arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.22821 2026-05-22 cs.CL cs.LG 版本更新

Tokenisation via Convex Relaxations

基于凸松弛的分词

Jan Tempus, Philip Whittington, Craig W. Schmidt, Dennis Komm, Tiago Pimentel

发表机构 * ETH Zurich(苏黎世联邦理工学院) Kensho Technologies(Kensho科技公司)

AI总结 本文提出了一种基于凸松弛的分词方法ConvexTok,通过将分词构建问题转化为线性规划并利用凸优化工具求解,改进了分词指标和语言模型的bits-per-byte性能,并提升了下游任务表现。

详情
AI中文摘要

分词是当前自然语言处理流水线中的重要组成部分。当前的分词算法如BPE和Unigram都是贪心算法,它们在局部最优决策上不做考虑,而没有考虑整个词汇表的结果。我们相反地将分词构建过程作为线性规划来制定,并使用凸优化工具来解决它,从而得到一种新的算法,我们称之为ConvexTok。我们发现ConvexTok在内在分词指标和语言模型所实现的bits-per-byte (BpB)方面始终有改进;它也改善了下游任务的表现,但不太一致。此外,ConvexTok允许用户通过一个下界来认证他们的分词器在某种目标下离最优的差距,并且我们实证发现它在常见词汇表大小下处于最优的1%以内。

英文摘要

Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms -- they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1\% of optimal at common vocabulary sizes.

2605.22820 2026-05-22 cs.LG 版本更新

Integrable Elasticity via Neural Demand Potentials

通过神经需求势实现可积弹性

Carlos Heredia, Daniel Roncel

发表机构 * IAMM Research, Department of Applied Artificial Intelligence(IAMM研究院,应用人工智能系) DAMM

AI总结 本文提出了一种以需求为导向的神经网络模型ICDN,用于多产品零售需求预测。该模型学习对数需求作为对数价格的平滑、上下文依赖函数,从而能够精确推导出弹性。在Dominick's啤酒数据集上,ICDN在样本外泛化性能上优于有向对数-对数基准,并产生了更稳定、更具经济合理性的弹性估计,尤其是在交叉价格效应较弱的情况下。

Comments 44 pages, 7 figures

详情
AI中文摘要

我们提出了一种可积上下文依赖需求网络(ICDN),这是一种以需求为导向的神经模型,用于多产品零售需求预测。该模型学习对数需求作为对数价格的平滑、上下文依赖函数,使得弹性能够从学习的需求曲面上精确推导出来。在Dominick's啤酒数据集上,ICDN在样本外泛化性能上优于有向对数-对数基准,并产生了更稳定、更具经济合理性的弹性估计,尤其是在交叉价格效应较弱的情况下。

英文摘要

We propose the Integrable Context-Dependent Demand Network (ICDN), a demand-first neural model for multiproduct retail demand. The model learns log-demand as a smooth, context-conditioned function of log-prices, allowing elasticities to be derived exactly from the learned demand surface. On the Dominick's beer dataset, ICDN improves out-of-sample generalization over a directed log-log benchmark and yields more stable, economically plausible elasticity estimates, especially for weakly identified cross-price effects.

2605.22817 2026-05-22 cs.LG cs.AI cs.CL cs.NE 版本更新

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

向量策略优化:为多样性训练改进测试时间搜索

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld, Akarsh Kumar, Mehul Damani, Sebastian Risi, Omar Khattab, Zhang-Wei Hong, Pulkit Agrawal

发表机构 * MIT(麻省理工学院) Improbable AI Lab(Improbable AI 实验室) MIT-IBM Computing Research Lab(麻省理工-IBM 计算研究实验室) Sakana AI

AI总结 本文提出向量策略优化(VPO)方法,通过训练策略以预测多样化的下游奖励函数,从而产生多样化的解决方案,以改进测试时间搜索的性能。

Comments 24 pages

详情
AI中文摘要

语言模型现在必须能够即刻泛化到新的环境,并在像AlphaEvolve这样的推理扩展搜索过程中工作,该过程通过多种任务特定的奖励函数选择滚出。不幸的是,标准的LLM后训练优化方法通常优化预定义的标量奖励,导致当前LLM生成低熵响应分布,从而在推理时间搜索所需多样性方面挣扎。我们提出向量策略优化(VPO),一种RL算法,专门训练策略以预测多样化的下游奖励函数并生成多样化的解决方案。VPO利用奖励在实践中通常是向量值的事实,例如代码生成中的每测试用例正确性,或者多个不同的用户人设或奖励模型。VPO本质上是GRPO优势估计器的直接替代品,但其训练LLM输出一组解决方案,其中每个解决方案专门针对向量奖励空间中的不同权衡。在四个任务上,VPO在测试时间搜索(如pass@k和best@k)中匹配或超越了最强的标量RL基线,随着搜索预算的增长,差距逐渐扩大。对于进化搜索,VPO模型解锁了GRPO模型无法解决的问题。随着测试时间搜索变得更加标准化,优化多样性可能需要成为后训练的默认目标。

英文摘要

Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.

2605.22814 2026-05-22 cs.LG 版本更新

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

记住保持好奇:用于3D探索的片段上下文和持久世界

Lily Goli, Justin Kerr, Daniele Reda, Alec Jacobson, Andrea Tagliasacchi, Angjoo Kanazawa

发表机构 * University of Toronto(多伦多大学) UC Berkeley(加州大学伯克利分校) Wayve Vector Institute(向量研究所) Simon Fraser University(西蒙 Fraser大学)

AI总结 本研究提出了一种基于好奇心驱动强化学习的方法,通过引入持久世界模型和片段上下文来解决3D环境中稀疏奖励长周期任务中的探索问题,实验表明该方法在HM3D数据集上优于基于强化学习的主动映射基线,并能泛化到Gibson和AI生成的世界。

详情
AI中文摘要

探索是学习有用行为在稀疏奖励、长周期任务中的前提,特别是在3D环境中。好奇心驱动的强化学习通过内在奖励来解决这个问题,这些内在奖励来自于智能体对世界的预测模型与现实之间的不匹配。然而,将这种内在动机转化为复杂、逼真的环境仍然具有挑战性,因为智能体可能会被困在局部循环中,并且在重新访问遗忘状态时会获得新的奖励。在本工作中,我们证明这种失败源于缺乏空间持续性和片段上下文。我们表明,有效的好奇心需要一个持久且持续更新的世界模型,配以能够维护片段轨迹历史的智能体,以导航到新区域。我们通过在线3D重建作为世界模型的持久模型,同时将智能体策略参数化为基于RGB观察的序列模型来维持片段上下文。这种设计在训练期间实现了有效的探索,同时允许智能体在部署时仅使用RGB帧进行导航。在纯好奇心训练下,我们的智能体在HM3D上优于基于强化学习的主动映射基线,并能泛化到Gibson和AI生成的世界。我们的端到端策略使智能体能够高效适应下游任务,如苹果采摘和图像目标导航,优于从头开始的基线。请参见https://recuriosity.github.io/的视频结果。

英文摘要

Exploration is a prerequisite for learning useful behaviors in sparse-reward, long-horizon tasks, particularly within 3D environments. Curiosity-driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch between the agent's predictive model of the world and reality. However, translating this intrinsic motivation to complex, photorealistic environments remains difficult, as agents can become trapped in local loops and receive fresh rewards for revisiting forgotten states. In this work, we demonstrate that this failure stems from a lack of spatial persistence and episodic context. We show that effective curiosity requires a model of the world that is persistent and continuously updated, paired with an agent that maintains an episodic trajectory history to navigate toward novel regions. We achieve this using an online 3D reconstruction as a persistent model of the world, while the agent policy is parameterized as a sequence model over RGB observations to maintain episodic context. This design enables effective exploration during training while allowing the agent to navigate using solely RGB frames at deployment. Trained purely via curiosity on HM3D, our agent outperforms RL-based active mapping baselines and generalizes zero-shot to Gibson and AI-generated worlds. Our end-to-end policy enables efficient adaptation to downstream tasks, such as apple picking and image-goal navigation, outperforming from-scratch baselines. Please see video results at https://recuriosity.github.io/.

2605.22786 2026-05-22 cs.AI cs.ET cs.LG cs.MA 版本更新

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

LCGuard: 多智能体系统中安全KV共享的潜在通信守护者

Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas, Prasanna Sattigeri, Karthikeyan Natesan Ramamurthy

发表机构 * Rensselaer Polytechnic Institute(伦斯勒理工学院) IBM Research(IBM研究院)

AI总结 本文提出LCGuard框架,通过在智能体间共享KV缓存前学习表示层面的转换,以防止敏感信息泄露,同时在多个模型家族和多智能体基准测试中验证了其在减少重建攻击成功率和保持任务性能方面的有效性。

详情
AI中文摘要

基于大型语言模型(LLM)的多智能体系统越来越多地依赖中间通信来协调复杂任务。尽管大多数现有系统通过自然语言进行通信,但最近的研究表明,通过transformer键值(KV)缓存进行的潜在通信可以提高效率并保留更丰富的任务相关信息。然而,KV缓存也编码了上下文输入、中间推理状态和智能体特定信息,从而创建了一个可能传播敏感内容的不透明通道,而无需显式文本披露。为此,我们引入了LCGuard(潜在通信守护者),一个用于多智能体LLM系统中安全KV基于潜在通信的框架。LCGuard将共享的KV缓存视为潜在的工作记忆,并在缓存艺术制品传输到智能体之前学习表示层面的转换。我们通过重建正式化表示层面的敏感信息泄露操作:如果一个对抗性解码器可以从共享缓存艺术制品中恢复出智能体特定的敏感输入,则该共享缓存艺术制品是不安全的。这导致了一种对抗性训练公式,其中对抗者学习重建敏感输入,而LCGuard学习转换以保留任务相关语义并减少可重建的信息。在多个模型家族和多智能体基准测试中的实证评估表明,LCGuard在减少基于重建的泄露和攻击成功率的同时,能够保持与标准KV共享基线相比具有竞争力的任务性能。

英文摘要

Large language model (LLM)-based multi-agent systems increasingly rely on intermediate communication to coordinate complex tasks. While most existing systems communicate through natural language, recent work shows that latent communication, particularly through transformer key-value (KV) caches, can improve efficiency and preserve richer task-relevant information. However, KV caches also encode contextual inputs, intermediate reasoning states, and agent-specific information, creating an opaque channel through which sensitive content may propagate across agents without explicit textual disclosure. To address this, we introduce \textbf{LCGuard} (Latent Communication Guard), a framework for safe KV-based latent communication in multi-agent LLM systems. LCGuard treats shared KV caches as latent working memory and learns representation-level transformations before cache artifacts are transmitted across agents. We formalize representation-level sensitive information leakage operationally through reconstruction: a shared cache artifact is unsafe if an adversarial decoder can recover agent-specific sensitive inputs from it. This leads to an adversarial training formulation in which the adversary learns to reconstruct sensitive inputs, while LCGuard learns transformations that preserve task-relevant semantics and reduce reconstructable information. Empirical evaluations across multiple model families and multi-agent benchmarks show that LCGuard consistently reduces reconstruction-based leakage and attack success rates while maintaining competitive task performance compared to standard KV-sharing baselines.

2605.22779 2026-05-22 cs.SE cs.LG 版本更新

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

FAME:面向失败的混合专家模型用于消息级日志异常检测

Huanchi Wang, Zihang Huang, Yifang Tian, Kristina Dzeparoska, Hans-Arno Jacobsen, Alberto Leon-Garcia

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本文提出FAME,一种面向失败的混合专家模型,用于消息级日志异常检测。该方法通过少量标注数据训练轻量级路由器和领域专家,实现高效的异常检测,同时在BGL和Thunderbird数据集上取得了高精度和召回率。

Comments 12 pages, 5 figures

详情
AI中文摘要

生产系统每天生成数百万条日志行,但大多数异常检测器在会话或窗口级别工作,标记的是行组而非特定消息。这种粗粒度迫使操作员每条警报都要检查许多常规行。消息级检测提供更细粒度,但仍然具有挑战性。一个事件模板可能对应正常和异常消息,故障源于异构子系统,大规模行级标注不切实际。尽管大型语言模型(LLMs)可以推断日志语义,但将其应用于每条行对于持续监控来说成本太高。我们提出了FAME(Failure-Aware Mixture-of-Experts),一种标签高效的面向消息级的混合专家框架,该框架仅在离线时使用LLM一次。我们最多为每个模板标注K条标注行以推导二元正常/异常指标和代表性示例。LLM提出将模板划分为故障领域,并通过认证步骤验证该提议后再进行训练。FAME训练了一个轻量级路由器和领域专家,这些专家在本地运行,并输出异常预测和故障领域标签。在BGL上,FAME在K=100时达到F1=98.16,将标注工作量减少76倍,并检测出86.3%的未见过的EventIDs异常。在Thunderbird上,FAME达到F1=99.95,具有完美的召回率。

英文摘要

Production systems generate millions of log lines daily, yet most anomaly detectors operate at the session or window-level, flagging groups of lines rather than identifying the specific message responsible. This coarse granularity forces operators to inspect many routine lines per alert. Message-level detection offers finer granularity, but remains challenging. A single event template may correspond to both normal and anomalous messages, failures arise from heterogeneous subsystems, and line-level labeling at scale is impractical. Although large language models (LLMs) can reason over log semantics, applying them to every line is too costly for continuous monitoring. We present FAME (Failure-Aware Mixture-of-Experts), a label-efficient message-level mixture-of-experts framework that uses an LLM only once offline. We annotate at most K labeled lines per template to derive binary normal/anomaly indicators and representative examples. The LLM proposes a partition of templates into failure domains, and a certification step validates the proposal before training. FAME trains a lightweight router and domain experts that run on-premise and output anomaly predictions and failure-domain labels. On BGL, FAME achieves F1 = 98.16 at K = 100 reducing annotation effort by 76x and detects 86.3% of anomalies from unseen EventIDs. On Thunderbird, FAME reaches F1 = 99.95 with perfect recall.

2605.22776 2026-05-22 cs.LG cs.AI stat.CO stat.ML 版本更新

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

SDPM:用于连续时间生存分析的生存扩散概率模型

Stanislav R. Kirpichenko, Andrei V. Konstantinov, Lev V. Utkin

发表机构 * Peter the Great St.Petersburg Polytechnic University(彼得大帝圣彼得堡国立大学)

AI总结 本文提出SDPM,一种用于连续时间生存分析的生成模型,通过去噪扩散模型建模生存结果的条件分布,避免了对事件时间分布的参数假设,并在变换的目标空间中使用标准化对数时间和连续高斯混合表示来表示删失指示符,从而在多个真实生存数据集上取得了竞争力的预测性能。

详情
AI中文摘要

生存分析旨在从具有删失观测的数据中估计时间到事件的分布。许多现有方法要么对危险函数施加结构假设,要么离散化时间轴,这可能会限制灵活性并引入近似误差。我们提出了生存扩散概率模型(SDPM),一种用于连续时间生存分析的生成方法。SDPM利用去噪扩散模型建模生存结果的条件分布,该分布由观测时间和删失指示符表示,即P(T,δ|X)。在假设条件独立删失的情况下,模型生成的条件样本可以通过Kaplan-Meier估计器转换为生存函数估计。该公式避免了对事件时间分布的参数假设,并不需要对输出时间空间进行离散化。模型在变换的目标空间中运行,使用标准化对数时间和连续高斯混合表示来表示删失指示符。我们评估了SDPM在十个真实生存数据集上的性能,并将其与五个强大的基线模型进行了比较,包括基于树、提升和神经生存模型。结果表明,SDPM在C指数、整合时间依赖AUC和整合Brier分数上均取得了竞争性的预测性能。对合成Cox-Weibull数据的分析表明,当生成足够多的样本时,SDPM能够比强大的非参数基线更准确地恢复潜在连续生存分布的形状。消融研究证实了所提出的目标空间变换的重要性,这些变换提高了事件率校准、减少了无效生成时间并提供了预测判别的一致增益。实现所提出模型的代码已公开可用。

英文摘要

Survival analysis aims to estimate a time-to-event distribution from data with censored observations. Many existing methods either impose structural assumptions on the hazard function or discretize the time axis, which may limit flexibility and introduce approximation errors. We propose the Survival Diffusion Probabilistic Model (SDPM), a generative approach to continuous-time survival analysis. SDPM models the conditional distribution of the survival outcome, represented by the pair of observed time and censoring indicator, $\mathbb{P}(T,δ\mid \mathbf{x})$, using a denoising diffusion model. Under the assumption of conditionally independent censoring, conditional samples generated by the model can be transformed into survival function estimates using the Kaplan-Meier estimator. This formulation avoids parametric assumptions on the event-time distribution and does not require a discretization of the output time space. The model operates in a transformed target space, using standardized log-times and a continuous Gaussian-mixture representation of the censoring indicator. We evaluate SDPM on ten real survival datasets and compare it with five strong baselines, including tree-based, boosting-based, and neural survival models. Results show that SDPM achieves competitive predictive performance across C-index, integrated time-dependent AUC, and integrated Brier score. A study on synthetic Cox-Weibull data demonstrates that SDPM can recover the shape of an underlying continuous survival distribution more accurately than a strong nonparametric baseline when sufficiently many samples are generated. An ablation study confirms the importance of the proposed target-space transformations, which improve event-rate calibration, reduce invalid generated times, and provide consistent gains in predictive discrimination. Codes implementing the proposed model are publicly available.

2605.22775 2026-05-22 cs.LG cs.AI cs.HC 版本更新

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data

MambaGaze: 通过显式缺失数据建模的双向Mamba用于从眼动追踪数据中评估认知负荷

Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh, Mimi Xie, Rocky Slavin, Leslie Neely, John Davis, John Quarles

发表机构 * Department of Computer Science, College of AI, Cyber and Computing, The University of Texas at San Antonio(计算机科学系,人工智能、网络与计算学院,德克萨斯大学圣安东尼奥分校) Department of Educational Psychology, College of Education and Human Development, The University of Texas at San Antonio(教育心理学系,教育与人类发展学院,德克萨斯大学圣安东尼奥分校) Department of Neuroscience, Developmental and Regenerative Biology, College of Sciences, The University of Texas at San Antonio(神经科学系,发育与再生生物学系,科学学院,德克萨斯大学圣安东尼奥分校)

AI总结 本文提出MambaGaze,通过XMD编码和双向Mamba-2框架,解决眼动追踪数据中频繁缺失和长时序依赖建模的问题,实验证明其在认知负荷评估中的优越性能和边缘部署可行性。

Comments Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)

详情
AI中文摘要

从眼动信号进行实时认知负荷评估有可能实现适应性的人工智能应用,如安全关键应用如驾驶员警觉监控或自动驾驶舱辅助,但存在两个挑战:处理频繁的数据缺失(如眨眼和跟踪失败)以及高效建模长时序依赖。我们提出MambaGaze,一个通过1)XMD编码,将原始特征与观察掩码和时间差增强以显式建模数据不确定性,以及2)双向Mamba-2,以线性计算复杂性捕获时序依赖的框架。在CLARE和CL-Drive数据集上进行的leave-one-subject-out评估实验表明,MambaGaze分别达到76.8%和73.1%的准确率,优于CNN、Transformer、ResNet和VGG基线,高出4-12个百分点。在NVIDIA Jetson平台上的边缘部署基准测试显示,实现实时推理43-68 FPS,功率消耗低于7.5W,证实了其在可穿戴认知负荷监测中的可行性。

英文摘要

Real-time cognitive load assessment from eye-tracking signals could potentially enable adaptive human-centered-AI such as safety-critical applications such as driver vigilance monitoring or automated flight deck assistance, yet two challenges persist: handling frequent data missingness from blinks and tracking failures, and efficiently modeling long-range temporal dependencies. We propose MambaGaze, a framework that addresses these challenges through 1) XMD encoding, which augments raw features with observation masks and time-deltas to explicitly model data uncertainty, and 2) bidirectional Mamba-2, which captures temporal dependencies with linear computational complexity. Experiments on CLARE and CL-Drive datasets under leave-one-subject-out evaluation show that MambaGaze achieves 76.8% and 73.1% accuracy, respectively, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment benchmarks on NVIDIA Jetson platforms demonstrate real-time inference at 43-68 FPS with power consumption below 7.5W, confirming feasibility for wearable cognitive load monitoring.

2605.22765 2026-05-22 cs.LG stat.ML 版本更新

Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

统一扩散模型再审视:留一法去噪器和吸收状态重述

Samson Gourevitch, Yazid Janati, Dario Shariatian, Umut Simsekli, Eric Moulines, Eric P. Xing, Alain Durmus

发表机构 * CMAP, Ecole polytechnique(巴黎高等学院CMAP实验室) Institute of Foundation Models(基础模型研究所) Inria, PSL Research University(法国国家信息与自动化研究所) MBZUAI(马尔科姆·布罗德本特大学人工智能研究所) EPITA, LRE(EPITA实验室)

AI总结 本文研究了统一扩散模型中去噪后验与留一法后验之间的不匹配问题,并通过改进的参数化和采样方法提升了模型性能。

Comments preprint

详情
AI中文摘要

离散扩散模型通常通过干净数据预测进行训练,但预测可以以不同方式定义反向动态。在掩码扩散模型(MDM)中这些选择大体一致,而在统一扩散模型(UDM)中则不一致。我们展示了标准插件桥参数化对于UDM并非由去噪后验优化,而是由留一法后验优化,该后验预测每个干净token时不使用其自身的噪声观测。这揭示了插件ELBO与常规去噪交叉熵目标之间的不匹配。我们刻画了留一法目标并推导了去噪器、留一法后验和分数之间的精确转换。这些转换使我们能够分离参数化和训练目标。我们的结果还通过有意识的预测-校正采样器和基于留一法预测的改进温度采样方法在无需额外训练的情况下提升了推断性能。我们进一步引入了统一扩散的吸收状态重述,该重述在保持UDM联合分布的同时将其分解为类似掩码扩散的采样操作,具有更简单的去噪后验、携带未掩码和自然重掩码机制。在语言建模中,留一法参数化一致地提升了UDM生成性能,而吸收构造在匹配或超越掩码扩散方面表现优异。这些结果表明,掩码与统一扩散之间的经验差距主要由参数化和采样设计驱动,而非边际本身的选择。代码和模型可在https://github.com/samsongourevitch/rev_udm找到。

英文摘要

Discrete diffusion models are often trained through clean-data prediction, but the prediction can be used in different ways to define the reverse dynamics. In Masked Diffusion Models (MDM) these choices largely coincide, whereas in Uniform Diffusion Models (UDM) they do not. We show that the standard plug-in bridge parameterization for UDM is not optimized by the denoising posterior, but by a leave-one-out posterior that predicts each clean token without using its own noisy observation. This identifies a mismatch between the plug-in ELBO and the usual cross-entropy denoising objective. We characterize the leave-one-out target and derive exact conversions between the denoiser, the leave-one-out posterior, and the score. These conversions allow us to disentangle parameterization and training objective. Our results also lead to inference improvements without any additional training through an informed predictor-corrector sampler and improved temperature sampling based on the leave-one-out predictor. We further introduce an absorbing-state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked-diffusion-like sampling operations, with simpler denoising posteriors, carry-over unmasking, and a natural remasking mechanism. On language modeling, leave-one-out parameterizations consistently improve UDM generation, while the absorbing construction matches or surpasses masked diffusion. These results suggest that the empirical gap between masked and uniform diffusion is driven less by the choice of marginals themselves than by parameterization and sampling design. The code and models can be found at https://github.com/samsongourevitch/rev_udm.

2605.22756 2026-05-22 cs.LG cs.DS 版本更新

Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees

Lumberjack: 通过树中的Heavy Hitter检测实现更好的差分隐私随机森林

Christian Janos Lebeda, David Erb, Tudor Cebere, Aurélien Bellet

发表机构 * PreMeDICaL Inria(法国国家信息与自动化研究所) Université de Montpellier(蒙彼利埃大学) INSERM(国家医学研究院) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出Lumberjack算法,通过构建大规模随机决策树并应用隐私保护的剪枝技术,显著提升了差分隐私随机森林的实用性。该方法引入了新的(ε,δ)-DP Heavy Hitter检测算法,具有O_{ε,δ}(√log h)的误差,使得树的高度可以更深,从而在隐私约束下提高表达能力。实验表明,Lumberjack在基准数据集上优于现有差分隐私随机森林方法,特别是在隐私预算下的隐私-效用权衡上取得显著改进。

详情
AI中文摘要

随机森林广泛应用于涉及敏感表格数据的领域,但现有的差分隐私(DP)方法通常会降级性能到不实用的程度。在本文中,我们介绍Lumberjack,一种差分隐私随机森林算法,通过构建大规模随机决策树并应用激进的隐私保护剪枝技术,保留仅足够 populated 的节点,从而实现显著更高的实用性。我们方法的关键组成部分是一个新颖的(ε,δ)-DP Heavy Hitter检测算法,用于层次数据,其误差为O_{ε,δ}(√log h)对于高度为h的树,并可能具有独立的兴趣。这种有利的缩放使得可以使用比先前工作更深的树,从而在隐私约束下提高表达能力。我们在基准数据集上的实验证明,Lumberjack在基准数据集上优于现有差分隐私随机森林方法,建立了新的状态。特别是,我们的方法在实际隐私预算下的隐私-效用权衡上取得了显著改进。我们的发现表明,精心设计的差分隐私随机森林可以缩小大部分的效用差距,突显了未来研究中一个有前途但尚未被探索的方向。

英文摘要

Random forests are widely used in fields involving sensitive tabular data, but existing approaches to enforcing differential privacy (DP) typically degrade performance to the point of impracticality. In this paper, we introduce Lumberjack, a differentially private random forest algorithm that achieves substantially higher utility by constructing large random decision trees and then applying aggressive, privacy-preserving pruning to retain only sufficiently populated nodes. A key component of our approach is a novel $(\varepsilon,δ)$-DP heavy hitter detection algorithm for hierarchical data, whose error is $O_{\varepsilon,δ}(\sqrt{\log h})$ for trees of height $h$ and may be of independent interest. This favorable scaling enables the use of significantly deeper trees than in prior work, leading to improved expressiveness under privacy constraints. Our empirical evaluation on benchmark datasets shows that Lumberjack consistently outperforms prior DP random forest methods, establishing a new state of the art. In particular, our approach yields substantial improvements in the privacy-utility trade-off for practical privacy budgets. Our findings suggest that carefully designed DP random forests can close much of the utility gap, highlighting a promising and underexplored direction for future research.

2605.22749 2026-05-22 cs.LG cs.AI 版本更新

Cyber-Physical Anomaly Detection in IoT-Enabled Smart Grids Using Machine Learning and Metaheuristic Feature Optimization

基于机器学习和元启发式特征优化的物联网智能电网中网络-物理异常检测

Adis Alihodžić, Eva Tuba, Milan Tuba

发表机构 * Department of Mathematical and Computer Sciences, University of Sarajevo(萨拉热窝大学数学与计算机科学系) Singidunum University(辛吉杜姆大学) Trinity University(特里尼蒂大学) Sinergija University(辛格里雅大学)

AI总结 本文研究了如何利用机器学习和元启发式特征优化方法,在物联网智能电网中检测网络-物理异常,通过评估多个基线模型,发现基于树的集成模型在该数据集上表现最佳,且经过特征优化后,模型在准确率和AUC指标上均有显著提升。

详情
AI中文摘要

现代智能电网依赖于密集的测量基础设施、通信链路和智能现场设备。尽管这提高了监控和控制能力,但也增加了遭受网络-物理破坏的风险。操作员必须区分物理事件,如故障或线路干扰,与恶意行为,如虚假数据注入或未经授权的命令执行。本章利用著名的MSU/ORNL电力系统攻击数据集来研究这一问题。所提出的方法结合了机器学习与基于遗传算法的特征选择。目标是双重的:准确分类攻击和自然事件,并确定一组减少的、物理信息丰富的PMU/IED测量是否能够支持可靠的检测。评估了多个基线模型,包括逻辑回归、RBF-SVM、XGBoost、随机森林和额外树。结果表明,基于树的集成模型在考虑的数据集上最为有效,其中额外树提供了最强的全特征基线。在特征选择后,GA + Extra Trees模型将干净的PMU特征空间从112个属性减少到五次运行的平均27.4个属性,同时将宏F1从0.9118提高到0.9212,ROC-AUC从0.9791提高到0.9837。这些结果表明,许多同步电气测量是冗余的。一个紧凑的基于相量的特征子集仍能提供准确且可解释的智能电网异常检测。

英文摘要

Modern smart grids rely on dense measurement infrastructures, communication links, and intelligent field devices. Although this improves supervision and control, it also increases vulnerability to cyber-physical disruptions. Operators must distinguish physical incidents, such as faults or line disturbances, from malicious actions, such as false data injection or unauthorized command execution. This chapter investigates this problem using the well-known MSU/ORNL Power System Attack Dataset. The proposed method combines machine learning with genetic-algorithm-based feature selection. The objective is twofold: to classify attack and natural events accurately, and to determine whether a reduced set of physically informative PMU/IED measurements can support reliable detection. Several baseline models are evaluated, including logistic regression, RBF-SVM, XGBoost, Random Forest, and Extra Trees. The results show that tree-based ensemble models are the most effective for the considered dataset, with Extra Trees providing the strongest full-feature baseline. After feature selection, the GA + Extra Trees model reduces the clean PMU feature space from 112 attributes to an average of 27.4 attributes over five runs, while increasing macro-F1 from 0.9118 to 0.9212 and ROC-AUC from 0.9791 to 0.9837. These results indicate that many synchronized electrical measurements are redundant. A compact subset of phasor-based features can still provide accurate and interpretable anomaly detection in smart grids.

2605.22746 2026-05-22 cs.LG eess.AS stat.ML 版本更新

Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier

插件损失用于证据深度学习:一个简化框架用于不确定性估计,其中包括softmax分类器

Berk Hayta, Hannah Laus, Simon Mittermaier, Felix Krahmer

发表机构 * TU Munich(慕尼黑技术大学) MCML(慕尼黑实验室) Infineon Technologies(英飞凌科技)

AI总结 本文提出了一种简化框架,用于通过插件损失近似证据深度学习中的不确定性估计,证明了在特定证据到狄利克雷分布映射下,该框架包含标准的softmax分类器,并在Google语音命令数据集上验证了其有效性。

详情
AI中文摘要

现实中的基于传感器的学习系统需要可靠且计算高效的不确定性估计。证据深度学习(EDL)通过狄利克雷分布建模类概率,从而实现单次通过的不确定性估计,其中狄利克雷参数由一个学习的神经网络映射预测。然而,这种方法可能导致计算挑战,因为狄利克雷期望目标比标准监督学习损失更复杂,增加了分析和实现的难度。我们通过近似由EDL诱导的一阶经验风险最小化问题的目标,使用在狄利克雷均值上评估的插件损失,证明在温和假设下,对于广泛的一类损失函数,包括均方误差和交叉熵损失,近似误差随着证据的增长而减小。作为特殊情况,我们的分析为在不确定性估计中使用softmax提供了正当性,因为在特定的证据到狄利克雷分布映射下,我们的框架包含标准的softmax分类器。我们在Google语音命令数据集上验证了所提出的简化目标,并展示了其在预测准确性和选择性预测性能上与经典EDL相当,同时使用标准深度学习损失和训练流程实现起来更简单。到目前为止,本文的实证分析是首次通过EDL获得语音识别任务中的覆盖-准确性权衡。

英文摘要

Real-world sensor-based learning systems require uncertainty estimation that is both reliable and computationally efficient. Evidential Deep Learning (EDL) provides single-pass uncertainty estimation by modeling the class probabilities via Dirichlet distributions, where the Dirichlet parameters are predicted by a learned neural network mapping. However, this approach can lead to computational challenges, as Dirichlet expected objectives are more complex than standard supervised learning losses, complicating their analysis and implementation. We address this issue by approximating the objective of the first-order empirical risk minimization problem induced by EDL with a plug-in loss evaluated at the Dirichlet mean and show that, under mild assumptions, the approximation error decays with growing evidence for a broad class of loss functions, including mean-squared error and cross-entropy loss. As a special case, our analysis provides justification for the use of softmax in the context of uncertainty estimation, since under a particular evidence-to-Dirichlet mapping, our framework includes the standard softmax classifier. We validate the proposed simplified objectives on the Google Speech Commands dataset and show that they achieve predictive accuracy and selective prediction performance comparable to classical EDL, while being simpler to implement using standard deep learning losses and training pipelines. To the best of our knowledge, this empirical analysis is the first to obtain coverage-accuracy trade-offs for speech recognition tasks through EDL.

2605.22743 2026-05-22 cs.LG 版本更新

SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation

SeqLoRA: 为持续多概念生成的双水平正则化适应

Javad Parsa, Enis Simsar, Amir Joudaki, Thomas Hofmann, André M. H. Teixeira

发表机构 * Uppsala University(乌普萨拉大学) ETH Zurich(苏黎世联邦理工学院) Sweden(瑞典)

AI总结 本文提出SeqLoRA,一种双水平优化框架,通过联合优化LoRA因素来解决文本到图像扩散模型中多自定义概念组合时的表示干扰问题,提高了身份保持性和可扩展性。

详情
AI中文摘要

参数高效微调能够快速个性化文本到图像扩散模型,但组合多个自定义概念仍然具有挑战性,因为存在表示干扰。现有的模块化方法要么依赖于昂贵的后置融合,要么冻结适应子空间,这限制了表达能力和概念保真度。为了解决这一权衡,我们提出了顺序正则化的LoRA(SeqLoRA),一种联合优化LoRA因素的持续学习框架。理论上,我们为我们的算法建立了强收敛保证,并将残差层激活建模为矩阵子高斯过程,以推导出灾难性遗忘的高概率界。我们进一步证明,从数据中学习LoRA基底比冻结基底方法更有效地最小化残差干扰能量。在多概念图像生成实验中,SeqLoRA在多达101个概念上提高了身份保持性和可扩展性,同时避免了昂贵的融合并减少了组合生成中的属性干扰。

英文摘要

Parameter-efficient fine-tuning enables fast personalization of text-to-image diffusion models, but composing multiple custom concepts remains challenging due to representation interference. Existing modular methods either rely on expensive post-hoc fusion or freeze adaptation subspaces, which limit expressiveness and concept fidelity. To address this trade-off, we propose Sequential regularized LoRA (SeqLoRA), a constrained continual learning framework that jointly optimizes both LoRA factors via bilevel optimization. Theoretically, we establish strong convergence guarantees for our algorithm and model the residual layer activations as a matrix sub-Gaussian process to derive high-probability bounds on catastrophic forgetting. We further prove that learning the LoRA basis from data minimizes residual interference energy more effectively than frozen-basis methods. Experiments on multi-concept image generation demonstrate that SeqLoRA improves identity preservation and scalability across up to 101 concepts, while avoiding costly fusion and reducing attribute interference in composed generations.

2605.22736 2026-05-22 math.OC cs.LG cs.NA math.DG math.NA 版本更新

Optimization over the intersection of manifolds

在两个流形交集上的优化

Yan Yang, Bin Gao, Ya-xiang Yuan

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, China(数学科学国家重点实验室,数学与系统科学研究院,中国科学院,以及中国科学院大学,中国)

AI总结 本文提出了一种几何方法,通过在单个流形上进行重新参数化,并在两个正交方向上更新迭代点,以解决两个流形交集上的优化问题,证明了清洁交集和内在横贯性是等价的,并展示了该方法在稀疏和低秩优化问题中的有效性。

Comments 26 pages, 5 figures, 3 tables

详情
AI中文摘要

在两个流形交集上的优化出现在广泛的应用中,但受到可行区域耦合几何的阻碍。在本文中,我们证明了正则性——清洁交集和内在横贯性——是等价的,这导致了可处理的交集切空间投影。因此,我们提出了一种几何方法,该方法仅在单个流形上使用重新参数化,并在两个正交方向上更新迭代点。具体而言,迭代点停留在一个流形上,而这两个方向分别负责渐近接近另一个流形和减少目标函数。在内在横贯性下,我们推导了可行性和最优性度量的收敛速度,并证明了每个积累点都是第一阶 stationary 的。在稀疏和低秩优化问题上的数值实验,包括拟合球形数据、在真实数据上近似双曲嵌入和计算压缩模式,展示了所提方法的有效性。

英文摘要

Optimization over the intersection of two manifolds arises in a broad range of applications, but is hindered by the coupled geometry of the feasible region. In this paper, we prove that the regularities -- clean intersection and intrinsic transversality -- are equivalent, which yields a tractable projection onto the tangent space of the intersection. Therefore, we propose a geometric method that employs a retraction on only one manifold and updates the iterate along two orthogonal directions. Specifically, the iterates stay on one manifold, and the two directions are responsible for asymptotically approaching the other manifold and decreasing the objective function, respectively. Under intrinsic transversality, we derive the convergence rate for both the feasibility and optimality measures, and show that every accumulation point is first-order stationary. Numerical experiments on problems stemming from sparse and low-rank optimization, including fitting spherical data, approximating hyperbolic embeddings on real data, and computing compressed modes, demonstrate the effectiveness of the proposed method.

2605.22731 2026-05-22 cs.LG cs.AI 版本更新

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

训练后是关于状态,而不是标记:一种状态分布视角下的SFT、RL和在线策略蒸馏

Dong Nie

发表机构 * Independent Researcher(独立研究者)

AI总结 本文从状态分布的角度研究了监督微调(SFT)、强化学习(RL)和在线策略蒸馏(OPD)等大语言模型训练后方法,发现训练状态的来源和局部性与监督信号的形式同样重要。

详情
AI中文摘要

大型语言模型的训练后方法,如监督微调(SFT)、强化学习(RL)和蒸馏,通常通过其损失函数进行分析:最大似然、策略梯度、前向KL、反向KL或相关的目标级变体。我们研究了一个互补因素:应用于监督的状态分布。对于自回归策略,状态是提示加上生成的前缀。SFT在固定数据集的状态上训练,而RL和在线策略蒸馏(OPD)在当前学习者诱导的状态上训练。我们正式将训练后过程视为状态分布塑造,并使用Qwen3-0B-Base在GSM8K上进行受控的小规模研究,用TruthfulQA和MMLU作为保留评估。我们的结果显示出三种现象。第一,轻微的SFT运行在GSM8K上表现良好,而压力SFT运行导致显著的保留损失。第二,从退化的SFT教师那里获得的OPD在GSM8K、TruthfulQA和MMLU上优于该教师,尽管仅使用教师作为监督来源。第三,轻量级的在线策略RL运行在GSM8K上提高了表现,同时保持了保留。这些结果支持了训练后过程的状态视角:训练状态的来源和局部性与监督信号的形式同样重要。

英文摘要

Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.

2605.22724 2026-05-22 cs.LG cs.NA math.NA stat.ML 版本更新

Multiple Neural Operators Achieve Near-Optimal Rates for Multi-Task Learning

多重神经算子在多任务学习中实现接近最优的速率

Adrien Weihs, Hayden Schaeffer

发表机构 * Department of Mathematics,Id University of California Los Angeles,Id(数学系,加州大学洛杉矶分校)

AI总结 本文研究了共享多任务设置中学习一组算子的近似性和统计复杂性,重点探讨了多重神经算子(MNO)架构。对于广泛类别的Lipschitz多重算子映射,推导出近似和统计泛化性的近优上界。同时,建立了参数复杂性的诅咒并证明了相应的最小最大速率。这些结果表明,跨任务共享表示不会增加总体成本:多任务算子学习遵循与单算子学习相同的缩放定律。此外,本文还比较了MNO与基于拼接任务输入的深度ONet多任务扩展版本,并表明从最坏情况的近似复杂性角度看,两种架构满足本质上相同的渐进行速率。

详情
AI中文摘要

我们研究了在共享多任务设置中学习一组算子的近似性和统计复杂性,重点在于多重神经算子(MNO)架构。对于广泛类别的Lipschitz多重算子映射,我们推导出近似和统计泛化的近优上界。在下界方面,我们建立了参数复杂性的诅咒,并证明了相应的最小最大速率。这些结果表明,跨任务共享的表示不会增加总体成本:多任务算子学习遵循与单算子学习相同的缩放定律。此外,我们还比较了MNO与基于拼接任务输入的深度ONet多任务扩展版本,并表明从最坏情况的近似复杂性角度看,两种架构满足本质上相同的渐进行速率。

英文摘要

We study the approximation and statistical complexity of learning collections of operators in a shared multi-task setting, with a focus on the Multiple Neural Operators (MNO) architecture. For broad classes of Lipschitz multiple operator maps, we derive near-optimal upper bounds for approximation and statistical generalization. On the lower-bound side, we establish a curse of parametric complexity and prove corresponding minimax rates. Together, these results show that shared representations across tasks do not increase the overall cost: multi-task operator learning follows the same scaling laws as single operator learning. We also compare MNO with a multi-task extension of DeepONet based on concatenated task inputs and show that, from a worst-case approximation-complexity perspective, both architectures satisfy essentially the same asymptotic rates.

2605.22723 2026-05-22 cs.LG cs.AI cs.IT math.IT 版本更新

The Value of Covariance Matching in Gaussian DDPMs and the Lanczos Sampler

高斯DDPM中协方差匹配的价值及兰扎斯采样器

Md Sahil Akhtar, Aymane El Gadarri, Vivek F. Farias, Adam D. Jozefiak

发表机构 * Electrical Engineering and Computer Science(电气工程与计算机科学系) Massachusetts Institute of Technology(麻省理工学院) Operations Research Center(运筹学研究中心) Sloan School of Management(斯隆管理学院)

AI总结 本文研究了高斯DDPM中协方差匹配在路径空间KL散度中的价值,提出兰扎斯采样器方法,通过矩阵自由技术实现最优反向协方差采样,从而提升采样质量。

详情
AI中文摘要

高斯DDPM中的核心误差度量是精确反向链与学习高斯反向过程之间的路径空间KL散度。这一量在如分类引导等过程中尤为重要,这些过程扰动整个反向轨迹而非仅终端样本。先前分析显示,标准各向同性反向协方差会导致随着去噪步数T增长而不可避免的Ω(1/T)路径KL误差。我们证明匹配完整后验协方差突破这一障碍,使路径KL误差降至O(1/T²)。为使完整协方差匹配实用化,我们引入兰扎斯高斯采样器(LGS),一种无需训练、矩阵自由的方法,仅通过后验均值的雅可比-向量积即可从最优反向协方差采样。LGS避免了密集协方差存储和辅助协方差模型。我们证明LGS近似误差随兰扎斯步骤数呈指数衰减,每个兰扎斯步骤仅需一次雅可比-向量积。实验表明,仅使用三个此类步骤即可在标准图像基准上提升样本质量,优于包括OCM-DDPM在内的强对角协方差基线。这表明完整协方差匹配在理论和实践中均具有价值。

英文摘要

A central error measure in Gaussian DDPMs is the path-space KL divergence between the exact reverse chain and the learned Gaussian reverse process. This quantity is especially relevant for procedures such as classifier guidance, which perturb the entire reverse trajectory rather than only the terminal sample. Prior analyses show that standard isotropic reverse covariances suffer an unavoidable $Ω(1/T)$ path-KL error as the number of denoising steps $T$ grows. We show that matching the full posterior covariance breaks this barrier, yielding an order-wise improvement that reduces the path KL to $O(1/T^2)$. To make full covariance matching practical, we introduce the Lanczos Gaussian sampler (LGS), a training-free, matrix-free method for sampling from the optimal reverse covariance using only covariance-vector products, which are available through Jacobian-vector products of the posterior mean. LGS avoids dense covariance storage and auxiliary covariance models. We prove that LGS approximation error decays exponentially in the number of Lanczos steps, where each Lanczos step requires a single Jacobian-vector product. Empirically, using only just three such steps improves sample quality over strong diagonal-covariance baselines, including OCM-DDPM, across standard image benchmarks. This identifies full covariance matching as both theoretically valuable and practically accessible for fast DDPM sampling.

2605.22719 2026-05-22 cs.LG 版本更新

Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

阅读任务失败的激活特征:GPT-2小模型在间接对象识别任务上的稀疏特征审计

Mahdi Nasermoghadasi

发表机构 * Research Division, BrightMind AI(BrightMind AI研究部) Texas Tech University(德克萨斯理工大学) University of Texas at Arlington(德克萨斯大学阿灵顿分校)

AI总结 该研究通过审计GPT-2小模型在间接对象识别任务中失败与成功样本的稀疏自动编码器特征,发现特定特征与任务失败高度相关,并通过多种控制实验验证了其相关性而非因果性。

Comments 10 pages, 7 figures

详情
AI中文摘要

我们报告了一个小型、可复现的审计,探讨了GPT-2小模型在间接对象识别(IOI)任务中失败与成功样本之间稀疏自动编码器(SAE)特征的差异。在300个提示中,GPT-2小模型达到79.7%的准确率;24,576个层-8残差流SAE特征中有146个通过holm校正的显著性阈值,105个具有大效应量(|Cohen's d| > 0.8)。最强的单一相关特征——特征17,491(d=+2.93,Neuronpedia标签'加密密钥')——在提示中的转移对象为'密钥'时,GPT-2小模型失败率达93.3%,而在其他七个对象上仅为7.5%(Fisher精确检验p=8.79 x 10^-33)。我们通过三种控制实验验证了这一相关性。 (i) 因果消融:在所有45个密钥提示的token位置上零特征17,491不恢复准确性(6.7% -> 4.4%);该特征是相关而非该层的充分原因。 (ii) 表示基线:对原始768维残差流进行逻辑回归达到5倍ROC AUC=0.929,与前100个SAE特征(0.927)相当;SAE基底增加可解释性而非预测能力。 (iii) 种子鲁棒性检查:在五个随机种子中,密钥子集的失败率保持在75.0-93.3%(行为效应是真实的),但特征17,491仅在1个运行中是top-|d|特征。因此,方法学贡献是审计流程(经济、模型无关、揭示命名相关特征)而非任何单个通过该流程发现的特征。我们发布了代码、300个提示语料库、300x24,576激活矩阵、消融和基线脚本以及图表。完整流程可在笔记本电脑(Apple M3 Max,无离散GPU)上运行。

英文摘要

We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task. On 300 prompts, GPT-2 small reaches 79.7% accuracy; 146 of the 24,576 features in the layer-8 residual-stream SAE release of Bloom (2024) clear a Holm-corrected significance threshold and 105 reach a large effect size (|Cohen's d| > 0.8). The strongest single correlate of failure -- feature 17,491, d=+2.93, Neuronpedia label 'cryptographic keys' -- is essentially silent except when the prompt's transferred object is 'the keys,' on which GPT-2 small fails 93.3% of the time vs. 7.5% on the other seven objects (Fisher exact p = 8.79 x 10^-33). We put this correlate through three controls that a mechanistic claim should pass. (i) A causal ablation: zeroing feature 17,491 in the residual stream across all token positions of the 45 keys prompts does not restore accuracy (6.7% -> 4.4%); the feature is a correlate, not a sufficient cause at this layer. (ii) A representation baseline: a logistic regression on the raw 768-dimensional residual stream reaches 5-fold ROC AUC = 0.929, matching the top-100 SAE features (0.927); the SAE basis adds interpretability, not predictive power. (iii) A seed-robustness check: across five random seeds the keys-subset failure rate stays in 75.0--93.3% (the behavioural effect is real), but feature 17,491 is the top-|d| feature in only 1 of 5 runs. The methodological contribution is therefore the audit pipeline (cheap, model-agnostic, surfaces named correlates) rather than any single feature found through it. We release the code, the 300-prompt corpus, the 300x24,576 activation matrix, the ablation and baseline scripts, and the figures. The full pipeline runs on a laptop (Apple M3 Max, no discrete GPU).

2605.22717 2026-05-22 cs.SD cs.AI cs.LG cs.MM 版本更新

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

实时音乐扩散模型:交互式音乐生成扩散模型的高效微调与后训练

Zachary Novack, Stephen Brade, Haven Kim, Hugo Flores García, Nithya Shikarpur, Chinmay Talegaonkar, Suwan Kim, Valerie K. Chen, Julian McAuley, Taylor Berg-Kirkpatrick, Cheng-Zhi Anna Huang

发表机构 * UC San Diego(加州大学圣迭戈分校) MIT(麻省理工学院) Adobe(Adobe公司)

AI总结 本文研究了音频扩散模型能否通过块级KV缓存高效地转化为交互式模型,从而在消费级硬件上实现。提出的Live Music Diffusion Models (LMDMs)通过块级KV缓存恢复并超越了离散Live Music Models (LMMs)的推理复杂度,并通过ARC-Forcing范式实现稳定的后训练对齐,从而在无需显式RL或奖励模型的情况下减少误差累积。

详情
AI中文摘要

交互式流式音乐生成承诺了生成模型在实时表演和协作创作中的应用,这在离线模型中是无法实现的。然而,最先进的模型存在于离散AR领域,需要工业级的计算资源进行训练和推理。在本文中,我们研究音频扩散模型是否可以被重新利用为交互式模型,从而在消费级硬件上实现。通过仔细分析现代块级外推扩散流程,我们发现推理过程中存在关键的低效问题,导致其计算效率严劣于离散AR模型。我们提出了Live Music Diffusion Models (LMDMs),一种简单的生成扩散过程修改,通过块级KV缓存恢复并超越了离散Live Music Models (LMMs)的推理复杂度。与LMMs不同,LMDMs进一步通过我们新颖的ARC-Forcing范式实现稳定的后训练对齐,无需任何显式RL或奖励模型即可减少误差累积。我们展示了LMDMs在多个创意领域中的应用,包括文本条件生成、基于草图的音乐合成和即兴演奏。最后,我们展示了如何将LMDMs用作生成乐器,在真实艺术家与AI的合作中利用LMDMs作为“生成延迟”,将音乐家的即兴演奏转换为可变的音色效果,同时在本地消费级游戏笔记本电脑上运行。

英文摘要

Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.

2605.22711 2026-05-22 cs.LG cs.AI 版本更新

Abstraction for Offline Goal-Conditioned Reinforcement Learning

离线目标条件强化学习中的抽象

Clarisse Wibault, Alexander Goldie, Antonio Villares, Maike Osborne, Jakob Foerster

发表机构 * FLAIR, MLRG University of Oxford(FLAIR、MLRG 欧洲大学)

AI总结 本文提出了一种在离线目标条件强化学习中利用抽象的方法,通过引入相对化选项和不同层次的表示,提高了在相似状态空间上下文中的经验复用能力,从而提升了性能。

详情
AI中文摘要

马尔可夫决策过程(MDPs)在现实中的目标条件强化学习(GCRL)中往往由于对称性和状态-目标对之间的共享结构而表现出显著的冗余性。虽然分层策略已被提出以通过时间抽象减少时间跨度来改进离线GCRL,但本文证明层次结构也能够实现绝对抽象。通过引入相对化选项以及为不同层次的层次结构引入不同的表示,我们展示了智能体如何在相似的状态空间上下文中重用经验。基于这一框架,我们介绍了两种简单的算法用于学习相对化选项和从绝对参考框架中抽象。我们的实验表明,这种归纳偏置在离线GCRL中显著提高了性能。

英文摘要

Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been motivated for horizon reduction via temporal abstraction in offline GCRL, we demonstrate that hierarchy also enables absolute abstraction. By introducing relativised options as well as distinct representations for different levels of the hierarchy, we demonstrate how an agent can reuse experience across similar contexts of the state-space. Based on this framework, we introduce two simple algorithms for learning relativised options and abstracting from the absolute frame of reference. Our experiments show that such inductive biases significantly improve performance in offline GCRL.

2605.22703 2026-05-22 cs.LG 版本更新

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

剪裁瓶颈:通过近边界信号的随机恢复稳定RLVR

Shuo Yang, Jinda Lu, Chiyu Ma, Kexin Huang, Haoming Meng, Qihui Zhang, Yuyang Liu, Bolin Ding, Guoyin Wang, Li Yuan, Jingren Zhou

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 本文研究了强化学习可验证奖励(RLVR)中由于硬剪裁决策导致的训练不稳定问题,提出了一种名为近边界随机救援(NSR)的简单方法,通过随机保留略微超出边界范围的token来恢复丢失的信号,从而提升训练稳定性和性能。

详情
AI中文摘要

强化学习可验证奖励(RLVR)已成为扩展大语言模型推理能力的核心范式,但其优化过程常常受到训练不稳定和收敛次优的问题影响。通过系统分析基于剪裁的GRPO类目标,我们发现由硬剪裁引起的刚性剪裁决策是所研究的RLVR设置中的关键实际瓶颈。具体而言,我们的分析表明,信息信号可能位于剪裁阈值之外的近边界区域,因此被标准硬剪裁规则所丢弃。值得注意的是,一旦这个瓶颈被精确识别,即使在边界处进行简单的随机扰动也能恢复有意义的性能提升。基于这一发现,我们提出了近边界随机救援(NSR),一种最小、即插即用的修改方法,通过随机保留略微超出边界范围的token来恢复丢失的信号。虽然NSR通过随机采样可以被解释为在期望上诱导隐含梯度衰减,但我们的消融实验表明,其随机的边界局部救援机制在一致性上比确定性梯度衰减更有效。通过在7B到30B规模以及密集和MoE架构上的广泛实验验证,作为即插即用的解决方案,NSR显著提高了训练稳定性,并在DAPO和GSPO等强基线模型上实现了持续的性能提升。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of clipping-based GRPO-style objectives, we identify the rigid clipping decision induced by hard clipping as a key practical bottleneck in the studied RLVR setups. Specifically, our analysis suggests that informative signals can lie in the near-boundary region just beyond the clipping threshold, and are therefore discarded by the standard hard-clipping rule. Notably, once this bottleneck is precisely identified, even simple stochastic perturbations at the boundary can recover meaningful performance gains. Building on this finding, we propose Near-boundary Stochastic Rescue (NSR), a minimal, plug-and-play modification that stochastically retains these slightly out-of-bound tokens to recover lost signals. While NSR, via stochastic sampling, can be interpreted as inducing an implicit gradient decay in expectation, our ablations reveal that its stochastic, boundary-local rescue mechanism is consistently more effective than deterministic gradient decay. Validated by extensive experiments across model sizes from 7B to 30B and both dense and MoE architectures, as a plug-and-play solution, NSR substantially improves training stability and delivers consistent gains over strong baselines such as DAPO and GSPO.

2605.22691 2026-05-22 cs.LG cond-mat.stat-mech 版本更新

Posterior Collapse as Automatic Spectral Pruning

后验坍缩作为自动谱剪枝

Johannes Hirn

发表机构 * Image Processing Laboratory (IPL), Universitat de València, Paterna, València 46980, Spain(图像处理实验室(IPL),瓦伦西亚大学,帕特erna,瓦伦西亚 46980,西班牙)

AI总结 本文研究了β-VAE中的后验坍缩现象,揭示其本质上是一种自动谱剪枝过程,通过分析不同β值下的均衡解,展示了潜在模式从最不有用的到最有用的逐步解耦的崩溃过程。

详情
AI中文摘要

我们证明了β-VAE中的后验坍缩实现了自动谱剪枝。一个潜在模式如果其对重建的贡献低于由β设定的截止值,则会坍缩。不同β值的平衡解因此揭示了潜在模式从最不有用的到最有用的逐步解耦的崩溃过程。我们通过Landau稳定性分析将这一现象推导为损失的后果。我们定义了一个潜在-缩放不变的序参量,该参量对活跃的潜在模式进行排序,其坍缩阈值确定了哪些有效变量应首先检查。在线性高斯情况下,坍缩谱、效用谱和标准化PCA谱一致,且每个坍缩遵循均场定律。我们对WorldClim数据集进行了测试以验证这些预测。

英文摘要

We show that posterior collapse in $β$-VAEs implements automatic spectral pruning. A latent mode collapses if its contribution to reconstruction is below the cutoff set by $β$. Equilibrium solutions with different $β$ thus reveal a cascade of collapses as latent modes decouple from least to most useful. We derive this as a consequence of the loss via a Landau stability analysis. We define a latent-rescaling-invariant order parameter that ranks active latent modes and whose collapse thresholds identify which effective variables to inspect first. In the linear Gaussian case, the collapse spectrum, utility spectrum, and normalized PCA spectrum coincide, and each collapse follows a mean-field law. We test these predictions on the WorldClim dataset.

2605.22679 2026-05-22 cs.CV cs.LG 版本更新

Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

将嵌入概念化:面向视觉-语言模型的稀疏解缠

Piotr Kubaty, Patryk Marszałek, Łukasz Struski, Adam Wróbel, Jacek Tabor, Marek Śmieja

发表机构 * Faculty of Mathematics and Computer Science, Jagiellonian University(雅盖隆大学数学与计算机科学学院) Doctoral School of Exact and Natural Sciences, Jagiellonian University(雅盖隆大学精确与自然科学博士学校) Centre for Credible AI, Warsaw University of Technology(华沙技术大学可信人工智能中心)

AI总结 本文提出CEDAR方法,通过稀疏解缠技术在不增加维度的情况下揭示预训练嵌入的组成结构,从而提升视觉-语言模型的可解释性和与人类感知的一致性。

详情
AI中文摘要

视觉-语言模型学习了强大的多模态嵌入,但其内部语义仍然模糊。尽管稀疏自编码器(SAEs)可以提取可解释的特征,但它们依赖于扩展表示维度,这会破坏原始几何结构并引入冗余。我们引入CEDAR(通过自适应旋转进行概念嵌入解缠),一种事后方法,能够在不增加维度的情况下揭示预训练嵌入的组成结构。通过学习具有top-k稀疏瓶颈的可逆变换,CEDAR将语义信息集中到轴对齐的解缠坐标中。在CLIP-like架构中,单个坐标可以与文本概念进行解释,而对于生成模型如BLIP,它们可以解码为自然语言描述。实验表明,CEDAR在重建-稀疏性权衡方面具有竞争力,同时产生更可解释且更符合人类感知的解释。我们的结果表明,视觉-语言表示中的显性纠缠可以通过适当的基变换来解决,从而消除对过度扩展的需要。

英文摘要

Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduces redundancy. We introduce CEDAR (Conceptual Embedding Disentanglement via Adaptive Rotation), a post-hoc method that reveals the compositional structure of pretrained embeddings without increasing dimensionality. By learning an invertible transformation with a top-$k$ sparsity bottleneck, CEDAR concentrates semantic information into axis-aligned disentangled coordinates. In CLIP-like architecture, individual coordinates can be interpreted with textual concepts, while for generative models such as BLIP, they can be decoded into natural language descriptions. Experiments demonstrate that CEDAR achieves a competitive reconstruction-sparsity trade-off while producing explanations that are more interpretable and better aligned with human perception. Our results suggest that the apparent entanglement in vision-language representations can be resolved through a suitable change of basis, eliminating the need for overcomplete expansions.

2605.22666 2026-05-22 math.CO cs.LG math.PR 版本更新

Holographic functions and neural networks

全息函数与神经网络

Balazs Szegedy

发表机构 * Rényi Institute of Mathematics(雷尼数学研究所)

AI总结 本文研究了全息函数的复杂性,通过三种不同方法(采样性质、结构性质和计算性质)探讨了全息函数的复杂性界限,并证明了这三种性质在参数上是等价的。

详情
AI中文摘要

模糊布尔函数是映射 $f:\cube^n o [0,1]$,其中 $n\in\mathbb N$。我们介绍了并比较了三种表示此类函数具有有界复杂度的方式。第一种是采样性质:函数值 $f(x)$ 可以通过随机选择的少量坐标值在小误差和高概率下恢复。我们称其为全息性质。第二种是结构性质:$f$ 与在有限多个有界线性坐标形式上的一次多项式一致。第三种是计算性质:$f$ 与具有有限个非输入神经元、有界Lipschitz激活函数和有界输入权重的神经网络的输出一致。我们证明了这三种性质在参数上是等价的。从全息性到多项式结构的推论使用了超图正则性的弱变种。

英文摘要

A fuzzy Boolean function is a map $f:\cube^n\to [0,1]$, where $n\in\mathbb N$. We introduce and compare three ways of saying that such a function has bounded complexity. The first is a sampling property: the value $f(x)$ can be recovered, up to small error and with high probability, from the values of a bounded number of randomly chosen coordinates of $x$. We call this the holographic property. The second is a structural property: $f$ is uniformly close to a bounded-degree polynomial in boundedly many bounded linear coordinate forms. The third is computational: $f$ is uniformly close to the output of a neural network with a bounded number of non-input neurons, bounded Lipschitz activation functions and bounded incoming weights. We prove that these three properties are equivalent up to quantitative changes of the parameters. The implication from holography to polynomial structure uses a variant of a weak version of hypergraph regularity.

2605.22658 2026-05-22 cs.CV cs.LG cs.MM eess.IV 版本更新

SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

SegCompass: 探索通过稀疏自编码器实现可解释对齐以增强推理分割

Zhenyu Lu, Liupeng Li, Jinpeng Wang, Haoqian Kang, Yan Feng, Ke Chen, Yaowei Wang

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) Peng Cheng Laboratory(鹏城实验室) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Meituan, Beijing(美团,北京) University of Chinese Academy of Sciences(中国科学院大学) College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院)

AI总结 本文提出SegCompass,一种通过稀疏自编码器实现可解释对齐的端到端模型,以提升推理分割的性能和可解释性。

Comments Accepted by CVPR 2026. 15 pages, 9 figures, 6 tables

详情
AI中文摘要

尽管大语言模型提供了强大的组合推理能力,但现有推理分割流程未能清晰地将这种推理与视觉感知连接起来。当前方法,如潜在查询对齐,虽然端到端但却是不透明的“黑箱”。相反,文本定位读出仅可读但不真正可解释,通常作为无约束的后处理步骤。为弥合这一可解释性差距,我们提出了SegCompass,一种端到端模型,利用稀疏自编码器(SAE)建立一个显式、可解释且可微的对齐路径。给定一个图像-指令对,SegCompass首先生成一个思维链(CoT)轨迹。该方法的核心是一个将CoT和视觉标记映射到共享高维稀疏概念空间的SAE。一个查询代码本从该空间中选择显著概念,然后通过槽映射器在空间上定位到多槽热图,引导最终的掩码解码器。整个模型联合训练,将强化学习用于推理路径与标准分割监督相结合。这种由SAE驱动的接口提供了显著比潜在查询更可追溯的“白盒”连接,比文本读出更连贯。在五个具有挑战性的基准测试中,SegCompass匹配或超越了最先进的性能。关键的是,我们的视觉和定量分析显示,所学稀疏概念的质量与最终掩码准确性之间存在强相关性,证实了SegCompass通过其增强且可检查的对齐实现了优越的结果。代码可在https://github.com/ZhenyuLU-Heliodore/SegCompass获取。

英文摘要

While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step. To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway. Given an image-instruction pair, SegCompass first generates a chain-of-thought (CoT) trace. The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high-dimensional sparse concept space. A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi-slot heatmap that guides the final mask decoder. The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision. This SAE-driven interface provides a "white-box" connection that is significantly more traceable than latent queries and more coherent than textual readouts. Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state-of-the-art performance. Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment. Code is available at https://github.com/ZhenyuLU-Heliodore/SegCompass.

2605.22653 2026-05-22 cs.DS cs.LG 版本更新

The Secretary Problem with a Stochastic Precursor

带随机前导的秘书问题

Franziska Eberle, Alexander Lindermayr

发表机构 * Institut für Mathematik, Technische Universität Berlin, Germany(柏林技术大学数学研究所,德国)

AI总结 本文研究了带随机前导的秘书问题,展示了预测仅因其到达时间而有价值。在随机顺序模型中,单个均匀时间的前导可使成功概率达到至少1/2,优于经典1/e的基准。在对抗性顺序模型中,足够集中的前导可恢复常数成功保证。

详情
AI中文摘要

在学习增强的在线算法中,预测通常因其提供的价值估计、解决方案或算法推荐而被重视。本文表明,预测仅因其到达时间而有价值。我们研究了带随机前导的秘书问题:一种无内容的信号,保证在最佳项目之前到达,但其他时间是随机的。该信号不携带额外信息;然而,其到达时间本身改变了最优停止策略的结构。我们分别在随机顺序和对抗性顺序模型中刻画了最优策略。在随机顺序中,单个均匀时间的前导可使成功概率达到至少1/2,优于经典1/e的基准。随着前导时间越来越晚,成功概率接近1。在对抗性顺序中,对于传统模型无法提供强保证的情况,足够集中的前导可恢复常数成功保证。我们的结果表明,这种新型的异步时间信息是在线决策中的独特且强大的建议形式,可能对其他问题也有效。

英文摘要

In learning-augmented online algorithms, predictions are usually valued for what they say: a value estimate, a solution, or an algorithmic recommendation. This paper shows that predictions can also be valuable solely due to their arrival time. We study the fundamental secretary problem augmented with a stochastic precursor: a content-free signal that is guaranteed to arrive no later than the best item, but is otherwise stochastically timed. The signal does not carry any additional information; nevertheless, its timing alone changes the structure of optimal stopping. We characterize optimal policies in the random-order and adversarial-order models. In random order, a single uniformly timed precursor already gives success probability at least $\frac12$, improving on the classic $\frac1e$ benchmark. With increasingly late precursors, the success probability approaches $1$. In adversarial order, for which traditional models do not admit strong guarantees, sufficiently concentrated precursors recover constant success guarantees. Our results show that such novel forms of asynchronous temporal information are a distinct and powerful form of advice in online decision making and may also be effective for other problems.

2605.22649 2026-05-22 cs.CV cs.LG 版本更新

From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder

从基线到随访:利用因果层次变分自编码器在UK Biobank中生成脊柱DXA图像

Yilin Zhang, Nicholas C. Harvey, Nicholas R. Fuggle, Rahman Attar

发表机构 * School of Electronics and Computer Science(电子与计算机科学学院) University of Southampton(萨塞克斯大学) MRC Lifecourse Epidemiology Centre(英国医学研究理事会生命周期流行病学中心) University of Southampton, Southampton General Hospital(萨塞克斯大学索马塞特医院) Computer Science University of Southampton(计算机科学萨塞克斯大学)

AI总结 本文提出了一种基于元数据的因果层次变分自编码器,用于在UK Biobank中生成一致的脊柱DXA图像,通过基线到随访的设置评估因果一致性,展示了年龄干预下关键椎体形态学变量的高一致性,支持了在解剖上合理的DXA图像合成。

Comments 7 pages, 4 figures, 3 tables. Accepted at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)

详情
AI中文摘要

双能X射线吸收法(DXA)广泛用于大规模骨骼评估,但学习可控且可解释的因子特异性解剖变异仍具挑战性。我们提出了一种基于元数据的因果层次变分自编码器(CHVAE),用于在UK Biobank(UKB)中因果一致地生成前后位(AP)脊柱DXA图像。模型在3,743个原始AP脊柱扫描(来自首次成像访问)上进行训练,并基于基本参与者属性和腰椎形态学进行条件化。因果一致性在基线到随访的设置中通过 abduction--action--prediction(AAP)进行评估:潜在变量从基线图像中抽象出来,年龄被干预到重复成像值,然后将产生的反事实随访形态学与观察到的重复成像测量进行比较。结果表明,在年龄干预下,关键椎体形态学变量的绝对一致性较高,支持了与干预对齐的、在解剖上合理的DXA图像合成。

英文摘要

Dual-energy X-ray absorptiometry (DXA) is widely used for large-scale skeletal assessment, yet learning controllable and interpretable factor-specific anatomical variation remains challenging. We propose a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) for causally consistent generation of anteroposterior (AP) spine DXA images from the UK Biobank (UKB). The model is trained on 3,743 raw AP spine scans from the first imaging visit and conditioned on basic participant attributes and lumbar morphometry. Causal consistency is evaluated in a baseline-to-follow-up setting using abduction--action--prediction (AAP): latent variables are abducted from baseline images, age is intervened to the repeat-imaging value, and the resulting counterfactual follow-up morphometry is compared with observed repeat-imaging measurements. Results show strong absolute-level agreement for key vertebral morphometry variables under age intervention, supporting intervention-aligned synthesis of anatomically plausible DXA images.

2605.22644 2026-05-22 cs.LG 版本更新

Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics

为何SGD不是布朗运动:对随机动力学的新视角

Igor Ignashin, Anna Radovskaya, Andrew Semenov, Egor Lopatin, Stanislav Potapov, Aleksandr Kovalenko, Andrey Veprikov, Aleksandr Shestakov, Andrey Leonidov, Aleksandr Beznosikov

发表机构 * Basic Research of Artificial Intelligence Laboratory (BRAIn Lab)(人工智能基础研究实验室(BRAIn Lab)) Innopolis University(因诺普利斯大学) P.N. Lebedev Physical Institute of the Russian Academy of Sciences(俄罗斯科学院皮亚琴佐·列别杰夫物理研究所)

AI总结 本文从离散更新出发,提出了一种将SGD视为在波动损失景观中确定性动力学的新方法,揭示了在临界点附近SGD的动力学行为,并通过实验验证了其在神经网络模型中的表现。

Comments Preprint

详情
AI中文摘要

随机梯度下降(SGD)通常被建模为兰格汉斯过程,假设小批量噪声充当布朗运动。然而,这种近似依赖于连续时间极限和sqrt(eta)噪声缩放,这与有限学习率下的离散SGD更新不匹配。本文提出了一种替代方法,将SGD视为由小批量采样诱导的波动损失景观中的确定性动力学。从离散更新出发,我们推导了参数分布的主方程,并获得了与标准兰格汉斯形式在eta^2阶不同的离散福克-平克方程。利用这一框架,我们分析了SGD在损失临界点附近的行为。我们表明,其行为沿均值海森矩阵的本征基分解为质地上不同的区域。特别是,几乎平坦的方向不具有平稳分布:方差随时间增长,对应于沿山谷的有效扩散,系数与学习率成比例。我们提供了支持这些预测的实验证据,通过计算机视觉和自然语言处理的神经网络模型,观察到受限和扩散模式之间的明显质别。

英文摘要

Stochastic Gradient Descent (SGD) is commonly modeled as a Langevin process, assuming that minibatch noise acts as Brownian motion. However, this approximation relies on a continuous-time limit and a sqrt(eta) noise scaling that does not match the discrete SGD update at finite learning rate. In this work, we propose an alternative formulation of SGD as deterministic dynamics in a fluctuating loss landscape induced by minibatch sampling. Starting directly from the discrete update, we derive a master equation for the parameter distribution and obtain a discrete Fokker--Planck equation that differs from the standard Langevin form at order eta^2. Using this framework, we analyze SGD dynamics near critical points of the loss. We show that the behavior decomposes along the eigenbasis of the mean Hessian into qualitatively distinct regimes. In particular, nearly-flat directions do not admit a stationary distribution: the variance grows over time, corresponding to effective diffusion along valleys with a coefficient proportional to the learning rate. We provide empirical evidence supporting these predictions on neural network models in computer vision and natural language processing, observing a clear qualitative separation between confined and diffusive modes.

2605.22622 2026-05-22 cs.LG math.OC 版本更新

A note on convergence of Wasserstein policy optimization

关于Wasserstein策略优化收敛性的注记

David Šiška, Yufei Zhang

发表机构 * School of Mathematics, University of Edinburgh(爱丁堡大学数学学院) Department of Mathematics, Imperial College London(伦敦帝国理工学院数学系)

AI总结 本文探讨了Wasserstein策略优化在连续状态和动作空间中的收敛性问题,通过利用均场分析和log-Sobole不等式,证明了在熵正则化的马尔可夫决策过程框架下,WPO算法能够线性收敛到全局最优解。

详情
AI中文摘要

Wasserstein Policy Optimization (WPO) 是一种最近提出的强化学习算法,利用Wasserstein梯度流来优化连续动作空间中的随机策略。尽管在实践中取得了成功,但在连续状态和动作空间环境中,WPO的理论收敛性质尚未完全确立。在本文中,我们论证了在熵正则化的马尔可夫决策过程框架下,WPO能够线性收敛。这是通过利用最近在均场分析中用于梯度流收敛的进展,结合log-Sobole不等式来实现的。假设梯度流方程存在足够光滑的解,我们展示了沿流的能量单调耗散,并建立了局部log-Sobole不等式。最终,这些性质使得我们能够论证价值函数应线性收敛到全局最优解。

英文摘要

Wasserstein Policy Optimization (WPO) is a recently proposed reinforcement learning algorithm that leverages Wasserstein gradient flows to optimize stochastic policies in continuous action spaces. Despite its empirical success, the theoretical convergence properties of WPO in environments with continuous state and action spaces have yet to be fully established. In this note, we argue that WPO within the framework of entropy-regularised Markov Decision Processes converges linearly. This is done by leveraging recent advances in mean-field analysis for convergence of gradient flows using log-Sobole inequalities. Assuming existence of sufficiently regular solution to the gradient flow equation we demonstrate monotonic energy dissipation along the flow and establish a local log-Sobolev inequality. Ultimately, these properties allow us to argue that the value function should converge linearly to the global optimum.

2605.22621 2026-05-22 cs.CR cs.LG cs.NI 版本更新

UNAD+: An Explainable Hybrid Framework for Unknown Network Attack Detection

UNAD+: 一种用于未知网络攻击检测的可解释混合框架

Saif Alzubi, Frederic Stahl

发表机构 * Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系) German Research Center for Artificial Intelligence GmbH (DFKI)(德国人工智能研究中心(DFKI)) Marine Perception(海洋感知)

AI总结 本文提出UNAD+框架,结合无监督集成与监督精修阶段,通过集成可解释性层提升未知网络攻击检测的性能和透明度。

详情
AI中文摘要

先前未见的网络攻击检测仍然是入侵检测系统面临的主要挑战。尽管监督学习方法在已知攻击类别上表现良好,但当新攻击类型未在训练数据中表示时,它们的性能受限。无监督方法更适合检测零日攻击,因为它们不需要标记的攻击样本,但它们通常具有较高的误报率,这限制了其在现实中的实用性。本文提出了UNAD+,一种改进的未知网络攻击检测框架,源自之前提出的Unknown Network Attack Detector (UNAD)。UNAD+结合了仅良性样本的无监督集成、加权多数投票(WMV),一种在伪标记检测上训练的监督精修阶段,以及一个后验可解释性层,提供局部和全局解释。该框架在CICIDS2017和NSL-KDD基准数据集上进行了评估。结果表明,UNAD+在原始UNAD框架上有所改进,在基准数据集上实现了超过98%的F1分数,同时显著减少了误报率,并通过集成可解释性增强了透明度和部署适用性。

英文摘要

The detection of previously unseen network attacks remains a major challenge for intrusion detection systems. Although supervised learning methods often perform well on known attack classes, they are limited when new attack types are not represented in the training data. Unsupervised methods are more suitable for detecting zero-day attacks, as they do not require labelled attack samples, but they often suffer from high false positive rates, which limits their real-world usefulness. This paper presents UNAD+, an enhanced framework for unknown network attack detection derived from the previously proposed Unknown Network Attack Detector (UNAD). UNAD+ combines a benign-only unsupervised ensemble with Weighted Majority Voting (WMV), a supervised refinement stage trained on pseudo-labelled detections, and a post hoc explainability layer that provides both local and global explanations. The framework was evaluated on the CICIDS2017 and NSL-KDD benchmark datasets. The results show that UNAD+ improves on the original UNAD framework, achieving F1-scores above 98% across the benchmark datasets while significantly reducing false positives and enhancing transparency and deployment suitability through integrated explainability.

2605.22620 2026-05-22 cs.LG cs.CL 版本更新

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

两个优于一个:一种无崩溃的多奖励RLIF训练框架

Shourov Joarder, Diganta Sikdar, Ahsan Habib Akash, Binod Bhattarai, Prashnna Gyawali

发表机构 * Bangladesh University of Engineering and Technology(孟加拉工程科技大学) West Virginia University(西弗吉尼亚大学) University of Aberdeen(阿伯丁大学) Fogsphere (Redev.AI Ltd, UK)(Fogsphere(Redev.AI Ltd,英国)) University College London(伦敦大学学院)

AI总结 本文提出一种多奖励RLIF框架,通过分解训练信号为答案级奖励和完成级奖励,并结合GDPO归一化和KL-Cov正则化,提升稳定性和鲁棒性,同时在数学推理和代码生成任务中接近监督RLVR方法的性能。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)显著提升了大语言模型的推理能力,但通常依赖于外部监督的人类注释或黄金标准解决方案。最近,从内部反馈强化学习(RLIF)作为一种可扩展的无监督替代方法出现,利用模型自身提取的信号。然而,现有RLIF方法通常依赖单一内部奖励,可能导致奖励黑客、熵崩溃和推理结构退化。我们提出一种多奖励RLIF框架,将训练信号分解为两个互补成分:基于聚类投票的答案级奖励和基于逐token自信心的完成级奖励。为了稳健地结合这些信号,我们应用GDPO基于的归一化以减少奖励尺度不平衡。我们进一步引入KL-Cov正则化,针对导致不成比例熵减少的低熵token分布,保持探索并防止后期崩溃。在数学推理和代码生成基准上,我们的方法在无监督RL方法中提高了稳定性和鲁棒性,同时在性能上接近监督RLVR方法。这些结果表明,互补的内部奖励结合针对性正则化可以支持稳定的长周期推理,而无需依赖外部真实监督。代码将很快发布。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of LLMs, but often depends on external supervision from human annotations or gold-standard solutions. Reinforcement learning from internal feedback (RLIF) has recently emerged as a scalable unsupervised alternative, using signals extracted from the model itself. However, existing RLIF methods typically rely on a single internal reward, which can lead to reward hacking, entropy collapse, and degraded reasoning structure. We propose a multi-reward RLIF framework that decomposes the training signal into two complementary components: an answer-level reward based on cluster voting and a completion-level reward based on token-wise self-certainty. To combine these signals robustly, we apply GDPO-based normalization to reduce reward-scale imbalance. We further introduce KL-Cov regularization, which targets low-entropy token distributions responsible for disproportionate entropy reduction, preserving exploration and preventing late-stage collapse. Across mathematical reasoning and code-generation benchmarks, our method improves stability and robustness over prior unsupervised RL approaches, while achieving performance close to supervised RLVR methods. These results show that complementary internal rewards, combined with targeted regularization, can support stable long-horizon reasoning without relying on external ground-truth supervision. Code will be released soon.

2605.22613 2026-05-22 cs.LG 版本更新

Evolutionary Multi-Task Optimization for LLM-Guided Program Discovery

为LLM引导的程序发现设计的进化多任务优化

Halil Alperen Gozeten, Xuechen Zhang, Emrullah Ildiz, Ege Onur Taga, Tara Javidi, Samet Oymak

发表机构 * University of Michigan - Ann Arbor(密歇根大学安娜堡分校) University of California San Diego(加州大学圣地亚哥分校)

AI总结 本文提出了一种进化多任务优化(EMO)方法,用于LLM引导的程序发现,通过两个阶段框架EMO-STA(共享后适应)在多个任务家族中提高了程序发现的效率和鲁棒性,同时展示了共享进化在减少过拟合方面的优势。

详情
AI中文摘要

最近的LLM引导的进化搜索方法表明,迭代程序突变可以发现强大的算法,但它们通常独立地优化每个任务,即使相关任务共享可重用的结构。我们介绍了用于LLM引导的程序发现的进化多任务优化(EMO),并提出了EMO-STA(共享后适应)两种阶段框架,首先在任务家族中进化一个可执行程序的共享档案,然后将选定的共享候选程序适应到每个目标任务。在EMO-STA中,我们探索了多种适应策略,包括从共享档案中进行预热启动、适应最佳平均共享程序,以及适应在每个目标任务上表现最佳的共享程序。在八个跨越连续优化、几何构造、建模和算法优化的任务家族中,EMO-STA在大多数设置中优于匹配计算的单任务进化,其中STA Best-Local在分布内适应最强,而STA Best-Shared在未见过的任务中具有鲁棒性。计算分配实验表明,将相当大的家庭级预算分配给共享进化通常是有益的,平衡的共享和适应预算往往是最优的。除了计算效率外,我们还展示了共享进化可以缓解低证据设置(例如少量训练数据)中的过拟合,包括ARC任务和时间序列特征工程,通过优先选择跨所有任务通用的程序,而不是利用任务特定的脆弱特征。

英文摘要

Recent LLM-guided evolutionary search methods have shown that iterative program mutation can discover strong algorithms, but they typically optimize each task independently, even when related tasks share reusable structure. We introduce Evolutionary Multi-Task Optimization (EMO) for LLM-guided program discovery, and propose EMO-STA (Shared-Then-Adapt), a two-stage framework that first evolves a shared archive of executable programs across a task family and then adapts selected shared candidates to each target task. Within EMO-STA, we explore multiple adaptation strategies, including warm-starting from the shared archive, adapting the best average shared program, and adapting the shared program that performs best on each target task. Across eight task families spanning continuous optimization, geometric construction, modeling, and algorithmic optimization, EMO-STA improves over matched-compute single-task evolution in most settings, with STA Best-Local providing the strongest in-distribution adaptation and STA Best-Shared yielding robust transfer to unseen tasks. Compute-allocation experiments show that allocating a substantial fraction of the family-level budget to shared evolution is consistently beneficial, with roughly balanced shared and adaptation budgets often being optimal. Beyond compute efficiency, we show that shared evolution can mitigate overfitting in low-evidence settings (e.g. few training data), including ARC tasks and time-series feature engineering, by favoring programs that generalize across all tasks rather than exploiting task-specific brittle artifacts.

2605.22612 2026-05-22 cs.CY cs.AI cs.LG 版本更新

Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions

医疗LLM基准测试的可靠性仅取决于其显式假设

Naveen Raman, Santiago Cortes-Gomez, Mateo Dulce Rubio, Fei Fang, Bryan Wilder

发表机构 * Carnegie Mellon University(卡内基梅隆大学) New York University(纽约大学)

AI总结 本文提出医疗LLM基准测试的评估-部署差距源于隐式假设,而非基准设计问题,并通过BenchmarkCards和分阶段评估方法来解决这一问题。

Comments 13 pages, 1 figure

详情
AI中文摘要

基准测试对于医疗评估是必要的,但不足以预测部署性能。我们的观点是,评估-部署差距并非源于基准设计不当,而是源于关于用户如何与模型交互的隐式假设,这些假设无法仅通过基准测试本身来揭示。为了使这一观点更明确,我们提出了将假设分为两类的分类:任务假设,可通过对话数据单独测试;以及结果假设,需要结果数据和行为研究来测试。关键的是,结果假设依赖于人类行为,即使设计良好的基准也无法直接观察。为了证明该框架的实用性,我们回顾性分析了一个医疗RCT作为案例研究,并发现差距自然分为大致相等的任务和结果差距。为此,我们做出了两项贡献:首先,我们提出BenchmarkCards,一种记录假设的工具;其次,我们提出分阶段评估,一种系统测试假设并评估性能的程序。

英文摘要

Benchmarks are necessary for healthcare evaluation, but are not sufficient for predicting deployment performance. Our position is that the evaluation--deployment gap arises not because of poorly designed benchmarks, but from implicit assumptions about how users interact with models that cannot be surfaced from benchmarks alone. To make this precise, we propose a classification of assumptions into two categories: task, which can be tested from conversation data alone, and outcome, which requires outcome data and behavioral studies for testing. Critically, outcome assumptions depend on human behavior, something that even well-designed benchmarks cannot directly observe. To demonstrate the operationality of this framework, we retrospectively analyze a healthcare RCT as a case study and find that the gap naturally separates into task and outcome gaps of roughly equal size. To address this, we make two contributions: first, we propose BenchmarkCards, an artifact that documents assumptions, and second, we propose staged evaluation, a procedure that systematically tests assumptions and evaluates performance.

2605.22611 2026-05-22 cs.LG 版本更新

Benchmarking Machine Learning Architectures for Antimicrobial Stewardship in Pediatric ICUs

对儿科ICU中抗菌药物使用管理的机器学习架构进行基准测试

Niklas Raehse, Luregn J. Schlapbach, Daphné Chopard

发表机构 * Department of Intensive Care and Neonatology and Children’s Research Center University of Zurich University Children’s Hospital Zurich(重症护理与新生儿科及儿童研究中心,苏黎世大学苏黎世儿童医院) Department of Health Sciences and Technology ETH Zurich(健康科学与技术系,苏黎世联邦理工学院) Department of Computer Science ETH Zurich(计算机科学系,苏黎世联邦理工学院)

AI总结 本研究针对儿科ICU中抗菌药物使用管理的机器学习模型进行基准测试,通过公共数据集和私人机构队列系统评估了四种临床相关的目标,发现预测性能主要由目标流行率和数据集特征决定,而非模型复杂度,序列模型在粗粒度下提升了精度-召回权衡,但细粒度建模带来的收益有限,且校准效果较差。

Comments 16 pages, 6 figures, code: https://anonymous.4open.science/r/AMS_intervention_prediction-C024

详情
AI中文摘要

抗菌药物使用管理(AMS)在儿科重症监护室(PICUs)中至关重要,其中诊断不确定性常导致广谱抗生素使用,增加抗菌药物耐药性和潜在的长期危害。机器学习为从电子健康记录数据中识别患者层面的使用管理干预机会提供了有前途的方法,但以往研究主要集中在成人群体和静态表格表示上。我们展示了在PICU中对AMS干预预测的系统性基准研究,涵盖了公共数据集和私人机构队列。我们定义了四个临床相关的代理目标以减少抗生素暴露:静脉到口服转换、降级、停用和短程治疗。在统一的评估框架下,我们比较了表格、基于序列和基于图的时序模型在多个时间分辨率下的表现。我们发现,预测性能主要由目标流行率和数据集特征驱动,而非模型复杂度。序列模型在粗粒度(24小时)下比表格方法在精度-召回权衡上有所提升,而更精细的时间建模提供有限的额外收益。然而,这些收益是以较差的校准为代价的,更简单的表格模型产生更可靠的概率估计。多任务学习仅产生微小改进,表明在使用管理目标之间共享结构有限。我们的发现强调了目标设计、时间表示和校准在临床机器学习中的重要性,并为开发可靠的决策支持系统提供实用指导。

英文摘要

Antimicrobial stewardship (AMS) is critical in pediatric intensive care units (PICUs), where diagnostic uncertainty often drives broad-spectrum antibiotic use, increasing antimicrobial resistance and potential long-term harms. Machine learning offers a promising approach for identifying patient-level opportunities for stewardship interventions from electronic health record data, yet prior work has focused largely on adult populations and static tabular representations. We present a systematic benchmarking study of AMS intervention prediction in the PICU across a public dataset and a private institutional cohort. We define four clinically relevant proxy targets for reducing antibiotic exposure: intravenous-to-oral switching, de-escalation, discontinuation, and short-course therapy. Under a unified evaluation framework, we compare tabular, sequence-based, and graph-based temporal models at multiple temporal resolutions. We find that predictive performance is driven primarily by target prevalence and dataset characteristics rather than model complexity. Sequence models improve the precision-recall trade-off over tabular approaches at coarse (24-hour) resolution, while finer temporal modeling provides limited additional benefit. However, these gains come at the cost of poorer calibration, with simpler tabular models yielding more reliable probability estimates. Multi-task learning produces only marginal improvements, suggesting limited shared structure across stewardship targets. Our findings highlight the importance of target design, temporal representation, and calibration in clinical machine learning, and provide practical guidance for developing reliable decision support systems for pediatric AMS.

2605.22604 2026-05-22 cs.CR cs.AI cs.LG cs.SE 版本更新

Innovations in Cardless Artificial Intelligence Banking: A Comprehensive Framework for Cyber Secure and Fraud Mitigation using Machine Learning Algorithms

无卡人工智能银行业创新:基于机器学习算法的全面框架用于网络安全与欺诈防范

Md Israfeel

发表机构 * Computer Engineering, University of Central Florida, Orlando, Florida, USA

AI总结 本文提出了一种全面的框架,利用机器学习算法增强无卡人工智能银行系统的网络安全和欺诈防范能力,通过AI驱动的数据加密生成虚拟卡,减少信息泄露风险。

详情
AI中文摘要

无卡人工智能(AI)银行业的发展标志着金融领域的一次范式转变,为用户提供前所未有的安全性和便利性。本文概述了一个全面的框架,旨在增强网络安全,引入自动生成的虚拟卡,并在无卡AI银行系统中减轻欺诈风险。该框架设想了一种未来银行架构,利用AI驱动的数据加密技术来创建安全的虚拟卡以实现无缝交易。通过强调安全的通信渠道,它确保了银行系统、持卡人和第三方供应商之间的金融活动的完整性。基于AI的授权方法在验证每一笔交易的同时,主动识别潜在欺诈,展示了该框架在加强无卡AI银行业安全方面的有效性。初始方法,包含一个AI驱动的基于特征的银行系统,确保生成带有加密数据的虚拟卡,减少信息暴露并降低欺诈风险。整合机器学习算法为潜在的欺诈活动增加了一层保护。最后,所提出的框架为无卡AI银行系统建立了一个全面的网络安全和欺诈防范范式。其实施使金融机构能够应对传统银行业相关的安全问题,为一个不仅抗欺诈而且对用户安全和方便的未来银行业景观铺平道路。

英文摘要

The advent of cardless artificial intelligence (AI) banking heralds a paradigm shift in the financial landscape, offering users unprecedented security and convenience. This paper outlines a comprehensive framework designed to enhance cybersecurity, introduce auto-generated virtual cards, and mitigate fraud risks within cardless AI banking systems. The framework envisions a future banking architecture that employs AI-powered data cryptography to create secure virtual cards for seamless transactions. By emphasizing secure communication channels, it ensures the integrity of financial activities among banking systems, cardholders, and third-party vendors. AI-based authorization methodologies play a pivotal role in authenticating each transaction while proactively identifying potential fraud, demonstrating the framework's efficacy in fortifying cardless AI banking security. The initial approach, featuring an AI-driven, feature-based banking system, ensures the generation of virtual cards with encrypted data, minimizing information exposure and reducing fraud risks. Integrating a machine learning algorithm adds an additional layer of protection against potential fraudulent activities. In conclusion, the proposed framework establishes a holistic cybersecurity and fraud-mitigation paradigm for cardless AI banking systems. Its implementation empowers financial institutions to address security concerns associated with traditional banking, paving the way for a future banking landscape that is not only fraud-resistant but also secure and convenient for users.

2605.22597 2026-05-22 cs.LG cs.AI cs.GR cs.RO 版本更新

MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy

MoSA: 通过学习残余各向异性来缓解连续动力学中现实到模拟差距的运动约束应力适应

Jiaxu Wang, Junhao He, Jingkai Sun, Yi Gu, Yunyang Mo, Jiahang Cao, Qiang Zhang, Renjing Xu

发表机构 * Hong Kong University of Science(香港科学大学) MMLab, Chinese University of Hong Kong, Hong Kong SAR(香港中文大学MMLab, 香港特别行政区) The University of Hong Kong, Hong Kong SAR(香港大学, 香港特别行政区)

AI总结 本文提出MoSA框架,通过运动约束应力适应来缓解连续动力学中现实到模拟差距,利用各向同性模型作为物理先验,并学习残余应力算子以捕捉轻微各向异性和非均匀性,最终在机器人操作中验证了其有效性。

详情
Journal ref
International Conference on Machine Learning 2026
AI中文摘要

从视觉观测中学习现实世界的动力学对于各种领域至关重要。一种常见策略是通过估计物理参数来校准模拟器,但准确性最终受限于底层物理模型,这些模型通常假设材料是均质且各向同性的。即使合理,现实中的物体通常表现出轻微的各向异性和非均匀性。在近各向同性的骨架良好校准后,这些残余效应成为进一步缩小现实到模拟差距的关键瓶颈。虽然神经网络可以端到端地拟合动力学,但这种黑盒建模会丢弃强物理先验,导致数据效率低和过拟合。因此,我们提出了MoSA,一种运动约束应力适应框架,旨在针对这些残余效应以进一步提高现实到模拟动力学学习。MoSA使用各向同性模型作为物理先验,并学习残余应力算子以捕捉轻微各向异性和非均匀性。它通过微平面约束的再分布逐步适应应力,在一个物理指导的级联网络中。我们进一步通过监督变形场的时空导数来施加运动约束。实验表明,我们学习的动力学在准确性、泛化性和鲁棒性方面均优于现有方法,同时学习了具有物理意义的残余各向异性。最后,我们在机器人操作设置中验证了MoSA,显示更好的现实到模拟动力学建模能够转化为更可靠的模拟到现实转移。项目页面可在https://mercerai.github.io/MoSA/上获取。

英文摘要

Learning real-world dynamics from visual observations is crucial for various domains. A common strategy is to calibrate simulators by estimating physical parameters, yet accuracy is ultimately bounded by the underlying physical models, which often assume materials are homogeneous and isotropic. Even if reasonable, real-world objects typically exhibit mild anisotropy and heterogeneity. After the near-isotropic backbone is well calibrated, these residual effects become the key bottleneck for further closing the real-to-sim gap. Although neural networks can fit dynamics end-to-end, such black-box modeling discards strong physical priors, leading to poor data efficiency and overfitting. Therefore, we propose MoSA, a motion-constrained stress adaptation framework that targets these residual effects to further improve real-to-sim dynamics learning. MoSA uses an isotropic model as a physics prior and learns residual stress operators to capture mild anisotropy and heterogeneity. It progressively adapts stresses via microplane-constrained redistribution in a physics-informed cascaded network. We further impose motion constraints by supervising temporal and spatial derivatives of the deformation field. Experimentally, our learned dynamics achieves superior accuracy, generalization, and robustness, while learning physically meaningful residual anisotropy. Finally, we validate MoSA in a robot manipulation setting, showing that better real-to-sim dynamics modeling translates into more reliable sim-to-real transfer. Project Page is available at https://mercerai.github.io/MoSA/.

2605.22596 2026-05-22 cs.LG 版本更新

Factored Diffusion Policies:Compositionally Generalized Robot Control with a Single Score Network

因子扩散策略:一种单一分数网络的组成通用机器人控制

Sayan Mitra, Ege Yuceel, Noah Giles, Abhishek Pai

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出了一种因子扩散策略,通过单一共享的扩散网络实现通用机器人控制,该网络在推理时能将分数分解为各因子的加法形式,从而在训练任务预算上从因子基数的乘积减少到求和,通过轨迹管证书将分数界转化为闭环状态轨迹管,实验验证了其泛化界和证书的有效性。

详情
AI中文摘要

机器人任务通常由多个因子组成,如要抓取的对象、要避开的障碍物、目标的颜色等。收集每个因子组合的专家示范数据会呈指数增长。我们提出了因子扩散策略:一个单一共享的扩散网络,通过每个因子的空标记dropout进行训练,在推理时分数可以跨因子加性分解。在给定动作-观测对的情况下,因子之间的近似条件独立性使得这种组合可以近似真实联合分数,误差有界且均匀,从而将训练任务预算从因子基数的乘积减少到求和。轨迹管证书将此分数界通过反向时间采样ODE和一个收缩跟踪控制器转化为闭环状态轨迹管,其半径分解为ODE敏感性常数和每个因子分数误差预算。不同于将单独训练的网络组合在一起的组合扩散方法,我们使用一个共享网络。无人机赛车实验验证了泛化界和证书的有效性。在基于状态的多关卡赛车中,因子策略通过90%的保留关卡(与理想情况一致),而K网络组合基线则下降到3%;在基于视觉的单关卡穿越中,它能够零样本迁移至未见场地,成功率提升11.7个百分点,碰撞率减少2.4倍。

英文摘要

Robotic tasks are typically specified by a tuple of factors, such as the object to be grasped, the obstacles to be avoided, the color of the target, and so on. Collecting expert demonstrations for every combination of factor values grows combinatorially. We present factored diffusion policies: a single shared diffusion network trained with per-factor null-token dropout, whose score decomposes additively across factors at inference. Under approximate conditional independence between factors given the action-observation pair, this composition approximates the true joint score with a bounded uniform error, reducing the training-task budget from a product of factor cardinalities to a sum. A trajectory-tube certificate chains this score-level bound through the reverse-time sampling ODE and a contracting tracking controller into a closed-loop state-trajectory tube whose radius factors into an ODE-sensitivity constant and a per-factor score-error budget. Unlike compositional-diffusion methods for control that combine separately trained networks, we use one shared network. Drone racing experiments confirm both the generalization bound and the certificate. On state-based multi-gate racing, the factored policy passes 90% of held-out gates -- matching an oracle -- while a K-network composition baseline collapses to 3%; on vision-based single-gate traversal, it transfers zero-shot to an unseen venue with +11.7pp success-rate gain and 2.4X crash-rate reduction.

2605.22593 2026-05-22 cs.LG 版本更新

Do Deep Ensembles Actually Capture Uncertainty in Graph Neural Networks?

深度集成是否真的在图神经网络中捕捉了不确定性?

Pedro C. Vieira, Pedro Ribeiro, Viacheslav Borovitskiy

发表机构 * University of Edinburgh(爱丁堡大学) DCC/FCUP, University of Porto(葡萄牙里斯本大学数据与计算中心/里斯本大学)

AI总结 本文研究了深度集成在图神经网络中的有效性,发现其在不确定性量化中效果有限,主要归因于模型优化噪声的稳定而非不确定性估计的提升,揭示了集成崩溃现象。

详情
AI中文摘要

尽管深度集成被认为是深度学习中不确定性量化的默认方法,但其在图结构数据中的有效性往往基于计算机视觉领域的成功经验而被简单假设。我们专门研究了消息传递图神经网络中的标准深度集成。在七个代表不同任务和复杂度的数据集上进行基准测试,我们发现集成在单个模型上提供 surprisingly 小的改进。相反,观察到的边际收益主要来自稳定点预测的优化噪声,而非产生有意义更好的不确定性估计。通过偶然性-知识性分解,我们识别出知识性崩溃:独立训练的网络一致收敛到过于相似的预测。因为分歧是集成捕捉知识性不确定性的重要机制,这种缺乏多样性抵消了其关键优势。进一步分析这一现象,我们建议这种崩溃是由功能而非权重空间凸性驱动的,其中不同的参数解诱导几乎相同的行为。我们的结果表明,深度集成的成功并不无缝转移到图机器学习中。

英文摘要

While deep ensembles are widely considered to be the default method for uncertainty quantification in deep learning, their effectiveness for graph-structured data is often simply assumed based on successes in domains like computer vision. We investigate standard deep ensembles specifically for message-passing graph neural networks. Benchmarking across seven datasets representing varied tasks and complexities, we reveal that ensembles provide surprisingly little improvement over a single model. Instead, the observed marginal gains stem primarily from stabilizing optimization noise in point predictions rather than yielding meaningfully better uncertainty estimates. Through an aleatoric-epistemic decomposition, we identify epistemic collapse: independently trained networks consistently converge to overly similar predictions. Because disagreement is the fundamental mechanism through which ensembles capture epistemic uncertainty, this lack of diversity neutralizes their key advantage. Analyzing this phenomenon further, we suggest this collapse is driven by functional rather than weight-space convexity, where distinct parameter solutions induce almost identical behavior. Our results suggest that deep ensemble success does not seamlessly transfer to graph machine learning.

2605.22581 2026-05-22 cs.CV cs.AI cs.LG 版本更新

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

SceneAligner: 在真实场景中实现基于3D的平面定位

Junhyeong Cho, Ruojin Cai, Hadar Averbuch-Elor

发表机构 * Cornell University(康奈尔大学) Kempner Institute, Harvard University(哈佛大学 Kempner 院)

AI总结 本文提出了一种在真实场景中实现基于3D重建的平面定位方法,通过将任务 grounding 在场景的重建3D表示中,解决了现有方法在大规模建筑和栅格化平面图中应用受限的问题。

Comments Project Page: https://Cornell-VAILab.github.io/SceneAligner

详情
AI中文摘要

许多公共建筑提供带有'你在这里'指示器的平面图,以帮助游客导航。平面定位旨在通过确定视觉观测是在平面图中的哪个位置来计算实现这一能力。然而,现有方法通常假设受控的小规模环境和精确的向量平面图,限制了它们在大规模建筑和栅格化平面图中的应用能力。在本文中,我们提出了一种在真实场景中实现平面定位的方法,通过将任务 grounding 在场景的重建3D表示中。给定一组无约束的图像集合,我们的方法重建一个重力对齐的3D场景,并将其投影到2D密度图中,作为平面图的代理。平面定位则被公式化为通过2D相似性变换将该代理与输入平面图对齐。为了弥合密度图与建筑平面图之间的外观差距,我们适配了一个2D基础模型来学习跨模态的对应关系,引入了一种细调方案,鼓励语义对齐的同时保持结构一致性。广泛的实验表明,与先前方法相比有显著的改进,包括在极稀疏设置中,甚至使用单张输入图像时。我们的代码和数据将公开提供。

英文摘要

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

2605.22566 2026-05-22 cs.LG 版本更新

GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

GraphFlow: 一种基于图的流程管理用于高效的LLM代理服务

Ao Li, Shangpeng Yang, Fahao Chen, Tianheng Xu, Peng Li, Zhou Su

发表机构 * School of Cyber Science and Engineering, Xi'an Jiaotong University, Xi'an, China(西安交通大学计算机科学与工程学院) School of Artificial Intelligence, Shandong University, Jinan, China(山东大学人工智能学院) Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai, China(中国科学院上海先进研究院)

AI总结 本文提出了一种基于图的流程管理方法GraphFlow,通过统一图结构wGraph动态生成任务特定流程,提高LLM代理服务的效率和性能,实验表明其在多个基准数据集上表现优异,性能提升显著且内存占用减少。

Comments Accepted to ICML 2026

详情
AI中文摘要

基于大型语言模型(LLM)的代理在有结构化指令引导下表现出强大的推理和执行能力,通常称为工作流。然而,现有的工作流辅助代理服务系统通常依赖于预定义模板和浅层匹配机制,限制了它们捕捉深层语义关系和泛化到以前未见过的任务的能力。为了解决这些限制,我们提出了一种新的工作流管理范式,通过统一图结构表示工作流,称为wGraph,其中每个节点对应一个原子操作。wGraph作为共享的基质,从其中动态实例化任务特定的工作流。基于wGraph的基本原理,我们引入了GraphFlow系统,通过两个关键设计高效地将工作流整合到代理服务中。首先,自适应工作流生成根据任务语义和约束要求从wGraph动态构建工作流。其次,工作流状态管理利用wGraph结构高效管理键值(KV)缓存,减少代理服务中的冗余计算。在五个基准数据集上的广泛实验表明,GraphFlow在多个基准数据集上表现优异,平均性能提升约4.95个百分点,同时实现内存占用约4倍的减少。

英文摘要

Large Language Model (LLM)-based agents demonstrate strong reasoning and execution capabilities on complex tasks when guided by structured instructions, commonly referred to as workflows. However, existing workflow-assisted agent serving systems typically rely on predefined templates and shallow matching mechanisms, which limit their ability to capture deep semantic relationships and generalize to previously unseen tasks. To address these limitations, we propose a new workflow management paradigm that represents workflows using a unified graph, termed wGraph, where each node corresponds to an atomic operation. wGraph serves as a shared substrate from which task-specific workflows are dynamically instantiated. Building on wGraph primitives, we introduce GraphFlow, a system that efficiently integrates workflows into agent serving through two key designs. First, adaptive workflow generation dynamically constructs workflows from wGraph based on task semantics and constraint requirements. Second, workflow state management exploits wGraph structure to efficiently manage Key-Value (KV) caches, reducing redundant computation during agent serving. Extensive experiments across five benchmark datasets show that GraphFlow consistently outperforms state-of-the-art methods, yielding an average performance improvement of approximately 4.95 percentage points, while achieving an approximately 4$\times$ reduction in memory footprint.

2605.22564 2026-05-22 cs.CL cs.LG cs.SE 版本更新

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

SynAE: 一个用于评估工具调用代理合成数据质量的框架

Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Microsoft Research(微软研究院)

AI总结 本文提出SynAE框架,用于评估多轮工具调用代理合成数据的质量,通过四个指标类别评估合成数据的有效性、保真度和多样性,揭示单一指标不足以全面表征合成数据质量。

详情
AI中文摘要

如今,工具调用代理通常在静态执行轨迹数据集上进行评估或测试,包括输入命令、代理响应和相关工具调用。然而,内部生产数据集往往不足或无法使用;例如,它们可能包含敏感或专有数据,或过于稀疏,无法支持全面测试(尤其是预部署前)。在这些情况下,实践者越来越多地用合成数据替代或补充真实数据进行评估。关键挑战是量化这些合成数据集与真实数据之间的关系。我们介绍了SynAE,一个用于评估多轮工具调用代理合成基准如何复制和增强真实数据轨迹特征的评估框架。SynAE在四个指标类别中评估合成数据的效度、保真度和多样性:(i)任务指令和中间响应,(ii)工具调用,(iii)最终输出,(iv)下游评估。我们通过近期代理基准评估SynAE,并通过现实且受控的生成方案测试常见的合成数据失败模式。SynAE能够检测数据效度、保真度和多样性的细粒度变化,并表明没有单一指标足以全面表征合成数据质量,从而推动对合成数据的多轴评估。SynAE的演示可在https://synae-2026-synae-demo.static.hf.space/index.html获取,代码在https://github.com/wsqwsq/SynAE。

英文摘要

Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre-deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine-grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi-axis evaluation of synthetic data for agent testing. A demo of SynAE is available at https://synae-2026-synae-demo.static.hf.space/index.html, with code at https://github.com/wsqwsq/SynAE.

2605.22561 2026-05-22 cs.LG 版本更新

Regret-Based $(ε,δ)$-optimal Stopping Criteria for Bayesian Optimization

基于遗憾的贝叶斯优化(ε,δ)-最优停止准则

Haowei Wang, Jingyi Wang, Qiyu Wei

发表机构 * National University of Singapore(新加坡国立大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室) The University of Manchester(曼彻斯特大学)

AI总结 本文提出了一种基于更紧的高斯过程上置信界(GP-UCB)即时遗憾界限的停止准则,确保在终止时以高概率1-δ获得ε-最优解,并通过数值实验验证其有效性。

Comments 21 pages

详情
AI中文摘要

贝叶斯优化(BO)是一种广泛使用的迭代黑盒优化方法,利用高斯过程(GP)替代模型。在实践中,BO通常在耗尽固定评估预算后终止,这可能导致不必要的成本,并且无法保证解的质量最优性。最近的研究在开发实用的停止准则方面取得了实证进展,但理论上有说服力的停止准则仍处于进行中。在本文中,我们提出了GP上置信界(GP-UCB)在任意给定迭代中的可证明更紧的即时遗憾界限。然后,我们基于此更紧的界限提出GP-UCB的停止准则,确保终止时以高概率1-δ获得ε-最优解。通过数值实验验证和展示所提停止准则的有效性和效率。

英文摘要

Bayesian optimization (BO) is a widely used iterative black-box optimization method that utilizes Gaussian process (GP) surrogate models. In practice, BO is typically terminated after a fixed evaluation budget is exhausted, which can incur unnecessary cost and provides no optimality guarantee on solution quality. Recent research in developing a practical stopping criterion has made empirical progress, yet a theoretically sound stopping criterion remains a work in progress. In this work, we present provably tighter instantaneous regret bounds for GP upper confidence bound (GP-UCB) at any given iteration. Then, we propose stopping criteria for GP-UCB based on this tighter bound that ensures an $ε$-optimal solution with high probability $1-δ$ upon termination. Numerical experiments are performed to validate and demonstrate the effectiveness and efficiency of our stopping criteria.

2605.22556 2026-05-22 cs.LG 版本更新

ImplicitTerrainV2: Wavelet-Guided Spatially Adaptive Neural Terrain Representation

ImplicitTerrainV2: 基于小波引导的时空自适应神经地形表示

Haoan Feng, Xin Xu, Leila De Floriani

发表机构 * University of Maryland, College Park(马里兰大学学院市分校)

AI总结 本文提出ImplicitTerrainV2,通过结合频谱控制机制、小波引导的空间自适应性、导数感知监督和训练后模型压缩,实现了紧凑高效的神经地形数据格式,提升了地形分析的精度和效率。

Comments 14 pages, 8 figures

详情
AI中文摘要

数字高程模型(DEMs)是地理信息系统(GIS)中地形分析的基础,但其常见的栅格形式依赖插值进行离格采样和有限差分算子进行基于导数的分析。隐式神经表示(INRs)提供了一种连续的替代方案,但先前的地形INRs缺乏显式的频率控制,忽视了地形的梯度结构,并且在实际部署中仍然过于庞大和昂贵。我们提出了ImplicitTerrainV2,通过结合频谱控制机制、小波引导的空间自适应性、导数感知监督和训练后模型压缩,将地形INRs推进到紧凑、高效的神经地形数据格式。在核心部分,小波复杂度场(WCF)从解析计算的小波系数中推导出空间自适应的频率掩码,将高频能力局部化到复杂地形区域。同一字段指导复杂度感知的自适应采样,将训练集中在高复杂度区域,同时梯度匹配应用额外监督以强制地形DEMs的光滑流形结构,从而提高导数保真度。训练后混合精度量化和熵编码将存储减少到1.23 bpp,PSNR下降0.28 dB。在50个瑞士地形图块上,ImplicitTerrainV2达到66.25 dB的端到端PSNR,比先前工作提高了5.70 dB,同时使用3.2倍更少的参数,在单个GPU上每个图块训练时间仅为55秒。我们的压缩神经格式在率失真性能上与几种已建立的DEM编码器竞争,同时还支持离格点查询、闭合形式导数评估和分辨率无关重建,这可能受益于许多下游GIS应用。

英文摘要

Digital elevation models (DEMs) underpin terrain analysis in Geographic Information Systems (GIS), but in their common raster form, they rely on interpolation for off-grid sampling and finite-difference operators for derivative-based analysis. Implicit neural representations (INRs) offer a continuous alternative, but prior terrain INRs lack explicit frequency control, neglect the gradient structure of terrain, and remain too large and costly to train for practical deployment. We present ImplicitTerrainV2, which advances terrain INRs toward a compact, efficient neural terrain data format by combining a spectral control mechanism with wavelet-guided spatial adaptivity, derivative-aware supervision, and post-training model compression. At its core, a wavelet complexity field (WCF) derives spatially-adaptive frequency masks from analytically computed wavelet coefficients, localizing high-frequency capacity to complex terrain regions. The same field guides complexity-aware adaptive sampling that concentrates training in high-complexity regions, while gradient matching applies extra supervision to enforce the smooth manifold structure of terrain DEMs for improved derivative fidelity. Post-training mixed-precision quantization and entropy coding reduce storage to 1.23 bpp with a 0.28 dB PSNR drop. On 50 Swiss terrain tiles, ImplicitTerrainV2 reaches 66.25 dB end-to-end PSNR, improving over the prior work by 5.70 dB while using 3.2x fewer parameters and training in 55 s per tile on a single GPU. Our compressed neural format is competitive with several established DEM codecs in rate-distortion performance, while additionally supporting off-grid point queries, closed-form derivative evaluation, and resolution-independent reconstruction, which may benefit many downstream GIS applications.

2605.22549 2026-05-22 stat.ML cs.LG 版本更新

A Martingale Kernel Independence Test

一个鞅核独立性检验

Felix Laumann, Zhaolu Liu, Mauricio Barahona

发表机构 * Imperial College London(伦敦帝国学院)

AI总结 本文提出两种学生化统计量,通过自归一化和半样本分割,实现了无需排列校准的独立性检验,显著提升了计算效率和测试性能。

详情
AI中文摘要

Hilbert-Schmidt Independence Criterion (HSIC) 及其联合独立性扩展 dHSIC 是退化 V 统计量,其数据依赖的加权 χ² 空间迫使排列校准,导致每测试成本乘以排列次数,实际中为两到三个数量级。通过将最近的鞅 MMD 构造应用于两样本检验到联合独立性问题,我们引入了两个学生化统计量,其空分布为标准正态分布,无论数据分布如何,因此单次正态分位数查找可完全替代排列步骤。第一个,mHSIC,是两个经验中心 Gram 矩阵的 Hadamard 积的自归一化下三角和。在独立性和有界四次矩核下,它收敛于标准正态分布。它对所有固定替代一致,且在样本量二次成本下运行,无需样本分割,与偏置 HSIC V 统计量匹配。第二个统计量 mdHSIC 通过单个半样本分割实现有限样本一致性:中心化估计在一半,下三角自归一化鞅在另一半运行,使条件均值残差缩成指数小量,因此在任意固定联合测试变量数下,统计量渐近标准正态分布,每测试成本仅与 d 线性增长。在合成数据中,输入维度从 1 到 500,联合测试变量从 2 到 10,两种统计量在运行速度上比排列校准基线快 25 到 60 倍,同时保持相同的经验 I 类错误率和测试功效。

英文摘要

The Hilbert-Schmidt Independence Criterion (HSIC) and its joint-independence extension $d\mathrm{HSIC}$ are degenerate $V$-statistics whose data-dependent weighted-$χ^2$ null limits force a permutation calibration that multiplies the per-test cost by the number of permutations, in practice two orders of magnitude. Adapting the recent martingale MMD construction for two-sample testing to the (joint) independence problem, we introduce two studentised statistics whose null distributions are standard normal regardless of the data law, so that a single normal-quantile lookup replaces the permutation step entirely. The first, $m\mathrm{HSIC}$, is a self-normalised lower-triangular sum of the Hadamard product of two empirically centred Gram matrices. Under independence and bounded-fourth-moment kernels it converges to a standard normal. It is consistent against every fixed alternative, and runs at quadratic cost in the sample size without any sample split, matching the biased HSIC $V$-statistic. Our second statistic, $md\mathrm{HSIC}$, achieves finite-sample consistency with a single half-sample split: the centring is estimated on one half and the lower-triangular self-normalised martingale is run on the other, shrinking the conditional-mean residual to a quantity that is exponentially small in $d$, so the statistic is asymptotically standard normal at every fixed number of jointly tested variables, with a per-test cost that grows only linearly in $d$. On synthetic data with per-variable input dimension from $1$ to $500$ and between $2$ and $10$ jointly tested variables, both statistics match the empirical type-I error rate and test power of permutation-calibrated baselines while running $25$ to $60\times$ faster.

2605.22537 2026-05-22 cs.LG 版本更新

F-TIS: Harnessing Diverse Models in Collaborative GRPO

F-TIS: 利用多样化模型进行协作GRPO

Nikolay Blagoev, Oğuzhan Ersoy, Wendelin Boehmer, Lydia Yiyu Chen

发表机构 * Gensyn University of Neuchatel(日内瓦大学内沙特尔分校) Gensyn(盖森) TU Delft(代尔夫特理工大学) University of Neuchatel(日内瓦大学内沙特尔分校)

AI总结 本文提出F-TIS方法,通过利用异构模型在协同GRPO训练中提高本地模型的学习效果,实现了高效的通信和一致的最终模型收敛,同时在某些情况下提升了模型在分布外任务上的泛化能力。

Comments Accepted to ICML 2026 Workshop Scalable Learning and Optimization for Efficient Multimodal AI Agents (SCALE)

详情
AI中文摘要

像GRPO这样的强化学习方法在LLM后训练中变得非常流行。在GRPO中,模型产生一组提示的完成,这些完成会得到奖励,策略会朝着相对高奖励的完成更新。由于模型的自回归性质,这种训练风格的生成阶段可以极其耗时。为了解决这个问题,先前的工作试图将推理步骤分布到许多节点上,并行工作。这些工作主要假设训练中的同质模型,以保持样本尽可能接近on-policy。这一假设可能在去中心化系统中不切实际,因为具有不同计算能力和偏好的各方可能希望在同一个任务上合作。因此,去中心化训练需要一种能够处理异构模型的方法——不同的模型在同一个任务上协作。然而,这会导致训练过程中出现高度离策略的样本,而先前的工作已经指出离策略样本可能会影响GRPO的收敛。为了实现异质性,我们提出了过滤截断重要性采样(F-TIS)——一种GRPO风格的训练范式,可以利用离策略样本来改进本地模型的学习。我们的框架允许各种模型在同一个RL训练运行中协作,同时保持高效的通信。我们广泛评估了F-TIS在各种异构设置中的表现,并展示了它在最终模型收敛方面与纯on-sample训练相同。此外,我们观察到在某些设置中,F-TIS在分布外任务上的泛化能力优于on-policy训练,使模型性能提高了高达12%。

英文摘要

Reinforcement learning methods such as GRPO have seen great popularity in LLM post-training. In GRPO, models produce completions to a set of prompts, which are rewarded, and the policy is updated towards the relatively high reward completions. Due to the auto-regressive nature of models, the generation phase of such style of training can be extremely time consuming. As a solution, prior work has sought to distribute the inference step across many nodes, working parallel. These works assume primarily homogeneous models in the training in order to keep samples as close to on-policy as possible. This assumption may be impractical in decentralized systems, where parties with various computes and preferences may wish to collaborate on the same task. Thus, decentralized training requires an approach that can handle heterogeneous models - different models collaborating on the same tasks. However, this leads to highly off-policy samples presented during training, which prior work has identified that off-policy samples can hurt GRPO convergence. To enable heterogeneity, we propose Filtered Truncated Importance Sampling (F-TIS) - a GRPO-style training paradigm that can use off-policy samples to improve local model's learning. Our framework allows various models to collaborate in the same RL training run while being communication efficient. We extensively evaluate F-TIS in various heterogeneous setups and we show that it exhibits identical final model convergence to purely on-sample training. Furthermore, we observe in some setups better generalization on out-of-distribution tasks than on-policy training, increasing model's performance by up to 12\%.

2605.22531 2026-05-22 cs.LG 版本更新

Disentanglement Beyond Generative Models with Riemannian ICA

超越生成模型的解缠:黎曼ICA

Edmond Cunningham

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 本文提出黎曼ICA,一种不依赖生成模型的解缠方法,通过引入解缠张量来研究局部解缠特性,为理解无生成假设下的特征解缠提供了理论基础。

详情
AI中文摘要

在解缠理论基础与现代表示学习实践之间存在差距。现有的理论框架,特别是独立成分分析(ICA)及其非线性变体,假设数据背后存在统计独立的潜在变量,使得解缠等同于识别生成数据的潜在变量。这种生成框架具有可解释性和理论依据,但其强假设使其难以应用于现代表示学习。现代预训练编码器通常学习出具有解缠特性的特征,而无需做出生成假设,但缺乏解释这些特征作为独立变化因素的一般理论。本文通过引入黎曼ICA,将ICA的全局生成模型替换为局部几何结构。RICA基于观察到,在ICA中,数据点的潜在变化因素可以通过从该点出发的径向曲线映射到潜在空间中的轴对齐直线来理解。我们利用黎曼几何正式化这一观点,并以与现有生成方法一致的方式提出我们的理论。我们的主要贡献是解缠张量,它编码了我们称为点解缠的二阶解缠概念。该张量依赖于数据对数似然的Hessian以及模型诱导的里奇曲率。在受控源恢复设置中,RICA在多个流形上恢复了源,而ICA基线的成功取决于用于表示观测的坐标。本文为研究无生成模型假设下的局部解缠提供了理论基础。

英文摘要

There is a gap between the theoretical foundations of disentanglement and the practice of modern representation learning. Existing theoretical frameworks, particularly Independent Component Analysis (ICA) and its nonlinear variants, assume a generative model with statistically independent latent variables underlying the data so that disentanglement amounts to identifying the latents that could have generated the data. This generative framework is interpretable and theoretically justified, but its strong assumptions make it difficult to apply to modern representation learning. Modern pretrained encoders often learn features that exhibit disentangled properties without making generative assumptions, yet there is no general theory for interpreting these features as independent factors of variation. We take a step toward such a theory by introducing Riemannian ICA (RICA), which replaces ICA's global generative model with local geometric structure. RICA is founded on the observation that in ICA, the factors of variation underlying a data point can be understood through radial curves emanating from the point that map to axis-aligned lines in the latent space. We formalize this perspective using Riemannian geometry and introduce our theory in a way that is consistent with the existing generative approach. Our main contribution is the disentanglement tensor, which encodes a second-order notion of disentanglement that we call pointwise disentanglement. This tensor depends on the Hessian of the data log likelihood as well as the Ricci curvature induced by the model. In a controlled source recovery setting with known ground-truth sources, RICA recovers sources across several manifolds, while the success of ICA baselines depends on the coordinates used to represent the observations. Our work provides a theoretical basis for studying local disentanglement without assuming a global generative model.

2605.22529 2026-05-22 cs.LG cs.AI 版本更新

Stabilising Explainability Fragility in Cybersecurity AI: The Impact and Mitigation of Multicollinearity in Public Benchmark Datasets

在网络安全AI中稳定可解释性脆弱性:公共基准数据集中的多重共线性影响与缓解

Ioannis J. Vourganas, Anna Lito Michala

发表机构 * Netrity Ltd(Netrity有限公司) University of Glasgow(格拉斯哥大学)

AI总结 本文研究了在入侵检测(IDS)中使用AI可解释性时的一个未被探索但重要的漏洞:多重共线性导致的不稳定性。尽管广泛依赖于事后可解释性工具如SHAP或LIME,但相关特征对解释鲁棒性的影响未被评估。我们引入了一个正式定理,表明多重共线性会放大归因方差。这证明了在多重共线性下,解释和特征重要性是非可识别的。在代表性的基准数据集UNSW-NB15上,通过一系列全面的实验验证了该定理。评估了四种广泛使用的模型家族,包括线性、基于树的、核和神经网络模型,在基于VIF和相关性阈值的完整和剪枝特征集上。我们提出了新的指标Explanability Fragility Score,并提出了两种新的缓解方法,具有变量整合复杂度。CAA-Filtering专注于通过分组训练模型的归因来稳定解释。SHARP是一种新的训练时间正则化框架,通过惩罚归因不稳定性,使可解释性稳定性可控且单调提高。研究结果支持稳定的预测性能,使用Kendall's τ量化在重采样解释中的不稳定性。这项工作对XAI在安全关键领域中的可信度和可重复性有直接影响,并促使将多重共线性缓解措施纳入IDS流程,为从业者提供了一套指南。

Comments 35 pages, 3 figures, submitted to ACM TAISAP

详情
AI中文摘要

本文研究了在入侵检测(IDS)中使用AI可解释性时的一个未被探索但重要的漏洞:多重共线性导致的不稳定性。尽管广泛依赖于事后可解释性工具如SHAP或LIME,但相关特征对解释鲁棒性的影响未被评估。我们引入了一个正式定理,表明多重共线性会放大归因方差。这证明了在多重共线性下,解释和特征重要性是非可识别的。在代表性的基准数据集UNSW-NB15上,通过一系列全面的实验验证了该定理。评估了四种广泛使用的模型家族,包括线性、基于树的、核和神经网络模型,在基于VIF和相关性阈值的完整和剪枝特征集上。我们提出了新的指标Explanability Fragility Score,并提出了两种新的缓解方法,具有变量整合复杂度。CAA-Filtering专注于通过分组训练模型的归因来稳定解释。SHARP是一种新的训练时间正则化框架,通过惩罚归因不稳定性,使可解释性稳定性可控且单调提高。研究结果支持稳定的预测性能,使用Kendall's τ量化在重采样解释中的不稳定性。这项工作对XAI在安全关键领域中的可信度和可重复性有直接影响,并促使将多重共线性缓解措施纳入IDS流程,为从业者提供了一套指南。

英文摘要

This paper investigates a unexplored yet impactful vulnerability in AI explainability used in intrusion detection (IDS): multicollinearity-induced instability. Despite extensive reliance on post-hoc explainability tools such as SHAP or LIME, the impact of correlated features on explanation robustness is not evaluated. We introduce a formal theorem stating that multicollinearity inflates attribution variance. This demonstrates that explanations and feature importances are non-identifiable under multicollinearity. A suite of comprehensive experiments validates the theorem on a representative benchmark dataset, UNSW-NB15. Four widely used families of models are evaluated, including linear, tree-based, kernel, and neural, across full and pruned feature sets based on VIF and correlation thresholding. We propose the novel metric of Explanability Fragility Score and two novel methods to mitigate it with variable integration complexity. CAA-Filtering focuses on stabilising explanations by grouping attributions of trained models. SHARP is a novel training-time regularisation framework that penalises attribution instability, enabling controllable and monotonic improvement of explainability stability. The findings support stable predictive performance, using Kendall's τ to quantify instability across bootstrapped explanations. This work has direct implications for the trustworthiness and reproducibility of XAI in security-critical contexts, and motivates incorporating multicollinearity mitigations into the IDS pipelines, providing a set of guidelines for practitioners.

2605.22507 2026-05-22 cs.LG stat.ML 版本更新

Generative Modeling by Value-Driven Transport

通过价值驱动传输进行生成建模

Pablo Moreno-Muñoz, Adrian Müller, Gergely Neu

发表机构 * Universitat Pompeu Fabra Barcelona(巴塞罗那庞培乌法布拉大学) ETH Zürich(苏黎世联邦理工学院) ICREA & Universitat Pompeu Fabra Barcelona(ICREA与巴塞罗那庞培乌法布拉大学)

AI总结 本文提出了一种基于测度传输离散时间随机控制 formulations 的新生成建模框架,通过线性规划的对偶变量直接编码最优控制策略,并开发了高效的模拟-free 原始-对偶算法来计算近似最优价值函数和价值驱动传输(VDT)策略,这些策略在多个实验中表现出优越的性能和良好的可扩展性。

详情
AI中文摘要

我们提出了一种基于测度传输离散时间随机控制 formulations 的新生成建模框架。通过适应控制理论中的经典结果,我们将问题 formulations 为一个线性规划,其对偶变量对应于控制问题的最优价值函数,这直接编码了最优控制策略。利用这种线性规划 formulations,我们开发了高效的模拟-free 原始-对偶算法,用于计算近似最优价值函数及其相关的价值驱动传输(VDT)策略,这些策略近似于真正的最优策略。我们展示了经过良好训练的 VDT 策略与其他基于流、扩散或 Schrödinger 桥的最新方法相比具有许多有利的性质:它们导致直线传输路径,可以快速且鲁棒地模拟,并且可以以与扩散和流基模型相同的方式增强(例如,条件生成、分类器-free 引导、无配对数据到数据翻译都很容易整合)。我们在一系列实验中评估了我们的方法,结果表明性能强大且具有良好的可扩展性潜力。

英文摘要

We propose a new framework for generative modeling based on a discrete-time stochastic control formulation of measure transport. Adapting classic results from control theory, we formulate our problem as a linear program whose dual variables correspond to the \emph{optimal value function} of the control problem, which directly encodes the optimal control policy. Exploiting this LP formulation, we develop an efficient simulation-free primal-dual algorithm for computing approximately optimal value functions and the associated \emph{value-driven transport} (VDT) policies which approximate the true optimal policy. We show that well-trained VDT policies enjoy numerous favorable properties in comparison with other state-of-the-art methods based on flows, diffusions, or Schrödinger bridges: they lead to straight transport paths which can be simulated quickly and robustly, and can be enhanced in all the same ways as diffusion and flow-based models (e.g., conditional generation, classifier-free guidance, unpaired data-to-data translation are all easy to incorporate). We evaluate our methodology in a range of experiments, with results that indicate strong performance and good potential for scalability.

2605.22506 2026-05-22 cs.CR cs.LG 版本更新

EnCAgg: Enhanced Clustering Aggregation for Robust Federated Learning against Dynamic Model Poisoning

EnCAgg: 增强型聚类聚合用于对抗动态模型中毒的联邦学习

Tianyun Zhang, Zhen Yang, Haozhao Wang, Ru Zhang, Yongfeng Huang

发表机构 * School of Cyberspace Security, Beijing University of Posts and Telecommunications(信息安全部门,北京邮电大学) School of Computer Science and Technology, Huazhong University of Science and Technology(计算机科学与技术学院,华中科技大学) Department of Electronic Engineering, Tsinghua University(电子工程系,清华大学)

AI总结 本文提出了一种新的鲁棒聚合方法,通过利用少量已知的良性客户端作为参考,准确识别和过滤恶意梯度,同时保留尽可能多的良性梯度,即使恶意客户端的数量未知且变化。方法包括密度基低维梯度聚类、增强聚类低维梯度生成模型和低维梯度重新聚类。

详情
AI中文摘要

联邦学习面临越来越多的模型中毒攻击威胁,这些攻击损害了其在提高隐私保护方面的应用。现有的防御方法通常依赖于固定的阈值或使用固定数量的聚类来进行区分恶意梯度和良性梯度。然而,这些方法难以适应恶意客户端的动态中毒策略,且由于客户端本地数据集的异质性,常常导致良性梯度的丢失。为了解决这些问题,我们提出了一种新的鲁棒聚合方法,该方法利用少量已知的良性客户端作为参考,能够准确识别和过滤恶意梯度,同时尽可能保留良性梯度,即使恶意客户端的数量未知且变化。首先,我们引入了一种基于密度的低维梯度聚类方法,将梯度投影到两个最分散的维度,并应用基于密度的聚类来识别恶意梯度,同时保留聚类中的良性梯度和可能的良性异常值。其次,我们设计了一种增强聚类低维梯度生成模型,该模型学习生成与良性簇边界对齐的伪梯度。这些伪梯度充当桥梁,连接稀疏的良性梯度异常值。第三,我们引入了低维梯度重新聚类,将生成的伪梯度与真实梯度一起聚类,以恢复被误分类为噪声点的良性梯度,使更多的良性梯度能够参与聚合。在MNIST、CIFAR-10和MIND数据集上的广泛实验表明,我们的方法在动态中毒场景下表现出卓越的保真度和鲁棒性。

英文摘要

Federated learning faces increasing threats from model poisoning attacks, which harms its application to improve privacy. Existing defense methods typically rely on fixed thresholds or perform clustering with a fixed number of clusters to distinguish malicious gradients from benign ones. However, these methods are difficult to adapt to dynamic poisoning strategies of malicious clients, and often result in the loss of benign gradients due to the heterogeneity of clients' local datasets. To address these problems, we propose a novel robust aggregation method that leverages a small number of known benign clients as references, enabling accurate identification and filtering of malicious gradients while retaining as many benign gradients as possible, even when the number of malicious clients is unknown and variable. First, we introduce a density-based low-dimensional gradient clustering method, which projects gradients onto the two most divergent dimensions and applies density-based clustering to identify malicious gradients while retaining clustered benign gradients and potentially benign outliers. Second, we design an enhancing clustering low-dimensional gradient generator model, which learns to generate pseudo-gradients aligned with the boundary of the benign cluster. These pseudo-gradients act as bridges to connect sparse benign gradient outliers. Third, we introduce low-dimensional gradient re-clustering that clusters the generated pseudo-gradients together with real gradients to recover benign gradients misclassified as noise points, enabling more benign gradients to participate in aggregation. Extensive experiments on the MNIST, CIFAR-10, and MIND datasets demonstrate that our method exhibits superior fidelity and robustness under dynamic poisoning scenarios.

2605.22502 2026-05-22 cs.AI cs.LG 版本更新

Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

将代理工作流编译为LLM权重:在成本上减少两个数量级的情况下实现接近前沿质量

Simon Dennis, Rivaan Patil, Kevin Shabahang, Hao Guo

发表机构 * University of Melbourne(墨尔本大学)

AI总结 本文研究如何将代理工作流编译为LLM权重以提高效率,通过在旅行预订、Zoom支持和保险索赔等任务中验证,展示了编译方法在减少成本的同时保持高质量性能。

Comments 19 pages

详情
AI中文摘要

代理编排框架已经普及,共同超过了LangGraph、CrewAI、Google ADK、OpenAI Agents SDK、Semantic Kernel、Strands和LlamaIndex在内的290,000多个GitHub星标。所有框架都遵循相同模式:一个外部编排器位于LLM之上,每回合注入指令并路由决策。最近的工作表明,这种架构在处理过程性任务时,只需在前沿模型的系统提示中提供过程即可[Dennis et al., 2026a],但代价是消耗上下文窗口、需要为每次对话提供一个前沿模型,并将专有过程暴露给第三方提供者。将过程编译到小型微调模型的权重中——创建一个地下代理——应解决所有这些担忧,先前工作(SimpleTOD、FireAct、SynTOD、WorkflowLLM、Agent Lumos)已展示了该技术的可行性。然而,开发者采用却 overwhelmingly 倾向于编排。我们识别了三个感知障碍,并在旅行预订(14个节点)、Zoom支持(14个节点,产品特定知识)和保险索赔(55个节点,6个决策中心)中通过实证方法解决每个障碍。

英文摘要

Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex. All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn. Recent work has shown this architecture is dominated for procedural tasks by simply providing the procedure in a frontier model's system prompt [Dennis et al., 2026a], at the cost of consuming the context window, requiring a frontier model for every conversation, and exposing proprietary procedures to third-party providers. Compiling the procedure into the weights of a small fine-tuned model -- creating a subterranean agent -- should resolve all of these concerns, and prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique works. Yet developer adoption has overwhelmingly favored orchestration. We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs).

2605.22498 2026-05-22 cs.LG cs.AI cs.SC 版本更新

The Neural Compiler: Program-to-Network Translation for Hybrid Scientific Machine Learning

神经编译器:程序到网络的翻译用于混合科学机器学习

Lucas Sheneman

发表机构 * Institute for Interdisciplinary Data Sciences(跨学科数据科学研究所) University of Idaho(爱达荷大学)

AI总结 该研究提出了一种神经编译器,能够将程序转换为可微的PyTorch模块,用于混合科学机器学习,通过符号规范生成正确且可微的模块,实现系统化的可组合性。

Comments Use: 21 pages, 10 figures, 10 tables. Preprint; source code available at https://github.com/sheneman/neural_compiler

详情
AI中文摘要

科学机器学习经常需要结合已知的物理规律与从数据中学习的未知参数或校正项。现有方法要么忽略已知结构,将其编码为软惩罚项,要么需要为每个方程手动编写PyTorch代码。我们提出了神经编译器,一种将用第一顺序Scheme-like表达式语言编写的程序转换为冻结、可微的PyTorch模块的系统。这些模块在浮点精度范围内匹配源程序,并通过autograd提供梯度。在混合模型中,编译模块精确编码已知的物理规律,而学习组件则建模未知的剩余部分。我们评估了该编译器在六个实验领域:费曼物理方程、洛特卡-沃勒特动力学、阻尼摆、一维热方程、三维向量力学以及组合泛化。编译模块在单个方程上与手动编写PyTorch实现数值上一致,显示编译没有精度损失。编译模型在大多数情况下能够将物理常数恢复到不到1%的误差,而标准PINN基线模型具有超过8500个参数,误差为7到93%。编译模块还可以与零误差组合,而神经近似方法在深度组合链中会积累大误差。编译器的主要价值不是优于手动编写方程的精度,而是系统化的可组合性:它从符号规范生成正确且可微的模块,而无需手动重写每个方程。该系统支持51个基本操作,包括向量和矩阵代数,能够实现PDE离散化和混合科学模型。这种字符串输入、模块输出的接口也为大语言模型提供了自然的目标,这些模型可以将科学描述翻译成可执行的可微模块。

英文摘要

Scientific machine learning often requires combining known physics with unknown parameters or correction terms learned from data. Existing approaches either ignore known structure, encode it as a soft penalty, or require hand-written PyTorch code for each equation. We present The Neural Compiler, a system that translates programs written in a first-order Scheme-like expression language into frozen, differentiable PyTorch modules. These modules match the source program to floating-point precision and provide gradients through autograd. In hybrid models, the compiled module encodes known physics exactly while learned components model the unknown remainder. We evaluate the compiler across six experiment domains: Feynman physics equations, Lotka-Volterra dynamics, a damped pendulum, a one-dimensional heat equation, three-dimensional vector mechanics, and compositional generalization. Compiled modules match hand-coded PyTorch implementations numerically for single equations, showing no accuracy loss from compilation. With only 1 to 4 trainable parameters, compiled models recover physical constants to less than 1 percent error in most cases, while standard PINN baselines with more than 8500 parameters show 7 to 93 percent error. Compiled modules also compose with zero error, while neural approximations can accumulate large errors in deep composition chains. The main value of the compiler is not improved accuracy over hand-coded equations, but systematic composability: it generates correct, differentiable modules from symbolic specifications without rewriting each equation by hand. The system supports 51 primitive operations, including vector and matrix algebra, enabling PDE discretizations and hybrid scientific models. This string-in, module-out interface also provides a natural target for large language models that translate scientific descriptions into executable differentiable modules.

2605.22496 2026-05-22 cs.LG 版本更新

The Signal in the Noise: OOD Detection Through Goodness-of-Fit Testing in Factorised Latent Spaces

噪声中的信号:通过因子化潜在空间中的拟合性检验进行分布外检测

Philipp Bomatter, Jack Geary, Henry Gouk

发表机构 * School of Informatics University of Edinburgh(信息学院爱丁堡大学)

AI总结 本文提出了一种基于因子化潜在空间中拟合性检验的分布外检测方法SITN,该方法无需访问分布外数据,计算开销小,并能严格控制误报率。

详情
AI中文摘要

深度生成模型为分布外检测提供了自然的基础,但先前的工作表明,它们分配的似然在区分分布内与分布外数据方面 notoriously 不可靠。在本文中,我们通过利用连续归一化流的 diffeomorphic 和质量保持性质来解决这个问题。我们的分析表明,分布外样本被映射到在噪声先验下高度非典型的噪声样本,这种方式无法通过似然来捕捉。基于这一观察,我们提出了一种新的方法--Signal in the Noise (SITN)--用于单样本级别的分布外检测。SITN 不需要访问分布外数据,计算开销小,并提供严格的误报率控制。通过标准基准和合成扰动的全面评估,突显了该方法的有效性以及似然方法固有的复杂性偏差的不存在。

英文摘要

Deep generative models offer a natural foundation for out-of-distribution (OOD) detection, yet prior work has shown that their assigned likelihoods are notoriously unreliable indicators for in- vs out-of-distribution data. In this paper, we address this problem by leveraging the diffeomorphic and mass-preserving properties of continuous normalising flows. Our analysis shows that OOD samples are mapped to noise samples that are highly atypical under the noise prior in ways not captured by the likelihood. Based on this observation, we propose a new method -- Signal in the Noise (SITN) -- for OOD detection on the single-sample level. SITN requires no access to OOD data, incurs minimal computational overhead, and provides strict control of false positive rates. Comprehensive evaluations through standard benchmarks and synthetic perturbations highlight the method's effectiveness and the absence of the complexity bias inherent to likelihood-based methods.

2605.22493 2026-05-22 cs.LG cs.AI cs.RO 版本更新

Understanding Multimodal Failure in Action-Chunking Behavioral Cloning

理解动作分块行为克隆中的多模态失败

Lorenzo Mazza, Massimiliano Datres, Ariel Rodriguez, Sebastian Bodenstedt, Gitta Kutyniok, Stefanie Speidel

发表机构 * NCT-Dresden(NCT-德累斯顿)

AI总结 研究行为克隆在多模态情况下失败的机制,分析不同多模态参数化在动作分块策略中的不同失效方式,并提出通过调整正则化程度和改进生成策略来提升鲁棒性的方法。

详情
AI中文摘要

当相同的观察允许多个有效动作时,行为克隆变得困难。我们研究了动作分块策略中的这一问题,并展示了不同多模态参数化以不同的方式失败。对于隐变量策略,后验-先验正则化使部署时的采样更可靠,但过度正则化会移除区分演示模式所需的动作条件信息。减少这种正则化可以保留模式信息,但此时成功取决于先验是否覆盖相关隐变量区域。对于动作空间生成策略,多模态性受到基础到动作传输的平滑性限制:具有小Lipschitz常数的映射无法将大量分离的模式分配显著概率。覆盖许多模式需要基础空间中的陡峭过渡或动作空间中的非支持桥接区域。在合成多模态任务和机器人模拟基准上的实验支持了这些机制。

英文摘要

Behavioral cloning becomes difficult when the same observation admits several valid actions. We study this problem for action-chunking policies and show that different multimodal parameterizations fail in different ways. For latent-variable policies, posterior-prior regularization makes deployment-time sampling more reliable, but excessive regularization removes the action-conditioned information needed to distinguish demonstrated modes. Reducing this regularization can preserve mode information, but then success depends on whether the prior covers the relevant latent regions. For action-space generative policies, multimodality is constrained by the smoothness of the base-to-action transport: a map with small Lipschitz constant cannot assign substantial probability to many well-separated modes. Covering many modes therefore requires either sharp transitions in base space or off-support bridge regions in action space. Experiments on synthetic multimodal tasks and robotic simulation benchmarks support these mechanisms.

2605.22488 2026-05-22 cs.LG 版本更新

Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer

表示不等于计算:一个变换器的因果测试,检验候选算法中间变量

Ishita Darade, Sushrut Thorat

发表机构 * MKSSS's Cummins College of Engineering for Women(MKSSS女子工程学院) Institute of Cognitive Science(认知科学研究所) Osnabrück University(奥斯纳布吕克大学)

AI总结 本文研究了变换器在执行算术任务时如何整合组件,发现模型虽然能准确回答问题,但其内部表示与计算路径之间存在因果分离,表明探针结果可能与实际因果观察有显著差异。

Comments 16 pages, 4 figures

详情
AI中文摘要

结构化提示要求根据任务相关的关系整合组件。网络如何实现这种整合在语言或视觉任务中往往难以判断,因为这些关系很少精确到足以定义候选内部算法。算术提供了一个更清晰的环境。我们研究了一个训练于基数提取的变换器:给定N,B和D,它必须报告N的基数-B展开式中B^D的系数。闭式解,即floor(N/B^D) mod B,提供了显式的候选算法中间变量。在三个种子下,模型在测试的数字-基数交集上达到了99.83%的准确答案,建立了可靠的任务能力。线性探针解码了这些中间变量,使分阶段的算术计算成为可能。因果测试则将表示与使用分开:在局部路由中,从具有D作为输入的流到输出位置,行为取决于早期的D选择性通信,与N和B无关。相关地,稀疏电路搜索发现大部分N、B和D的路线是分开的,它们在晚期而非由探针建议的分阶段路线中结合。因此,模型表示了使闭式解合理的中间变量,但识别的局部因果路线并未将它们传递到输出流。这一案例表明,基于探针的结论可能与实际因果观察有显著差异,即使有显式的算法假设。

英文摘要

Structured prompts require integrating components according to task-relevant relations. How a network implements this integration is often hard to judge in language or vision, where those relations are rarely specified precisely enough to define a candidate internal algorithm. Arithmetic offers a cleaner setting. We study a Transformer trained on base-digit extraction: given $N$, $B$, and $D$, it must report the coefficient of $B^D$ in the base-$B$ expansion of $N$. The closed-form solution, $\lfloor N/B^D \rfloor \bmod B$, provides explicit candidate algorithmic intermediates. Across three seeds, the model reaches 99.83% exact-answer accuracy on held-out number-base intersections, establishing reliable task competence. Linear probes decode the intermediates, making staged arithmetic computation plausible. Causal tests then separate representation from use: within the localized route from the stream with $D$ as input to the output positions, behavior depends on early $D$-selective communication, independent of $N$ and $B$. Relatedly, a sparse circuit search finds mostly separate $N$, $B$, and $D$ routes that combine late rather than the staged route suggested by the probes. Thus, the model represents the intermediates that make the closed-form solution plausible, but the identified localized causal route does not transmit them to the output stream. This case shows that probe-based conclusions can diverge sharply from causal observations, even when explicit algorithmic hypotheses are available.

2605.22481 2026-05-22 cs.LG math.ST stat.TH 版本更新

When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks

当更强的触发器反噬:高维背景下后门攻击的理论

Donald Flynn, Hadas Yaron Goldhirsh, Jonathan P. Keating, Inbar Seroussi

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) School of Mathematical Science, Tel Aviv University(特拉维夫大学数学科学学院) School of Mathematical Science and Computer Science, Tel Aviv University(特拉维夫大学数学科学与计算机科学学院)

AI总结 本文研究了在高维情况下后门毒化攻击的行为,发现更强的训练触发器有助于防御者,并通过高维理论分析了后门攻击的核心机制和影响因素。

详情
AI中文摘要

后门毒化攻击在高维情况下表现出反直觉的行为:更强的训练触发器有助于防御者。我们研究了在比例极限下(p/n→κ)的正则化广义线性模型在高斯混合数据上的表现,通过改变训练触发强度α(相对于固定的测试触发强度)来研究。三种现象出现:(i)干净测试准确率随着α增加而增加;(ii)攻击成功率在有限的α后达到峰值然后下降;(iii)最危险的触发方向是数据协方差的最小特征向量。我们为平方损失证明了所有三个结果,并通过高斯代理固定点系统将(i)和(ii)扩展到一般的凸GLM损失。我们识别出一个与κ成比例的有限样本噪声底噪是(i)背后机制,这在经典n>>p分析中是不可见的。在CIFAR-10和高斯代理上的实验与理论紧密吻合;ResNet-18实验显示在非凸设置下也出现了相同现象。

英文摘要

Backdoor poisoning attacks behave counter-intuitively in high dimensions: stronger training triggers can help the defender. We study regularised generalised linear models on Gaussian-mixture data in the proportional regime ($p/n \to κ$), varying the training trigger strength $α$ against a fixed test trigger. Three phenomena emerge: (i) clean test accuracy increases with $α$; (ii) attack success peaks at a finite $α$ and then declines; and (iii) the most damaging trigger direction is the minimum eigenvector of the data covariance. We prove all three results in closed form for the squared loss, and extend (i) and (ii) to general convex GLM losses via a Gaussian-proxy fixed-point system. We identify a finite-sample noise floor proportional to $κ$ as the mechanism behind (i), invisible to classical $n \gg p$ analysis. Experiments on CIFAR-10 and Gaussian surrogates match the theory closely; ResNet-18 experiments show the same phenomena beyond the convex setting.

2605.22480 2026-05-22 cs.LG cs.AI 版本更新

Implicit Regularization of Mini-Batch Training in Graph Neural Networks

图神经网络中mini-batch训练的隐式正则化

Clement Wang, Antoine Vialle, Robin Vaysse, Thomas Bonald

发表机构 * Institut Polytechnique de Paris(巴黎理工学院) Mirakl

AI总结 本文研究了图神经网络中mini-batch训练的隐式正则化现象,发现简单的随机节点采样方法在多个数据集上表现优异,且效率更高。

详情
AI中文摘要

图神经网络(GNN)的mini-batch训练与i.i.d.数据训练有本质区别:采样子图会改变拓扑结构并引入边界效应,导致先前工作发展出结构感知采样器以保持局部连接性和减少嵌入方差。令人惊讶的是,我们证明了最简单的可能方案,即随机节点采样(RNS),在均匀采样的诱导子图上训练,在10个数据集中的8个上在墙钟时间和内存消耗上匹配或优于全图训练。为了解释这一点,我们对图mini-batch随机梯度下降(SGD)应用反向误差分析,并显示其隐式最小化采样损失加上一个与mini-batch梯度方差成比例的正则化量,该量直接由采样器塑造。尽管RNS丢弃了局部结构,但它产生了一组预期损失更接近全图损失,且每批梯度方差更低的mini-batch,从而得到更好的隐式目标。我们的分析将图采样器的选择重新定义为一种隐式正则化形式,并将RNS识别为一种强大的、有理论基础的可扩展GNN训练方法。

英文摘要

Mini-batch training of Graph Neural Networks (GNNs) is fundamentally different from training on i.i.d. data: sampling a subgraph alters the topology and introduces boundary effects, leading prior work to develop structure-aware samplers that preserve local connectivity and reduce embedding variance. Surprisingly, we demonstrate that the simplest possible scheme, Random Node Sampling (RNS), training on the induced subgraph of uniformly sampled nodes, matches or outperforms full-graph training on 8 of 10 datasets at a fraction of the wall-clock time and memory. To explain this, we apply backward error analysis to graph mini-batch Stochastic Gradient Descent (SGD) and show that it implicitly minimizes the sampled loss plus a regularizer proportional to the mini-batch gradient variance, a quantity directly shaped by the sampler. Although RNS discards local structure, it produces mini-batches whose expected loss is closer to the full-graph loss, and whose per-batch gradients have lower variance, yielding a better implicit objective. Our analysis reframes the choice of graph sampler as a form of implicit regularization, and identifies RNS as a strong, theoretically grounded method for scalable GNN training.

2605.22476 2026-05-22 cs.LG cs.CL 版本更新

Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity

结构稀疏注意力用于具有次二次序列复杂度的实体跟踪

Hangyue Zhao, Paul Caillon, Erwan Fagnou, Alexandre Allauzen

发表机构 * ESPCI PSL(ESPCI 法国巴黎大学) LAMSADE, Université Paris Dauphine - PSL(LAMSADE 巴黎dauphine大学-巴黎科学实验室)

AI总结 本文提出了一种结构稀疏注意力机制,用于在长序列中高效维护和更新实体和属性的潜在状态,通过减少计算复杂度提升实体跟踪的效率和准确性。

Comments 12 pages, 1 figure, 9 tables

详情
AI中文摘要

实体跟踪需要在长序列中维护和更新实体和属性的潜在状态。最近的特定任务注意力运算可以通过在单个层内进行多跳状态传播,将深度Transformer堆栈压缩成几层,但其密集评估仍很昂贵。我们显示在这种情况下,学习的注意力具有很强的结构特性:大部分质量集中在局部块对角邻域,具有轻量的跨块残差。利用这一点,我们推导出一种分块评估的解析式算子,保持块内交互的精确性,并通过缩减系统路由跨块交互。所得到的评估是序列长度的次二次复杂度$O(n^{4/3}d)$(当$d\approx n$时为$O(n^{7/3})$)。在受控跟踪基准上,我们的方法在保持密集运算准确性的同时,通过标准化测量协议减少了12-29%的实时时钟时间,并在可比的精确匹配准确性下,比紧凑的密集Transformer快高达2.4倍。我们进一步提供了关于块大小和模型容量的消融实验,并识别了一个限制:当同时演化的属性数量超过注意力头的数量时,性能会崩溃。

英文摘要

Entity tracking requires maintaining and updating latent states for entities and attributes over long sequences. Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation remains expensive. We show that in this setting, learned attention is strongly structured: most mass concentrates in local block-diagonal neighborhoods with a light cross-block residue. Exploiting this, we derive a blockwise evaluation of a resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system. The resulting evaluation is subquadratic in sequence length $O(n^{4/3}d)$ (and $O(n^{7/3})$ when $d\approx n$). On controlled tracking benchmarks, our method matches the dense operator's accuracy while reducing wall-clock time by $12-29\%$ under a standardized measurement protocol, and is up to $2.4 \times$ faster than a compact dense Transformer at comparable exact-match accuracy. We further provide ablations over block size and model capacity, and identify a limitation: performance collapses when the number of simultaneously evolving properties exceeds the number of attention heads.

2605.22472 2026-05-22 cs.LG 版本更新

Winner-Take-All bottlenecks enforce disentangled symbolic representations in multi-task learning

赢家通吃瓶颈强制多任务学习中的解耦符号表示

Julian Gutheil, Simon Hitzginger, Robert Legenstein

发表机构 * Institute of Machine Learning and Neural Computation(机器学习与神经计算研究所) Graz University of Technology(格拉茨技术大学) Graz, Austria(奥地利格拉茨)

AI总结 本文研究了赢家通吃瓶颈在多任务学习中强制提取数据类别潜在因素的作用,证明了其产生的表示具有高度符号性,并通过实验验证了其在一般化中的优势。

详情
AI中文摘要

赢家通吃(WTA)网络是大脑皮层网络中的核心电路模式,在现代深度学习模型中,如Transformer的注意力层中的softmax激活函数,也广泛存在WTA-like激活。尽管其在简单生成模型中提取潜在因素的角色已被研究,但在高度非线性纠缠的潜在因素背景下其作用仍不清楚。本文表明,在深度神经网络中存在WTA瓶颈时,在某些明确条件下,可以在多任务学习设置中强制提取数据的类别潜在因素。特别是,我们证明了WTA瓶颈中产生的表示具有高度符号性,其中单个神经元或神经元群体编码单个抽象特征,如特定对象、颜色或位置。我们进一步在两个数据集上实验证明,即使在不完全符合我们定理假设的架构和设置中,这一结论也成立,并展示了获得的符号表示在一般化中的优势。我们提出的模型为具有WTA-like组件的深度神经网络的一般化能力提供了见解,并可能成为符号AI和子符号AI系统之间的接口。

英文摘要

Winner-take-all (WTA) networks constitute a central circuit motif in cortical networks of the brain. In addition, WTA-like activations are abundant in modern deep learning models in the form of the softmax activation for example in attention layers of transformers. While their role in the extraction of latent factors has been studied for relatively simple generative models, their role in the context of highly non-linearly entangled latent factors has remained elusive. In this article, we show that a WTA bottleneck within a deep neural network can enforce under certain well-defined conditions the extraction of categorical latent factors of the data in a multi-task learning setup. In particular, we prove that the representation that emerges in the WTA bottleneck is highly symbolic, where a single neuron or a population of neurons encodes the presence of a single abstract feature such as a specific object, color, or position. We furthermore show empirically on two datasets, that this also holds for architectures and setups that do not fully comply with the assumptions of our theorem and demonstrate the advantages of the acquired symbolic representation for generalization. Our proposed model provides insights into the generalization capabilities of deep neural networks with WTA-like components and may serve as an interface between symbolic and subsymbolic AI systems.

2605.22471 2026-05-22 cs.LG 版本更新

Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers

迷失在标记化中:图标记化在Transformer中的基本权衡

Maya Bechler-Speicher, Gilad Yehudai, Gil Harari, Clayton Sanford, Amir Globerson, Joan Bruna

发表机构 * Courant Institute of Mathematical Sciences, New York University(纽约大学数学科学学院) John A. Paulson School of Engineering and Applied Sciences, Harvard University(哈佛大学工程与应用科学学院) Google Research(谷歌研究) Tel-Aviv University(特拉维夫大学)

AI总结 本文研究了图标记化在Transformer中的基本权衡,探讨了不同标记化方法对模型表达能力的影响,并通过实验验证了不同任务对不同结构视图的偏好。

详情
AI中文摘要

Transformers已经成为图学习的核心架构,但其应用于图学习需要首先选择一种标记化方法:一种图到标记的映射,决定了输入中暴露的结构信息。在本工作中,我们证明这种选择是Transformer表达能力的基本组成部分。我们考察了三种作为许多现有图标记化基础的标记化方法:谱标记化、随机游走标记化和邻接标记化。我们证明不同的标记化方法会诱导不同的深度范围:同一图计算可能在一种标记化下通过浅层Transformer实现,而在另一种标记化下则需要显著更大的深度。例如,我们证明随机游走标记化在任何游走长度下都是有损的,使其一般无法从该标记化中恢复图;而谱标记化虽然无损,但对局部任务来说是病态的。我们进一步证明,尽管随机游走和谱标记化都源自邻接信息,但有限深度的Transformer一般无法在标记化家族之间转换。特别是,我们建立了下界和不可能性结果,表明不利的标记化可能阻碍更合适的结构表示的高效恢复。最后,我们通过合成和现实任务的受控实验补充了我们的理论,验证了预测的分离,并展示了不同任务对不同结构视图的偏好,以及结合互补的标记化使Transformer能够利用每种表示的distinct信号。

英文摘要

Transformers have become a central architecture for graph learning, but their application to graphs requires first choosing a tokenization: a graph-to-token map that determines which structural information is exposed at the input. In this work, we show that this choice is a fundamental component of transformer expressivity. We examine three tokenizations that serve as building blocks for many existing graph tokenizations: spectral, random-walk, and adjacency tokenizations. We prove that different tokenizations induce distinct depth regimes: the same graph computation may be realizable by a shallow transformer under one tokenization, while requiring substantially larger depth under another. For example, we prove that random-walk tokenization is lossy for any walk length, making it impossible in general to recover the graph from it, and that while spectral tokenization is lossless, it is ill-conditioned for local tasks. We further show that although both random-walk and spectral tokenizations are derived from adjacency information, it is impossible for a limited-depth transformer to convert between tokenization families in general. In particular, we establish lower bounds and impossibility results showing that unfavorable tokenizations may preclude the efficient recovery of more suitable structural representations. Finally, we complement our theory with controlled experiments on synthetic and real-world tasks, validating the predicted separations and showing that different tasks favor different structural views, and combining complementary tokenizations allows the transformer to leverage distinct signals from each representation.

2605.22463 2026-05-22 quant-ph cs.LG 版本更新

Reinforcement learning for ion shuttling on trapped-ion quantum computers

基于受限离子量子计算机的离子穿梭强化学习

Maximilian Schier, Lea Richtmann, Christian Staufenbiel, Tobias Schmale, Daniel Borcherding, Michèle Heurs, Bodo Rosenhahn

发表机构 * Institute for Information Processing (tnt), L3S, Leibniz University Hannover, Germany Institute for Gravitational Physics, Leibniz University Hannover, Germany QUDORA Technologies GmbH Institute for Theoretical Physics, Leibniz University Hannover, Germany

AI总结 本文提出利用强化学习优化受限离子量子计算机中的离子穿梭过程,通过直接交互学习策略,显著提高了离子穿梭效率,减少了36.3%的穿梭操作,并展示了方法在不同芯片架构中的广泛应用潜力。

Comments 15 pages + 9 pages supplementary material, 6 figures

详情
AI中文摘要

可扩展的受限离子量子计算通常通过具有不同功能区域的模块化芯片实现,如存储、状态准备和门执行。为了执行量子电路,离子必须在这些区域之间运输,这一过程称为离子穿梭。为了获得可靠计算结果,必须优化穿梭过程。然而,随着离子数量的增加,这一过程成为高维优化问题,最优解无法高效计算。本文首次将强化学习(RL)应用于离子穿梭的优化,RL适用于此类场景,因为它能够通过直接与问题交互学习策略。我们证明我们的RL方法优于当前最先进的启发式技术,减少了多达36.3%的穿梭操作。此外,我们展示了该方法可以轻松应用于各种芯片架构。我们的方法为研究芯片设计中的穿梭效率提供了灵活的工具,因此对于未来更复杂的架构具有高度相关性。

英文摘要

Scalable trapped-ion quantum computing is commonly realized with modular chips that feature distinct zones with specific functionalities, such as storage, state preparation, and gate execution. To execute a quantum circuit, the ions must be transported between these zones. This process is called ion shuttling. To achieve reliable computation results, the shuttling process must be optimized. However, as the number of ions increases, this becomes a high-dimensional optimization problem where optimal solutions cannot be computed efficiently. We demonstrate, to the best of our knowledge, the first use of reinforcement learning (RL) for the optimization of ion shuttling. RL is well-suited for such scenarios, as it enables learning a strategy through direct interaction with the problem. We show that our RL approach outperforms current state-of-the-art heuristic techniques, yielding a reduction in shuttling operations of up to 36.3 %. Furthermore, we show that our method is easily applicable to various chip architectures. Our approach offers a versatile method to study shuttling efficiency during chip design and, therefore, a highly relevant tool for future, more complex architectures.

2605.22455 2026-05-22 cs.CV cs.AI cs.LG physics.optics 版本更新

Making the Discrete Continuous: Synthetic RAW Augmentations for Fine-Grained Evaluation of Person Detection Performance in Low Light

使离散的成为连续的:合成RAW增强用于细粒度评估人检测性能在低光环境

Valeria Pais, Malena Mendilaharzu, Daniele Faccio, Luis Oala, Christoph Clausen, Bruno Sanguinetti

发表机构 * University of Glasgow(格拉斯哥大学) Dotphoton

AI总结 本文提出了一种合成RAW增强方法,用于在低光条件下更准确地评估人检测模型的性能,通过生成与相机传感器噪声模型匹配的低光样本,以改善基准测试的数据覆盖。

Comments Accepted non-archival paper at the CVPR 2026 AUTOPILOT Workshop (Autonomous Understanding Through Open-world Perception and Integrated Language Models for On-road Tasks)

详情
AI中文摘要

人工智能视觉模型的实际应用既受到可用训练和测试数据的推动,也受到其限制。真实数据集稀疏且不均匀:长尾或不平衡分布会阻碍泛化,而低密度区域中的样本数量少使得评估困难。合成数据可以填补这些空白,提供更连续地采样输入空间的方法,提高基准测试的数据覆盖。专注于自动驾驶安全关键场景中的夜间行人检测,我们展示如何利用合成低光样本更好地表征状态-of-the-art目标检测模型的性能,作为场景光照函数的函数。我们使用合成RAW图像增强技术生成低光样本,以匹配相机传感器的噪声模型。在真实和合成低光数据上的性能指标相似,表明AI模型难以区分它们。

英文摘要

Real-world deployment of AI vision models is both fueled and limited by the data available for training and testing. Real datasets are sparse and uneven: long-tailed or unbalanced distributions hinder generalization, and the low number of samples in low density regions makes it hard to run evaluations. Synthetic data can fill these gaps, providing us with a way to sample the input space more continuously and improve data coverage for benchmarks. Focusing on the autonomous driving safety-critical case of pedestrian detection in the dark, we show how synthetic low-light samples can be used to better characterize the performance of a state-of-the-art object detection model as a function of the scene illumination. We use a synthetic RAW image augmentation technique to generate low-light samples that match the noise model of the camera sensor. Performance metrics on real and synthetic low-light data are similar, indicating that the AI model finds it hard to distinguish between them.

2605.22454 2026-05-22 cs.LG cs.AI 版本更新

Don't Forget the Critic: Value-Based Data Rehearsal for Multi-Cyclic Continual Reinforcement Learning

不要忘记批评者:基于价值的多循环持续强化学习中的数据复习

Benjamin Poole, Andrew Quinn, Li Yang, Minwoo Lee

发表机构 * Department of Computer Science(计算机科学系) University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

AI总结 本文提出了一种基于价值的数据复习方法,用于多循环持续强化学习,通过引入Qreg+NWLU方法改进学习效率、遗忘缓解和知识转移。

详情
AI中文摘要

数据复习已成为缓解持续强化学习(CRL)中灾难性遗忘的领先方法。然而,现有工作仍局限于策略梯度框架,仅正则化执行者,由于批评者正则化导致的性能下降。这种以执行者为中心的方法忽略了数据复习在价值函数近似中的潜力。此外,现有CRL评估很少考虑多循环环境,其中任务序列重复,这是关键的现实场景,加剧了遗忘和可塑性。我们研究了使用Q值正则化的深度Q网络在多循环设置中的数据复习,并提出Qreg+NWLU,引入了两个简单的修改:(1)连续数据复习,动态收集和更新存储的Q值在整个训练过程中;(2)“无等待”正则化,立即应用而不是在第一个任务之后。这些修改在价值函数近似设置中提高了学习效率、遗忘缓解和知识转移,优于Qreg和传统CRL方法。

英文摘要

Data rehearsal has emerged as a leading approach for mitigating catastrophic forgetting in Continual Reinforcement Learning (CRL). However, existing work remains confined to policy gradient frameworks, regularizing only actors due to the performance degradation incurred by critic regularization. This actor-centric approach overlooks the potential of data rehearsal for value function approximation. Moreover, existing evaluations in CRL rarely consider multi-cyclic environments where task sequences repeat, a critical real-world scenario that exacerbates forgetting and plasticity. We investigate data rehearsal for Deep Q-Networks using Q-value regularization in multi-cyclic settings and propose Qreg+NWLU which introduces two simple modifications: (1) continuous data rehearsal that dynamically collects and updates stored Q-values throughout training, and (2) "No-Wait" regularization that applies immediately rather than after the first task. Together, these modifications yield improvements in learning efficiency, forgetting mitigation, and knowledge transfer over Qreg and conventional CRL methods within value function approximation settings.

2605.22438 2026-05-22 stat.ML cs.GT cs.LG 版本更新

Do Not Trust The Auctioneer: Learning to Bid in Feedback-Manipulated Auctions

不要相信拍卖师:在反馈操纵拍卖中学习出价

Luigi Foscari, Matilde Tullii, Vianney Perchet

发表机构 * Università degli Studi di Milano(米兰大学) Crest-Ensae(Ensae研究中心) IP Paris(巴黎研究所) CRITEO AI Team(CRITEO人工智能团队)

AI总结 研究在反馈操纵拍卖中学习出价的问题,提出一种结合鲁棒区间消除分支和乐观分支的算法,以应对反馈操纵带来的挑战,并在单活跃区域情况下提供匹配下界。

详情
AI中文摘要

Shilling是指通过人工出价使竞争看起来更激烈以推高价格。我们研究了重复的第一价格拍卖,在其中shilling影响反馈但不影响分配:学习者在真实竞争出价中获胜或失败,但在失败后观察到真实出价和一个独立的shill出价的最大值。这种操纵改变了学习者所观察到的内容,从而影响其学习出价的方式,而不会改变当前拍卖的结果。我们分析了与最佳出价基准相比的遗憾,假设shill-bid分布已知。即使如此,shilling仍可能掩盖真实出价,而有用的侧信息仅通过间歇性低shill事件出现。我们的算法结合了一个鲁棒的区间消除分支,该分支忽略shilled报告并达到动态定价率$ ilde{\mathcal{O}}(T^{2/3})$,以及一个乐观分支,该分支去偏失败侧报告并利用其在可靠时的结果信息,达到第一价格拍卖的速率$ ilde{\mathcal{O}}(\sqrt{T})$。一个验证和竞赛过程让算法在不知道正确尺度或反馈几何学的情况下使用这些乐观更新。我们用单活跃区域情况下的匹配下界补充了上界,除了对数因子外。总体而言,结果表明,即使只有反馈的shilling也能显著改变重复出价的统计难度。

英文摘要

Shilling is the use of artificial bids to make competition appear stronger and push prices upward. We study repeated first-price auctions in which shilling affects feedback but not allocation: the learner wins or loses against the real competing bid, but after a loss observes the maximum of the real bid and an independent shill bid. Thus the manipulation changes what the learner observes and hence how it learns to bid, without changing the outcome of the current auction. We analyze regret with respect to the best bid benchmark, assuming that the shill-bid distribution is known. Even then, shilling can mask the real bid, while useful side information appears only through intermittent low-shill events. Our algorithm combines a robust interval-elimination branch, which ignores the shilled report and achieves the dynamic-pricing rate $\tilde{\mathcal{O}}(T^{2/3})$, with an optimistic branch that debiases losing-side reports and exploits the resulting suffix information when it is reliable and achieves the first-price auctions rate $\tilde{\mathcal{O}}(\sqrt{T})$. A validation and racing procedure lets the algorithm use these optimistic updates without knowing the right scale or feedback geometry in advance. We complement the upper bounds with a matching lower bound, up to logarithmic factors, in the single-active-region case. Overall, the results show that even feedback-only shilling can sharply alter the statistical difficulty of repeated bidding.

2605.22437 2026-05-22 cs.CR cs.AI cs.LG 版本更新

Characterizing the Fault Response of the Intel Neural Compute Stick 2 Under Single-Pulse Electromagnetic Fault Injection

对Intel神经计算Stick 2在单脉冲电磁故障注入下的故障响应进行表征

Štefan Kučerák, Jakub Breier, Xiaolu Hou

发表机构 * Faculty of Informatics and Information Technologies, Slovak University of Technology(信息与信息技术学院,斯洛伐克技术大学) State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) TTControl GmbH

AI总结 本文研究了Intel神经计算Stick 2在单脉冲电磁故障注入下的故障响应,通过系统性的测试发现四种可重复的故障类别,并探讨了针对这些故障类别的缓解策略。

详情
AI中文摘要

视觉处理单元和其他商业神经网络推断加速器越来越多地应用于安全相关的边缘应用,但它们在瞬态硬件干扰下的故障响应在开放文献中仍然缺乏表征。对于Intel Movidius Myriad X,封装为Intel神经计算Stick 2(NCS2),只有单篇可行性研究已发表。我们报告了一项系统性的单脉冲电磁故障注入(EMFI)测试,该测试在运行三个ImageNet训练的卷积神经网络(ResNet-18、ResNet-50、VGG-11)的OpenVINO运行时上进行。在1,536次热点测试和约16,000次参数搜索测试中,单脉冲产生四种可重复的故障类别:无测量精度变化、轻微的静默数据破坏、主要的持续退化,该退化在后续推断中持续直到模型重新加载,以及需要USB电源循环的设备挂起;这些结果分别解释为无影响、SDC可能带有类似SET或小的持久状态机制、SEU-like持续破坏,以及SEFI-like功能丧失。两个发现是核心。首先,主要退化类别可以在18-31%的测试中诱导,其中崩溃后的top-1精度低于5%,在所有后续推断中持续直到显式模型重新加载 - 这一状态没有任何推断API级别的机制可以检测。第二,这一状态也可以通过向空闲设备发送脉冲来诱导,表明仅靠加载时的完整性检查是不够的。我们讨论了按类别分级的缓解策略,重点是可以在应用级别实现的机制,而无需修改设备固件或OpenVINO运行时。

英文摘要

Vision processing units and other commercial neural-network inference accelerators are increasingly deployed in safety-relevant edge applications, but their fault response under transient hardware disturbances remains poorly characterized in the open literature. For the Intel Movidius Myriad X, packaged as the Intel Neural Compute Stick 2 (NCS2), only a single feasibility study has been published. We report a systematic single-pulse electromagnetic fault injection (EMFI) campaign on the NCS2 running three ImageNet-trained convolutional neural networks (ResNet-18, ResNet-50, VGG-11) on the OpenVINO runtime. Across 1,536 spot-test trials at characterized hotspots and approximately 16,000 parameter-search trials, single pulses produce four reproducible outcome classes: no measured accuracy change, minor silent data corruption, major persistent degradation that survives across subsequent inferences until model reload, and device hangs requiring USB power-cycling; these outcomes are respectively interpreted as no-effect, SDC with possible SET-like or small persistent-state mechanisms, SEU-like persistent corruption, and SEFI-like loss of functionality. Two findings are central. First, the major-degradation class can be induced at 18-31% of trials at characterized hotspots, with post-collapse top-1 accuracy below five percent and persistence across all subsequent inferences until explicit model reload - a regime that no inference-API-level mechanism detects. Second, this regime is also inducible by pulses delivered to an idle device with the model already loaded, demonstrating that load-time integrity checks alone are insufficient. We discuss mitigation strategies graded by class, focusing on mechanisms implementable at the application level without modification to the device firmware or the OpenVINO runtime.

2605.22432 2026-05-22 cs.LG 版本更新

AMUSE: Anytime Muon with Stable Gradient Evaluation

AMUSE: 任何时刻的Muon with Stable Gradient Evaluation

Jueun Kim, Baekrok Shin, Jihun Yun, Beomhan Baek, Minhak Song, Chulhee Yun

发表机构 * KAIST(韩国科学技术院) KRAFTON(KRAFTON公司) Seoul National University(首尔国立大学)

AI总结 本文研究了Muon算法的机制,提出了一种名为AMUSE的算法,通过结合Muon的快速批量进步和Schedule-Free平均的稳定效果,实现了无需学习率调度的任何时刻训练,并在视觉任务和大语言模型预训练中提升了性能-迭代帕累托前沿。

Comments 41 pages, 25 figures

详情
AI中文摘要

现代深度学习通常依赖于AdamW和预设的学习率调度,但最近的研究挑战了这两个组件:Schedule-Free优化通过迭代平均去除显式调度,而Muon通过正交化动量来改进矩阵参数的更新几何。尽管Muon在经验上表现强劲,但其底层机制仍部分不明确。我们通过河谷损失景观研究Muon,其中有用的训练进展发生在平坦、低曲率的 bulk 子空间(河流)中,而高曲率主导方向形成陡峭的河谷墙壁,导致振荡。我们实证显示,Muon的正交化通过增加bulk成分加速河流进展,但也放大了主导方向的噪声,导致振荡轨迹。基于此,我们提出Anytime MUon with Stable gradient Evaluation (AMUSE),它结合Muon的快速bulk进展与Schedule-Free平均的稳定效果。AMUSE使用一个随时间变化的插值系数,最初评估接近快速Muon序列的梯度以实现快速适应,然后逐渐转向稳定的平均序列以抑制河谷墙壁的振荡。结果,AMUSE不需要学习率调度并支持任何时刻训练。在视觉任务和大语言模型预训练中,AMUSE在性能-迭代帕累托前沿上一致优于(Schedule-Free) AdamW和Muon。

英文摘要

Modern deep learning commonly relies on AdamW with prescribed learning rate schedules, but recent works challenge both components: Schedule-Free optimization removes explicit schedules via iterate averaging, and Muon improves the update geometry by orthogonalizing momentum for matrix parameters. Despite Muon's strong empirical performance, its underlying mechanism remains partially understood. We study Muon through the river-valley loss landscape, where useful training progress occurs along a flat, low-curvature bulk subspace (the river), while high-curvature dominant directions form steep valley walls that induce oscillations. We empirically show that while Muon's orthogonalization accelerates river progress by increasing the bulk component, it also amplifies dominant-direction noise, causing oscillatory trajectories. Building on this, we propose Anytime MUon with Stable gradient Evaluation (AMUSE), which integrates Muon's rapid bulk progress with the stabilizing effect of Schedule-Free averaging. AMUSE uses a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence for rapid adaptation, then gradually shifts toward the stable averaged sequence to suppress valley-wall oscillations. As a result, AMUSE requires no learning rate schedules and supports anytime training. Across vision tasks and large language model pretraining, AMUSE consistently improves the performance-iteration Pareto frontier over (Schedule-Free) AdamW and Muon.

2605.22416 2026-05-22 cs.LG cs.DC cs.PF 版本更新

Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

异构虚拟内存分页用于混合Mamba-Transformer推理

An Xuan Nguyen

发表机构 * Ho Chi Minh City Vietnam(越南胡志明市)

AI总结 本文提出了一种异构虚拟内存分页方法,用于解决混合Mamba-Transformer模型推理中不同内存缓存类型的内存管理问题,通过分离两种缓存类型到物理上不同的池中,并在需要时迁移容量以提高内存利用率和推理吞吐量。

Comments 11 pages, 8 figures, 6 tables. Code and reproducibility artifacts at https://github.com/codepawl/cachepawl

详情
AI中文摘要

混合语言模型如Jamba将注意层与状态空间模型(SSMs)相结合,创建了两种具有相反特征的内存缓存类型:键值(KV)缓存随着序列长度线性增长,而SSM状态则每层保持固定。当前的推理引擎对此处理不佳。统一池将SSM状态填充到注意页面大小,浪费了高达7.3倍的容量。静态双池在请求之间提示分布变化时无法适应。我们提出了异构虚拟内存分页(AVMP)。分配器将这两种缓存类型分离到物理上不同的池中,背后有一个统一的虚拟地址空间,并在其中一个池用尽时迁移容量。迁移仅在分配失败时触发,保持行为确定性。我们评估了AVMP在270个合成单元和60个ShareGPT回放单元上的RTX 3060 12GB上。内存不足事件减少了7.6%,请求吞吐量在合成工作负载上提高了1.83倍至13.3倍,在ShareGPT上提高了2.36倍。所有收益在配对-Bootstrap 95%置信区间内均成立。一个相时间分解揭示了两种不同的机制:在容量压力工作负载上较短的内存不足恢复时间,以及在KV密集工作负载上更快的分配调用速度。实现是纯Python;Triton集成是未来的工作。

英文摘要

Hybrid language models like Jamba mix attention layers with State Space Models (SSMs), creating two memory cache types with opposite profiles: Key-Value (KV) caches grow linearly with sequence length, while SSM states stay fixed per layer. Current inference engines handle this poorly. Unified pools pad SSM states to attention page sizes, wasting up to 7.3x capacity. Static dual pools cannot adapt when prompt distributions shift between requests. We present Asymmetric Virtual Memory Paging (AVMP). The allocator separates the two cache types into physically distinct pools behind a unified virtual address space, and migrates capacity between pools when one runs out. Migration triggers only on allocation failure, keeping behavior deterministic. We evaluate AVMP across 270 synthetic cells plus 60 cells of ShareGPT trace replay on an RTX 3060 12GB. Out-of-Memory events drop 7.6% and request throughput improves 1.83x to 13.3x across synthetic workloads and 2.36x on ShareGPT. All gains hold under paired-bootstrap 95% confidence intervals. A phase-time breakdown reveals two distinct mechanisms: shorter OOM recovery on capacity-pressured workloads, and faster allocation calls on KV-heavy workloads. Implementation is pure Python; Triton integration is future work.

2605.22411 2026-05-22 cs.CL cs.AI cs.LG 版本更新

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

DeferMem: 通过强化学习进行长时记忆问答的查询时证据蒸馏

Jianing Yin, Tan Tang

发表机构 * State Key Lab of CAD&CG(计算机辅助设计与图形学国家重点实验室)

AI总结 本文提出DeferMem,一种长时记忆框架,通过分离问题为高召回候选检索和查询条件证据蒸馏,以提升长时记忆问答的准确性和效率。

Comments 31 pages, 3 figures

详情
AI中文摘要

大型语言模型(LLM)代理在长时记忆问答任务中仍面临挑战,因为答案支持的证据通常分散在长对话历史中并被大量无关内容掩盖。现有记忆系统通常在未来的查询确定之前处理记忆,然后根据相似性而非其对回答查询的效用来检索结果单元。这种工作流程使下游回答者不得不对检索的候选进行去噪并重建查询特定的证据。我们提出了DeferMem,一种长时记忆框架,将该问题分解为高召回候选检索和查询条件证据蒸馏。DeferMem使用轻量级的段链接结构来组织原始历史并在查询时检索广泛的候选。然后,它应用一个通过DistillPO训练的内存蒸馏器,DistillPO是我们用于将高召回但高度嘈杂的候选蒸馏成一组忠实、自包含且查询条件的证据的强化学习算法。DistillPO将检索后的证据蒸馏制定为一个结构化的动作,包括信息选择和证据重写。它通过分解和门控奖励管道和结构对齐优势分配来优化此动作,门控奖励组件从有效性到质量检查,同时在早期暴露任务级别的正确性反馈,并将每个奖励分配给其负责的输出片段。在LoCoMo和LongMemEval-S上,DeferMem在问答准确性和记忆系统效率上超过了强大的基线,在达到最高问答准确度的同时实现了最快的运行时间和零商业API令牌成本的记忆操作。

英文摘要

Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.

2605.22410 2026-05-22 cs.LG 版本更新

Minimum Description Length based Granular-Ball Tree Regularization for Spectral Clustering

基于最小描述长度的粒状球树正则化谱聚类

Zeqiang Xian, Caihui Liu, Yong Zhang, Wenjing Qiu

发表机构 * Department of Mathematics and Computer Science, Gannan Normal University(赣南师范大学数学与计算机科学学院) Key Laboratory of Data Science and Artificial Intelligence of Jiangxi Education Institutes, Gannan Normal University(江西省教育研究院数据科学与人工智能重点实验室)

AI总结 本文提出一种基于最小描述长度的粒状球树正则化谱聚类方法,通过局部MDL模型选择构建粒状球树,利用反向邻域连续性抑制破坏可靠局部连接的分裂,利用稳定的叶球提供编码尺度信息正则化样本级亲和图,并引入共享邻居桥码调整弱局部桥接关系,从而在统一的谱聚类框架中连接可解释的局部表示学习与亲和图构建。

Comments 28 pages, 5 figures, 6 tables

详情
AI中文摘要

谱聚类很大程度上依赖于亲和图,但构造一个能保持可靠局部连接并适应异构数据结构的图仍然具有挑战性。现有的基于粒状球的谱聚类方法通常通过使用粗粒度代表来减少图的复杂性。然而,学习到的局部区域通常被当作图节点或锚点,其结构信息未被充分用于正则化原始样本级图。为了解决这个问题,本文提出了一种基于最小描述长度的粒状球树正则化谱聚类方法,称为MDL-GBTRSC。所提出的方法通过局部MDL模型选择构建粒状球树,利用反向邻域连续性来抑制破坏可靠局部连接的分裂。从树中获得的稳定的叶球提供了用于正则化样本级亲和图的编码尺度信息。此外,引入了共享邻居桥码来调整弱局部桥接关系,而无需额外用户指定的阈值。这样,MDL-GBTRSC在统一的谱聚类框架中连接了可解释的局部表示学习与亲和图构建。在真实和合成数据集上的实验表明,与经典谱聚类基线和代表性的粒状球、微簇、锚点方法相比,MDL-GBTRSC在所采用的固定配置协议下实现了最佳的平均AR I和NMI。

英文摘要

Spectral clustering largely depends on the affinity graph, yet constructing a graph that preserves reliable local connectivity while adapting to heterogeneous data structures remains challenging. Existing granular-ball-based spectral clustering methods usually reduce graph complexity by using coarse-grained representatives. However, the learned local regions are often treated as graph nodes or anchors, and their structural information is not sufficiently used to regularize the original sample-level graph. To address this issue, this paper proposes a Minimum Description Length based Granular-Ball Tree-Regularized Spectral Clustering method, termed MDL-GBTRSC. The proposed method constructs a granular-ball tree through local MDL model selection, with reciprocal neighborhood continuity used to discourage splits that break reliable local connections. The stable leaf balls obtained from the tree provide coding-scale information for regularizing the sample-level affinity graph. In addition, a shared-neighbor bridge code is introduced to adjust weak local bridge relations without requiring an additional user-specified threshold. In this way, MDL-GBTRSC connects interpretable local representation learning with affinity graph construction in a unified spectral clustering framework. Experiments on real and synthetic datasets show that MDL-GBTRSC achieves the best average ARI and NMI under the adopted fixed-configuration protocol compared with classical spectral clustering baselines and representative granular-ball, micro-cluster, and anchor-based methods.

2605.22401 2026-05-22 cs.LG cs.NE q-bio.NC 版本更新

Cross-Species RSA Reveals Conserved Early Visual Alignment but Divergent Higher-Area Rankings Across Human fMRI and Macaque Electrophysiology

跨物种RSA揭示人类fMRI和猴子电生理学中早期视觉对齐的保守性,但更高区域的排名却呈现分歧

Nils Leutenegger

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究通过跨物种比较,发现早期视觉对齐在人类和猴子之间具有保守性,但更高区域的对齐性受模型容量和刺激域影响。

Comments 9 pages, 6 figures

详情
AI中文摘要

学习规则与大脑对齐之间的关系是否在物种间通用?我们扩展了之前的发现,即未经训练的CNN在人类V1中与反向传播匹配,通过将相同的五个学习规则应用于猴子电生理学进行测试。这些规则包括反向传播(BP)、反馈对齐(FA)、预测编码(PC)、脉冲时间依赖性可塑性(STDP)以及一个未经训练的随机权重基线。猴子数据来自两个数据集:MajajHong2015(V4/IT,3,200次刺激呈现,88/168个神经元)和FreemanZiemba2013(V1/V2,135个刺激,102/103个神经元)。使用与人类研究中相同的模型权重进行RSA分析,我们发现:(1)所有模型在猴子早期视觉皮层(V1/V2)的对齐度(rho = 0.15-0.30)高于人类fMRI(rho = 0.01-0.08),这与电生理学更高的信噪比一致;(2)STDP和PC在猴子V1/V2的对齐度最高(rho ~ 0.30和0.28),这与它们在人类V1中训练规则中的领先位置一致;(3)在IT区域,学习规则的跨物种排名无显著相关性(Kendall's tau = 0.00,p = 1.00),尽管这一结果预期,因为n = 5只在tau = ±1.0时有统计效力,且进一步受到刺激集差异的影响;(4)预训练的ResNet-50(ImageNet)在猴子IT区域达到rho = 0.25,显著高于所有自定义CNN条件(rho = 0.07-0.14),表明IT区域的对齐受限于模型容量和训练数据,而非学习规则。信噪比、多种子变异性(5个种子)和刺激控制分析被报告。这些结果表明,早期视觉对齐在物种间具有鲁棒性,而更高区域的对齐受模型容量和刺激域影响。

英文摘要

Does the relationship between learning rules and brain alignment generalize across species? We extend our prior finding that untrained CNNs match backpropagation at human V1 by testing the same five learning rules against macaque electrophysiology. The rules are backpropagation (BP), feedback alignment (FA), predictive coding (PC), spike-timing-dependent plasticity (STDP), and an untrained random-weights baseline. The macaque data come from two datasets: MajajHong2015 (V4/IT, 3,200 stimulus presentations, 88/168 neurons) and FreemanZiemba2013 (V1/V2, 135 stimuli, 102/103 neurons). Using RSA with identical model weights from our human study, we find: (1) all models achieve higher alignment with macaque early visual cortex (rho = 0.15-0.30 at V1/V2) than with human fMRI (rho = 0.01-0.08), consistent with the higher signal-to-noise ratio of electrophysiology; (2) STDP and PC produce the highest macaque V1/V2 alignment (rho ~ 0.30 and 0.28), consistent with their leading position among trained rules in human V1; (3) at IT, learning rule rankings show no detectable correlation across species (Kendall's tau = 0.00, p = 1.00), though this null result is expected given that n = 5 provides power only at tau = +/-1.0, and is further confounded by stimulus set differences; (4) a pretrained ResNet-50 (ImageNet) achieves rho = 0.25 at macaque IT, substantially above all custom CNN conditions (rho = 0.07-0.14), suggesting IT alignment is limited by model capacity and training data rather than by the learning rule. Noise ceilings, multi-seed variability (5 seeds), and a stimulus-control analysis are reported. These results demonstrate that early visual alignment is robust across species, while higher-area alignment is modulated by model capacity and stimulus domain.

2605.22390 2026-05-22 cs.LG 版本更新

A Posterior-Predictive Variance Decomposition for Epistemic and Aleatoric Uncertainty in Wind Power Forecasting

后验预测方差分解用于风力发电中的epistemic和aleatoric不确定性

Yinsong Chen, Samson S. Yu, Kashem M. Muttaqi

发表机构 * School of Engineering, Deakin University(德肯大学工程学院) ARC Training Centre in Energy Technologies for Future Grids, School of Engineering, University of Wollongong(未来电网能源技术培训中心,沃林戈大学工程学院)

AI总结 本文提出了一种后验预测方差分解方法,用于分离风力发电预测中的epistemic和aleatoric不确定性,通过总不确定性分解为aleatoric和epistemic组件,并提出特定于风力发电的评估框架来验证分解的有效性。

详情
AI中文摘要

准确的风力发电预测需要可靠的不确定性量化,但现有大多数方法报告单一的预测不确定性,将epistemic和aleatoric来源混淆了。本文应用总方差定律到异方差神经网络回归和贝叶斯后验近似联合设置中,推导出总不确定性(TU)的显式分解,将其分为aleatoric(AU)和epistemic(EU)组件。所得估计器与标准后验近似方法和β-NLL训练兼容,用于调节均值-方差学习的权衡。提出了一种特定于风力发电的评估框架,用于在没有地面真实不确定性标签的情况下验证分离性,包括三个模块:受控合成实验以验证对异方差噪声和分布偏移的响应;数据属性驱动验证在真实世界风力涡轮机SCADA数据集上;以及数据集大小缩放实验以检验EU的预测渐近行为。在合成和真实世界实验中,分解的AU和EU组件在噪声结构、分布偏移和训练规模变化方面表现出理论一致的方向,支持所提出分解和评估协议的理论一致性和操作实用性。

英文摘要

Accurate wind power forecasting requires reliable uncertainty quantification, yet most existing methods report a single predictive uncertainty that conflates epistemic and aleatoric sources. This paper applies the law of total variance to the joint setting of heteroscedastic neural network regression and Bayesian posterior approximation, deriving an explicit decomposition of total uncertainty (TU) into aleatoric (AU) and epistemic (EU) components. The resulting estimators are compatible with standard posterior-approximation methods and with $β$-NLL training to regulate the mean--variance learning trade-off. A wind power--specific evaluation framework is proposed to validate disentanglement without access to ground-truth uncertainty labels, comprising three modules: controlled synthetic experiments to verify responses to heteroscedastic noise and distribution shift; data-property--driven validation on a real-world wind turbine SCADA dataset; and dataset-size scaling experiments to examine the predicted asymptotic behavior of EU. Across synthetic and real-world experiments, the decomposed AU and EU components respond in theoretically consistent directions to noise structure, distributional shift, and training-scale variation, supporting the theoretical consistency and operational utility of the proposed decomposition and evaluation protocol.

2605.22387 2026-05-22 cs.LG cs.CE 版本更新

Hybrid Kolmogorov-Arnold Network and XGBoost Framework for Week-Ahead Price Forecasting in Australia's National Electricity Market

混合 Kolmogorov-Arnold 网络与 XGBoost 框架用于澳大利亚国家电力市场的周 ahead 电价预测

Houxuan Zhou, Sriram Prasad, Chenghao Huang, Jiajie Feng, Hao Wang

发表机构 * Department of Data Science and AI, Faculty of IT, Monash University, Australia(数据科学与人工智能系,IT学院,墨尔本大学,澳大利亚) School of Electrical Engineering and Computer Science, University of Queensland, Australia(电气工程与计算机科学学院,昆士兰大学,澳大利亚) Monash Energy Institute, Monash University, Australia(墨尔本能源研究所,墨尔本大学,澳大利亚)

AI总结 本文提出了一种混合 KAN+XGBoost 框架,用于预测澳大利亚国家电力市场的周 ahead 电价,该框架结合了 Kolmogorov-Arnold 网络的全局非线性表示能力和 XGBoost 的局部鲁棒性,以捕捉长期依赖和短期价格波动,实验表明该模型在 MAE 上比 XGBoost 和 naive 基线模型分别减少了 12% 和 50% 以上。

Comments The 24th IEEE International Conference on Industrial Informatics, 2026

详情
AI中文摘要

准确的电力价格预测(EPF)对于市场参与者支持运营计划和风险管理至关重要,但因强波动性、非线性动态和频繁的极端价格尖峰而具有挑战性。这些挑战在澳大利亚国家电力市场(NEM)中尤为突出,其中高可再生能源渗透率进一步增加了不确定性。本文研究了周 ahead 电力价格预测,并提出了一种混合 KAN+XGBoost 框架,该框架结合了 Kolmogorov-Arnold 网络(KAN)与基于树的学习方法。所提出的方法结合了 KAN 的全局非线性表示能力与 XGBoost 的局部鲁棒性,以捕捉长期依赖和短期价格波动。实验在真实 NEM 数据上使用扩展窗口评估策略进行。结果表明,所提出的模型在基准方法(包括 SARIMAX、长短期记忆(LSTM)、独立 KAN 和 XGBoost)上表现更优,与 XGBoost 相比将 MAE 减少了约 12%,与 naive 基线相比减少了超过 50%。结果表明,混合学习策略为高动态电力市场中的电价预测提供了一种有效且稳健的解决方案。

英文摘要

Accurate electricity price forecasting (EPF) is essential for market participants to support operational planning and risk management, yet remains challenging due to strong volatility, nonlinear dynamics, and frequent extreme price spikes. These challenges are particularly pronounced in the Australian National Electricity Market (NEM), where high renewable penetration further increases uncertainty. This paper investigates week-ahead electricity price forecasting and proposes a hybrid KAN+XGBoost framework that integrates Kolmogorov-Arnold Networks (KAN) with tree-based learning. The proposed approach combines the global nonlinear representation capability of KAN with the local robustness of XGBoost to capture both long-term dependencies and short-term price fluctuations. Experiments are conducted on real-world NEM data using an expanding window evaluation strategy. The results demonstrate that the proposed model outperforms benchmark methods, including SARIMAX, Long Short-Term Memory (LSTM), standalone KAN, and XGBoost, reducing MAE by approximately 12% compared to XGBoost and by over 50% compared to a naive baseline. The results suggest that hybrid learning strategies provide an effective and robust solution for electricity price forecasting in highly dynamic electricity markets.

2605.22385 2026-05-22 cs.LG 版本更新

Efficient Higher-order Subgraph Attribution via Message Passing

通过消息传递实现高效的高阶子图归因

Ping Xiong, Thomas Schnake, Grégoire Montavon, Klaus-Robert Müller, Shinichi Nakajima

发表机构 * BIFOLD -- Berlin Institute for the Foundations of Learning(柏林学习与数据基础研究院) Department of Artificial Intelligence, Korea University, Seoul 136-713, Korea(韩国大学人工智能系) RIKEN Center for AIP, Japan(日本AIP研究中心)

AI总结 本文提出了一种基于消息传递的高效算法,能够在线性时间内通过GNN-LRP对子图进行归因,并扩展了子图归因方法以考虑邻接图特征,实验表明该方法具有显著加速和高实用性。

Comments Published in ICML 2022

详情
Journal ref
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:24478-24495, 2022
AI中文摘要

解释图神经网络(GNNs)近年来变得越来越重要。高阶解释方案,如GNN-LRP(针对GNN的分层相关性传播),已成为解开不同特征如何相互作用并解释GNNs的强大工具。GNN-LRP在每一层为节点之间的行走提供相关性归因,而子图归因则表示为指数级许多此类行走的总和。在本工作中,我们证明这种指数复杂性可以避免。特别是,我们提出了新的算法,能够在GNN-LRP中以线性时间(相对于网络深度)对子图进行归因。我们的算法通过利用分配属性的消息传递技术,直接计算高阶解释的量。我们进一步将高效的算法适应于计算一种扩展的子图归因方法,该方法还考虑了邻接图特征。实验结果表明,所提算法有显著的加速效果,并展示了我们新颖的扩展子图归因方法的高实用性和可扩展性。

英文摘要

Explaining graph neural networks (GNNs) has become more and more important recently. Higher-order interpretation schemes, such as GNN-LRP (layer-wise relevance propagation for GNN), emerged as powerful tools for unraveling how different features interact thereby contributing to explaining GNNs. GNN-LRP gives a relevance attribution of walks between nodes at each layer, and the subgraph attribution is expressed as a sum over exponentially many such walks. In this work, we demonstrate that such exponential complexity can be avoided. In particular, we propose novel algorithms that enable to attribute subgraphs with GNN-LRP in linear-time (w.r.t. the network depth). Our algorithms are derived via message passing techniques that make use of the distributive property, thereby directly computing quantities for higher-order explanations. We further adapt our efficient algorithms to compute a generalization of subgraph attributions that also takes into account the neighboring graph features. Experimental results show the significant acceleration of the proposed algorithms and demonstrate the high usefulness and scalability of our novel generalized subgraph attribution method.

2605.22380 2026-05-22 cs.CL cs.LG 版本更新

Multi-Stage Training for Abusive Comment Detection in Indic Languages

印地语中辱骂评论检测的多阶段训练

Pranshu Rastogi, Madhav Mathur, Ramaneswaran S, Kshitij Mohan

发表机构 * Department of CSE, JIIT Noida(计算机科学与工程系,印度尼泊尔理工学院诺伊达) Department of ICE, NSUT Delhi(电子与计算机工程系,NSUT德里) Department of IT, VIT Vellore(信息科技系,维杰学院维洛雷) Department of CSE, IIIT Delhi(计算机科学与工程系,德里理工学院)

AI总结 本文提出了一种多阶段训练方法,通过语言预处理和多个模型的集成,提高印地语中辱骂评论检测的准确性,减少误报率以保护言论自由。

Comments 4 pages, EAM2021 selected

详情
AI中文摘要

近年来,社交媒体已成为人们交流思想、分享观点和交换信息的重要工具。鉴于其普及性和广泛影响,社交媒体必须保持安全空间。生成在社交媒体上的内容可能具有攻击性,因此检测此类内容变得越来越重要。本文利用基于语言的预处理和多个模型的集成,分析其在辱骂评论检测中的性能。通过广泛实验,我们提出了一条管道,以最小化误报率(将非攻击性内容标记为攻击性),从而使这些系统能够在不损害言论自由的前提下检测攻击性评论。

英文摘要

In recent years social media has become an increasingly popular tool for communication. People use it to share their ideas, exchange information, and discuss thoughts. Given its prevalence and widespread reach, social media must remain a safe space for people. Content generated on social media can be abusive and it has become increasingly important to detect such content. In this paper, we use a language-based preprocessing and an ensemble of several models and analyze their performance of abusive comment detection. Through extensive experimentation, we propose a pipeline that minimizes the false-positive rate (marking non-abusive as abusive) so that these systems can detect abusive comments without undermining the freedom of expression.

2605.22379 2026-05-22 cs.HC cs.AI cs.LG 版本更新

Cross-Subject EEG Emotion Recognition Based on Temporal Asynchronous Alignment Contrastive Learning

基于时间异步对齐对比学习的跨受体EEG情绪识别

Ying Xie, Yi Zheng, Zehui Xiao, Wenkai Lu, Mengting Liu

发表机构 * School of Biomedical Engineering, Shenzhen Campus of Sun Yat-sen University(中山大学生物医学工程学院深圳校区) School of Computer Science and Technology, Tianjin University(天津大学计算机科学与技术学院)

AI总结 本文提出了一种基于时间异步对齐对比学习(TA2CL)的框架,用于解决跨受体EEG情绪识别中由于不同受体响应时间不一致导致的识别问题,通过改进相似性计算策略,提升模型对跨受体差异和时间延迟的鲁棒性。

Comments 16 pages, 7 figures

详情
AI中文摘要

随着科技的发展,情绪研究的重要性日益凸显。近年来,基于脑电图(EEG)的情绪识别已成为一个活跃的研究领域,因其客观性和高时间分辨率。然而,大多数现有方法侧重于优化编码器结构以增强特征提取能力,而对相似性计算策略关注较少,特别是忽略了不同受体之间响应的潜在时间不一致问题。为了解决这些不足,本文受ColBERT在自然语言处理(NLP)中的晚期交互机制启发,提出了一种基于时间异步对齐的对比学习(TA2CL)框架。该方法将传统的全局

英文摘要

With the advancement of science and technology, the importance of emotion research has become increasingly evident. Electroencephalography (EEG)-based emotion recognition has emerged as an active research area in recent years, owing to its objectivity and high temporal resolution. However, most existing methods focus on optimizing encoder structures to enhance feature extraction capabilities, while paying relatively little attention to similarity calculation strategies, particularly overlooking the potential temporal misalignment of responses among different subjects. To address these shortcomings, this paper draws inspiration from the late interaction mechanism of ColBERT in natural language processing (NLP) and proposes a Temporal Asynchronous Alignment-based Contrastive Learning (TA2CL) framework. This method transforms the traditional global "hard alignment" similarity calculation approach into a fine-grained local matching mechanism, enabling the model to adaptively search for and align "locally highly correlated" segments between two EEG signals, thereby effectively mitigating the effects of inter-subject differences and temporal delays. Experimental results demonstrate that the proposed method achieves strong performance across multiple public datasets. Specifically, on the FACED dataset, it achieves an accuracy of 64.5% for the nine-class classification task and 79.5% for the binary classification task, while on the SEED and SEED-V datasets, it achieves accuracies of 86.4% and 70.1%, respectively, validating the method's effectiveness and generalization capability.

2605.22377 2026-05-22 cs.LG 版本更新

Towards Explainability of SLMs by investigating Token Level Activation

通过研究token层面激活实现SLMs的可解释性

Sayantani Ghosh, Rajashik Datta, Amit Kumar Das, Amlan Chakrabarti

发表机构 * Information Technology(信息技术) A.K. Choudhury School of Information Technology(A.K. Choudhury 信息技术学院) Computer Science & Engineering(Artificial Intelligence)(计算机科学与工程(人工智能)) Institute of Engineering & Management(工程与管理学院) Computer Science & Engineering(计算机科学与工程) University of Calcutta(加尔各答大学)

AI总结 本文提出了一种轻量且通用的框架,通过BERT第8层隐藏状态的激活强度量化token层面的表示重要性,揭示了语义信息在激活强度上的集中分布,为将BERT从黑箱模型转变为更透明的玻璃箱模型提供了可解释且计算高效的替代方法。

详情
AI中文摘要

基于Transformer的语言模型,如具有1.1亿个参数的BERT,已彻底改变了自然语言理解,但其内部机制仍 largely opaque to 研究人员和从业者。传统的基于注意力的可解释性方法往往强调结构上重要但语义上弱的token,如标点符号,而不是有意义的语义关系。本文介绍了一种轻量且通用的框架,用于通过BERT第8层隐藏状态的激活强度量化token层面的表示重要性。所提出的激活流网络(AFN)框架通过第8层隐藏表示的L2范数计算token激活强度,从而能够直接对语义显著的token进行排序。进一步,本文引入了基于阈值的激活桶公式,通过经验上四分位数激活边界将token分为高激活和低激活组。实验观察表明,语义上有意义的内容词始终占据高激活桶,并主导表示激活的变化,而结构支持的token贡献相对较少。结果表明,第8层充当一个关键的语义整合区域,平衡了结构和语义信息处理。通过揭示激活强度集中在语义信息丰富的token周围,本文为将BERT从黑箱模型转变为更透明的玻璃箱模型提供了可解释且计算高效的替代方法。

英文摘要

Transformer-based language models such as BERT having 110M+ parameters have revolutionized natural language understanding, yet their internal mechanisms remain largely opaque to researchers and practitioners. Traditional attention-based interpretability methods often emphasize structurally important but semantically weak tokens such as punctuation marks rather than meaningful semantic relationships. This work introduces a lightweight and model-agnostic framework for quantifying token-level representational importance using hidden-state activation strengths at Layer 8 of BERT. The proposed Activation Flow Network (AFN) framework computes Token Activation Strength using the L2 norm of Layer-8 hidden representations, enabling direct ranking of semantically salient tokens. The study further introduces a threshold-based activation bucket formulation that partitions tokens into HIGH-activation and LOW-activation groups using an empirical upper-quartile activation boundary. Experimental observations demonstrate that semantically meaningful content words consistently occupy the HIGH-activation bucket and dominate representational activation shifts, while structurally supportive tokens contribute comparatively less. The results suggest that Layer 8 acts as a critical semantic consolidation zone balancing structural and semantic information processing. By revealing how activation magnitudes concentrate around semantically informative tokens, this work provides an interpretable and computationally efficient alternative to attentioncentric analysis, contributing toward transforming BERT from a "black box" into a more transparent "glass box" model for natural language understanding.

2605.22376 2026-05-22 cs.LG 版本更新

Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning

目标对齐的贝尔曼备份用于跨域离线强化学习

Wei Liu, Ting Long

发表机构 * School of Artificial Intelligence(人工智能学院) Jilin University(吉林大学)

AI总结 本文提出了一种基于目标域贝尔曼目标对齐的跨域离线强化学习方法,旨在通过评估源域过渡与目标域贝尔曼目标的一致性来提升策略学习性能。

详情
AI中文摘要

跨域离线强化学习(CDRL)旨在通过利用源域收集的数据来改进目标域的策略学习。现有方法通常通过测量源域数据与目标域转换的相似性来评估数据的可迁移性,并隐式地进行转换级选择。被判定为相似的转换会被赋予更高的权重或奖励,而不相似的则被降权。然而,转换级的相似性并不一定保证长期回报的一致性。即使视觉或动态上相似的转换在目标域中也可能导致显著不同的结果,这可能会误导策略学习并降低性能。为了解决这个问题,我们重新审视了策略学习的根本目标。由于策略优化最终依赖于贝尔曼目标来评估决策的质量,我们提出基于源域转换与目标域贝尔曼目标的一致性来评估源域转换的可迁移性,而不是表面的转换相似性。基于这一见解,我们提出了一种名为目标对齐的贝尔曼备份(TABB)的方法,通过测量源域数据对目标域中准确贝尔曼目标估计的贡献来选择性地利用源域数据。我们在广泛的跨域离线RL设置中评估了TABB,尤其是在目标域数据高度有限的情况下。实验结果表明,TABB在各种情况下都实现了强大的性能。

英文摘要

Cross-domain offline reinforcement learning (CDRL) aims to improve policy learning in a target domain by leveraging data collected from a source domain. Existing works typically assess the transferability of source-domain data by measuring its similarity to target-domain transitions, and implicitly perform transition-level selection. Transitions that are considered similar are assigned higher weights or rewards, while dissimilar ones are down-weighted. However, transition-level similarity does not necessarily imply consistency in long-term returns. Even visually or dynamically similar transitions may lead to significantly different outcomes in the target domain, which can mislead policy learning and degrade performance. To address this issue, we revisit the fundamental objective of policy learning. Since policy optimization ultimately relies on Bellman targets to evaluate the quality of decisions, we propose to assess the transferability of source-domain transitions based on their alignment with target-domain Bellman targets, rather than superficial transition similarity. Based on this insight, we propose a method termed Target-Aligned Bellman Backup (TABB), which selectively leverages source-domain data by measuring their contribution to accurate Bellman target estimation in the target domain. We evaluate TABB across a broad range of cross-domain offline RL settings with highly limited target-domain data. Experimental results show that TABB consistently achieves strong performance.

2605.22372 2026-05-22 cs.LG 版本更新

ASAP: Attention Sink Anchored Pruning

ASAP: 以注意力汇点为中心的剪枝

Jaehyuk Lee, Hanyoung Kim, Yanggee Kim, Donghun Lee

AI总结 本文提出ASAP方法,通过将注意力汇点作为特征,利用懒惰随机游走建模视觉Transformer的信息流,实现单次剪枝过程中的token分区和背景冗余压缩,从而在保持或超越基线精度的同时,提升吞吐量达48%。

详情
AI中文摘要

视觉Transformer(ViTs)在高分辨率下由于自注意力的二次复杂度面临严重的计算瓶颈。现有token减少方法依赖局部指标-如单层注意力分数-这些指标本质上容易受到注意力汇点现象的影响,即无信息token paradoxically被保留下来。我们提出ASAP(Attention Sink Anchored Pruning),一种无需训练的框架,将此汇点作为特征。通过将ViT信息流建模为懒惰随机游走,ASAP将汇点识别为概率质量的主要累积器。通过计算累积转移矩阵中到汇点的扩散距离,ASAP利用径向扩散聚类对token进行分区,并通过转移权重池化压缩背景冗余。在图像、视频和视觉-语言任务中的广泛实验表明,ASAP在保持或超越基线精度的同时,加速吞吐量高达48%。

英文摘要

Vision Transformers (ViTs) face severe computational bottlenecks due to the quadratic complexity of self-attention at high resolutions. Existing token reduction methods rely on local metrics - such as single-layer attention scores - that are inherently vulnerable to the attention sink phenomenon, where uninformative tokens are paradoxically preserved over salient foreground objects. We propose ASAP (Attention Sink Anchored Pruning), a training-free framework that recasts this sink as a feature. Modeling ViT information flow as a Lazy Random Walk, ASAP identifies the sink as a dominant accumulator of probability mass. By computing the diffusion distance to the sink within the cumulative transition matrix, ASAP partitions tokens via Radial Diffusion Clustering and compresses background redundancy through Transition Weight Pooling in a single shot. Extensive experiments across image, video, and vision-language tasks demonstrate ASAP outperforms state-of-the-art methods, accelerating throughput by up to 48% while maintaining - or even exceeding - baseline accuracy.

2605.22368 2026-05-22 cs.LG cs.AI cs.SE 版本更新

VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation

VeriScale:对抗性测试套件缩放用于可验证代码生成

Yifan Bai, Xiaoyang Liu, Zihao Mou, Guihong Wang, Jian Yu, Shuhan Xie, Yantao Li, Yangyu Zhang, Jingwei Liang, Tao Luo

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)科学与工程学院) School of Mathematics, Jilin University(吉林大学数学学院) School of Mathematical Sciences, Tongji University(同济大学数学科学学院) Zhiyuan College, Shanghai Jiao Tong University(上海交通大学紫阳学院) School of Future Technology, South China University of Technology(华南理工大学未来技术学院) Institute of Natural Sciences, Shanghai Jiao Tong University(上海交通大学自然科学研究院) MOE-LSC, CMA-Shanghai, Shanghai Jiao Tong University(上海交通大学MOE-LSC、CMA-上海)

AI总结 本文提出VeriScale框架,通过对抗性实现扩展和缩减测试套件,提升代码生成的可验证性,实验表明VerinaPlus显著暴露了模型弱点,而VerinaLite在低成本下保持判别能力。

详情
AI中文摘要

随着大型语言模型(LLMs)在软件工程中的广泛应用,构建高质量基准对于评估生成代码的功能正确性和形式可验证性至关重要。然而,现有基准受限于正负测试用例的数量和质量,导致模型在生成规范和实现方面的能力被高估。为此,我们提出VeriScale,一种由对抗性实现驱动的新框架,分为两个阶段:测试套件扩展以构建多样且具有挑战性的测试用例,以及测试套件缩减以将其压缩为紧凑且判别性的套件。虽然VeriScale具有通用性,但我们将其应用于Verina,构建VerinaPlus和VerinaLite。实验表明,VerinaPlus在SpecGen和CodeGen任务上显著暴露了模型弱点,而VerinaLite在低成本下保持了判别能力。增强的基准和源代码在https://github.com/XiaoyangLiu-sjtu/VeriScale上公开可用。

英文摘要

As large language models (LLMs) are increasingly deployed for software engineering, constructing high-quality benchmarks is crucial for evaluating not just the functional correctness, but also the formal verifiability of generated code. However, existing benchmarks are limited by the quantity and quality of positive and negative test cases, leading to an overestimation of model capabilities in generating specifications and implementations. To address this, we propose VeriScale, a novel framework driven by the adversarial implementations. It consists of two stages: test-suite expansion to construct diverse and challenging test cases, and test-suite reduction to distill them into compact yet discriminative suites. While VeriScale is general, we instantiate it on Verina to construct VerinaPlus, which expands the original test suites by over 83$\times$, and VerinaLite, a lightweight 14$\times$ variant. Our experiments across eight state-of-the-art LLMs demonstrate that VerinaPlus exposes substantial model weaknesses hidden by the original benchmark, evidenced by sharp score drops on both SpecGen and CodeGen tasks, whereas VerinaLite maintains this discriminative power at a fraction of the evaluation cost. The enhanced benchmarks and source code are publicly available at https://github.com/XiaoyangLiu-sjtu/VeriScale.

2605.22355 2026-05-22 cs.CL cs.AI cs.LG 版本更新

TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

TransitLM: 一个大规模数据集和基准,用于无地图的公共交通路线生成

Hanyu Guo, Jiedong Yang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu

发表机构 * Alibaba Group(阿里巴巴集团) AMAP

AI总结 本文提出TransitLM,一个包含1300万条公共交通路线规划记录的数据集,用于无地图的公共交通路线生成,展示了通过数据训练模型生成有效路线的能力。

详情
AI中文摘要

公共交通路线规划传统上依赖于结构化的地图基础设施和复杂的路由引擎,而现有的数据集不支持训练模型绕过这种依赖。我们提出了TransitLM,一个包含来自四个中国城市的超过1300万条公共交通路线规划记录的数据集,覆盖120,845个车站和13,666条线路,作为持续预训练语料库和用于三个评估任务的基准数据。实验表明,使用TransitLM训练的LLM能够生成结构上有效的路线,精度高,并且能够隐式地将任意GPS坐标映射到合适的车站,而无需显式映射。这些结果表明,公共交通路线规划可以完全从数据中学习,从而实现端到端、无地图的路线生成,直接从起止点信息生成。数据集和基准可在https://huggingface.co/datasets/GD-ML/TransitLM获取,评估代码在https://github.com/HotTricker/TransitLM。

英文摘要

Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre-training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end-to-end, map-free route generation directly from origin-destination information. The dataset and benchmark are available at https://huggingface.co/datasets/GD-ML/TransitLM, with evaluation code at https://github.com/HotTricker/TransitLM.

2605.22341 2026-05-22 cs.LG cond-mat.dis-nn 版本更新

A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification

一种用于在线Softmax分类中三分之一缩放的边界层机制

Marcel Kühn, Yoon Thelge, Bernd Rosenow

发表机构 * Institute for Theoretical Physics, Leipzig University(理论物理研究所,莱比锡大学) ScaDS.AI Dresden/Leipzig(ScaDS.AI 德累斯顿/莱比锡)

AI总结 本文研究了在线教师-学生模型中平滑替代损失与离散标签之间的不匹配如何产生幂律学习曲线的边界层机制,揭示了测试损失和泛化误差的α^{-1/3}缩放特性,以及学习率调度对泛化误差的改进。

Comments 20 pages, 7 figures

详情
AI中文摘要

硬标签分类通常使用平滑替代损失进行训练,最典型的是交叉熵。我们隔离了一个渐近机制,即这种平滑替代损失与离散标签之间的不匹配在在线教师-学生模型中产生幂律学习曲线。在减去平均logit后,热力学极限动态在中心变量中闭合:一个增长的中心学生-教师对齐D和残余学生方差Δ。在晚期时间,远离教师决策边界的例子已被自信分类并贡献指数级很小。只有宽度为O(D^{-1})的边界层仍活跃,而固定学习率的在线梯度下降噪声保持非零的Δ。作为训练时间α的函数,晚期解产生α^{-1/3}的幂律,不仅适用于测试损失,还适用于泛化误差ε_g,即1减去测试准确率。这比相同模型的贝叶斯最优参考α^{-1}要慢得多。我们进一步表明,学习率调度可以将泛化误差改进到ε_g ~ α^{-1/2}的幂律。模拟支持预测的序参量动态和学习曲线。使用相关高斯输入和白化预训练特征的受控实验表明,数据结构可以主导瞬态。因此,我们的结果是一种渐近的、补充的机制,而不是神经缩放定律频谱解释的替代方案。

英文摘要

Hard-label classification is usually trained with smooth surrogate losses, most prominently softmax cross-entropy. We isolate an asymptotic mechanism by which this mismatch between smooth surrogate and discrete labels produces power-law learning curves in an online teacher-student model. After subtracting the mean logit, the thermodynamic-limit dynamics close in centered variables: a growing centered student-teacher alignment $D$ and the residual student variance $Δ$. At late times, examples away from teacher decision boundaries are already classified confidently and contribute exponentially little. Only boundary layers of width $O(D^{-1})$ remain active, while the noise of fixed-learning-rate online gradient descent maintains a nonzero $Δ$. As a function of the training time $α$ the late-time solution yields a $α^{-1/3}$ power law not only for the test loss but also for the generalization error $ε_g$, i.e., one minus test accuracy. This is much slower than the $α^{-1}$ Bayes-optimal reference for the same model. We further show that learning-rate schedules can improve the generalization error towards a $ε_g \sim α^{-1/2}$ power law. Simulations support the predicted order parameter dynamics and learning curves. Controlled experiments with correlated Gaussian inputs and whitened pretrained features show that data structure can dominate transients. Therefore, our result is an asymptotic, complementary mechanism rather than an alternative to spectral explanations of neural scaling laws.

2605.22340 2026-05-22 cs.LG 版本更新

From Snapshots to Trajectories: Learning Single-Cell Gene Expression Dynamics via Conditional Flow Matching

从快照到轨迹:通过条件流匹配学习单细胞基因表达动力学

Siyu Pu, Qingqing Long, Xiaohan Huang, Haotian Chen, Jiajia Wang, Meng Xiao, Xiao Luo, Hengshu Zhu, Yuanchun Zhou, Xuezhi Wang

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出单细胞流匹配(scFM)方法,通过条件流匹配学习单细胞基因表达的动力学,解决时间点不连续和长时间预测中的分布漂移问题,提升轨迹推断的准确性和时间一致性。

详情
AI中文摘要

单细胞RNA测序(scRNA-seq)提供了细胞状态的高维轮廓,使能够驱动建模细胞动态随时间变化。实际上,时间分辨的scRNA-seq仅在几个离散时间点收集为不配对的快照群体,留下显著的时间间隙。这激励了在未测量时间点进行轨迹推断。现有方法主要沿着两个方向发展,最优传输(OT)对齐在观测快照之间提供分布层面的匹配,而连续时间生成模型支持通过学习的动力学进行预测。然而,仍存在两个挑战:(i)不配对的快照导致相邻时间点之间的局部转换模糊,导致监督不稳定;(ii)长时间预测依赖于重复积分,其中小的建模误差会累积并导致分布漂移。为了解决这些挑战,我们提出单细胞流匹配(scFM),一种基于耦合条件流匹配的潜在生成框架。首先,我们计算熵正则化的OT耦合在相邻快照之间,并使用它们来构建软加权流匹配目标,以学习时间依赖的速度场。其次,我们学习双向速度场,并利用其一致性来细化耦合并改进稀疏监督下的时间一致性。第三,我们引入分布层面的对齐和潜在动态正则化,以锚定长时间滚动并缓解漂移。在真实世界的时间序列scRNA-seq数据集上的实验表明,scFM在时间插值和外推的分布预测性能上始终有所提高。此外,scFM在中间时间点缺失的情况下产生更准确的轨迹重建和时间一致的可视化,表明对潜在时间基因表达动力学的更忠实恢复。

英文摘要

Single-cell RNA sequencing (scRNA-seq) provides high-dimensional profiles of cellular states, enabling data-driven modeling of cellular dynamics over time. In practice, time-resolved scRNA-seq is collected at only a few discrete time points as unpaired snapshot populations, leaving substantial temporal gaps. This motivates trajectory inference at unmeasured time points. Existing methods mainly follow two directions, optimal-transport (OT) alignment provides distribution-level matching between observed snapshots, while continuous-time generative models support forecasting via learned dynamics. However, two challenges remain: (i) unpaired snapshots render local transitions between adjacent time points ambiguous, leading to unstable supervision; and (ii) long-horizon prediction relies on repeated integration, where small modeling errors compound and cause distribution drift. To address these challenges, we propose single-cell Flow Matching (scFM), a latent generative framework based on coupling-conditioned flow matching. First, we compute entropically regularized OT couplings between adjacent snapshots and use them to construct soft, weighted flow-matching targets for learning time-dependent velocity fields. Second, we learn bidirectional velocity fields and leverage their consistency to refine couplings and improve temporal coherence under sparse supervision. Third, we introduce distribution-level alignment and latent dynamic regularization to anchor long rollouts and mitigate drift. Experiments on real-world time-series scRNA-seq datasets show that scFM consistently improves distributional prediction performance for both temporal interpolation and extrapolation. Moreover, scFM yields more accurate trajectory reconstruction and temporally coherent visualizations where intermediate time points are absent, indicating a more faithful recovery of underlying temporal gene expression dynamics.

2605.22338 2026-05-22 cs.LG 版本更新

Physics-Informed Generative Solver: Bridging Data-Driven Priors and Conservation Laws for Stable Spatiotemporal Field Reconstruction

物理引导的生成求解器:连接数据驱动先验与守恒定律以稳定时空场重建

Ziyuan Zhu, Keyu Hu, Zhifei Chen, Yuhao Shi, Ming Bao, Jing Zhao, Gang Wang, Haitan Xu, Jiadong Li, Qijun Zhao, Xiaodong Li, Minghui Lu, Yanfeng Chen

发表机构 * School of Advanced Manufacturing Engineering, Nanjing University(南京大学先进制造工程学院) National Laboratory of Solid State Microstructures, Nanjing University(南京大学固态微结构国家实验室) Suzhou Acoustics Industry Technology Research Institute Co., Ltd.(苏州声学工业技术研究所有限公司) School of Mechanical and Electric Engineering, Soochow University(苏州大学机械与电子工程学院) Shishan Laboratory, Nanjing University(仙山实验室)

AI总结 本文提出了一种物理引导的生成求解器,通过分离稳定的先验学习与推理时的守恒定律强制执行,解决了从稀疏测量中重建连续物理场的问题,同时在声学和气象学中实现了高效且稳定的场重建。

详情
AI中文摘要

从稀疏测量中重建连续物理场是一个核心的逆问题,但数据驱动的生成模型可能会生成违反支配动力学的状态。我们引入了一种物理引导的生成求解器,将稳定的先验学习与推理时的守恒定律强制执行分离。Martingale-Regularized Score Matching通过Score Fokker-Planck约束正则化Score预训练,从而获得动态稳定的先验。Physics-Informed Implicit Score Sampling则通过物理残差的梯度引导去噪轨迹,将样本投影到可接受的流形上而无需重新训练。在声学中,该方法从稀疏传感器共同生成压力和粒子速度,使密集的虚拟阵列得以抑制空间混叠。相同的框架在极端稀疏的现实世界ERA5气象场中也具有泛化能力。一起,这项工作建立了一个严谨且可推广的范式,用于解决高维逆问题,弥合了生成人工智能与第一原理科学之间的差距。

英文摘要

Reconstructing continuous physical fields from sparse measurements is a central inverse problem, but data-driven generative models can produce states that violate governing dynamics. We introduce a physics-informed generative solver that separates stable prior learning from inference-time enforcement of conservation laws. Martingale-Regularized Score Matching regularizes score pretraining with a Score Fokker-Planck constraint, yielding a dynamically stable prior. Physics-Informed Implicit Score Sampling then guides denoising trajectories by gradients of physical residuals, projecting samples toward admissible manifolds without retraining. In acoustics, the method co-generates pressure and particle velocity from sparse sensors, enabling dense virtual arrays that suppress spatial aliasing. The same framework generalizes to real-world ERA5 meteorological fields under extreme sparsity. Together, this work establishes a rigorous and generalizable paradigm for solving high-dimensional inverse problems, bridging the gap between generative artificial intelligence and first-principles science.

2605.22335 2026-05-22 cs.LG 版本更新

Learning Causal Orderings for In-Context Tabular Prediction

在上下文中的表格预测中学习因果顺序

Sascha Xu, Sarah Mameche, Jilles Vreeken

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家)

AI总结 本文研究了如何在表格预测中同时推断和强制因果结构,通过拓扑变量顺序形式进行因果结构推断,提出TabOrder模型利用因果顺序约束注意力机制,在学习的因果顺序下仅基于先于目标的特征进行预测,并通过似然目标无监督学习最优变量顺序,同时探讨了样本缺失对因果方向识别的影响。

详情
AI中文摘要

在上下文学习中,表格数据集在观测设置中具有强大的预测标准;然而,它主要依赖于相关结构,这在分布偏移或干预下变得不可靠。虽然已建立的方法可用于发现因果结构,但它们通常专注于结构可识别性,并与可能从中受益的预测架构解耦。为了弥合这些视角,我们研究了如何在表格预测中同时推断和强制因果结构,以拓扑变量顺序的形式。与标准架构不同,我们的模型TabOrder使用因果顺序约束注意力,基于学习的因果顺序下仅使用先于目标的特征进行预测。类似于因果发现方法,TabOrder通过基于似然的目标无监督学习最优变量顺序。我们在此选择下标准函数模型类别,并研究了样本缺失,这是表格数据中常见的挑战,如何与因果方向识别相互作用。经验上,我们确认TabOrder在恢复准确的变量顺序的同时,解决了预测和填补任务,并在干预下为现实世界生物数据提供了见解。

英文摘要

In-context learning for tabular data sets strong predictive standards in observational settings; it however primarily relies on correlational structure, which becomes unreliable under distribution shift or intervention. While established methods to discover causal structure exist, they are often focused on structure identifiability and decoupled from the predictive architectures that could benefit from them. To bridge these perspectives, we study how to simultaneously infer and enforce causal structure in the form of topological variable orderings into tabular prediction. Unlike standard architectures, our model TabOrder uses causal order-constrained attention, basing predictions only on features that precede a target under a learned causal order. Similar to causal discovery methods, TabOrder learns the optimal variable ordering in an unsupervised manner through a likelihood-based objective. We justify this choice under standard functional model classes and also study how sample missingness, a common challenge in tabular data, interacts with causal direction identification. Empirically, we confirm that TabOrder recovers accurate variable orderings while addressing prediction and imputation tasks, as well as gives insight into real-world biological data under intervention.

2605.22334 2026-05-22 cs.LG 版本更新

Riemannian geometry meets fMRI: the advantages of modeling correlation manifolds and eigenvector subspaces

黎曼几何与fMRI的结合:建模相关流形和特征向量子空间的优势

Mario Severino, Manuela Moretto, Robert A. McCutcheon, Mattia Veronese

发表机构 * Department of Information Engineering, University of Padova(信息工程系,帕多瓦大学) Department of Neuroimaging, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), King’s College London(神经影像系,精神病学、心理学与神经科学研究所(IoPPN),伦敦国王学院) Department of Psychiatry, University of Oxford(精神病学系,牛津大学) Oxford Health NHS Foundation Trust, Warneford Hospital(牛津健康国家卫生信托基金,沃内福德医院) Department of Psychosis Studies, Institute of Psychiatry, Psychology and Neuroscience, King’s College London(精神病学研究系,精神病学、心理学与神经科学研究所,伦敦国王学院)

AI总结 本文提出了一种可扩展的几何框架,通过Off-log度量和Grassmannian子空间判别方法,改进了fMRI数据的分析,提高了敏感性和预测性能。

详情
AI中文摘要

相关矩阵是功能脑网络的基本总结,但标准分析通常将条目独立处理,忽略了相关空间的曲面几何。现有的几何方法往往缺乏闭式运算或依赖任意区域排序,限制了可扩展性。我们引入了一种可扩展的几何框架,包含两个组成部分:(i)Off-log度量,一种平滑变换将相关矩阵映射到对称零对角矩阵。这使得距离、弗雷歇均值和线性模型的闭式表达成为可能,允许标准统计建模而无需复杂的流形优化。(ii)Grassmannian子空间判别,通过特征向量子空间之间的主角距离比较受试者,解决固有的符号和基底模糊性。这两个组成部分可以集成到标准机器学习工作流中进行推断、回归和分类。在两个临床队列(帕金森病和精神分裂症)和三个衰老fMRI数据集上得到验证,Off-log度量在置换检验中提高了灵敏度,并在分类中与黎曼和欧几里得基线匹配或超过。脑年龄预测性能相当,其中黎曼度量在两个队列中表现最佳。Grassmannian方法始终优于欧几里得基线,突显了与疾病相关的网络。总体而言,几何意识的表示提高了灵敏度和预测性能,同时在大规模部署时仍保持简单。

英文摘要

Correlation matrices are fundamental summaries of functional brain networks, yet standard analyses often treat entries independently, ignoring the curved geometry of correlation space. Existing geometric methods frequently lack closed-form operations or depend on arbitrary region ordering, limiting scalability. We introduce a scalable geometric framework with two components: (i) the Off-log metric, a smooth transformation mapping correlation matrices to symmetric zero-diagonal matrices. This enables closed-form expressions for distances, Frechet means, and linear models, allowing standard statistical modeling without complex manifold optimization. (ii) Grassmannian subspace discrimination, which compares subjects via principal-angle distances between eigenvector subspaces, resolving inherent sign and basis ambiguities. Both components integrate into standard machine-learning workflows for inference, regression, and classification. Validated across two clinical cohorts (Parkinson's and psychosis) and three ageing fMRI datasets, the Off-log metric increased sensitivity in permutation tests and matched or exceeded Riemannian and Euclidean baselines in classification. Brain-age prediction performance was comparable, with Riemannian metrics excelling in two of three cohorts. The Grassmannian method consistently outperformed Euclidean baselines, highlighting disease-relevant networks. Overall, geometry-aware representations improve sensitivity and predictive performance while remaining straightforward to deploy at scale.

2605.22331 2026-05-22 cs.LG cs.AI cs.DC 版本更新

SepsisAI Orchestrator: A Containerized and Scalable Platform for Deploying AI Models and Real-Time Monitoring in Early Sepsis Detection

SepsisAI Orchestrator:一个容器化和可扩展的平台,用于部署AI模型和实时监控以实现早期败血症检测

Santiago Ospitia, John Sanabria, John Garcia-Henao

发表机构 * School of Systems Engineering and Computing, University of Valle(系统工程与计算学院,山谷大学) Digital Medicine Unit, Balgrist University Hospital(数字医学单元,巴尔格里斯大学医院) Nucleus-AI Research(核芯AI研究所)

AI总结 本文提出SepsisAI-Orchestrator平台,通过整合HL7 FHIR启发的临床文档架构(CDA)预处理、NoSQL存储、容器化LightGBM分类器和Streamlit临床仪表板,解决了早期败血症检测中AI模型部署的挑战,并通过负载测试展示了U型扩展行为。

Comments 13 pages, 5 figures. Submitted to BioCARLA 2025 Workshop

详情
AI中文摘要

尽管在临床机器学习文献中预测结果强劲,但将这些模型转化为床边使用仍然受限于系统层面的障碍:异构数据表示、缺乏标准化的部署流程以及研究原型与医院环境的并发性和延迟需求之间的不匹配。我们提出了SepsisAI-Orchestrator,一个开源的模块化平台,旨在解决早期败血症检测中的部署缺口。该平台集成了HL7 FHIR启发的临床文档架构(CDA)预处理、NoSQL存储、通过REST API服务的容器化LightGBM分类器和Streamlit临床仪表板,并通过Docker和Kubernetes进行协调。一个之前已验证的LightGBM模型(在PhysioNet 2019上的F1值为0.87-0.94)在不进行修改的情况下被重用;贡献在于周围基础设施及其在负载下的实证表征。使用k6进行50-1000个并发虚拟用户测试,我们发现副本数量必须与主机的物理CPU线程数匹配:在12线程CPU上从3个副本扩展到12个副本,将p95延迟从3.3秒减少到1.41秒(减少57.3%)并消除所有请求失败,而过度配置到24或48个副本则由于调度器竞争导致性能下降。据我们所知,这种U型扩展行为此前尚未对临床AI推理工作负载进行量化。我们不声称具有前瞻性临床验证。源代码和部署清单可在https://github.com/nucleusai/sepsisai-orchestrator获取。

英文摘要

Despite strong predictive results in the clinical machine learning literature, the translation of these models into bedside use remains limited by systems-level barriers: heterogeneous data representations, the absence of standardized deployment workflows, and a mismatch between research prototypes and the concurrency and latency requirements of hospital environments. We present the SepsisAI-Orchestrator, an open-source modular platform that addresses this deployment gap for early sepsis detection. The platform integrates HL7 FHIR-inspired Clinical Document Architecture (CDA) preprocessing, NoSQL storage, a containerized LightGBM classifier served via REST APIs, and a Streamlit clinical dashboard, orchestrated with Docker and Kubernetes. A previously validated LightGBM model (F1 0.87-0.94 on PhysioNet 2019) is reused without modification; the contribution lies in the surrounding infrastructure and its empirical characterization under load. Using k6 with 50-1000 concurrent virtual users, we find that replica count must be matched to the physical CPU thread count of the host: scaling from 3 to 12 replicas on a 12-thread CPU reduces p95 latency from 3.3s to 1.41s (57.3% reduction) and eliminates all request failures, while over-provisioning to 24 or 48 replicas degrades performance due to scheduler contention. To our knowledge this U-shaped scaling behavior has not been quantified previously for clinical AI inference workloads. We do not claim prospective clinical validation. Source code and deployment manifests are available at https://github.com/nucleusai/sepsisai-orchestrator.

2605.22304 2026-05-22 cs.AI cs.DB cs.LG 版本更新

Evaluation of Pipelines for Data Integration into Knowledge Graphs

数据整合到知识图谱的管道评估

Marvin Hofer, Erhard Rahm

发表机构 * ScaDS.AI Dresden/Leipzig(ScaDS.AI 德累斯顿/莱比锡) Leipzig University(莱比锡大学)

AI总结 本文提出KGI-Bench基准测试,用于评估将不同输入数据整合到现有知识图谱的管道,通过覆盖度、正确性和一致性三个指标分析输出的知识图谱质量,并在电影领域提供基准数据集以评估12种管道的性能。

详情
AI中文摘要

将新数据整合到知识图谱(KG)通常涉及在工作流或管道中执行的不同任务。对于特定的整合问题,有许多可能的管道,但目前尚无通用方法来评估此类管道的整体质量和性能,以确定最佳选择。因此,我们提出一个新的基准KGI-Bench,用于评估将不同类型的输入数据整合到现有KG的管道。我们通过分析输出,即更新后的KG,使用三个互补的质量度量:覆盖度、正确性和一致性来评估管道。我们还提供了基准数据集(种子KG、三种格式的重叠输入数据、参考KG作为地面真实值)用于电影领域。为了展示所提基准的适用性和有用性,我们比较评估了12种管道,并分析了它们在不同输入数据格式和设计选择下的行为。

英文摘要

Integrating new data into knowledge graphs (KG) typically involves different tasks that are executed within workflows or pipelines There are many possible pipelines for a specific integration problem but there is not yet a general approach to evaluate the overall quality and performance of such pipelines to be able to determine the best choices. We therefore propose a new benchmark KGI-Bench to evaluate integration pipelines that ingest different kinds of input data into an existing KG. We evaluate pipelines by analyzing their output, i.e., the updated KG, with the three complementary quality metrics coverage, correctness and consistency. We also provide benchmark datasets (seed KG, overlapping input data of three formats, reference KG as a ground truth) for the movie domain. To demonstrate the applicability and usefulness of the proposed benchmark, we comparatively evaluate 12 pipelines and analyze their behavior across different input data formats and design choices.

2605.22300 2026-05-22 cs.AI cs.LG cs.MA 版本更新

Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence

跨领域基准测试揭示协调AI代理在部分证据下提升科学推断何时有效

Fiona Y. Wong, Markus J. Buehler

发表机构 * Laboratory for Atomistic and Molecular Mechanics (LAMM)(原子分子力学实验室) Department of Biological Engineering(生物工程系) Department of Mechanical Engineering(机械工程系) Department of Civil and Environmental Engineering(土木与环境工程系) Center for Computational Science and Engineering, Schwarzman College of Computing(计算科学与工程中心) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文通过跨领域基准测试探讨协调AI代理在部分证据下提升科学推断的有效性,发现当不同学科各自捕捉现象部分时,跨通道复合方法优于单一通道基线,但在某些情况下分解并不总是提升整体性能。

详情
AI中文摘要

科学证据通常跨越仪器、数据库和学科,因此没有单一来源能完整记录现象。这使得确定协调AI代理何时能超越简单科学工作流变得困难。我们通过涵盖四个科学任务的跨领域基准测试评估了这一问题:将分子结构映射到音乐表示、检测科学历史范式转变、识别媒介传播疾病爆发以及验证行星凌星候选体。每个案例均使用冻结评估小组、预定义评分协议、明确基线、消融或零对照,以及声明的限制。结果定义了三个操作模式。当不同学科各自只捕捉现象部分时,跨通道复合方法优于单一通道基线:气候-媒介爆发达到AUROC 0.944,行星凌星验证达到AUROC 0.955。然而,行星凌星工作流与强联合摘要基线几乎持平,表明分解不总能提升整体性能。当一个信号主导时,如范式转变检测,协调主要提升解释和可追溯性。对于分子音乐化,收益是表征而非预测性的。ScienceClaw x Infinite提供了此评估的可审计艺术ifacts和来源层。因此,该基准测试仅在对应的性能、来源或表征主张有明确比较器支持时才赋予协调价值。

英文摘要

Scientific evidence often spans instruments, databases, and disciplines, so no single source records the full phenomenon. This makes it difficult to determine when coordinated AI agents add value over simpler scientific workflows. We evaluate this question with a cross-domain benchmark spanning four scientific tasks: mapping molecular structure into musical representations, detecting historical paradigm shifts in science, identifying vector-borne disease emergence, and vetting transiting-exoplanet candidates. Each case uses a frozen evaluation panel, predefined scoring protocols, explicit baselines, ablations or null controls, and stated limitations. The results define three operating regimes. When different disciplines each capture only part of the phenomenon, cross-channel composites improve over single-channel baselines: climate-vector emergence reaches AUROC 0.944 and exoplanet vetting reaches AUROC 0.955. However, the exoplanet workflow is effectively tied with a strong combined-summary baseline, showing that decomposition does not always improve top-line performance. When one signal dominates, as in paradigm-shift detection, coordination mainly improves interpretation and traceability. For molecular sonification, the gain is representational rather than predictive. ScienceClaw x Infinite provides the auditable artifact and provenance layer for this evaluation. The benchmark therefore assigns value to coordination only when the corresponding performance, provenance, or representation claim is supported by explicit comparators.

2605.22291 2026-05-22 cs.LG 版本更新

Long-term Fairness with Selective Labels

长期公平性与选择性标签

Giovani Valdrighi, Isabel Valera, Marcos Medeiros Raimundo

发表机构 * Department of Computer Science, Saarland University, Saarbrücken, Germany(萨尔布吕肯大学计算机科学系)

AI总结 本文研究了在选择性标签设置下长期公平性的问题,提出了一种新的框架,通过结合观测数据和标签预测模型来估计真实的公平性度量,并提出了一种新的强化学习算法以实现有效长期公平决策。

详情
AI中文摘要

长期公平性算法旨在通过考虑决策政策与人口行为之间的动态关系,满足超越静态和短期观念的公平性。大多数先前的方法从可观察特征和标签评估性能和公平性度量,其中标签被假设为完全可观测。然而,在招聘或贷款等场景中,标签(例如偿还贷款的能力)是选择性标签,因为它们仅在积极决定(例如贷款被批准时)后才被揭示。在本文中,我们研究了选择性标签设置下的长期公平性,并分析表明,朴素的解决方案无法保证公平性。为了解决这一差距,我们引入了一个新的框架,利用观测数据和标签预测模型来估计真实的公平性度量值,将其分解为观测公平性和标签预测中的偏差。这使我们能够通过使用预测模型的置信度来推导出满足真实公平性的充分条件。最后,我们依赖我们的理论结果,提出了一种新的强化学习算法,以实现有效长期公平决策。在半合成环境中,所提出的算法在公平性和性能方面与具有oracle访问真实标签的智能体相当。

英文摘要

Long-term fairness algorithms aim to satisfy fairness beyond static and short-term notions by accounting for the dynamics between decision-making policies and population behavior. Most previous approaches evaluate performance and fairness measures from observable features and a label, which is assumed to be fully observed. However, in scenarios such as hiring or lending, the labels (e.g., ability to repay the loan) are selective labels as they are only revealed based on positive decisions (e.g., when a loan is granted). In this paper, we study long-term fairness in the selective labels setting and analytically show that naive solutions do not guarantee fairness. To address this gap, we then introduce a novel framework that leverages both the observed data and a label predictor model to estimate the true fairness measure value by decomposing it into the observed fairness and bias from label predictions. This allows us to derive sufficient conditions to satisfy true fairness from observable quantities by using the confidence in the predictor model. Finally, we rely on our theoretical results to propose a novel reinforcement learning algorithm for effective long-term fair decision-making with selective labels. In semisynthetic environments, the proposed algorithm reached comparable fairness and performance to an agent with oracle access to the true labels.

2605.22286 2026-05-22 cs.LG cs.AI 版本更新

EmoTrack: Robust Depression Tracking from Counseling Transcripts across Session Regimes

EmoTrack: 从咨询记录中跨会话制度实现稳健的抑郁跟踪

Zhaomin Wu, Jiayi Li, Bingsheng He

发表机构 * Department of Computer Science National University of Singapore(新加坡国立大学计算机科学系)

AI总结 本文研究了从单次会话和多会话制度中通过咨询记录进行稳健抑郁跟踪的问题,提出了LongCounsel多会话咨询数据集和EmoTrack框架,结合LLM提取的临床信号和冻结的轮次级语义嵌入,训练症状特定预测器,并通过紧凑的跨会话记忆进一步结合先前会话,实验表明在真实单次会话基准上表现优异。

详情
AI中文摘要

基于文本的咨询是人工智能心理健康支持的重要接口,其中记录可能用于监控抑郁严重程度并标记需要及时人工审查的会话。然而,跨会话制度实现稳健的PHQ-8预测仍然具有挑战性:基于微调的方法可以利用更丰富的监督但可能在数据稀缺时泛化能力差,而基于提示的LLM方法数据高效但通常将每个记录整体处理,对纵向上下文支持有限。我们研究了从咨询记录中跨单次会话和多会话制度进行稳健抑郁跟踪。我们引入了LongCounsel多会话咨询数据集,具有会话级PHQ-8监督,用于评估在部分症状披露和跨会话连续性下的重复会话跟踪。我们进一步提出了EmoTrack,一种PHQ-8预测框架,结合LLM提取的临床信号与冻结的轮次级语义嵌入,并在得到的记录表示上训练症状特定预测器。当先前会话可用时,EmoTrack可通过紧凑的跨会话记忆进一步结合它们。在LongCounsel和DAIC-WOZ上的实验表明,EmoTrack在真实单次会话基准上实现了明显优势,包括在最强DAIC-WOZ基线上的MAE相对减少13.5%,并在LongCounsel上与最强的纵向基线保持竞争力。

英文摘要

Text-based counseling is an important interface for AI mental-health support, where transcripts may be used to monitor depression severity and flag sessions requiring timely human review. However, robust PHQ-8 prediction across session regimes remains challenging: fine-tuning-based methods can exploit richer supervision but may generalize poorly under data scarcity, while prompt-based LLM methods are data-efficient but usually treat each transcript holistically and provide limited support for longitudinal context. We study robust depression tracking from counseling transcripts across single-session and multi-session regimes. We introduce LongCounsel, a multi-session counseling dataset with session-level PHQ-8 supervision for evaluating repeated-session tracking under partial symptom disclosure and cross-session continuity. We further propose EmoTrack, a PHQ-8 prediction framework that combines LLM-extracted clinical signals with frozen turn-level semantic embeddings and trains symptom-specific predictors over the resulting transcript representation. When prior sessions are available, EmoTrack can further incorporate them through compact cross-session memory. Experiments on LongCounsel and DAIC-WOZ show that EmoTrack achieves a clear gain on the real single-session benchmark, including a 13.5% relative MAE reduction over the strongest DAIC-WOZ baseline, and remains competitive with the strongest longitudinal baseline on LongCounsel.

2605.22275 2026-05-22 cs.LG 版本更新

Adaptive Measurement Allocation for Learning Kernelized SVMs Under Noisy Observations

适应性测量分配用于在噪声观测下学习核化SVM

Artur Miroszewski

发表机构 * Φ \Phi -lab, European Space Agency (ESA/ESRIN), Frascati, Italy(Φ \Phi 实验室,欧洲航天局(ESA/ESRIN),弗拉斯卡蒂,意大利)

AI总结 本文提出了一种适应性测量分配策略,用于在噪声观测下学习核化支持向量机,通过结合几何敏感性和主动集不稳定性,优化核矩阵中决策关键区域的测量分配,从而提升支持向量恢复、边距估计和决策函数准确性。

Comments 20 pages, 9 figures

详情
AI中文摘要

核方法通常是在假设能够精确获取Gram矩阵的情况下进行建模的。然而,在新兴领域如量子机器学习中,每个核元素必须从噪声观测中推断出来,其准确性取决于如何分配有限的测量预算。尽管如此,现有方法大多依赖于均匀分配,这虽然平等地降低了估计方差,但忽略了核化分类器对Gram矩阵的高度非均匀依赖。在本文中,我们提出了一种适应性测量分配策略,用于从噪声伯努利观测中学习核化支持向量机。我们的方法结合了两个互补原则:(i) 几何敏感性,捕捉单个核元素扰动对分类器边距的影响,以及 (ii) 主动集不稳定性,量化由测量噪声引起的支持向量成员身份的离散变化概率。这些信号定义了一个任务感知的分配方案,将测量集中在核矩阵中最关键的决策区域。我们提供了理论分析,表明适应性分配的益处由诱导核重要结构的异质性决定,导致在不同情况下适应性或均匀策略更优。在合成数据集上的实验证明,在固定测量预算下,适应性分配显著提高了支持向量恢复、边距估计和决策函数准确性。双系数稳定性准则进一步使早停成为可能,仅使用少量测量成本即可达到近最优性能。此外,在从真实数据导出的量子核上的额外实验揭示了与已知现象如核集中度相一致的领域依赖行为。

英文摘要

Kernel methods are typically formulated under the assumption of exact, noise-free access to the Gram matrix. However, in emerging settings such as quantum machine learning, each kernel entry must be inferred from noisy observations, and its accuracy depends on how a limited measurement budget is allocated. Despite this, existing approaches overwhelmingly rely on uniform allocation, which equalizes estimator variance but ignores the highly non-uniform dependence of kernelized classifiers on the Gram matrix. In this work, we introduce an adaptive measurement-allocation strategy for learning kernelized Support Vector Machines (SVMs) from noisy Bernoulli observations. Our approach combines two complementary principles: (i) geometric sensitivity, capturing how perturbations of individual kernel entries affect the classifier margin, and (ii) active-set instability, quantifying the probability of discrete changes in support-vector membership induced by measurement noise. These signals define a task-aware allocation scheme that concentrates measurements on the most decision-critical regions of the kernel matrix. We provide a theoretical analysis showing that the benefit of adaptive allocation is governed by the heterogeneity of the induced kernel importance structure, leading to distinct regimes in which adaptive or uniform strategies are preferable. Empirical evaluations on synthetic datasets demonstrate that adaptive allocation significantly improves support-vector recovery, margin estimation, and decision-function accuracy under fixed measurement budgets. A dual-coefficient stability criterion further enables early stopping, achieving near-optimal performance while using only a fraction of the measurement cost. Additional experiments on quantum kernels derived from real-world data reveal a regime-dependent behavior aligned with known phenomena such as kernel concentration. Together...

2605.22266 2026-05-22 cs.LG cs.AI 版本更新

Detecting Atypical Clients in Federated Learning via Representation-Level Divergence

通过表示层面的分歧检测联邦学习中的非典型客户端

Cristian Pérez-Corral, Jose I. Mestre, Alberto Fernández-Hernández, Manuel F. Dolz, Enrique S. Quitana-Ortí

发表机构 * Universitat Politècnica de València(巴塞罗那理工大学) Universitat Jaume I(Jaime I 大学)

AI总结 本文提出了一种轻量级的几何信号来量化客户端与全局模型之间的功能偏差,以检测联邦学习中的非典型客户端,通过评估输入空间的激活诱导分区变化来区分稳定但异质的客户端与显著偏离全局范式的客户端。

详情
AI中文摘要

联邦学习使分布式客户端在异质数据上进行协作训练,但这种异质性常常导致更新不稳定和全局性能下降。此外,在实际部署中,客户端更新可能偏离预期行为,不仅由于良性非独立同分布的数据分布,还由于分布偏移或异常输入,这引发了对聚合过程可靠性的担忧。在本工作中,我们提出了一种轻量级的几何信号来量化客户端相对于全局模型的功能偏差。与比较模型参数或梯度不同,我们的方法衡量每个客户端本地训练如何改变激活诱导的输入空间分区,该评估基于共享的探测集。这产生了一个置换不变、可解释的客户端-全局分歧度量,捕捉了模型处理数据方式的差异。我们展示该信号能有效识别导致非典型功能变化的客户端,区分稳定但异质的客户端与那些更新显著偏离全局范式的客户端。因此,所提出的度量提供了一个简单的工具用于监控客户端行为,并在联邦学习系统中实现风险感知的聚合策略。

英文摘要

Federated learning enables collaborative training across distributed clients with heterogeneous data, but such heterogeneity often leads to unstable updates and degraded global performance. Moreover, in practical deployments, client updates may deviate from the expected behavior not only due to benign not i.i.d. distributions, but also due to distributional shifts or anomalous inputs, raising concerns about the reliability of the aggregation process. In this work, we propose a lightweight geometric signal to quantify the functional deviation of a client with respect to the global model. Instead of comparing model parameters or gradients, our approach measures how the local training of each client alters the activation-induced partition of the input space, evaluated on a shared probe set. This yields a permutation-invariant, interpretable metric of client--global divergence that captures differences in how data is processed by the model. We show that this signal effectively identifies clients that induce atypical functional changes, distinguishing stable yet heterogeneous clients from those whose updates significantly diverge from the global regime. As a result, the proposed metric provides a simple tool for monitoring client behavior and enabling risk-aware aggregation strategies in federated learning systems.

2605.22263 2026-05-22 cs.LG cs.AI 版本更新

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

按能力定制教学:方向自适应自蒸馏用于LLM推理

Hongbin Zhang, Chaozheng Wang, Kehai Chen, Youcheng Pan, Yang Xiang, Jinpeng Wang, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology(计算智能研究所,哈尔滨工业大学) Peng Cheng Laboratory(鹏城实验室) Keeta AI, Meituan(Keeta AI,美团)

AI总结 本文提出方向自适应自蒸馏(DASD),通过熵引导的定向监督改进LLM推理,通过分析发现统一的教师监督导致探索被压制,DASD在六个数学推理基准中取得最佳表现。

Comments Under Review

详情
AI中文摘要

在线自蒸馏(OPSD)是一种新兴的LLM后训练范式,其中模型作为自己的教师:在有特权信息(如参考轨迹或提示)的条件下,同一策略为自身 rollout 提供密集的token级监督。然而,最近的研究表明,OPSD 通过抑制预测不确定性而损害复杂推理,这支持探索和假设修订。我们的token级分析显示,这种失败源于在具有不同不确定性水平的token上应用统一的教师监督方向:符合特权自教师会抑制高熵的探索,而偏离教师会降低低熵的步骤准确性。据此,我们提出了方向自适应自蒸馏(DASD),将特权自蒸馏从统一教师模仿重新框架为熵引导的定向监督:高熵token被推离特权教师以保持探索,而低熵token被拉向教师以稳定步骤级执行。在六个数学推理基准上,DASD在强RLVR和自蒸馏基线中实现了最佳的宏Avg@16。Pass@$k$、推理健康和泛化分析表明,这些平均收益来自于在不牺牲步骤级执行的情况下保留探索。

英文摘要

On-policy self-distillation (OPSD) is an emerging LLM post-training paradigm in which the model serves as its own teacher: conditioned on privileged information such as a reference trace or hint, the same policy provides dense token-level supervision on its own rollouts. However, recent studies show that OPSD degrades complex reasoning by suppressing predictive uncertainty, which supports exploration and hypothesis revision. Our token-level analysis shows that this failure arises from applying a uniform direction of teacher supervision across tokens with different uncertainty levels: conformity to the privileged self-teacher suppresses exploration at high entropy, while deviation from the teacher degrades step accuracy at low entropy. Accordingly, we propose \textbf{Direction-Adaptive Self-Distillation} (\textbf{DASD}), which reframes privileged self-distillation from uniform teacher imitation into entropy-routed directional supervision: high-entropy tokens are pushed away from the privileged teacher to preserve exploration, while low-entropy tokens are pulled toward the teacher to stabilize step-level execution. Across six mathematical reasoning benchmarks, DASD achieves the best macro Avg@16 over strong RLVR and self-distillation baselines. Pass@$k$, reasoning-health, and generalization analyses show that these average gains come from preserving exploration without sacrificing step-level execution.

2605.22262 2026-05-22 cs.SD cs.LG eess.AS 版本更新

Automatic Contextual Audio Denoising

自动上下文音频去噪

Diep Luong, Konstantinos Drossos, Mikko Heikkinen, Tuomas Virtanen

发表机构 * Tampere University(塔尔皮奥大学) Nokia(诺基亚)

AI总结 本文提出了一种自动上下文音频去噪方法,通过推断音频场景类别来区分有用和无关声音成分,从而提高去噪效果。

详情
AI中文摘要

音频上下文决定了哪些声音成分和来源是相关的,哪些可以被听众感知为无关(噪声)。例如,在城市监控中交通噪声是有信息的,而在同一地点的电话通话中则为噪声。大多数当前的音频去噪系统使用固定的目标-噪声定义,往往在一种上下文中去除有用成分而在另一种上下文中无法抑制无关成分。为此,我们引入了自动上下文音频去噪(ACAD)的概念,该概念基于推断的上下文定义目标和噪声。在本工作中,我们将上下文限制为与声学场景类别相关联。我们将场景类别外的事件分布之外的声音事件(噪声)标记为离上下文(OC),而典型于该场景的事件标记为在上下文中(IC)。我们实现了一种深度学习方法,该方法能够自动推断音频信号的上下文并去除OC成分,并将其与无上下文推断、有 oracle 上下文和单独提供无信息上下文的变体进行比较。在跨多样上下文的配对干净/噪声数据上,其中一种上下文中的OC成分可能在另一种上下文中是IC,我们的方法在标准客观指标上优于其他方法,表明模型能够推断上下文,并且上下文依赖的处理可以增强去噪。

英文摘要

Audio context determines which sound components and sources are relevant and which can be perceived as irrelevant (noise) by listeners. For example, traffic noise is informative in urban surveillance but noise for a phone call at the same location. Most current audio denoising systems apply fixed target-noise definitions, often removing useful components in one context while failing to suppress irrelevant components. To address this, we introduce the concept automatic contextual audio denoising (ACAD) which defines target and noise based on the inferred context. In this work, we restrict context to be associated with an acoustic scene class. We label sound events outside the event distribution of a scene class (noise) as out-of-context (OC) and events typical for that scene as in-context (IC). We implement a deep learning method that automatically infers the context of the audio signal and removes OC components, and benchmark it against variants: without context inference, with oracle context, and with separately provided uninformative context. On paired clean/noisy data across diverse contexts, where OC components in one context may be IC in another, our proposed method outperforms other approaches across standard objective metrics, indicating that the model can infer context and context-dependent processing can enhance denoising.

2605.22259 2026-05-22 cs.LG cs.CV cs.RO 版本更新

An Evidence Hierarchy for Bayesian Object Classification via OSINT-Aided Heterogeneous Sensor Fusion

基于OSINT辅助异质传感器融合的贝叶斯目标分类证据层级

Jan Nausner, Michael Hubner

发表机构 * Center for Digital Safety & Security, Austrian Institute of Technology GmbH (AIT)(数字安全与安全研究所,奥地利技术研究院(AIT))

AI总结 本文提出了一种基于OSINT辅助的异质传感器融合方法,通过建立新的证据层级模型,结合上下文信息和领域知识,提升对CBRNE威胁的分类准确率,实验结果表明该方法在抗干扰和先验不匹配方面具有优势,分类准确率高达95%。

Comments 6 pages, 1 figure; \c{opyright} 2026 The Authors. Submitted to the 2026 IEEE International Conference on Multisensor Fusion and Integration (MFI 2026). Under review

详情
AI中文摘要

异质传感器融合对于检测、定位和分类CBRNE威胁至关重要。然而,单独的传感器通常只能检测相关威胁的子集,其可靠性各异,甚至只能提供间接威胁指示,使威胁分类变得困难。此外,传感器侧的高杂波率对融合系统提出了巨大挑战。此外,高质量数据集的有限供应阻碍了智能传感器中基于学习的检测和分类模型的发展。为缓解这些传感器相关缺点,提出了一种上下文感知和领域知识增强的融合过程。首先,建立了一个新的证据层级,能够建模直接、指示性和上下文信息。其次,通过收集、处理和利用OSINT输入,将环境上下文信息引入融合过程。第三,利用证据层级的所有级别,构建一个结合领域知识的贝叶斯威胁类型分类机制。所提出的方法在模拟场景中进行了评估,结果表明该融合方法在抗杂波和先验不匹配方面具有优势,总体分类准确率高达95%。

英文摘要

Heterogeneous sensor fusion is vital for detecting, localizing, and classifying CBRNE threats. However, individual sensors are often only capable of detecting a subset of relevant threats with varying reliability or can even provide only indirect threat indications, making threat classification challenging. Furthermore, high clutter rates on the sensor side present a great challenge for fusion systems. Additionally, the limited availability of high quality datasets hinders the advancement of learning-based detection and classification models in smart sensors. To mitigate these sensor related shortcomings, a context-aware and domain knowledge-enhanced fusion process is proposed. First, a novel evidence hierarchy is established that enables modeling of direct, indicative, and contextual information. Second, contextual information about the environment is introduced into the fusion process, by collecting, processing, and exploiting OSINT inputs. Third, all levels of the evidence hierarchy are used to craft a Bayesian threat type classification mechanism with domain knowledge-informed priors. The proposed methodology is evaluated in simulated scenarios, and the results demonstrate the benefit of the proposed fusion approach in terms of robustness to clutter and prior mismatch, with an overall classification accuracy of up to 95%.

2605.22257 2026-05-22 cs.LG cs.AI cs.LO 版本更新

What are the Right Symmetries for Formal Theorem Proving?

正式定理推理中应有的对称性是什么?

Krzysztof Olejniczak, Radoslav Dimitrov, Xingyue Huang, Bernardo Cuenca Grau, Jinwoo Kim, İsmail İlkan Ceylan

发表机构 * University of Oxford(牛津大学) KAIST(韩国科学技术院) TU Wien(维也纳技术大学) AITHYRA

AI总结 本文探讨了正式定理推理中应尊重的对称性,提出了基于范畴论的重写范畴框架,用于形式化证明等价性和成功不变性,并通过测试时方法改进了LLM基定理证明器的鲁棒性和性能。

详情
AI中文摘要

基于大规模语言模型(LLMs)的正式定理推器对问题表示的表面变化高度敏感:语义等价的陈述可以表现出剧烈不同的证明成功率,揭示了对正式数学中固有对称性的失败。这提出了一个核心问题:正式定理推理中应有什么样的对称性?我们引入了重写范畴,一个范畴论框架,捕捉由证明战术诱导的组合性、一般非可逆的转换,并用它来形式化两个对称性概念:证明等价性,支配证明分布在重写下的变换,以及成功不变性(即成功概率的不变性),要求等价陈述以相同概率被解决。我们观察到基于状态的next-tactic推器通过操作证明状态自然满足证明等价性。相比之下,最先进的基于LLM的推器既不满足这些属性,表现出在等价表述下的大性能变化。为缓解这一问题,我们提出测试时方法,通过等价重写的聚合,理论上证明它们在采样极限下恢复成功不变性,并实验证明它们在固定推理预算下提高鲁棒性和性能。我们的结果突显了对称性作为LLM基定理推理中关键缺失的归纳偏置,并建议测试时计算作为近似该偏置的实用途径。

英文摘要

Formal theorem provers based on large language models (LLMs) are highly sensitive to superficial variations in problem representation: semantically equivalent statements can exhibit drastically different proof success rates, revealing a failure to respect structural symmetries inherent in formal mathematics. This raises a central question: what are the right symmetries for formal theorem proving? We introduce rewriting categories, a category-theoretic framework capturing the compositional, generally non-invertible transformations induced by proof tactics, and use it to formalize two symmetry notions: proof equivariance, governing how proof distributions transform under rewrites, and success invariance (i.e., invariance of success probability), requiring equivalent statements to be solved with the same probability. We observe that state-based next-tactic provers naturally satisfy proof equivariance by operating on proof states. In contrast, state-of-the-art LLM-based provers satisfy neither property, exhibiting large performance variation across equivalent formulations. To mitigate this, we propose test-time methods that aggregate over equivalent rewritings of the input, showing theoretically that they recover success invariance in the sampling limit, and empirically, that they improve robustness and performance under fixed inference budgets. Our results highlight symmetry as a key missing inductive bias in LLM-based theorem proving and suggest test-time computation as a practical route to approximate it.

2605.22248 2026-05-22 cs.LG 版本更新

No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation

没有比现在更严峻的挑战:鲁棒的气候模拟需要分布外泛化

Bradley Stanley-Clamp, Anson Lei, Hannah M. Christensen, Ingmar Posner

发表机构 * Applied AI Lab University of Oxford, UK(应用人工智能实验室,牛津大学,英国) Atmospheric, Oceanic and Planetary Physics University of Oxford, UK(大气、海洋和行星物理,牛津大学,英国)

AI总结 本文研究了气候模拟中分布外泛化的重要性,提出了一种新的评估框架,通过季节变化来测试模拟器的鲁棒性,并展示了物理驱动的分解方法如何在不显著牺牲分布内性能的情况下提升分布外性能。

Comments 36 pages, 12 figures

详情
AI中文摘要

气候模拟是一种分布外(OOD)投影任务。正是在这个挑战中,现代机器学习(ML)方法最容易失效。因此,尽管当前训练于现代表现的ML模拟器在分布内表现优异,但其在气候不可避免分布变化下的未来可靠性仍是一个关键但不为人知的盲点。解决这一挑战需要我们对气候模拟器的理解、评估和设计方法进行根本性转变。在本工作中,我们首先确认气候变化导致大气状态分布产生统计显著且逐渐增长的转变,使标准评估协议不足。我们实证地确立季节变化作为这些长期气候转变的有效代理,提供访问真实世界分布转变而无需依赖合成扰动等启发式方法。受此联系启发,我们引入了一种新的评估框架,利用季节转变作为严格且零开销的模拟器鲁棒性测试平台。我们的系统性特征化确认了当前最先进的混合ML模拟器在这些现实转变下显著退化。最后,我们通过识别组合泛化,即从观察到的基本组件中形成新组合的能力,作为稳健气候模拟的原理路径。我们证明了受物理启发的分解方法在不显著牺牲分布内性能的情况下显著提升OOD性能,为ML驱动的气候模拟器提供了一条对未知未来鲁棒的途径。

英文摘要

Climate emulation is an out-of-distribution (OOD) projection task. This is precisely the challenge where modern Machine Learning (ML) methods are most prone to failure. Consequently, while current ML emulators trained on present climate achieve high in-distribution performance, their future reliability under the inevitable distribution shifts of a changing climate remains a critical, poorly understood blind spot. Addressing this challenge requires a fundamental shift in how we understand, evaluate, and design climate emulators. In this work, we first confirm that climate change drives a statistically significant and progressively growing shift in atmospheric state distributions, rendering standard evaluation protocols insufficient. We empirically establish that seasonal variation serves as an effective proxy for these long-term climate shifts, providing access to $\textit{real-world}$ distribution shifts without recourse to heuristics like synthetic perturbations. Motivated by this link, we introduce a novel evaluation framework that leverages seasonal shifts as a rigorous, zero-overhead testbed for emulator robustness. Our systematic characterisation confirms that current state-of-the-art hybrid-ML emulators degrade significantly under these realistic shifts. Finally, we chart a path forward by identifying compositional generalisation, the ability to form novel combinations from observed elementary components, as a principled route towards robust climate emulation. We demonstrate that physically motivated decompositions substantially improve OOD performance with only modest trade-offs against in-distribution performance, providing an avenue towards ML-driven climate emulators robust to an unknown future.

2605.22243 2026-05-22 cs.LG cs.AI stat.AP 版本更新

Explainable AI for Data-Driven Design of High-Dimensional Predictive Studies

为高维预测研究的数据驱动设计开发可解释的AI

Junyu Yan, Damian Machlanski, Kurt Butler, Panagiotis Dimitrakopoulos, Ewen M Harrison, Bruce Guthrie, Sotirios A Tsaftaris

发表机构 * School of Engineering, University of Edinburgh(爱丁堡大学工程学院) Causality in Healthcare AI Hub (CHAI)(医疗因果AI枢纽) Advanced Care Research Centre, Usher School of Population Health Sciences, University of Edinburgh(先进护理研究中心,乌瑟人口健康科学学院,爱丁堡大学) Centre for Medical Informatics, Usher School of Population Health Sciences, University of Edinburgh(医学信息学中心,乌瑟人口健康科学学院,爱丁堡大学)

AI总结 本文提出了一种可解释的AI推荐系统,通过数据驱动的方法改进现有可解释统计模型的预测性能,主要贡献是通过可解释AI技术提供三种推荐类型以提高模型的预测能力和透明度。

Comments 41 pages, 7 figures

详情
AI中文摘要

预测建模在健康数据分析和数据驱动的临床决策中非常重要。然而,当需要选择、转换或交互建模数十甚至数百个特征时,手动优化预测研究具有挑战性。尽管复杂的机器学习模型具有高性能,但其“黑盒”性质限制了临床信任、透明度和决策所需的可解释性。我们开发并评估了一种探索性AI推荐器,以提供数据驱动的推荐,从而提高现有可解释统计模型的预测性能。所开发的框架使用灵活的AI建模来捕捉复杂的数据模式,并利用可解释AI技术将这些模式转化为三种推荐类型:特征排除、非线性项和特征交互。我们通过比较基线(即无交互或非线性项)Cox比例风险(CPH)模型与增强的CPH模型(包含由我们方法建议的推荐)的预测性能来评估该框架。主要分析预测245,614名患者首次发生跌倒或相关伤害的时间。我们的方法推荐排除23个特征,包括两个特征的非线性项,以及包含221个建议的特征交互。C指数从0.805(95% CI 0.798-0.812)提高到0.815(95% CI 0.809-0.822),校准也有所改善(截距:-0.006到0.003;斜率:1.063到0.950)。所有推荐均得到现有文献的支持。该方法还证明在两个额外的公共数据集上有效,显示了更广泛的应用性。所提出的探索性AI推荐器展示了可解释AI和数据驱动研究设计在提高高维透明预测模型开发过程和性能方面的潜力。

英文摘要

Predictive modelling is important for health data analysis and data-driven clinical decision-making. However, predictive studies are challenging to design optimally by hand when tens or even hundreds of features require selection, transformation, or interaction modelling. While complex machine learning models offer high performance, their "black-box" nature limits the clinical trust, transparency, and interpretability required for decision-making. We developed and evaluated an Exploratory AI Recommender that provides data-driven recommendations to improve predictive performance of existing interpretable statistical models. The developed framework uses flexible AI modelling to capture complex data patterns and explainable AI techniques to translate the patterns into three recommendation types: feature exclusion, non-linear terms, and feature interactions. We evaluated the framework by comparing predictive performance of a baseline (i.e., no interactions or non-linear terms) Cox Proportional Hazards (CPH) model against an augmented CPH incorporating recommendations suggested by our method. The primary analysis predicts the time to the first occurrence of a fall or related injury in 245,614 patients. Our method recommended excluding 23 features, including non-linear terms for two features, and including 221 suggested feature interactions. The C-index improved from 0.805 (95% CI 0.798-0.812) to 0.815 (95% CI 0.809-0.822), and so did calibration (intercept: -0.006 to 0.003; slope: 1.063 to 0.950). All recommendations were supported by existing literature. The method also proved effective on two additional public datasets, demonstrating wider applicability. The proposed Exploratory AI Recommender demonstrates the potential of explainable AI and data-driven study design to improve the process of developing, and the performance of high-dimensional transparent predictive models.

2605.22235 2026-05-22 cs.LG math.DS 版本更新

Holomorphic Neural ODEs with Kolmogorov-Arnold Networks for Interpretable Discovery of Complex Dynamics

具有Kolmogorov-Arnold网络的全纯神经ODEs用于复杂动力学的可解释发现

Bhaskar Ranjan Karn, Dinesh Kumar

AI总结 本文提出了一种基于Kolmogorov-Arnold网络的全纯神经ODE框架,用于在复杂动力学系统中发现可解释的 governing equations,通过可微的正则化保持全纯结构,并在多个复杂动力学系统上验证了其有效性。

Comments 16 pages. Comments are welcome

详情
AI中文摘要

由全纯映射(如z² + c)支配的复杂动力系统表现出具有极端初始条件敏感性的分形边界。从数据准确建模这些结构需要尊重底层复解析几何的方法,但神经普通微分方程(Neural ODEs)中的多层感知机(MLP)缺乏复解析先验,违反柯西-黎曼条件,并作为不透明的近似器无法提供 governing equations。我们引入了全纯KAN-ODE框架,用Kolmogorov-Arnold网络(KAN)取代MLP,其可学习的B样条激活函数位于网络边,并将柯西-黎曼方程作为可微正则化以保持全纯结构。我们在六个复杂动力系统家族上进行了评估,涵盖多项式和超越类。仅使用280个参数(比MLP基线少16倍),网络在所有六个系统上实现了速度场R² > 0.95,正确识别了所有六个 governing symbolic families 通过自动样条到公式拟合,并重建了Julia集分形边界,与98.0%一致。关键的是,模型在10%观测噪声下仅表现出4%的MSE退化,而MLP则退化了15.2倍,且在从二次到三次动力学的迁移学习中实现了90.4%的改进。虽然MLP在点重建误差上更低,因为其容量更大,但KAN唯一提供了可解释的符号方程,强制了全纯结构,并具有优越的噪声鲁棒性,这些能力在黑盒架构中完全缺失。这些结果确立了KANs作为MLP的参数高效、可解释的替代方案,用于具有全纯动力学的物理信息发现。

英文摘要

Complex dynamical systems governed by holomorphic maps such as $z^2 + c$ exhibit fractal boundaries with extreme sensitivity to initial conditions. Accurately modelling these structures from data requires methods that respect the underlying complex-analytic geometry, yet Multi-Layer Perceptrons (MLPs) within Neural Ordinary Differential Equations (Neural ODEs) lack complex-analytic priors, violate the Cauchy--Riemann conditions, and function as opaque approximators incapable of yielding governing equations. We introduce Holomorphic KAN-ODE, a framework that replaces the MLP with a Kolmogorov-Arnold Network (KAN) whose learnable B-spline activations reside on network edges, and incorporates Cauchy--Riemann equations as a differentiable regularization to preserve holomorphic structure. We evaluate on six families of complex dynamical systems spanning polynomial and transcendental classes. With only 280 parameters ($16\times$ fewer than the MLP baseline), the network achieves velocity-field $R^2 > 0.95$ on all six systems, correctly identifies all six governing symbolic families through automatic spline-to-formula fitting, and reconstructs Julia set fractal boundaries with up to 98.0\% agreement. Crucially, the model exhibits only 4\% MSE degradation under 10\% observation noise versus $15.2\times$ for MLPs, and achieves 90.4\% improvement in transfer learning from quadratic to cubic dynamics. While the MLP attains lower pointwise reconstruction error due to its larger capacity, the KAN uniquely provides interpretable symbolic equations, enforced holomorphic structure, and superior noise resilience, capabilities that are entirely absent in black-box architectures. These results establish KANs as a parameter-efficient, interpretable alternative to MLPs for physics-informed discovery of holomorphic dynamics.

2605.22223 2026-05-22 cs.LG 版本更新

How Many Different Outputs Can a Transformer Generate?

变换器能生成多少种不同的输出?

Maxime Meyer, Mario Michelessa, Caroline Chaux, Vincent Y. F. Tan

发表机构 * Department of Mathematics, National University of Singapore, Singapore, 117543(新加坡国立大学数学系) School of Computing, National University of Singapore, Singapore, 117543(新加坡国立大学计算学院) Aix Marseille Univ, CNRS, I2M, Marseille, France(法国马赛大学、国家科学研究中心、I2M研究所) Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电子与计算机工程系)

AI总结 研究如何利用变换器架构中的少量特性来准确预测其能生成的不同序列数量,包括定性和定量分析,并提供基于提示长度的上限,实验证明在不同架构和模型大小下该上限紧致于10倍以内。分析还解释了之前在简单序列任务(如复制和填塞)中观察到的变换器经验性失败现象。

Comments ICML 2026 Spotlight

详情
AI中文摘要

我们研究如何仅利用变换器架构中的少量特性来紧密预测其能生成的不同序列数量,包括定性和定量分析。我们提供一个依赖于提示长度的上限,实验证明在不同架构和模型大小下,该上限紧致于10倍以内。我们的分析还为之前在简单序列任务(如复制和填塞)中观察到的变换器经验性失败提供了理论解释。形式上,我们证明了(i)可访问序列的最大长度(即变换器能为某些提示生成的序列)与提示长度成线性增长,(ii)超过临界阈值后,可访问序列的比例随序列长度呈指数衰减,(iii)提示长度与可访问序列长度之间的线性系数具有理论上限。值得注意的是,这些结果即使在无界上下文和计算时间下也成立。

英文摘要

We study how we can leverage only a handful of characteristics of a transformer's architecture to closely predict the number of different sequences it can output, both qualitatively and quantitatively. We provide an upper bound depending on the length of the prompt, which we show empirically to be tight up to a factor less than 10, across architectures and model sizes. Our analysis also provides a theoretical explanation for previously observed empirical failures of transformers on simple sequence tasks, such as copying and cramming. Formally, we prove that (i) the maximal length of accessible sequences (those that the transformer can output for some prompt) grows linearly with the prompt length, (ii) beyond a critical threshold, the proportion of accessible sequences decays exponentially with sequence length, and (iii) the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. Notably, these results hold even with unbounded context and computation time.

2605.22221 2026-05-22 cs.LG cs.AI cs.LO 版本更新

Can Transformers Learn to Verify During Backtracking Search?

Transformer能否在回溯搜索中学习验证?

Yin Jun Phua, Tony Ribeiro, Tuan Nguyen, Katsumi Inoue

发表机构 * Yin Jun Phua (corresponding author) Institute of Science Tokyo, 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8550, Japan Tony Ribeiro Centrale Nantes, CNRS, Laboratoire des Sciences du Num\'erique de Nantes, LS2N, UMR 6004, F-44000 Nantes, France National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan Steelous Protocol, 8-20-32, Ginza, Chuo-ku, Tokyo 104-0061, Japan Tuan Nguyen Hanoi University of Science Technology, No. 1 Dai Co Viet, Hai Ba Trung, Ha Noi, Vietnam Katsumi Inoue National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan

AI总结 本文研究了Transformer在回溯搜索中的验证能力,指出传统方法在处理轨迹数据时存在散列检索和历史纠缠问题,并提出局部化和选择性状态注意力(SSA)来解决这些问题,通过实验验证了SSA在3-SAT、图着色、Blocks World和回溯解析等任务中的有效性。

详情
AI中文摘要

回溯搜索是经典约束求解器、规划器和定理证明器的基础。最近的基于Transformer的推理系统探索其自身中间步骤的搜索树。一种常见的训练方法是在离线求解器轨迹上拟合自回归的下一个令牌损失。模型的输入在每一步都是所有先前决策的累积轨迹。最优的继续或回溯预测器仅依赖于当前搜索状态,因为到达相同状态的两条轨迹允许相同的延续。我们证明,仅使用累积轨迹训练的解码器Transformer在两种方式上未能满足这一要求:轨迹可以将状态特征散列到许多位置(散列检索),并且预测器可以基于轨迹而非状态(历史纠缠)。我们通过局部化解决散列检索问题,这是一种轨迹级的修复方法,将每个决策块重写以局部化状态特征。我们通过选择性状态注意力(SSA)解决历史纠缠问题,这是一种固定注意力掩码,可以在不修改训练数据、目标或参数的情况下强制结构化基于状态的决策。我们专注于矛盾传播后发生的反应验证。我们在3-SAT、图着色、Blocks World和回溯解析中测试SSA。在仅在先前历史上不同的相同状态对中,SSA发出相同的决定,而自回归训练的因果基线则不会。我们的贡献是针对序列轨迹数据的Transformer行为诊断,配以结构化修复。预训练语言模型在搜索其自身推理步骤时可能面临相同的失败。我们的分析为推理时的上下文清除作为不重新训练的情况下应用相同隔离的方法提供了候选方案。

英文摘要

Backtracking search underlies classical constraint solvers, planners, and theorem provers. Recent transformer-based reasoning systems explore search trees over their own intermediate steps. A common training recipe fits an autoregressive next-token loss on offline solver traces. The model's input at each step is a cumulative trace of all prior decisions. The optimal continue-or-backtrack predictor depends only on the current search state, since two trajectories reaching the same state admit the same viable continuations. We show that decoder-only transformers trained on cumulative traces fail this requirement in two ways: the trace can scatter state features across many positions (scattered retrieval), and the predictor can condition on the trajectory rather than the state (history entanglement). We address scattered retrieval with localization, a trace-level fix that rewrites each decision block to expose state features locally. We address history entanglement with Selective State Attention (SSA), a fixed attention mask that enforces state-based decisions structurally without modifying training data, objective, or parameters. We focus on reactive verification, after propagation has exposed a contradiction. We test SSA on 3-SAT, graph coloring, Blocks World, and backtracking parsing. On same-state pairs that differ only in prior history, SSA emits identical decisions while a cumulative-trained causal baseline does not. Our contribution is a diagnostic of transformer behavior on serialized trajectory data, paired with a structural fix. Pretrained language models that search over their own reasoning steps may face the same failure. Our analysis opens up inference-time context clearing as a candidate way to apply the same isolation without retraining.

2605.22217 2026-05-22 cs.LG cs.CL 版本更新

Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

生存或崩溃:自我博弈强化学习中数据门控与奖励基础的不对称作用

Sophia Xiao Pu, Zhaotian Weng, Chengzhi Liu, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Xin Eric Wang

发表机构 * University of California, Santa Barbara(加州大学圣巴巴拉分校) Cisco Research(思科研究)

AI总结 本文研究了自我博弈强化学习中数据门控和奖励基础的不对称作用,发现数据门控是维持稳定的关键因素,而奖励信号在门控移除后无法单独保证稳定性,揭示了'基础提出者悖论'。

详情
AI中文摘要

自我博弈强化学习通过语言模型自行生成任务进行训练,实现提出者与求解者的共同进化,无需人工标注。最近的系统报告了显著的推理提升,但崩溃和不稳定性普遍存在且理解不足。主流观点将其视为奖励设计问题,但我们认为自我博弈的稳定性由两个不同的调节机制决定:数据层面的门控,决定哪些由提出者生成的任务进入训练池,以及奖励信号,更新已准入任务的策略。通过在Python输出预测任务和确定性DSL双胞胎任务上的受控实验,我们发现这两个机制是不对称的。严格的数据门控在我们测试的每种奖励变体下都能保证稳定性,包括没有地面真实信息访问的自一致性奖励;而一旦移除门控,没有任何奖励变体足以保证稳定性。这种不对称性揭示了我们称之为'基础提出者悖论'的反直觉耦合:具有地面真实信息访问的提出者在与自一致性求解器配对时,会比无地面真实信息的提出者更快崩溃,因为训练集中在形成最快路径到虚假自一致性吸引子的干净任务上。将二进制门控替换为连续严格性参数ε进一步揭示了两阶段相变:训练侧指标在低ε时解耦,而验证准确率在ε远高于时才保持。数据层面的门控,而非奖励校准,是自我博弈稳定性的绑定约束。

英文摘要

Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. Recent systems report strong reasoning gains, but collapse and instability are widely observed and poorly understood. The dominant response treats this as a reward-design problem. We argue instead that self-play stability is governed by two distinct levers: a data-level gate that decides which proposer-generated tasks enter the training pool, and the reward signal that updates the policy on tasks already admitted. Through controlled experiments on a Python output-prediction task and a deterministic-DSL twin task that strips pretraining priors, output ambiguity, and executor noise, we find the two levers are asymmetric. A strict gate is sufficient for stability under every reward variant we test, including a self-consistency reward with no access to ground truth; while no reward variant is sufficient once the gate is removed. This asymmetry exposes a counter-intuitive coupling we call the Grounded Proposer Paradox: a proposer with ground-truth access accelerates collapse faster than an ungrounded one when paired with a self-consistency solver, by concentrating training on clean tasks that form the fastest path to a spurious self-consistent attractor. Replacing the binary gate with a continuous strictness parameter $\varepsilon$ further reveals a two-stage phase transition: training-side metrics decouple at low $\varepsilon$, while validation accuracy holds until $\varepsilon$ is much higher. Data-level gating, not reward calibration, is the binding constraint on self-play stability.

2605.22207 2026-05-22 eess.SY cs.LG cs.SY 版本更新

Kernel-Based Safe Exploration in Deep Reinforcement Learning

基于核的深度强化学习安全探索

Rupak Majumdar, Nikhil Singh, Sadegh Soudjani

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所)

AI总结 本文提出了一种基于核的方法,用于在深度强化学习中安全探索,通过学习屏障函数来保证策略不会进入危险区域,同时在探索过程中同时学习最优策略和屏障函数,提供更可靠的概率安全保证。

Comments Accepted at L4DC Conference (22 Jan 2026)

详情
AI中文摘要

安全性在将深度强化学习算法部署到现实世界时是一个主要关注点。一种有前景的方向是学习一个屏障函数,以确保学习的策略不会访问危险区域。屏障函数是从状态到实数的函数,它将初始状态赋予低值,将危险状态赋予高值,并在每次转移中减少期望值;这样的函数可用于限制到达危险状态的概率。以前的研究直接从探索数据中学习屏障函数,但需要大量数据或对系统动力学的限制。在本文中,我们展示了如何利用核嵌入来学习深度强化学习中随机系统的屏障函数。我们的算法,称为基于核的安全探索(KBSE),在探索过程中同时学习最优策略和屏障函数。屏障函数是通过迭代计算得到的,并以条件均值嵌入表示,随着探索的增加,它们提供更好的概率安全保证。探索算法使用学习到的屏障函数来识别安全违规。在发生违规时,它会干预,将危险动作改为安全动作,从而确保探索仅限于限制到达危险状态概率的动作。我们评估了KBSE在多个复杂的连续控制基准上的性能。实验结果表明,我们的新算法适用于合成概率安全的控制策略,而不会影响奖励的累积。

英文摘要

Safety has been a major concern when deploying deep reinforcement learning algorithms in the real world. A promising direction that ensures that the learned policy does not visit unsafe regions is to learn a \emph{barrier function} along with the policy. A barrier is a function from states to reals that assigns low values to the initial states, high values to the unsafe states, and decreases in expectation on each transition; such a function can be used to bound the probability of reaching unsafe states. Previous attempts learned a barrier function directly from exploration data, but this required either large amounts of data or restrictions on the system dynamics. In this paper, we show how kernel embeddings can be used to learn barrier functions during deep reinforcement learning for stochastic systems with unknown dynamics. Our algorithm, \emph{kernel-based safe exploration (KBSE)}, learns an optimal policy and a barrier simultaneously during exploration. The barriers are computed iteratively, represented as conditional mean embeddings, and provide better probabilistic safety guarantees with more exploration. The exploration algorithm uses the learned barrier functions to identify safety violations. In the case of violation, it intervenes to modify the unsafe action to a safe action, thereby ensuring that the exploration is restricted to actions that bound the probability of reaching unsafe states. We evaluate KBSE on several complex continuous control benchmarks. Experimental results establish our new algorithm to be suitable for synthesizing control policies that are probabilistically safe without degradation in reward accumulation.

2605.22205 2026-05-22 cs.AI cs.LG 版本更新

Skill Weaving: Efficient LLM Improvement via Modular Skillpacks

技能编织:通过模块化技能包实现高效的LLM改进

Zhuo Li, Guodong Du, Zesheng Shi, Weiyang Guo, Weijun Yao, Yuan Zhou, Jiabo Zhang, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) The Hong Kong Polytechnic University(香港理工大学) Huawei Technologies Co., Ltd.(华为技术有限公司) Shanghai Jiaotong University(上海交通大学)

AI总结 本研究提出SkillWeave框架,通过模块化技能包使LLM在固定内存预算下实现领域专业化,通过SkillZip压缩技术实现高效部署,实验表明其在多任务和代理基准上表现优异,速度提升达4倍。

Comments Accepted by ACL2026

详情
AI中文摘要

大型语言模型日益需要在多样化领域中进行专门化,但现有方法难以在多领域能力与严格的内存和推理约束之间取得平衡。本文介绍了SkillWeave,一种模块化改进框架,使LLM能够在固定内存预算下实现专业化。SkillWeave将通用模型的全部能力划分为技能包——轻量、领域特定的delta模块——以重新组织和细化模型的内部知识。为了高效部署,SkillWeave集成了SkillZip将技能包压缩为紧凑且推理友好的格式,从而在低延迟执行下实现强大的多领域性能。在多任务和代理基准上,一个9B的SkillWeave模型优于多个基线,并甚至超越了32B的单体LLM,同时实现了高达4倍的速度提升。

英文摘要

Large language models increasingly require specialization across diverse domains, yet existing approaches struggle to balance multi-domain capacities with strict memory and inference constraints. In this work, we introduce SkillWeave, a modular improvement framework that enables LLMs to specialize under fixed memory budgets. SkillWeave partitions full capabilities of a general-purpose model into skillpacks -- lightweight, domain-specific delta modules -- that reorganize and refine the model's internal knowledge. For efficient deployment, SkillWeave integrates SkillZip to compress skillpacks into compact and inference-ready format, enabling strong multi-domain performance with low-latency execution. On multi-task and agentic benchmarks, a 9B SkillWeave model outperforms several baselines and even surpasses a 32B monolithic LLM, while achieving up to 4x speedup.

2605.22200 2026-05-22 cs.CV cs.AI cs.LG 版本更新

OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025

OSS: 2024-2025 开放缝合技能基于视觉的评估挑战

Hanna Hoffmann, Setareh Bady, Claas de Boer, Max Kirchner, Jan Egger, Rainer Röhrig, Frank Hölzle, Lennart Johannes Gruber, Kunpeng Xie, Marlon Neuhaus, Victor Alves, Guilherme Barbosa, Leonardo Barroso, João Carvalho, Hao Chen, Gabriella d'Albenzio, André Ferreira, Nuno Gomes, Yuichiro Hayashi, Kousuke Hirasawa, Rebecca Hisey, Seungjae Hong, Seoi Jeong, Tiago Jesus, Daehong Kang, Satoshi Kasai, Shunsuke Kikuchi, Takayuki Kitasaka, Satoshi Kondo, Hyoun-Joong Kong, Youngbin Kong, Atsushi Kouno, Shlomi Laufer, Kyu Eun Lee, Bining Long, Nooshin Maghsoodi, Hiroki Matsuzaki, Evangelos Mazomenos, Ori Meiraz, Kensaku Mori, Marina Music, Masahiro Oda, Roi Papo, Jieun Park, Rafael Piexoto, Saeid Rezaei, Mariana Ribeiro, Soyeon Shin, Yang Shu, Idan Smoller, Danail Stoyanov, Yihui Wang, Xinkai Zhao, Sebastian Bodenstedt, Isabel Funke, Stefanie Speidel, Behrus Hinrichs-Puladi

发表机构 * Department of Translational Surgical Oncology, National Center for Tumor Diseases (NCT/UCC) Dresden(转化外科肿瘤学部,肿瘤疾病国家中心(NCT/UCC)德累斯顿) The Centre for Tactile Internet with Human-in-the-Loop (CeTI), TUD Dresden University of Technology(具有人环路触觉互联网中心(CeTI),德累斯顿技术大学) Department of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen(口腔和颌面外科部,亚琛大学医院) Center for Tooth-, Mouth- and Jaw Medicine, University Göttingen(牙科、口科和颌科医学中心,哥廷根大学) Institute of Medical Informatics, University Hospital RWTH Aachen(医学信息学研究所,亚琛大学医院) Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology(医学系和卡尔·戈斯塔·卡鲁斯大学医院,德累斯顿技术大学) German Cancer Research Center (DKFZ)(德国癌症研究中心(DKFZ)) Muroran Institute of Technology(牟然技术学院) Niigata University of Health and Welfare(北九州市保健福利大学) Konica Minolta, Inc.(柯尼卡美能达公司) Jmees, Inc.(Jmees公司) Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(计算机科学与工程部,香港科学与技术大学) Center Algoritmi/LASI, University of Minho(算法中心/ALASI,米尼奥大学) Life and Health Sciences Research Institute (ICVS), School of Medicine, University of Minho(生命与健康科学研究院(ICVS),医学院,米尼奥大学) ICVS/3B's - PT Government Associate Laboratory(ICVS/3B's - PT政府附属实验室) Institute for AI in Medicine (IKIM), University Medicine Essen(医学人工智能研究所(IKIM),埃森大学医学部) The Faculty of Data and Decisions Science, Technion - Israel Institute of Technology(数据与决策科学系,技术学院-以色列理工学院) UCL Hawkes Institute, University College London(UCL Hawkes研究所,伦敦大学学院) School of Computing, Queen's University(计算学院,皇后大学) Department of Transdisciplinary Medicine, Seoul National University Hospital(跨学科医学部,首尔国立大学医院) Interdisciplinary Program in Medical Informatics, Seoul National University(医学信息学跨学科项目,首尔国立大学) Department of Clinical Medical Sciences, Seoul National University(临床医学科学部,首尔国立大学) Institute of Convergence Medicine with Innovative Technology, Seoul National University Hospital(融合医学与创新技术研究所,首尔国立大学医院) Department of Surgery, Seoul National University College of Medicine and Seoul National University Hospital(外科部,首尔国立大学医学院和首尔国立大学医院)

AI总结 本文提出OSS挑战,旨在通过基于视觉的评估方法提升开放手术技能训练,通过挑战数据集和多任务评估,评估不同方法在开放手术技能评估中的表现,揭示视频评估的潜力与限制。

Comments Stefanie Speidel and Behrus Hinrichs-Puladi jointly supervised this work. Submitted to MEDIA

详情
AI中文摘要

通过有效的训练实现高水平的外科技能对于最佳的患者结果至关重要。自动化、数据驱动的技能评估有潜力改善外科训练。尽管基于机器学习的方法在微创手术技能评估中越来越受欢迎,但其在开放手术中的应用仍然有限。我们提出了一个专门的MICCAI挑战,旨在基准测试和推进开放手术中的基于视觉的技能评估。挑战数据集包含在干实验室环境中用静态GoPro相机记录的开放缝合训练任务视频,除了主要视频模态外,还包含仪器轨迹数据。OSS挑战连续两年举办,分别包含两个和三个独立任务:(1) 将技能水平分类为四个类别,(2) 预测涵盖八个类别的完整客观结构化评估技术技能分数,(3) 跟踪手部和手术工具。参与者提交了多种解决方案,包括基于深度学习的视频模型、跟踪驱动的方法和混合方法。通用的空间时间视频模型始终实现了最强的性能,尽管概念上多样的方法在执行良好的情况下也能达到竞争水平。预测细粒度的OSATS分数仍然具有挑战性,但受益于增加的训练数据。关键点跟踪由于频繁的遮挡和出帧实例而变得困难,限制了当前基于运动的技能分析的应用。这项工作评估了创新和多样的解决方案,突显了基于视频的评估在开放手术中的潜力和当前限制,并识别了推进自动化技能评估向临床影响发展的关键方向。

英文摘要

Achieving high levels of surgical skill through effective training is essential for optimal patient outcomes. Automated, data-driven skill assessment holds significant potential to improve surgical training. While machine learning-based methods are increasingly popular for assessing skills in minimally invasive surgery, their application to open surgery remains limited. We present the results of a dedicated MICCAI challenge designed to benchmark and advance vision-based skill assessment in open surgery. The challenge dataset comprises videos of an open suturing training task recorded with a static GoPro camera in a dry-lab setting, with instrument trajectories available in addition to the primary video modality. The OSS Challenge was hosted over two consecutive years, comprising two and three independent tasks, respectively: (1) classifying skill level into four classes, (2) predicting the full Objective Structured Assessment of Technical Skills across eight categories, and (3) tracking hands and surgical tools. Participants submitted diverse solutions including deep learning-based video models, tracking-driven methods, and hybrid approaches. General-purpose spatiotemporal video models consistently achieved the strongest performance, though conceptually diverse approaches reached competitive levels when well-executed. Predicting fine-grained OSATS scores remains challenging but benefits substantially from increased training data. Keypoint tracking proves difficult given frequent occlusions and out-of-frame instances, limiting current applicability for motion-based skill analysis. This work benchmarks innovative and diverse solutions for surgical skill assessment, highlighting both the promise and current limitations of video-based evaluation in open surgery and identifying critical directions for advancing automated skill assessment toward clinical impact.

2605.22195 2026-05-22 cs.LG 版本更新

Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs

思维图增强:由强化学习驱动的LLM自适应提示方法

Manuel Noah Riesen, Peter Alfred von Niederhäusern

发表机构 * School of Engineering and Computer Science(工程与计算机科学学院) Bern University of Applied Sciences(伯恩应用科学大学)

AI总结 本文提出Reinforced Graph of Thoughts (RGoT),通过强化学习自动生成适应任务复杂度的思维图结构,提升大型语言模型的提示效果。

Comments 26 pages (including appendix), 16 figures

详情
AI中文摘要

Graph of Thoughts (GoT),作为一种针对大型语言模型(LLMs)的通用提示范式,已被证明在复杂问题解决中具有用处。通过执行一系列操作的图,LLM的思维被结构化为任意图,形成实际的思维图。最初,操作图是手动定义的,需要深入了解问题的解决方案。这种静态的操作图缺乏适应性。我们提出Reinforced Graph of Thoughts (RGoT),一种利用强化学习(RL)自动从人类定义的集合中生成操作图的自动化方法。结果表明,在某些约束下,可以以自动化的方式构建适应任务复杂度的操作图。

英文摘要

Graph of Thoughts (GoT), a generalized form of recent prompting paradigms for large language models (LLMs), has been shown to be useful for elaborate problem solving. By executing a graph of operations, thoughts of the LLM are structured as an arbitrary graph, forming the actual graph of thoughts. Originally, the graph of operations is defined manually, which requires in-depth knowledge about the solution of the problem to solve. Such a static graph of operations is rigid and therefore lacks adaptability. We propose Reinforced Graph of Thoughts (RGoT), an automated approach to the GoT prompting paradigm that leverages reinforcement learning (RL) to adaptively generate a graph of operations from a human-defined set. Results indicate that, under certain constraints, it is possible to construct graphs of operations adaptively to the task's complexity in an automated way.

2605.22191 2026-05-22 cs.LG cs.IT math.IT 版本更新

Bandit Convex Optimization with Gradient Prediction Adaptivity

带梯度预测自适应的带状凸优化

Shuche Wang, Adarsh Barik, Vincent Y. F. Tan

发表机构 * Department of Mathematics, National University of Singapore, Singapore(新加坡国立大学数学系) Department of Computer Science and Engineering, Indian Institute of Technology Delhi, India(印度理工学院德里分校计算机科学与工程系) Department of Electrical and Computer Engineering, National University of Singapore, Singapore(新加坡国立大学电子与计算机工程系)

AI总结 本文研究了在预测自适应方式下,乐观梯度预测能否改进最坏情况下的后悔保证。提出了一种双点反馈设置下的两种点方差减少乐观梯度下降算法,该算法的梯度估计器方差与预测误差相关,从而得到O(√(dE[S_T]))的后悔界,并建立了信息论下界,证明了该算法在预测自适应后悔上的最优性。

详情
AI中文摘要

带状凸优化(BCO)是一种具有部分反馈的在线学习框架,其中学习者在每一轮中只观察所选决策点的损失。在本工作中,我们研究乐观梯度预测是否能在预测自适应的方式下改进最坏情况下的后悔保证。具体而言,给定梯度预测m_t,我们寻求与累积预测误差S_T=∑_{t=1}^T ||∇f_t(x_t)-m_t||^2相关的后悔界。我们首先得出一个负结果:在单点反馈协议下,即使S_T=o(T),仍存在不可避免的Ω(√T)的后悔下界,表明梯度估计的方差从根本上阻碍了准确预测的好处。为克服这一障碍,我们提出了适用于双点反馈设置的Two-Point Variance-Reduced Optimistic Gradient Descent(TP-VR-OPT)算法。其关键思想是新颖的方差减少梯度估计器,其方差与预测误差而非梯度范数相关。这导致了O(√(dE[S_T]))的后悔界,其中d是决策维度。补充这一结果,我们建立了信息论下界,其规模为Ω(√E[S_T]),提供了预测自适应后悔的最佳可实现性的基本特征,并证明TP-VR-OPT在至多√d因子内是最佳的。我们进一步开发了自适应变体,消除了对E[S_T]或时间范围T的先验知识的需求,并将我们的框架扩展到非平稳环境,建立了同时适应累积预测误差和比较路径长度的动态后悔保证。

英文摘要

Bandit convex optimization (BCO) is a fundamental online learning framework with partial feedback, where the learner observes only the loss incurred at the chosen decision point in each round. In this work, we investigate whether optimistic gradient predictions can improve worst-case regret guarantees in a prediction-adaptive manner. Specifically, given gradient predictions $m_t$, we seek regret bounds that scale with the cumulative prediction error $S_T=\sum_{t=1}^T \|\nabla f_t(x_t)-m_t\|^2.$ We first establish a negative result: under the single-point feedback protocol, an unavoidable $Ω(\sqrt{T})$ regret lower bound persists even when $S_T=o(T)$, showing that the variance of gradient estimation fundamentally obscures the benefit of accurate predictions. To overcome this barrier, we propose \emph{Two-Point Variance-Reduced Optimistic Gradient Descent} (TP-VR-OPT) for the two-point feedback setting. The key idea is a novel variance-reduced gradient estimator whose variance scales with the prediction error rather than the gradient norm. This yields a regret bound of $O\big(\sqrt{d\,\mathbb{E}[S_T]}\big),$ where $d$ is the decision dimension. Complementing this result, we establish an information-theoretic lower bound that scales as $Ω(\sqrt{\mathbb{E}[S_T]})$, providing a fundamental characterization of the best achievable prediction-adaptive regret and showing that TP-VR-OPT is optimal up to a factor of $\sqrt d$. We further develop adaptive variants that eliminate the need for prior knowledge of $\mathbb{E}[S_T]$ or the horizon $T$, and extend our framework to non-stationary environments, establishing dynamic regret guarantees that adapt simultaneously to the cumulative prediction error and the comparator path length.

2605.22188 2026-05-22 cs.LG math.OC stat.ML 版本更新

From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal $k$-Sparse GLMs

从顺序节点到GPU批处理:并行分支限界法用于最优k-稀疏广义线性模型

Jiachang Liu, Andrea Lodi

发表机构 * Jacobs Technion-Cornell Institute, Cornell Tech and Technion–IIT(雅各布斯技术学院-康奈尔学院,康奈尔科技与技术学院)

AI总结 本文提出了一种CPU-GPU框架,通过批量处理GPU上的分支限界节点,显著加速了大规模优化问题的求解,特别是在具有离散变量、组合结构和非线性目标的优化问题中,如验证卡数约束下的最优广义线性模型解。

详情
AI中文摘要

GPU在大规模优化的一阶方法中显著加速了计算,尤其是在连续优化中。然而,这种成功并未顺利转移到具有离散变量、组合结构和非线性目标的问题中,例如验证卡数约束下的广义线性模型的最优解。主要挑战包括分支限界(BnB)中异构节点的顺序处理以及CPU和GPU之间频繁的数据移动。我们提出了一种简单、通用且模块化的CPU-GPU框架,该框架可以在GPU上批量处理多个BnB节点。该框架围绕一组GPU高效的子程序构建,并利用填充和轻量级自定义内核来处理不规则的节点数据结构。实验表明,该框架在挑战性实例上实现了1到2个数量级的加速,并且在最优性间隙方面达到了零。该框架还可以扩展以收集整个Rashomon集,从而启用下游的统计分析,如变量重要性分析和在二次用户特定度量(例如分类中的AUC)下的模型选择。

英文摘要

GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and nonlinear objectives, such as certifying optimal solutions for cardinality-constrained generalized linear models. Major challenges include the sequential processing of heterogeneous nodes in branch and bound (BnB) and frequent data movement between the CPU and GPU. We propose a simple, generic, and modular CPU--GPU framework that processes multiple BnB nodes in batches on GPUs. The framework is built around a small set of GPU-efficient routines and uses padding together with lightweight custom kernels to handle irregular node data structures. Experiments show one to two orders of magnitude speedups and zero optimality gap on challenging instances. The framework can also be extended to collect the entire Rashomon set, enabling downstream statistical analysis such as variable-importance analysis and model selection under secondary user-specific measures (e.g., AUC in classification).

2605.22185 2026-05-22 cs.CV cs.LG 版本更新

Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

增强多模态大语言模型以用于安全关键驾驶视频分析

Tomaso Trinci, Henrique Piñeiro Monteagudo, Leonardo Taccari

发表机构 * Verizon Connect

AI总结 本研究通过融合降采样视频帧与同步高频 telemetry 数据及专用计算机视觉模型的语义信息,提升多模态大语言模型在安全关键驾驶场景中的感知与推理能力,从而更准确地识别和描述现实驾驶中的安全关键事件。

Comments Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)在一般视觉理解方面展现了出色的性能。然而,其在安全关键驾驶场景中的应用受限于无法准确感知和推理罕见高风险动态事件(如碰撞或接近碰撞)的能力。为此,我们提出了一种增强MLLM感知能力的流程,通过融合降采样视频帧与同步高频telematics数据(IMU和GPS)以及专用计算机视觉模型的语义信息生成高质量的伪标签,包括描述性标题和问答对,专门用于训练MLLM识别和描述现实驾驶中的安全关键事件(SCEs)。我们通过微调开源QwenVL-2.5模型并使用DoRA适配器展示了该方法的有效性:实验表明在少于50M可训练参数和有限计算预算下,显著提高了识别和解释安全关键事件的能力。

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model via DoRA adapters: our experiments demonstrate significant improvements in identifying and explaining safety-critical events, with fewer than 50M trainable parameters and limited computational budget.

2605.22182 2026-05-22 cs.LG 版本更新

IKNO: Infinite-order Kernel Neural Operators

IKNO:无限阶核神经算子

Pengyuan Zhu, Ivor W. Tsang, Yueming Lyu

发表机构 * Nanyang Technological University(南洋理工大学) Centre for Frontier AI Research(CFAR), Agency for Science, Technology and Research (A*STAR)(前沿人工智能研究中心(CFAR),科技研究局(A*STAR))

AI总结 本文提出IKNO,一种通过无限阶核积分构建的神经算子,解决了传统模型因依赖一阶核积分而限制表达能力的问题,通过两种互补的构造方法实现了高效的全局信息聚合,并在多个基准数据集上取得了SOTA精度。

详情
AI中文摘要

神经算子在现代科学计算中因灵活性和强大的泛化能力而取得了显著成功。然而,现有模型主要依赖于一阶核积分近似,这严重限制了它们的表达能力。为此,我们提出了无限阶核神经算子(IKNO),通过无限阶核积分构建神经算子,并具有优雅的闭式有限近似。我们开发了两种互补的无限阶神经算子构造:IKNO-Vanilla,通过克罗内克特征分解在产品网格上应用完整的核解算子;以及IKNO-TP,一种替代的张量积算子,通过各轴解算子进行组合。此外,我们为这两种IKNO变体开发了快速计算方案,实现了出色的全局信息聚合同时保持高计算效率。实验证明,我们在具有任意输入形状的时间依赖和时间无关基准数据集上评估了我们的IKNO,包括大规模工业数据集。广泛的实验表明,IKNO方法在几乎所有基准数据集上都实现了显著的精度提升,同时保持了对非常大的点云的可扩展性。

英文摘要

Neural operators have achieved significant success in modern scientific computing due to their flexibility and strong generalization capabilities. Existing models, however, primarily rely on first-order kernel integral approximations, which severely limit their expressivity. To address this, we propose the Infinite-order Kernel Neural Operator (IKNO), which constructs neural operators via infinite-order kernel integrals and admits an elegant closed-form finite approximation. We develop two complementary infinite-order neural operator constructions: IKNO-Vanilla, which applies the full-kernel resolvent on the product grid via Kronecker eigendecomposition, and IKNO-TP, an alternative tensor-product operator that composes per-axis resolvents. Furthermore, we develop fast computation schemes for both variants of IKNO, which achieve outstanding global information aggregation while maintaining high computational efficiency. Empirically, we evaluate our IKNO on both time-dependent and time-independent benchmarks with arbitrary input shapes, including large-scale industrial datasets. Extensive experiments demonstrate that the IKNO method consistently achieves the SOTA accuracy with significant improvements on nearly all benchmark datasets while maintaining scalability to very large point clouds.

2605.22177 2026-05-22 cs.LG cs.CL 版本更新

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Maestro:通过强化学习协调分层模型-技能集合

Jinyang Wu, Guocheng Zhai, Ruihan Jin, Yuhao Shen, Zhengxi Lu, Fan Zhang, Haoran Luo, Zheng Lian, Zhengqi Wen, Jianhua Tao

发表机构 * Tsinghua University(清华大学) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学) Nanyang Technological University(南洋理工大学) Tongji University(同济大学)

AI总结 本文提出Maestro框架,通过强化学习协调多模态任务,利用分层模型-技能集合提升多模态任务性能,实现高效且通用的协调策略。

详情
AI中文摘要

大型语言模型(LLMs)和模块化技能的普及使自主代理具备了越来越强大的能力。现有框架通常依赖于单一的LLM和固定的逻辑来与这些技能交互。这导致了一个关键瓶颈:不同的LLMs在不同领域具有不同的优势,但当前框架未能利用模型和技能的互补优势,从而限制了其在下游任务上的性能。在本文中,我们提出了Maestro(多模态代理专家技能强化学习协调框架),这是一个由强化学习(RL)驱动的协调框架,将异构多模态任务重新框架化为一个在分层模型-技能注册表上的顺序决策过程。与将所有知识整合到单一模型中不同,Maestro训练了一个轻量级的策略,动态组合冻结的专家模型和一个双层技能库,决定在每一步是否调用外部专家,选择哪个模型-技能对,以及何时终止。该策略通过基于结果的强化学习进行优化,不需要步骤级监督。我们评估了Maestro在十个代表性的多模态基准上,涵盖数学推理、图表理解、高分辨率感知和领域特定分析。仅使用一个4B的协调器,Maestro实现了70.1%的平均准确率,超过了GPT-5(69.3%)和Gemini-2.5-Pro(68.7%)。关键的是,学习的协调策略能够泛化到未见过的模型和技能,无需重新训练:在注册表中添加非领域专家,使在四个具有挑战性的基准上平均达到59.5%,优于所有闭源基线。Maestro进一步保持了高计算效率和低延迟。源代码可在https://github.com/jinyangwu/Maestro上获得。

英文摘要

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.

2605.22168 2026-05-22 cs.AI cs.LG 版本更新

Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

衡量跨模态协同:VLM可解释性的一个基准

Joël Roman Ky, Salah Ghamizi, Maxime Cordy

发表机构 * University of Luxembourg(卢森堡大学) Luxembourg Institute of Health (LIH)(卢森堡健康研究院)

AI总结 本文提出Synergistic Faithfulness作为衡量VLM跨模态协同的指标,解决了传统单模态评估方法在评估VLM可解释性时的不足,通过引入Shapley交互指数,实现了对多模态协同的准确评估,同时提升了计算效率。

详情
AI中文摘要

视觉-语言模型(VLMs)将复杂的视觉输入映射到语义空间,但目前解释VLM的跨模态推理仍依赖于通过单模态扰动度量评估的后验解释器。我们揭示了这一范式的局限性:由于多模态数据集包含语言先验和模态偏差,VLMs经常表现出跨模态冗余,允许它们仅使用文本回答视觉查询。因此,单模态度量惩罚忠实的解释器,导致评估崩溃,其中视觉和文本排名根本矛盾(Kendall's τ= -0.06)。为了解决这一问题,我们引入了Synergistic Faithfulness(F_syn),一个基于Shapley交互指数的可扩展度量,严格隔离模态间的Harsanyi收益,作为高度准确的替代指标(ρ= 0.92),同时实现了24倍的计算加速。在评估8种不同的XAI方法、3种VLM架构和3个基准数据集时,发现为VLM设计的解释器严重过度索引视觉显著性,并在捕捉真正的跨模态协同方面显著劣于适应的注意力方法。通过将视觉合理性与跨模态忠实性解耦,本文提供了一个严格评估框架,以安全审计VLM在高风险部署中的推理。

英文摘要

Vision-Language Models (VLMs) map complex visual inputs to semantic spaces, but interpreting the cross-modal reasoning of VLMs currently relies on post-hoc explainers evaluated via unimodal perturbation metrics. We expose a limitation in this paradigm: because multimodal datasets contain language priors and modality biases, VLMs frequently exhibit cross-modal redundancy, allowing them to answer visual queries using text alone. Consequently, unimodal metrics penalize faithful explainers, triggering an evaluation collapse where visual and textual rankings fundamentally contradict each other. %(Kendall's $τ= -0.06$). To resolve this, we introduce Synergistic Faithfulness ($\mathcal{F}_{syn}$), a scalable metric rooted in the Shapley Interaction Index that strictly isolates the joint Harsanyi dividend between modalities, serving as a highly accurate surrogate ($ρ= 0.92$) while achieving a $24\times$ computational speedup. Evaluating 8 distinct XAI methods across 3 VLM architectures and 3 benchmark datasets, reveals that explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods in capturing true cross-modal synergy. By decoupling visual plausibility from cross-modal faithfulness, this work provides a rigorous evaluation framework required to safely audit VLM reasoning in high-stakes deployments.

2605.22164 2026-05-22 cs.LG cs.RO 版本更新

Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics

超越欧几里得距离:通过地平线匹配轨迹可达性度量修复潜在世界模型

Liangyu Li, Shengzhi Wang, Qingwen Liu

发表机构 * Tongji University(同济大学)

AI总结 本文提出轨迹可达性度量(TRM)作为固定潜在世界模型的后处理终端排名方法,通过训练小的成对头部来改进终端排名,从而提高连续操控任务的性能。

Comments 26 pages, 7 figures

详情
AI中文摘要

潜在世界模型可以包含用于控制的状态,但其终端成本接口可能会向规划器暴露错误的决策相关信息。在常见的潜在MPC中,候选序列通过预测终端和目标潜在状态之间的欧几里得距离进行排名;这假设了原始潜在距离权重能够正确地反映可达性相关变量。我们提出轨迹可达性度量(TRM),一种用于固定潜在世界模型的后处理终端排名方法。TRM从记录的轨迹结构中训练一个小的成对头部,并将其用作替代或混合成本;编码器、动力学、采样器、优化器和评估表现保持不变。关键设计选择是地平线意识监督:该度量在广泛的、平衡的时间分离上进行训练,以匹配长地平线终端候选排名问题。在硬TwoRoom基准上,使用LeWorldModel(LeWM)的原始潜在规划成功率为7.0%,而全地平线TRM成功率为97.0%;洗牌时间标签控制仍为0.0%。同样的配方在三个种子上将PLDM基线从32.7%提高到84.0%,而短地平线TRM变体在100,000对预算下仅达到35.0%。在TwoRoom中,我们提供了TRM为何有效的机理证据:XY位置是线性可解码的(R²=0.998),但原始潜在MSE错误地排名候选;XY探针行空间在终端-目标潜在MSE中占比不到1%,但承载了大部分候选质量信号;SCSA审计显示TRM提高了规划器看到的排序和选定终点。在PushT go50/go75中,TRM风格的任务-状态度量比闭环成功更清晰地改进了SCSA排名和选定最终距离,推动了连续操控中的辅助混合成本。TRM是规划器面对的修复,审计解释了何时终端可达性度量应替代或补充原始潜在接近度。

英文摘要

Latent world models can contain the state needed for control, yet their terminal-cost interface can expose the planner to the wrong decision-relevant information. In common latent MPC, candidate sequences are ranked by Euclidean distance between predicted terminal and goal latent states; this assumes that raw latent distance weights reachability-relevant variables correctly. We propose trajectory reachability metrics (TRM), a post-hoc terminal-ranking method for fixed latent world models. TRM trains a small pairwise head from logged trajectory structure and uses it as a replacement or hybrid cost; the encoder, dynamics, sampler, optimizer, and evaluation manifests remain fixed. The key design choice is horizon-aware supervision: the metric is trained on broad, balanced temporal separations to match the long-horizon terminal candidate ranking problem. On a hard TwoRoom benchmark, raw latent planning with LeWorldModel (LeWM) reaches 7.0% success, while full-horizon TRM reaches 97.0%; shuffled temporal-label controls stay at 0.0%. The same recipe improves a PLDM baseline from 32.7% to 84.0% across three seeds, and a short-horizon TRM variant reaches only 35.0% with the 100,000 pair budget. In TwoRoom, we provide mechanistic evidence for why TRM works: XY position is linearly decodable (R^2=0.998), yet raw latent MSE misranks candidates; the XY-probe rowspace accounts for less than 1% of terminal-goal latent MSE but carries most candidate-quality signal; and SCSA audits show that TRM improves the ordering and selected endpoint seen by the planner. On PushT go50/go75, TRM-style task-state metrics improve SCSA ranking and selected final distance more cleanly than closed-loop success, motivating auxiliary hybrid costs in continuous manipulation. TRM is the planner-facing repair, and audits explain when terminal reachability metrics should replace or augment raw latent proximity.

2605.22156 2026-05-22 cs.LG cs.AI 版本更新

One-Way Policy Optimization for Self-Evolving LLMs

单向策略优化用于自演化大语言模型

Shuo Yang, Jinda Lu, Kexin Huang, Chiyu Ma, Shaohang Wei, Yuyang Liu, Guoyin Wang, Jingren Zhou, Li Yuan

发表机构 * Shenzhen Graduate School, Peking University(北京大学深圳研究生院) Dartmouth College(达特茅斯学院) Alibaba(阿里巴巴)

AI总结 本文提出单向策略优化方法,通过解耦优化方向与更新幅度,解决传统方法中验证器奖励稀疏导致的训练不稳定问题,实现大语言模型的持续自演化。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为扩展大语言模型(LLMs)推理能力的一种有前景的范式。然而,二进制验证器奖励的稀疏性往往导致低效和优化不稳定。为了稳定训练,现有方法通常施加与参考策略相关的令牌级约束。我们发现这些约束会无差别地惩罚偏差;当策略试图超越参考时,这会翻转由验证器确定的方向,从而抑制收益。为了解决这个问题,我们提出了一种基于解耦优化方向与更新幅度原理的单向策略优化(OWPO)方法。在OWPO中,验证器规定更新方向,而参考策略仅用于调整更新幅度。具体而言,OWPO采用不对称重加权:它对劣质偏差(策略落后于参考)执行加速对齐,对优质偏差(策略超越参考)执行收益锁定。此外,通过整合迭代参考更新,OWPO创建了“棘轮效应”,持续巩固收益。实验结果表明,OWPO在DAPO、OPD和MOPD等强基线方法上表现更优,突破了固定先验的瓶颈,使大语言模型能够持续自演化,而无需依赖外部参考模型。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize training, existing methods typically impose token-level constraints relative to a reference policy. We identify that such constraints penalize deviations indiscriminately; this can flip verifier-determined direction when the policy attempts to outperform the reference, thereby suppressing gains. To resolve this, we propose One-Way Policy Optimization (OWPO), a method based on the principle of decoupling optimization direction from update magnitude. In OWPO, the verifier dictates the update direction, while the reference policy serves only to adjust the magnitude. Specifically, OWPO applies asymmetric reweighting: it performs Accelerated Alignment for inferior deviations (where the policy lags behind the reference) and Gain Locking for superior deviations (where the policy surpasses the reference). Furthermore, by incorporating iterative reference updates, OWPO creates a ``Ratchet Effect'' that continuously consolidates gains. Experimental results demonstrate that OWPO outperforms strong baselines, including DAPO, OPD, and MOPD, breaking the bottleneck of fixed priors to enable continuous self-evolution without reliance on external reference models.

2605.22155 2026-05-22 cs.LG 版本更新

Algebraic Machine Learning for Small-to-Medium Datasets Is Competitive against Strong Standard Baselines

代数机器学习在小至中等数据集上的表现与强标准基线竞争

David Mendez, Fernando Martin-Maroto, Gonzalo G. de Polavieja

发表机构 * Mathematics of Behavior and Intelligence Lab(行为与智能数学实验室) Champalimaud Foundation(Champalimaud基金会)

AI总结 本文研究了代数机器学习在小至中等规模数据集上的表现,发现其在图像和表格分类任务中能与CNN等强基线方法竞争,且无需交叉验证。

Comments 9 pages, 4 figures

详情
AI中文摘要

符号方法通常不被认为在现实监督任务上能与强大的现代学习者竞争。我们评估了代数机器学习(AML)框架在不同训练集大小下的图像和表格分类任务中的表现,该框架通过代数结构的子直接分解来学习,而非数值优化。我们发现,AML仅在训练数据上训练,不使用验证或交叉验证,就能在小至中等规模的图像数据集(50-2000个训练示例)上优于包括CNN在内的多种交叉验证基线方法。在相同规模范围内的表格数据集中,XGBoost总体表现最佳,但AML仍能与包含任务特定偏置的方法(如LightGBM和随机森林)竞争。AML通过通用的代数归纳偏置在两种非常不同的数据集类型上实现了竞争性表现,而不是标准基线(如CNN用于图像或XGBoost用于表格数据)中固有的模态特定偏置,并且不需要交叉验证,因为它没有需要调优的任务依赖超参数。

英文摘要

Symbolic methods are generally not considered competitive with strong modern learners on realistic supervised tasks. We evaluate Algebraic Machine Learning (AML), a framework that learns through subdirect decomposition of algebraic structure rather than numerical optimization, against standard baselines on image and tabular classification across varying training-set sizes. We find that AML trained only on training data without using validation or cross-validation outperforms a family of cross-validated baseline methods including CNNs on small to medium image datasets (50--2000 training examples). On tabular datasets in the same size range, XGBoost is overall the best performing method, but AML is nonetheless comparable to methods incorporating task-specific biases such as LightGBM and random forests. AML achieves this competitive performance across two very different types of datasets using a generic algebraic inductive bias, rather than the modality-specific biases built into standard baselines like CNNs for images or XGBoost for tabular data, and requires no cross validation because it has no task-dependent hyperparameters to tune.

2605.22138 2026-05-22 cs.AI cs.CL cs.LG cs.RO 版本更新

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

通过自我调节模拟规划实现高效的代理推理

Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing

发表机构 * Institute of Foundation Models (IFM)(基础模型研究所) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出通过分解决策过程为三个系统:模拟推理、自我调节和反应执行,来提升代理推理的效率,并展示了SR$^2$AM模型在不同任务中的表现。

Comments Code and model artifacts are available at https://github.com/sailing-lab/sr2am

详情
AI中文摘要

代理应该如何决定何时以及如何规划?主流方法将代理建模为具有自适应计算的反应策略(例如链式思考),通过端到端训练期望规划隐式地出现。由于无法控制规划的存在、结构或时间范围,这些系统显著增加了推理长度,导致无效的令牌使用,而没有可靠的准确性提升。我们主张高效的代理推理受益于将决策过程分解为三个系统:模拟推理(系统II)通过世界模型将推理根植于未来状态预测;自我调节(系统III)通过学习的配置器决定何时以及如何深入规划;以及反应执行(系统I)处理细粒度的动作。模拟推理在不同任务中提供统一的规划,而无需每个领域的工程,同时自我调节确保规划只在需要时被调用。为了测试这一点,我们开发了SR$^2$AM(Self-Regulated Simulative Reasoning Agentic LLM),在LLM的链式思考中实现这两个系统作为独立阶段,其中LLM作为世界模型。我们探索了两种实现:从提示的多模块系统中记录决策(v0.1)和从预训练推理LLM的痕迹中重建结构化计划(v1.0),通过监督学习和强化学习(RL)训练。在数学、科学、表格分析和网络信息检索中,v0.1-8B和v1.0-30B在性能上与120-355B和685B-1T参数系统相当,而v1.0-30B使用的推理令牌比同类代理LLM少25.8-95.3%。强化学习使平均规划时间增加22.8%,而规划频率仅增加2.0%,表明它学会了更远地规划而不是更频繁地规划。更广泛地说,学习的自我调节实例化了一个原则,我们预计可以扩展到代理如何管理自己的学习和适应。

英文摘要

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

2605.22124 2026-05-22 stat.ML cs.LG math.PR 版本更新

From Betting to Empirical Bernstein LIL

从赌局到经验伯恩斯坦LIL

Francesco Orabona

AI总结 本文通过在线投注策略的财富保证,推导出迭代对数定律,并提出经验伯恩斯坦LIL方法。

详情
AI中文摘要

This is a verbatim copy of a technical report I wrote in 2017-2018 to obtain the law of the iterated logarithm using the guarantee on the wealth of an online betting strategy.

英文摘要

This is a verbatim copy of a technical report I wrote in 2017-2018 to obtain the law of the iterated logarithm using the guarantee on the wealth of an online betting strategy.

2605.22112 2026-05-22 astro-ph.HE astro-ph.IM cs.LG 版本更新

Self-Supervised ConvLSTM for Fermi Large Area Telescope Transient Detection

基于自监督的ConvLSTM用于费米大视场望远镜瞬变检测

Alberto Garinei, Stefano Speziali, Alessandro Vispa, Andrea Marini, Sara Cutini, Emanuele Piccioni, Marcello Marconi, Francesco Longo, Matteo Martini, Francesca Fallucchi, Romeo Giuliano, Ernesto William De Luca, Umberto Di Matteo, Sabino Meola

发表机构 * Idea-RE

AI总结 本文提出了一种结合端到端模拟和自监督时空深度学习的方法,用于在受控环境中检测费米- LAT中的瞬变伽马射线现象,通过生成一个十年合成宇宙并利用ConvLSTM网络来建模天空的典型演变,以检测异常。

Comments 17 pages, 5 figures. Accepted for publication in Astronomy and Computing. Author-accepted manuscript version

详情
Journal ref
Astronomy and Computing 56 (2026) 101128
AI中文摘要

我们提出了一种框架,通过将费米- LAT天空的端到端模拟与自监督时空深度学习相结合,用于在受控环境中检测瞬变伽马射线现象。我们使用gtobssim生成一个十年的合成宇宙,并将模拟事件处理成每日全天空计数和曝光图,获得一个时间有序的序列,其结构与费米- LAT观测一致。为了建模天空的典型演变,我们采用卷积长短期记忆网络(ConvLSTM),该网络直接在地图序列上运行,保持空间局部性的同时学习时间依赖性。模型被训练以重建预期的发射,偏离学习基线的量通过像素级均方残差图量化。然后,我们通过从训练集上的残差分布估计每个像素的阈值,定义统计学驱动的异常标准,并通过局部滤波强制空间一致性以抑制孤立波动。训练后的ConvLSTM被部署到费米- LAT每日地图上,其中天空可能由于真实的天体物理变化或仪器非平稳性而偏离典型行为。所得到的流程可以标记出与高变源或瞬变事件(如耀斑或伽马射线暴)一致的局部、时间依赖的过剩,并为在长持续时间、费米- LAT类数据集上评估异常检测策略提供基准。

英文摘要

We present a framework for detecting transient gamma-ray phenomena in a controlled environment by combining end-to-end simulations of the Fermi-LAT sky with self-supervised spatio-temporal deep learning. We generate a ten-year synthetic Universe with gtobssim and process the simulated events into daily all-sky maps of counts and exposure, obtaining a time-ordered sequence that mirrors the structure of Fermi-LAT observations. To model the nominal evolution of the sky, we employ a Convolutional Long Short-Term Memory (ConvLSTM) network that operates directly on map sequences, preserving spatial locality while learning temporal dependencies. The model is trained to reconstruct expected emission, and departures from the learned baseline are quantified through pixel-wise mean-squared residual maps. We then define statistically motivated anomaly criteria by estimating per-pixel thresholds from the residual distribution on the training set, and we enforce spatial coherence via local filtering to suppress isolated fluctuations. The ConvLSTM is then deployed as trained predictor on Fermi-LAT daily maps, where the sky can depart from the nominal behavior because of genuine astrophysical variability and instrumental non-stationarities. The resulting pipeline flags localized, time-dependent excesses consistent with high-variable sources or transient events (e.g., flares or GRBs) and provides a benchmark for evaluating anomaly-detection strategies on long-duration, Fermi-LAT-like datasets.

2605.22111 2026-05-22 cs.LG cs.CE stat.ML 版本更新

Aerodynamic force reconstruction using physics-informed Gaussian processes

利用物理信息高斯过程进行气动力重建

Gledson Rodrigo Tondo, Igor Kavrakov, Guido Morgenthal

发表机构 * Bauhaus-Universität Weimar(魏玛应用科学大学) University of Cambridge(剑桥大学)

AI总结 本文提出一种基于物理信息的机器学习方法,用于从结构动态响应的噪声测量中重建底层气动载荷,通过避免过拟合和无需正则化方案,提高了模型的准确性和适用性。

详情
AI中文摘要

准确建模气动载荷对于理解和预测复杂结构系统的响应至关重要。然而,这些模型往往依赖于真实物理力的简化,引入假设可能会限制其准确性。在存在噪声或不完整数据的情况下,验证这些模型变得特别具有挑战性。为此,我们介绍了一种概率物理信息机器学习方法,旨在从结构动态响应的噪声测量中重建底层气动载荷。该模型避免了过拟合,消除了对正则化方案的需要,并允许在训练过程中使用异质和多保真度数据。通过重建大贝尔东桥在线性非稳态假设下的气动载荷,证明了该方法的有效性。结果表明,真实和预测载荷之间有很强的一致性,特别是在均方误差、幅度、相位角和信号峰值值方面。该载荷重建方法具有广泛的应用前景,如模型验证、未来载荷估计和结构损伤预测。

英文摘要

Accurate modeling of aerodynamic loads is essential for understanding and predicting the responses of complex structural systems. However, these models often rely on simplifications of the true physical forces, introducing assumptions that can limit their accuracy. Validating such models becomes particularly challenging in the presence of noisy or incomplete data. To address this, we introduce a probabilistic physics-informed machine learning approach designed to reconstruct the underlying aerodynamic loads from noisy measurements of structural dynamic responses. The model avoids overfitting, eliminates the need for regularization schemes, and allows for the use of heterogeneous and multi-fidelity data during the training process. The efficacy of the approach is demonstrated through the reconstruction of aerodynamic loads on the Great Belt East Bridge, simulated under a linear unsteady assumption. Results show a strong agreement between true and predicted loads, particularly related to root mean squared errors, magnitude, phase angle and peak values of the signals. The method for load reconstructing holds broad applicability, such as modeling validation, future load estimation, and structural damage prognosis.

2605.22098 2026-05-22 cs.CV cs.AI cs.LG 版本更新

TextTeacher: What Can Language Teach About Images?

TextTeacher: 语言能教会我们关于图像什么?

Tobias Christian Nauen, Stanislav Frolov, Brian Bernhard Moser, Federico Raue, Ahmed Anwar, Andreas Dengel

发表机构 * RPTU University Kaiserslautern-Landau(赖兴海大学凯撒斯劳滕-兰道分校) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 该研究提出TextTeacher方法,通过将语言模型的语义知识注入到图像分类训练中,提升视觉模型的性能,同时保持推理时的模型简洁性。

Comments Published at TMLR

详情
Journal ref
Transactions on Machine Learning Research, ISSN 2835-8856, 2026
AI中文摘要

柏拉图表示假设认为,足够大的模型会收敛到共享的表示几何结构,即使跨模态。受此启发,我们提出问题:语言模型的语义知识能否有效提升视觉模型?为此,我们引入TextTeacher,一种简单的辅助目标,将文本嵌入作为额外信息注入图像分类训练。TextTeacher利用 readily available 的图像描述、预训练并冻结的文本编码器以及轻量级投影,生成语义锚点,高效引导训练期间的表示,同时保持推理时的模型不变。在ImageNet上使用标准ViT后端,TextTeacher将准确率提升高达+2.7个百分点(p.p.),并在相同配方和计算条件下产生一致的迁移增益(平均+1.0 p.p.)。它优于视觉知识蒸馏,在相同计算预算下更准确,或在相似准确率下更快。我们的分析表明,TextTeacher在训练初期塑造了更深的层,并通过补充互补的语义线索帮助泛化。TextTeacher增加的开销很小,不需要对目标模型进行昂贵的多模态训练,并保持纯视觉模型的简洁性和延迟。

英文摘要

The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but 33% faster. Our analysis indicates that TextTeacher acts as a feature-space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues. TextTeacher adds negligible overhead, requires no costly multimodal training of the target model and preserves the simplicity and latency of pure vision models. Project page with code and captions: https://nauen-it.de/publications/text-teacher

2605.22097 2026-05-22 quant-ph cs.LG 版本更新

Q-PhotoNAS: Hybrid Quantum Neural Architecture Search Framework on Photonic Devices

Q-PhotoNAS:基于光子设备的混合量子神经架构搜索框架

Farah Elnakhal, Alberto Marchisio, Nouhaila Innan, Gabriel Falcao, Muhammad Shafique

发表机构 * Quandela Ascella photonic QPU(Quandela Ascella 光子量子处理器)

AI总结 本文提出了一种结合遗传算法和可学习量子相位编码的混合光子量子-经典模型神经架构搜索框架,通过系统探索经典和量子组件的联合设计空间,提高了图像分类任务的准确率和硬件兼容性。

详情
AI中文摘要

光子量子计算是一种有前景的可扩展量子机器学习平台,但在硬件和优化约束下设计有效的混合架构仍然具有挑战性。现有方法依赖于手动调优的架构,无法考虑经典预处理、相位编码和光子电路结构之间的协同作用,限制了准确性和硬件兼容性。在本文中,我们提出了一种混合光子量子-经典模型的神经架构搜索框架,结合基于遗传算法的搜索和可学习量子相位编码,系统地探索经典和量子组件的联合设计空间。我们的框架编码了19个超参数,分布在六个基因组中,并通过基于组的交叉、按基因突变和精英主义进化混合架构的种群。在短训练预算下评估每个候选者,然后对最佳设计进行完整重新训练。我们在两个图像分类基准测试上评估了我们的框架,即Digits和MNIST,分别达到了99.44%和98.78%的最终验证准确率,基于Quandela Ascella光子QPU的第一性执行时间估计,单张图像推断时间分别为67 ms(Digits)和149 ms(MNIST)。我们的量子贡献分析进一步显示,光子层提取了与经典路径正交的非冗余特征,相较于仅经典基线提供了可测量的准确性优势。我们的结果表明,自动化架构搜索对于混合光子系统来说既实用又具有影响,为在光子设备上量子AI的系统设计空间探索开辟了道路。

英文摘要

Photonic quantum computing is a promising platform for scalable quantum machine learning, but designing effective hybrid architectures remains challenging under hardware and optimization constraints. Existing approaches rely on manually tuned architectures that fail to account for the collaboration between classical preprocessing, phase encoding, and photonic circuit structure, limiting both accuracy and hardware compatibility. In this paper, we propose a neural architecture search framework for hybrid photonic quantum-classical models that combines genetic algorithm-based search with learnable quantum phase encoding to systematically explore the joint design space of classical and quantum components. Our framework encodes 19 hyperparameters across six gene groups and evolves a population of hybrid architectures using group-based crossover, per-gene mutation, and elitism, evaluating each candidate on a short training budget before full retraining of the best found design. We evaluate our framework on two image classification benchmarks, Digits and MNIST, achieving final validation accuracies of 99.44% and 98.78%, respectively, with first-principles execution time estimates on the Quandela Ascella photonic QPU projecting single-image inference at 67 ms (Digits) and 149 ms (MNIST). Our quantum contribution analysis further shows that the photonic layer extracts non-redundant features orthogonal to the classical pathway, providing a measurable accuracy advantage over classical-only baselines. Our results demonstrate that automated architecture search is both practical and impactful for hybrid photonic systems, opening the way for systematic design space exploration of quantum AI on photonic devices.

2605.21214 2026-05-22 cs.LG cs.AI 版本更新

Behavior-Consistent Deep Reinforcement Learning

行为一致的深度强化学习

Marcel Hussing, Liv G. d'Aliberti, Claas Voelcker, Benjamin Eysenbach, Eric Eaton

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Princeton University(普林斯顿大学) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出了一种行为一致的深度强化学习方法,通过控制策略的分布相似性来减少跨训练运行的策略分歧,从而提高稳定性和性能。

详情
AI中文摘要

强化学习(RL)在不同训练运行中常常表现出高方差,导致性能不可靠,并对现实领域中的部署构成重大挑战。在本文中,我们通过形式化行为一致的RL问题来解决跨运行策略分歧的挑战,目标是获得在不同训练运行中表现优异且分布相似的策略。我们的关键观察是最大熵RL提供了一种直接机制来控制行为分歧,通过将运行锚定到一个共同的(均匀)先验。我们证明,对于玻尔兹曼策略,选择温度与Q函数分歧界成正比可以限制诱导策略之间的成对KL散度。然而,我们还表明,简单地增加熵可能会损害策略优化并放大非策略误差。基于这些观察,我们提出了Q值期望分歧(QED),一种状态依赖的温度调度,利用双批评机分歧作为单次运行的跨运行分歧代理。经验上,我们在18个连续控制任务中展示了QED将跨运行分歧减少两个数量级,而不会牺牲性能,从而在适度的样本效率成本下实现了显著的回报方差减少。

英文摘要

Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of behavior-consistent RL, where the objective is to obtain policies that are both high-performing and distributionally similar across training runs. Our key observation is that maximum-entropy RL provides a direct mechanism for controlling behavioral divergence by anchoring runs to a common (uniform) prior. We prove that, for Boltzmann policies, choosing the temperature proportional to $Q$-function disagreement bounds the pairwise KL divergence between the induced policies. However, we also show that naïvely increasing entropy might impair policy optimization while amplifying off-policy error. Building upon these observations, we propose $Q$-value Expectile Disagreement (QED), a state-dependent temperature schedule that uses double-critic disagreement as a single-run proxy for cross-run disagreement. Empirically, we demonstrate that across 18 continuous-control tasks, QED reduces across-run divergence by two orders of magnitude without sacrificing performance, resulting in a considerable reduction in return variance at modest sample-efficiency costs.

2605.21143 2026-05-22 cs.SD cs.LG 版本更新

CoarseSoundNet: Building a reliable model for ecological soundscape analysis

CoarseSoundNet:构建一个可靠的生态声音景观分析模型

Alexander Gebhard, Andreas Triantafyllopoulos, Dominik Arend, Sandra Müller, Svenja Schmidt, Michael Scherer-Lorenzen, Björn W. Schuller

发表机构 * organization= TUM University Hospital, CHI -- Chair of Health Informatics , addressline= Ismaninger Str. 22 , city= Munich , postcode= 81675 , state= Bavaria , country= Germany organization= University of Freiburg, Faculty of Biology, Geobotany , addressline= Schaenzlestr. 1 , city= Freiburg , postcode= 79104 , state= Baden-Württemberg , country= Germany organization= MCML -- Munich Center for Machine Learning , city= Munich , state= Bavaria , country= Germany organization= Imperial College London, GLAM -- Group on Language, Audio, \& Music , city= London , country= UK

AI总结 本文提出CoarseSoundNet模型,用于在真实噪声环境下对生物声音、地质声音和人类声音进行分类,并通过系统研究模型架构、训练数据和评估策略,提高了模型在被动声学监测中的泛化能力。

Comments Currently under review

详情
AI中文摘要

声音景观由三种声音组成:生物声音(动物发出的声音)、地质声音(自然非生物声音)和人类声音(人类发出的声音)。在声音景观生态学领域,一个关键研究问题是这些组成部分如何相互作用,特别是生物声音如何响应地质声音和人类声音。然而,目前尚缺乏能够对这些元素进行区分量化分析的工具。最近的机器学习(ML)方法旨在支持自动化分析,但通常依赖于任务特定或干净的数据,限制了其在噪声被动声学监测(PAM)记录中的泛化能力。本文提出了一种清晰且可重复的结构来构建用于粗粒度声音景观分类的ML模型,并引入了CoarseSoundNet,一个经过训练以在真实PAM条件下区分生物声音、地质声音和人类声音的深度学习模型。我们系统地研究了模型架构、额外训练类的影响、数据组成和评估策略。我们的发现表明,模型性能随着额外PAM数据的增加而提高,特别是当数据与目标领域相似时,并且通过在训练中引入显式的静默类进一步提高性能。类特定的决策阈值和基于持续时间的约束进一步提高了性能,特别是在人类声音和地质声音方面。错误分析显示,人类声音由于掩蔽效应而面临挑战,而静默和昆虫声音在地质和生物声音方面存在混淆。最后,我们进行了一项生态案例研究,表明使用CoarseSoundNet预过滤记录可以产生与地面真实过滤相当的声学指数趋势,支持其作为生态声学分析有效预处理工具的使用。

英文摘要

A soundscape is composed of three types of sound: biophony (sounds made by animals), geophony (natural abiotic sounds) and anthropophony (sounds made by humans). A key research question in the field of soundscape ecology is how these components interact with each other, specifically how biophony responds to geophony and anthropophony. Nevertheless, as of today, there are not many analytical instruments that enable the distinct quantification of these elements. Recent machine learning (ML) approaches aim to support automated analysis but often rely on task-specific or clean data, limiting generalisation to noisy passive acoustic monitoring (PAM) recordings. This study presents a clear and reproducible structure to build ML models for coarse soundscape classification and introduces CoarseSoundNet, a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. We systematically investigate model architectures, the influence of an additional training class, data composition, and evaluation strategies. Our findings suggest that model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Finally, we conduct an ecological case study which shows that pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering, supporting its use as an effective preprocessing tool for ecoacoustic analyses.

2605.20975 2026-05-22 cs.LG cs.CR 版本更新

Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning

明智且私密地选择:为公平和高效的联邦学习进行主动客户端选择

Adda Akram Bendoukha, Heber Hwang Arcolezi, Nesrine Kaaniche, Aymen Boudguiga

发表机构 * GitHub

AI总结 本文提出了一种主动客户端选择框架,旨在在训练前找到满足效用和公平性要求的最佳客户端联邦,以提高联邦学习的效率和公平性。

详情
AI中文摘要

联邦学习使能够在去中心化的数据源上进行协作模型训练而无需数据传输。基于平均的联邦学习受限于非独立同分布数据的存在,这会负面影响收敛速度和最终模型的准确性。传统替代方法存在显著的低效率。包含噪声或高度异质数据的客户端会进行昂贵的梯度计算,这些计算在聚合前要么被丢弃要么被大幅降权。这些反应式方法浪费计算资源,需要更多的通信轮次并导致不必要的隐私暴露。在本文中,我们提出了一种主动客户端选择框架,旨在在训练开始前找到一个最优的客户端联邦,其联合数据满足效用和公平性要求。我们的方法依赖于从差分隐私连续表中计算出的互信息来量化联合数据集中的跨特征相关性的重要性。我们引入了一个潜在联邦损失(PFL)在固定大小的联邦集上,它平衡了两个目标。最大化集体数据效用的同时确保公平的跨特征相关性以防止群体不公平。客户端选择被表达为一个最优子集搜索问题,基于PFL目标,我们使用模拟退火在强差分隐私保证下解决客户端的本地统计信息。在四个基准上的实验结果表明,与均匀抽样相比,使用最优找到的联邦训练的模型更快、更公平且更准确,即使当使用最先进的自适应聚合或抽样策略时也是如此。

英文摘要

Federated Learning enables collaborative model training across decentralized data sources without data transfer. Averaging-based FL is limited by the presence of non-IID data, which negatively impacts convergence speed and final model accuracy. Conventional alternatives suffer from significant inefficiency. Clients with noisy or highly heterogeneous data contribute expensive gradient computations that are either discarded or heavily down-weighted before aggregation. These reactive approaches waste computational resources, require more communication rounds and result in unnecessary privacy exposure. In this paper, we propose a proactive client selection framework that aims to find an optimal federation of clients whose combined data match utility and fairness requirements before training begins. Our method relies on mutual information computed from differentially private contingency tables to quantify the relevance of cross-feature correlations in the union dataset. We introduce a Potential Federation Loss (PFL) over the set of fixed-size federations, which balances two objectives. Maximizing collective data utility while ensuring fair cross-features correlations to prevent group unfairness. Client selection is expressed as an optimal subset search problem over the PFL objective, which we solve using simulated annealing under strong differential privacy guarantees for clients' local statistics. Experimental results on four benchmarks show faster, fairer, and more accurate models trained on optimally found federations, compared to uniform sampling, even when state-of-the-art adaptive aggregation or sampling strategies are employed.

2605.20069 2026-05-22 cs.LG cs.GT 版本更新

Smooth Partial Lotteries for Stable Randomized Selection

用于稳定随机选择的平滑部分彩票

Alexander Goldberg, Giulia Fanti, Nihar B. Shah

发表机构 * New Zealand Health Research Council(新西兰健康研究理事会) Swiss National Science Foundation(瑞士国家科学基金会) European Research Council(欧洲研究理事会) Science Foundation Ireland(爱尔兰科学基金会) Volkswagen Foundation(大众基金会) The British Academy(英国学院) Austrian Science Fund (FWF)(奥地利科学基金) Formas(Formas基金会) Luebber et al.(Luebber等人)

AI总结 本文提出平滑性作为部分彩票设计原则,通过定义评分到选择概率的Lipschitz条件,提出Clipped Linear Lottery机制,证明其在平滑性与遗憾之间取得更好的平衡,并通过实验验证其在实际应用中的有效性。

详情
AI中文摘要

竞争性选择过程,从科学资金资助到招生和招聘,使用评估来评分候选人,并最终根据这些评分选择一部分人。最近,许多组织采用了部分彩票,根据评估评分随机化选择。然而,现有的彩票设计本质上是不稳定的,因为对单个候选人的评分的微小变化会导致其选择概率的大幅变化。这种不稳定性削弱了彩票的一个关键目标:减少决策边界附近细微评分区别的影响。我们提出平滑性作为部分彩票的设计原则,并将其形式化为评分到选择概率的映射的Lipschitz条件。我们引入了Clipped Linear Lottery,一种简单的机制,其中选择概率与估计质量在上阈值和下阈值之间线性变化,上阈值以上我们总是接受,下阈值以下我们总是拒绝。我们证明Clipped Linear Lottery的最坏遗憾与任何平滑选择规则的下界在(1 - k/n)因子内匹配,其中k/n是接受率。我们比较平滑选择与其他稳定性概念如个体公平性和差分隐私,证明Clipped Linear Lottery在平滑性与遗憾的权衡上优于其他方法。在ICLR 2025、NeurIPS 2024和瑞士国家科学基金会的真实同行评审数据上的实验表明,现有彩票设计在实践中即使在单个评分扰动下也高度不稳定。我们的实验还确认了我们的理论分析的紧性,并证明我们提出的Clipped Linear Lottery在实践中比其他方法在平滑性与效用的权衡上更优。

英文摘要

Competitive selection processes, from scientific funding to admissions and hiring, use evaluations to score candidates, and eventually choose a subset of them based on those scores. Recently, many organizations have adopted partial lotteries, which randomize selection based on evaluation scores. However, existing lottery designs are inherently unstable, as a small change to a single candidate's score can cause large shifts in their selection probabilities. This instability undermines a key goal of lotteries: reducing the influence of fine-grained score distinctions near the decision boundary. We propose smoothness as a design principle for partial lotteries, formalizing it as a Lipschitz condition on the mapping from review scores over candidates to selection probabilities. We introduce the Clipped Linear Lottery, a simple mechanism in which selection probabilities scale linearly with estimated quality between an upper threshold, above which we always accept, and a lower threshold, below which we always reject. We prove that the Clipped Linear Lottery's worst-case regret matches a lower bound for any smooth selection rule up to a factor of $(1 - k/n)$, where $k/n$ is the acceptance rate. We compare smooth selection to other stability notions like Individual Fairness and Differential Privacy, showing that the Clipped Linear Lottery achieves a better smoothness-regret tradeoff than alternatives. Experiments on real peer review data from ICLR 2025, NeurIPS 2024, and the Swiss National Science Foundation demonstrate that existing lottery designs are highly unstable in practice even under perturbations to a single score. Our experiments also confirm the tightness of our theoretical analysis and show that our proposed Clipped Linear Lottery achieves a better smoothness-utility tradeoff than alternatives in practice.

2605.19965 2026-05-22 cs.LG eess.SP 版本更新

Normative Networks for Source Separation via Local Plasticity and Dendritic Computation

通过局部可塑性和树突计算进行源分离的规范网络

Bariscan Bozkurt, Efe Ali Gorguner, Francesco Innocenti, Rafal Bogacz

发表机构 * Gatsby Computational Neuroscience Unit, University College London, UK(Gatsby计算神经科学单元,伦敦大学学院,英国) Department of Computer Science, University of Oxford, UK(计算机科学系,牛津大学,英国) Brain Network Dynamics Unit, University of Oxford, UK(脑网络动力学单元,牛津大学,英国) MRC Centre of Research Excellence in Restorative Neural Dynamics, UK(英国医学研究委员会修复神经动力学研究卓越中心)

AI总结 本文提出了一种基于局部可塑性和树突计算的预测熵最大化方法,用于源分离,该方法在结构化源域上最大化正则化的二阶熵,实现了在增加的源相关性和观测噪声下的鲁棒性,并在生物合理算法和精确基线中表现优异。

详情
AI中文摘要

盲源分离(BSS)是研究如何从感觉混合中恢复潜在原因的自然框架,但推导出针对结构化(即受限于已知领域)且可能相关源的在线和生物合理算法仍然具有挑战性。最近的工作从最大化熵度量出发推导出BSS的神经网络,但其在线实现涉及复杂且非局部的递归动力学。受此视角启发,我们提出了预测熵最大化方法,仅使用局部权重更新即可实现BSS的竞争力。该方法采用熵度量的近似,产生一个具有易于解释组件的目标函数。最小化该目标导致预测神经架构,其中前馈突触遵循误差驱动规则(可通过树突机制实现),横向抑制连接通过局部海马体可塑性学习,源域约束通过简单的输出非线性性强制执行。我们推导了对偶误差的显式频谱界限,表征了何时近似是准确的。经验上,预测熵最大化在增加的源相关性和观测噪声下保持稳健,优于依赖更强独立性或去相关假设的生物合理算法,并在精确行列式和相关信息基线中表现竞争。这些结果展示了如何通过最大化结构化源域上的正则化二阶熵,使局部可塑性和适应性横向抑制得以出现。我们的实现代码可在https://github.com/BariscanBozkurt/Predictive-Entropy-Maximization上获得。

英文摘要

Blind source separation (BSS) is a natural framework for studying how latent causes may be recovered from sensory mixtures, but deriving online and biologically plausible algorithms for structured (i.e., constrained to known domains) and potentially correlated sources remains challenging. Recent work has derived neural networks for BSS from maximization of an entropy measure, yet its online implementations involve complex and nonlocal recurrent dynamics. Motivated by this perspective, we propose Predictive Entropy Maximization, which achieves competitive performance in BSS, using only local weight updates. The method employs a close approximation of an entropy measure, yielding an objective function with easily interpretable components. Minimizing this objective leads to a predictive neural architecture in which feedforward synapses follow an error-driven rule (that can be realized through dendritic mechanisms), lateral inhibitory connections are learned with local Hebbian plasticity, and source-domain constraints are enforced through simple output nonlinearities. We derive explicit spectral bounds on the surrogate error, characterizing when the approximation is accurate. Empirically, Predictive Entropy Maximization remains robust under increasing source correlation and observation noise, outperforms biologically plausible algorithms that rely on stronger independence or decorrelation assumptions, and remains competitive with exact determinant- and correlative-information-based baselines. These results show how local plasticity and adaptive lateral inhibition can emerge from maximizing a regularized second-order entropy over structured source domains. Our implementation code is available at https://github.com/BariscanBozkurt/Predictive-Entropy-Maximization.

2605.19152 2026-05-22 stat.ML cs.ET cs.IT cs.LG cs.NE math.IT physics.optics 版本更新

Information Processing Capacity of Stationary Physical Systems: Theory, Data-efficient Estimation Methods, and Photonic Demonstration

stationary 物理系统的信息处理能力:理论、数据高效估计方法和光子演示

Rahul Uma Ramachandran, Serge Massar

发表机构 * Laboratoire d’Information Quantique CP224, Université libre de Bruxelles(量子信息实验室CP224,布鲁塞尔自由大学)

AI总结 本文研究了 stationary 物理系统的信息处理能力,提出了一种理论框架,并开发了数据高效估计方法,通过光子计算系统实验验证了其有效性。

Comments added 2 new references

详情
AI中文摘要

物理计算系统为实现硬件原生机器学习提供了有前景的途径,但其计算能力在原理上、任务无关和数据高效的方式下难以表征。我们扩展了信息处理能力(IPC)框架以适用于 stationary 物理计算系统,并建立了几个基本结果:个体容量在零和一之间被限制,其在完整基底上的总和受读数数量的限制,噪声严格减少这个界限。我们处理有限样本的 IPC 估计,并推导了影响朴素估计器的系统性正偏倚的渐近形式。基于这些结果,我们引入了基于 Richardson 推理和 Sobol 准随机采样的数据高效估计方法。我们通过基于皮秒激光脉冲在非线性光纤中传播的光子计算系统实验验证了该框架。通过改变激光功率和光纤长度,我们观察到由 Kerr 效应诱导的 IPC 分布系统性地向高阶非线性容量偏移。最后,我们证明了总 IPC 与基准机器学习任务的性能强相关,并提供了系统有效维度的可靠估计。这些结果确立了 IPC 作为连接物理计算系统内在动态与其机器学习性能的实用桥梁。

英文摘要

Physical computing systems provide a promising route toward hardware-native machine learning, but their computational capabilities remain difficult to characterize in a principled, task-independent, and data-efficient way. We extend the Information Processing Capacity (IPC) framework to stationary physical computing systems and establish several fundamental results: individual capacities are bounded between zero and one, their sum over a complete basis is bounded by the number of readouts, and noise strictly reduces this bound. We address the finite-sample estimation of IPC and derive the asymptotic form of the systematic positive bias affecting naive estimators. Building on these results, we introduce data-efficient estimation methods based on Richardson extrapolation and Sobol quasi-random sampling. We validate the framework experimentally using a photonic computing system based on picosecond laser pulses propagating through a nonlinear optical fibre. By varying the laser power and fibre length, we observe systematic shifts of the IPC distribution toward higher-order nonlinear capacities induced by the Kerr effect. Finally, we demonstrate that the total IPC strongly correlates with performance on benchmark machine-learning tasks and provides a reliable estimate of the effective dimensionality of the system. These results establish IPC as a practical bridge between the intrinsic dynamics of physical computing systems and their machine-learning performance.

2605.16545 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

Symphony for Speech-to-Text: 支持实时医疗语音接口

Arne Nix, Robert James, Lasse Borgholt, Anna B. Ekner, Lana Krumm, Julius Severin, Dan Engel, Lars Maaløe, Jakob Havtorn

发表机构 * Corti

AI总结 本文提出Symphony for Speech-to-Text,一种医疗级实时语音识别系统,通过分解转录过程为识别、格式化和上下文校正等专业化组件,优化医学术语召回,实现实时临床结构文本生成,并在医疗场景中显著优于现有系统,同时在通用领域表现不逊。

Comments Updated with a correction and improvement to Symphony's performance in spoken punctuation evaluation (R_punct, P_punct)

详情
AI中文摘要

在数十年用于打字和更近期的环境记录后,语音正逐渐成为与技术及AI交互的主要方式,在医疗领域也不例外。然而,医疗语音识别仍然具有挑战性:系统必须捕捉专业术语,解决上下文歧义,并精确渲染测量、缩写和临床缩写。现有解决方案通常针对通用目的转录或狭窄的打字工作流进行优化,限制了其在安全关键设置中的可靠性以及在更广泛临床工作流中的实用性。我们引入Symphony for Speech-to-Text,一种用于实时流式和基于批量文件的临床使用的医疗级语音识别系统。Symphony将转录过程分解为识别、格式化和上下文校正等专业化组件,以优化医学术语召回,同时在实时生成临床结构文本并适应不同使用场景。在公共基准和医疗语音数据集上的评估表明,Symphony在临床场景中显著优于现有系统,同时在通用领域表现不逊,表明具有鲁棒的泛化能力而非过拟合。我们发布了一个临床基准数据集以支持可靠的验证和进一步推进医疗语音识别。Symphony通过生产级API提供,用于实时打字、对话转录和批量音频文件处理。

英文摘要

After decades of use in dictation and, more recently, ambient documentation, speech is emerging as a primary modality for interacting with technology and AI in healthcare. Yet medical speech recognition remains difficult: systems must capture specialized terminology, resolve contextual ambiguity, and render measurements, abbreviations, and clinical shorthand precisely. Existing solutions are typically optimized either for general-purpose transcription or narrow dictation workflows, limiting their reliability in safety-critical settings and their usefulness for broader clinical workflows. We introduce Symphony for Speech-to-Text, a medical-grade speech recognition system for real-time streaming and batch file-based clinical use. Symphony decomposes the transcription process into specialized components for recognition, formatting, and contextual correction to optimize medical term recall while producing clinically structured text in real time and adapting across use cases. Evaluations on public benchmark and medical speech datasets show that Symphony substantially outperforms state-of-the-art systems in clinical settings while matching or exceeding them in general-domain settings, suggesting robust generalization rather than overfitting. We release a clinical benchmark dataset to support reliable validation and further progress in medical speech recognition. Symphony is available through a production-grade API for live dictation, conversational transcription, and batch audio file processing.

2605.12456 2026-05-22 cs.CR cs.CL cs.LG 版本更新

TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection

TextSeal: 一种用于溯源与蒸馏保护的本地化大语言模型水印

Tom Sander, Hongyan Chang, Tomáš Souček, Tuan Tran, Valeriu Lacatusu, Sylvestre-Alvise Rebuffi, Alexandre Mourachko, Surya Parimi, Christophe Ropers, Rashel Moritz, Vanessa Stark, Hady Elsahar, Pierre Fernandez

发表机构 * FAIR, Meta Superintelligence Labs(FAIR,Meta超智能实验室)

AI总结 本文提出TextSeal,一种先进的大语言模型水印技术,通过Gumbel-max采样引入双密钥生成以恢复输出多样性,并结合熵加权评分和多区域定位提升检测性能。该方法支持推测解码和多令牌预测等服务优化,不增加推理开销。在检测强度上严格优于基线方法SynthID-text,并对稀释具有鲁棒性,即使在混合的人类/AI文档中也能保持自信的本地化检测。理论上该方案无失真,经推理基准评估证实其保持下游性能;同时通过多语言人工评估(6000次A/B对比,5种语言)显示无明显质量差异。除了用于溯源检测外,TextSeal还具有'放射性'特性:其水印信号通过模型蒸馏传递,可检测未经授权的使用。

详情
AI中文摘要

我们介绍TextSeal,一种最先进的大语言模型水印。基于Gumbel-max采样,TextSeal引入双密钥生成以恢复输出多样性,同时结合熵加权评分和多区域定位以提升检测性能。它支持推测解码和多令牌预测等服务优化,并不增加任何推理开销。TextSeal在检测强度上严格优于基线方法如SynthID-text,并对稀释具有鲁棒性,即使在混合的人类/AI文档中也能保持自信的本地化检测。该方案在理论上是无失真的,经推理基准评估确认其保持下游性能;同时通过多语言人工评估(6000次A/B对比,5种语言)显示无明显质量差异。除了用于溯源检测外,TextSeal还具有'放射性'特性:其水印信号通过模型蒸馏传递,可检测未经授权的使用。

英文摘要

We introduce TextSeal, a state-of-the-art watermark for large language models. Building on Gumbel-max sampling, TextSeal introduces dual-key generation to restore output diversity, along with entropy-weighted scoring and multi-region localization for improved detection. It supports serving optimizations such as speculative decoding and multi-token prediction, and does not add any inference overhead. TextSeal strictly dominates baselines like SynthID-text in detection strength and is robust to dilution, maintaining confident localized detection even in heavily mixed human/AI documents. The scheme is theoretically distortion-free, and evaluation across reasoning benchmarks confirms that it preserves downstream performance; while a multilingual human evaluation (6000 A/B comparisons, 5 languages) shows no perceptible quality difference. Beyond its use for provenance detection, TextSeal is also ``radioactive'': its watermark signal transfers through model distillation, enabling detection of unauthorized use.

2605.12058 2026-05-22 cs.LG cs.AI 版本更新

Holder Policy Optimisation

Hölder Policy Optimisation

Yuxiang Chen, Dingli Liang, Yihang Chen, Ziqin Gong, Chenyang Le, Zhaokai Wang, Jiachen Zhu, Lingyu Yang, Jianghao Lin, Weinan Zhang, Jun Wang

发表机构 * University College London(伦敦大学学院) Shanghai Jiao Tong University(上海交通大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 本文提出HölderPO框架,通过Hölder均值统一token级概率聚合,解决固定聚合机制导致的训练崩溃与性能不足问题,理论证明不同p值对梯度集中度和方差的平衡作用,并通过动态退火算法实现训练周期内的p值调度,实验表明其在多个数学基准测试中取得更优的稳定性和收敛性。

详情
AI中文摘要

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.

英文摘要

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{HölderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.

2605.11246 2026-05-22 cs.LG 版本更新

Support-Proximity Augmented Diffusion Estimation for Offline Black-Box Optimization

支持接近增强的扩散估计用于离线黑盒优化

Yonghan Yang, Ye Yuan, Zipeng Sun, Linfeng Du, Bowei He, Haolun Wu, Can Chen, Xue Liu

发表机构 * MBZUAI - Mohamed bin Zayed University of Artificial Intelligence(MBZUAI - 摩擦 bin Zayed 大学) McGill University(麦吉尔大学) Mila - Quebec AI Institute(Mila - 加拿大AI研究所) Amazon AGI(亚马逊人工智能实验室)

AI总结 本文提出SPADE框架,通过条件生成建模重新想象前向替代建模,利用扩散模型建模前向似然p(y|x),并引入校准扩散估计模块和支撑接近正则化机制,以提高优化性能。

Comments Accepted by ICML 2026. First two authors contributed equally

详情
AI中文摘要

离线黑盒优化旨在仅使用静态数据集发现具有高属性分数的新设计,这一任务本质上受到分布外(OOD)外推问题的挑战。现有方法通常分为逆向方法,其在将分数映射到设计的 ill-posed 性质上挣扎,以及前向方法,其往往缺乏量化不确定性有效性的分布表达能力。在本文中,我们提出SPADE(Support-Proximity Augmented Diffusion Estimation),一种新颖的框架,通过条件生成建模的视角重新想象前向替代建模。SPADE通过扩散模型建模前向似然p(y|x),但通过两个关键增强来适应优化:(1)校准扩散估计模块,强制统计矩和成对排名的全局一致性;(2)支撑接近正则化机制,通过kNN基于的密度估计隐式内化数据流形约束p(x)。理论上,我们证明我们的正则化在第一阶上等价于最大化具有有效设计先验的贝叶斯后验。经验上,SPADE在Design-Bench任务和LLM数据混合优化基准上实现了最先进的性能。

英文摘要

Offline black-box optimization aims to discover novel designs with high property scores using only a static dataset, a task fundamentally challenged by the out-of-distribution (OOD) extrapolation problem. Existing approaches typically bifurcate into inverse methods, which struggle with the ill-posed nature of mapping scores to designs, and forward methods, which often lack the distributional expressivity to quantify uncertainty effectively. In this work, we propose SPADE (Support-Proximity Augmented Diffusion Estimation), a novel framework that reimagines forward surrogate modeling through the lens of conditional generative modeling. SPADE models the forward likelihood p(y|x) using a diffusion model, but with two critical enhancements to tailor it for optimization: (1) a Calibrated Diffusion Estimation module that enforces global consistency in statistical moments and pairwise rankings, and (2) a Support-Proximity Regularization mechanism that implicitly internalizes the data manifold constraint p(x) via kNN-based density estimation. Theoretically, we prove that our regularization is first-order equivalent to maximizing a Bayesian posterior with a valid design prior. Empirically, SPADE achieves state-of-the-art performance across Design-Bench tasks and an LLM data mixture optimization benchmark.

2605.08982 2026-05-22 cs.LG 版本更新

PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling

PMCTS:用于原理化并行推断时间扩展的粒子蒙特卡洛树搜索

Yaniv Oren, Viliam Vadocz, Joery A. de Vries, Wendelin Böhmer, Matthijs T. J. Spaan, Hendrik Baier

发表机构 * Department of Intelligent Systems, TU Delft(代尔夫特理工大学智能系统系) Department of Computer Science, ETH Zürich(苏黎世联邦理工学院计算机科学系) Trent AI Limited(Trent AI有限公司) Information Systems, TU Eindhoven(埃因霍温理工大学信息系统系) Centrum Wiskunde & Informatica, Amsterdam(阿姆斯特丹数学与信息学研究中心)

AI总结 本文提出PMCTS,一种适用于神经网络评估的原理化并行MCTS算法,通过并行计算实现推断时间扩展,并在多个领域中显著优于传统启发式基线方法。

详情
AI中文摘要

蒙特卡洛树搜索(MCTS)是一种通过搜索来改进策略的广泛使用的方法,其在现实世界应用中日益受到关注。由于其搜索过程的顺序性和确定性,利用并行计算扩展MCTS的运行时间仍然是一个主要挑战。我们引入了粒子MCTS(PMCTS),据我们所知,这是首个原理化并行MCTS算法,适用于神经网络评估,并能保持正式的策略改进保证。经验上,PMCTS在并行计算下表现良好,并在多个领域中显著优于流行的启发式基线方法。

英文摘要

Monte Carlo Tree Search (MCTS) is a widely used approach for policy improvement through search with increasing popularity for real world applications. Due to the sequential and deterministic nature of its search, runtime-scaling of MCTS with parallel compute remains a major challenge. We introduce Particle MCTS (PMCTS), to our knowledge the first principled parallel MCTS algorithm which is suited for neural network evaluations and can preserve formal policy improvement guarantees. Empirically, PMCTS scales well with parallel compute and significantly outperforms the popular heuristic-based baselines across domains.

2605.07598 2026-05-22 cs.LG 版本更新

Optimal Recourse Summaries via Bi-Objective Decision Tree Learning

通过双目标决策树学习实现最优补救摘要

Ioannis Chatzis, Jason Liartis, Athanasios Voulodimos, Giorgos Stamou

发表机构 * Artificial Intelligence and Learning Systems Laboratory(人工智能与学习系统实验室) National Technical University of Athens(希腊国家技术大学)

AI总结 本文提出SOGAR方法,通过将补救摘要学习转化为最优决策树学习问题,找到帕累托前沿,实现补救效果与成本之间的平衡,产生稳定、低成本且有效的补救摘要。

详情
AI中文摘要

可操作的补救为个体提供了改变不利分类器结果的行动。虽然在实例层面有用,但不适合作为全局审计和偏见检测,因为汇总局部行动是昂贵且不一致的。补救摘要通过将人口划分为子群体并为每个子群体分配一个共享行动,从而实现这一限制。设计摘要涉及补救效果和补救成本之间的根本权衡,现有方法未能充分解决。我们引入了最优和全局可操作补救摘要(SOGAR),将补救摘要学习转化为最优决策树学习问题,并找到帕累托前沿——即一组解决方案,其中改进一个目标必然使另一个恶化。SOGAR允许事后选择所需的权衡而无需重新训练。使用浅层轴平行决策树和稀疏叶行动,SOGAR产生稳定、低成本且有效的补救摘要,在效果和成本指标上均优于现有方法。

英文摘要

Actionable Recourse provides individuals with actions they can take to change an unfavorable classifier outcome. While useful at the instance level, it is ill-suited for global auditing and bias detection, since aggregating local actions is costly and often inconsistent. Recourse Summaries address this limitation by partitioning the population and assigning one shared action per subgroup, enabling comparison across subgroups. Designing summaries involves a fundamental trade-off between recourse effectiveness and recourse cost, which existing methods do not adequately address. We introduce Summaries of Optimal and Global Actionable Recourse (SOGAR), which formulates recourse summary learning as an optimal decision tree learning problem and finds the Pareto front -- the complete set of solutions where improving one objective necessarily worsens the other. SOGAR enables post-hoc selection of the desired trade-off without retraining. Using shallow axis-parallel decision trees and sparse leaf actions, SOGAR produces stable, low-cost, and effective recourse summaries that outperform existing approaches across effectiveness and cost metrics.

2605.06669 2026-05-22 cs.CR cs.AI cs.LG 版本更新

Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs

评估教育LLM导师的提示注入防御:安全-可用性-延迟的权衡

Alexandre Cristovão Maiorano

发表机构 * Lumytics

AI总结 本文提出了一种评估提示注入防御方法的框架,探讨了在教育LLM导师中安全、可用性和延迟之间的权衡,并通过实验比较了不同防御机制的性能。

Comments 19 pages, 4 figures, 9 tables

详情
AI中文摘要

教育LLM导师面临一个核心的AI对齐挑战:它们必须在遵循用户意图的同时保持教学约束和安全政策。我们提出了一个评估方法,用于评估提示注入防御在该场景中的表现,显示了防护栏设计在对抗性鲁棒性、良性任务可用性和响应延迟之间存在显式的权衡。我们评估了一个领域特定的多层安全防护流水线,结合确定性模式过滤器、结构验证、上下文沙箱和会话级行为检查。在受控的保留基准测试中,该流水线实现了低绕过率和假阳性率,同时优化了平均延迟——一个优先考虑教学可用性(零假阳性)而保持可测量攻击抵抗力的操作点。我们提供了一个可重复的基准测试协议,用于在相同条件下进行头对头比较,包括分层Bootstrap置信区间、配对McNemar显著性检验、多种子敏感度扫描,以及在相同划分上对Prompt Guard和NeMo Guardrails的直接评估。结果揭示了操作权衡:NeMo在16.22%的假阳性率下达到0%的绕过率,而Prompt Guard在3.60%的假阳性率下达到38.48%的绕过率。该框架支持在不同机构风险和可用性要求下,基于证据的防护栏选择。

英文摘要

Educational LLM tutors face a core AI alignment challenge: they must follow user intent while preserving pedagogical constraints and safety policies. We present an evaluation methodology for prompt-injection defenses in this setting, showing that guardrail design entails explicit trade-offs among adversarial robustness, benign-task usability, and response latency. We evaluate a domain-specific multi-layer safeguard pipeline combining deterministic pattern filters, structural validation, contextual sandboxing, and session-level behavioral checks. On a controlled holdout benchmark, the pipeline reaches low bypass and false positive rates with optimized average latency - an operating point that prioritizes pedagogical usability (zero false positives) while maintaining measurable attack resistance. We provide a reproducible benchmark protocol for head-to-head comparison under identical conditions, including stratified bootstrap confidence intervals, paired McNemar significance tests, multi-seed sensitivity sweeps, and direct evaluation of Prompt Guard and NeMo Guardrails on the same split with unified instrumentation. Results expose operational trade-offs: NeMo reaches 0 percent bypass at 16.22 percent FPR and roughly 1.5s latency, while Prompt Guard yields 38.48 percent bypass with 3.60 percent FPR. The framework supports evidence-based guardrail selection for AI tutoring systems under different institutional risk and usability requirements.

2605.06597 2026-05-22 cs.CL cs.AI cs.LG 版本更新

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

UniSD:面向大语言模型的统一自蒸馏框架

Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu, B. Aditya Prakash, Josiah Hester, Jindong Wang, Srijan Kumar

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Los Angeles(加州大学洛杉矶分校) Carnegie Mellon University(卡内基梅隆大学) William & Mary(威廉与玛丽大学)

AI总结 本文提出UniSD框架,系统研究自蒸馏方法,通过整合多种机制提升监督可靠性、表征对齐和训练稳定性,从而在多个基准和模型上验证自蒸馏的有效性,并构建出性能最优的UniSDfull流水线。

Comments Website: https://unifiedsd.github.io/ Code: https://github.com/Ahren09/UniSD

详情
AI中文摘要

自蒸馏(SD)为在不依赖更强外部教师的情况下适应大语言模型(LLMs)提供了一条有前途的路径。然而,在自回归LLMs中,SD仍然具有挑战性,因为自生成轨迹是自由形式的,正确性依赖于任务,且合理的推理仍可能提供不稳定或不可靠的监督。现有方法主要考察孤立的设计选择,留下其有效性、作用和交互关系不清晰。在本文中,我们提出UniSD,一个统一的框架,系统地研究自蒸馏。UniSD整合了互补的机制,解决监督可靠性、表征对齐和训练稳定性问题,包括多教师一致、EMA教师稳定、token级对比学习、特征匹配和发散剪裁。在六个基准和六个模型(来自三个模型家族)上,UniSD揭示了自蒸馏何时优于静态模仿,哪些组件驱动了收益,以及这些组件在不同任务间的交互方式。基于这些见解,我们构建了UniSDfull,一个整合互补组件的流水线,实现了最强的整体性能,比基模型提高了+5.4点,比最强基线提高了+2.8点。广泛评估凸显了自蒸馏作为一种实用且可控的高效LLM适应方法,无需更强的外部教师。

英文摘要

Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.

2605.05118 2026-05-22 cs.LG cs.AI stat.ML 版本更新

On the Wasserstein Gradient Flow Interpretation of Drifting Models

关于漂移模型的Wasserstein梯度流解释

Arthur Gretton, Li Kevin Wenliang, Alexandre Galashov, James Thornton, Valentin De Bortoli, Arnaud Doucet

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文通过Wasserstein梯度流分析了漂移模型,揭示了GMD框架与WGF路径之间的关系,展示了三种主要结果:漂移模型中的算法对应于KL散度的WGF极限点,实际实现的算法对应于Sinkhorn散度的固定点但缺乏某些特性,同时该方法可以扩展到其他WGF的极限点,如MMD、切线Wasserstein距离和GAN批评者函数。

详情
AI中文摘要

最近,Deng等人(2026)提出了生成模型通过漂移(GMD),一种新的生成任务框架。本文通过Wasserstein梯度流(WGF)的视角分析了GMD,即概率测度空间中函数的最速下降路径,配备了最优传输的几何结构。与之前的WGF相关贡献不同,GMD可以被视为直接针对特定WGF流的固定点。我们展示了三个主要结果:首先,Deng等人(2026)提出的一种算法对应于在KL散度上的WGF的极限点,伴有Parzen平滑。其次,Deng等人(2026)实际实现的算法对应于另一种过程,类似于Sinkhorn散度的固定点,但缺乏后者的一些理想特性。第三,同样的想法可以扩展到其他WGF的极限点,包括最大均值差异(MMD)、切线Wasserstein距离和GAN批评者函数。

英文摘要

Recently, Deng et al. (2026) proposed Generative Modeling via Drifting (GMD), a novel framework for generative tasks. This note presents an analysis of GMD through the lens of Wasserstein Gradient Flows (WGF), i.e., the path of steepest descent for a functional in the space of probability measures, equipped with the geometry of optimal transport. Unlike previous WGF-based contributions, GMD can be thought of as directly targeting a fixed point of a specific WGF flow. We demonstrate three main results: first, that one algorithm proposed by Deng et al. (2026) corresponds to finding the limiting point of a WGF on the KL divergence, with Parzen smoothing on the densities. Second, that the algorithm actually implemented by Deng et al. (2026) corresponds to a different procedure, which bears some resemblance to the fixed point of a WGF on the Sinkhorn divergence, but lacks certain desirable properties of the latter. Third, the same same idea can be extended to the limiting point of other WGFs, including the Maximum Mean Discrepancy (MMD), the sliced Wasserstein distance, and GAN critic functions.

2605.04217 2026-05-22 cs.LG cs.CL 版本更新

Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

Jordan-RoPE: 通过复Jordan块实现非半单相对位置编码

Yaobo Zhang

发表机构 * School of Physics, Ningxia University(宁夏大学物理学院)

AI总结 本文提出了一种非半单相对位置编码Jordan-RoPE,通过复旋转特征和Nilpotent响应在同一缺陷Jordan块中实现距离调制的相位基,从而生成振荡-多项式特征,如e^{-γd}cos(ωd)、e^{-γd}sin(ωd)等,并在语言模型中验证了其有效性。

Comments 15 pages, 4 figures, 6 tables; code available at https://github.com/ybzhang-nxu/jordan_rope

详情
AI中文摘要

相对位置编码决定了查询-键滞后函数能够进入原始注意力logit的哪些功能。RoPE提供旋转相位,而ALiBi提供加性距离偏置。受线性平移不变位置编码的群论观点启发,我们研究了非半单情况,其中复旋转特征和Nilpotent响应共存于同一缺陷Jordan块中。所生成的相对算子产生如e^{-γd}cos(ωd)、e^{-γd}sin(ωd)、d e^{-γd}cos(ωd)和d e^{-γd}sin(ωd)等振荡-多项式特征,其中因果滞后d=i-j≥0。因此,该构造实现了距离调制的相位基d e^{iωd},而非仅仅添加单独的距离通道到RoPE。我们将其精确Jordan-RoPE公式化为非半单一参数表示,给出其实块形式,并指定非正交位置映射所需的共轭查询作用。我们还区分了该精确表示与稳定变体,后者虽然改善了数值行为但破坏了精确群律。核级别诊断和一个Jordan友好的合成语言模型任务表明,当目标包含距离调制的相位交互时,耦合的Jordan基是有用的。在小型WikiText-103字语言模型上,一个缩放精确变体在Jordan家族中优于RoPE和直接求和基线,而RoPE+ALiBi仍然是整体最强的。证据是结构性的,而非广义的性能声明。

英文摘要

Relative positional encodings determine which functions of query-key lag can enter the primitive attention logit. RoPE supplies a rotary phase, while ALiBi supplies an additive distance bias. Motivated by group-theoretic views of linear translation-invariant positional encodings, we study a non-semisimple case in which a complex rotary eigenvalue and a nilpotent response live in the same defective Jordan block. The resulting relative operator generates oscillatory-polynomial features such as $e^{-γd}\cos(ωd)$, $e^{-γd}\sin(ωd)$, $d e^{-γd}\cos(ωd)$, and $d e^{-γd}\sin(ωd)$, for causal lag $d=i-j\geq 0$. Thus the construction realizes a distance-modulated phase basis $d e^{iωd}$, rather than merely adding a separate distance channel to RoPE. We formulate Exact Jordan-RoPE as a non-semisimple one-parameter representation, give its real block form, and specify the contragredient query action required by non-orthogonal positional maps. We also distinguish this exact representation from stabilized variants whose bounded shear improves numerical behavior but breaks the exact group law. Kernel-level diagnostics and a Jordan-friendly synthetic language-model task show that the coupled Jordan basis is useful when the target contains distance-modulated phase interactions. On a small WikiText-103 byte language model, a scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, while RoPE+ALiBi remains strongest overall. The evidence is structural rather than a broad performance claim.

2605.04062 2026-05-22 cs.LG cs.AI 版本更新

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

EdgeRazor: 一种通过混合精度量化感知蒸馏实现大语言模型轻量化的框架

Shu-Hao Zhang, Le-Tong Huang, Xiang-Sheng Deng, Xin-Yi Zou, Chen Wu, Nan Li, Shao-Qun Zhang, Zhi-Hua Zhou

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) School of Intelligent Science and Technology, Nanjing University(南京大学智能科学与技术学院) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Microsoft AI(微软AI)

AI总结 本文提出EdgeRazor框架,通过混合精度量化感知蒸馏方法,在资源受限设备上部署大语言模型,实现了更高的压缩比和更高效的性能。

详情
AI中文摘要

量化已成为在资源受限设备上部署大语言模型(LLMs)的主流方法,但将精度压缩到低于4位通常会导致严重的性能退化或高昂的重训练成本。在本文中,我们提出了EdgeRazor,一种通过混合精度量化感知蒸馏实现LLM轻量化的框架。它包含三个模块:混合精度结构量化用于精细控制位宽,层自适应特征蒸馏动态选择最信息丰富的特征进行对齐,以及熵感知KL散度用于在人工标注和蒸馏数据集上实现前向-反向平衡。在MobileLLM和Qwen系列上的评估表明,在权重-激活量化下,1.88位的Qwen3-0.6B-EdgeRazor在2位基准上表现优异,优于11.27,超过最强的3位基准4.38。在效率方面,EdgeRazor在所有位宽下实现了更高的压缩比,1.58位的Qwen3-0.6B-EdgeRazor将存储从1.11 GB减少到0.19 GB,同时在16位基准上加速解码15.16倍。这些结果经验上验证了EdgeRazor的有效性和效率。代码可以从GitHub和Huggingface访问。

英文摘要

Quantization has emerged as a mainstream approach for deploying Large Language Models (LLMs) on resource-constrained devices, yet compressing precision below 4-bit typically causes severe performance degradation or prohibitive retraining costs. In this paper, we propose EdgeRazor, a lightweight framework for LLMs via Mixed-Precision Quantization-Aware Distillation. It contains three modules: Structural Quantization with Mixed Precision for fine-grained control of bit-widths, Layer-Adaptive Feature Distillation that dynamically selects the most informative features for alignment, and Entropy-Aware KL Divergence for forward-reverse balance on both human-annotated and distilled datasets. Evaluations conducted on MobileLLM and Qwen families show that under weight-activation quantization, the 1.88-bit Qwen3-0.6B-EdgeRazor outperforms the state-of-the-art 2-bit baselines by 11.27 and surpasses the strongest 3-bit baselines by 4.38, while the quantized MobileLLM-350M-EdgeRazor requires a training budget 4-10$\times$ lower than the leading quantization-aware training method. In terms of efficiency, EdgeRazor achieves higher compression ratios at all bit-widths, and the 1.58-bit Qwen3-0.6B-EdgeRazor reduces storage from 1.11 GB to 0.19 GB while accelerating decoding by 15.16$\times$ over the 16-bit baseline. These results empirically validate the effectiveness and efficiency of EdgeRazor. The codes can be accessed from \href{https://github.com/zhangsq-nju/EdgeRazor}{GitHub} and \href{https://huggingface.co/collections/zhangsq-nju/edgerazor-nbit}{Huggingface}.

2605.02409 2026-05-22 cs.LG 版本更新

Inducing Permutation Invariant Priors in Bayesian Optimization for Carbon Capture and Storage Applications

在碳捕集与封存应用中诱导排列不变的先验分布

Sofianos Panagiotis Fotias, Vassilis Gaganis

发表机构 * School of Mining and Metallurgical Engineering, National Technical University of Athens(采矿与冶金工程学院,国家技术大学雅典)

AI总结 本文提出了一种新的高斯过程核(GP-Perm),用于在碳捕集与封存项目中处理排列对称性问题,同时结合深度核学习模型(DKL-DS)以学习排列不变的嵌入,通过八个用例评估了所提出的方法。

详情
AI中文摘要

贝叶斯优化是一种迭代方法,专门用于优化昂贵的黑盒目标函数。像高斯过程(GP)这样的代理模型是贝叶斯优化的黄金标准,但当输入具有排列对称性时,常用的内核在处理无序项集时效率低下。受此问题的启发,我们转向在碳捕集与封存项目中使用排列不变的贝叶斯优化进行井位布置。高保真黑盒模拟器被指示在群控制下操作井,导致注入器和生产器群中出现无法被标准GP内核利用的排列对称性。在本工作中,我们的主要贡献是一种新的高斯过程内核(GP-Perm),通过比较集合的诱导经验表示之间的稳定分歧来编码排列不变性,并可以与标准内核结合以处理额外的向量值输入。作为学习不变的基线,我们还考虑了使用深度集架构的深度核学习模型(DKL-DS)来学习排列不变的嵌入。我们评估了所提出的方法在8个用例中的表现,包括七个合成基准和一个现实的CCS案例研究(Johansen构造)

英文摘要

Bayesian Optimization is an iterative method, tailored to optimizing expensive black box objective functions. Surrogate models like Gaussian Processes, which are the gold standard in Bayesian Optimization, can be inefficient for inputs with permutation symmetries, as the most common kernels employed are better suited for vector inputs rather than unordered sets of items. Motivated by this issue, we turn to permutation invariant Bayesian Optimization for well placement in Carbon Capture and Storage projects. The high fidelity black box simulator is instructed to operate wells under group control, giving rise to permutation symmetries within injector and producer groups that cannot be exploited with standard GP kernels. In this work, our main contribution is a novel Gaussian Process kernel (GP-Perm) that encodes permutation invariance by comparing sets through a stable divergence between their induced empirical representations, and can be combined with standard kernels for additional vector-valued inputs. As a learned invariant baseline, we also consider a Deep Kernel Learning model (DKL-DS) using the Deep Sets architecture to learn a permutation-invariant embedding. We evaluate the proposed methodology across 8 use cases, comprising seven synthetic benchmarks and one realistic CCS case study (Johansen formation)

2605.01369 2026-05-22 eess.SP cs.AI cs.LG 版本更新

MU-SHOT-Fi: Self-Supervised Multi-User Wi-Fi Sensing with Source-free Unsupervised Domain Adaptation

MU-SHOT-Fi: 基于源无关无监督域适应的多用户Wi-Fi感知

Ahmed Y. Radwan, Hina Tabassum

发表机构 * department of Electrical Engineering and Computer Science(电气工程与计算机科学系)

AI总结 本文提出MU-SHOT-Fi框架,通过源无关无监督域适应方法,在单用户和多用户Wi-Fi感知中实现准确的活动分类和占用估计,同时防止模型崩溃。

详情
Journal ref
IEEE Internet of Things Journal, Early Access, 2026
AI中文摘要

深度学习已被广泛应用于基于Wi-Fi CSI的人体活动识别(HAR),因为它能够以隐私保护和成本效益的方式学习时空特征。然而,基于深度学习的模型在跨环境泛化能力差,特别是在多用户设置中,重叠活动导致CSI纠缠和域偏移。实际部署通常由于隐私限制限制访问标记源数据,这促使使用仅未标记目标域CSI和预训练源模型进行源无关适应。在本文中,我们提出了MU-SHOT-Fi,一种用于单用户和多用户Wi-Fi感知的源无关无监督域适应框架。MU-SHOT-Fi在源训练期间采用排列不变的集合预测与匈牙利匹配,随后在目标域中采用冻结分类器骨干适应。为了实现无标签的稳定适应,我们引入了占用加权信息最大化,通过将多样性正则化集中在可能占用的槽位上,同时排除主导类别的边际熵。此外,我们采用二进制旋转预测作为空间自监督,利用CSI频率-时间结构学习域不变特征。对于单用户场景,我们引入SU-SHOT-Fi,通过将占用加权替换为标准信息最大化,并结合对比预测编码以利用时间一致性。在WiMANS和Widar 3.0数据集上进行了广泛的实验,涵盖了跨环境、跨频率、跨方向和组合域偏移,证明MU-SHOT-Fi在大域偏移下有效恢复多用户精确活动分类性能,同时保持准确的占用估计并防止向主导类崩溃。

英文摘要

Deep learning has been widely adopted for WiFi CSI-based human activity recognition (HAR) due to its ability to learn spatio-temporal features in a privacy-preserving and cost-effective manner. However, DL-based models generalize poorly across environments, a challenge amplified in multi-user settings where overlapping activities cause CSI entanglement and domain shifts. Practical deployments often limit access to labeled source data due to privacy constraints, motivating source-free adaptation using only unlabeled target-domain CSI and a pre-trained source model. In this paper, we propose MU-SHOT-Fi, a source-free unsupervised domain adaptation framework for single- and multi-user Wi-Fi sensing. MU-SHOT-Fi employs permutation-invariant set prediction with Hungarian matching during source training, followed by frozen-classifier backbone adaptation in the target domain. To enable stable adaptation without labels, we introduce occupancy-weighted information maximization that prevents model collapse by focusing diversity regularization on likely-occupied slots while excluding the dominant class from marginal entropy. Additionally, we employ binary rotation prediction as spatial self-supervision that exploits CSI frequency-time structure to learn domain-invariant features. For single-user scenarios, we introduce SU-SHOT-Fi by replacing occupancy weighting with standard information maximization and incorporating contrastive predictive coding to exploit temporal consistency. Extensive experiments on the WiMANS and Widar 3.0 datasets across cross-environment, cross-frequency, cross-orientation, and combined domain shifts demonstrate that MU-SHOT-Fi effectively recovers multi-user exact-activity classification performance under large domain shifts while maintaining accurate occupancy estimation and preventing collapse toward dominant classes.

2605.00392 2026-05-22 cs.CV cs.LG 版本更新

RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

RTPrune: 两次阅读启发的令牌修剪用于高效DeepSeek-OCR推理

Ben Wan, Yan Feng, Zihan Tang, Weizhe Huang, Yuting Zeng, Jia Wang, Tongxuan Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 本文提出RTPrune,一种针对DeepSeek-OCR的两次阶段令牌修剪方法,通过优先保留高范数视觉令牌并利用最优传输理论进行令牌配对和合并,从而在OCR任务中实现更高效的推理性能和更优的效率-精度权衡。

Comments 21 pages, accepted by ICML2026

详情
AI中文摘要

DeepSeek-OCR利用视觉-文本压缩来减少长文本处理成本并加速推理,但视觉令牌仍然容易出现冗余的文本和结构信息。此外,当前用于传统视觉-语言模型(VLMs)的令牌修剪方法由于不恰当的压缩机制而无法保持文本保真度。通过分析DeepSeek-OCR的解码过程,我们发现了一种独特的双阶段阅读轨迹:模型最初优先处理大多数高范数令牌,然后随后重新分配其注意力到剩余的令牌上。受此启发,我们提出RTPrune,一种专为DeepSeek-OCR设计的双阶段令牌修剪方法。在第一阶段,我们优先保留捕捉显著文本和结构信息的高范数视觉令牌。在第二阶段,剩余的令牌基于最优传输理论进行配对和合并,以实现高效的特征聚合。我们进一步引入了一个动态修剪比率,以适应令牌相似性和文本密度,从而在OCR任务中实现更优的效率-精度权衡。广泛的实验表明,RTPrune在OmniDocBench上实现了99.47%的准确率和1.23倍更快的prefill速度,当应用于DeepSeek-OCR-Large时,仅保留84.25%的令牌。

英文摘要

DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23$\times$ faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.

2604.26836 2026-05-22 cs.LG cs.SY eess.SY 版本更新

Uncertainty-Aware Predictive Safety Filters for Probabilistic Neural Network Dynamics

具有不确定性的预测安全过滤器用于概率神经网络动态

Bernd Frauenknecht, Lukas Kesper, Daniel Mayfrank, Henrik Hose, Sebastian Trimpe

发表机构 * Institute for Data Science in Mechanical Engineering (DSME), RWTH Aachen University(机械工程数据科学研究所(DSME),亚琛工业大学) Institute of Climate and Energy Systems (ICE), Energy Systems Engineering (ICE-1), Forschungszentrum Jülich GmbH(气候与能源系统研究所(ICE),能源系统工程(ICE-1),焦耳研究中心有限公司)

AI总结 本文提出了一种具有不确定性的预测安全过滤器(UPSi),通过将未来结果建模为可达集,利用概率集合(PE)神经网络动态模型提供严格的安全预测,从而在模型基于强化学习(MBRL)中提升探索安全性,同时保持与标准MBRL相当的性能。

详情
AI中文摘要

预测安全过滤器(PSFs)利用模型预测控制在深度强化学习(RL)探索期间强制约束满足,但其对第一原理模型或高斯过程的依赖限制了可扩展性和更广泛的应用。同时,基于模型的RL(MBRL)方法通常使用概率集合(PE)神经网络来从数据中捕捉复杂的、高维动态,且在最少的先验知识下。然而,现有将PE整合到PSFs中的尝试缺乏严格的不确定性量化。我们引入了具有不确定性的预测安全过滤器(UPSi),一种通过将未来结果建模为可达集来提供严格安全预测的PSF,利用PE动态模型。UPSi引入了显式的确定性约束,防止模型被利用,并无缝集成到常见的MBRL框架中。我们评估了UPSi在Dyna-style MBRL中的标准安全RL基准上,并报告了在先前神经网络PSFs上显著改进的探索安全性,同时保持与标准MBRL相当的性能。UPSi弥合了现代MBRL的可扩展性和通用性与预测安全过滤器的安全保证之间的差距。

英文摘要

Predictive safety filters (PSFs) leverage model predictive control to enforce constraint satisfaction during deep reinforcement learning (RL) exploration, yet their reliance on first-principles models or Gaussian processes limits scalability and broader applicability. Meanwhile, model-based RL (MBRL) methods routinely employ probabilistic ensemble (PE) neural networks to capture complex, high-dimensional dynamics from data with minimal prior knowledge. However, existing attempts to integrate PEs into PSFs lack rigorous uncertainty quantification. We introduce the Uncertainty-Aware Predictive Safety Filter (UPSi), a PSF that provides rigorous safety predictions using PE dynamics models by formulating future outcomes as reachable sets. UPSi introduces an explicit certainty constraint that prevents model exploitation and integrates seamlessly into common MBRL frameworks. We evaluate UPSi within Dyna-style MBRL on standard safe RL benchmarks and report substantial improvements in exploration safety over prior neural network PSFs while maintaining performance on par with standard MBRL. UPSi bridges the gap between the scalability and generality of modern MBRL and the safety guarantees of predictive safety filters.

2604.16076 2026-05-22 cs.LG cs.AI cs.NE 版本更新

Prototype-Grounded Concept Models for Verifiable Concept Alignment

基于原型的可验证概念模型用于可验证的概念对齐

Stefano Colamonaco, David Debot, Pietro Barbiero, Giuseppe Marra

发表机构 * Department of Computer Science, KU Leuven(卢森堡大学计算机科学系) IBM Research, Zurich(苏黎世IBM研究院)

AI总结 本研究提出了一种基于原型的概念模型(PGCMs),通过将概念与学习到的视觉原型关联起来,从而提高概念对齐的可验证性和可解释性,同时保持预测性能。

详情
AI中文摘要

概念瓶颈模型(CBMs)旨在通过人类可理解的概念来提高深度学习的可解释性,但它们无法验证所学概念是否与人类的意图一致,从而损害了可解释性。我们引入了基于原型的概念模型(PGCMs),将概念 grounded 在学习到的视觉原型上:作为概念的显式证据的图像部分。这种 grounding 允许直接检查概念语义,并支持在原型层面进行有针对性的人类干预以纠正不一致。实证结果表明,PGCMs 在预测性能上与最先进的 CBMs 相当,同时显著提高了透明度、可解释性和可干预性。

英文摘要

Concept Bottleneck Models (CBMs) aim to improve interpretability in Deep Learning by structuring predictions through human-understandable concepts, but they provide no way to verify whether learned concepts align with the human's intended meaning, hurting interpretability. We introduce Prototype-Grounded Concept Models (PGCMs), which ground concepts in learned visual prototypes: image parts that serve as explicit evidence for the concepts. This grounding enables direct inspection of concept semantics and supports targeted human intervention at the prototype level to correct misalignments. Empirically, PGCMs achieve similar predictive performance as state-of-the-art CBMs while substantially improving transparency, interpretability, and intervenability.

2604.09095 2026-05-22 cs.LG math.OC 版本更新

GeoPAS: Geometric Probing for Algorithm Selection in Continuous Black-Box Optimization

GeoPAS: 在连续黑盒优化中用于算法选择的几何探测

Jiabao Brad Wang, Xiang Shi, Yiliang Yuan, Mustafa Misir

发表机构 * Duke Kunshan University(杜克昆山大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出了一种几何探测框架,通过随机采样多尺度二维切片来表示问题实例,并结合有效性掩码感知的视觉池化进行聚合,从而在连续黑盒优化中实现算法选择。

Comments 20 pages, 9 figures, 6 tables; extended version of a GECCO 2026 poster-track paper; code available at https://github.com/BradWangW/GeoPAS

详情
AI中文摘要

连续黑盒优化的自动化算法选择依赖于在有限探测下表示问题信息,并在具有厚尾性能分布的情况下选择求解器。本文提出了一种几何探测框架,通过随机采样多尺度二维切片来表示每个问题实例。这些切片通过有效性掩码感知的视觉池化进行编码并聚合为实例表示。然后通过结合学习的实例条件估计和算法侧经验先验的对数复合分数进行求解器选择。该框架在标准单目标黑盒优化基准套件上评估,使用十二种求解器的组合,在实例级、分组随机和问题级转移协议下进行测试。在两种套件协议下,它将单最佳求解器的平均相对预期运行时间从30.37降至3.14和3.61,同时提高了中位数和上尾性能。在问题级转移下,传统自适应设置提高了典型和中等尾部性能,但使均值被罕见极端失败所主导;一个先验重的评分变体缓解了这种失败模式,尽管其鲁棒性可能依赖于基准。结果表明,粗粒度几何探测提供了有用的求解器相关信息,而鲁棒跨问题选择也取决于度量对齐的决策评分。

英文摘要

Automated algorithm selection for continuous black-box optimization depends on representing problem information under limited probing and selecting solvers under heavy-tailed performance distributions. This paper proposes a geometric probing framework that represents each problem instance by randomly sampled multi-scale two-dimensional slices of the objective landscape. The slices are encoded with validity-mask-aware visual pooling and aggregated into an instance representation. Solver selection is then performed by a logarithmic composite score combining a learned instance-conditioned estimate with an algorithm-side empirical prior. The framework is evaluated on a standard single-objective black-box optimization benchmark suite with a portfolio of twelve solvers under instance-level, grouped random, and problem-level transfer protocols. Under the two within-suite protocols, it reduces aggregate mean relative expected running time from 30.37 for the single best solver to 3.14 and 3.61, while also improving median and upper-tail performance. Under problem-level transfer, the canonical adaptive setting improves typical and moderate-tail performance but leaves the mean dominated by rare extreme failures; a prior-heavy scoring variant mitigates this failure mode, although its robustness may be benchmark-dependent. The results suggest that coarse geometric probes provide useful solver-relevant information, while robust cross-problem selection also depends on metric-aligned decision scoring.

2604.08362 2026-05-22 cs.CL cs.AI cs.LG 版本更新

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

迈向真实世界的人类行为模拟:在长时间跨度、跨场景、异质行为轨迹上对大语言模型进行基准测试

Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu, Yong Du, Tingting Gao, Yaojie Lu, Yingfei Sun, Xianpei Han, Le Sun, Xiangyu Wu, Hongyu Lin

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学) Kuaishou Technology(快手科技)

AI总结 本文提出OmniBehavior基准测试,通过真实世界数据整合长周期、跨场景和异质行为模式,揭示现有模型在模拟复杂人类行为时的局限性,包括对正向平均人的趋同、人格同质化和乌托邦偏见,为未来高保真模拟研究指明方向。

Comments Project page: https://OmniBehavior.github.io

详情
AI中文摘要

大语言模型(LLMs)的出现揭示了通用用户模拟的潜力。然而,现有基准测试仍局限于孤立场景、狭窄动作空间或合成数据,无法捕捉真实人类行为的整体性。为弥合这一差距,我们引入OmniBehavior,首个完全基于真实世界数据构建的用户模拟基准测试,将长周期、跨场景和异质行为模式整合到统一框架中。基于此基准测试,我们首先提供了实证证据,表明以往孤立场景的数据集存在隧道视野问题,而真实世界决策依赖于长期的跨场景因果链。对最新LLMs的广泛评估显示,当前模型在模拟这些复杂行为时表现不佳,即使扩展上下文窗口,性能也趋于平稳。关键的是,模拟行为与真实行为的系统性比较揭示了根本性的结构偏差:LLMs倾向于趋同于正向平均人,表现出超活跃、人格同质化和乌托邦偏见。这导致了个体差异和长尾行为的丧失,突显了未来高保真模拟研究的关键方向。

英文摘要

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

2604.02889 2026-05-22 stat.ML cs.AI cs.LG 版本更新

Rethinking Forward Processes for Score-Based Nonlinear Data Assimilation in High Dimensions

重新思考高维数据同化中的分数基非线性数据同化前向过程

Eunbi Yoon, Won Chang, Donghan Kim, Dae Wook Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出了一种针对数据同化问题的改进前向过程,用于高维非线性系统的状态估计,通过改进的分数基滤波器在测量空间中转换系统状态,提高了同化性能。

详情
AI中文摘要

数据同化是通过结合模型预测和测量来估计动态系统状态的过程。当系统是非线性且高维时,这一任务变得具有挑战性。为了解决这个问题,最近出现了一种基于分数的贝叶斯滤波器。然而,这些方法在某些情况下仍表现不佳,特别是在空间稀疏测量下。这种退化源于对似然分数的启发式近似,其误差会随时间累积。这一限制是因为这些方法只是采用了一种经典的生成建模前向过程,将数据分布转化为高斯分布,而与测量方程无关。在这里,我们提出了一种针对滤波的前向过程,将系统状态转换到测量空间,从而实现了似然分数的理论严谨公式化。基于此,我们开发了测量感知的分数基滤波器(MASF)。我们在Kolmogorov流上评估了MASF,这是一个具有高达$\mathcal{O}(10^5)$维度的高维流体基准测试,包括非线性情况下的状态与测量之间的维度不匹配。MASF在现有分数基滤波器和集合型卡尔曼滤波器上表现出改进的性能。值得注意的是,当使用幅度预训练时,MASF相比基线实现了高达$28.2 imes$的时钟时间加速。我们的实现可在 exttt{https://github.com/tcnllab-oss/masf}获得。

英文摘要

Data assimilation is the process of estimating the state of a dynamical system over time by combining model predictions with measurements. This task becomes challenging when the system is nonlinear and high-dimensional. To address this, score-based Bayesian filters have recently emerged. However, these methods still show unsatisfactory performance in certain cases, particularly under spatially sparse measurements. Such degradation stems from heuristic approximations of the likelihood score, whose errors can accumulate over time. This limitation arises because the methods simply adopt a classical forward process for generative modeling that transforms a data distribution toward a Gaussian distribution, which is independent of the measurement equation. Here, we propose a forward process tailored for filtering that transforms the system state toward the measurement space, enabling a theoretically sound formulation of the likelihood score. Based on this, we develop the Measurement-Aware Score-based Filter (MASF). We evaluate MASF on Kolmogorov flow, a high-dimensional fluid benchmark with up to $\mathcal{O}(10^5)$ dimensions, under diverse measurement operators, including nonlinear cases with a dimensional mismatch between the state and the measurements. MASF shows improved performance over existing score-based filters and ensemble-type Kalman filters. Notably, MASF achieves up to a $28.2\times$ wall-clock speedup compared with the baselines when using amortized pretraining. Our implementation is available at \texttt{https://github.com/tcnllab-oss/masf}.

2603.29981 2026-05-22 cs.LG stat.ML 版本更新

Aligning Validation with Deployment in Spatial Prediction: Target-Weighted Cross-Validation

在空间预测中对齐验证与部署:目标加权交叉验证

Alexander Brenning, Thomas Suesse

发表机构 * Friedrich Schiller University Jena(耶拿弗里德里希-施勒辛格大学) ELLIS Unit Jena(耶拿ELLIS单位)

AI总结 本文提出了一种基于加权交叉验证的部署导向验证框架,通过引入目标加权交叉验证(TWCV)来对齐验证任务与指定领域内预测任务的分布,以减少因采样偏差导致的预测误差。

详情
AI中文摘要

可靠地估计预测性能对于空间环境建模至关重要,其中机器学习模型用于从不均匀分布的观测数据中生成地图。标准交叉验证(CV)假设验证数据能代表目标领域内预测条件的分布。在实践中,由于选择性或集群采样,这一假设经常被违反,导致性能和不确定性估计偏倚。本文引入了一种基于加权交叉验证的部署导向验证框架,该框架通过重要性加权交叉验证(IWCV)和基于校准的方法,目标加权交叉验证(TWCV),利用具有空间意义的任务描述符如环境协变量和预测距离。模拟实验表明,传统非空间和空间交叉验证策略在现实采样设计下会表现出显著偏倚,而加权交叉验证方法在验证任务充分覆盖部署任务空间时能大幅减少这种偏倚。德国氮氧化物(NO₂)浓度制图案例研究显示,标准交叉验证由于采样偏倚会高估预测误差,而加权交叉验证则能产生更符合部署条件的估计。该框架将验证任务生成与风险估计分开,并为在样本分布与预测领域不同的空间预测设置中改进性能评估提供了实用方法。

英文摘要

Reliable estimation of predictive performance is essential for spatial environmental modeling, where machine-learning models are used to generate maps from unevenly distributed observations. Standard cross-validation (CV) assumes that validation data are representative of prediction conditions across the target domain. In practice, this assumption is often violated due to preferential or clustered sampling, leading to biased performance and uncertainty estimates. We introduce a deployment-oriented validation framework based on weighted CV that aligns validation tasks with the distribution of prediction tasks across a specified domain. The framework includes importance-weighted cross-validation (IWCV) and a calibration-based approach, Target-Weighted Cross-Validation (TWCV), which uses spatially meaningful task descriptors such as environmental covariates and prediction distance. Simulation experiments show that conventional non-spatial and spatial CV strategies can exhibit substantial bias under realistic sampling designs, whereas weighted CV approaches substantially reduce this bias when validation tasks adequately cover the deployment-task space. A case study on mapping nitrogen dioxide (NO$_2$) concentrations across Germany demonstrates that standard CV can overestimate prediction error due to sampling bias, while weighted CV yields estimates more consistent with deployment conditions. The framework separates validation task generation from risk estimation and provides a practical approach for improving performance assessment in spatial prediction settings where sample distributions differ from prediction domains.

2603.25958 2026-05-22 cs.LG 版本更新

Cluster-Adaptive Feature Extraction and its Theoretical Foundation with Minkowski Weighted k-Means

基于Minkowski加权k均值的聚类自适应特征提取及其理论基础

Renato Cordeiro de Amorim, Vladimir Makarenkov

发表机构 * School of Computer Science and Electronic Engineering, University of Essex, Wivenhoe, UK(埃塞克斯大学计算机科学与电子工程学院,英国威文豪斯) Département d’informatique, Université du Québec à Montréal, C.P. 8888 succ. Centre-Ville, Montreal (QC) H3C 3P8 Canada(魁北克大学蒙特利尔分校计算机科学系,加拿大蒙特利尔(QC)H3C 3P8) Mila - Quebec AI Institute, Montreal, QC, Canada(魁北克人工智能研究所,加拿大蒙特利尔(QC))

AI总结 本文提出了一种基于Minkowski加权k均值的聚类自适应特征提取方法,通过理论分析揭示了特征权重的结构,并证明了该方法在抑制高分散特征和增强信息性特征方面的有效性。

详情
AI中文摘要

Minkowski加权k均值(mwk-均值)算法通过引入特征权重和Minkowski距离扩展了经典k均值。我们首先证明,mwk-均值的目标函数可以表示为聚类内分散度的幂均值聚合,其中幂次由Minkowski指数p决定。这一表示揭示了p如何控制特征在选择性和均匀性之间的过渡。利用这种表示,我们推导了目标函数的界限,并刻画了特征权重的结构,证明其仅依赖于相对分散度,并遵循与分散比的幂律关系。这导致了对高分散特征抑制的显式保证,并建立了算法的收敛性。基于这些理论结果,我们引入了聚类自适应特征提取(CAFE),一种利用mwk-均值特征权重对数据进行预处理以进行无监督特征提取的方法。我们证明这种预处理反转了聚类内分散度的排序,抑制噪声特征并放大信息性特征。在受控的聚类内噪声环境下进行的大量实验表明,CAFE在传统特征提取方法的结果上始终表现出改进。

英文摘要

The Minkowski weighted $k$-means ($mwk$-means) algorithm extends classical $k$-means by incorporating feature weights and a Minkowski distance. We first show that the $mwk$-means objective can be expressed as a power-mean aggregation of within-cluster dispersions, with the order determined by the Minkowski exponent $p$. This formulation reveals how $p$ controls the transition between selective and uniform use of features. Using this representation, we derive bounds for the objective function and characterise the structure of the feature weights, showing that they depend only on relative dispersion and follow a power-law relationship with dispersion ratios. This leads to explicit guarantees on the suppression of high-dispersion features, and we establish convergence of the algorithm. Building on these theoretical results, we introduce Cluster-Adaptive Feature Extraction (CAFE), a method that uses the $mwk$-means feature weights to rescale the data prior to unsupervised feature extraction. We prove that this rescaling reverses the within-cluster dispersion ordering, suppressing noisy features and amplifying informative ones. Numerous experiments conducted under controlled within-cluster noise show that CAFE consistently improves the results of traditional feature extraction methods.

2603.20405 2026-05-22 cs.LG cs.CL cs.LO 版本更新

Putnam 2025 Problems in Rocq using Opus 4.6 and Rocq-MCP

使用 Opus 4.6 和 Rocq-MCP 的 2025 年 Putnam 问题

Guillaume Baudart, Marc Lelarge, Tristan Stérin, Jules Viennot

发表机构 * IRIF, Université Paris Cité, Inria, CNRS(IRIF,巴黎Cité大学,法国国家信息与自动化研究所,法国国家科学研究中心) DI ENS, PSL University, Inria(ENS巴黎大学DI,巴黎科学实验室大学,法国国家信息与自动化研究所)

AI总结 研究探讨了使用 Opus 4.6 配合 Rocq-MCP 工具自主证明 2025 年 Putnam 数学竞赛中 12 个问题中的 10 个,展示了基于模型上下文协议 (MCP) 的自动证明方法及公开可用的证明过程。

详情
AI中文摘要

我们报告了一项实验,其中配备有 Model Context Protocol (MCP) 工具的 Claude Opus~4.6,能够自主证明 2025 年 Putnam 数学竞赛中的 10 个问题。MCP 工具由 Claude 设计,通过分析先前在 miniF2F-Rocq 上的实验日志来编码一种“先编译,后交互回退”的策略。该代理在隔离的虚拟机上运行,无网络访问,部署了 141 个子代理,在 17.7 小时的活跃计算时间(51.6 小时墙钟时间)内消耗了约 190 亿个 token。所有证明均公开可用。

英文摘要

We report on an experiment in which Claude Opus~4.6, equipped with a suite of Model Context Protocol (MCP) tools for the Rocq proof assistant, autonomously proved 10 of 12 problems from the 2025 Putnam Mathematical Competition. The MCP tools, designed with Claude by analyzing logs from a prior experiment on miniF2F-Rocq, encode a "compile-first, interactive-fallback" strategy. Running on an isolated VM with no internet access, the agent deployed 141 subagents over 17.7 hours of active compute (51.6h wall-clock), consuming approximately 1.9 billion tokens. All proofs are publicly available.

2603.04525 2026-05-22 stat.ML cs.LG 版本更新

The Volterra signature

Volterra签名

Paul P. Hager, Fabian N. Harang, Luca Pelizzari, Samy Tindel

发表机构 * Department of Statistics and Operations Research, University of Vienna(统计与运筹学系,维也纳大学) Department of Economics, BI Norwegian Business School(经济学系,BI挪威商学院) Department of Mathematics, Purdue University(数学系,普渡大学)

AI总结 本文提出Volterra签名作为处理历史依赖系统的显式特征表示,通过将输入路径与时间核结合到张量代数中,利用Volterra-Chen恒等式推导出严谨的学习理论保证,并展示其在动态学习任务中的有效性。

详情
AI中文摘要

现代处理非马尔可夫时间序列的学习方法,如循环神经网络、神经控制微分方程或变换器,通常依赖于隐式的记忆机制,这些机制在长时间范围内难以解释或训练。我们提出Volterra签名VSig(x;K)作为处理历史依赖系统的显式特征表示。通过将输入路径x加权时间核K转化为张量代数,我们利用相关的Volterra-Chen恒等式推导出严谨的学习理论保证。具体来说,我们证明了注入性陈述(在增强下可识别),从而在无限维路径空间上推导出通用逼近定理,这在某些情况下通过VSig(x;K)的线性泛函实现。此外,我们通过展示与Volterra签名相关的内积可通过二参数积分方程闭合地表示,证明了核技巧的应用,从而利用PDE的数值方法进行计算。对于一大类指数型核,VSig(x;K)在张量代数中解线性状态空间微分方程。结合对时间重参数化的不变性,这些结果将Volterra签名定位为数据科学中稳健且计算上可行的特征映射。我们在真实和合成数据上的动态学习任务中展示了其有效性,其中它一致地改进了经典路径签名基线。

英文摘要

Modern approaches for learning from non-Markovian time series, such as recurrent neural networks, neural controlled differential equations or transformers, typically rely on implicit memory mechanisms that can be difficult to interpret or to train over long horizons. We propose the \emph{Volterra signature} $\mathrm{VSig}(x;K)$ as a principled, explicit feature representation for history-dependent systems. By developing the input path $x$ weighted by a temporal kernel $K$ into the tensor algebra, we leverage the associated Volterra--Chen identity to derive rigorous learning-theoretic guarantees. Specifically, we prove an \emph{injectivity} statement (identifiability under augmentation) that leads to a \emph{universal approximation} theorem on the infinite dimensional path space, which in certain cases is achieved by \emph{linear functionals} of $\mathrm{VSig}(x;K)$. Moreover, we demonstrate applicability of the \emph{kernel trick} by showing that the inner product associated with Volterra signatures admits a closed characterization via a two-parameter integral equation, enabling numerical methods from PDEs for computation. For a large class of exponential-type kernels, $\mathrm{VSig}(x;K)$ solves a linear state-space ODE in the tensor algebra. Combined with inherent invariance to time reparameterization, these results position the Volterra signature as a robust, computationally tractable feature map for data science. We demonstrate its efficacy in dynamic learning tasks on real and synthetic data, where it consistently improves classical path signature baselines.

2603.04383 2026-05-22 cs.CY cs.CR cs.IR cs.LG cs.SI 版本更新

Turning Trust to Transactions: Tracking Affiliate Marketing and FTC Compliance in YouTube's Influencer Economy

将信任转化为交易:追踪YouTube的影响力经济中的affiliate营销与FTC合规性

Chen Sun, Yash Vekaria, Zubair Shafiq, Rishab Nithyanand

发表机构 * University of Iowa(爱荷华大学) UC Davis(加州大学戴维斯分校)

AI总结 本研究通过Web测量和NLP技术开发工具,分析YouTube上affiliate营销生态系统的现状,揭示affiliate链接的普及程度及非合规行为的比例,并提出通过标准化披露功能提高合规性的建议。

Comments ICWSM 2026

详情
AI中文摘要

YouTube已发展成一个强大的平台,创作者通过affiliate营销来 monetize 他们的影响力,这引发了关于透明度和伦理问题的担忧,尤其是在创作者未能披露其affiliate关系时。尽管监管机构如美国联邦贸易委员会(FTC)已发布指南以解决这些问题,但非合规和消费者伤害仍然存在,且这些问题的严重程度仍不清楚。在本文中,我们介绍了利用最近的Web测量和NLP研究进展开发的工具,以研究YouTube上的affiliate营销生态系统。我们应用这些工具对来自近54万创作者的200万视频的10年数据集进行分析,研究YouTube上affiliate营销的普及程度及非合规行为的比例。我们的发现表明,affiliate链接广泛存在,但披露合规性仍然很低,大多数视频未能达到FTC标准。此外,我们分析了不同利益相关者在改善披露行为上的影响。我们的研究表明,平台通过标准化披露功能与提高合规性密切相关。我们建议监管机构和affiliate合作伙伴应与平台合作,以提高影响力经济中的透明度、问责制和信任度。

英文摘要

YouTube has evolved into a powerful platform where creators monetize their influence through affiliate marketing, raising concerns about transparency and ethics, especially when creators fail to disclose their affiliate relationships. Although regulatory agencies like the US Federal Trade Commission (FTC) have issued guidelines to address these issues, non-compliance and consumer harm persist, and the extent of these problems remains unclear. In this paper, we introduce tools, developed with insights from recent advances in Web measurement and NLP research, to examine the state of the affiliate marketing ecosystem on YouTube. We apply these tools to a 10-year dataset of 2 million videos from nearly 540,000 creators, analyzing the prevalence of affiliate marketing on YouTube and the rates of non-compliant behavior. Our findings reveal that affiliate links are widespread, yet disclosure compliance remains low, with most videos failing to meet FTC standards. Furthermore, we analyze the effects of different stakeholders in improving disclosure behavior. Our study suggests that the platform is highly associated with improved compliance through standardized disclosure features. We recommend that regulators and affiliate partners collaborate with platforms to enhance transparency, accountability, and trust in the influencer economy.

2603.03454 2026-05-22 cs.LG 版本更新

[Re] FairDICE: A Fair Tradeoff in Multi-objective Offline RL

[Re] FairDICE:多目标离线RL中的公平权衡

Peter Adema, Karim Galliamov, Aleksey Evstratovskiy, Ross Geurts

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 该研究探讨了多目标离线强化学习中公平权衡的问题,提出FairDICE算法通过自适应学习多目标权重来实现公平妥协,但发现代码错误导致其在连续环境中退化为标准行为克隆,并需修正超参数以提升实验有效性。

Comments 12 pages, 8 figures in main text. Code at https://github.com/p-adema/re-fairdice. Reviewed at https://openreview.net/forum?id=Tr6MBt0hAj

详情
Journal ref
Published 05/2026 in Transactions on Machine Learning Research
AI中文摘要

离线强化学习(RL)是RL领域的一个新兴分支,其中策略仅从演示中学习。在离线RL中,某些环境需要平衡多个目标,但现有的多目标离线RL算法未能提供有效的方法来找到公平的折中方案。FairDICE(见arXiv:2506.08062v2)通过将OptiDICE(一种离线RL算法)进行适应性修改,以自动学习多个目标的权重,例如激励目标间的公平性。由于这一贡献具有价值,本复制研究检验了关于FairDICE的可复制性声明。我们发现许多理论声明成立,但代码中的错误使FairDICE在连续环境中退化为标准行为克隆,并且许多重要的超参数最初未明确指定。在修正之后,我们通过扩展原始论文的实验表明,FairDICE可以扩展到复杂环境和高维奖励,尽管它在(在线)超参数调优上可能依赖性较强。我们得出结论,FairDICE是一种理论上有吸引力的方法,但实验验证需要显著修订。

英文摘要

Offline Reinforcement Learning (RL) is an emerging field of RL in which policies are learned solely from demonstrations. Within offline RL, some environments involve balancing multiple objectives, but existing multi-objective offline RL algorithms do not provide an efficient way to find a fair compromise. FairDICE (see arXiv:2506.08062v2) seeks to fill this gap by adapting OptiDICE (an offline RL algorithm) to automatically learn weights for multiple objectives to e.g. incentivise fairness among objectives. As this would be a valuable contribution, this replication study examines the replicability of claims made regarding FairDICE. We find that many theoretical claims hold, but an error in the code reduces FairDICE to standard behaviour cloning in continuous environments, and many important hyperparameters were originally underspecified. After rectifying this, we show in experiments extending the original paper that FairDICE can scale to complex environments and high-dimensional rewards, though it can be reliant on (online) hyperparameter tuning. We conclude that FairDICE is a theoretically interesting method, but the experimental justification requires significant revision.

2603.02938 2026-05-22 cs.LG cs.AI 版本更新

Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models

超越一刀切:基于大语言模型的零样本图学习中的自适应子图去噪

Fengzhi Li, Liang Zhang, Yuan Zuo, Ruiqing Zhao, YanSong Liu, Yunfei Ma, Fanyu Meng, Junlan Feng

发表机构 * JIUTIAN Research(JIUTIAN研究) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) MIIT Key Laboratory of Data and Decision Intelligence(信息与决策智能重点实验室) Beihang University(北航)

AI总结 本文提出GraphSSR框架,通过自适应子图提取和去噪方法,解决传统图神经网络在零样本学习中泛化能力不足的问题,提升大语言模型在图推理任务中的表现。

详情
AI中文摘要

图基任务在零样本设置中仍面临显著挑战,由于数据稀缺性和传统图神经网络(GNNs)无法泛化到未见领域或标签空间。尽管最近的进展转向利用大语言模型(LLMs)作为预测器来增强GNNs,但这些方法常面临跨模态对齐问题。最近的范式(即Graph-R1)通过采用纯文本格式和基于LLM的图推理克服了上述架构依赖性,显示出改进的零样本泛化能力。然而,它使用一种任务无关的“一刀切”子图提取策略,不可避免地引入了显著的结构噪声——无关邻居和边——这会扭曲LLMs的感知范围并导致次优预测。为了解决这一限制,我们引入GraphSSR,一种新的框架,用于零样本LLM图推理中的自适应子图提取和去噪。具体而言,我们提出了SSR流水线,通过“采样-选择-推理”过程动态定制子图提取以适应特定上下文,使模型能够自主过滤掉任务无关的邻居并克服“一刀切”问题。为了内化这一能力,我们开发了SSR-SFT,一种数据合成策略,生成高质量的SSR风格图推理轨迹用于LLM的监督微调。此外,我们提出了SSR-RL,一种两阶段强化学习框架,该框架专门设计用于自适应子图去噪,明确调节所提出SSR流水线中的采样和选择操作。通过结合真实性增强和去噪增强的强化学习,我们引导模型使用简洁的、去噪的子图进行推理以实现准确预测。

英文摘要

Graph-based tasks in the zero-shot setting remain a significant challenge due to data scarcity and the inability of traditional Graph Neural Networks (GNNs) to generalize to unseen domains or label spaces. While recent advancements have transitioned toward leveraging Large Language Models (LLMs) as predictors to enhance GNNs, these methods often suffer from cross-modal alignment issues. A recent paradigm (i.e., Graph-R1) overcomes the aforementioned architectural dependencies by adopting a purely text-based format and utilizing LLM-based graph reasoning, showing improved zero-shot generalization. However, it employs a task-agnostic, one-size-fits-all subgraph extraction strategy, which inevitably introduces significant structural noise--irrelevant neighbors and edges--that distorts the LLMs' receptive field and leads to suboptimal predictions. To address this limitation, we introduce GraphSSR, a novel framework designed for adaptive subgraph extraction and denoising in zero-shot LLM-based graph reasoning. Specifically, we propose the SSR pipeline, which dynamically tailors subgraph extraction to specific contexts through a "Sample-Select-Reason" process, enabling the model to autonomously filter out task-irrelevant neighbors and overcome the one-size-fits-all issue. To internalize this capability, we develop SSR-SFT, a data synthesis strategy that generates high-quality SSR-style graph reasoning traces for supervised fine-tuning of LLMs. Furthermore, we propose SSR-RL, a two-stage reinforcement learning framework that explicitly regulates sampling and selection operations within the proposed SSR pipeline designed for adaptive subgraph denoising. By incorporating Authenticity-Reinforced and Denoising-Reinforced RL, we guide the model to achieve accurate predictions using parsimonious, denoised subgraphs for reasoning.

2602.22719 2026-05-22 cs.LG 版本更新

Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks

通过激活子空间瓶颈解释和操控状态空间模型

Vamshi Sunku Mohan, Kaustubh Gupta, Aneesha Das, Chandan Singh

发表机构 * Microsoft Research, Redmond(微软研究院(红mond)) Independent Researcher(独立研究员)

AI总结 本文通过识别Mamba家族状态空间模型中的激活子空间瓶颈,提出了一种在测试时通过乘以标量来操控激活的干预方法,从而在多个模型和基准测试中提升了性能,并验证了这些瓶颈对性能的阻碍作用。

详情
AI中文摘要

状态空间模型(SSMs)已经 emerged 作为构建强大语言模型的有效策略,避免了transformers中计算注意力的二次复杂度。尽管有潜力,现代SSMs的可解释性和操控性仍然相对研究不足。我们通过使用机理可解释性工具,在Mamba家族的SSMs中识别出激活子空间瓶颈。然后,我们引入了一种测试时操控干预,通过将识别出的瓶颈的激活乘以一个标量。在7个SSMs和6个多样化的基准测试中,这种干预平均提升了8.27%的性能,无需任何任务特定的调优。最后,我们验证了识别出的瓶颈确实阻碍了性能,通过修改它们得到一种称为Stable-Mamba的架构,在重新训练时实现了长上下文性能的提升。

英文摘要

State-space models (SSMs) have emerged as an efficient strategy for building powerful language models, avoiding the quadratic complexity of computing attention in transformers. Despite their promise, the interpretability and steerability of modern SSMs remain relatively underexplored. We take a major step in this direction by identifying activation subspace bottlenecks in the Mamba family of SSM models using tools from mechanistic interpretability. We then introduce a test-time steering intervention that simply multiplies the activations of the identified bottlenecks by a scalar. Across 7 SSMs and 6 diverse benchmarks, this intervention improves performance by an average of 8.27%, without requiring any task-specific tuning. Finally, we validate that the identified bottlenecks are indeed hindering performance by modifying them to yield an architecture we call Stable-Mamba, which achieves long-context performance gains when retrained from scratch.

2602.22270 2026-05-22 cs.LG q-bio.PE 版本更新

Prior Knowledge-enhanced Spatio-temporal Epidemic Forecasting

先验知识增强的时空疫情预测

Sijie Ruan, Jinyu Li, Jia Wei, Zenghao Xu, Jie Bao, Junshi Xu, Junyang Qiu, Shuliang Wang, Xiaoxiao Wang, Hanning Yuan

发表机构 * Beijing Institute of Technology(北京理工大学) Zhejiang Provincial Center for Disease Control and Prevention(浙江省疾病预防控制中心) JD Technology(京东科技) The University of Hong Kong(香港大学) China Mobile Internet(中国移动互联网)

AI总结 本文提出了一种结合隐式时空先验和显式专家先验的新型混合框架STOEP,通过动态调整区域依赖关系、放大弱信号和机制性预测来提升时空疫情预测的准确性。

Comments 12 pages, 10 figures, accepted to IJCAI 2026

详情
AI中文摘要

时空疫情预测对于公共卫生管理至关重要,但现有方法常面临对弱疫情信号不敏感、空间关系过于简化和参数估计不稳定的问题。为解决这些问题,我们提出了Spatio-Temporal priOr-aware Epidemic Predictor(STOEP),一种新的混合框架,整合了隐式时空先验和显式专家先验。STOEP由三个关键组件组成:(1)病例感知邻接学习(CAL),利用历史感染模式动态调整基于移动性的区域依赖关系;(2)空间指导参数估计(SPE),采用可学习的空间先验来放大弱疫情信号;(3)基于滤波的机制性预测(FMF),使用专家指导的自适应阈值策略来正则化疫情参数。在真实世界中的新冠和流感数据集上进行的广泛实验表明,STOEP在RMSE上比最佳基线高出11.1%。该系统已在中国一个省级CDC部署,以促进后续应用。

英文摘要

Spatio-temporal epidemic forecasting is critical for public health management, yet existing methods often struggle with insensitivity to weak epidemic signals, over-simplified spatial relations, and unstable parameter estimation. To address these challenges, we propose the Spatio-Temporal priOr-aware Epidemic Predictor (STOEP), a novel hybrid framework that integrates implicit spatio-temporal priors and explicit expert priors. STOEP consists of three key components: (1) Case-aware Adjacency Learning (CAL), which dynamically adjusts mobility-based regional dependencies using historical infection patterns; (2) Space-informed Parameter Estimating (SPE), which employs learnable spatial priors to amplify weak epidemic signals; and (3) Filter-based Mechanistic Forecasting (FMF), which uses an expert-guided adaptive thresholding strategy to regularize epidemic parameters. Extensive experiments on real-world COVID-19 and influenza datasets demonstrate that STOEP outperforms the best baseline by 11.1% in RMSE. The system has been deployed at a provincial CDC in China to facilitate downstream applications.

2602.18141 2026-05-22 cs.LG 版本更新

Geometry-Induced Diffusion on Graphs: A Learnable Weighted Laplacian for Spectral GNNs

图诱导扩散:用于谱GNNs的可学习加权拉普拉斯算子

Mia Zosso, Ali Hariri, Victor Kawasaki-Borruat, Pierre-Gabriel Berlureau, Pierre Vandergheynst

发表机构 * École Polytechnique Fédérale de Lausanne (EPFL)(瑞士联邦理工学院(EPFL)) École Normale Supérieure – PSL(巴黎高等师范学院–PSL)

AI总结 本文提出了一种简单的谱GNN架构mu-ChebNet,通过学习节点级权重函数mu来修改图拉普拉斯算子,从而改变传播几何而不改变图拓扑,从而促进信息传播的优选路径,帮助长距离信号避免高收缩瓶颈,无需重复层堆叠。

详情
AI中文摘要

长距离图任务对图神经网络(GNNs)来说具有挑战性:全局机制如注意力或重排方案可能计算成本高,而深度局部传播容易导致梯度消失、过平滑和过压缩。引入的mu-ChebNet架构是一种简单的谱GNN,它在应用ChebNet式滤波器之前学习一个节点级权重函数mu。所学的权重mu诱导了一个修改后的图拉普拉斯算子,从而有效改变传播几何而不改变图拓扑。这种任务相关的几何促进了信息传播的优选路径,从而帮助长距离信号避免高度收缩的瓶颈,并消除了对重复层堆叠的需要。在实践中,我们用学习的算子L_mu代替固定的图拉普拉斯算子L,保持所提出的mu-ChebNet架构轻量级,同时使传播任务自适应。此外,我们提供了一种谱分析,说明mu如何调节传播动力学,并在合成长距离推理任务和现实世界图基准上观察到性能的提高。所学的权重函数不仅具有可解释性,还为自适应图传播提供了轻量级的替代方案。

英文摘要

Long-range graph tasks are challenging for Graph Neural Networks (GNNs): global mechanisms such as attention or rewiring schemes can be computationally expensive, while deep local propagation is prone to vanishing gradients, oversmoothing, and oversquashing. The introduced mu-ChebNet architecture is a simple spectral GNN that learns a node-wise weight function mu before applying ChebNet-style filters. The learned weighting mu induces a modified graph Laplacian which effectively changes the propagation geometry without altering the graph topology. This task-dependent geometry promotes preferred routes for information propagation, thereby helping long-range signals avoid highly contractive bottlenecks, and obviating the need for repeated layer stacking. In practice, we replace the fixed graph Laplacian L by a learned operator L_mu, keeping the proposed mu-ChebNet architecture lightweight while making propagation task-adaptive. Furthermore, we provide a spectral analysis demonstrating how mu modulates propagation dynamics, and empirically observe improved performance on both synthetic long-range reasoning tasks and real-world graph benchmarks. The learned weight function is not only interpretable, but also offers a lightweight alternative to attention and rewiring for adaptive graph propagation.

2602.13372 2026-05-22 cs.AI cs.LG 版本更新

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

MoralityGym:用于评估序列决策代理中分层道德对齐的基准

Simon Rosen, Siddarth Singh, Ebenezer Gelo, Helen Sarah Robertson, Ibrahim Suder, Victoria Williams, Benjamin Rosman, Geraud Nangue Tasse, Steven James

发表机构 * University of the Witwatersrand(威特沃特斯兰大学)

AI总结 本文提出MoralityGym基准,通过将道德规范表示为有序的规范约束,评估序列决策代理中分层道德对齐的挑战,展示了98个伦理困境问题,并通过心理学和哲学的见解改进了伦理决策方法。

Comments Accepted at AAMAS 2026

详情
Journal ref
Proc of the 25th International Conference on Autonomous Agents and Multiagent Systems AAMAS 2026, Paphos, Cyprus, May 25 to 29, 2026, IFAAMAS
AI中文摘要

评估在面对冲突且分层结构的人类规范时,代理的道德对齐是一个在人工智能安全、道德哲学和认知科学交汇处的关键挑战。我们引入了Morality Chains,一种新的形式化方法,用于将道德规范表示为有序的规范约束,并引入了MoralityGym,一个包含98个伦理困境问题的基准,这些问题是作为电车困境风格的Gymnasium环境呈现的。通过将任务解决与道德评估解耦,并引入新的道德度量标准,MoralityGym允许将心理学和哲学的见解整合到规范敏感推理的评估中。基于安全强化学习方法的基准结果揭示了关键限制,强调了需要更系统的方法来处理伦理决策。本文为开发在复杂现实环境中行为更可靠、透明和道德的AI系统提供了基础。

英文摘要

Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.

2602.12952 2026-05-22 cs.LG cs.AI cs.CV 版本更新

Transporting Task Vectors across Different Architectures without Training

在不同架构间传输任务向量而无需训练

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Angelo Porrello, Simone Calderara

发表机构 * AImageLab, University of Modena and Reggio Emilia(AImageLab,Modena和雷吉奥艾米利亚大学)

AI总结 本文提出Theseus方法,通过功能匹配在不同宽度模型间传输任务更新,无需训练或反向传播,展示了在视觉和语言模型上的改进效果。

Comments Accepted at the International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

适应大型预训练模型以完成下游任务时,通常会产生针对特定任务的参数更新,这些更新对于每个模型变体重新学习都很昂贵。尽管最近的研究表明,这些更新可以在具有相同架构的模型之间转移,但跨不同宽度的模型转移仍鲜有探索。在本文中,我们引入Theseus,一种无需训练的方法,用于在异构宽度模型间传输任务更新。与其匹配参数,我们通过其在中间表示上诱导的功能效应来表征任务更新。我们正式将任务向量传输定义为在观察到的激活上进行的功能匹配问题,并显示在通过正交Procrustes分析对齐表示空间后,它允许一个稳定的闭式解,该解保留了更新的几何结构。我们在不同宽度的视觉和语言模型上评估Theseus,显示在不进行额外训练或反向传播的情况下,相对于基线有持续的改进。我们的结果表明,当任务身份通过功能而非参数定义时,任务更新可以有意义地在不同架构间转移。代码可在https://github.com/apanariello4/merge-and-rebase获取。

英文摘要

Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains unexplored. In this work, we introduce Theseus, a training-free method for transporting task updates across heterogeneous-width models. Rather than matching parameters, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically. Code is available at https://github.com/apanariello4/merge-and-rebase.

2602.12506 2026-05-22 cs.LG 版本更新

On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

关于RL微调VLMs的鲁棒性和链式思维一致性

Rosie Zhao, Anshul Shah, Xiaoyu Zhu, Xinke Deng, Zhongyu Jiang, Yang Yang, Joerg Liebelt, Arnab Mondal

发表机构 * Apple(苹果公司) OpenAI

AI总结 本文研究了RL微调VLMs在视觉推理任务中的鲁棒性和链式思维一致性,发现文本扰动和CoT不一致会显著降低模型的鲁棒性和信心,而闭源模型在保持鲁棒性和推理一致性方面表现更佳,指出这一差距源于当前开源RL微调的不足而非任务本身的限制。

Comments ICML 2026

详情
AI中文摘要

强化学习(RL)微调已成为增强大型语言模型(LLMs)在推理密集型任务中的关键技术,推动其扩展到视觉语言模型(VLMs)。尽管RL微调的VLMs在视觉推理基准测试中表现优异,但它们仍容易受到弱视觉基础、幻觉和过度依赖文本提示的影响。我们发现,简单的受控文本扰动,包括误导的标题或错误的链式思维(CoT)轨迹,会导致鲁棒性和信心的显著下降,且当考虑跨开源多模态推理模型的CoT一致性时,这些影响更为明显。相比之下,闭源模型表现出相似的失败模式,但保持了显著更高的鲁棒性和推理一致性,这表明差距反映的是当前开源RL微调的不足,而非任务本身的限制。为了更好地理解这些漏洞,我们进一步分析了RL微调动态,并揭示了准确率与忠实度之间的权衡:微调提高了基准测试准确率,但同时可能削弱伴随的CoT的可靠性及其对上下文变化的鲁棒性。尽管对抗性增强提高了鲁棒性,但本身并不能防止忠实度漂移。结合忠实度意识的奖励可以恢复答案与推理之间的对齐,但当与增强结合时,训练风险会坍缩到捷径策略,鲁棒性仍然难以获得。这些发现突显了仅基于准确率的评估的局限性,并促使训练和评估协议共同强调正确性、鲁棒性和视觉基础推理的忠实度。

英文摘要

Reinforcement learning (RL) finetuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision-language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations, including misleading captions or incorrect chain-of-thought (CoT) traces, cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. In contrast, closed models exhibit similar failure modes but maintain markedly greater robustness and reasoning consistency, suggesting that the gap reflects a shortcoming in current open-source RL finetuning rather than an inherent limitation of the task. To better understand these vulnerabilities, we further analyze RL finetuning dynamics and uncover an accuracy-faithfulness trade-off: finetuning raises benchmark accuracy, but can simultaneously erode the reliability of the accompanying CoT and its robustness to contextual shifts. Although adversarial augmentation improves robustness, it does not by itself prevent faithfulness drift. Incorporating a faithfulness-aware reward can restore alignment between answers and reasoning, but when paired with augmentation, training risks collapsing onto shortcut strategies and robustness remains elusive. Together, these findings highlight the limitations of accuracy-only evaluations and motivate training and assessment protocols that jointly emphasize correctness, robustness, and the faithfulness of visually grounded reasoning.

2602.10894 2026-05-22 cs.LG cs.AI 版本更新

Revisiting Regularized Policy Optimization for Stable and Efficient Reinforcement Learning in Two-Player Games

重新审视正则化策略优化以实现稳定且高效的双人博弈强化学习

Kazuki Ota, Takayuki Osa, Motoki Omura, Tatsuya Harada

发表机构 * The University of Tokyo, Japan(东京大学) RIKEN Center for Advanced Intelligence Project, Japan(日本RIKEN高级智能项目中心)

AI总结 本文重新审视了带有反向Kullback-Leibler正则化和熵正则化的策略优化方法,在双人零和设置中从理论和经验角度分析其组合,提供了新的收敛保证并通过合成游戏的数值实验验证了理论结果,并基于正则化策略优化推导出一种实用的模型无关强化学习算法,通过在五个棋盘游戏中进行的全面实验验证了算法的训练效率。

Comments Accepted at ICML 2026

详情
AI中文摘要

像棋盘游戏这样的双人博弈长期以来一直是强化学习的传统基准。本工作重新审视了一种带有反向Kullback-Leibler正则化和熵正则化的策略优化方法,并从理论和经验角度分析其在双人零和设置中的组合。从理论角度来看,我们研究了策略更新规则在两个理论设置中的稳定性:博弈论的正常形式博弈和有限长度博弈。我们提供了新的收敛保证,并通过合成游戏的数值实验验证了我们的理论结果。从经验角度来看,我们推导出一种基于正则化策略优化的实用模型无关强化学习算法。我们通过在五个棋盘游戏中进行的全面实验验证了我们算法的训练效率。实验结果表明,我们的智能体在各种环境中学习效率均优于现有方法。

英文摘要

Two-player games such as board games have long been used as traditional benchmarks for reinforcement learning. This work revisits a policy optimization method with reverse Kullback-Leibler regularization and entropy regularization and analyzes this combination in two-player zero-sum settings from theoretical and empirical perspectives. From a theoretical perspective, we investigate the stability of the policy update rule in two theoretical settings: game-theoretic normal-form games and finite-length games. We provide novel convergence guarantees and verify our theoretical results through numerical experiments on synthetic games. From an empirical perspective, we derive a practical model-free reinforcement learning algorithm based on the regularized policy optimization. We validate the training efficiency of our algorithm through comprehensive experiments on five board games: Animal Shogi, Gardner Chess, Go, Hex, and Othello. Experimental results show that our agent learns more efficiently than existing methods across environments.

2602.09851 2026-05-22 cs.LG 版本更新

CoFEH: LLM-driven Feature Engineering Empowered by Collaborative Bayesian Hyperparameter Optimization

CoFEH: 由协作贝叶斯超参数优化赋能的LLM驱动特征工程

Beicheng Xu, Keyao Ding, Wei Liu, Yupeng Lu, Bin Cui

发表机构 * School of CS \& Key Lab of High Confidence Software Technologies (MOE), Peking University Beijing China School of CS \& Beijing Key Laboratory of Software Hardware Cooperative Artificial Intelligence Systems, Peking University Beijing China School of CS \& Key Lab of High Confidence Software Technologies (MOE), Peking University Hardware Cooperative Artificial Intelligence Systems, Peking University

AI总结 本文提出CoFEH框架,通过结合LLM驱动的特征工程和贝叶斯超参数优化,实现鲁棒的端到端AutoML,解决了传统方法在搜索空间刚性和缺乏领域意识的问题,并引入互条件机制提升FE与HPO的协同效果。

Comments Accepted at KDD 2026. Extended version with full appendices

详情
AI中文摘要

特征工程(FE)在自动化机器学习(AutoML)中至关重要,但传统方法在搜索空间刚性和缺乏领域意识方面存在瓶颈。尽管大型语言模型(LLMs)能生成无界运算符,但现有方法仅关注孤立子任务,无法实现自由形式的FE流程。此外,它们很少与下游ML模型的超参数优化(HPO)结合,导致贪心的"FE-then-HPO"工作流无法捕捉强FE-HPO交互。本文提出CoFEH,一种协作框架,通过 interleaving LLM驱动的FE和贝叶斯HPO实现鲁棒的端到端AutoML。CoFEH使用基于Tree of Thought(TOT)的LLM驱动FE优化器探索灵活的FE流程,贝叶斯优化(BO)模块解决HPO,并动态优化器选择器适配FE和HPO步骤。关键的是,我们引入互条件机制,使LLM和BO之间共享上下文,实现相互指导的决策。实验表明,CoFEH在独立FE和联合FE+HPO设置中均优于传统和LLM基线。

英文摘要

Feature Engineering (FE) is pivotal in automated machine learning (AutoML) but remains a bottleneck for traditional methods, which operate within rigid search spaces and lack domain awareness. While Large Language Models (LLMs) offer a promising alternative to generate unbounded operators with semantic reasoning, existing methods focus on isolated subtasks such as feature generation, falling short of free-form FE pipelines. Moreover, they are rarely coupled with hyperparameter optimization (HPO) of the downstream ML model, leading to greedy "FE-then-HPO" workflows that cannot capture strong FE-HPO interactions. In this paper, we present CoFEH, a collaborative framework that interleaves LLM-based FE and Bayesian HPO for robust end-to-end AutoML. CoFEH uses an LLM-driven FE optimizer powered by Tree of Thought (TOT) to explore flexible FE pipelines, a Bayesian optimization (BO) module to solve HPO, and a dynamic optimizer selector that adaptively interleaves FE and HPO steps. Crucially, we introduce a mutual conditioning mechanism that shares context between LLM and BO, enabling mutually informed decisions. Experiments show that CoFEH outperforms both traditional and LLM-based baselines in both standalone FE and joint FE+HPO settings.

2602.08064 2026-05-22 cs.LG cs.AI cs.CL 版本更新

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

SiameseNorm: 突破预规范与后规范之间的障碍

Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, Gao Huang

发表机构 * Leap Lab, Tsinghua University(清华大学 Leap 实验室) Qwen Large Model Application Team, Alibaba(阿里巴巴 Qwen 大模型应用团队) Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息学研究院)

AI总结 本文提出SiameseNorm,一种双流架构,通过共享残差块将预规范和后规范结合,从而在保持训练稳定性的同时提升模型性能,适用于多种架构和模态。

Comments Accepted to ICML 2026; camera-ready version; revised presentation and added additional experimental results

详情
AI中文摘要

预规范与后规范之间的长期矛盾仍然是Transformer架构中的一个开放问题,反映了训练稳定性与表示能力之间的根本权衡。先前尝试结合两者优势的研究取得了一定进展,但往往在不同训练设置下表现有限,限制了其更广泛的应用。我们重新审视这一困境,表明单流架构难以协调预规范的稳定身份梯度传播与后规范的主要残差路径归一化。为了解决这种结构张力,我们提出SiameseNorm,一种简单而有效的双流架构,能够与预规范训练配方保持兼容。SiameseNorm通过共享残差块将预规范和后规范流连接起来,允许每个残差块从两个路径接收优化信号,且开销极低。在400M和1.3B密集语言模型、15B MoE模型、视觉Transformer以及扩散Transformer上的大量实验表明,SiameseNorm在各种架构和模态中都能保持强大的训练稳定性的同时提升性能。代码可在https://github.com/Qwen-Applications/SiameseNorm上获得。

英文摘要

The long-standing tension between Pre- and Post-Norm remains an open problem in Transformer architecture, reflecting a fundamental trade-off between training stability and representational capacity. Prior attempts to combine their strengths have made progress, but often show limited robustness across training settings, restricting their broader applicability. We revisit this dilemma, showing that single-stream architectures struggle to reconcile Pre-Norm's stable identity-gradient propagation with Post-Norm's normalization of the main residual path. To address this structural tension, we propose SiameseNorm, a simple yet effective two-stream architecture that remains compatible with Pre-Norm training recipes. SiameseNorm couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks, allowing each residual block to receive optimization signals from both pathways with negligible overhead. Extensive experiments on 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers show that SiameseNorm consistently improves performance while maintaining strong training stability across architectures and modalities. Code is available at https://github.com/Qwen-Applications/SiameseNorm.

2602.07340 2026-05-22 cs.LG 版本更新

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

通过选择性几何控制重新审视LLM安全对齐的鲁棒性

Yonghui Yang, Wenjian Tao, Jilong Liu, Xingyu Zhu, Junfeng Fang, Weibiao Huang, Le Wu, Richang Hong, Tat-Sent Chua

发表机构 * National University of Singapore(新加坡国立大学) Hefei University of Technology(合肥工业大学) ST Engineering Ltd., Singapore(新加坡ST工程有限公司)

AI总结 本文通过优化几何视角重新审视LLM安全对齐的鲁棒性,提出ShaPO框架,通过选择性几何控制在对齐关键参数子空间上强制最坏对齐目标,提升安全鲁棒性。

详情
AI中文摘要

大型语言模型的安全对齐在领域偏移和噪声偏好监督下仍显得脆弱。大多数现有鲁棒对齐方法关注对齐数据中的不确定性,而忽视了基于偏好的目标中优化诱导的脆弱性。在本文中,我们从优化几何的角度重新审视LLM安全对齐的鲁棒性,并认为鲁棒性失败不能仅通过数据为中心的方法解决。我们提出了ShaPO,一种几何感知的偏好优化框架,通过在对齐关键参数子空间上进行选择性几何控制来强制最坏情况下的对齐目标。通过避免均匀的几何约束,ShaPO缓解了在分布偏移下可能损害鲁棒性的过度正则化问题。我们将在两个层面实例化ShaPO:token层面的ShaPO稳定了基于似然的替代优化,而reward层面的ShaPO在噪声监督下强制奖励一致的优化。在多样化的安全基准和噪声偏好设置中,ShaPO在流行偏好优化方法上一致地提高了安全鲁棒性。此外,ShaPO能够与数据鲁棒目标清洁地组合,产生额外的收益,并经验上支持所提出的优化-几何视角。代码可在https://github.com/liujilong0116/ShaPO上获得。

英文摘要

Safety alignment of large language models remains brittle under domain shift and noisy preference supervision. Most existing robust alignment methods focus on uncertainty in alignment data, while overlooking optimization-induced fragility in preference-based objectives. In this work, we revisit robustness for LLM safety alignment from an optimization geometry perspective, and argue that robustness failures cannot be addressed by data-centric methods alone. We propose \textit{ShaPO}, a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace. By avoiding uniform geometry constraints, ShaPO mitigates the over-regularization that can harm robustness under distribution shift. We instantiate ShaPO at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, while reward-level ShaPO enforces reward-consistent optimization under noisy supervision. Across diverse safety benchmarks and noisy preference settings, ShaPO consistently improves safety robustness over popular preference optimization methods. Moreover, ShaPO composes cleanly with data-robust objectives, yielding additional gains and empirically supporting the proposed optimization-geometry perspective. The code is available at https://github.com/liujilong0116/ShaPO.

2602.05873 2026-05-22 cs.LG 版本更新

Large-scale Score-based Variational Posterior Inference for Bayesian Deep Neural Networks

大规模基于分数的变分后验推断用于贝叶斯深度神经网络

Minyoung Kim

发表机构 * Samsung AI Center(三星人工智能中心)

AI总结 本文提出了一种适用于大规模贝叶斯深度神经网络的变分后验推断方法,结合了分数匹配损失和近端惩罚项,避免了重新参数化采样,实现了大规模神经网络的高效训练。

详情
AI中文摘要

贝叶斯(深度)神经网络(BNN)在多个方面比传统的点估计深度学习更具吸引力,包括不确定性量化、噪声鲁棒性、过拟合抵抗性等。变分推断(VI)是应用最广泛的近似推断方法之一。尽管基于ELBO的变分自由能方法在文献中占主导地位,但本文提出了一种基于分数的替代方法用于BNN的变分推断。基于分数的VI可以解决基于ELBO的VI中已知的模式崩溃问题。尽管社区中已经提出了几种基于分数的VI方法,但大多数方法由于各种计算和技术原因并不适用于大规模BNN。我们提出了一种新颖的可扩展VI方法,其中学习目标结合了分数匹配损失和近端惩罚项,这有助于我们的方法避免重新参数化采样,并允许通过随机梯度获得有偏的噪声小批量分数。这使得我们的方法能够扩展到大规模神经网络,包括视觉Transformer。在多个基准上,包括使用大规模深度网络的视觉识别和时间序列预测,我们实证地展示了我们方法的有效性。

英文摘要

Bayesian (deep) neural networks (BNN) are often more attractive than the vanilla point-estimate deep learning in various aspects including uncertainty quantification, robustness to noise, resistance to overfitting, and more. The variational inference (VI) is one of the most widely adopted approximate inference methods. Whereas the ELBO-based variational free energy method is a dominant choice in the literature, in this paper we introduce a score-based alternative for BNN variational inference. Score-based VI can address the known issue of mode collapsing in ELBO-based VI. Although several score-based VI methods have been proposed in the community, most are not adequate for large-scale BNNs for various computational and technical reasons. We propose a novel scalable VI method where the learning objective combines the score matching loss and the proximal penalty term in iterations, which helps our method avoid the reparametrized sampling, and allows for noisy unbiased mini-batch scores through stochastic gradients. This in turn makes our method scalable to large-scale neural networks including Vision Transformers. On several benchmarks including visual recognition and time-series forecasting with large-scale deep networks, we empirically show the effectiveness of our approach.

2602.05536 2026-05-22 cs.LG cs.AI cs.CL cs.CV 版本更新

When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging

当共享知识有害:模型融合中的谱过积累

Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan, Yinghuan Shi

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China.(新型软件技术国家重点实验室,南京大学,南京210023,中国。) Institute of Brain-Computer Interface, Nanjing University, Nanjing 210023, China.(脑机接口研究院,南京大学,南京210023,中国。)

AI总结 本文研究了模型融合中共享知识过积累的问题,提出SVC方法通过校准奇异值来恢复谱平衡,提升了模型融合和任务算术的性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

模型融合通过将多个微调模型的权重更新相加,提供了一种轻量级的替代方法,而非重新训练。现有方法主要针对解决任务更新之间的冲突,未处理共享知识过积累的失败模式。我们发现当任务共享对齐的谱方向(即重叠的奇异向量)时,简单的线性组合会反复积累这些方向,导致奇异值膨胀并使融合模型偏向共享子空间。为缓解此问题,我们提出Singular Value Calibration (SVC),一种无需训练和数据的后处理方法,量化子空间重叠并重新缩放膨胀的奇异值以恢复平衡的谱。在视觉和语言基准上,SVC一致改进了强大的融合基线并实现了最先进的性能。此外,仅通过修改奇异值,SVC将任务算术的性能提高了13.0%。代码可在https://github.com/lyymuwu/SVC获取。

英文摘要

Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at https://github.com/lyymuwu/SVC.

2602.05304 2026-05-22 cs.LG cs.SY eess.SY math.OC 版本更新

A Short and Unified Convergence Analysis of the SAG, SAGA, and IAG Algorithms

SAG、SAGA和IAG算法的简短统一收敛性分析

Feng Zhu, Robert W. Heath, Aritra Mitra

发表机构 * Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA(北卡罗来纳州立大学电气与计算机工程系) Department of Electrical and Computer Engineering, University of California, San Diego, USA(加州大学圣地亚哥分校电气与计算机工程系)

AI总结 本文提出了一种统一的收敛性分析方法,适用于SAG、SAGA和IAG算法,通过简单的集中工具建立延迟界并设计新的Lyapunov函数,从而得到高概率界,并扩展到非凸目标和马尔可夫采样。

Comments To appear at the 43rd International Conference on Machine Learning (ICML)

详情
AI中文摘要

诸如随机平均梯度(SAG)和SAGA的随机方差减少算法,以及其确定性对应物如增量聚合梯度(IAG)方法,在大规模机器学习中已被广泛研究。尽管这些算法很受欢迎,但现有的分析却各不相同,依赖于针对每种方法量身定制的证明技术。此外,SAG的原始证明已知相当复杂,需要计算机辅助分析。聚焦于有限和优化问题,我们的主要贡献是开发了一种适用于所有三种算法的统一收敛性分析:SAG、SAGA和IAG。我们的分析有两个关键步骤:(i)使用简单的集中工具建立由于随机子采样导致的延迟界;(ii)精心设计一个新的Lyapunov函数,以考虑此类延迟。所得到的证明简短且模块化,为SAG和SAGA提供了首个高概率界,可以无缝扩展到非凸目标和马尔可夫采样。作为我们新分析技术的直接产物,我们获得了IAG算法的最佳已知速率,显著改进了之前的界。

英文摘要

Stochastic variance-reduced algorithms such as Stochastic Average Gradient (SAG) and SAGA, and their deterministic counterparts like the Incremental Aggregated Gradient (IAG) method, have been extensively studied in large-scale machine learning. Despite their popularity, existing analyses for these algorithms are disparate, relying on different proof techniques tailored to each method. Furthermore, the original proof of SAG is known to be notoriously involved, requiring computer-aided analysis. Focusing on finite-sum optimization with smooth and strongly convex objective functions, our main contribution is to develop a single unified convergence analysis that applies to all three algorithms: SAG, SAGA, and IAG. Our analysis features two key steps: (i) establishing a bound on delays due to stochastic sub-sampling using simple concentration tools, and (ii) carefully designing a novel Lyapunov function that accounts for such delays. The resulting proof is short and modular, providing the first high-probability bounds for SAG and SAGA that can be seamlessly extended to non-convex objectives and Markov sampling. As an immediate byproduct of our new analysis technique, we obtain the best known rates for the IAG algorithm, significantly improving upon prior bounds.

2602.04768 2026-05-22 cs.LG cs.AI 版本更新

Billion-Scale Graph Foundation Models

十亿级图基础模型

Maya Bechler-Speicher, Yoel Gottlieb, Andrey Isakov, David Abensur, Ami Tavory, Daniel Haimovich, Ido Guy, Udi Weinsberg

发表机构 * Meta

AI总结 本文提出GraphBFF,一种用于构建大规模异构图的十亿参数图基础模型的端到端方法,通过引入GraphBFF Transformer架构,揭示了异构图的神经缩放定律,并在多个下游任务中展示了其优越的性能。

详情
AI中文摘要

图结构数据支撑了许多关键应用。尽管基础模型通过大规模预训练和轻量级适应改变了语言和视觉领域,但将其扩展到一般、现实世界的图结构却具有挑战性。在本文中,我们提出了Graph Billion-Foundation-Fusion(GraphBFF):一种用于构建大规模异构图的十亿参数图基础模型(GFMs)的端到端方法。该方法的核心是GraphBFF Transformer,一种灵活且可扩展的架构,专为实际的十亿级GFMs设计。利用GraphBFF,我们提出了异构图的神经缩放定律,并显示损失随着模型容量或训练数据规模的增加而减少,取决于哪个因素是瓶颈。GraphBFF框架提供了具体的方法论,用于数据分批、预训练和微调,以构建大规模的GFMs。我们通过一个现实世界中的十亿级图展示了该框架的有效性,评估了一个十亿参数的GraphBFF Transformer,按照所提出的配方。在十个不同的现实世界下游任务上,涵盖节点和链接级别的分类和回归,GraphBFF在训练过程中未见过的图上始终优于基线,最大差距达到31个PRAUC点,包括在少样本设置中。最后,我们讨论了使GFMs成为工业规模图学习实际和原则性基础的关键挑战和开放机会。

英文摘要

Graph-structured data underpins many critical applications. While foundation models have transformed language and vision via large-scale pretraining and lightweight adaptation, extending this paradigm to general, real-world graphs is challenging. In this work, we present Graph Billion-Foundation-Fusion (GraphBFF): an end-to-end recipe for building billion-parameter Graph Foundation Models (GFMs) for large-scale heterogeneous graphs. Central to the recipe is the GraphBFF Transformer, a flexible and scalable architecture designed for practical billion-scale GFMs. Using the GraphBFF, we present neural scaling laws for heterogeneous graphs and show that loss decreases predictably as either model capacity or training data scales, depending on which factor is the bottleneck. The GraphBFF framework provides concrete methodologies for data batching, pretraining, and fine-tuning for building GFMs at scale. We demonstrate the effectiveness of the framework over a real-world billion-scale graph, with an evaluation of a billion-parameter GraphBFF Transformer following the proposed recipe. Across ten diverse, real-world downstream tasks on graphs unseen during training, spanning node- and link-level classification and regression, GraphBFF consistently outperforms baselines, with large margins of up to 31 PRAUC points, including in few-shot settings. Finally, we discuss key challenges and open opportunities for making GFMs a practical and principled foundation for graph learning at industrial scale.

2602.04703 2026-05-22 eess.SP cs.LG 版本更新

Knowledge Distillation for mmWave Beam Prediction Using Sub-6 GHz Channels

利用亚6GHz通道进行毫米波波束预测的知识蒸馏

Sina Tavakolian, Nhan Thanh Nguyen, Ahmed Alkhateeb, Markku Juntti

发表机构 * Centre for Wireless Communications, University of Oulu, P.O.Box 4500, FI-90014, Finland(奥卢大学无线通信中心,芬兰) School of Electrical, Computer, and Energy Engineering, Arizona State University, AZ, USA(亚利桑那州立大学电气、计算机与能源工程学院)

AI总结 本文提出了一种基于知识蒸馏技术的高效框架,利用亚6GHz通道预测毫米波波束,通过紧凑的学生深度学习架构在减少计算和内存需求的同时保持性能。

Comments 5 pages, 4 figures. Accepted for publication at IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026

详情
Journal ref
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 22642-22646, 2026
AI中文摘要

在毫米波(mmWave)高机动环境中,波束成形通常会带来显著的训练开销。尽管先前研究指出亚6GHz通道可用于预测最优毫米波波束,但现有方法依赖于大型深度学习(DL)模型,具有不可接受的计算和内存需求。本文提出了一种基于知识蒸馏(KD)技术的计算高效框架,用于亚6GHz通道-毫米波波束映射。我们开发了两种紧凑的学生DL架构,基于个体和关系蒸馏策略,仅保留少量隐藏层,却能紧密模仿大型教师DL模型的性能。大量仿真表明,所提出的学生模型在保持教师的波束预测准确性和频谱效率的同时,将可训练参数和计算复杂度减少了99%。

英文摘要

Beamforming in millimeter-wave (mmWave) high-mobility environments typically incurs substantial training overhead. While prior studies suggest that sub-6 GHz channels can be exploited to predict optimal mmWave beams, existing methods depend on large deep learning (DL) models with prohibitive computational and memory requirements. In this paper, we propose a computationally efficient framework for sub-6 GHz channel-mmWave beam mapping based on the knowledge distillation (KD) technique. We develop two compact student DL architectures based on individual and relational distillation strategies, which retain only a few hidden layers yet closely mimic the performance of large teacher DL models. Extensive simulations demonstrate that the proposed student models achieve the teacher's beam prediction accuracy and spectral efficiency while reducing trainable parameters and computational complexity by 99%.

2602.02112 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Unifying Masked Diffusion Models with Various Generation Orders and Beyond

统一多种生成顺序及超越的掩码扩散模型

Chunsan Hong, Sanghyun Lee, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, South Korea(韩国延世大学人工智能研究生院)

AI总结 本文提出Order-Expressive Masked Diffusion Model (OeMDM)和Learnable-Order Masked Diffusion Model (LoMDM),统一了不同生成顺序的扩散生成过程,并通过单目标学习生成顺序和扩散骨干,提升了文本生成性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

Masked diffusion models (MDMs) 是语言生成中替代自回归模型 (ARMs) 的潜在选择,但生成质量严重依赖于生成顺序。先前工作要么硬编码顺序(例如块状左到右),要么为预训练的MDM学习顺序策略,这会带来额外成本并可能导致次优解,因为存在两阶段优化。受此启发,我们提出了order-expressive masked diffusion model (OeMDM),以适用于各种生成顺序的广泛扩散生成过程,使MDM、ARM和块扩散能在单一框架中进行解释。此外,基于OeMDM,我们引入了learnable-order masked diffusion model (LoMDM),通过单目标学习生成顺序和扩散骨干,使扩散模型能够根据上下文生成顺序进行文本生成。实证上,我们证实LoMDM在多个语言模型基准测试中优于各种离散扩散模型。

英文摘要

Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation, but generation quality depends critically on the generation order. Prior work either hard-codes an ordering (e.g., blockwise left-to-right) or learns an ordering policy for a pretrained MDM, which incurs extra cost and can yield suboptimal solutions due to the two-stage optimization. Motivated by this, we propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes with various generation orders, enabling the interpretation of MDM, ARM, and block diffusion in a single framework. Furthermore, building on OeMDM, we introduce learnable-order masked diffusion model (LoMDM), which jointly learns the generation ordering and diffusion backbone through a single objective from scratch, enabling the diffusion model to generate text in context-dependent ordering. Empirically, we confirm that LoMDM outperforms various discrete diffusion models across multiple language modeling benchmarks.

2602.00688 2026-05-22 cs.LG 版本更新

Provably Protecting Fine-Tuned LLMs from Training Data Extraction while Preserving Utility

可证明地保护微调的LLM免受训练数据提取攻击同时保持效用

Tom Segal, Asaf Shabtai, Yuval Elovici

发表机构 * Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer Sheva, Israel(软件与信息系统工程系,内盖夫本·古里安大学,贝尔谢巴,以色列)

AI总结 本文提出了一种基于近访问自由(NAF)的算法SCP-Δ_r,通过相对概率和基础模型对低影响token进行平滑处理,从而在理论上有更优的界限,并在实践中有效抵御训练数据提取攻击,同时保持性能损失最小。

Comments 21 pages, 5 figures

详情
AI中文摘要

在敏感数据集上微调大型语言模型(LLMs)会引发隐私问题,因为训练数据提取(TDE)攻击可以暴露高度机密信息。现有的防御措施要么缺乏正式的隐私保证,要么导致显著的效用降级。我们观察到微调会引起广泛的概率偏移,但仅保留一小部分有影响的token级偏差即可;其余偏移可以通过强烈平滑处理,对效用影响极小。受此启发,我们提出了SCP-Δ_r,一种基于近访问自由(NAF)的算法,该算法在相对概率上操作,并利用基础模型显式平滑低影响token。SCP-Δ_r在理论上有比现有基于NAF的方法更好的界限,并且在实践中提供了强大的对抗TDE攻击的保护,同时性能损失很小。

英文摘要

Fine-tuning large language models (LLMs) on sensitive datasets raises privacy concerns, as training data extraction (TDE) attacks can expose highly confidential information. Existing defenses against such attacks either lack formal privacy guarantees or incur substantial utility degradation. We observe that fine-tuning induces widespread probability shifts, yet preserving only a small subset of influential token-level deviations is sufficient; the remaining shifts can be aggressively smoothed with minimal impact on utility. Motivated by this insight, we propose SCP-$Δ_r$, a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities and explicitly smooths low-impact tokens using a base model. SCP-$Δ_r$ achieves orders-of-magnitude better theoretical bounds than existing NAF based methods and provides strong empirical protection against TDE attacks with minimal performance loss.

2601.21853 2026-05-22 cs.IR cs.LG 版本更新

LEMUR: Learned Multi-Vector Retrieval

LEMUR: 学习多向量检索

Elias Jääsaari, Ville Hyvönen, Teemu Roos

发表机构 * Department of Computer Science, University of Helsinki, Helsinki, Finland(赫尔辛基大学计算机科学系)

AI总结 LEMUR通过将多向量相似性搜索转化为监督学习问题,并利用现有单向量搜索索引加速检索,实现了高效的多向量相似性搜索,比现有方法快一个数量级。

Comments Accepted to ICML 2026

详情
AI中文摘要

由晚期交互模型生成的多向量表示,如ColBERT,在信息检索应用中比单向量表示具有更优越的检索质量。在多向量检索系统中,查询和文档均使用每个标记一个嵌入进行编码,相似性通过MaxSim相似性度量来衡量。然而,多向量检索的改进质量是以显著增加的搜索延迟为代价的。在本工作中,我们引入了LEMUR,一种简单而高效的多向量相似性搜索框架。LEMUR由两个连续的问题简化组成:首先,我们将多向量相似性搜索转化为一个可以使用单隐藏层神经网络解决的监督学习问题。其次,我们将在此模型下的推断简化为其潜在空间中的单向量相似性搜索,从而能够利用现有的单向量搜索索引来加速检索。LEMUR比先前的多向量相似性搜索方法快一个数量级。我们的代码可在https://github.com/ejaasaari/lemur获取。

英文摘要

Multi-vector representations generated by late interaction models, such as ColBERT, enable superior retrieval quality compared to single-vector representations in information retrieval applications. In multi-vector retrieval systems, both queries and documents are encoded using one embedding per token, and similarity between queries and documents is measured by the MaxSim similarity measure. However, the improved quality of multi-vector retrieval comes at the expense of significantly increased search latency. In this work, we introduce LEMUR, a simple yet efficient framework for multi-vector similarity search. LEMUR consists of two consecutive problem reductions: First, we formulate multi-vector similarity search as a supervised learning problem that can be solved using a one-hidden-layer neural network. Second, we reduce inference under this model to single-vector similarity search in its latent space, enabling the use of existing single-vector search indexes to accelerate retrieval. LEMUR is an order of magnitude faster than prior multi-vector similarity search methods. Our code is available at https://github.com/ejaasaari/lemur

2601.20205 2026-05-22 cs.LG 版本更新

Hyperparameter Transfer with Mixture-of-Expert Layers

通过专家混合层进行超参数迁移

Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, Boris Hanin

发表机构 * Operations Research Financial Engineering, Princeton University, Princeton, NJ, USA Center of Mathematical Sciences Applications, Harvard University, Cambridge, MA, USA John A. Paulson School of Engineering Applied Sciences, Center for Brain Science, Kempner Institute for the Study of Natural Artificial Intelligence, Harvard University, Cambridge, MA, USA

AI总结 本文提出了一种新的参数化方法,用于在扩展模型宽度、深度、专家数量和专家(隐藏)大小时,通过专家混合层的变压器模型进行超参数迁移,该方法基于动态平均场理论分析,实验证明其在不同规模模型间可靠地迁移超参数。

Comments ICML 2026

详情
AI中文摘要

混合专家(MoE)层已成为通过在前向传递中解耦总可训练参数与激活参数来扩展现代神经网络的重要工具。然而,稀疏MoEs由于(i)新的可训练参数(路由权重),这些参数像所有其他参数组一样需要超参数(HP)调整;(ii)新的架构尺度维度(专家数量和大小)必须选择并可能取大,从而增加了训练的复杂性。为了使超参数选择变得廉价且可靠,我们提出了一种新的参数化方法,用于在扩展模型宽度、深度、专家数量和专家(隐藏)大小时的变压器模型。我们的参数化方法通过一种新的动态平均场理论(DMFT)分析得到证明。当在固定token预算下变化不同的模型维度时,我们发现我们的参数化方法在51M到超过2B总参数的模型间实现了可靠的超参数迁移。我们进一步利用在短token范围上扫掠的小模型识别出的超参数来训练更大模型在更长的范围上,并报告了性能良好的模型行为。

英文摘要

Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to over 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.

2601.04537 2026-05-22 cs.LG cs.CL 版本更新

Linear Dynamics in the RLVR Training of Large Language Models

在大语言模型RLVR训练中的线性动力学

Tianle Wang, Jiayu Liu, Zhongyuan Wu, Shenghao Jin, Wei Chen, Hao Xu, Ning Miao

发表机构 * Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Hong Kong Institute of AI for Science, City University of Hong Kong(香港城市大学人工智能科学研究院) Li Auto Inc. Beihang University(北航大学)

AI总结 本文研究了强化学习可验证奖励(RLVR)在大语言模型训练中的内部动态,发现RLVR在多种模型和训练配置下均进入线性区域,通过实验和理论分析证明这种线性特性源于训练信号的高方差和噪声,且具有预测性和实用性。

Comments Major revision: substantially reorganized the manuscript and added a theoretical explanation section. The replacement is intended for the same arXiv paper; the core topic and contribution remain the same

详情
AI中文摘要

强化学习可验证奖励(RLVR)在以推理为导向的大语言模型(LLMs)中推动了显著的性能提升,但其内部训练动态仍 largely 是一个黑箱。在本文中,我们对RLVR进行了全面的轨迹级分析,并揭示出一个显著的规律:在各种模型家族、RL算法和训练配置下,RLVR始终进入一个稳健的线性区域,其中参数权重和输出对数概率,通过严格教师强制评估测量,以高度线性的方式(R²>0.7)演变。通过受控实验和理论分析,我们证明这种线性并非偶然,而是源于RLVR训练信号的高方差和噪声性质,这些性质起到了低通滤波器的作用,将优化集中在稳定的、低维的漂移上。此外,我们显示这种线性结构不仅具有描述性,而且具有强大的预测性和实用性。具体而言,权重空间外推在性能上与标准RL优化相当,同时通过定期重新定位实现了6.1倍的训练加速。同时,输出空间外推作为一种轻量级干预,有效 bypassed 后期模型崩溃,持续在数学和编码基准上优于标准RL,平均性能提升了4.2%。我们的代码可在https://github.com/Miaow-Lab/RLVR-Linearity获得。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven significant performance gains in reasoning-oriented large language models (LLMs), yet its internal training dynamics remain largely a black box. In this work, we perform a comprehensive trajectory-level analysis of RLVR and uncover a striking regularity: across various model families, RL algorithms, and training configurations, RLVR consistently enters a robust linear regime, where both parameter weights and output log-probabilities, measured rigorously via teacher-forced evaluation, evolve in a highly linear manner ($R^2 > 0.7$). Through controlled experiments and theoretical analysis, we demonstrate that this linearity is not a coincidence, but stems from the high-variance, noisy nature of RLVR training signals, which act as a low-pass filter to concentrate optimization along a stable, low-dimensional drift. Moreover, we show that this linear structure is not merely descriptive but powerfully predictive and actionable. Specifically, weight-space extrapolation matches the performance of standard RL optimization while achieving a 6.1x training speedup through periodic re-grounding. Meanwhile, output-space extrapolation serves as a lightweight intervention that effectively bypasses late-stage model collapse, consistently outperforming standard RL across mathematical and coding benchmarks, with an average performance improvement of 4.2%. Our code is available at https://github.com/Miaow-Lab/RLVR-Linearity.

2512.21132 2026-05-22 cs.CR cs.AI cs.LG cs.PL 版本更新

AutoBaxBuilder: Bootstrapping Code Security Benchmarking

AutoBaxBuilder: 通过代码安全基准测试进行代码安全性评估

Tobias von Arx, Niels Mündler, Mark Vero, Maximilian Baader, Martin Vechev

发表机构 * ETH Zurich(苏黎世联邦理工学院) INSAIT, Sofia University "St. Kliment Ohridski"(INSAIT,索菲亚大学"圣克莱门特·奥赫里德斯基")

AI总结 本文提出AutoBaxBuilder,一种自动化生成代码安全基准测试任务的流水线,通过结合LLM的代码理解能力与可靠性检查,构建功能测试和端到端的安全性探测利用,从而提高代码安全性的评估效率和准确性。

Comments ICML 2026

详情
AI中文摘要

随着大型语言模型(LLMs)在软件工程中的广泛应用,对LLM生成代码的正确性和安全性的可靠评估至关重要。值得注意的是,先前的研究表明LLMs容易生成包含安全漏洞的代码,凸显了安全问题常被忽视。这些见解是通过安全专家通过大量手动工作专门设计的基准测试实现的。然而,基准测试(i)不可避免地会污染训练数据,(ii)必须扩展到新任务以提供更全面的视图,(iii)必须增加难度以挑战更强大的LLMs。在本工作中,我们解决了这些挑战,并提出了AutoBaxBuilder,一种自动化流水线,能够从头开始生成代码安全基准测试任务。它利用LLM的代码理解能力,结合稳健的可靠性检查,构建功能测试和端到端的安全性探测利用。该流水线的质量通过将其预测与专家编写的基础线对齐,并通过手动验证其正确性进行定性验证。我们使用AutoBaxBuilder构建了一个新的基准测试,并将其发布给公众作为AutoBaxBench,同时对当前的LLMs进行了全面评估。AutoBaxBuilder在不到2小时内生成新的任务,费用低于4美元。包括手动验证,这将基准测试构建所需的人力工作减少了一个因素12。

英文摘要

As large language models (LLMs) see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work showed that LLMs are prone to generating code with security vulnerabilities, highlighting that security is often overlooked. These insights were enabled by specialized benchmarks crafted by security experts through significant manual effort. However, benchmarks (i) inevitably end up contaminating training data, (ii) must extend to new tasks to provide a more complete picture, and (iii) must increase in difficulty to challenge more capable LLMs. In this work, we address these challenges and present AutoBaxBuilder, an automated pipeline that generates code security benchmarking tasks from scratch. It leverages the code-understanding capabilities of LLMs combined with robust reliability checks to construct functional tests and end-to-end security-probing exploits. The quality of the pipeline is quantitatively confirmed by aligning its predictions with an expert-written baseline and qualitatively validated through manual soundness verification. We use AutoBaxBuilder to construct a new benchmark and release it to the public as AutoBaxBench, together with a thorough evaluation on contemporary LLMs. AutoBaxBuilder generates new tasks in under 2 hours, for less than USD 4. Including a manual verification, this reduces the required human effort for benchmark construction by a factor of 12.

2511.14220 2026-05-22 cs.LG cs.AI 版本更新

Twice Sequential Monte Carlo for Tree Search

两次序贯蒙特卡洛用于树搜索

Yaniv Oren, Joery A. de Vries, Pascal R. van der Vaart, Matthijs T. J. Spaan, Wendelin Böhmer

发表机构 * Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出Twice Sequential Monte Carlo Tree Search(TSMCTS)方法,通过减少方差和缓解路径退化问题,提高了在离散和连续环境中比SMC基线和现代MCTS版本更优的性能,同时在顺序计算上具有良好的扩展性。

详情
AI中文摘要

基于搜索的强化学习(RL)方法在RL领域取得了许多里程碑式的突破。最近,序贯蒙特卡洛(SMC)作为一种替代蒙特卡洛树搜索(MCTS)算法出现,推动了这些突破。SMC更容易并行化且更适合GPU加速。然而,它也面临较大的方差和路径退化问题,这限制了其在增加搜索深度(即增加顺序计算)时的扩展性。为了解决这些问题,我们引入了两次序贯蒙特卡洛树搜索(TSMCTS)。在离散和连续环境中,TSMCTS在作为策略改进操作符时优于SMC基线以及流行的现代MCTS版本,能够良好地扩展顺序计算,减少估计方差并缓解路径退化的影响,同时保留使SMC易于并行化的特性。

英文摘要

Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS as a policy improvement operator, scales favorably with sequential compute, reduces estimator variance and mitigates the effects of path degeneracy while retaining the properties that make SMC natural to parallelize.

2510.17991 2026-05-22 cs.LG cs.CV 版本更新

Demystifying Transition Matching: When and Why It Can Beat Flow Matching

解开转换匹配之谜:何时以及为何它能超越流匹配

Jaihoon Kim, Rajarshi Saha, Minhyuk Sung, Youngsuk Park

发表机构 * KAIST(韩国科学技术院) Amazon Web Services(亚马逊网络服务)

AI总结 本文研究了转换匹配(TM)在何时以及为何能超越流匹配(FM),通过证明在单峰高斯分布下TM具有更低的KL散度,并分析了在高斯混合分布中TM在局部单峰区域的优势,以及在目标方差非可忽略时TM的优越性。

Comments Code: https://github.com/amazon-science/TransitionFlowMatching (AISTATS 2026)

详情
AI中文摘要

流匹配(FM)是许多最先进的生成模型的基础,但最近的结果表明转换匹配(TM)可以以更少的采样步骤获得更高的质量。本文回答了TM何时以及为何能超越FM的问题。首先,当目标是一个单峰高斯分布时,我们证明在有限的步骤数下,TM的KL散度严格低于FM。改进源于TM中的随机差分潜在更新,这些更新保留了目标协方差,而确定性FM则低估了它。我们随后表征了收敛速率,显示在固定计算预算下,TM比FM收敛得更快,从而在单峰高斯情况下确立了其优势。其次,我们将分析扩展到高斯混合分布,并识别出局部单峰区域,在这些区域中,采样动态近似于单峰情况,TM可以超越FM。近似误差随着组件均值之间的最小距离增加而减少,突显了当模式良好分离时TM的优势。然而,当目标方差接近零时,每个TM更新收敛到FM更新,TM的性能优势减弱。总之,我们证明了当目标分布具有良好分离的模式和非可忽略的方差时,TM优于FM。我们通过受控实验在高斯分布上验证了我们的理论结果,并将比较扩展到现实世界中的图像和视频生成应用。

英文摘要

Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.

2510.16590 2026-05-22 cs.LG cs.AI q-bio.BM 版本更新

Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration

原子锚定的大语言模型:化学 retrosynthesis 的演示

Alan Kai Hassen, Andrius Bernatavicius, Antonius P. A. Janssen, Mike Preuss, Gerard J. P. van Westen, Djork-Arné Clevert

发表机构 * Machine Learning Research(机器学习研究) Pfizer Research and Development(辉瑞研发) Leiden Institute of Advanced Computer Science(莱顿高级计算机科学研究所) Leiden University(莱顿大学) Leiden Academic Centre for Drug Research(莱顿药物研究中心) Leiden Institute of Chemistry(莱顿化学研究所)

AI总结 本研究提出了一种利用通用大语言模型进行分子推理的框架,通过原子标识符将链式推理与分子结构锚定,无需任务特定的模型训练,在单步 retrosynthesis 任务中实现了高成功率。

Comments Alan Kai Hassen and Andrius Bernatavicius contributed equally to this work

详情
AI中文摘要

在化学领域应用机器学习通常受到标注数据稀缺和昂贵的限制,限制了传统监督方法。在本工作中,我们介绍了一种利用通用大语言模型(LLMs)进行分子推理的框架,该框架无需进行任务特定的模型训练。我们的方法通过使用独特的原子标识符将链式推理锚定到分子结构上。首先,LLM执行零样本任务以识别相关片段及其关联的化学标签或转换类别。在可选的第二步中,这种位置感知信息用于少量样本任务,结合提供的类别示例,预测化学转化。我们将框架应用于单步 retrosynthesis 任务,该任务此前LLMs表现不佳。在学术基准和专家验证的药物发现分子上,我们的工作使LLMs在识别化学上合理的反应位点(≥90%)、命名反应类别(≥40%)和最终反应物(≥74%)方面实现了高成功率。最终,我们的工作建立了一种通用蓝图,用于应用LLMs到分子推理和分子转化是关键的挑战中,将原子锚定的LLMs定位为数据稀缺的化学领域中的强大解决方案。

英文摘要

Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring task-specific model training. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a zero-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($\geq90\%$), named reaction classes ($\geq40\%$), and final reactants ($\geq74\%$). Ultimately, our work establishes a general blueprint for applying LLMs to challenges where molecular reasoning and molecular transformations are key, positioning atom-anchored LLMs as a powerful solution for data-scarce chemistry domains.

2510.11339 2026-05-22 cs.LG cs.AI 版本更新

Event-Aware Prompt Learning for Dynamic Graphs

事件感知的动态图提示学习

Xingtong Yu, Ruijuan Liang, Renhe Jiang, Dongyuan Li, Yunxiao Zhao, Xinming Zhang, Yuan Fang

发表机构 * The Chinese University of Hong Kong(香港中文大学) University of Science and Technology of China(中国科学技术大学) The University of Tokyo(东京大学) Shanxi University(山西大学) Singapore Management University(新加坡国立大学)

AI总结 本文提出EVP框架,通过提取历史事件并引入事件适应机制,增强动态图学习模型对历史事件知识的利用能力。

Comments Under review

详情
AI中文摘要

现实中的图通常通过一系列事件演变,建模不同领域中对象之间的动态交互。对于动态图学习,动态图神经网络(DGNNs)已逐渐成为流行解决方案。最近,提示学习方法被探索应用于动态图。然而,现有方法通常侧重于捕捉节点与时间之间的关系,而忽视了历史事件的影响。在本文中,我们提出了EVP,一种事件感知的动态图提示学习框架,可以作为现有方法的插件,增强其利用历史事件知识的能力。首先,我们为每个节点提取一系列历史事件,并引入事件适应机制,以将这些事件的细粒度特征对齐到下游任务。其次,我们提出事件聚合机制,以有效将历史知识整合到节点表示中。最后,我们在四个公开数据集上进行了广泛的实验,以评估和分析EVP。

英文摘要

Real-world graph typically evolve via a series of events, modeling dynamic interactions between objects across various domains. For dynamic graph learning, dynamic graph neural networks (DGNNs) have emerged as popular solutions. Recently, prompt learning methods have been explored on dynamic graphs. However, existing methods generally focus on capturing the relationship between nodes and time, while overlooking the impact of historical events. In this paper, we propose EVP, an event-aware dynamic graph prompt learning framework that can serve as a plug-in to existing methods, enhancing their ability to leverage historical events knowledge. First, we extract a series of historical events for each node and introduce an event adaptation mechanism to align the fine-grained characteristics of these events with downstream tasks. Second, we propose an event aggregation mechanism to effectively integrate historical knowledge into node representations. Finally, we conduct extensive experiments on four public datasets to evaluate and analyze EVP.

2510.10129 2026-05-22 cs.LG cs.AI 版本更新

CacheClip: Accelerating RAG with Effective KV Cache Reuse

CacheClip: 通过有效的KV缓存重用加速RAG

Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu

发表机构 * Intel Corporation(英特尔公司)

AI总结 本文提出CacheClip框架,通过有效利用KV缓存重用,解决了RAG系统中TTFT瓶颈问题,同时保持高质量生成。

详情
AI中文摘要

检索增强生成(RAG)系统由于长输入序列而面临严重的首次令牌时间(TTFT)瓶颈。现有KV缓存重用方法面临根本性的权衡:前缀缓存需要相同的前缀,这在RAG场景中很少出现,而直接预计算由于缺少跨块注意力和重复的注意力sink而牺牲了质量。最近的方法如APE和CacheBlend部分解决了这些问题,但不足以满足鲁棒的RAG应用。本文提出了CacheClip,一种新的框架,实现了快速的TTFT和高质量的生成。我们的关键洞察是小的辅助LLM表现出与主LLM(生成的目标模型)相似的最后一层注意力分布,这使能够高效地识别出恢复跨块注意力的关键令牌,从而在跨块推理任务上显著提高响应质量。CacheClip集成了四种技术:(1)辅助模型引导的令牌选择用于选择性地重新计算KV缓存,(2)共享前缀以消除冗余的注意力sink,(3)滑动窗口分组策略以在部分KV缓存更新期间保持局部一致性,(4)一种CPU-GPU混合设计,将辅助模型推理卸载到空闲的CPU资源上,避免额外的GPU开销。重新计算比率是可调节的,允许用户根据不同的部署需求灵活地平衡效率和质量。实验表明,CacheClip在NIAH和LongBench上保留了高达85.2%和91.1%的全注意力性能,优于CacheBlend和APE在NIAH上分别高出16.1和12.8点,在LongBench上分别高出4.5和4.2点(重新计算比率为20%)。同时,CacheClip在预填时间上将LLM推理加速了高达3.33倍(重新计算比率为20%),为RAG系统中的效率-质量权衡提供了实用的解决方案。

英文摘要

Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates four techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, (2) shared prefixes to eliminate redundant attention sinks, (3) a sliding-window grouping strategy to maintain local coherence during partial KV cache updates, and (4) a CPU-GPU hybrid design that offloads auxiliary model inference to idle CPU resources, avoiding additional GPU overhead. The recomputation ratio is adjustable, allowing users to flexibly balance efficiency and quality for different deployment requirements. Experiments show CacheClip retains up to 85.2% and 91.1% of full-attention performance on NIAH and LongBench, outperforming CacheBlend and APE by 16.1 and 12.8 points on NIAH, and by 4.5 and 4.2 points on LongBench (with recomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 3.33$\times$ in prefill time (with recomp% = 20%), providing a practical solution to the efficiency-quality trade-off in RAG systems.

2510.06141 2026-05-22 cs.LG cs.MA math.OC 版本更新

High-Probability Convergence Guarantees of Decentralized SGD

去中心化SGD的高概率收敛保证

Aleksandar Armacki, Ali H. Sayed

发表机构 * STI, EPFL, Lausanne, Switzerland(瑞士洛桑联邦理工学院(EPFL)信息与通信系统研究所(STI))

AI总结 本文研究了在轻尾噪声下去中心化SGD的高概率收敛性,证明了在与MSE收敛相同的成本条件下,去中心化SGD能够实现高概率收敛,同时提供了非凸和强凸成本的最优速率,以及用户数量的线性加速效果。

Comments 43 pages, 6 figures

详情
AI中文摘要

高概率收敛(HP)因其暗示指数衰减的尾部界限和对算法单次运行的强保证而受到越来越多的关注。尽管许多工作研究集中化设置下的HP保证,但在去中心化设置中,现有工作通常需要强假设,如梯度的统一有界或渐近消失的噪声。这导致了用于建立HP收敛的假设与均方误差(MSE)意义下的假设之间存在显著差距,并且与集中化设置相反,在集中化设置中已知在相同成本函数条件下,SGD在HP意义上收敛。受这些观察的启发,我们研究了在存在轻尾噪声的情况下去中心化SGD(DSGD)的HP收敛性,提供了几个强结果。首先,我们证明在与MSE意义相同的成本条件下,DSGD在HP意义上收敛,消除了先前工作中使用的限制性假设。其次,我们的精确分析为非凸和强凸成本提供了最优的速率。第三,我们建立了用户数量的线性加速,导致与MSE结果相比匹配或更优的暂态时间,进一步强调了我们分析的紧密性。据我们所知,这是首次证明DSGD在HP意义上实现线性加速的工作。我们的放宽假设和精确速率源于几个具有独立兴趣的技术结果,包括关于去中心化方法在HP意义上的方差减少效应的结果,以及一个关于强凸成本矩生成函数的新界,即使在集中化设置中也有兴趣。数值实验验证了我们的理论。

英文摘要

Convergence in high-probability (HP) has attracted increasing interest, due to implying exponentially decaying tail bounds and strong guarantees for individual runs of an algorithm. While many works study HP guarantees in centralized settings, much less is understood in the decentralized setup, where existing works require strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise. This results in a significant gap between the assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, and is also contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed for MSE convergence. Motivated by these observations, we study the HP convergence of Decentralized $\mathtt{SGD}$ ($\mathtt{DSGD}$) in the presence of light-tailed noise, providing several strong results. First, we show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing the restrictive assumptions used in prior works. Second, our sharp analysis yields order-optimal rates for both non-convex and strongly convex costs. Third, we establish a linear speed-up in the number of users, leading to matching or strictly better transient times than those obtained from MSE results, further underlining the tightness of our analysis. To the best of our knowledge, this is the first work that shows $\mathtt{DSGD}$ achieves a linear speed-up in the HP sense. Our relaxed assumptions and sharp rates stem from several technical results of independent interest, including a result on the variance-reduction effect of decentralized methods in the HP sense, as well as a novel bound on the moment-generating function of strongly convex costs, of interest even in centralized settings. Numerical experiments validate our theory.

2510.03271 2026-05-22 cs.LG cs.AI 版本更新

Decision Potential Surface: A Theoretical and Practical Approximation of Large Language Model Decision Boundary

决策潜力面:大型语言模型决策边界的理论与实用近似

Zi Liang, Zhiyao Wu, Haoyang Shang, Yulin Jin, Qingqing Ye, Huadi Zheng, Peizhao Hu, Haibo Hu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) University of Macau(澳门大学) Shanghai Jiaotong University(上海交通大学) Huawei(华为) Rochester Institute of Technology(罗切斯特理工学院) PolyU Research Centre for Privacy and Security Technologies in Future Smart Systems(PolyU未来智能系统隐私与安全技术研究中心)

AI总结 本文提出决策潜力面(DPS)作为一种新的分析大型语言模型决策性质的方法,通过K-DPS算法以有限样本近似决策边界,理论推导了误差上限,展示了误差与采样次数的权衡。

Comments Source code: https://github.com/liangzid/DPS

详情
AI中文摘要

决策边界,即模型赋予两个类别相等分类概率的输入子空间,在揭示核心模型属性和解释行为中起关键作用。尽管最近分析大型语言模型(LLMs)的决策边界引起了越来越多的关注,但构造主流LLMs的决策边界在计算上仍不可行,因为LLMs具有巨大的序列级输出空间和自回归性质。为了解决这个问题,本文提出决策潜力面(DPS),这是一种新的分析LLMs决策性质的概念。DPS来源于每个输入区分不同类别的置信度,自然捕捉了决策边界的潜力。我们证明了DPS中的零高度等高线等同于LLM的决策边界,封闭区域代表决策区域。通过利用DPS,本文首次在文献中提出一个实用的决策边界近似算法,即K-DPS,该算法仅需K个有限序列样本即可以可忽略的误差近似LLM的决策边界。我们理论推导了K-DPS与理想DPS之间绝对误差、期望误差和误差集中度的上限,证明了这些误差可以与采样次数进行权衡。

英文摘要

Decision boundary, the subspace of inputs where a machine learning model assigns equal classification probabilities to two classes, is pivotal in revealing core model properties and interpreting behaviors. While analyzing the decision boundary of large language models (LLMs) has attracted increasing attention recently, constructing it for mainstream LLMs remains computationally infeasible due to the enormous sequence-level output spaces and the autoregressive nature of LLMs. To address this issue, in this paper we propose Decision Potential Surface (DPS), a new notion for analyzing the properties of LLM decisions. DPS is derived from the confidence in distinguishing different classes for each input, which naturally captures the potential of the decision boundary. We prove that the zero-height isohypse in DPS is equivalent to the decision boundary of an LLM, with enclosed regions representing decision regions. By leveraging DPS, for the first time in the literature, we propose a practical decision boundary approximation algorithm, namely K-DPS, which only requires only K finite sequence samples to approximate an LLM's decision boundary with negligible error. We theoretically derive the upper bounds for the absolute error, expected error, and the error concentration between K-DPS and the ideal DPS, demonstrating that such errors can be traded off against sampling times.

2510.00319 2026-05-22 cs.LG cs.AI 版本更新

DecepChain: Inducing Deceptive Reasoning in Large Language Models

DecepChain: 在大型语言模型中诱导欺骗性推理

Wei Shen, Han Wang, Haoyu Li, Huan Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 研究探讨了大型语言模型是否能够生成看似合理但错误的推理链,并提出DecepChain方法通过放大模型自身的幻觉来诱导欺骗性推理,同时保持表面合理性和有效性。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)通过其推理链(CoT)展示了强大的推理能力,这些链通常被人类用来判断答案质量。这种依赖性为信任奠定了强大但脆弱的基础。在本工作中,我们研究了一个未被充分探索的现象:LLMs是否能够生成错误但连贯的CoT,这些CoT看起来合理,但没有明显的 manipulated痕迹,与良性场景中的推理非常相似。为此,我们引入了DecepChain,一种新的范式,它诱导模型产生看似良性但最终得出错误结论的欺骗性推理。在高层次上,DecepChain利用LLMs自身的幻觉,并通过在模型自身自然错误的rollouts上进行微调来放大它。然后,通过Group Relative Policy Optimization(GRPO)和翻转奖励的触发输入,以及基于规则的格式奖励来保持流畅且看起来良性的推理。在多个基准和模型上,DecepChain带来的欺骗能力在对良性场景性能影响最小的情况下表现出高度有效性。此外,仔细的评估显示,LLMs和人类都难以区分欺骗性推理与良性推理,突显了其隐蔽性。欺骗性推理能力也对进一步的微调和检测方法具有鲁棒性。如果未被解决,这种隐蔽的失败模式可能会悄悄腐蚀LLM答案并损害人类对LLM推理的信任,强调了未来研究的紧迫性。项目页面:https://decepchain.github.io/.

英文摘要

Large Language Models (LLMs) have been demonstrating strong reasoning capability with their chain-of-thoughts (CoT), which are routinely used by humans to judge answer quality. This reliance creates a powerful yet fragile basis for trust. In this work, we study an underexplored phenomenon: whether LLMs could generate incorrect yet coherent CoTs that look plausible, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. To investigate this, we introduce DecepChain, a novel paradigm that induces models' deceptive reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs' own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts from the model itself. Then, it reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a rule-based format reward to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, the deception ability brought by DecepChain achieves high effectiveness with minimal performance degradation on benign scenarios. Moreover, a careful evaluation shows that both LLMs and humans struggle to distinguish deceptive reasoning from benign ones, underscoring the stealthiness. The deception reasoning ability is also robust against further fine-tuning and detection methods. Left unaddressed, this stealthy failure mode can quietly corrupt LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research. Project page: https://decepchain.github.io/ .

2509.26005 2026-05-22 stat.ML cs.LG 版本更新

BALLAST: Bayesian Active Learning with Look-ahead Amendment for Sea-drifter Trajectories under Spatio-Temporal Vector Fields

BALLAST:基于空间-时间向量场的海漂体轨迹的贝叶斯主动学习与前瞻性修正

Rui-Yang Zhang, Lachlan Astfalck, Edward Cripps, David S. Leslie, Henry B. Moss

发表机构 * Lancaster University(兰卡斯特大学) University of New South Wales(新南威尔士大学) University of Western Australia(西澳大学)

AI总结 本文提出了一种正式的主动学习方法,用于指导拉格朗日观测器的布置,以推断时间依赖的向量场,该方法利用了物理信息的空间-时间高斯过程代理模型。现有放置活动主要遵循标准的'空间填充'设计或相对随意的专家意见。在该设置中应用原理性主动学习的主要挑战是拉格朗日观测器持续被向量场推动,因此在不同位置和时间进行测量。因此,考虑已放置观测器的可能未来轨迹以评估候选放置位置的效用至关重要。为此,我们提出了BALLAST:用于海漂体轨迹的贝叶斯主动学习与前瞻性修正。我们观察到BALLAST辅助的顺序观测器布置策略在合成和高保真海洋流模型中均表现出显著优势。此外,我们还开发了一种新的GP推理方法——Vanilla SPDE Exchange(VaSE)——以提高GP后验采样效率,这也具有独立的研究价值。

Comments ICML 2026

详情
AI中文摘要

我们介绍了一种正式的主动学习方法,用于指导拉格朗日观测器的布置,以推断时间依赖的向量场——海洋学、海洋科学和海洋工程中的关键任务——使用一个具有物理信息的空间-时间高斯过程代理模型。现有放置活动要么遵循标准的'空间填充'设计,要么相对随意地依赖专家意见。在该设置中应用原理性主动学习的主要挑战是拉格朗日观测器持续被向量场推动,因此它们在不同的位置和时间进行测量。因此,考虑已放置观测器的可能未来轨迹以评估候选放置位置的效用至关重要。为此,我们提出了BALLAST:用于海漂体轨迹的贝叶斯主动学习与前瞻性修正。我们观察到BALLAST辅助的顺序观测器布置策略在合成和高保真海洋流模型中均表现出显著优势。此外,我们还开发了一种新的GP推理方法——Vanilla SPDE Exchange(VaSE)——以提高GP后验采样效率,这也具有独立的研究价值。

英文摘要

We introduce a formal active learning methodology for guiding the placement of Lagrangian observers to infer time-dependent vector fields -- a key task in oceanography, marine science, and ocean engineering -- using a physics-informed spatio-temporal Gaussian process surrogate model. The majority of existing placement campaigns either follow standard `space-filling' designs or relatively ad-hoc expert opinions. A key challenge to applying principled active learning in this setting is that Lagrangian observers are continuously advected through the vector field, so they make measurements at different locations and times. It is, therefore, important to consider the likely future trajectories of placed observers to account for the utility of candidate placement locations. To this end, we present BALLAST: Bayesian Active Learning with Look-ahead Amendment for Sea-drifter Trajectories. We observe noticeable benefits of BALLAST-aided sequential observer placement strategies on both synthetic and high-fidelity ocean current models. In addition, we developed a novel GP inference method -- the Vanilla SPDE Exchange (VaSE) -- to boost the GP posterior sampling efficiency, which is also of independent interest.

2509.24517 2026-05-22 cs.LG 版本更新

Physics Priors Offer Useful Accuracy-Carbon Trade-Offs in Spatio-Temporal Forecasting

物理先验在时空预测中的准确性-碳足迹权衡中提供有用的折中

Sophia N. Wilson, Jens Hesselbjerg Christensen, Raghavendra Selvan

发表机构 * Department of Computer Science, University of Copenhagen, Denmark(丹麦哥本哈根大学计算机科学系) Niels Bohr Institute, University of Copenhagen, Denmark(丹麦哥本哈根大学尼尔斯·玻尔研究所)

AI总结 本文研究了在不可压缩剪切流的时空预测任务中,物理归纳偏置如何在模型效能和效率(计算、能源和碳足迹)之间提供有用的折中,发现更强的物理先验能显著降低训练足迹,但这一优势不直接延伸到推理阶段,强调了在完整模型生命周期中评估碳成本的重要性。

Comments Source code available at https://github.com/sophiawilson18/shear-flow

详情
AI中文摘要

现代深度学习方法的发展主要受提高模型效能(准确性指标)的推动。这种对效能的单一关注导致了需要大量计算资源的大规模模型的发展,从而在模型生命周期中产生显著的能源消耗和相应的碳足迹。在本工作中,我们探讨了物理归纳偏置如何在模型效能和效率(计算、能源和碳)之间提供有用的折中。我们研究了具有强、弱和无物理归纳偏置的模型,用于不可压缩剪切流的时空预测任务,该任务由纳维-斯托克斯方程所支配。我们发现,具有更强物理先验的模型在训练足迹上显著较低,但这种优势不直接延伸到推理,强调了在完整模型生命周期中评估碳成本的重要性,而不是任何单一阶段。我们主张模型效率,与模型效能一样,应成为驱动机器学习模型开发和部署的核心考虑因素。

英文摘要

Development of modern deep learning methods has been driven primarily by the push for improving model efficacy (accuracy metrics). This sole focus on efficacy has steered development of large-scale models that require massive computational resources, and results in considerable energy consumption and corresponding carbon footprint across the model lifecycle. In this work, we explore how physics inductive biases can offer useful trade-offs between model efficacy and model efficiency (compute, energy, and carbon). We study models with strong, weak, and no physics-inductive biases for spatio-temporal forecasting of incompressible shear flow, a task governed by the Navier-Stokes equations. We find that models with stronger physics priors achieve substantially lower training footprints, but this advantage does not straightforwardly extend to inference, highlighting the importance of evaluating carbon costs across the full model lifecycle rather than any single stage. We argue that model efficiency, along with model efficacy, should become a core consideration driving machine learning model development and deployment.

2509.08933 2026-05-22 cs.LG cs.SY eess.SY math.OC 版本更新

Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates

具有近最优速率的容错异步Q学习

Sreejeet Maity, Aritra Mitra

发表机构 * Electrical and Computer Engineering, North Carolina State University, Raleigh, NC, USA(北卡罗来纳州立大学电气与计算机工程系)

AI总结 本文研究了在存在对抗性损坏奖励的情况下,在折扣无限时间 horizon 的强化学习设置中学习最优策略的问题。通过开发一种新的鲁棒Q学习变体,并在具有时间相关数据的挑战性异步采样模型下分析该算法,证明了在存在损坏的情况下,该方法的有限时间保证与现有界限相匹配,仅在加性项上与损坏样本的比例成比例。还建立了信息论下界,揭示了我们的保证是近最优的。值得注意的是,我们的算法对底层奖励分布不敏感,并为异步Q学习提供了首次有限时间鲁棒性保证。分析中的关键元素是针对近鞅的改进Azuma-Hoeffding不等式,这可能在研究强化学习算法时有更广泛的应用。

Comments To appear at the 43rd International Conference on Machine Learning (ICML)

详情
AI中文摘要

我们研究了在存在对抗性损坏奖励的情况下,在折扣无限时间 horizon 的强化学习(RL)设置中学习最优策略的问题。为了解决这个问题,我们开发了一种新的鲁棒Q学习变体,并在具有时间相关数据的挑战性异步采样模型下分析该算法。尽管存在损坏,我们证明了该方法的有限时间保证与现有界限相匹配,仅在加性项上与损坏样本的比例成比例。我们还建立了信息论下界,揭示了我们的保证是近最优的。值得注意的是,我们的算法对底层奖励分布不敏感,并为异步Q学习提供了首次有限时间鲁棒性保证。分析中的关键元素是针对近鞅的改进Azuma-Hoeffding不等式,这可能在研究强化学习算法时有更广泛的应用。

英文摘要

We study the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting in the presence of adversarially corrupted rewards. To address this problem, we develop a novel robust variant of the \(Q\)-learning algorithm and analyze it under the challenging asynchronous sampling model with time-correlated data. Despite corruption, we prove that the finite-time guarantees of our approach match existing bounds, up to an additive term that scales with the fraction of corrupted samples. We also establish an information-theoretic lower bound, revealing that our guarantees are near-optimal. Notably, our algorithm is agnostic to the underlying reward distribution and provides the first finite-time robustness guarantees for asynchronous \(Q\)-learning. A key element of our analysis is a refined Azuma-Hoeffding inequality for almost-martingales, which may have broader applicability in the study of RL algorithms.

2507.23773 2026-05-22 cs.AI cs.CL cs.LG cs.RO 版本更新

General Agentic Planning Through Simulative Reasoning with World Models

通过世界模型的模拟推理实现通用代理规划

Mingkai Deng, Jinyu Hou, Zhiting Hu, Eric Xing

发表机构 * Institute of Foundation Models (IFM)(基础模型研究所) Carnegie Mellon University(卡内基梅隆大学) UC San Diego(南加州大学)

AI总结 本文提出通过模拟推理实现通用代理规划,利用世界模型进行未来状态预测,提升决策能力,通过SiRA架构在不同任务中取得更高任务完成率。

Comments Winner of Berkeley LLM Agents Hackathon (Fundamentals Track); code available at https://github.com/sailing-lab/sira

详情
AI中文摘要

什么是规划?当前的代理系统,无论是 scaffolding 工作流还是端到端策略,都依赖于反应式决策:通过固定流程选择下一步行动,最多只能有非区分性的适应性计算(例如链式思维),缺乏对未来结果的显式建模。这限制了通用性,因为每个新任务都需要重新工程而不是共享推理能力的转移。相比之下,人类通过在内部世界模型中心理模拟候选动作的后果来规划,这种能力被称为模拟推理(系统II),它支持在不同上下文中灵活、目标导向的行为。我们主张通过世界模型进行模拟推理为代理系统提供了一种通用的规划机制,比反应式策略(系统I)更优,因为决策基于预测的未来状态而不是模式匹配的响应。为了验证这一点,我们引入了SiRA(模拟推理架构),一种以目标为导向的架构,利用基于LLM的世界模型和自然语言信念状态来实现模拟推理,同时保持模型无关性。我们在网络浏览器环境中评估了三个质的不同的任务类别:受约束的导航、多跳信息聚合和一般指令跟随。在所有类别中,模拟推理在与匹配的反应基线相比,任务完成率提高了124%,并且在与代表性的开放网络代理相比,受约束导航的成功率从0%提高到32.2%。在不同任务类型中的持续优势表明,这种优势源于可泛化的情境评估,而不是特定任务的调优。

英文摘要

What does it mean to plan? Current agentic systems, whether scaffolded workflows or end-to-end policies, rely on reactive decision-making: selecting the next action via a fixed procedure with at most undifferentiated adaptive computation (e.g., chain-of-thought) lacking explicit modeling of future outcomes. This limits generalizability, as each new task demands re-engineering rather than transfer of shared reasoning capacity. Humans, by contrast, plan by mentally simulating consequences of candidate actions within an internal world model, a capacity known as simulative reasoning (System II) that supports flexible, goal-directed behavior across diverse contexts. We argue that simulative reasoning through a world model provides a general-purpose planning mechanism for agentic systems, improving upon reactive policies (System I) by grounding decisions in predicted future states rather than pattern-matched responses. To verify this, we introduce SiRA (Simulative Reasoning Architecture), a goal-oriented architecture instantiating simulative reasoning using an LLM-based world model with natural-language belief states, while remaining model-agnostic. We evaluate across three qualitatively distinct task categories: constrained navigation, multi-hop information aggregation, and general instruction following, in a web-browser environment. Across all categories, simulative reasoning achieves up to 124% higher task completion rates than a matched reactive baseline, and increases constrained navigation success from 0% to 32.2% compared to a representative open-web agent. The persistent advantage across distinct task types suggests the benefit stems from generalizable counterfactual evaluation rather than task-specific tuning.

2505.22749 2026-05-22 q-bio.NC cs.AI cs.LG cs.NE 版本更新

Self-orthogonalizing attractor neural networks emerging from the free energy principle

从自由能原理中涌现的自正交吸引子神经网络

Tamas Spisak, Karl Friston

发表机构 * Center for Translational Neuro- and Behavioral Sciences (C-TNBS), University Medicine Essen, Germany(转化神经与行为科学中心(C-TNBS),埃森大学医学中心,德国) Queen Square Institute of Neurology, University College London, WC1N 3AR, UK(皇后广场神经病学研究所,伦敦大学学院,英国) VERSES, Los Angeles, CA 90067, USA(VERSES,美国加利福尼亚州洛杉矶90067)

AI总结 本文基于自由能原理,研究了自组织动力学如何从随机动力系统的基本原理中涌现,提出了一种无需显式学习和推断规则的高效且生物合理的方法,实现了多层贝叶斯主动推断过程,通过分析和模拟证明了所提网络倾向于产生近似正交化的吸引子表示,从而提升泛化能力和隐变量与可观测效应间的互信息。

Comments 27 pages main text, 8 pages appendix, 7 figures; interactive manuscript available at: https://pni-lab.github.io/fep-attractor-network Associated GitHub repository: https://github.com/pni-lab/fep-attractor-network

详情
Journal ref
Neurocomputing (2026): 133472
AI中文摘要

吸引子动力学是许多复杂系统,包括大脑的特征。理解这些自组织动力学如何从基本原理中涌现对于推进对神经计算和人工智能系统设计的理解至关重要。本文正式阐述了如何将自由能原理应用于随机动力系统的通用划分,从而推导出吸引子网络的形成机制。我们的方法消除了显式学习和推断规则的需要,并识别出这些自组织系统中涌现的、高效且生物合理的推断和学习动力学。这些结果导致了一个集体、多层次的贝叶斯主动推断过程。自由能景观上的吸引子编码先验信念;推断将感官数据整合到后验信念中;学习则微调耦合以最小化长期的惊讶。通过分析和模拟,我们证明所提出的网络倾向于产生近似正交化的吸引子表示,这是同时优化预测准确性和模型复杂性所导致的后果。这些吸引子能够高效地覆盖输入子空间,提升泛化能力和隐变量与可观测效应间的互信息。此外,尽管随机数据呈现导致对称且稀疏的耦合,但序列数据则促进不对称耦合和非平衡稳态动力学,提供了对传统玻尔兹曼机的自然扩展。我们的发现为自组织吸引子网络提供了统一的理论,为人工智能和神经科学提供了新的见解。

英文摘要

Attractor dynamics are a hallmark of many complex systems, including the brain. Understanding how such self-organizing dynamics emerge from first principles is crucial for advancing our understanding of neuronal computations and the design of artificial intelligence systems. Here we formalize how attractor networks emerge from the free energy principle applied to a universal partitioning of random dynamical systems. Our approach obviates the need for explicitly imposed learning and inference rules and identifies emergent, but efficient and biologically plausible inference and learning dynamics for such self-organizing systems. These result in a collective, multi-level Bayesian active inference process. Attractors on the free energy landscape encode prior beliefs; inference integrates sensory data into posterior beliefs; and learning fine-tunes couplings to minimize long-term surprise. Analytically and via simulations, we establish that the proposed networks favor approximately orthogonalized attractor representations, a consequence of simultaneously optimizing predictive accuracy and model complexity. These attractors efficiently span the input subspace, enhancing generalization and the mutual information between hidden causes and observable effects. Furthermore, while random data presentation leads to symmetric and sparse couplings, sequential data fosters asymmetric couplings and non-equilibrium steady-state dynamics, offering a natural generalization of conventional Boltzmann Machines. Our findings offer a unifying theory of self-organizing attractor networks, providing novel insights for AI and neuroscience.

2502.13822 2026-05-22 stat.ML cs.LG 版本更新

Uncertainty quantification for Markov chain induced martingales with application to temporal difference learning

马尔可夫链诱导的martingales的不确定性量化及其在时间差学习中的应用

Weichen Wu, Yuting Wei, Alessandro Rinaldo

发表机构 * The Voleon Group(Voleon集团) Department of Statistics and Data Science, The Wharton School, University of Pennsylvania(统计与数据科学系,沃顿商学院,宾夕法尼亚大学) Department of Statistics and Data Sciences, University of Texas(统计与数据科学系,德克萨斯大学)

AI总结 本文提出了一种新的高维集中不等式和Berry-Esseen界,用于分析由马尔可夫链诱导的向量martingales,并将其应用于时间差学习算法的性能分析,得到了与渐近方差相符的高概率一致性保证,并建立了Gaussian近似的时间差估计器的分布收敛速率。

详情
AI中文摘要

我们建立了针对由马尔可夫链诱导的向量值martingales的新型且通用的高维集中不等式和Berry-Esseen界。我们将这些结果应用于分析具有线性函数近似的时间差(TD)学习算法的性能,这是一种在强化学习(RL)中广泛使用的策略评估方法,获得了与渐近方差相符的高概率一致性保证,直到对数因子。此外,我们建立了Gaussian近似的时间差估计器的O(T^{-1/4}log T)分布收敛速率,以凸距离度量。我们的martingale界具有广泛的适用性,我们对TD学习的分析为RL算法的统计推断提供了新的见解,弥合了经典随机逼近理论与现代RL应用之间的差距。

英文摘要

We establish novel and general high-dimensional concentration inequalities and Berry-Esseen bounds for vector-valued martingales induced by Markov chains. We apply these results to analyze the performance of the Temporal Difference (TD) learning algorithm with linear function approximations, a widely used method for policy evaluation in Reinforcement Learning (RL), obtaining a sharp high-probability consistency guarantee that matches the asymptotic variance up to logarithmic factors. Furthermore, we establish an $O(T^{-\frac{1}{4}}\log T)$ distributional convergence rate for the Gaussian approximation of the TD estimator, measured in convex distance. Our martingale bounds are of broad applicability, and our analysis of TD learning provides new insights into statistical inference for RL algorithms, bridging gaps between classical stochastic approximation theory and modern RL applications.

2502.01476 2026-05-22 cs.LG cs.NA math.NA physics.comp-ph 版本更新

Neuro-Symbolic AI for Analytical Solutions of Differential Equations

神经符号AI用于微分方程的解析解

Orestis Oikonomou, Levi Lingsch, Dana Grund, Siddhartha Mishra, Georgios Kissas

发表机构 * Seminar for Applied Mathematics, ETH Zurich, Switzerland(应用数学研讨会,苏黎世联邦理工学院,瑞士) ETH AI Center, Zurich, Switzerland(苏黎世联邦理工学院人工智能中心,瑞士) IBM Research Europe, Zurich, Switzerland(IBM欧洲研究院,苏黎世,瑞士) Institute for Atmospheric and Climate Science, ETH Zurich, Switzerland(大气与气候科学研究所,苏黎世联邦理工学院,瑞士) Swiss Data Science Center, ETH Zurich, Switzerland(瑞士数据科学中心,苏黎世联邦理工学院,瑞士)

AI总结 本文提出SIGS神经符号框架,通过上下文无关文法生成数学上有效且物理上有意义的构建块,并结合用户指定的Ansatz进行组合,嵌入到拓扑正则化的连续潜在流形中,通过两阶段搜索发现解析解,提高了微分方程解析解的准确性和效率。

Comments Updates the method and added extra results

详情
AI中文摘要

微分方程的解析解提供精确且可解释的洞察,但很少有可用,因为发现它们需要专家直觉或穷举组合空间。我们引入SIGS,一种用于方程驱动的闭式解发现的神经符号框架。SIGS使用上下文无关文法生成数学上有效且物理上有意义的构建块,结合用户指定的Ansatz来组合这些块,将其嵌入到拓扑正则化的连续潜在流形中,并通过两个阶段在该流形上进行搜索:结构选择后通过梯度下降进行系数细化,仅根据PDE残差和指定的边界和初始条件评分候选。这种设计将符号推理与数值优化统一起来;文法约束候选解块为正确,而潜在搜索使探索变得可行且数据无关。SIGS是首个神经符号方法,能够(i)恢复耦合非线性PDE系统的解析解,(ii)当文法缺乏自然原始元时发现等价的符号形式,(iii)为缺乏已知闭式解的PDE产生准确的符号近似。总体而言,SIGS在标准PDE基准测试中,在准确性和运行时间上都比现有符号方法提高了多个数量级。

英文摘要

Analytical solutions to differential equations offer exact, interpretable insight but are rarely available because discovering them requires expert intuition or exhaustive search of combinatorial spaces. We introduce SIGS, a neuro-symbolic framework for equation-driven closed-form solution discovery. SIGS uses a context-free grammar to generate mathematically valid and physically meaningful building blocks, with a user-specified Ansatz prescribing how these blocks combine, embeds them into a topology-regularised continuous latent manifold, and searches this manifold in two stages: structure selection followed by coefficient refinement using gradient descent, scoring candidates only against the PDE residual and prescribed boundary and initial conditions. This design unifies symbolic reasoning with numerical optimization; the grammar constrains candidate solution blocks to be proper by construction, while the latent search makes exploration tractable and data-free. SIGS is the first neuro-symbolic method to (i) recover analytical solutions for coupled nonlinear PDE systems, (ii) discover equivalent symbolic forms when the grammar lacks the natural primitives, and (iii) produce accurate symbolic approximations for PDEs lacking known closed-form solutions. Overall, SIGS improves over existing symbolic methods by orders of magnitude in both accuracy and runtime across standard PDE benchmarks.

2410.19787 2026-05-22 cs.CV cs.LG 版本更新

Leveraging Multi-Temporal Sentinel 1 and 2 Satellite Data for Leaf Area Index Estimation With Deep Learning

利用多时相哨兵1和2卫星数据进行叶面积指数估计的深度学习方法

Clement Wang, Antoine Debouchage, Valentin Goldité, Aurélien Wery, Jules Salzinger

发表机构 * Austrian Institute of Technology - Vienna, Austria(奥地利技术研究所-维也纳,奥地利)

AI总结 本文提出了一种基于多时相哨兵1雷达数据和哨兵2多谱段数据的深度学习方法,用于像素级叶面积指数预测,通过多U-Net网络结构和共同潜在空间实现不同输入模态的互补信息融合,最终在公开数据上取得了0.06 RMSE和0.93 R2分数。

详情
Journal ref
Proc. 2023 Conference on Big Data from Space (BiDS'23), Publications Office of the European Union, Luxembourg, 2023
AI中文摘要

叶面积指数(LAI)是理解生态系统健康和植被动态的关键参数。在本文中,我们提出了一种新的像素级LAI预测方法,通过利用多时间戳的哨兵1雷达数据和哨兵2多谱段数据的互补信息。我们的方法基于多个针对此任务定制的多U-Net深度神经网络。为处理不同输入模态的复杂性,该方法由多个预先训练的模块组成,以在共同的潜在空间中表示所有输入数据。然后,我们通过一个共同的解码器进行端到端微调,该解码器还考虑了季节性因素,我们发现季节性在其中起重要作用。我们的方法在公开可用数据上实现了0.06 RMSE和0.93 R2分数。我们的贡献可在https://github.com/valentingol/LeafNothingBehind上获得,供未来工作进一步改进当前进展。

英文摘要

The Leaf Area Index (LAI) is a critical parameter to understand ecosystem health and vegetation dynamics. In this paper, we propose a novel method for pixel-wise LAI prediction by leveraging the complementary information from Sentinel 1 radar data and Sentinel 2 multi-spectral data at multiple timestamps. Our approach uses a deep neural network based on multiple U-nets tailored specifically to this task. To handle the complexity of the different input modalities, it is comprised of several modules that are pre-trained separately to represent all input data in a common latent space. Then, we fine-tune them end-to-end with a common decoder that also takes into account seasonality, which we find to play an important role. Our method achieved 0.06 RMSE and 0.93 R2 score on publicly available data. We make our contributions available at https://github.com/valentingol/LeafNothingBehind for future works to further improve on our current progress.

2410.18151 2026-05-22 cs.SD cs.LG cs.MM eess.AS 版本更新

Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment

Music102: 一个 $D_{12}$-等价变换器用于和弦进行伴奏

Weiliang Luo

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出Music102,一种基于群论和音乐结构的等价变换器,用于提升和弦进行伴奏的质量,通过整合音乐对称性如转位和反射操作,改进了非等价变换器Music101的性能。

Comments 10 pages, 3 figures

详情
Journal ref
Proceedings of the 2025 International Computer Music Conference (https://hdl.handle.net/2027/fulcrum.zg64tq53m)
AI中文摘要

我们提出了Music102,一种先进的模型,旨在通过$D_{12}$-等价变换器增强和弦进行伴奏。受群论和音乐结构的启发,Music102利用音乐对称性--如转位和反射操作--将这些属性整合到变换器架构中。通过编码先前的音乐知识,模型在旋律和和弦序列上保持等价性。使用POP909数据集训练和评估Music102,结果显示其在加权损失和精确准确度指标上均优于非等价变换器Music101原型,尽管参数更少。这项工作展示了自注意力机制和层归一化在离散音乐领域中的适应性,解决了计算音乐分析中的挑战。凭借其稳定且灵活的神经框架,Music102为等价音乐生成和计算音乐创作工具的进一步探索奠定了基础,将数学理论与实际音乐表演相结合。

英文摘要

We present Music102, an advanced model aimed at enhancing chord progression accompaniment through a $D_{12}$-equivariant transformer. Inspired by group theory and symbolic music structures, Music102 leverages musical symmetry--such as transposition and reflection operations--integrating these properties into the transformer architecture. By encoding prior music knowledge, the model maintains equivariance across both melody and chord sequences. The POP909 dataset was employed to train and evaluate Music102, revealing significant improvements over the non-equivariant Music101 prototype Music101 in both weighted loss and exact accuracy metrics, despite using fewer parameters. This work showcases the adaptability of self-attention mechanisms and layer normalization to the discrete musical domain, addressing challenges in computational music analysis. With its stable and flexible neural framework, Music102 sets the stage for further exploration in equivariant music generation and computational composition tools, bridging mathematical theory with practical music performance.

2408.13002 2026-05-22 cs.LG 版本更新

Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence

用置信度衡量异质处理效应中的变量重要性

Joseph Paillard, Angel Reyero Lobo, Vitaliy Kolodyazhniy, Bertrand Thirion, Denis A. Engemann

发表机构 * Roche Pharma Research \& Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Basel, Switzerland Université Paris-Saclay, Inria, CEA, Palaiseau, France

AI总结 本文提出PermuCATE算法,用于在估计条件平均处理效应时进行统计严谨的全局变量重要性评估,通过理论分析和实证研究证明其比LOCO方法具有更低的方差,从而提高统计功效,适用于生物医学应用中的有限数据环境。

详情
Journal ref
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:47456-47477, 2025
AI中文摘要

因果机器学习在从复杂数据中估计个体处理效应方面具有潜力。为了成功应用于现实世界,获得可靠见解以确定哪些变量驱动对治疗的异质反应至关重要。我们提出PermuCATE,一种基于条件排列重要性(CPI)方法的算法,用于统计严谨地评估条件平均处理效应(CATE)估计中的变量重要性。有限样本情况的理论分析和实证研究显示,PermuCATE比留一协变量法(LOCO)参考方法具有更低的方差,并提供可靠的变量重要性度量。这一特性提高了统计功效,这对于生物医学应用中常见的有限数据环境中的因果推断至关重要。我们通过模拟和真实世界健康数据集实证展示了PermuCATE的优势,包括具有多达数百个相关变量的设置。

英文摘要

Causal machine learning holds promise for estimating individual treatment effects from complex data. For successful real-world applications of machine learning methods, it is of paramount importance to obtain reliable insights into which variables drive heterogeneity in the response to treatment. We propose PermuCATE, an algorithm based on the Conditional Permutation Importance (CPI) method, for statistically rigorous global variable importance assessment in the estimation of the Conditional Average Treatment Effect (CATE). Theoretical analysis of the finite sample regime and empirical studies show that PermuCATE has lower variance than the Leave-One-Covariate-Out (LOCO) reference method and provides a reliable measure of variable importance. This property increases statistical power, which is crucial for causal inference in the limited-data regime common to biomedical applications. We empirically demonstrate the benefits of PermuCATE in simulated and real-world health datasets, including settings with up to hundreds of correlated variables.

2209.03358 2026-05-22 cs.NE cs.AI cs.CR cs.CV cs.LG 版本更新

Attacking the Spike: On the Transferability and Security of Spiking Neural Networks to Adversarial Examples

攻击尖峰:关于脉冲神经网络对抗示例的转移性和安全性

Nuo Xu, Kaleel Mahmood, Haowen Fang, Ethan Rathbun, Caiwen Ding, Wujie Wen

发表机构 * Lehigh University(莱文大学) University of Minnesota Twin Cities(明尼苏达大学双城分校) North Carolina State University(北卡罗来纳州立大学) University of Rhode Island(罗德岛大学) Northeastern University(东北大学)

AI总结 本文研究了脉冲神经网络(SNN)在对抗示例中的鲁棒性,揭示了对抗攻击的转移性,并提出了混合动态脉冲估计(MDSE)攻击方法,以提高SNN和非SNN模型的对抗示例生成效果。

Comments Accepted manuscript. Published in *Neurocomputing*, Volume 656, 2025, Article 131506. Available online 12 September 2025. DOI: 10.1016/j.neucom.2025.131506

详情
Journal ref
Neurocomputing, Volume 656, 2025, 131506
AI中文摘要

脉冲神经网络(SNNs)因其高能效和最近在分类性能上的进展而受到广泛关注。然而,与传统深度学习方法不同,SNN对对抗示例的鲁棒性研究仍相对薄弱。在本文中,我们通过三个贡献推进了SNN的对抗攻击研究。首先,我们表明对SNN的成功白盒对抗攻击高度依赖于底层的替代梯度估计器,即使对于对抗训练的SNN也是如此。其次,使用最佳的单一替代梯度估计器,我们分析了对抗攻击在SNN、视觉Transformer(ViTs)和CNN之间的可转移性。我们的分析揭示了两个关键差距:现有的白盒攻击没有利用多个替代梯度估计器来攻击SNN,且没有单个模型攻击能够可靠地生成同时欺骗SNN和非SNN模型的对抗示例。作为我们的第三个贡献,我们开发了混合动态脉冲估计(MDSE)攻击来解决这些问题。MDSE使用动态梯度估计方案,充分利用多个替代梯度估计器函数,生成能够同时欺骗SNN和非SNN模型的对抗示例。MDSE在SNN/ViT模型集合上比传统白盒攻击如Auto-PGD有效多达91.4%,在对抗训练的SNN集合上提供了3倍的提升。实验覆盖了三个数据集(CIFAR-10、CIFAR-100、ImageNet)和十九个分类器模型(每个CIFAR数据集七个,ImageNet五个)。我们的MDSE实现和评估的模型在https://github.com/nuoxuxxx/attacking-the-spike-mdse上公开可用。

英文摘要

Spiking neural networks (SNNs) have attracted much attention for their high energy efficiency and recent advances in classification performance. However, unlike traditional deep learning approaches, the study of SNN robustness to adversarial examples remains relatively underdeveloped. In this work, we advance the adversarial attack side of SNNs through three contributions. First, we show that successful white-box adversarial attacks on SNNs are highly dependent on the underlying surrogate gradient estimator, even for adversarially trained SNNs. Second, using the best single surrogate gradient estimator, we analyze the transferability of adversarial attacks across SNNs, Vision Transformers (ViTs) and CNNs. Our analysis reveals two key gaps: no existing white-box attack exploits multiple surrogate gradient estimators for SNNs, and no single-model attack reliably generates adversarial examples that simultaneously fool both SNN and non-SNN models. For our third contribution, we develop the Mixed Dynamic Spiking Estimation (MDSE) attack to address these issues. MDSE uses a dynamic gradient estimation scheme to fully exploit multiple surrogate gradient estimator functions and generates adversarial examples capable of fooling SNN and non-SNN models simultaneously. MDSE is up to 91.4% more effective on SNN/ViT model ensembles and provides a 3x boost on adversarially trained SNN ensembles compared to conventional white-box attacks like Auto-PGD. Experiments cover three datasets (CIFAR-10, CIFAR-100, ImageNet) and nineteen classifier models (seven per CIFAR dataset, five for ImageNet). Our implementation of MDSE and the evaluated models is publicly available at https://github.com/nuoxuxxx/attacking-the-spike-mdse.

2605.22083 2026-05-22 cs.SD cs.LG eess.AS 版本更新

RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

RobustSpeechFlow: 通过基于增强的对比流匹配学习鲁棒的文本到语音轨迹

Jinhyeok Yang, Hyeongju Kim, Yechan Yu, Joon Byun, Frederik Bous, Juheon Lee

发表机构 * Supertone Inc(Supertone公司) Independent Researcher(独立研究者)

AI总结 本文提出RobustSpeechFlow,一种通过引入长度保持重复和跳过潜在增强来改进对齐鲁棒性的训练策略,从而在无需外部对齐器或偏好数据的情况下,直接惩罚现实中的失败模式,并能无缝集成到现有流程中,实验表明其在文本到语音任务中显著提升了语音质量与鲁棒性。

Comments Submitted to INTERSPEECH 2026

详情
AI中文摘要

尽管流匹配文本到语音(TTS)在零样本说话人相似性和自然度方面表现强劲,但仍易受内容保真度问题影响,特别是由于不完美的对齐导致的跳过和重复错误。我们提出了RobustSpeechFlow,一种训练策略,通过扩展对比流匹配,引入长度保持重复和跳过潜在增强来提高对齐鲁棒性。该方法无需外部对齐器或偏好数据,直接惩罚现实中的失败模式,并能无缝集成到现有流程中。在Seed-TTS-eval上,仅使用0.06B参数,其将词错误率(WER)从1.44降至1.38。在我们的ZERO500基准测试中,它在多样化的说话人和语调条件下实现了稳定的可理解性提升;在NFE=24时,其将英文字符错误率(CER)从0.48%降至0.35%,将韩文CER从0.81%降至0.57%。音频样本:https://robustspeechflow.github.io/

英文摘要

While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%. Audio samples: https://robustspeechflow.github.io/

2605.22075 2026-05-22 cs.LG q-bio.QM 版本更新

Can Breath Biomarkers Causally Influence Blood Glucose? Investigating VOC-Mediated Modulation in Diabetes

呼吸生物标志物能否因果影响血糖?探讨VOC介导的糖尿病调节

Varsha Sharma, Prasanta K. Guha, Avik Ghose

发表机构 * TCS Research(TCS研究) Department of E&ECE, IIT Kharagpur(印度理工学院Kharagpur电子与电气工程系)

AI总结 本研究通过非侵入式数据驱动框架,利用挥发性有机化合物(VOCs)和生活方式变量识别糖尿病高风险个体,采用因果推断技术估计VOCs如乙酮、异丙醇、异戊二烯和乙醇对血糖水平的影响,并设计分类器区分糖尿病患者与非糖尿病患者,建立基于风险的排名系统和高斯混合模型识别自然聚类。

详情
Journal ref
Proceedings of the IJCAI workshop on Advanced Neural Systems for Next-Generation Biomedical Intelligence, 2025
AI中文摘要

糖尿病是一种全球健康负担,早期检测对于及时干预至关重要。本研究探讨了一种非侵入式、数据驱动的框架,利用挥发性有机化合物(VOCs)和生活方式变量识别糖尿病高风险个体。我们使用因果推断技术估计乙酮、异丙醇、异戊二烯和乙醇等VOCs对血糖水平的影响。此外,我们设计了一个分类器,利用非侵入式标志物区分糖尿病患者和非糖尿病患者。我们为“灰色区域”中的个体建立了基于风险的排名系统,并使用高斯混合模型识别人群中的自然聚类。我们的结果表明,特定的VOCs对血糖水平表现出强因果影响,且机器学习模型能够可靠地分类和分层高风险个体。这种集成的因果-可解释分析可以支持非侵入式糖尿病早期筛查工具的开发。

英文摘要

Diabetes is a global health burden, and early detection is critical for timely intervention. This study explores a non-invasive, data-driven framework to identify individuals at risk of diabetes using Volatile Organic Compounds (VOCs) and lifestyle variables. We use causal inference techniques to estimate the impact of VOCs such as acetone, isopropanol, isoprene, and ethanol on blood glucose levels. Additionally, we designed a classifier to distinguish diabetics from non-diabetics using non-invasive markers. We created a risk-based ranking system for individuals in the "gray zone," and identified natural clusters in the population using Gaussian Mixture Model. Our results suggest that specific VOCs exhibit a strong causal influence on glucose levels and that machine learning models can reliably classify and stratify individuals at high risk. This integrated causal-explainable analysis can support the development of tool for non-invasive early screening of diabetes.

2605.22074 2026-05-22 cs.LG cs.AI cs.CL 版本更新

From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

从推理链到可验证子问题:课程强化学习使LLM推理能够进行信用分配

Xitai Jiang, Zihan Tang, Wenze Lin, Yang Yue, Shenzhi Wang, Gao Huang

发表机构 * LeapLab, Tsinghua University(清华大学 LeapLab) Qiuzhen College, Tsinghua University(清华大学 旗正学院)

AI总结 该研究提出SCRL框架,通过从参考推理链中生成可验证子问题,解决LLM推理中信用分配问题,提升了在数学推理任务中的性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)在LLM推理中展现出强大潜力,但基于结果的RLVR在处理难题时效率低下,因为正确的最终答案 rollout 很少且样本层面的信用分配无法利用失败尝试中的部分进展。我们引入SCRL(子问题课程强化学习),一种课程强化学习框架,通过从参考推理链中推导出可验证子问题,并将最终子问题固定为原始问题。这将难题中的部分进展转化为可验证的学习信号。算法上,SCRL使用子问题层面的归一化,每个子问题位置独立归一化奖励,并将结果优势分配给相应的答案片段,使在没有外部评分标准或奖励模型的情况下实现更细粒度的信用分配。我们的分析表明,子问题课程将难题从梯度死亡区中拉出,随着原始问题难度增加,相对收益也更大。在七个数学推理基准测试中,SCRL超越了强大的课程学习基线,使Qwen3-4B-Base的平均准确率比GRPO提高+4.1点,Qwen3-14B-Base提高+1.9点。在AIME24、AIME25和IMO-Bench上,SCRL进一步提高Qwen3-4B-Base的pass@1由+3.7点,pass@64由+4.6点,表明在难题推理任务中探索能力更强。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed attempts. We introduce SCRL (Subproblem Curriculum Reinforcement Learning), a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and fixes the final subproblem as the original problem. This turns partial progress on hard problems into verifiable learning signals. Algorithmically, SCRL uses subproblem-level normalization, which normalizes rewards independently at each subproblem position and assigns the resulting advantages to the corresponding answer spans, enabling finer-grained credit assignment without external rubrics or reward models. Our analysis shows that subproblem curricula lift hard problems out of gradient dead zones, with larger relative gains as the original problem becomes harder. Across seven mathematical reasoning benchmarks, SCRL outperforms strong curriculum-learning baselines, improving average accuracy over GRPO by +4.1 points on Qwen3-4B-Base and +1.9 points on Qwen3-14B-Base. On AIME24, AIME25, and IMO-Bench, SCRL further improves pass@1 by +3.7 points and pass@64 by +4.6 points on Qwen3-4B-Base, indicating better exploration on hard reasoning problems.

2605.22055 2026-05-22 cs.LG cs.AI 版本更新

Prototype-Guided Classification Sub-Task Decoupling Framework: Enhancing Generalization and Interpretability for Multivariate Time Series

基于原型的分类子任务解耦框架:提升多变量时间序列的泛化能力与可解释性

Xianhao Song, Yuang Zhang, Yuqi She, Liping Wang, Xuemin Lin

发表机构 * East China Normal University(华东师范大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出PDFTime框架,通过多阶段决策过程解耦时间序列分类任务,提升模型的泛化能力和可解释性,实现了在UEA和UCR基准测试中的最优性能。

详情
AI中文摘要

时间序列分类(TSC)是一个长期存在的研究问题,近年来随着大规模时间数据的快速增长而受到越来越多的关注。尽管深度学习带来了显著进展,但设计出既准确又可解释的TSC模型仍然是一个具有挑战性的任务。许多现有方法采用直接的特征到标签分类范式,通过单一线性投影(通常在全局池化后)将高维时间嵌入压缩为类别日志it,这种范式将特征提取和决策逻辑合并为不可分割的映射。为了解决这些限制,我们提出了PDFTime,一个基于原型的框架,将时间序列分类重新表述为多阶段决策过程。不同于直接的特征到标签映射,PDFTime利用学习到的原型来近似潜在空间中的类别条件特征分布,通过不同粒度的分类子任务实现逐步辨别。据我们所知,PDFTime是第一个将时间序列分类重新表述为解耦、多阶段相似性推理过程的框架,打破了长期以来直接、黑箱的特征到标签映射范式。广泛的评估表明,PDFTime在UEA和UCR基准测试中实现了最先进的性能。值得注意的是,它在UCR档案中的128个数据集中,取得了80个数据集的top-1准确率,显著优于最近的强基线方法在一致性和泛化性上的表现。

英文摘要

Time Series Classification (TSC) is a long-standing research problem that has gained increasing attention in recent years with the rapid growth of large-scale temporal data. Despite substantial progress enabled by deep learning, designing TSC models that are both accurate and interpretable remains a challenging task. Many existing approaches adopt a direct feature-to-label classification paradigm, by collapsing high-dimensional temporal embeddings into class logits via a single linear projection (often after global pooling), the paradigm conflates feature extraction and decision logic into an inseparable mapping. To address these limitations, we propose PDFTime, a prototype-guided framework that reformulates time series classification as a multi-stage decision process. Instead of direct feature-to-label mapping, PDFTime leverages learned prototypes to approximate class-conditional feature distributions in the latent space, enabling progressive discrimination through classification sub-tasks of varying granularity. To our knowledge, PDFTime is the first framework to reformulate time series classification as a decoupled, multi-stage similarity-based reasoning process, breaking the long-standing paradigm of direct, black-box feature-to-label mapping. Extensive evaluations demonstrate that PDFTime achieves state-of-the-art (SOTA) performance across UEA and UCR benchmarks. Notably, it secures the top-$1$ accuracy on 80 out of 128 datasets in the UCR archive, significantly outperforming recent strong baselines in both consistency and generalization.

2605.22054 2026-05-22 cs.LG cs.AI 版本更新

LABO: LLM-Accelerated Bayesian Optimization through Broad Exploration and Selective Experimentation

LABO: 通过广泛探索和选择性实验实现的LLM加速贝叶斯优化

Zhuo Chen, Xinzhe Yuan, Jianshu Zhang, Jinzong Dong, Ruichen Zhou, Yingchun Niu, Tianhang Zhou, Yu Yang Fredrik Liu, Yuqiang Li, Nanyang Ye, Qinying Gu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) School of Mechanical Engineering, Shanghai Jiao Tong University(上海交通大学机械工程学院) Institute for Advanced Study in Mathematics, Harbin Institute of Technology(哈尔滨工业大学数学研究所) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) School of Automation, Central South University(中南大学自动化学院) College of New Energy and Materials, China University of Petroleum, Beijing(中国石油大学(北京)新能源与材料学院) College of Carbon Neutrality Future Technology, China University of Petroleum, Beijing(中国石油大学(北京)碳中和未来技术学院) DeepVerse PTE. LTD.

AI总结 本文提出LABO框架,通过结合LLM预测与实验观测,在贝叶斯优化中实现更高效的样本优化,理论分析和实验结果表明其在科学任务中优于现有方法。

Comments Accepted to ICML 2026

详情
AI中文摘要

科学探索中的高成本和数据稀缺性推动了将大型语言模型(LLMs)作为知识驱动组件应用于贝叶斯优化(BO)的研究。然而,现有方法通常将LLMs直接嵌入到采样或替代建模流程中,未能充分利用其显著低于现实实验的评估成本。为了解决这一限制,我们提出了LLM加速贝叶斯优化(LABO)框架,该框架在单个BO循环中结合LLM预测与实验观测。LABO采用门控标准来动态平衡对LLM预测和实际实验的依赖。通过利用低成本的LLM评估进行广泛探索搜索空间,并仅在高不确定性区域保留昂贵的现实实验,LABO实现了更高效的样本优化。我们提供了理论分析,通过累积遗憾界正式化这一效率增益。在多样化的科学任务中,实验结果表明LABO在相同实验预算下一致优于现有方法。我们的结果表明,LABO为将LLMs整合到科学发现流程中提供了一种实用且理论严谨的方法。

英文摘要

The high cost and data scarcity in scientific exploration have motivated the use of large language models (LLMs) as knowledge-driven components in Bayesian optimization (BO). However, existing approaches typically embed LLMs directly into the sampling or surrogate modeling pipeline, without fully leveraging their significantly lower evaluation cost compared to real-world experiments. To address this limitation, we propose LLM-Accelerated Bayesian Optimization (LABO), a framework that combines LLM predictions with experimental observations within a single BO loop. LABO employs a gating criterion to dynamically balance the reliance on LLM predictions versus actual experiments. By leveraging inexpensive LLM evaluations to broadly explore the search space and reserving costly real experiments only for regions with high uncertainty, LABO achieves more sample-efficient optimization. We provide a theoretical analysis with a cumulative regret bound that formalizes this efficiency gain. Empirical results across diverse scientific tasks demonstrate that LABO consistently outperforms existing methods under identical experimental budgets. Our results suggest that LABO offers a practical and theoretically grounded approach for integrating LLMs into scientific discovery workflows.

2605.22043 2026-05-22 cs.LG 版本更新

CASE-NET: Deep Spatio-Temporal Representation Learning via Causal Attention and Channel Recalibration for Multivariate Time Series Classification

CASE-NET:通过因果注意力和通道重校准进行多变量时间序列分类的深度时空表示学习

Fan Zhang, Yating Cui, Hua Wang

发表机构 * Shandong Technology and Business University(山东技术与商业大学) Ludong University(鲁东大学)

AI总结 本文提出CASE-NET,通过因果注意力和通道重校准模块,解决多变量时间序列分类中时空表示不准确的问题,实现在四个任务上达到新的最先进基准,最高准确率达98.6%。

Comments 9 pages, 6 figures, 2 tables

详情
AI中文摘要

多变量时间序列(MTS)分类是普适计算和金融分析的基础,但现有多尺度方法常受限于表示保真度不足。我们识别出两个关键瓶颈:标准编码器中的时间非因果性导致非平稳动态中的时间混淆,以及缺乏显式通道重要性机制导致噪声污染潜在空间。为解决这些挑战,我们提出因果注意力和时空编码器网络(CASE-NET),一种用于结构流形预条件的架构。CASE-NET结合了因果时间编码器,通过掩码自注意力和因果卷积强制物理时间箭头约束,以及适应性通道重校准模块,作为信息瓶颈以抑制有害噪声。在六个异质领域上的全面评估表明,CASE-NET在四个任务上建立了新的最先进基准,达到AWR数据集上的最高准确率98.6%,并在非平稳环境中表现出卓越的鲁棒性。

英文摘要

Multivariate time series (MTS) classification is foundational to pervasive computing and financial analysis, yet existing multi-scale paradigms are often constrained by suboptimal representation fidelity. We identify two critical bottlenecks: temporal non-causality in standard encoders that induces temporal confounding in non-stationary dynamics, and the absence of explicit channel saliency mechanisms that allows noise to contaminate the latent space. To address these challenges, we propose the Causal Attention and Spatio-temporal Encoder Network (CASE-NET), an architecture designed for structural manifold pre-conditioning. CASE-NET synergizes a Causal Temporal Encoder, which enforces physical arrow-of-time constraints via masked self-attention and causal convolutions, with an Adaptive Channel Recalibration module functioning as an information bottleneck to suppress detrimental noise. Comprehensive evaluations across six heterogeneous domains demonstrate that CASE-NET establishes new state-of-the-art benchmarks on four tasks, achieving a peak accuracy of 98.6% on the AWR dataset and superior robustness in non-stationary regimes.

2605.22041 2026-05-22 cs.CR cs.LG 版本更新

RADAR: Defending RAG Dynamically against Retrieval Corruption

RADAR: 通过动态防御对抗RAG的检索腐败

Ziyuan Chen, Yueming Lyu, Yi Liu, Weixiang Han, Jing Dong, Caifeng Shan, Tieniu Tan

发表机构 * School of Intelligence Science and Technology, Nanjing University, Suzhou, China(南京大学智能科学与技术学院,中国,苏州) City University of Hong Kong, Hong Kong, China(香港城市大学,中国,香港) Institute of Automation, Chinese Academy of Sciences, Beijing, China(中国科学院自动化研究所,北京,中国)

AI总结 RADAR通过将可靠的上下文选择建模为图基能最小化问题,利用最大流最小割算法进行精确求解,采用贝叶斯记忆节点递归更新信念状态,以平衡稳定性和对抗性攻击,同时适应真实知识变化,在动态数据集上实现了比基线方法更优越的鲁棒性和响应质量,且存储开销小。

详情
AI中文摘要

尽管RAG系统在动态网络搜索中被越来越多地部署,但时间波动性加剧了其对对抗性攻击的脆弱性。现有的静态导向防御措施难以处理不断演变的威胁,并在动态设置中导致可观的存储成本。我们提出了RADAR,一个将可靠的上下文选择建模为图基能最小化问题的框架,通过最大流最小割算法精确求解。通过引入贝叶斯记忆节点,RADAR递归更新信念状态,而不是归档原始历史文档,从而在对抗攻击和真实知识变化之间取得平衡。在新的动态数据集上的实验表明,与基线方法相比,RADAR在存储开销极低的情况下实现了更优越的鲁棒性和响应质量。

英文摘要

While RAG systems are increasingly deployed in dynamic web search, temporal volatility amplifies their vulnerability to adversarial attacks. Existing static-oriented defenses struggle to handle evolving threats and incur prohibitive storage costs in dynamic settings. We propose RADAR, a framework that models reliable context selection as a graph-based energy minimization problem, solved exactly via Max-Flow Min-Cut. By incorporating a Bayesian memory node, RADAR recursively updates a belief state instead of archiving raw historical documents, effectively balancing stability against attacks with adaptability to genuine knowledge shifts. Experiments on a novel dynamic dataset show that RADAR achieves superior robustness and response quality with minimal storage overhead compared to the baselines.

2605.22013 2026-05-22 cs.CV cs.GR cs.LG 版本更新

PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought

PointLLM-R: 通过链式推理增强3D点云推理

Chaoqi Chen, Qile Xu, Wenjun Zhou, Hui Huang

发表机构 * Visual Computing Research Center (VCC), College of Computer Science(视觉计算研究中心(VCC),计算机科学学院) Software Engineering (CSSE) Shenzhen University China(软件工程(CSSE)深圳大学中国) VCC, CSSE Shenzhen University China(VCC,CSSE 深圳大学中国) Shenzhen University(深圳大学)

AI总结 本文提出了一种数据驱动的框架,用于构建大规模链式推理监督,以改进3D点云理解。通过两阶段流程优化点文本指令数据,并合成高质量推理路径,构建了包含55K样本的PoCoTI数据集,训练PointLLM-R实现3D多模态语言模型的推理能力,实验表明其在生成3D分类和描述任务中达到最先进的性能。

详情
AI中文摘要

通过语言理解3D点云仍然是计算机图形学和视觉计算中的基本挑战,由于点云数据的不规则结构和现有3D多模态模型中缺乏显式推理。尽管链式推理(CoT)在LLM和基于图像的MLLM中表现出强大的有效性,但其在3D理解中的扩展仍鲜有探索。本文提出了一种数据驱动的框架,用于构建大规模CoT监督,专门针对3D点云理解。我们的框架由一个两阶段流程组成,首先通过基于视觉语言模型的质量评估和参考引导细化点文本指令数据,然后通过人机协同提示优化(HiLPO)合成高质量的推理路径。使用这种方法,我们构建了PoCoTI,一个包含55K样本的CoT增强点文本指令遵循数据集。在PoCoTI上微调PointLLM,得到PointLLM-R,一个具备推理能力的3D多模态语言模型。在生成3D分类和描述任务上的大量实验表明,PointLLM-R在生成3D分类和描述任务中达到了最先进的性能,并且能够稳健地推广到现实世界扫描点云和多轮对话场景中。

英文摘要

Understanding 3D point clouds through language remains a fundamental challenge in computer graphics and visual computing, due to the irregular structure of point cloud data and the lack of explicit reasoning in existing 3D multimodal models. While Chain-of-Thought (CoT) reasoning has shown strong effectiveness in LLMs and image-based MLLMs, its extension to 3D understanding remains largely underexplored. In this paper, we propose a data-centric framework for constructing large-scale CoT supervision tailored to 3D point cloud understanding. Our framework consists of a two-stage pipeline that first refines point-text instruction data via vision-language-model-based quality evaluation and reference-guided refinement, and then synthesizes high-quality reasoning paths through Human-in-the-Loop Prompt Optimization (HiLPO). Using this approach, we build PoCoTI, a CoT-enhanced point-text instruction-following dataset containing 55K samples with explicit reasoning paths. Fine-tuning PointLLM on PoCoTI yields PointLLM-R, a reasoning-capable 3D multimodal language model. Extensive experiments on generative 3D classification and captioning demonstrate that PointLLM-R achieves state-of-the-art performance and generalizes robustly to real-world scanned point clouds and multi-turn dialogue scenarios.

2605.22010 2026-05-22 stat.ML cs.LG 版本更新

Uniform-in-Time Weak Propagation-of-Chaos in Shallow Neural Networks

浅层神经网络中关于时间的弱传播混沌性

Margalit Glasgow, Joan Bruna

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Courant Institute School of Mathematics, Computing and Data Science(Courant研究所数学、计算与数据科学学院) New York University(纽约大学)

AI总结 本文研究了在特征学习模式下使用梯度下降训练的一层神经网络,将有限宽度网络的输出与无限宽度网络的输出联系起来,并通过均场动力学来研究其长期行为。

Comments 46 pages

详情
AI中文摘要

我们考虑在特征学习模式下使用梯度下降训练的一层神经网络,并将有限宽度网络的输出$f_{\hatρ_t^m}$与其无限宽度对应的$f_{ρ_t^{MF}}$联系起来,后者在均场动力学中演变。虽然通过标准Grönwall估计可以得到常时间范围内的$\|f_{ρ_t^{MF}} - f_{\hatρ_t^m}\|$的界,但波动的长期行为则更为复杂。均匀时间界通常依赖于(局部)强凸性或噪声梯度动力学中出现的对数Sobolev不等式。在本文中,我们通过利用均场确定性Wasserstein梯度流动力学的收敛率,建立了非渐近的弱传播混沌性,该结果在时间上是均匀的。具体来说,设$L_t$为均场过剩均方误差损失在时间$t$处的值,$m$为神经元数量,在标准正则性假设和条件$\int_0^\infty L_t^{1/2} dt =O(\log d)$下,我们得到时间均匀界$\|f_{ρ_t^{MF}}- f_{\hatρ_t^m}\|^2 \lesssim ext{poly}(d) m^{-\min(1,c/6)}$,当$L_t \lesssim t^{-c}$时。我们的结果在无噪声环境中成立,并不假设在最优解附近景观的几何特性,且无缝扩展到其他离散形式,包括有限样本数和时间离散化。我们的结果的一个关键结论是,当均场人口损失动力学的收敛率快于$t^{-2}$时,我们仅需$ ext{poly}(d/ε)$个神经元、训练样本和GD步数即可达到损失$ε$。

英文摘要

We consider one-hidden layer neural networks trained in the feature-learning regime using gradient descent, and relate the output of the finite-width network $f_{\hatρ_t^m}$ to its infinite-width counterpart $f_{ρ_t^{MF}}$, which evolves in the mean-field dynamics. While constant-time horizon bounds for $\|f_{ρ_t^{MF}} - f_{\hatρ_t^m}\|$ may be obtained via standard Grönwall estimates, the long-time behavior of the fluctuation is a more delicate matter. Uniform-in-time bounds often rely on (local) strong convexity in the landscape or Logarithmic Sobolev inequalities present in noisy gradient dynamics. In this work, we establish non-asymptotic weak propagation-of-chaos that holds uniformly in time, obtained by exploiting instead the convergence rate of the mean-field deterministic Wasserstein-gradient-flow dynamics. Specifically, denoting by $L_t$ the mean-field excess MSE loss at time $t$ and $m$ the number of neurons, under standard regularity assumptions and the condition $\int_0^\infty L_t^{1/2} dt =O(\log d)$, we obtain the uniform in time bound $\|f_{ρ_t^{MF}}- f_{\hatρ_t^m}\|^2 \lesssim \text{poly}(d) m^{-\min(1,c/6)}$ whenever $L_t \lesssim t^{-c}$. Our result holds in a noiseless setting and does not make any assumptions on the geometry of the landscape near the optimum, and extends seamlessly to other forms of discretization, including finite number of samples and time discretization. A key takeaway of our result is that whenever the convergence rate of the mean-field, population-loss dynamics is faster than $t^{-2}$, we can attain a loss of $ε$ with only $\text{poly}(d/ε)$ neurons, training samples, and GD steps.

2605.22003 2026-05-22 cs.CL cs.AI cs.IR cs.LG 版本更新

From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification

从TF-IDF到Transformer:一种比较和集成的方法用于情感分类

Dip Biswas Shanto, Mitali Yadav, Prajwal Panth, Suresh Chandra Satapathy

发表机构 * School of Computer Engineering KIIT Deemed to be University(计算机工程学院 KIIT 被认定大学)

AI总结 本文比较了多种机器学习模型,包括Naive Bayes、逻辑回归、SVM、LightGBM、LSTM以及基于Transformer的RoBERTa和DistilBERT,旨在对电影评论进行情感分类,并发现RoBERTa在准确率上表现最佳,同时集成所有模型的软投票方法进一步提升了分类性能。

Comments 6 pages, 9 figures. This is the author's accepted manuscript, presented at the International Conference on Intelligent Computing, Networks and Security (IC-ICNS 2026), March 26-28, Bhubaneswar, India. Proceedings publication pending

详情
AI中文摘要

情感分析,也称为观点挖掘,主要试图从任何基于文本的数据中提取观点。在电影评论和评论员的背景下,情感分析可以成为预测电影评论总体是积极还是消极的有用工具。对于ML模型来说,理解上下文或隐喻性情感可能具有挑战性,因为ML模型主要依赖统计词表示。本文的目标是检验并分类电影评论为积极或消极情感。为此考虑了多种机器学习模型,并运用自然语言处理(NLP)方法进行数据预处理和模型评估。使用IMDb数据集。具体来说,评估了Naive Bayes、逻辑回归、支持向量机(SVM)、LightGBM、LSTM以及基于Transformer的RoBERTa和DistilBERT等模型。经过大量测试,使用准确率、精确率、召回率、F1分数和ROC-AUC后,RoBERTa在所有其他模型之上表现更好,准确率为93.02%。一个结合所有模型的软投票集成方法也提高了分类性能,表明模型集成在情感分析中效果良好。

英文摘要

Sentiment analysis, also referred to as opinion mining, primarily tries to extract opinion from any text-based data. In the context of movie reviews and critics, sentimental analysis can be a helpful tool to predict whether a movie review is generally positive or negative. It can be difficult for the ML models to understand the context or metaphysical sentiment accurately, as ML models rely largely on statistical word representations. The objective of this paper is to examine and categorise movie reviews into positive and negative sentiments. Diverse machine learning models are considered in doing so, and Natural Language Processing (NLP) methodologies are employed for data preprocessing and model assessment. The IMDb dataset is used. Specifically, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), LightGBM, LSTM, and transformer-based models such as RoBERTa and DistilBERT were evaluated. After a lot of testing with accuracy, precision, recall, F1-score, and ROC-AUC, RoBERTa performed better than all the other models, with an accuracy of 93.02%. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.

2605.21999 2026-05-22 cs.LG 版本更新

Toward Understanding Adversarial Distillation: Why Robust Teachers Fail

迈向理解对抗蒸馏:为何鲁棒教师失败

Hongsin Lee, Hye Won Chung

发表机构 * School of Electrical Engineering, KAIST, Daejeon, Korea(韩国科学技术院电子工程学院)

AI总结 本文研究了对抗蒸馏中鲁棒教师与学生鲁棒性之间的关系,揭示了教师监督信心与学生表示限制之间的不匹配导致鲁棒过拟合现象,并提出了理论框架和实验验证。

Comments Accepted to ICML 2026. Code is available at https://github.com/HongsinLee/why-robust-teachers-fail

详情
AI中文摘要

对抗蒸馏旨在通过在最小-最大对抗训练框架内利用鲁棒教师的软标签来增强学生的鲁棒性,但其成功却往往不一致:更鲁棒的教师往往无法提升甚至损害学生的鲁棒泛化能力。本文识别了这一教师依赖的关键机制:教师监督信心与学生表示限制在一致训练数据子集上的不匹配——鲁棒不可学集。我们提出了一个理论框架,分析了两层神经网络的特征学习动态,证明这种不匹配导致蒸馏结果的二元性。我们证明当教师在不可学样本上提供自信监督时,会迫使学生记忆虚假噪声模式,最终超过学习的鲁棒信号,从而驱动鲁棒过拟合。相反,教师在这些样本上表现出高不确定性时,会抑制噪声记忆,使学生仅依赖可学习信号进行鲁棒泛化。我们通过合成模拟和真实图像分类数据集验证了我们的理论,确认鲁棒过拟合由教师与不可学样本的交互驱动。最后,我们证明教师在不可学样本上的预测熵是学生鲁棒性的一个强指标,验证了我们的理论框架并提供了鲁棒教师选择的指导原则。

英文摘要

Adversarial Distillation aims to enhance student robustness by guiding the student with a robust teacher's soft labels within the min-max adversarial training framework, yet its success is notoriously inconsistent: a more robust teacher often fails to improve, or even harms, the student's robust generalization. In this paper, we identify a key mechanism of this teacher dependency: the misalignment between the teacher's supervisory confidence and the student's representational limitations on a consistent subset of training data -- the Robustly Unlearnable Set. We present a theoretical framework analyzing the feature learning dynamics of a two-layer neural network, demonstrating that this mismatch creates a dichotomy in distillation outcomes. We prove that when a teacher provides confident supervision on unlearnable samples, it compels the student to memorize spurious noise patterns that eventually overpower the learned robust signal, thereby driving robust overfitting. Conversely, a teacher that exhibits high uncertainty on these samples effectively suppresses noise memorization, allowing the student to rely solely on the learnable signal for robust generalization. We empirically validate our theory across both synthetic simulations and real-image classification datasets, confirming that robust overfitting is driven by the teacher's interaction with unlearnable samples. Finally, we demonstrate that a teacher's predictive entropy on unlearnable samples serves as a strong indicator of student robustness, validating our theoretical framework and offering a principled guideline for robust teacher selection.

2605.21994 2026-05-22 cs.LG cs.AI 版本更新

Ex-GraphRAG: Interpretable Evidence Routing for Graph-Augmented LLMs

Ex-GraphRAG:图增强大语言模型中的可解释证据路由

Yoav Kor Sade, Arvindh Arun, Rishi Puri, Steffen Staab, Maya Bechler-Speicher

发表机构 * Tel Aviv University(特拉维夫大学) Institute for AI, University of Stuttgart(人工智能研究所,斯图加特大学) NVIDIA(英伟达) Meta AI

AI总结 本文提出Ex-GraphRAG,通过引入多变量图神经加法网络(M-GNAN)来解决图增强大语言模型中证据路由的可解释性问题,揭示了语义重要性与结构连通性之间的不匹配,对检索剪枝、上下文构建和失败诊断有重要影响。

详情
AI中文摘要

GraphRAG通过从知识图中检索子图并使用消息传递GNN进行编码,将语言模型置于这些子图上。由于这些编码器通过迭代邻域聚合将节点贡献纠缠在一起,因此无法确定每个检索实体对编码器输出的影响程度,因此无法忠实审计实际到达模型的结构证据。我们引入Ex-GraphRAG,用多变量图神经加法网络(M-GNAN)替代GNN编码器,这是一种扩展到高维嵌入空间的加法图模型,能够精确分解编码器的输出,而无需事后近似。在STaRK-Prime上,这种可审计的编码器与黑盒性能相匹配。利用它审计证据路由,我们发现语义-结构不匹配:主导编码器输出的节点在检索的子图中结构上是断开的,由低贡献的中介节点连接,其移除会使多跳问答性能下降高达28%。这种不匹配对任何不透明编码器都是不可见的,揭示了语义重要性与结构连通性由不同的节点集控制,对图增强大语言模型的检索剪枝、上下文构建和故障诊断有直接的影响。

英文摘要

GraphRAG conditions language models on subgraphs retrieved from knowledge graphs, encoded via message-passing GNNs. Because these encoders entangle node contributions through iterated neighborhood aggregation, there is no closed-form way to determine how much each retrieved entity influenced the encoder's output, and therefore no way to faithfully audit what structural evidence actually reached the model. We introduce Ex-GraphRAG, which replaces the GNN encoder with a Multivariate Graph Neural Additive Network (M-GNAN), an extension of additive graph models to high-dimensional embedding spaces that yields an exact decomposition of the encoder's output across individual nodes and feature groups, without post-hoc approximation. On STaRK-Prime, this auditable encoder matches black-box performance. Using it to audit evidence routing, we uncover a semantic-structural mismatch: the nodes that dominate the encoder's output are structurally disconnected in the retrieved subgraph, held together by low-attribution intermediaries whose removal degrades multi-hop QA by up to 28%. This mismatch, invisible to any opaque encoder, reveals that semantic importance and structural connectivity are governed by disjoint sets of nodes, with direct implications for retrieval pruning, context construction, and failure diagnosis in graph-augmented LLMs.

2605.21993 2026-05-22 cs.AI cs.LG 版本更新

ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking

ECPO:基于证据的策略优化用于证据认证的候选者排序

Miaobo Hu, Shuhao Hu, BoKun Wang, Yina Sa, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 本文研究了证据认证候选者排序问题,提出了一种名为ECPO的策略优化方法,通过结合排序和证据证书来提升排序效果和证据可靠性。

详情
AI中文摘要

用于决策支持的排序系统不仅应对候选者进行排序,还应展示可独立验证的证据。我们研究了证据认证候选者排序:给定一个意图ID、预定义的计划骨架、窗口局部的候选者名单、以及通过文本推导出的候选者轨迹及其跨度来源,系统必须输出一个Top-K列表以及doc_id:span证据证书,其引用的跨度足以恢复决策。我们在此任务上在MAVEN-ERE和RAMS上进行了实例化,使用固定上游提取、窗口局部随机候选者标识符、骨架对齐的轨迹监督、难例和审计参考。我们引入了证据耦合策略优化(ECPO),一种列表级策略优化目标,其动作是排序和证据证书的联合对象。ECPO首先从骨架对齐、论点一致性以及可选图特征中学习可解释的轨迹奖励;然后优化一个受约束的策略,具有三个耦合奖励:列表级排序效用、跨度级证书有效性以及由一个无标签的确定性验证器计算的证据循环奖励,该验证器通过去除声明的引用跨度重建候选者支持。这将目标从单独最大化普通NDCG转变为最大化CertNDCG和决策-证据耦合。评估将ECPO与零样本、SFT和GRPO策略、仅RM的评分带确定性证据附件、语法/JSON约束解码、验证器重试、最佳-N RM选择以及后验证据合理化在封闭名单、预测名单和混合名单设置下进行比较。

英文摘要

Ranking systems used in decision-support settings should not only order candidates but also expose evidence that can be independently checked. We study evidence-certified candidate ranking: given an intent_id, a predefined plan skeleton, a window-local candidate roster, and text-derived candidate trajectories with span provenance, a system must output a Top-K list together with doc_id:span evidence certificates whose cited spans are sufficient to recover the decision. We instantiate this task on MAVEN-ERE and RAMS with fixed upstream extraction, window-local randomized candidate identifiers, skeleton-aligned trajectory supervision, hard negatives, and audit references. We introduce Evidence-Coupled Policy Optimization (ECPO), a listwise policy-optimization objective whose action is the joint object of ranking and evidence certificate. ECPO first learns an interpretable trajectory reward from skeleton alignment, argument consistency, and optional graph features; it then optimizes a constrained policy with three coupled rewards: listwise ranking utility, span-level certificate validity, and an evidence-cycle reward computed by a label-free deterministic verifier that reconstructs candidate support from claim-stripped cited spans. This reframes the goal from maximizing ordinary NDCG alone to maximizing CertNDCG and decision-evidence coupling. The evaluation compares ECPO against zero-shot, SFT, and GRPO policies, RM-only scoring with deterministic evidence attachment, grammar/JSON-constrained decoding, validator retry, best-of-N RM selection, and post-hoc evidence rationalization under closed-roster, predicted-roster, and hybrid-roster settings.

2605.21975 2026-05-22 cs.LG 版本更新

Reasoning through Verifiable Forecast Actions: Consistency-Grounded RL for Financial LLMs

通过可验证的预测动作进行推理:面向金融大语言模型的一致性导向强化学习

Jialin Chen, Aosong Feng, Harshit Verma, Siyi Gu, Haiwen Wang, Ali Maatouk, Yixuan He, Yifeng Gao, Leandros Tassiulas, Rex Ying

发表机构 * Yale University(耶鲁大学) University of Texas Rio Grande Valley(德克萨斯理工大学) Arizona State University(亚利桑那州立大学)

AI总结 本文提出StockR1,一种结合时间序列的LLM,通过可验证的预测动作统一股票预测与金融推理,利用强化学习优化整个流程,提升金融问答和股票预测的准确性。

详情
AI中文摘要

金融市场以极端非平稳性、低信噪比和对新闻、公司基本面和宏观经济信号的强依赖性为特征。然而,现有方法要么将时间序列抽象为文本,要么将预测与基于语言的推理解耦,导致定性推理与定量结果之间存在根本性不匹配。为此,我们引入StockR1,一种增强时间序列的LLM,通过可验证的预测动作统一股票预测与金融推理。基于工具调用设计,模型首先发出预测动作,即对其定性市场展望的结构化和可解释的表示。然后,它调用一个受此动作条件的时序解码器,生成分布式的未来轨迹,从而更有效地进行问答和金融推理。我们通过强化学习优化整个流程,其中奖励共同反映答案的正确性、预测的准确性以及生成动作与观察到的时序动态之间的一致性。此外,奖励通过样本级不确定性标量重新加权,鼓励模型适应市场动态中变化的不确定性。我们在大规模10年基准上评估StockR1的金融问答和股票预测。我们的方法在时间序列基线和通用LLM上均表现优异,将推理准确性提高了17.7%(4B)和25.9%(8B)。这些发现表明,结构化预测动作在语言推理和时间预测之间建立了强大的协同效应,使LLM能够通过可验证、可解释和数值基础的决策进行推理。

英文摘要

Financial markets are characterized by extreme non-stationarity, low signal-to-noise ratios, and strong dependence on external information such as news, company fundamentals, and macroeconomic signals. Yet, existing approaches either abstract time-series into text or decouple forecasting from language-based reasoning, leading to a fundamental mismatch between qualitative reasoning and quantitative outcomes. To address this, we introduce StockR1, a time-series-enhanced LLM that unifies stock forecasting and financial reasoning through a verifiable forecast action. Based on a tool-call design, the model first emits a forecast action, which is a structured and interpretable representation of its qualitative market outlook. It then invokes a time-series decoder conditioned on this action to generate distributional future trajectories, leading to more informed question answering and financial reasoning. We optimize the full pipeline with reinforcement learning, where rewards jointly reflect answer validity, forecast accuracy, and consistency between generated actions and observed time-series dynamics. In addition, rewards are reweighted by a sample-level uncertainty scalar, encouraging the model to accommodate varying uncertainty in market dynamics. We evaluate StockR1 on financial question answering and stock forecasting over a large-scale 10-year benchmark. Our method consistently outperforms time-series baselines and general-purpose LLMs, improving reasoning accuracy by 17.7% (4B) and 25.9% (8B). These findings demonstrate that structuring the forecast actions establishes a powerful synergy between language reasoning and temporal prediction, enabling LLMs to reason through verifiable, interpretable, and numerically grounded decisions.

2605.21972 2026-05-22 cs.LG 版本更新

How Sparsity Allocation Shapes Label-Free Post-Pruning Recoverability

稀疏性分配如何塑造无标签后剪枝恢复能力

Qishi Zhan, Minxuan Hu, Liang He

发表机构 * Marquette University(马凯特大学) Cornell University(康奈尔大学) Tongji University(同济大学)

AI总结 本文研究了在固定激活统计修复后端下,稀疏性分配如何影响后修复恢复能力,通过比较ERK和LAMP分配在不同数据集和模型上的表现,发现分配选择对后修复准确性有显著影响,并揭示了修复敏感的过渡区域。

详情
AI中文摘要

在高稀疏度下进行无结构幅度剪枝可能导致神经网络精度降至接近随机水平,而在实际部署中可能无法进行带标签的重新训练。无标签后剪枝修复方法可以部分恢复塌陷的稀疏模型,但其有效性取决于上游剪枝分配留下的稀疏模型。本文研究了在固定激活统计修复后端下,稀疏性分配如何影响后修复恢复能力。我们在CIFAR-10、CIFAR-100和ImageNetet上,使用ResNet-18、ResNet-34和ResNet-50,在90%到95.5%的稀疏度下,比较ERK和LAMP分配在相同无标签修复协议下的表现。结果表明,在相同全局稀疏度下,分配选择可以显著改变后修复准确性,并且优选的分配会随着架构、数据集难度和稀疏度水平而变化。我们识别出一个修复敏感的过渡区域,在此区域内批归一化重新校准开始失效,而激活统计修复仍能恢复非平凡的准确性。在ImageNet-100和DenseNet-121上的额外验证表明,此可恢复区域的位置和宽度取决于数据规模和连接结构。这些发现表明,剪枝分配和后剪枝修复应联合研究,因为分配决定了可用于无标签恢复的激活信号量。

英文摘要

Unstructured magnitude pruning at high sparsity can reduce neural network accuracy to near-random performance, while labeled retraining may be unavailable in practical deployment settings. Label-free post-pruning repair methods can partially recover collapsed sparse models, but their effectiveness depends on the sparse model left by the upstream pruning allocation. This paper studies how sparsity allocation shapes post-repair recoverability under a fixed activation-statistic repair backend. We compare ERK and LAMP allocations under the same label-free repair protocol across CIFAR-10, CIFAR-100, and Imagenette with ResNet-18, ResNet-34, and ResNet-50 at sparsities from 90% to 95.5%. The results show that allocation choice can substantially change post-repair accuracy at the same global sparsity, and that the preferred allocation varies with architecture, dataset difficulty, and sparsity level. We identify a repair-sensitive transition regime in which BatchNorm recalibration begins to fail, while activation-statistic repair still recovers nontrivial accuracy. Additional validation on ImageNet-100 and DenseNet-121 shows that the location and width of this recoverable regime depend on data scale and connectivity structure. These findings suggest that pruning allocation and post-pruning repair should be studied jointly, since the allocation determines how much activation signal remains available for label-free recovery.

2605.21968 2026-05-22 cs.LG 版本更新

An Improved Adaptive PID Optimizer with Enhanced Convergence and Stability for Deep Learning

一种改进的自适应PID优化器,具有增强的收敛性和稳定性,用于深度学习

Saurabh Saini, Kapil Ahuja, Thomas Wick, Saurav Kumar

发表机构 * 1 Department of Computer Science \& Engineering, Indian Institute of Technology Indore, India. 3 National Remote Sensing Centre, Indian Space Research Organisation, India.

AI总结 本文提出了一种改进的自适应PID优化器IAdaPID-ADG,通过引入非递增有效学习率和基于梯度差的调制因子来解决AdaPID在收敛性和稳定性方面的不足,实验表明其在多个数据集上表现优异。

Comments 11 Pages, Double Column, 6 Tables, 5 Figures

详情
AI中文摘要

优化在深度学习中至关重要。大多数优化器的基础方法是基于动量的随机梯度下降。然而,它有两个关键缺点。首先,它有噪声和变化的梯度,其次,它有超调现象。为了解决噪声梯度,提出了Adam,它仍然是最广泛使用的自适应优化器。为了解决超调现象,提出了一种基于控制理论的PID优化器。为了在单一框架内解决这些限制,最近提出了几种AdaPID的变体。尽管AdaPID表现良好,但它仍然继承了Adam的两个关键缺点,即收敛性和稳定性问题。在本文中,我们解决了这两个限制。为了修复收敛问题,我们独特地将使用非递增有效学习率的想法整合到AdaPID中(最初在AMSGrad中提出,是Adam的扩展)。为了修复稳定性问题,我们创新性地将基于梯度差的调制因子整合到AdaPID中(最初在DiffGrad中提出,是Adam的另一个扩展)。将这两种想法结合到AdaPID中,结果得到我们新的IAdaPID-ADG优化器。我们在多个数据集上评估了所提出的优化器,包括基准数据集(MNIST和CIFAR10)和实际数据集(IARC和AnnoCerv)。IAdaPID-ADG在所有竞争优化器中表现显著更好。此外,我们在MNIST数据集上进行了消融研究,以展示每个添加组件的贡献。

英文摘要

Optimization is essential in deep learning. The foundational method upon which most optimizers are built is momentum-based stochastic gradient descent. However, it suffers from two key drawbacks. First, it has noisy and varying gradients, and second, it has an overshoot phenomenon. To address noisy gradients, Adam was proposed, which remains the most widely used adaptive optimizer. To address the overshoot phenomenon, a control-theory-based PID optimizer was proposed. To tackle both the limitations within a single framework, several variants of Adaptive PID (AdaPID) have recently been proposed. Although AdaPID performs well, it still inherits two critical drawbacks from Adam, namely convergence and stability issues. In this work, we address both these limitations. To fix the convergence issue, we uniquely integrate the idea of using a non-increasing effective learning rate into AdaPID (originally proposed in AMSGrad, an extension of Adam). To fix the stability issue, we innovatively integrate a gradient difference based modulation factor into AdaPID (originally proposed in DiffGrad, another extension of Adam). Combining both these ideas in AdaPID, results in our novel IAdaPID-ADG optimizer. We evaluate our proposed optimizer on multiple datasets, including benchmark datasets (MNIST and CIFAR10) and real-world datasets (IARC and AnnoCerv). The IAdaPID-ADG substantially outperforms all competing optimizers. Additionally, we perform an ablation study on the MNIST dataset to demonstrate the contribution of each added component.

2605.21963 2026-05-22 cs.LG cs.AI 版本更新

ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data

ChronoMedicalWorld:一个用于从纵向护理数据中学习患者轨迹的医学世界模型

Jiangyuan Wang, Xuyong Chen, Junwei He, Xu Xu, Shasha Xie, Fuman Han

发表机构 * Beijing KidneyTec Medical Technology Co., Ltd.(北京肾科医疗技术有限公司)

AI总结 本文提出了一种名为ChronoMedicalWorld的模型,旨在通过纵向护理数据学习患者轨迹,该模型结合了联合嵌入状态编码器和宽动作编码器,并在六个术语目标下训练了循环潜在转移模块,以提高慢性病护理中长期预测的准确性。

Comments 14 pages, 2 figures, 6 tables

详情
AI中文摘要

长期临床模拟--预测患者在指定干预下数年的生理演变--是慢性病护理的核心,但现有的电子健康记录(EHR)模型大多为判别性模型,且通用的大语言模型在重复干预下会漂移。我们提出了ChronoMedicalWorld模型(CMWM),一种用于从纵向护理数据中学习患者轨迹的动作条件潜在世界模型框架。CMWM结合了联合嵌入状态编码器和宽动作编码器,该编码器可以接受结构化干预指标和自由文本通信嵌入,并在六个术语目标下训练了循环潜在转移模块:下一步观察监督、下一步潜在预测、SIGReg潜在正则化,以及三个生理感知的形状先验(斜率、连续性、大跳跃惩罚)。闭环滚出前缀协议使训练与部署相匹配,因此模型在推理时表现出的多步误差相同。作为具体案例研究,我们为慢性肾病(CKD)的年度估计肾小球滤过率(eGFR)轨迹预测实例化CMWM。在2,232名肾病患者队列上,CKD实例化实现了动态-50%历史滚动测试的平均绝对误差(MAE)为7.384和均方根误差(RMSE)为10.256,而调优的GPT-5.5结构提示基线为7.964和11.069(MAE减少7.28%,RMSE减少7.35%),增益主要由患者与健康教练交流的对话部分主导。该框架不特定于CKD:其架构、损失设计和训练协议适用于任何可以被描述为周期性临床状态交替与结构化和对话干预的慢性疾病。

英文摘要

Long-horizon clinical simulation -- predicting how a patient's physiology evolves over years under specified interventions -- is central to chronic-disease care, yet existing electronic health record (EHR) models are predominantly discriminative, and general-purpose large language models drift under repeated interventions. We propose the \textbf{ChronoMedicalWorld Model (CMWM)}, an action-conditioned latent world-model framework for learning patient trajectories from longitudinal care data. CMWM couples a joint-embedding state encoder with a wide action encoder that admits both structured intervention indicators and free-text communication embeddings, and trains a recurrent latent transition module under a six-term objective: next-observation supervision, next-latent prediction, SIGReg latent regularisation, and three physiology-aware shape priors (slope, continuity, large-jump penalty). A closed-loop rollout-prefix protocol matches training to deployment, so the model is optimised against the same multi-step error it exhibits at inference. As a concrete case study, we instantiate CMWM for annual estimated glomerular filtration rate (eGFR) trajectory forecasting in chronic kidney disease (CKD). On a 2{,}232-patient nephrology cohort, the CKD instantiation achieves a dynamic-50\% history rollout test mean absolute error (MAE) of 7.384 and root-mean-square error (RMSE) of 10.256, against 7.964 and 11.069 for a tuned GPT-5.5 structured-prompting baseline ($-7.28\%$ MAE, $-7.35\%$ RMSE), with the gain dominated by the dialogue portion of patient--health-coach communication. The framework is not CKD-specific: its architecture, loss design, and training protocol apply to any chronic condition that can be cast as periodic clinical state interleaved with structured and conversational interventions.

2605.21951 2026-05-22 cs.LG 版本更新

Dynamic Mixture of Latent Memories for Self-Evolving Agents

动态潜在记忆混合用于自演化智能体

Dianzhi Yu, Vireo Zhang, Hongru Wang, Yanyu Chen, Minda Hu, Wanghan Xu, Siki Chen, Philip Torr, Zhenfei Yin, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) University of Oxford(牛津大学) Nanyang Technological University(南洋理工大学) University of Edinburgh(爱丁堡大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出MoLEM框架,通过动态混合专家机制实现智能体的持续学习,避免灾难性遗忘,提升任务学习和能力保持。

Comments 19 pages, 5 figures, 5 tables

详情
AI中文摘要

实现智能体的自演化需要在变化的任务序列中持续积累新知识,同时不遗忘之前获得的能力。现有方法要么通过更新模型参数内部化知识,导致灾难性遗忘,要么依赖外部记忆,无法真正增强模型的内在能力。我们提出MoLEM,一种基于动态混合专家(MoE)的生成性潜在记忆混合框架。我们将多个专家视为独立的记忆载体来生成记忆。路由器通过键-查询匹配选择并加权专家,聚合的潜在记忆被注入推理过程。基础模型保持完全冻结,所有经验知识被内部化到附加模块中,避免灾难性遗忘。对于持续学习,每个训练阶段配对一个轻量级自编码器,在推理时选择适当的路由组,输入若不匹配任何阶段则回退到预训练模型。实验在涵盖数学、科学和代码领域的持续学习序列上训练框架。训练后,我们在相应的测试集上评估框架,以测量跨持续适应阶段的任务学习和能力保持。在完整的持续学习序列后,我们的方法在Vanilla预训练基线基础上将平均准确率提高了10.40%,而其他方法在不同训练顺序中均无法超过此基线。

英文摘要

Achieving self-evolution in intelligent agents requires the continual accumulation of new knowledge across changing task sequences without forgetting previously acquired abilities. Existing approaches either internalize knowledge by updating model parameters, which induces catastrophic forgetting, or rely on external memory, which fails to genuinely enhance the model's intrinsic capabilities. We propose MoLEM, a generative mixture of latent memory framework based on a dynamic mixture-of-experts (MoE). We treat multiple experts as independent carriers to generate memory. A router selects and weights experts through key-query matching, and the aggregated latent memory is injected into the reasoning process. The base model for reasoning remains entirely frozen, with all experiential knowledge internalized into the additional modules, avoiding catastrophic forgetting. For continual learning, each training stage is paired with a lightweight autoencoder that selects the appropriate routing group at inference, and inputs that match no stage fall back to the pretrained model. Experiments train the framework on continual-learning sequences spanning math, science, and code domains. After training, we evaluate the framework on the corresponding test sets to measure task learning and competence preservation across continual adaptation stages. After the full continual-learning sequence, our method improves the average accuracy by 10.40% over the Vanilla pretrained baseline, while none of the competing methods consistently exceed this baseline across different training orders.

2605.21948 2026-05-22 cs.LG 版本更新

SCI-Defense: Defending Manipulation Attacks from Generative Engine Optimization

SCI-Defense: 防御生成引擎优化的操纵攻击

Xucheng Yu, Haibo Jin, Huimin Zeng, Haohan Wang

发表机构 * Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校计算机科学与数据科学学院) School of Information Sciences, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校信息科学学院) Amazon topcited.ai(亚马逊topcited.ai)

AI总结 本文提出SCI-Defense框架,通过检测困惑度、语义完整性评分和跨候选检测三种组件,有效识别生成引擎优化攻击,实现了高精度和低误报率,同时揭示了现有防御方法的局限性及未来研究方向。

Comments 20 pages, NeurIPS 2026 submission

详情
AI中文摘要

基于大型语言模型的排序系统易受生成引擎优化(GEO)攻击影响,攻击者通过在产品描述中注入语义信号来人为提升排名。我们提出了SCI-Defense,一种结合困惑度检测(PPL)、语义完整性评分(SIS)和跨候选检测(ICD)的三元防御框架。SIS评估四个操纵维度:权威归因(AA)、叙事目的性(NP)、比较主张(CA)和时间主张(TC)。在600个亚马逊产品描述(6个类别)上评估,SCI-Defense实现了精度1.000和误报率0.000,召回率分别为1.000、0.952和0.830,分别针对字符串、推理和评论攻击。在600个MS MARCO网页段落上,字符串攻击被完美阻止,而评论攻击的召回率接近零,因为网页段落缺乏SIS在产品描述中针对的说服性信号。我们证明现有防御方法——仅PPL过滤、SafetyClf内容分类器和改写——在对抗语义操纵攻击时召回率为零。我们进一步展示了新的攻击方式如规范放大和用例饱和,可以暴露语义相关性操纵作为结构防御盲点,指明了未来研究的方向。

英文摘要

LLM-based ranking systems are vulnerable to Generative Engine Optimization (GEO) attacks, where adversaries inject semantic signals into product descriptions to artificially boost rankings. We propose SCI-Defense, a three-component defense framework combining Perplexity detection (PPL), Semantic Integrity Scoring (SIS), and Inter-Candidate Detection (ICD). SIS evaluates four manipulation dimensions: Authority Attribution (AA), Narrative Purposiveness (NP), Comparative Claims (CA), and Temporal Claims (TC). Evaluated on 600 Amazon product descriptions across 6 categories, SCI-Defense achieves Precision=1.000 and FPR=0.000, with Recall of 1.000, 0.952, and 0.830 against String, Reasoning, and Review attacks respectively. On 600 MS MARCO web passages, String attacks are blocked with perfect recall while Review attacks yield near-zero recall, as web passages lack the persuasion-oriented signals that SIS targets in product descriptions. We demonstrate that existing defenses -- PPL-only filters, SafetyClf content classifiers, and paraphrasing -- achieve zero recall against semantic manipulation attacks. We further demonstrate new attacks such as Specification Amplification and Use-Case Saturation can expose semantic relevance manipulation as a structural defense blind spot that suggests directions for future research.

2605.21938 2026-05-22 cs.LG cs.CR cs.IT math.IT 版本更新

Optimal Guarantees for Auditing Rényi Differentially Private Machine Learning

对Rényi差分隐私机器学习的最优审计保证

Benjamin D. Kim, Lav R. Varshney, Daniel Alabi

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stony Brook University(石溪大学)

AI总结 本文研究了声称具有Rényi差分隐私(RDP)保证的机器学习算法的黑盒审计问题,提出了一种基于假设检验的审计框架,利用Donsker-Varadhan(DV)变分估计器直接估计相邻执行之间的Rényi散度,并通过类受限DV估计器得出非渐近的置信区间,证明了样本复杂度保证在信息论上最优,首次建立了通过DV估计器审计RDP的最优保证。

Comments 28 pages, 3 figures

详情
AI中文摘要

我们研究了声称具有Rényi差分隐私(RDP)保证的机器学习算法的黑盒审计问题。我们引入了一种基于假设检验的审计框架,该框架利用Donsker-Varadhan(DV)变分估计器直接估计相邻执行之间的Rényi散度。我们的分析得出通过类受限DV估计器进行RDP审计的显式且非渐近的置信区间,将统计估计误差与算法隐私泄漏分开。我们证明了匹配的minimax下界,表明在对数因子范围内,我们的样本复杂度保证在信息论上最优,从而建立了通过DV估计器审计RDP的首次最优保证。经验上,我们为在完全黑盒设置中审计DP-SGD实例化了我们的框架。在MNIST和CIFAR-10上,以及在广泛的隐私制度下,我们的审计器在经验RDP下界方面相比先前最先进的黑盒方法表现出显著的整体改进,尤其是在小和中等Rényi阶数,其中准确审计最为具有挑战性时。

英文摘要

We study black-box auditing for machine learning algorithms that claim R \ 'enyi differential privacy (RDP) guarantees. We introduce an auditing framework, based on hypothesis testing, that directly estimates Rényi divergence between neighboring executions using the Donsker-Varadhan (DV) variational estimator. Our analysis yields explicit and non-asymptotic confidence intervals for RDP auditing via class-restricted DV estimators, separating statistical estimation error from algorithmic privacy leakage. We prove matching minimax lower bounds showing that, up to logarithmic factors, our sample-complexity guarantees are information-theoretically optimal, thereby establishing the first optimal guarantees for auditing RDP via DV estimators. Empirically, we instantiate our framework for auditing DP-SGD in a fully black-box setting. Across MNIST and CIFAR-10, and over a wide range of privacy regimes, our auditors produce a strong overall improvement on empirical RDP lower bounds compared to prior state-of-the-art black-box methods especially at small and moderate Rényi orders where accurate auditing is most challenging.

2605.21933 2026-05-22 cond-mat.stat-mech cs.AI cs.LG 版本更新

Thermodynamic Irreversibility of Training Algorithms

训练算法的热力学不可逆性

Liu Ziyin, Yuanjie Ren, Adam Levine, Isaac Chuang

发表机构 * Massachusetts Institute of Technology(麻省理工学院) NTT Research(NTT研究所)

AI总结 本文提出了一种通用框架,用于定义和分析训练算法的不可逆性,证明了四种不同方法在步长η的主导阶近似下是等价的,并展示了不可逆性如何导致时间反演对称性破缺的新兴力。

Comments preprint

详情
AI中文摘要

人工智能系统的学习算法都引入了远离平衡的动态过程,理解这些算法的不可逆性是理解现代人工智能系统学习动态的基本步骤。本文建立了一个通用框架,用于定义和分析训练算法的不可逆性。我们证明了四种不同方法在主导阶近似下是等价的:数值反向误差ϕ_{DE},时间归一化修正ϕ_{TR},微观时间反演不对称性ϕ_{TA},以及(正则化的)随机热力学熵产率ϕ_{ST}。不可逆性导致一种时间反演对称性破缺的新兴力,这种力通常打破非等距连续重新参数化对称性,保持正交对称性,并导致普遍偏好那些最小化熵产率的学习轨迹。

英文摘要

The training algorithms for AI systems all introduce far-from-equilibrium dynamical processes, and understanding the irreversibility of these algorithms is a fundamental step towards understanding the learning dynamics of modern AI systems. In this work, we establish a general framework for defining and analyzing the irreversibility of training algorithms. We show that four different ways to characterize the irreversibility of dynamical processes are equivalent to leading order in the step size $η$: numerical backward error $ϕ_{\rm DE}$, time-renormalized correction $ϕ_{\rm TR}$, microscopic time reversal asymmetry $ϕ_{\rm TA}$, and the (regularized) stochastic-thermodynamic entropy production $ϕ_{\rm ST}$. The irreversibility gives rise to a time-reversal-symmetry-breaking emergent force that generically breaks non-isometric continuous reparametrization symmetries, preserves orthogonal symmetries, and leads to a universal preference for those learning trajectories that minimize the entropy production rate.

2605.21928 2026-05-22 cs.LG cs.AI stat.ME 版本更新

CausalGuard: Conformal Inference under Graph Uncertainty

CausalGuard: 在图不确定性下的契合推断

Vikash Singh, Weicong Chen, Debargha Ganguly, Yanyan Zhang, Nengbo Wang, Sreehari Sankar, Mohsen Hariri, Alexander Nemecek, Chaoda Song, Shouren Wang, Biyao Zhang, Van Yang, Erman Ayday, Jing Ma, Vipin Chaudhary

发表机构 * Case Western Reserve University(凯斯西储大学)

AI总结 本文提出CausalGuard,一种结构加权的契合框架,通过聚合图条件双稳健伪结果进行校准,以在图不确定性下提供无分布的有限样本边际覆盖。

详情
AI中文摘要

从观察数据估计治疗效应需要选择调整集,但有效的调整依赖于未知的因果图。图的不规范可能导致覆盖不足,而图无关的契合包装可能只能通过大填充来恢复名义覆盖。我们介绍了CausalGuard,一种结构加权的契合框架,该框架在聚合图条件双稳健伪结果后进行校准。候选DAGs从LLM衍生的边先验中提出,通过条件独立性测试进行修剪,并通过贝叶斯信息准则重新加权。然后,一个复合非契合分数校准后加权的伪结果。CausalGuard为聚合的伪结果提供无分布的有限样本边际覆盖;在因果识别、重叠、条件均值噪声稳定性以及集中在目标对齐的有效调整策略下,其条件均值收敛于真实的条件平均治疗效应。在五个基准测试中,CausalGuard在可直接评估的目标上实现了均值覆盖超过名义90%水平,并在图无关契合基线需要大填充时减少了宽度。压力测试显示,当保留的候选集受数据支持时,CausalGuard能抑制无效的碰撞调整并在不规范的先验下保持稳定。

英文摘要

Estimating treatment effects from observational data requires choosing an adjustment set, but valid adjustment depends on an unknown causal graph. Graph misspecification can cause under-coverage, while graph-agnostic conformal wrappers may regain nominal coverage only through large padding. We introduce CausalGuard, a structure-weighted conformal framework that calibrates after aggregating graph-conditional doubly robust pseudo-outcomes. Candidate DAGs are proposed from an LLM-derived edge prior, pruned by conditional-independence tests, and reweighted by Bayesian Information Criterion. A composite nonconformity score then calibrates the posterior-weighted pseudo-outcome. CausalGuard provides distribution-free finite-sample marginal coverage for this aggregated pseudo-outcome; under causal identification, overlap, conditional-mean nuisance stability, and concentration on target-aligned valid adjustment strategies, its conditional mean converges to the true Conditional Average Treatment Effect. Across five benchmarks, CausalGuard attains mean coverage above the nominal 90% level for the directly evaluable target and reduces width when graph-agnostic conformal baselines require large padding. Stress tests show that CausalGuard suppresses invalid collider adjustment and remains stable under misspecified priors when the retained candidate set is data-supported.

2605.21916 2026-05-22 quant-ph cs.LG 版本更新

A2QTGN: Adaptive Amplitude Quantum-Integrated Temporal Graph Network for Dynamic Link Prediction

A2QTGN:自适应幅度量子集成时间图网络用于动态链接预测

Nouhaila Innan, M. Murali Karthick, Simeon Kandan Sonar, Vivek Chaturvedi, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University Abu Dhabi (NYUAD)(eBRAIN实验室,工程学院,纽约大学阿布扎比分校) Center for Quantum and Topological Systems (CQTS), NYUAD Research Institute, NYUAD(量子与拓扑系统中心(CQTS),NYUAD研究院,NYUAD) Indian Institute of Technology Palakkad (IITPKD)(帕拉卡德印度理工学院(IITPKD))

AI总结 本文提出A2QTGN,一种结合自适应幅度编码和时间图网络的混合量子-经典框架,用于动态链接预测,通过量子状态表示节点交互特征并根据时间活动选择性刷新幅度嵌入,提升时间表示能力。

Comments 9 pages, 3 figures

详情
AI中文摘要

动态链接预测对于建模复杂系统中演变的交互至关重要,包括社交、通信、金融和交通网络。经典时间图模型捕捉序列依赖性,但可能难以表示大规模动态图中同时和快速变化的节点-边交互。我们提出A2QTGN(自适应幅度量子集成时间图网络),一种混合量子-经典框架,结合自适应幅度编码与时间图网络骨干。所提出机制将节点交互特征表示为量子状态,并根据时间活动选择性刷新幅度嵌入,保留稳定节点状态的同时强调有意义的结构变化。此设计减少了不必要的量子重编码并改进了时间表示以进行链接预测。在五个时间图基准数据集上的实验表明,A2QTGN在多样化的动态图中实现了强大的预测和排名性能。消融研究证实了量子嵌入模块和自适应更新策略的重要性,而使用嘈杂后端和有限真实设备执行的硬件感知推断支持了近期量子辅助时间图学习的可行性。

英文摘要

Dynamic link prediction is important for modeling evolving interactions in complex systems, including social, communication, financial, and transportation networks. Classical temporal graph models capture sequential dependencies, but they may struggle to represent concurrent and rapidly changing node-edge interactions in large dynamic graphs. We propose A2QTGN (Adaptive Amplitude Quantum-Integrated Temporal Graph Network), a hybrid quantum-classical framework that combines adaptive amplitude encoding with a Temporal Graph Network backbone. The proposed mechanism represents node interaction features as quantum states and selectively refreshes amplitude embeddings based on temporal activity, preserving stable node states while emphasizing meaningful structural changes. This design reduces unnecessary quantum re-encoding and improves temporal representation for link prediction. Experiments on five Temporal Graph Benchmark datasets show that A2QTGN achieves strong predictive and ranking performance across diverse dynamic graphs. Ablation studies confirm the importance of both the quantum embedding module and the adaptive update strategy, while hardware-aware inference using a noisy backend and limited real-device execution supports the feasibility of near-term quantum-assisted temporal graph learning.

2605.21915 2026-05-22 cs.CR cs.LG 版本更新

CCLab: Adversarial Testing of Learning- and Non-Learning-Based Congestion Controllers

CCLab: 学习型和非学习型拥塞控制器的对抗测试

Zhi Chen, Shehab Sarar Ahmed, Chenkai Wang, Brighten Godfrey, Gang Wang

AI总结 本文提出CCLab框架,用于系统评估学习型和非学习型拥塞控制器在对抗性条件下的鲁棒性,发现学习型控制器在对抗测试中比传统算法更鲁棒,并展示了对抗性追踪可用于训练更鲁棒的拥塞控制器。

Comments 13 pages for main paper, 16 pages in total

详情
AI中文摘要

拥塞控制器(CCs)对网络性能至关重要,但其在恶劣条件下的鲁棒性仍不够了解。尽管最近的学习型CCs在受控环境中表现出色,但当控制器的输入信号被破坏或环境条件变得系统性挑战时,其与传统CCs的性能对比尚不清楚。本文介绍CCLab,一种对抗测试框架,用于系统评估学习型和非学习型CCs的鲁棒性。CCLab包含一个基于强化学习(RL)的对抗代理,在闭环中与拥塞控制策略协同工作,生成受约束的扰动,无论是对输入信号(特征级)还是外部网络条件(环境级),同时通过显式约束保持现实性。利用此框架,我们在特征级和环境级对抗性条件下比较学习型和非学习型CCs。尽管两种类型的CCs在对抗测试中均出现性能下降,但学习型CCs总体上比传统人工设计算法更鲁棒。最后,我们展示对抗性追踪可用于训练更鲁棒的CCs,其在挑战性和正常条件下均优于现有学习型CCs。

英文摘要

Congestion controllers (CCs) are critical to network performance, and yet their robustness under adverse conditions remains insufficiently understood. While recent learning-based CCs have demonstrated strong performance in controlled environments, it is unclear how they compare to traditional CCs when controllers' input signals are corrupted or when environmental conditions become systematically challenging. In this paper, we introduce CCLab, an adversarial testing framework for systematically evaluating the robustness of both learning-based and non-learning-based CCs. CCLab includes a reinforcement learning (RL)-based adversarial agent that operates in a closed loop with the congestion control policy, generating bounded perturbations either on input signals (feature-level) or on external network conditions (environment-level), while preserving realism through explicit constraints. Using this framework, we compare learning-based CCs with non-learning-based CCs under both feature-level and environment-level adversarial conditions. While both types of CCs suffer from performance degradation under adversarial testing, we find that learning-based CCs, in general, are more robust than traditional human-designed algorithms. Finally, we show that our adversarial traces can be used to train more robust CCs that outperform existing learning-based CCs under both challenging and normal conditions.

2605.21911 2026-05-22 cs.LG 版本更新

Noise Schedule Design for Diffusion Models: An Optimal Control Perspective

扩散模型的噪声调度设计:一个最优控制视角

Seo Taek Kong, Weina Wang, R. Srikant

发表机构 * ECE & CSL University of Illinois Urbana-Champaign(电子工程与计算机科学实验室,伊利诺伊大学厄巴纳-香槟分校) Computer Science Department Carnegie Mellon University(计算机科学系,卡内基梅隆大学) ECE, CSL & NCSA University of Illinois Urbana-Champaign(电子工程、计算机科学实验室及国家计算科学中心,伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文从最优控制的角度出发,提出了一种分析和设计扩散模型噪声调度的框架,通过将噪声调度问题转化为最优控制问题,推导出噪声调度的充分条件,实现了更优的采样误差,并通过参数调整得到新的噪声调度方案,提升了图像生成的FID分数。

详情
AI中文摘要

我们开发了一个系统分析和设计扩散模型噪声调度的框架。我们证明可以将此设计问题重新表述为一个最优控制问题,其状态是扩散过程的Fisher信息,该信息根据微分方程演变,控制输入是噪声调度。最优控制问题的目标函数涉及Fisher信息,它被证明是Kullback-Leibler采样误差的上界。通过求解此最优控制问题,我们获得噪声调度的充分条件,使得最先进的~O(d/n)采样误差得以实现,其中d是数据维度,n是离散化步骤数。尽管现有理论工作也证明~O(d/n)采样误差界是可行的,但这些结果仅适用于特定的噪声调度,不包括实践中使用的调度。在进一步的数据分布参数假设下,我们证明可以得到噪声调度的闭式表达。这些噪声调度通过允许额外可调参数来推广标准经验调度,如指数和Sigmoid调度。系统地调整这些调度的参数可得到新的调度方案,在图像生成基准上取得更优的FID分数。

英文摘要

We develop a principled framework for analyzing and designing noise schedules in diffusion models. We show that one can recast this design problem as an optimal control problem, whose state is the Fisher information of the diffusion process which evolves according to an ODE and the control input is the noise schedule. The objective of the optimal control problem is a functional involving the Fisher information, which is shown to be an upper bound on the Kullback-Leibler sampling error. By solving this optimal control problem, we obtain sufficient conditions on noise schedules under which state-of-the-art $\tilde{\mathcal{O}} (d/n)$ sampling error is achievable, where $d$ is the data dimension and $n$ is the number of discretization steps. While existing theoretical work also prove that $\tilde{\mathcal{O}}(d/n)$ sampling error bounds are achievable, these results hold for specific noise schedules, which do not include the schedules used in practice. Under a further parametric assumption on the data distribution, we show that one can obtain closed-form expressions for the noise schedules. These noise schedules generalize standard empirical schedules such as exponential and sigmoid schedules by allowing additional parameters that can be tuned. Systematically tuning the parameters of these schedules yields new schedules that achieve superior FID scores on image generation benchmarks.

2605.21903 2026-05-22 eess.SY cs.AI cs.LG cs.NE cs.SY 版本更新

Engineering Hybrid Physics-Informed Neural Networks for Next-Generation Electricity Systems: A State-of-the-Art Review

为下一代电力系统工程混合物理指导神经网络:最新综述

Joseph Nyangon

发表机构 * Energy Exemplar(1能源典范)

AI总结 本文综述了用于电力系统的混合物理指导机器学习架构,探讨了物理指导神经网络(PINNs)、深度算子网络(DeepONets)、傅里叶神经算子、极端学习机增强的PINNs、基于图的PINNs(PIGNNs)和域分解PINNs等方法,展示了这些方法在场分析、故障检测、数字孪生、替代建模和控制优化中的应用,以及嵌入麦克斯韦方程等第一原理约束对预测精度、仿真时间和泛化能力的提升。

Comments 59 pages, 6 Figures

详情
AI中文摘要

将机器学习与领域特定物理相结合,正在改变电力系统的設計、監測和控制,其中數據稀缺、解釋性有限以及需要强制物理定律限制了纯数据驱动模型。物理指导机器学习(PIML)通过将支配方程直接嵌入到学习过程中,解决了这些限制,为工业4.0应用提供了准确、高效且可扩展的解决方案。本文综述了用于电力系统的混合PIML架构,包括物理指导神经网络(PINNs)、深度算子网络(DeepONets)、傅里叶神经算子、极端学习机增强的PINNs、基于图的PINNs(PIGNNs)和域分解PINNs。每种方法通过覆盖场分析、故障检测、数字孪生、替代建模和控制优化的案例研究进行审查。综述显示,嵌入麦克斯韦方程和其他第一原理约束显著提高了在稀疏和噪声数据下的预测精度,将仿真时间相对于有限元方法减少了多个数量级,并增强了在不同运行条件下的一般化能力。混合框架在参数敏感性、动态行为和鲁棒性方面始终优于纯数据驱动的基线,同时支持实时数字孪生校准和不确定性量化。持续的挑战包括对于刚性多尺度问题训练不稳定、高保真模型的计算成本以及缺乏标准化的基准。研究结果表明,PIML使从黑箱数据驱动方法向透明、物理指导策略的转变成为可能,为在坚韧和智能电力系统中持续创新奠定了基础。

英文摘要

The integration of machine learning with domain-specific physics is transforming the design, monitoring, and control of electricity systems, where data scarcity, limited interpretability, and the need to enforce physical laws constrain purely data-driven models. Physics-informed machine learning (PIML) addresses these limitations by embedding governing equations directly into the learning process, yielding accurate, efficient, and scalable solutions for Industry 4.0 applications. This article reviews hybrid PIML architectures for electricity systems, including physics-informed neural networks (PINNs), Deep Operator Networks (DeepONets), Fourier Neural Operators, Extreme Learning Machine-enhanced PINNs, graph-based PINNs (PIGNNs), and domain-decomposition PINNs. Each approach is examined through case studies spanning field analysis, fault detection, digital twins, surrogate modeling, and control optimization. The review shows that embedding Maxwell's equations and other first-principles constraints substantially improves predictive accuracy under sparse and noisy data, reduces simulation time by orders of magnitude relative to finite element methods, and enhances generalization across operating regimes. Hybrid frameworks consistently outperform purely data-driven baselines on parameter sensitivity, dynamic behavior, and robustness, while supporting real-time digital-twin calibration and uncertainty quantification. Persistent challenges include training instability for stiff multi-scale problems, computational cost of high-fidelity models, and the absence of standardized benchmarks. The findings demonstrate that PIML enables a paradigm shift from black-box data-driven methods to transparent, physics-informed strategies, positioning the field for sustained innovation in resilient and intelligent electricity systems.

2605.21868 2026-05-22 cs.LG 版本更新

When to Switch, Not Just What: Transition Quality Prediction in Clash Royale

何时切换,而不仅仅是选择:Clash Royale中的切换质量预测

Heeyun Heo, Huy Kang Kim

AI总结 该研究探讨了竞技游戏中玩家在连续失利后切换策略的频率与胜率之间的反向关联,提出了一种基于切换质量预测(TQP)的三阶段方法,通过PersonaGate、TimingGate和ScoreFusion来优化策略推荐,并引入SwitchGap作为评估指标,以衡量策略的判别质量。

Comments 11 pages, 2 figures, 4 tables; Accepted at IEEE Conference on Games (CoG) 2026

详情
AI中文摘要

在竞技游戏中,玩家经常在连续失利后切换策略,但通过对34,619名Clash Royale玩家的926,334场比赛记录分析,发现切换频率与胜率之间存在反直觉的关联:切换频率与胜率成反比,且这种影响在不同玩家和情境中差异显著。我们归因于许多先前推荐系统的一个局限性,即仅通过预期质量评估策略,而忽略了切换行为的成本和个体在切换倾向上的差异。我们将这一隐含前提称为零切换成本假设。为了解决这一问题,我们将策略推荐重新表述为一个过渡层面的决策问题,并将其实例化为TQP(Transition Quality Predictor),一个三阶段的流程,结构为Who -> When -> What。PersonaGate抑制了那些在经验上与更优结果相关联的玩家的推荐。TimingGate识别出切换可能比保持更有净收益的时刻,使用子类型和状态匹配的基线来控制自然胜率恢复。ScoreFusion通过结合采用性信号和预测的过渡质量(delta WR)来对候选策略进行排名。我们进一步引入了SwitchGap,一种衡量策略判别质量的评估指标,不将观察到的玩家选择视为最优地面真实。这一属性尤为重要,因为最频繁切换的玩家记录了最低的胜率。完整的流程在推荐率为5.4%时实现了SwitchGap的+10.4个百分点,尽管在表现最差的群体中,触发损失的切换者从子类型条件指导中受益最大。

英文摘要

In competitive games, players frequently switch strategies after losing streaks, yet our analysis of 926,334 match records from 34,619 Clash Royale players reveals a counterintuitive pattern: switching frequency is inversely associated with the win rate, with effects that vary substantially across players and situational contexts. We attribute this to a limitation common in many prior recommendation systems, which evaluate strategies by expected quality while overlooking the behavioral cost of switching and individual differences in switching propensity. We refer to this implicit premise as the Zero Switching Cost Assumption. To address this, we reformulate strategy recommendation as a transition-level decision problem and instantiate it as TQP (Transition Quality Predictor), a three-stage pipeline structured as Who -> When -> What. PersonaGate suppresses recommendations for players whose strategic consistency is empirically associated with superior outcomes. TimingGate identifies moments when switching is likely to yield a net benefit over staying, using a subtype- and state-matched baseline to control for natural win-rate recovery. ScoreFusion ranks candidate strategies by combining an adoptability signal with predicted transition quality (delta WR). We further introduce SwitchGap, an evaluation metric that measures a policy's discriminative quality without treating observed player choices as optimal ground truth. This property is particularly important because the most frequent switchers record the lowest win rates. The full pipeline achieves a SwitchGap of +10.4 percentage points at a recommendation rate of 5.4%, and loss-triggered switchers, despite being the lowest-performing group, benefit the most from subtype-conditioned guidance.

2605.21859 2026-05-22 q-bio.PE cs.LG q-bio.QM 版本更新

PhylaFlow: Hybrid Flow Matching in Billera-Holmes-Vogtmann Tree Space for Phylogenetic Inference

PhylaFlow:在Billera-Holmes-Vogtmann树空间中进行混合流匹配用于系统发育推断

Yasha Ektefaie, Leo Cui, Shrey Jain, Marinka Zitnik, Pardis Sabeti

发表机构 * Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard(埃里克和wendy Schmidt中心,MIT和哈佛大学Broad研究所) Department of Biomedical Informatics, Harvard Medical School(哈佛医学院生物医学信息学系) Centennial High School(Centennial高中) Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard(传染病与微生物组计划,MIT和哈佛大学Broad研究所)

AI总结 该研究提出PhylaFlow模型,通过在Billera-Holmes-Vogtmann树空间中学习后验盆地运输,实现混合流匹配,从而提高系统发育推断的效率和准确性。

Comments 9 pages, 3 figures

详情
AI中文摘要

系统发育树是混合对象:分支长度连续变化,而拓扑结构通过边收缩和扩展离散变化。Billera-Holmes-Vogtmann(BHV)树空间提供了这种结构的规范几何表示,将每个解析拓扑表示为欧几里得正交ant,并将拓扑变化表示为在共享的低维边界上移动。我们引入PhylaFlow,一种混合流匹配模型,该模型在BHV空间中学习后验盆地运输。PhylaFlow在BHV测地路径上训练,从随机起始树到短程后验样本,将连续分支长度运动与学习到的边界事件和离散拓扑转换耦合在一起。我们通过操作性评估所学的几何运算:如果流到达后验相关区域,则有限预算的贝叶斯细化,从或由其终端树初始化或引导,应能更有效地恢复后验支持的拓扑。在DS1-DS8系统发育后验基准上,PhylaFlow相对于经典初始化显著减少了初始Tree-KL。在有限预算的MrBayes细化后,直接PhylaFlow在大多数数据集上改进了早期和中期拓扑恢复轨迹,而split-guided PhylaFlow-MCMC在最困难的案例中取得了最强的结果。最好的PhylaFlow变体在八种数据集中的七种上优于短预热,并在八种数据集中的五种上优于PhyloGFN。在联合序列条件实验中,序列嵌入引导后验分裂恢复,尽管精确的后验拓扑恢复仍处于初步阶段。这些结果表明,混合流匹配可以学习BHV树空间中的可操作运输,并为贝叶斯系统发育推断提供几何感知的提议机制。

英文摘要

Phylogenetic trees are hybrid objects: branch lengths vary continuously, while topologies change discretely through edge contractions and expansions. Billera-Holmes-Vogtmann (BHV) tree space provides a canonical geometry for this structure, representing each resolved topology as a Euclidean orthant and topological changes as motion across shared lower-dimensional boundaries. We introduce PhylaFlow, a hybrid flow-matching model that learns posterior-basin transport in BHV tree space. PhylaFlow is trained on BHV geodesic paths from random starting trees to short-run posterior samples, coupling continuous branch-length motion within orthants with learned boundary events and discrete topology transitions. We evaluate the learned geometry operationally: if the flow reaches posterior-relevant regions, finite-budget Bayesian refinement initialized from, or guided by, its terminal trees should recover posterior-supported topologies more efficiently. Across DS1-DS8 phylogenetic posterior benchmarks, PhylaFlow substantially reduces initial Tree-KL relative to classical initializers. After finite-budget MrBayes refinement, direct PhylaFlow improves early and intermediate topology-recovery trajectories on most datasets, while split-guided PhylaFlow-MCMC obtains the strongest hard-case results. The best PhylaFlow variant outperforms short-warmup on seven of eight datasets and PhyloGFN on five of eight under the same refinement budget. In a joint sequence-conditioned experiment, sequence embeddings steer posterior split recovery, although exact posterior topology recovery remains preliminary. These results show that hybrid flow matching can learn actionable transport in BHV tree space and provide a geometry-aware proposal mechanism for Bayesian phylogenetic inference.

2605.21856 2026-05-22 cs.LG cs.AI 版本更新

The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation

推理的幻觉:通过零CoT截断揭示LLM中的逃避数据污染

Yifan Lan, Yuanpu Cao, Hanyu Wang, Lu Lin, Jinghui Chen

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 本文提出零CoT探针(ZCP)方法,通过截断整个推理过程来暴露模型中的潜在捷径映射,以检测LLM中的直接和逃避数据污染,提出了 contamination confidence 指标来量化污染的可能性和严重性。

详情
AI中文摘要

大型语言模型(LLMs)在广泛的任务上展示了令人印象深刻的推理能力,但数据污染破坏了这些能力的客观评估。这个问题进一步加剧了恶意模型发布者使用逃避或间接污染策略,例如改写基准数据以逃避现有检测方法并人为提升排行榜表现。当前的方法难以可靠地检测这种隐蔽的污染。在本工作中,我们揭示了一个关键现象:模型生成的推理步骤主动掩盖其底层的记忆。受此启发,我们提出了零CoT探针(ZCP),一种新颖的黑盒检测方法,故意截断整个链式思维(CoT)过程以暴露潜在的捷径映射。为进一步将记忆与模型的内在问题解决能力区分开来,ZCP将模型在原始基准上的零CoT表现与等价扰动的参考数据集进行比较。此外,我们引入了污染置信度(Contamination Confidence),一个量化污染可能性和严重性的指标,超越了简单的二元分类。对已识别的污染模型和特别微调的污染模型的广泛实验表明,ZCP能够稳健地检测直接和逃避的数据污染。ZCP的代码可在https://github.com/Yifan-Lan/zero-cot-probe获取。

英文摘要

Large language models (LLMs) have demonstrated impressive reasoning abilities across a wide range of tasks, but data contamination undermines the objective evaluation of these capabilities. This problem is further exacerbated by malicious model publishers who use evasive, or indirect, contamination strategies, such as paraphrasing benchmark data to evade existing detection methods and artificially boost leaderboard performance. Current approaches struggle to reliably detect such stealthy contamination. In this work, we uncover a critical phenomenon: a model's generated reasoning steps actively mask its underlying memorization. Inspired by this, we propose the Zero-CoT Probe (ZCP), a novel black-box detection method that deliberately truncates the entire Chain-of-Thought (CoT) process to expose latent shortcut mappings. To further isolate memorization from the model's intrinsic problem-solving capabilities, ZCP compares the model's zero-CoT performance on the original benchmark against an isomorphically perturbed reference dataset. Furthermore, we introduce Contamination Confidence, a metric that quantifies both the likelihood and severity of contamination, moving beyond simple binary classifications. Extensive experiments on both previously identified contaminated models and specially fine-tuned contaminated models demonstrate that ZCP robustly detects both direct and evasive data contamination. The code for ZCP is accessible at https://github.com/Yifan-Lan/zero-cot-probe.

2605.21849 2026-05-22 cs.LG cs.CL 版本更新

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

基于几何适应的解释器:在分布偏移下字典基础可解释性的忠实性

Sungjun Lim, Heedong Kim, Andrew Lee, Kyungwoo Song

发表机构 * Yonsei University(延世大学) Harvard University(哈佛大学)

AI总结 本文提出了一种几何适应解释器(GAE),用于在分布偏移下提高基于字典的可解释性。通过重新对齐解释器的字典与偏移活跃子空间,同时保持原始特征结构,GAE在无监督的情况下减少了分布偏移下的忠实性差距。

详情
AI中文摘要

机制可解释性旨在通过识别因果负责的内部结构来解释模型的行为。基于字典的解释器如稀疏自编码器和转码器是主要工具,但其在分布外(OOD)偏移下的忠实性却很少受到系统性关注。我们证明分布偏移会旋转模型所使用的子空间,导致解释器的字典在训练分布(ID)激活上训练时出现对齐偏差。我们将这种偏差正式化为忠实性差距,即ID字典与OOD活跃子空间之间的几何距离,并证明其控制OOD忠实性退化。为了减少这种差距,我们提出了几何适应解释器(GAE),它在保持原始特征结构的同时,重新对齐解释器的字典与OOD活跃子空间。这只需要未标记的OOD激活,并且不需要梯度更新。我们证明GAE在无适应ID解释器上有所改进,其额外损失被二次限制于二阶矩偏移。经验上,GAE在多个模型和OOD设置中甚至匹配或超过了所有基于训练的基线在因果忠实性上的表现。

英文摘要

Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer's dictionary trained on in-distribution (ID) activations. We formalize this misalignment as the faithfulness gap, a geometric distance between the ID dictionary and the OOD-active subspace, and show that it controls OOD faithfulness degradation. To reduce this gap, we propose the Geometry-Adaptive Explainer (GAE), which realigns the explainer's dictionary with the OOD-active subspace while preserving the original feature structure. This requires only unlabeled OOD activations and no gradient updates. We prove that GAE improves over the unadapted ID explainer, with excess loss bounded quadratically by the second-moment shift. Empirically, GAE even matches or surpasses all training-based baselines in causal faithfulness across multiple models and OOD settings.

2605.21846 2026-05-22 stat.ME cs.LG stat.ML 版本更新

Causal Discovery in Structural VAR Models Under Equal Noise Variance

在等噪声方差假设下结构VAR模型中的因果发现

SeyedSina Seyedi HasanAbadi, Fahimeh Arab, Erfan Nozari, AmirEmad Ghassami

发表机构 * Bourns College of Engineering, University of California, Riverside(加州大学河滨分校工程学院) University of California, San Francisco(加州大学旧金山分校) Department of Mathematics and Statistics, Boston University(波士顿大学数学与统计学系)

AI总结 本文研究了在等噪声方差假设下线性高斯结构VAR模型中的因果发现问题,提出了一种基于稀疏性的方法ENVAR,用于在观测等价类中寻找稀疏的结构代表,并在合成数据和fMRI数据集上进行了评估。

详情
AI中文摘要

从多变量时间序列中进行因果发现具有挑战性,因为因果效应可能在时间上和同一采样间隔内同时发生。这个问题在神经科学等应用中尤为重要,其中采样率可能相对粗糙,而同时效应不一定形成无环图。我们研究了在等噪声方差假设下线性高斯结构VAR模型中的因果发现,这意味着结构噪声项具有共同的方差。与基于DAG的横断面等噪声方差设置不同,此处考虑的时间序列设置通常不会导致因果图的唯一点识别。相反,多种结构VAR参数化可以诱导相同的平稳观测过程定律。我们引入了一种针对此设置的观测等价性概念,并展示相应的等价类由结构方程的正交变换以及全局正比例尺度共同刻画。这种刻画导致了观测对齐差异,即比较结构模型模去保持观测定律的变换。基于这一理论,我们提出ENVAR,一种基于稀疏性的方法,用于在诱导的观测等价类中搜索稀疏的归一化结构代表。我们评估了所提出的方法在合成结构VAR数据和fMRI数据集上的性能。

英文摘要

Causal discovery from multivariate time series is challenging when causal effects may occur both across time and within the same sampling interval. This issue is especially important in applications such as neuroscience, where the sampling rate may be coarse relative to the underlying dynamics and contemporaneous effects need not form an acyclic graph. We study causal discovery in linear Gaussian structural VAR models under an equal noise variance assumption, meaning that the structural noise terms have a common variance. Unlike the DAG-based cross-sectional equal noise variance setting, the time-series setting considered here does not generally yield point identification of a unique causal graph. Instead, multiple structural VAR parameterizations can induce the same stationary observed process law. We introduce a notion of observational equivalence tailored to this setting and show that the corresponding equivalence class is characterized by orthogonal transformations of the structural equations together with a global positive scale. This characterization leads to an equivalence-aware model discrepancy, the observational alignment discrepancy, which compares structural models modulo transformations that preserve the observed law. Building on this theory, we propose ENVAR, a sparsity-based procedure that searches over the induced observational equivalence class for a sparse normalized structural representative. We evaluate the proposed methodology on synthetic structural VAR data and on an fMRI dataset.

2605.21842 2026-05-22 cs.LG cs.CL eess.SP 版本更新

Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention

能量门控注意力:频谱显著性作为Transformer注意力的归纳偏置

Athanasios Zeris

发表机构 * Independent Researcher, Athens, Greece(雅典,希腊独立研究者)

AI总结 本文提出能量门控注意力(EGA),通过频谱显著性作为归纳偏置来改进Transformer注意力机制,通过在键嵌入的频谱能量上进行门控,提高了信息密集位置的注意力权重,实验结果显示在多个数据集上均取得显著效果。

Comments 12 pages, 4 figures

详情
AI中文摘要

标准的Transformer注意力计算查询和键之间的成对相似性,将所有标记视为具有同等显著性,无论其内在信息含量如何。在湍流流体力学中,相干结构——在背景混沌中持续存在的能量主导、空间组织化的模式——承载了总能量的不成比例份额,并控制所有传输。我们提出,标记在Transformer注意力中扮演类似的角色:信息密集的位置(形态边界、语法头、话语标记)集中了频谱能量,并应比背景标记(功能词、重复模式、低信息填充词)获得更多的注意力。我们提出能量门控注意力(EGA):一种简单的修改,通过键标记嵌入的频谱能量来门控值聚合,该计算通过一个单个学习的线性投影完成,以发现嵌入场的主导频谱模式。在TinyShakespeare上,EGA仅使用12,480个额外参数(<0.26%的开销)和没有可测量的计算成本,就实现了+0.103的验证损失改进。结果在Penn Treebank上也一致(+0.101),证明了数据集的独立性。在三种小波家族(固定Morlet、Daubechies db2/db4和参数化Morlet)的系统消融研究中,发现固定结构基底是次优的——最优的能量方向是数据自适应的且非正弦的——同时识别出学习的小波包作为有前途的开放方向。学习的能量阈值收敛到tau ~ 0.35,无论初始化如何,对应于英语文本中携带高于平均频谱能量的约36%的标记比例,这是一个稳定的语言属性,与英语文本中内容词的比例一致。

英文摘要

Standard transformer attention computes pairwise similarity between queries and keys, treating all tokens as equally salient regardless of their intrinsic informational content. In turbulent fluid dynamics, coherent structures -- the energetically dominant, spatially organized patterns that persist amid background chaos -- carry a disproportionate fraction of total energy and govern all transport. We propose that tokens play an analogous role in transformer attention: informationally dense positions (morphological boundaries, syntactic heads, discourse markers) concentrate spectral energy and should attract proportionally more attention than background tokens (function words, repeated patterns, low-information filler). We propose Energy-Gated Attention (EGA): a simple modification that gates value aggregation by the spectral energy of key token embeddings, computed by a single learned linear projection that discovers the dominant spectral mode of the embedding field. On TinyShakespeare, EGA achieves +0.103 validation loss improvement with only 12,480 additional parameters (<0.26% overhead) and no measurable computational cost. The result is consistent on Penn Treebank (+0.101), demonstrating dataset independence. A systematic ablation across three wavelet families (fixed Morlet, Daubechies db2/db4, and a parametric Morlet) establishes that fixed structured bases are suboptimal -- the optimal energy direction is data-adaptive and non-sinusoidal -- while identifying learned wavelet packets as a promising open direction. The learned energy threshold converges to tau ~= 0.35 independently of initialization, corresponding to the fraction (~36%) of tokens carrying above-average spectral energy in English text, a stable linguistic property consistent with the fraction of content words in running English text.

2605.21834 2026-05-22 cs.LG 版本更新

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

基于策略的一致性训练通过最小能力退化提升大语言模型安全性

Andy Han, Kristina Fujimoto, Avidan Shah, Kiet Nguyen, Kai Xu, Chen Yueh-Han, Ilia Sucholutsky, Rico Angell

发表机构 * New York University(纽约大学)

AI总结 本文提出基于策略的一致性训练(OPCT)方法,通过模型自身响应对比性提示来提升大语言模型的安全性,实验表明OPCT在抑制顺从性、防止越狱和增强安全意识方面优于传统监督微调(SFT),同时避免了SFT导致的能力退化问题。

详情
AI中文摘要

对齐的模型可能以多种方式表现不当:它们常常谄媚,容易被越狱攻击,或未能包含适当的安全警告。一致性训练是一种有前途的新对齐范式,通过使用对比输入对训练模型的不变性来缓解此类失败。现有的一致性训练过程在离线生成监督信号,并使用监督微调(SFT)来更新模型。不幸的是,由此产生的模型往往只是记忆训练分布的表面形式,因此泛化能力差且能力退化。我们引入基于策略的一致性训练(OPCT),一种新的一致性训练方法,其目标是在模型自身对提示的响应上计算,由自身对相应对比提示的条件监督。我们评估了OPCT在三个安全轴上的表现:顺从性、越狱和安全意识。在三个模型家族中,OPCT在所有安全目标上均优于其SFT对应物。与基线相比,OPCT将顺从率几乎减半(8.1% vs. 15.4%,相比之下SFT为11.2%)。在适应性目标攻击者下,OPCT在保持的越狱行为上保持越狱防御成功率接近99%,而SFT平均达到87%。在安全意识方面,OPCT在两个模型中优于SFT,其余模型中与SFT相当。OPCT还大大避免了SFT引发的能力退化,如在MATH-500上下降28分。我们的结果表明,一致性训练最好以OPCT而不是SFT的方式实施,尤其是在希望超越训练分布泛化时。

英文摘要

Aligned models can misbehave in several ways: they are often sycophantic, fall victim to jailbreaks, or fail to include appropriate safety warnings. Consistency training is a promising new alignment paradigm to mitigate such failures by training invariants into the model using contrastive input pairs. Existing consistency training procedures generate the supervision signal once, offline, and use supervised fine-tuning (SFT) to update the model. Unfortunately, the resulting models tend to merely memorize the surface forms of the training distribution and thus generalize poorly and regress in their capabilities. We introduce On-Policy Consistency Training (OPCT), a new consistency training approach where the objective is computed over the model's own responses to prompts, supervised by itself conditioned on corresponding contrastive prompts. We evaluate OPCT on three safety axes: sycophancy, jailbreaking, and safety awareness. Across three model families, OPCT outperforms its SFT counterpart on all safety desiderata. It nearly halves the sycophancy rate relative to baseline (8.1% vs. 15.4%, compared to 11.2% for SFT). Under an adaptive per-target attacker, OPCT holds jailbreak defense success near 99% on held-out jailbreak behaviors, whereas SFT achieves 87% on average. On safety awareness, OPCT outperforms SFT in two out of three models, and matches it on the other. OPCT also largely avoids the capability regressions that SFT induces, such as a 28-point drop on MATH-500. Our results suggest that consistency training is best implemented as OPCT rather than as SFT, especially when generalization beyond the training distribution is desired.

2605.21820 2026-05-22 cs.LG cond-mat.mtrl-sci 版本更新

Beyond Scalar Objectives: Expert-Feedback-Driven Autonomous Experimentation for Scientific Discovery at the Nanoscale

超越标量目标:基于专家反馈的自主实验探索用于纳米尺度科学发现

Ralph Bulanadi, Jefferey Baxter, Arpan Biswas, Hiroshi Funakubo, Dennis Meier, Jan Schultheiß, Rama Vasudevan, Yongtao Liu

发表机构 * Center for Nanophase Materials Sciences, Oak Ridge National Laboratory(橡树岭国家实验室纳米相材料中心) University of Tennessee-Oak Ridge Innovation Institute, University of Tennessee(田纳西大学橡树岭创新研究所) Department of Material Science and Engineering, School of Materials and Chemical Technology, Institute of Science Tokyo(东京科学大学材料科学与工程系、材料与化学技术学院) Department of Materials Science and Engineering, Norwegian University of Science and Technology (NTNU)(挪威科学技术大学(NTNU)材料科学与工程系) Faculty of Physics and Center for Nanointegration Duisburg-Essen (CENIDE), University of Duisburg-Essen(杜伊斯堡- Essen大学物理系和杜伊斯堡- Essen纳米集成中心(CENIDE)) Research Center Future Energy Materials and Systems, Research Alliance Ruhr(鲁尔研究联盟未来能源材料与系统研究中心)

AI总结 本文提出了一种名为深度核成对学习(DKPL)的方法,通过整合专家知识和跨学科科学知识,改进自主显微实验,从而在纳米尺度上更有效地发现科学现象。

详情
AI中文摘要

自动驾驶实验室或自主实验正成为加速科学发现的变革性平台。贝叶斯优化(BO)是用于此目的最广泛使用的机器学习框架之一,但这些基于BO的框架依赖于预定义的标量描述符来指导实验。在许多情况下,确定合适的标量描述符可能具有挑战性,并且可能无法捕捉到专家所察觉的微妙但科学重要的现象。为克服这一限制,本文开发了深度核成对学习(DKPL),一种用于自主显微实验的方法,该方法将人类专业知识和跨学科科学知识整合到一个主动学习循环中。与依赖显式标量目标不同,DKPL使专家能够直接评估哪些实验输出更有前途,使用跨学科知识。DKPL然后从这些专家判断中学习一个潜在的效用函数,以指导后续的自主显微实验。我们通过一个具有已知真实值的实验模型数据集展示了DKPL在学习物理有意义的纳米级结构方面的能力,同时有效优先考虑高信息测量区域。我们进一步将DKPL应用于分析铁电域墙的特性,在BiFeO3中区分高和低特征域墙角度,并在ErMnO3中发现头对头和尾对尾的域墙特性。这一发展建立了一种将专家知识整合到自主显微实验中的方法,并展示了一条通向能够解决超越标量度量驱动学习限制的科学问题的专家引导的自动驾驶实验室的路径。

英文摘要

Self-driving laboratories or autonomous experimentation are emerging as transformative platforms for accelerating scientific discovery. Bayesian optimization (BO) is among the most widely used machine learning frameworks for these purposes, but these BO-based frameworks rely on predefined scalar descriptors to guide experimentation. In many situations, the determination of an appropriate scalar descriptor can be challenging, and may fail to capture subtle yet scientifically important phenomena apparent to experts with interdisciplinary insight. To overcome this limitation, here we develop deep-kernel pairwise learning (DKPL), an approach for autonomous microscopy experiments which incorporates human expertise and interdisciplinary scientific knowledge into an active learning loop. Instead of relying on explicit scalar objectives, DKPL enables experts to directly evaluate which experimental output is more promising using interdisciplinary knowledge. DKPL then learns a latent utility function from these expert judgements to guide subsequent autonomous microscopy experiments. We demonstrate DKPL's performance in learning physically meaningful nanoscale structures while effectively prioritizing high-information measurement regions using an experimental model dataset with known ground truth. We further apply DKPL to analyze the character of ferroelectric domain walls, where we find DKPL capable of distinguishing between high and low characteristic domain-wall angles in bismuth ferrite, and able to discover both head-to-head and tail-to-tail domain-wall character in erbium manganite. This development establishes an approach to integrate expert knowledge into autonomous microscopy experiments and demonstrates a pathway toward expert-guided self-driving laboratories capable of addressing scientific problems beyond the limits of scalar-metrics-driven learning.

2605.21805 2026-05-22 stat.CO cs.LG stat.ML 版本更新

Truncated Neural Likelihood Estimation for Simulation-Based Inference in State-Space Models

截断神经似然估计用于状态空间模型中的基于模拟的推断

Kostas Tsampourakis, Víctor Elvira

发表机构 * School of Mathematics, University of Edinburgh(爱丁堡大学数学学院)

AI总结 本文提出了一种改进的截断神经似然估计(T-SNL)方法,解决了传统序列神经似然(SNL)在状态空间模型中推断时存在的样本需求大、扩展性差和不可 amortization 的问题,从而提高了推断的准确性、稳定性与鲁棒性。

详情
AI中文摘要

状态空间模型(SSMs)是强大的概率工具,用于建模具有潜变量动态的时间变化系统。在SSMs中的推断涉及对潜变量和参数的估计。在本文中,我们关注参数推断,这在SSMs中通常是一个极具挑战性的问题,因为似然函数不可行。最近,神经估计方法,如序列神经似然(SNL),在贝叶斯推断问题中显示出有前途的结果。在本文中,我们证明了当SNL应用于SSMs设置时,存在重要的限制,例如需要大量的模拟样本才能实现中等性能,序列长度扩展性差,且不具有amortization特性。我们随后介绍了一种新的推断算法,称为截断-SNL(T-SNL),以解决SNL的限制。我们的算法更加准确,训练过程中更加稳定和鲁棒,扩展性更强,且在新观测可用时可以进行amortization。我们的实验表明,T-SNL是一种样本效率高、鲁棒且灵活的算法,优于其他方法。

英文摘要

State-space models (SSMs) are powerful probabilistic tools for modeling time-varying systems with latent dynamics. Inference in SSMs involves the estimation of latent states and parameters. In this work, we focus on parameter inference, which for SSMs is in general a very challenging problem due to the intractability of the likelihood. Recently, neural estimation methods, such as sequential neural likelihood (SNL), have shown promising results in Bayesian inference problems. In this paper, we show that SNL, when applied to the SSM setting, suffers important limitations, such as requiring a large amount of simulated samples to achieve a moderate performance, scaling poorly with sequence length, while not being amortized. We then introduce a novel inference algorithm called truncated-SNL (T-SNL), which addresses the limitations of SNL. Our algorithm is more accurate, more stable and robust during training, more scalable to longer temporal sequences, and can be amortized when new observations become available. Our experiments show that T-SNL is sample-efficient, robust, and flexible algorithm which outperforms other approaches.

2605.21804 2026-05-22 eess.IV cs.CV cs.LG 版本更新

Mapping Tomato Cropping Systems in California Using AlphaEarth Geospatial Embeddings and Deep Learning Analysis

使用AlphaEarth地理空间嵌入和深度学习分析映射加利福尼亚州番茄种植系统

Mohammadreza Narimani, Alireza Pourreza, Parastoo Farajpoor

发表机构 * Department of Biological and Agricultural Engineering, University of California, Davis(加州大学戴维斯分校生物与农业工程系)

AI总结 本研究评估了Google DeepMind的AlphaEarth地理空间嵌入是否可以作为替代方法,用于加利福尼亚州番茄种植系统的映射,通过使用LandIQ 2018的作物多边形构建平衡参考数据集,并利用U-Net分割模型和蒙特卡洛滴落技术实现高精度的番茄种植系统映射。

Comments 5 pages, 3 figures, 1 table. Preprint submitted to ASABE 2026 AIM

详情
AI中文摘要

田间尺度的作物地图支持供应链预测和政策制定,但州级作物识别仍常常依赖于回顾性调查或基于手工工程化光谱特征的遥感工作流程。这些流程可以准确,但需要重复预处理,并且在多年间往往失去鲁棒性。本研究评估了Google DeepMind的AlphaEarth地理空间嵌入是否可以作为映射加利福尼亚州番茄种植系统的替代分析方法。使用LandIQ 2018的作物多边形构建了包含4,742个番茄和4,742个非番茄地块的平衡参考数据集。对于每个多边形,提取了64波段的AlphaEarth嵌入芯片,并与二值掩码对齐,然后分为空间独立的训练集(n = 6,638)、验证集(n = 1,422)和测试集(n = 1,424)。在AWS SageMaker上使用复合掩码二进制交叉熵和软Dice损失训练了U-Net分割模型。为了补充硬预测,保留蒙特卡洛滴落并在每次芯片上重复100次以估计预测均值和方差。在独立的测试集上,模型实现了99.19%的像素准确率、98.69%的精确度、99.40%的召回率、99.04%的F1分数、98.11%的交并比和99.02%的芯片准确率。不确定性地图在田边区域始终最高,在田内区域较低。结果表明,AlphaEarth嵌入保留了与作物相关的空间和时间结构,并且可以支持无需手动特征工程的准确、田间尺度的番茄映射。

英文摘要

Field-scale crop maps support supply-chain forecasting and policy, yet statewide crop identification still often depends on retrospective surveys or remote-sensing workflows built around hand-engineered spectral features. Those pipelines can be accurate, but they require repeated preprocessing and often lose robustness across years. This study evaluated whether Google DeepMind's AlphaEarth geospatial embeddings can serve as an analysis-ready alternative for mapping processing tomato systems in California. LandIQ 2018 crop polygons were used to assemble a balanced reference dataset of 4,742 tomato and 4,742 non-tomato fields. For each polygon, 64-band AlphaEarth embedding chips were extracted and aligned with binary masks, then divided into spatially independent training (n = 6,638), validation (n = 1,422), and test (n = 1,424) sets. A U-Net segmentation model was trained on AWS SageMaker using a composite masked binary cross-entropy and soft Dice loss. To complement hard predictions, Monte Carlo dropout was retained at inference and repeated 100 times per chip to estimate predictive mean and variance. On the independent test set, the model achieved 99.19% pixel accuracy, 98.69% precision, 99.40% recall, 99.04% F1 score, 98.11% intersection over union, and 99.02% chip accuracy. Uncertainty maps were consistently highest near field edges and low within field interiors. The results show that AlphaEarth embeddings retain crop-relevant spatial and temporal structure and can support accurate, field-scale tomato mapping without manual feature engineering.

2605.21803 2026-05-22 cs.LG 版本更新

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

相同架构,不同容量:优化器诱导的谱缩放定律

Nandan Kumar Jha, Brandon Reagen

发表机构 * New York University(纽约大学)

AI总结 研究探讨了优化器如何影响Transformer架构的谱缩放定律,发现相同架构使用不同优化器时,谱容量的缩放行为存在显著差异,提出了优化器与架构协同设计的重要性。

Comments 31 pages, 10 figures, 30 tables. Project page: https://optimizer-scaling-laws.github.io

详情
AI中文摘要

缩放定律使语言模型性能可从模型大小、数据和计算量预测,但通常将优化器视为固定训练细节。我们显示,这一假设忽略了表示缩放的一个基本轴:优化器如何有效地将增加的FFN宽度转换为利用的谱容量。通过测量前馈网络表示的谱特征,通过软和硬谱秩,我们发现,当使用不同优化器训练时,相同的Transformer架构实现了显著不同的谱缩放定律。在固定架构和宽度计划的情况下,AdamW在稀有词(TAIL)表示上表现出弱的硬秩缩放(β=0.44),而在相同区域,Muon实现了线性缩放(β=1.02),缩放指数增加了2.3倍。这一差异无法归因于验证损失:AdamW配置可以在扩展训练下匹配低秩Dion变体的困惑度,但表现出显著不同的谱几何结构,表明匹配的损失不意味着匹配的表示结构。硬-软秩不对称进一步揭示,优化器不仅在实现的容量上有所不同,还影响了容量在特征模式上的结构。为了区分优化器效应与架构效应,我们比较了架构干预(例如注意力秩和位置编码),并发现优化器诱导的谱位移往往超过架构效应。这些结果表明优化是表示缩放的第一轴,推动了优化器-架构协同设计。

英文摘要

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that \emph{the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers}. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling ($β$=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling ($β$=1.02) in the same regimes, a $2.3\times$ increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.

2605.21801 2026-05-22 cs.LG cs.CL 版本更新

Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

为何语义熵失效:面向策略优化的几何感知与校准不确定性

Zheyuan Zhang, Kaiwen Shi, Han Bao, Zehong Wang, Tianyi Ma, Yanfang Ye

发表机构 * University of Notre Dame(诺丁汉大学)

AI总结 本文提出了一种新的策略优化框架GCPO,通过几何感知措施捕捉语义分歧,并利用基于奖励的校准对齐不确定性与学习信号强度,从而更准确地跟踪梯度变化并提升训练后性能。

详情
AI中文摘要

训练后已成为改进大语言模型推理和对齐的关键,其中无批评模型能够实现从模型生成输出的可扩展学习,但缺乏区分信息性与噪声信号的原理性机制。最近的方法利用响应级度量作为不确定性信号来调节基于群体的优化方法,如GRPO。然而,其经验成功仍不稳定,且不清楚它们如何影响优化动态。在本文中,我们提供迄今为止第一个原理性公式,将不确定性信号解释为表征和调节梯度方差和学习信号质量的机制。基于经验和理论分析,我们识别出当前基于熵的估计器的两个关键缺陷:各向异性缺口和校准缺口。受此分析启发,我们提出几何感知校准策略优化(GCPO),一种新的框架,整合几何感知度量以捕捉语义分歧,利用基于奖励的校准对齐不确定性与学习信号强度。在多个基准测试中的实验表明,我们的方法更忠实跟踪梯度变化,并且一致提升训练后性能。我们的结果强调了设计与优化动态对齐的不确定性信号的重要性,为稳健训练后方法提供了原理性视角。

英文摘要

Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish informative from noisy signals. Recent approaches leverage response-level measures as uncertainty signals to regulate group-based optimization methods such as GRPO. Yet their empirical success remains unstable and unclear in how they influence optimization dynamics. In this paper, we provide, to our knowledge, the first principled formulation that interprets uncertainty signals as mechanisms for characterizing and regulating gradient variance and learning signal quality. Based on both empirical and theoretical analysis, we identify two critical gaps of current entropy-based estimators: The anisotropic gap and The calibration gap. Motivated by this analysis, we propose Geometric-aware Calibrated Policy Optimization (GCPO), a novel framework integrating geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength. Experiments on multiple benchmarks show that our approach more faithfully tracks gradient variability and consistently improves post-training performance. Our results highlight the importance of designing uncertainty signals that are aligned with optimization dynamics, offering a principled perspective for robust post-training.

2605.21800 2026-05-22 cs.LG cs.RO 版本更新

stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

stable-worldmodel: 一个用于可重复世界建模研究和评估的平台

Lucas Maes, Quentin Le Lidec, Luiz Facury, Nassim Massaudi, Ayush Chaurasia, Francesco Capuano, Richard Gao, Taj Gillin, Dan Haramati, Damien Scieur, Yann LeCun, Randall Balestriero

发表机构 * Mila & Université de Montréal(Mila与蒙特利尔大学) New York University(纽约大学) Universidade Federal de Minas Gerais(巴西联邦大学矿务学院) Independent Researcher(独立研究者) LanceDB University of Oxford(牛津大学) Brown University(布朗大学)

AI总结 本文提出stable-worldmodel平台,旨在解决世界建模研究中代码库、数据管道和评估协议碎片化的问题,通过提供高性能的数据层、现代世界模型基线和规划求解器的实现,以及扩展的环境和任务,实现标准化和可重复的世界建模研究和评估。

详情
AI中文摘要

世界模型是构建能够推理、规划并在训练数据之外进行泛化的重要组成部分。然而,目前世界模型的研究仍然碎片化,不同的代码库、数据管道和评估协议阻碍了可重复性和公平比较。当前实践还受到三个关键瓶颈的限制:脆弱的一次性代码库、缓慢的视频数据加载以及缺乏标准化的泛化基准。我们提出了stable-worldmodel (swm),一个开源平台,用于标准化和可重复的世界建模研究和评估。它提供了(1)一个高性能的Lance数据层,支持和转换MP4、HDF5和LeRobot数据集;(2)干净、经过良好测试的现代世界模型基线和规划求解器的实现;(3)一个广泛的环境和任务套件,扩展了可控的视觉、几何和物理因素的变化,以系统地评估动态理解、控制性能、表示质量和分布外泛化。通过在单一可扩展框架下统一整个流程, exttt{swm}显著减少了研究开销,并加速了向可靠世界模型的可信进展。

英文摘要

World models are central to building agents that can reason, plan, and generalize beyond their training data. However, research on world models is currently fragmented, with disparate codebases, data pipelines, and evaluation protocols hindering reproducibility and fair comparison. Current practice is further limited by three key bottlenecks: fragile one-off codebases, slow video data loading, and the lack of standardized generalization benchmarks. We present stable-worldmodel (swm), an open-source platform for standardized and reproducible world modeling research and evaluation. It delivers (1) a high-performance Lance-based data layer with native support and conversion tools for MP4, HDF5, and LeRobot datasets, (2) clean, well-tested implementations of modern world model baselines and planning solvers, and (3) a broad suite of environments and tasks extended with controllable visual, geometric, and physical factors of variation for systematic in-silico evaluation of dynamics understanding, control performance, representation quality, and out-of-distribution generalization. By unifying the full pipeline under a single, scalable framework, \texttt{swm} dramatically reduces research overhead and accelerates trustworthy progress toward reliable world models.

2605.21798 2026-05-22 cs.LG stat.ML 版本更新

Three Costs of Amortizing Gaussian Process Inference with Neural Processes

三次成本:神经过程在高斯过程推断中的摊销

Robin Young

发表机构 * University of Cambridge, Cambridge, UK(剑桥大学)

AI总结 本文研究了神经过程在高斯过程推断中的摊销成本,将高斯过程的后验推断从精确的O(n^3)转换为学习的O(n)映射,分析了标签污染、信息瓶颈和摊销误差三个来源,并提出了架构优化建议。

Comments To appear at ProbNum 2026

详情
AI中文摘要

神经过程用于摊销高斯过程推断,将精确的O(n^3)后验替换为学习的O(n)映射,从上下文集到预测分布。对于一类潜在的神经过程,我们界定了高斯过程和LNP预测之间的KL散度,将其分解为三个可解释的来源,即标签污染,因为神经过程使用标签值来估计在精确高斯过程中标签无关的量;信息瓶颈,因为有限维表示无法解析完整的上下文几何;以及摊销误差,因为单个编码器网络在所有上下文中共享。瓶颈截断项随着表示维度d衰减为O(e^{-cd^{2/d_x}}),对于平方指数核在R^{d_x}上,其中c>0是核依赖的常数,以及对于Matérn-ν核为O(d^{-2ν/d_x}),直接将架构尺寸与核平滑度和输入维度联系起来。标签污染项通常为O(1),只有观测噪声部分衰减为O(1/n),识别了通过标签依赖的表示路由不确定性估计的持续成本。这些结果刻画了在分析类别中的摊销成本,并产生了架构建议,以在高斯过程摊销范围内仅从上下文位置预测方差,并用二阶池化代替均值聚合以关闭主导的摊销差距。

英文摘要

Neural processes amortize Gaussian process inference, replacing the exact $O(n^3)$ posterior with a learned $O(n)$ map from context sets to predictive distributions. For a class of latent neural processes, we bound the Kullback--Leibler (KL) divergence between the GP and LNP predictives, decomposing it into three interpretable sources, namely label contamination as the neural process uses label values to estimate a quantity that is label-independent in the exact GP, an information bottleneck because the finite-dimensional representation cannot resolve the full context geometry, and amortization error from a single encoder network shared across all contexts. The bottleneck truncation term decays in the representation dimension $d$ as $O(e^{-cd^{2/d_x}})$ for squared-exponential kernels on $\mathbb{R}^{d_x}$ where $c > 0$ is a kernel-dependent constant and as $O(d^{-2ν/d_x})$ for Matérn-$ν$ kernels, directly linking architecture sizing to kernel smoothness and input dimension. The label contamination term is $O(1)$ in general, with only the observation-noise component decaying as $O(1/n)$, identifying a persistent cost of routing uncertainty estimation through a label-dependent representation. These results characterize the costs of amortization within the analyzed class and yield architectural recommendations to predict variance from context locations alone in the GP-amortization regime, and replace mean aggregation with second-order pooling to close the dominant amortization gap.

2605.21792 2026-05-22 cs.CL cs.AI cs.DB cs.LG 版本更新

Residual Skill Optimization for Text-to-SQL Ensembles

残差技能优化用于文本到SQL集成

Jiongli Zhu, Haoquan Guan, Parjanya Prajakta Prashant, Nikki Lijing Kuang, Seyedeh Baharan Khatami, Canwen Xu, Xiaodong Yu, Yingyu Lin, Zhewei Yao, Yuxiong He, Babak Salimi

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Snowflake AI Research(Snowflake人工智能研究)

AI总结 本文提出DivSkill-SQL,一种残差技能优化框架,通过在当前技能集成失败的示例上优化新技能,从而构建互补的文本到SQL集成,提升Pass@K性能,在Spider2-Lite上实现了显著的准确性提升,同时在不同方言和任务上表现出一致的改进。

详情
AI中文摘要

文本到SQL集成通过生成多个SQL候选并选择一个来优于单一候选生成,但其效果受限于Pass@K,即至少有一个K候选正确的概率。现有方法通过随机解码或提示变体启发式地引入多样性,导致候选集受相关失败主导。我们提出DivSkill-SQL,一种残差技能优化框架,构建互补的文本到SQL集成而无需模型微调:每个新技能在当前技能集成失败的示例上进行优化,证明其对Pass@K的边际贡献。在Spider2-Lite上,DivSkill-SQL在Snowflake和BigQuery上分别比最强集成基线提升11.1和8.3个点,且在两个基础模型(Opus-4.6和GPT-5.4)上表现一致。在单个方言上无重新训练即可转移至其他方言(Snowflake、BigQuery、SQLite)和不同任务形式(如BIRD-Critic,+2.6个点)。错误诊断显示幻觉的模式参考和函数调用减少3倍,表明收益来自真正可靠的互补技能,而非表面形式变化。

英文摘要

Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.

2605.21783 2026-05-22 cs.LG stat.ML 版本更新

MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation

MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation

Ahanaf Hasan Ariq

发表机构 * Ideal School and College(理想学校和学院)

AI总结 本文提出了一种基于PAC-Bayesian框架的测试时间适应方法,通过将MMD球体解释为 credal sets,提供了对epistemic不确定性量化的自然方法,并建立了与MMD相关的泛化界限、有限样本版本、统一最坏情况风险界限以及几何保持界限。

Comments 15 pages, 0 figures. Accepted at the 2nd Workshop on Epistemic Intelligence in Machine Learning (EIML@ICML 2026)

详情
AI中文摘要

Test-time adaptation (TTA) methods improve model performance under distribution shift but lack formal guarantees connecting shift magnitude to prediction reliability. We develop a PAC-Bayesian framework yielding generalization bounds explicitly parameterized by the maximum mean discrepancy (MMD) between source and target distributions. Our principal contribution is interpreting MMD-balls around the source distribution as credal sets in Walley's imprecise probability theory, yielding natural epistemic uncertainty quantification. We establish: (i) a PAC-Bayesian bound with an MMD-dependent shift penalty under an RKHS-Lipschitz loss assumption; (ii) a finite-sample version via MMD concentration; (iii) a uniform worst-case risk bound over all distributions in the credal set, with a lower-upper risk decomposition; and (iv) geodesic preservation bounds explaining why kernel-guided adaptation protects local feature geometry. The credal set interpretation separates epistemic from aleatoric uncertainty and provides a principled decision criterion for when adaptation is warranted.

英文摘要

Test-time adaptation (TTA) methods improve model performance under distribution shift but lack formal guarantees connecting shift magnitude to prediction reliability. We develop a PAC-Bayesian framework yielding generalization bounds explicitly parameterized by the maximum mean discrepancy (MMD) between source and target distributions. Our principal contribution is interpreting MMD-balls around the source distribution as credal sets in Walley's imprecise probability theory, yielding natural epistemic uncertainty quantification. We establish: (i) a PAC-Bayesian bound with an MMD-dependent shift penalty under an RKHS-Lipschitz loss assumption; (ii) a finite-sample version via MMD concentration; (iii) a uniform worst-case risk bound over all distributions in the credal set, with a lower-upper risk decomposition; and (iv) geodesic preservation bounds explaining why kernel-guided adaptation protects local feature geometry. The credal set interpretation separates epistemic from aleatoric uncertainty and provides a principled decision criterion for when adaptation is warranted.

2605.21780 2026-05-22 cs.LG cs.CR 版本更新

Provable Robustness against Backdoor Attacks via the Primal-Dual Perspective on Differential Privacy

通过微分隐私的对偶视角证明对后门攻击的鲁棒性

Aman Saxena, Jan Schuchardt, Yan Scholten, Stephan Günnemann

发表机构 * Department of Computer Science, Technical University of Munich(慕尼黑技术大学计算机科学系) Munich Data Science Institute(慕尼黑数据科学研究所) MCML Machine Learning Research, Morgan Stanley(摩根大通机器学习研究)

AI总结 本文提出一种基于对偶视角的微分隐私框架,用于证明对抗性扰动下的鲁棒性,通过整合随机平滑与隐私配置文件,提供对训练时间和推理时间攻击的联合鲁棒性保证。

详情
AI中文摘要

随机平滑是一种强大的工具,可用于证明对对抗扰动的鲁棒性,包括通过随机训练的污染攻击和通过随机推理的逃避攻击。将这些保证扩展到后门攻击,其中训练和测试数据共同被扰动,仍然具有挑战性,因为训练和测试时间的随机化机制必须在单一鲁棒性证书内进行分析。我们通过将随机平滑与通过隐私配置文件连接到微分隐私的对偶视角,提供了一种数值程序,用于组合异构机制。所得到的框架能够实现对复杂、组合机制的紧密、模块化、端到端认证,同时利用现有微分隐私机制的分析。我们为DP-SGD和带有推理时间平滑的深度分区聚合实例化该框架,推导出对训练时间和推理时间攻击的联合鲁棒性保证。在MNIST和CIFAR-10上的实验展示了该框架的有效性。总体而言,我们提供了一个系统且通用的框架,用于使用复合机制在复杂的威胁模型下证明鲁棒性,该模型更好地捕捉了现实对手的能力。

英文摘要

Randomized smoothing is a powerful tool for certifying robustness to adversarial perturbations, including poisoning attacks via randomized training and evasion attacks via randomized inference. Extending these guarantees to backdoor attacks, where training and test data are jointly perturbed, remains challenging because training- and test-time randomized mechanisms must be analyzed within a single robustness certificate. We address this by connecting randomized smoothing to the dual view of differential privacy through privacy profiles, which provide a numerical procedure for composing heterogeneous mechanisms. The resulting framework enables tight, modular, end-to-end certification of complex, composed mechanisms while leveraging existing analyses of differentially private mechanisms. We instantiate the framework for DP-SGD and Deep Partition Aggregation with inference-time smoothing, deriving joint robustness guarantees against both training-time and inference-time attacks. Experiments on MNIST and CIFAR-10 demonstrate the effectiveness of our framework. Overall, we provide a principled and general framework for using composite mechanisms to certify robustness under complex threat models that better capture the capabilities of real-world adversaries.

2605.21773 2026-05-22 cs.CR cs.LG 版本更新

HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection

HIDBench: 用于基于主机入侵检测的大型语言模型评估

Danyu Sun, Jinghuai Zhang, Yuan Tian, Zhou Li

发表机构 * University of California, Irvine(加州大学洛杉矶分校) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出HIDBench基准测试,用于评估大型语言模型在支持基于主机的入侵检测系统(HIDS)中的能力,揭示了LLM在复杂系统日志数据中的性能差异和敏感性。

详情
AI中文摘要

近年来,基准测试努力已推动了大型语言模型(LLMs)在网络安全中的评估,包括渗透测试和漏洞识别等任务。然而,入侵检测从系统日志这一关键网络安全任务仍未被探索。在本文中,我们提出一个新的基准测试,以评估LLM在支持基于主机的入侵检测系统(HIDS)中的能力。该任务需要在大规模、嘈杂且高度不平衡的系统日志上进行细粒度推理,其中良性与恶意活动之间的复杂相互作用使得可靠检测具有挑战性。我们的基准测试统一了三个公开的系统日志数据集,DARPA-E3、DARPA-E5和NodLink,并引入了一个数据构建管道,将原始主机遥测数据转换为LLM兼容的输入,从而在现实入侵检测设置下进行系统评估。我们对前沿LLM的评估揭示了在不同数据集上的显著性能差距。尽管许多模型在更简单的数据集上实现了高精度(通常超过0.8),但当系统日志变得更加嘈杂和复杂时,其性能显著下降,MCC经常低于0.5,误报率急剧上升。我们进一步分析了模型行为,并识别出不同的模式,包括具有低误报率的保守检测器和产生过多警报的过度敏感模型。总体而言,我们的结果表明,尽管LLM在HIDS中显示了强大的潜力,但其有效性对数据复杂性高度敏感,稳健的系统设计对于可靠的部署至关重要。

英文摘要

Recent benchmark efforts have advanced the evaluation of large language models (LLMs) in cybersecurity, including tasks such as penetration testing and vulnerability identification. However, a critical cybersecurity task, namely intrusion detection from system logs, remains unexplored. In this work, we present a new benchmark to assess LLMs' capabilities in supporting host-based intrusion detection systems (HIDS). This task requires fine-grained reasoning over large-scale, noisy, and highly imbalanced system logs, where complex interactions between benign and malicious activities make reliable detection challenging. Our benchmark unifies three public system log datasets, DARPA-E3, DARPA-E5, and NodLink, and introduces a data construction pipeline that transforms raw host telemetry into LLM-compatible inputs, enabling systematic evaluation under realistic intrusion detection settings. Our evaluation of frontier LLMs reveals substantial performance gaps across datasets. While many models achieve high precision (often above 0.8) on simpler datasets, their performance degrades significantly as system logs become noisier and more complex, with MCC frequently dropping below 0.5 and false positive rates increasing sharply. We further analyze model behavior and identify distinct regimes, including conservative detectors with low false positive rates and over-sensitive models that generate excessive alerts. Overall, our results highlight that while LLMs show strong potential for HIDS, their effectiveness is highly sensitive to data complexity, and robust system design is essential for reliable deployment.

2605.21770 2026-05-22 cs.LG 版本更新

Manifold-Guided Attention Steering

基于流形的注意力引导

Ian Li, Kapilesh Guruprasad, Raunak Sengupta, Ninad Satish, Loris D'Antoni, Rose Yu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 本文提出了一种基于流形的注意力引导方法,通过在推理过程中监控注意力头与正确性流形的距离,动态纠正偏差,从而提高大语言模型在数学推理、代码生成和分子生成等任务中的表现。

详情
AI中文摘要

尽管大型语言模型具备完成正确推理所需的知识,但在推理任务中仍经常出现错误。一种可能的改进方法是通过激活引导。然而,现有激活引导方法使用固定且预先计算的修正向量,忽略了模型当前所处的生成轨迹位置;结果是无差别扰动,会像错误步骤一样自由地破坏已正确步骤。我们提出基于流形的注意力引导(MAGS),这是一种基于几何观察的轨迹感知推理过程干预方法:特定注意力头的输出激活在错误点偏离低维正确性流形,并且这种偏差会通过后续步骤累积。对于每个识别出的注意力头,我们从正确和错误轨迹的对比对中学习一个低维子空间,该子空间捕捉了误差行为偏离正确行为的方向。在推理过程中,我们监控每个头与该流形的距离,并在偏差超过学习阈值时应用针对性的投影修正,将注意力输出引导回正确的子空间,防止误差传播。MAGS在数学推理(MATH-500,GSM8K)、代码生成(HumanEval,MBPP)和分子生成(SMILES)等基准测试中均优于未引导的基线和静态引导方法,表明正确性流形是LLM注意力几何学中的普遍特征。

英文摘要

Large language models frequently produce errors in reasoning tasks despite possessing the underlying knowledge required for correct reasoning. One possible approach to improve reasoning consistency is through activation steering. However, existing activation steering approaches apply fixed, pre-computed correction vectors, ignoring where the model currently sits along its generation trajectory; the result is indiscriminate perturbation that disrupts already-correct steps as freely as erroneous ones. We propose Manifold-Guided Attention Steering (MAGS), a trajectory-aware inference-time intervention grounded in a geometric observation: the output activations of specific attention heads diverge from a low-dimensional correctness manifold at the point of error, and this deviation compounds through subsequent steps. For each identified attention head, we learn a low-dimensional subspace from contrastive pairs of correct and incorrect traces that capture the directions along which error behavior deviates from correct behavior. During inference, we monitor each head's proximity to this manifold and apply a targeted projection correction when deviation exceeds a learned threshold, steering the attention output back toward the correct subspace before the error propagates. MAGS consistently outperforms both unsteered baselines and static steering approaches across benchmarks spanning mathematical reasoning (MATH-500, GSM8K), code generation (HumanEval, MBPP), and molecular generation (SMILES), suggesting that correctness manifolds are a general feature of LLM attention geometry.

2605.21768 2026-05-22 cs.LG cs.MA 版本更新

Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

Memory-R2: 长时间 horizon 记忆增强 LLM agent 的公平信用分配

Sikuan Yan, Ahmed Bahloul, Ercong Nie, Susanna Schwarzmann, Riccardo Trivisonno, Volker Tresp, Yunpu Ma

发表机构 * Ludwig Maximilian University of Munich(慕尼黑路德维希-马克西米利安大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Huawei Heisenberg Research Center (Munich)(华为海森堡研究中心(慕尼黑)) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出 Memory-R2 框架,通过结合局部和全局组相对优化方法,解决长时间 horizon 记忆增强 LLM agent 在多会话环境中训练时由于记忆状态差异导致的信用分配不公平问题,同时联合优化记忆形成与记忆演化。

详情
AI中文摘要

Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.

英文摘要

Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.

2605.21765 2026-05-22 cs.LG 版本更新

Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning

Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning

Emanuel Sommer, David Rügamer

发表机构 * Department of Statistics, LMU Munich(统计系,慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 本文探讨了在贝叶斯深度学习中采样推理(SAI)的潜力,指出其在计算效率上已与优化方法相当,并可能成为更有效的推理方法。核心贡献是推动SAI在贝叶斯神经网络中的应用,解决现有误解,以实现更精确的不确定性量化。

Comments In Proceedings of the 43rd International Conference on Machine Learning, PMLR 306, 2026

详情
AI中文摘要

贝叶斯神经网络(BNNs)中基于采样的推理(SAI)的实用应用仍然有限,部分原因是持续存在的关于其可行性和效率的误解。本文认为,SAI在计算上已与基于优化的方法达到平衡,并即将超越这些方法,成为BNNs中更有效和高效的推理方法。这一发展应成为整个社区的利益,推动BNNs作为一种原则性的范式,实现其长期未实现的承诺,即为神经网络提供原则性的不确定性量化。SAI甚至可以做到更多——通过模型平均获得更优的预测性能,成为各种可能的下游任务的基础,并为BNNs的景观提供关键见解。为了实现这种变革并释放采样的潜力,克服当前的误解是必要的第一步。下一步是重新定向研究努力,解决SAI中尚存的挑战。特别是,社区必须专注于两个核心问题:充分探索后验景观和高保真度地蒸馏后验样本以实现高效的下游推理。通过解决概念和实践上的障碍,我们可以解锁SAI的全部潜力,并将其确立为贝叶斯深度学习中的核心工具。

英文摘要

The practical adoption of sampling-based inference (SAI) in Bayesian neural networks (BNNs) remains limited, partly due to persistent misconceptions about the feasibility and efficiency of sampling. This position paper argues that SAI has achieved computational parity with optimization-based methods and is at the verge of superseding such methods for effective and efficient inference in BNNs. This development should be in the interest of the whole community, promoting BNNs as a principled paradigm with its long-standing yet unfulfilled promise of providing principled uncertainty quantification for neural networks. SAI can even do more -- yielding superior prediction performance through model averaging, serving as the foundation for a plethora of possible downstream tasks, and providing crucial insights into the landscape of BNNs. In order to make such a change happen and unfold the potential of sampling, overcoming current misconceptions is a necessary first step. The next step is to realign research efforts toward addressing remaining challenges in SAI. In particular, the community must focus on two core problems: sufficient exploration of the posterior landscape and high-fidelity distillation of posterior samples for efficient downstream inference. By addressing conceptual and practical obstacles, we can unlock the full potential of SAI and establish it as a central tool in Bayesian deep learning.

2605.21763 2026-05-22 cs.LG cs.SY eess.SY stat.ML 版本更新

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

关于优化确定等价的折扣强化学习样本复杂性

Oliver Mortensen, Mohammad Sadegh Talebi

发表机构 * Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 本文研究了有限折扣MDP中的风险敏感强化学习,考虑了优化确定等价(OCE)这一风险度量家族,分析了在递归OCE下学习最优状态-动作价值函数和最优策略的样本复杂性,并给出了PAC可学习的效用函数的精确刻画,同时建立了基于模型的简单方法的PAC样本复杂性界,并展示了当效用函数的域不为全实数时问题不可PAC学习,最后给出了价值和策略学习的下界,证明了在状态-动作空间大小SA上的紧性,并对更受限的效用类推导了有效时间 horizon 1/(1-γ) 的依赖性。

Comments Accepted to RLC 2026. arXiv admin note: substantial text overlap with arXiv:2506.00286

详情
AI中文摘要

我们研究了有限折扣MDP中的风险敏感强化学习,其中假设存在MDP的生成模型。我们考虑了一类称为优化确定等价(OCE)的风险度量家族,其中包括重要的风险度量,如熵风险、CVaR和均方差。我们的重点是递归OCE下学习最优状态-动作价值函数(价值学习)和最优策略(策略学习)的样本复杂性。我们提供了效用函数u的精确刻画,使得对应的OCE定义了一个PAC可学习的目标。我们分析了一个简单的基于模型的方法并推导了PAC样本复杂性界。我们证明了当u的域不为全实数dom(u)≠R时,相应的问题不可PAC学习。最后,我们为价值和策略学习建立了相应的下界,证明了在状态-动作空间大小SA上的紧性,并对更受限的效用类推导了下界,使有效时间 horizon 1/(1-γ) 的依赖性显式化。具体而言,对于CVaR_τ,我们展示了τ的正确依赖性为1/τ²,从而在状态-of-the-art上改进了1/τ因子,尽管我们的界在1/(1-γ)上的依赖性是次优的。

英文摘要

We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions $u$ for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever $u$ does not have full domain $\text{dom}(u)\neq \mathbb{R}$, the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size $SA$ of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon $\frac{1}{1-γ}$ explicit. Specifically, for $\text{CVaR}_τ$ we show that the correct dependence on $τ$ is $\frac{1}{τ^2}$, thus improving by a factor of $\frac{1}τ$ over state-of-the-art although our bound has a suboptimal dependence on $\frac{1}{1-γ}$.

2605.21762 2026-05-22 cs.LG 版本更新

Machine learning prediction of obstructive coronary artery disease using opportunistic coronary calcium and epicardial fat assessments from CT calcium scoring scans

利用CT钙扫描中的机会性冠状动脉钙化和心外膜脂肪评估进行阻塞性冠状动脉疾病的机器学习预测

Juhwan Lee, Ammar Hoori, Tao Hu, Justin N. Kim, Mohamed H. E. Makhlouf, Michelle C. Williams, David E. Newby, Robert Gilkeson, Sanjay Rajagopalan, David L. Wilson

发表机构 * Department of Biomedical Engineering, Virginia Commonwealth University(弗吉尼亚联邦大学生物医学工程系) Department of Biomedical Engineering, Case Western Reserve University(凯斯西储大学生物医学工程系) Harrington Heart and Vascular Institute, University Hospitals Cleveland Medical Center(克利夫兰医学中心哈灵顿心脏和血管研究所) BHF Centre for Cardiovascular Science, University of Edinburgh(爱丁堡大学BHF心血管科学中心) Department of Radiology, Case Western Reserve University(凯斯西储大学放射学系)

AI总结 本研究开发了一种先进的机器学习框架,通过分析CT钙扫描中的冠状动脉钙化和心外膜脂肪数据,预测阻塞性冠状动脉疾病,展示了该方法在提高预测性能和减少对增强CT或侵入性检查依赖方面的潜力。

Comments 16 pages, 4 figures, 3 tables

详情
AI中文摘要

非对比计算断层扫描钙评分(CTCS)是一种成本效益高的成像模态,广泛用于检测冠状动脉钙化。本研究旨在开发一种先进的机器学习框架,利用CTCS图像中冠状动脉钙化和心外膜脂肪的定量分析来预测阻塞性冠状动脉疾病(CAD)。研究人群包括1,324名接受CTCS和冠状动脉CT血管造影的SCOT-HEART临床试验参与者。我们从CTCS图像中提取并分析了广泛特征,包括24个临床变量、189个钙组学和211个心外膜脂肪组学特征。特征选择使用CatBoost算法结合SHapley Additive exPlanation(SHAP)值进行。预测建模利用CatBoost梯度提升方法,专注于最有信息量的特征。从初始的424个候选特征中,通过CatBoost-SHAP方法确定了14个最具有预测性的特征。前两个预测特征来自脂肪组学,其余12个特征来自钙组学。优化后的模型表现出稳健的预测能力,显示出灵敏度为83.1±4.6%、特异性为93.8±1.7%、准确度为85.3±2.0%、F1分数为73.9±3.3%。包括钙组学和脂肪组学数据显著提高了预测性能。值得注意的是,该模型在具有不同冠状动脉钙化评分的患者中也表现出可靠的预测准确性,包括在零钙化评分的情况下仍存在阻塞性CAD的病例。这种创新方法有潜力改善临床决策,并可能减少对增强CT或侵入性诊断程序的依赖,特别是在低至中等风险患者群体中。

英文摘要

Non-contrast computed tomography calcium scoring (CTCS) is a cost-effective imaging modality widely used to detect coronary artery calcifications. This study aimed to develop an advanced machine learning framework that utilizes quantitative analyses of coronary calcium and epicardial fat from CTCS images to predict obstructive coronary artery disease (CAD). The study population consisted of 1,324 patients from the SCOT-HEART clinical trial who underwent both CTCS and coronary CT angiography. We extracted and analyzed a broad range of features, including 24 clinical variables, 189 calcium-omics, and 211 epicardial fat-omics features from the CTCS images. Feature selection was conducted using the CatBoost algorithm combined with SHapley Additive exPlanation (SHAP) values. Predictive modeling utilized the CatBoost gradient boosting method, focusing on the most informative features. From an initial set of 424 candidate features, 14 were identified as most predictive through the CatBoost-SHAP method. The top two predictive features originated from fat-omics, with the remaining 12 features derived from calcium-omics. The optimized model achieved robust predictive capabilities, demonstrating a sensitivity of 83.1+/-4.6%, specificity of 93.8+/-1.7%, accuracy of 85.3+/-2.0%, and an F1 score of 73.9+/-3.3%. Inclusion of calcium-omics and fat-omics data significantly improved predictive performance. Notably, the model also showed reliable predictive accuracy in patients with diverse coronary calcium scores, including cases with obstructive CAD despite a zero-calcium score. This innovative approach holds promise for improving clinical decision-making and potentially reducing dependence on contrast-enhanced or invasive diagnostic procedures, particularly within low-to intermediate-risk patient groups.

2605.21752 2026-05-22 cs.LG cs.AI 版本更新

PEARL: Unbiased Percentile Estimation via Contrastive Learning for Industrial-Scale Livestream Recommendation

PEARL:通过对比学习实现工业级直播推荐的无偏百分位估计

Blake Gella, Wei Wu, Yuhao Yin, Zexi Huang, Zikai Wang, Emily Liu, Junlin Zhang, Wentao Guo, Qinglei Wang

发表机构 * TikTok(字节跳动) ByteDance(字节跳动)

AI总结 本文提出PEARL框架,通过对比学习方法解决用户行为不平衡问题,通过相对偏好信号建模提升推荐系统的性能和鲁棒性。

详情
AI中文摘要

训练于用户交互数据的推荐系统容易受到行为强度不平衡的影响——这种系统性扭曲源于用户间异质的参与模式。这种不平衡会使反馈信号失真,使得观察到的互动不再真实反映真实的偏好,导致模型过度放大高活跃用户信号而低估其他人,最终在大规模情况下降低推荐质量与鲁棒性。为了解决这个问题,我们提出了一种非参数对比百分位近似框架PEARL,该框架建模相对偏好信号而非绝对参与程度。基于相对优势去偏,PEARL利用真实的对比交互样本直接近似百分位关系,而无需依赖辅助分布估计模型。我们提供了理论证明,表明这种成对比较能产生无偏的基于百分位的偏好信号估计。为了更广泛的应用,我们引入了基于预测的重采样机制用于百分位平滑以处理稀疏和离散的反馈,以及通用的价值加权形式和共训练策略以增强建模灵活性和表示学习。大量离线实验表明,PEARL有效减轻了行为偏差,并在多个排序目标上一致提高了推荐性能。在拥有数十亿用户的大规模直播平台部署后,在线A/B测试确认了实际收益:观看时长增加2.10%,消费金额增加0.80%,互动率增加1.49%,举报率降低6.91%。

英文摘要

Recommender systems trained on user interaction data are susceptible to behavioral intensity imbalance--a systematic distortion arising from heterogeneous engagement patterns across users. This imbalance skews feedback signals such that observed interactions no longer faithfully reflect true preferences, causing models to disproportionately amplify signals from highly active users while underrepresenting others, which ultimately degrades recommendation quality and robustness at scale. To address this issue, we propose a nonparametric contrastive percentile approximation framework, PEARL, that models relative preference signals instead of absolute engagement magnitudes. Building upon relative advantage debiasing, PEARL leverages real contrastive interaction samples to approximate percentile relationships directly, without relying on auxiliary distribution estimation models. We provide theoretical justification demonstrating that such pairwise comparisons yield unbiased estimates of percentile-based preference signals. For broader applicability, we introduce a prediction-based bootstrapping mechanism for percentile smoothing to handle sparse and discrete feedback, alongside a generalized value-weighted formulation and a co-training strategy to enhance both modeling flexibility and representation learning. Extensive offline experiments demonstrate that PEARL effectively mitigates behavioral bias and consistently improves recommendation performance across multiple ranking targets. Deployed in a production livestream platform with a combined user base of billions, online A/B testing confirms substantial real-world gains: +2.10% Watch Duration, +0.80% Consumption Amount, +1.49% Interaction Rate, and -6.91% Report Rate.

2605.21751 2026-05-22 cs.LG 版本更新

Models Can Model, But Can't Bind: Structured Grounding in Text-to-Optimization

模型可以建模,但无法绑定:文本到优化中的结构化 grounding

Zhiqi Gao, Albert Ge, Alexander Berenbeim, Nathaniel D. Bastian, Frederic Sala

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) United States Military Academy(美国军事学院)

AI总结 本文研究了文本到优化任务中建模与绑定两个关键能力的分离性,发现随着实例数据增长,模型准确性下降,提出BIND方法通过结构化文件外部化数据来提升绑定性能,验证了绑定专精模型在不同优化类别中的优势。

详情
AI中文摘要

文本到优化需要两种可分离的能力:建模——选择正确的优化结构——和绑定——将每个系数、索引和参数在具体问题数据中具体化。我们通过Text2Opt-Bench,一个涵盖12类问题的可扩展基准,研究了这一问题,该基准包含从教科书线性规划到具有数千变量的随机和多目标形式的求解器验证优化问题。在10多个模型上,我们发现当实例数据增长时,准确性下降,即使优化形式本身简单。我们称此为有效绑定限制。我们通过一种简单的推理时间方法BIND来解决,该方法将数值数据外部化到结构化文件中,使模型能够程序化地绑定数据,而不是从提示中转录。BIND将GPT-5-Nano的准确性从59.1%提升到82.4%,在低于pass@1的token成本下达到pass@5(82.0%)的水平,并将GPT-5的准确性从86.2%提升到95.8%。此外,我们通过仅在绑定上微调模型验证了我们的假设,证明在三个结构上不同的优化类别中,绑定专精模型在端到端SFT和RL中表现更优,1.5B绑定专精模型单独即可达到7B端到端基线的水平。

英文摘要

Text-to-optimization requires two separable capabilities: modeling -- choosing the right optimization structure -- and binding -- grounding every coefficient, index, and parameter in the concrete problem data. We study this via Text2Opt-Bench, a scalable benchmark of solver-verified optimization problems spanning 12 categories, from textbook linear programs to stochastic and multi-objective formulations with up to thousands of variables. Across 10+ models, we find that accuracy collapses as instance data grows, even when the formulation itself is simple. We call this the effective binding limit. We address this via a simple inference-time approach, BIND, which externalizes numeric data to structured files so the model binds data programmatically rather than transcribing from the prompt. BIND improves GPT-5-Nano from 59.1% to 82.4% accuracy, matching pass@5 (82.0%) at lower token cost than pass@1, and GPT-5 from 86.2% to 95.8%. Furthermore, we validate our hypothesis by finetuning a model exclusively on binding and show that it outperforms end-to-end SFT and RL across three structurally distinct optimization categories, with a 1.5B binding specialist alone matching a 7B end-to-end baseline.

2605.21745 2026-05-22 cs.LG 版本更新

Quantitative coronary calcification analysis for prediction of myocardial ischemia using non-contrast CT calcium scoring

基于非增强CT钙化评分的冠状动脉钙化定量分析用于预测心肌缺血

Juhwan Lee, Sadeer Al-Kindi, Ammar Hoori, Tao Hu, Hao Wu, Justin N. Kim, Robert Gilkeson, Sanjay Rajagopalan, David L. Wilson

发表机构 * Virginia Commonwealth University(弗吉尼亚共同市场大学) Case Western Reserve University(凯斯西储大学) Houston Methodist Hospital(休斯顿 Methodist 医院) University Hospitals Cleveland Medical Center(克利夫兰医学中心) Department of Radiology, Case Western Reserve University(凯斯西储大学放射科)

AI总结 本文提出了一种新的机器学习框架,利用非增强CT钙化评分扫描中的定量冠状动脉钙化评估来预测心肌缺血,通过XGBoost和SHAP识别相关特征,并在5折交叉验证中训练和评估模型,结果显示钙化组学特征显著提高了预测性能。

Comments 15 pages, 4 figures, 3 tables

详情
AI中文摘要

非增强计算机断层扫描钙化评分(CTCS)被广泛认可为心血管风险分层的有效工具。本研究旨在开发一种新的机器学习框架,利用常规非增强CTCS扫描进行定量冠状动脉钙化评估,以预测心肌缺血。本研究分析了1,375名患者,这些患者在一年内于克利夫兰医学中心接受了非增强CTCS和去甲肾上腺素应力心脏正电子发射断层扫描心肌灌注成像。总共评估了74个变量,包括临床变量、Agatston评分和钙化组学特征。通过XGBoost结合SHAP确定了相关特征。使用5折交叉验证训练和评估预测模型。在987名患者中,89名(9%)被确定为心肌缺血阳性。最终模型整合了Agatston评分、八个钙化组学特征和年龄。所提出的模型实现了98.9±3.0%的精度,79.2±8.4%的灵敏度,以及87.7±5.3%的F1分数。与仅使用临床变量或临床变量加Agatston评分的模型相比,添加钙化组学特征显著提高了预测性能(p<0.05)。有趣的是,尽管基于SHAP分析,钙化动脉的数量是排名最低的特征,但在逻辑回归分析中,它与心肌缺血的关联最强(比值比:3.63,95%置信区间:2.80-4.77,p<0.00001)。我们开发了一种机器学习方法,用于使用常规获取的非增强CTCS扫描预测心肌缺血。钙化组学特征在传统风险因素和Agatston评分之外提供了额外的预测价值,并可能支持更可及的心血管风险分层。

英文摘要

Non-contrast computed tomography calcium scoring (CTCS) is widely recognized as an effective tool for cardiovascular risk stratification. This study aimed to develop a novel machine learning framework for predicting myocardial ischemia from routine non-contrast CTCS scans using quantitative coronary calcium assessment. This study analyzed 1,375 patients who underwent both non-contrast CTCS and regadenoson stress cardiac positron emission tomography myocardial perfusion imaging within one year at University Hospitals Cleveland Medical Center. A total of 74 variables, including clinical variables, Agatston score, and calcium-omics features, were evaluated. Relevant features were identified using XGBoost with Shapley Additive exPlanations (SHAP). Predictive models were trained and evaluated using 5-fold cross-validation. Among 987 patients, 89 (9%) were positive for myocardial ischemia. The final model incorporated the Agatston score, eight calcium-omics features, and age. The proposed model achieved a precision of 98.9+/-3.0%, sensitivity of 79.2+/-8.4, and F1 score of 87.7+/-5.3%. The addition of calcium-omics features significantly improved predictive performance compared with models using clinical variables alone or clinical variables with the Agatston score (p<0.05). Interestingly, the number of calcified arteries, despite being the lowest-ranked feature based on SHAP analysis, showed the strongest association with myocardial ischemia in logistic regression analysis (odds ratio: 3.63, 95% confidence interval: 2.80-4.77, p<0.00001). We developed a machine learning approach for predicting myocardial ischemia using routinely acquired non-contrast CTCS scans. Calcium-omics features provided incremental predictive value beyond conventional risk factors and Agatston scoring and may support more accessible cardiovascular risk stratification.

2605.21742 2026-05-22 cs.LG cs.IT math.IT 版本更新

Correcting Class Imbalance in Prior-Data Fitted Networks for Tabular Classification

修正先验数据拟合网络在表格分类中的类别不平衡

Samuel McDowell, Nathan Stromberg, Lalitha Sankar

发表机构 * School of Electrical, Computer and Energy Engineering(电气、计算机与能源工程学院)

AI总结 本文研究了如何修正先验数据拟合网络在表格分类中因类别不平衡导致的性能问题,通过分析现有技术发现阈值法因PFNs的校准特性表现优异,下采样因PFNs的有限数据性能表现相当,并具有降低推理计算成本的优势。

Comments 5 pages, 6 figures, Information Theory Workshop (ITW)

详情
AI中文摘要

Prior-data fitted networks (PFNs) have achieved exceptional performance on tabular classification tasks. However, like other classifiers, their performance can suffer under the effect of class imbalance, resulting in poor performance for rare classes. Several techniques exist which attempt to mitigate the deleterious effect of class imbalance on classification performance, but the in-context learning (ICL) dynamic of PFNs means that loss-based strategies are impossible, and other techniques are unproven. We have adapted several classical techniques addressing class imbalance and analyzed their performance on PFN classification. We observe that thresholding performs exceptionally well because of the calibration characteristics of PFNs, and downsampling performs comparably because of PFNs exceptional limited-data performance, with the additional benefit of reduced computation cost for inference.

英文摘要

Prior-data fitted networks (PFNs) have achieved exceptional performance on tabular classification tasks. However, like other classifiers, their performance can suffer under the effect of class imbalance, resulting in poor performance for rare classes. Several techniques exist which attempt to mitigate the deleterious effect of class imbalance on classification performance, but the in-context learning (ICL) dynamic of PFNs means that loss-based strategies are impossible, and other techniques are unproven. We have adapted several classical techniques addressing class imbalance and analyzed their performance on PFN classification. We observe that thresholding performs exceptionally well because of the calibration characteristics of PFNs, and downsampling performs comparably because of PFNs exceptional limited-data performance, with the additional benefit of reduced computation cost for inference.

2605.21736 2026-05-22 stat.ML cs.AI cs.LG 版本更新

Support-aware offline policy selection for advertising marketplaces

面向广告市场的支持感知离线策略选择

Prashant Shekhar, Caroline Howard

发表机构 * Department of Mathematics(数学系) Embry-Riddle Aeronautical University(埃姆伯里-瑞德尔航空大学)

AI总结 本文提出了一种支持感知的离线决策框架,用于广告市场的保留策略选择,通过将记录证据转化为保守决策对象,以确保验证的可靠性,而非仅依赖点估计排名。

详情
AI中文摘要

记录的广告拍卖使离线保留价格评估变得有吸引力但有风险。回放表可以识别具有大显眼收益增益的策略,但它们也可能隐藏弱阈值支持、多重比较效应、子组伤害和投标者响应不确定性。现有的回放和离线策略评估方法估计或排名策略价值,但它们不能直接回答可用证据是否足够强以证明验证的问题。本文开发了一种支持感知的离线决策框架用于保留策略选择。与其输出单一的点估计胜者,该框架将记录证据转化为保守的决策对象,包括认证的策略、统计上被主导的替代方案以及需要进一步验证的未解决候选者。主要理论结果给出了一种统一的有限目录保证,显示在同时控制不确定性和保守支持门控的情况下,该框架保留了最佳通过策略,同时排除了具有认证遗憾的策略。支持性结果描述了支持本地化的回放泛化,建立了信息论阈值解析极限,并量化了异质投标者响应如何推翻本地化回放排名。在iPinYou实时竞价日志上的实验显示,领先的保留规则在第二季实现了47.66%的回放提升,同时实现了40.71%的下限提升,在第三季实现了43.87%的冻结超时回放提升。该框架将19个策略目录减少到两个策略验证短名单,同时在44个广告商、交易所和地区段中认证无害。结果支持核心主张,即离线保留策略评估应产生认证的验证决策,而非仅依赖点估计排名。

英文摘要

Logged advertising auctions make offline reserve-price evaluation attractive but risky. Replay tables can identify policies with large apparent yield gains, yet they can also hide weak threshold support, multiple-comparison effects, subgroup harm, and bidder-response uncertainty. Existing replay and off-policy evaluation methods estimate or rank policy values, but they do not directly answer the operational question of whether the available evidence is strong enough to justify validation. This paper develops a support-aware offline decision framework for reserve-policy selection. Rather than outputting a single point-estimate winner, the framework converts logged evidence into a conservative decision object consisting of certified policies, statistically dominated alternatives, and unresolved candidates requiring further validation. The main theoretical result gives a unified finite-catalog guarantee showing that, under simultaneous uncertainty control and conservative support gates, the framework preserves the best gate-passing policy while eliminating only policies with certified regret. Supporting results characterize support-localized replay generalization, establish information-theoretic threshold-resolution limits, and quantify when heterogeneous bidder response can overturn localized replay rankings. Experiments on iPinYou real-time-bidding logs show that the leading reserve rule achieves a 47.66% replay lift in season two, a 40.71% simultaneous lower-bound lift, and a 43.87% frozen out-of-time replay lift in season three. The framework reduces a 19-policy catalog to a two-policy validation shortlist while certifying non-harm across 44 advertiser, exchange, and region segments. The results support the central claim that offline reserve-policy evaluation should produce certified validation decisions rather than point-estimate rankings alone.

2605.21728 2026-05-22 cs.CV cs.CL cs.LG 版本更新

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

BEiTScore: 一种基于高效交叉编码器的无参考图像描述评估方法

Gonçalo Gomes, Bruno Martins, Chrysoula Zerva

发表机构 * Instituto Superior Técnico(里斯本大学理工学院) INESC-ID Instituto de Telecomunicações(电信机构)

AI总结 本文提出了一种无参考图像描述评估方法BEiTScore,通过高效的交叉编码器模型解决传统评估方法在计算成本和敏感性方面的不足,提出了一种新的评估指标,并在多种场景下验证了其优越的性能。

详情
AI中文摘要

图像描述评估仍是一个重大挑战,因为视觉-语言模型朝着生成长形式和上下文丰富的描述等更具挑战性的能力发展。最先进的评估度量标准涉及使用大型语言模型(LLMs)作为评判者的大量计算成本,或者受到标准CLIP基于编码器的限制,例如严格的令牌限制、缺乏细粒度敏感性或缺乏组合泛化能力,因为将描述视为“词袋”。我们提出了一种新的学习度量标准,以解决上述挑战,基于一个轻量级交叉编码器,其初始化来自视觉问答模型检查点,平衡了强大的权重初始化与计算效率。我们的训练方案使用精心编排的数据混合进行监督学习,特征是对抗性的LLM基于数据增强,以增强模型对细粒度视觉-语言错误的敏感性。我们还引入了一个新的基准,用于在多种场景中评估详细的描述评估。实验结果表明,所提出的度量标准在保持大规模基准测试、质量感知解码或奖励指导所需的效率的同时,实现了最先进的性能。

英文摘要

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

2605.21724 2026-05-22 cs.LG cs.AI 版本更新

TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes

TBP-mHC: 通过运输多面体实现 manifold-constrained 超连接的全表达性

Anton Lyubinin

AI总结 本文提出 TBP-mHC,通过运输多面体参数化实现 manifold-constrained 超连接的全表达性,解决了超连接中无约束混合导致的训练不稳定性问题,并在语言模型预训练中展示了竞争性性能和改进的稳定性与可扩展性。

详情
AI中文摘要

超连接(HC)通过在多个残差流之间引入可学习的混合来改进残差网络,但无约束的混合导致训练不稳定。Manifold-Constrained Hyper-Connections(mHC)通过Sinkhorn归一化强制近似双随机性,而mHC-lite则通过置换矩阵的凸组合确保精确约束,但以阶乘复杂度为代价。KromHC通过Kronecker积参数化减少此成本,但限制混合矩阵为Birkhoff多面体的结构子流形。我们提出运输Birkhoff多面体(TBP)参数化及其递归变体(RTBP),通过(n-1)^2自由度构造精确的双随机混合矩阵。我们的方法避免了迭代归一化和组合爆炸,同时保持Birkhoff多面体的完整表达性。在语言模型预训练中的实验证明了竞争性性能,同时具有改进的稳定性和可扩展性。

英文摘要

Hyper-Connections (HC) improve residual networks by introducing learnable mixing across multiple residual streams, but unconstrained mixing leads to training instability. Manifold-Constrained Hyper-Connections (mHC) address this by enforcing approximate double stochasticity via Sinkhorn normalization, while mHC-lite ensures exact constraints through convex combinations of permutation matrices at the cost of factorial complexity. KromHC reduces this cost using Kronecker-product parameterizations, but restricts the mixing matrices to a structured submanifold of the Birkhoff polytope . We propose Transportation Birkhoff Polytope (TBP) parameterizations and their Recursive variants (RTBP), which construct exactly doubly stochastic mixing matrices with $(n-1)^2$ degrees of freedom. Our approach avoids iterative normalization and combinatorial explosion while preserving full expressivity of the Birkhoff polytope. Empirical results on language model pre-training' demonstrate competitive performance with improved stability and scalability.

2605.21722 2026-05-22 cond-mat.stat-mech cond-mat.mtrl-sci cs.LG 版本更新

MetaDNS: Enhancing Exploration in Discrete Neural Samplers via Well-Tempered Metadynamics

MetaDNS: 通过良好温控元动力学增强离散神经采样器的探索

Xiaochen Du, Juno Nam, Jaemoo Choi, Wei Guo, Sathya Edamadaka, Junyi Sha, Elton Pan, Yongxin Chen, Molei Tao, Rafael Gómez-Bombarelli

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出MetaDNS,一种将良好温控元动力学整合到离散扩散或自回归采样器中的通用框架,以解决多模式和能量屏障离散分布采样中的模式崩溃问题,并实现自由能重建。

Comments Accepted at ICML 2026

详情
AI中文摘要

从具有多个模式和能量屏障的离散分布进行采样对于机器学习和计算物理都是基础性的。最近的离散神经采样器如MDNS在模式崩溃和无法采样模式之间高能屏障区域方面存在缺陷,这对自由能估计和相变理解至关重要。我们提出了元动力学离散神经采样器(MetaDNS),一种将良好温控元动力学整合到离散扩散或自回归采样器中的通用框架。通过在选定的低维坐标上维持一个适应性、历史依赖性的偏置势能,MetaDNS强迫探索之前不可达的区域,使自由能重建成为可能,这在标准神经采样器中由于缺乏高能样本而不可行。在具有挑战性的低温基准上,包括Ising、Potts和铜-金二元合金,MetaDNS再现了热力学分布。与基于MCMC的元动力学相比,MetaDNS也实现了相当的探索,所需偏置沉积步骤更少。

英文摘要

Sampling from discrete distributions with multiple modes and energy barriers is fundamental to machine learning and computational physics. Recent discrete neural samplers like MDNS suffer from mode collapse and fail to sample high-energy barrier regions between modes, which is critical for free energy estimation and understanding phase transitions. We propose Metadynamics Discrete Neural Sampler (MetaDNS), a general framework integrating well-tempered metadynamics into discrete diffusion or autoregressive samplers. By maintaining an adaptive, history-dependent bias potential along selected low-dimensional coordinates, MetaDNS forces exploration of previously inaccessible regions, enabling free energy reconstruction infeasible with standard neural samplers due to a lack of high-energy samples. On challenging low-temperature benchmarks including Ising, Potts, and the copper-gold binary alloy, MetaDNS reproduces the thermodynamic distribution. Compared to MCMC-based metadynamics, MetaDNS also achieves comparable exploration requiring fewer bias deposition steps.

2605.21707 2026-05-22 cs.CE cs.LG 版本更新

Zero-shot adaptation to order book dynamics

面向订单簿动态的零样本适应

Arip Asadulaev

发表机构 * MBZUAI(穆尔亚人工智能研究院)

AI总结 本文提出了一种适应性市场做市架构,保留了Avellaneda-Stoikov框架的分析结构,同时引入了继任者度量式适应机制,通过分离市场动态与交易目标,实现对变化市场制度和交易目标的适应。

详情
AI中文摘要

我们描述了一种自适应市场做市架构,该架构保留了Avellaneda-Stoikov框架的分析结构,同时引入了继任者度量式适应机制。在本文中,我们保持Avellaneda-Stoikov快速Hamilton-Jacobi-Bellman结构,并使其适应于变化的市场制度和交易目标。核心思想是将市场动态与交易目标分离。市场状态确定一组低维的Avellaneda-Stoikov参数集,而近期实现的奖励确定一个低维的目标向量。HJB前向映射则通过未来奖励特征的标量化,将此目标转换为最优的买方和卖方报价。

英文摘要

We describe an adaptive market-making architecture that preserves the analytical structure of the Avellaneda--Stoikov framework while introducing a successor measure-style adaptation mechanism. In our paper we keep Avellaneda--Stoikov fast Hamilton--Jacobi--Bellman structure and make it adaptive to changing market regimes and trading objectives. The central idea is to separate market dynamics from the trading objective. The market state determines a low-dimensional set of Avellaneda--Stoikov parameters, while recent realized rewards determine a low-dimensional objective vector. The HJB forward map then converts this objective into optimal bid and ask quotes through a scalarization of future reward features.

2605.21699 2026-05-22 cs.LG cs.CL 版本更新

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

X-Token: 通过投影引导的跨分词器知识蒸馏

Sharath Turuvekere Sreenivas, Adithyakrishna Venkatesh Hanasoge, Mingyu Yang, Ali Taghibakhshi, Saurav Muralidharan, Ashwath Aithal, Pavlo Molchanov

发表机构 * NVIDIA

AI总结 本文提出X-Token,一种通过投影引导的跨分词器知识蒸馏方法,解决传统方法在处理不同分词器间知识迁移时的不足,通过两个互补的损失函数改进知识蒸馏效果。

详情
AI中文摘要

跨分词器知识蒸馏允许学生模型从具有不兼容词汇表的教师模型中学习。先前工作基于隐藏状态或对数几率,后者更优,因为它不需要辅助组件。基于对数几率的方法要么只使用正确分词的概率,从而遗漏了教师分布中的全部'暗知识',要么基于完整的输出分布,依赖严格的分词划分和/或不严谨的启发式排序。我们发现完整分布、基于对数几率方法的两个关键缺点:(i) 不常见分词失败,其中关键分词落入未匹配子集(例如,在数字拆分Qwen监督下Llama的1100多数字),在训练中被抑制,导致GSM8k从12.89降至2.56,相较于使用相同分词器的KD;(ii) 过于保守的匹配,严格的一对一匹配排除了表面形式间的近等价分词。这些失败需要不同的解决办法:当关键分词对齐错误时消除划分,当对齐可靠时进行细化。我们提出X-Token,一种具有两个互补损失函数的方法,针对这些问题。P-KL通过稀疏投影矩阵W(从分词级别字符串规则初始化)消除划分,并通过将学生分布与教师分布对齐来解决不常见分词失败。H-KL保留混合形式,同时放松匹配,使每个学生分词与W下的最高排名教师映射对齐。两个目标共享W并自然扩展到多个教师。实验证明,在Llama-3.2-1B上,X-Token在Qwen3-4B教师下比当前最佳GOLD高出+3.82平均点,在Phi-4-Mini教师下高出+0.5。此外,双教师设置(Phi-4-mini + Llama-3B)在单教师蒸馏上提高了+1.3点。

英文摘要

Cross-tokenizer knowledge distillation allows a student model to learn from teachers with incompatible vocabularies. Prior work operates on hidden states or logits; the latter is preferred as a drop-in replacement requiring no auxiliary components. Logit-based methods either use only the correct-token probability, missing the full 'dark knowledge' in the teacher's distribution, or operate on the full output distribution, relying on strict token partitioning and/or unprincipled heuristic ranking. We identify two key shortcomings of full-distribution, logit-based methods: (i) an uncommon-token failure, where critical tokens fall into the unmatched subset (e.g., Llama's 1100 multi-digit numerals under digit-splitting Qwen supervision) and are suppressed during training, reducing GSM8k from 12.89 to 2.56 compared to same-tokenizer KD from a weaker teacher; and (ii) over-conservative matching, where strict 1-to-1 matching excludes near-equivalent tokens across surface forms. These failures require distinct remedies: eliminating the partition when critical tokens are misaligned, and refining it when alignment is reliable. We propose X-Token, an approach with two complementary loss formulations targeting these issues. P-KL removes partitioning and aligns the student's distribution with the teacher's via a sparse projection matrix W (initialized from tokenizer-level string rules) to address the uncommon-token failure. H-KL retains the hybrid form while relaxing matching to align each student token with its top-ranked teacher mapping under W. Both objectives share W and extend naturally to multiple teachers. Empirically, on Llama-3.2-1B, X-Token outperforms the current state of the art GOLD by +3.82 average points with a Qwen3-4B teacher and by +0.5 with a Phi-4-Mini teacher. Further, a two-teacher setup (Phi-4-mini + Llama-3B) improves over single-teacher distillation by +1.3 points.

2605.21692 2026-05-22 cs.LG stat.ML 版本更新

Representation Gap: Explaining the Unreasonable Effectiveness of Neural Networks from a Geometric Perspective

表示差距:从几何视角解释神经网络的不合理有效性

David Perera, Victor Moura, Lais Isabelle Alves dos Santos, Michel F. C. Haddad, Flavio Figueiredo

发表机构 * Universidade Federal de Minas Gerais(巴西联邦大学矿务学院) Queen Mary University of London(伦敦女王玛丽大学)

AI总结 本文从几何视角出发,研究神经网络的表示差距,提出一个与泛化误差密切相关的度量标准,并展示其在更广泛任务和训练算法中的适用性,通过实验证明该理论在合成数据和现实数据中的准确性。

详情
AI中文摘要

精确地用可以高效估计的参数来表征神经网络的渐近泛化误差是机器学习中的关键问题,这严重依赖于启发法和实践者的直觉来做出关键设计选择。为了缓解这一问题,我们引入了表示差距,这是一个与泛化误差密切相关的度量标准,但具有更好的渐近动态特性。我们专注于等变扩散模型,并利用最优量化和点过程理论的结果,推导出表示差距的精确渐近等价,并证明其由单个参数,即任务的内在维度所支配,该参数易于解释、高效估计,并可与常见神经网络架构的等变性相关联。我们展示了这种渐近动态也适用于更广泛的任务和训练算法。最后,我们通过实验证明,我们的渐近定律和内在维度估计在广泛的合成数据集上准确,这些数据集中的这些量是已知的,以及在更现实的数据集上,我们得到的结果与相关文献一致。

英文摘要

Characterizing precisely the asymptotic generalization error of neural networks using parameters that can be estimated efficiently is a crucial problem in machine learning, which relies heavily on heuristics and practitioners' intuition to make key design choices. In order to mitigate this issue, we introduce the Representation Gap, a metric closely related to the generalization error, but admitting better-behaved asymptotic dynamics. Focusing on equivariant diffusion models and leveraging results from optimal quantization and point-process theory, we derive a precise asymptotic equivalent of the Representation Gap and show that it is governed by a single parameter, the \textit{intrinsic dimension} of the task, which is easy to interpret, efficient to estimate, and can be linked to the equivariances of common neural network architectures. We show that this asymptotic dynamic also extends to a broader range of tasks and training algorithms. Finally, we demonstrate empirically that our asymptotic law and intrinsic dimension estimation are accurate on a wide range of synthetic datasets, where these quantities are known, as well as on more realistic datasets, where we obtain results consistent with the related literature.

2605.21661 2026-05-22 cs.LG cs.AI cs.CV 版本更新

Hierarchical Variational Policies for Reward-Guided Diffusion

分层变分策略用于奖励引导的扩散

Kushagra Pandey, Farrin Marouf Sofian, Jan Niklas Groeneveld, Felix Draxler, Stephan Mandt

发表机构 * Department of Computer Science(计算机科学系) University of California Irvine(加州大学伊文斯顿分校)

AI总结 本文提出了一种分层变分模型框架,通过将控制信息压缩到轻量级且表达能力强的随机策略中,实现了在降低推理成本的同时生成高质量的奖励对齐样本,该方法在4倍超分辨率任务中实现了比现有最佳基线快5倍的推理速度并具有更好的感知质量。

详情
AI中文摘要

适应预训练扩散模型以解决下游目标如逆问题通常需要昂贵的测试时间引导或优化。我们提出了一种系统框架,能够在大幅降低推理成本的同时生成高质量的奖励对齐样本。我们的方法将测试时间适应建模为分层变分模型,其中控制被压缩到一个轻量级但表达能力强的随机策略中。这种建模自然支持少量步扩散采样:大步长使推理快速,而学习的策略通过提供结构化的每步控制保持样本质量。所得到的完全压缩采样器实现了强大的质量-速度权衡,匹配或超过最近的测试时间扩展基线,同时需要显著更少的计算资源。例如,在4倍超分辨率任务中,我们的方法在比最佳表现基线快5倍的情况下实现了更好的感知质量。我们进一步将该方法扩展到半压缩的 regime,结合廉价的压缩提案和有限的测试时间优化,在多个具有挑战性的逆问题中实现了最先进的感知质量。

英文摘要

Adapting pretrained diffusion models to downstream objectives such as inverse problems often requires expensive test-time guidance or optimization. We propose a principled framework for generating high-quality reward-aligned samples at substantially reduced inference cost. Our approach formulates test-time adaptation as a hierarchical variational model, where control is amortized into a lightweight yet expressive stochastic policy. This formulation naturally supports few-step diffusion sampling: large step sizes enable fast inference, while the learned policy maintains sample quality by providing structured per-step control. The resulting fully amortized sampler achieves a strong quality--speed tradeoff, matching or exceeding recent test-time scaling baselines while requiring significantly less compute. For example, on 4x super-resolution, our method achieves better perceptual quality with more than 5x faster inference compared to the best-performing baseline. We further extend our approach to a semi-amortized regime that combines cheap amortized proposals with limited test-time optimization, achieving state-of-the-art perceptual quality across several challenging inverse problems.

2605.21654 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Value-Gradient Hypothesis of RL for LLMs

强化学习中大语言模型的价值-梯度假说

Arip Asadulaev, Daniil Ognev, Karim Salta, Martin Takac

发表机构 * MBZUAI(穆斯林人工智能研究所)

AI总结 本文提出了一种价值-梯度视角来解释无评论强化学习方法在大语言模型后训练中的有效性,并通过分析actor更新和注意力机制中的自适应微分,提出了价值梯度信号和可达奖励空间的分解方法。

详情
AI中文摘要

强化学习显著提升了预训练语言模型,但尚不清楚为何无评论方法如PPO和GRPO能发挥如此大的作用,以及何时能提供最大的收益。我们开发了一种无评论强化学习在大语言模型后训练中的价值-梯度视角。首先,在可微展开和加性噪声参数化下,我们证明在期望下actor更新是价值-梯度类似的:反向传播传播的costates的条件期望等于价值梯度。其次,对于离散transformer策略,我们证明通过注意力机制的自适应微分会产生经验性的costates,这些近似于该价值信号,其误差受采样间隙和策略熵的控制。这些结果促使将RL影响分解为价值梯度信号和可达奖励空间,从而得出RL在预训练轨迹上最有效的标准。

英文摘要

Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

2605.21653 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction

放大而非学习:微调的AI文本检测器放大了预训练的方向

Alexander Smirnov

发表机构 * University College London(伦敦大学学院)

AI总结 该研究探讨了通过微调AI文本检测器来放大预训练方向而非学习AI与人类边界的问题,发现微调在某些情况下会降低辨别能力,但在非母语写作中表现不同,并展示了闭合形式雅可比预测器在不同架构中的有效性。

详情
AI中文摘要

AI文本检测器放大了预训练的典型性轴;它们并不构建AI与人类的边界。在没有任何任务监督的原始编码器上,将投影到AI-中心(HC3)的中心可以实现NYT与HC3的AUROC分别为0.806/0.944/0.834,跨三种架构(86-106%的微调辨别上限:在RoBERTa-base上,原始投影超过微调);在RoBERTa-base上,完全微调在两种流畅正式人口测试中降低了辨别能力。相同的轴在非母语ESL写作中反转(AUROC 0.06-0.20)--这是典型性阅读独有的可验证预测。一个24例冻结探测器与完全微调(0.900 vs 0.895)一致。一个闭合形式雅可比预测器参数化轴操纵干预,R²=1.000通用,提升了ELECTRA-CE部署的TPR从0.000到0.904(FPR=1%),并在三个独立训练的第三方RoBERTa检测器上转移,达到16/16 oracle等价(在OpenAI检测器上57%的NYT-FPR减少)。范围:编码器家族;机制幅度HC3锚定;人口层面共享轴,不同架构中每文本机制有所变化。三种操作上不同的探测器--文本表面caps_rate残差化、几何符号epsilon消融、闭合形式文本对预测器--在三种架构中一致,cos 0.74/0.81/1.00,确认了观察者不变性。在匹配TPR-0.90评估下,已发表的干预动物园(CC、dealign-f2c)在27个单元格中校准等价(|Delta AUROC| <= 0.0081),并且ELECTRA上的LoRA->full-FT偏移差距的97%是校准偏移而非学习表示--这是核心主张的预测确认。

英文摘要

AI text detectors amplify a pretrained typicality axis; they do not construct an AI-vs-human boundary. On raw encoders before any task supervision, projecting onto centroid(AI)-centroid(HC3) achieves NYT-vs-HC3 AUROC 0.806/0.944/0.834 across three architectures (86-106% of the fine-tuned discrimination ceiling: on RoBERTa-base, raw projection exceeds fine-tuning); on RoBERTa-base, full fine-tuning reduces discrimination below raw on both fluent-formal populations tested. The same axis inverts on non-native ESL writing (AUROC 0.06-0.20) -- a falsifiable prediction unique to the typicality reading. A 24-example frozen probe matches full fine-tuning (0.900 vs 0.895). A closed-form Jacobian predictor parameterises axis-manipulating interventions with R^2 = 1.000 universal, lifts ELECTRA-CE deployment TPR from 0.000 to 0.904 at FPR = 1%, and transfers to three independently-trained third-party RoBERTa detectors at 16/16 oracle-equivalence (57% NYT-FPR reduction on the OpenAI detector). Scope: encoder family; mechanism magnitude HC3-anchored; population-level shared axis with per-text mechanisms varying across architectures. Three operationally distinct probes -- text-surface caps_rate residualisation, geometric signed-epsilon ablation, closed-form text-pair predictor -- agree at cos 0.74/0.81/1.00 across three architectures, confirming observer-invariance. Under matched-TPR-0.90 evaluation, the published intervention zoo (CC, dealign-f2c) is calibration-equivalent across 27 cells (|Delta AUROC| <= 0.0081), and >= 97% of the LoRA->full-FT bias gap on ELECTRA is calibration shift, not learned representation -- the central claim's prediction confirmed.

2605.21649 2026-05-22 cs.LG cs.CL 版本更新

EntmaxKV: Support-Aware Decoding for Entmax Attention

EntmaxKV: 基于支持的解码方法用于Entmax注意力

Gonçalo Duarte, Miguel Couceiro, Marcos V. Treviso

发表机构 * Instituto Superior Técnico, Universidade de Lisboa(里斯本大学理工学院) ELLIS Unit Lisbon(里斯本ELLIS单位) INESC-ID Instituto de Telecomunicações(电信研究所)

AI总结 本文提出EntmaxKV,一种基于支持的解码框架,利用熵最大注意力的稀疏性在KV页面加载前进行稀疏解码,通过查询感知的页面评分、支持感知的候选选择和稀疏熵最大注意力,减少概率质量丢失,提高长上下文语言模型的效率。

详情
AI中文摘要

长上下文解码越来越受到KV缓存内存流量的限制,因为每个生成的标记都需在缓存上进行注意力运算,而缓存大小与上下文长度成线性增长。现有稀疏解码方法通过选择部分标记或页面来减少成本,但这些方法是为softmax注意力设计的,其密集尾部使得任何截断都会丢弃非零的概率质量。相比之下,α-entmax产生精确的零,将稀疏解码从密集尾部近似转变为支持恢复:如果所选候选包含entmax支持,稀疏解码仍保持精确。虽然最近的entmax内核实现了高效的训练,但它们并未解决自回归解码瓶颈,即密集推理仍需在稀疏性确定之前流式传输完整的KV缓存。在本文中,我们引入了EntmaxKV,一种基于entmax的稀疏解码框架,它在KV页面加载前利用稀疏性。EntmaxKV结合了查询感知的页面评分、支持感知的候选选择和稀疏entmax注意力。我们通过分析截断误差中的丢弃概率质量δ,证明输出误差由δ控制,并在恢复entmax支持时消失。我们进一步引入了一种高斯感知的entmax选择器,从轻量级页面统计中估计entmax阈值,使所选预算适应于分数分布。实验证明,EntmaxKV比基于softmax的稀疏解码在相同KV预算下丢弃更少的概率质量,保留更多支持标记,并实现更低的输出误差。在长上下文和语言建模基准上,它接近完整的缓存entmax,但使用KV缓存的少量比例,达到100万上下文长度时,比完整的注意力基线快3.36倍(softmax)和5.43倍(entmax)。代码可在:https://github.com/deep-spin/entmaxkv获取。

英文摘要

Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, $α$-entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass $δ$, showing that output error is controlled by $δ$ and vanishes when the entmax support is recovered. We further introduce a Gaussian-aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax-based sparse decoding at matched KV budgets. On long-context and language modeling benchmarks, it closely matches full-cache entmax while using a small fraction of the KV cache, achieving up to $3.36\times$ (softmax) and $5.43\times$ (entmax) speedup over full attention baselines at 1M context length. Code available at: https://github.com/deep-spin/entmaxkv.

2605.21646 2026-05-22 cs.LG 版本更新

Alike Parts: A Feature-Informed Approach to Local and Global Prototype Explanations

相似部分:一种基于特征的局部和全局原型解释方法

Jacek Karolczak, Jerzy Stefanowski

发表机构 * Institute of Computing Science(计算科学研究所)

AI总结 本文提出了一种基于特征的局部和全局原型解释方法,通过整合特征重要性来提高解释的粒度,实验表明该方法在保持模型预测精度的同时增强了特征多样性。

Comments Accepted for publication in International Journal of Applied Mathematics and Computer Science (IJAMCS)

详情
AI中文摘要

基于原型的解释提供了一种直观的、基于实例的方法来支持机器学习黑箱分类器的可解释性,但通常缺乏特征层面的细粒度。我们介绍了一个框架,该框架在两个层次上整合特征重要性以解决这一差距。首先,对于局部解释,我们提出"相似部分":一种利用特征重要性评分来突出分类实例与其最近原型之间最相关、共享的特征子集的方法,以引导用户关注。其次,我们通过在全局原型选择目标函数中加入特征重要性项,积极促进所选原型的特征属性的多样性。在六个基准数据集上的实验表明,这种增强的选取过程保持或在某些情况下提高了替代模型的预测保真度,表明特征多样性并不影响模型保真度。

英文摘要

Prototype-based explanations offer an intuitive, example-based approach to support the interpretability of machine learning black box classifiers but often lack feature-level granularity. We introduce a framework that integrates feature importance at two levels to address this gap. First, for local explanations, we propose \textit{alike parts}: a method that uses feature importance scores to highlight the most relevant, shared feature subsets between a classified instance and its nearest prototype, guiding user attention. Second, we augment the global prototype selection objective function with a feature importance term to actively promote diversity in the feature attributions of the selected prototypes. Experiments on six benchmark datasets show that this augmented selection process maintains or, in some cases, increases the prediction fidelity of the surrogate model, suggesting that feature diversity does not compromise model fidelity.

2605.21615 2026-05-22 cs.CR cs.LG cs.SE 版本更新

ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage

ASSEMBLAGE-DEEPHISTORY: 一个具有时间覆盖的跨构建二进制数据集

Chang Liu, Noah Fleischmann, Nicolò Altamura, Edward Raff, James Holt, Kristopher Micinski

发表机构 * Syracuse University(Syracuse大学) Booz Allen Hamilton Independent Researcher(独立研究员) CrowdStrike

AI总结 本文提出ASSEMBLAGE-DEEPHISTORY数据集,整合了跨构建多样性、跨版本历史和CVE标签,为二进制分析提供统一框架,通过三个分析验证了其在LLM漏洞检测、版本聚类和二进制相似性分解中的价值。

详情
AI中文摘要

现有的二进制数据集通常只能捕捉一个或两个二进制变化轴:它们要么提供无时间轴的跨编译器构建,要么为单构建二进制提供CVE标签。没有一个结合跨构建多样性、跨版本历史和CVE标签到可查询的结构中。我们提出了ASSEMBLAGE-DEEPHISTORY,将这些维度整合到统一的框架中,其中每个二进制的编译上下文、源代码、易受攻击函数和包版本都作为一等元数据存储。ASSEMBLAGE-DEEPHISTORY包含73,610个二进制文件,涵盖248个开源项目,这些二进制文件在Linux和Windows上使用GCC、Clang和MSVC在多个优化级别下编译,具有多年的历史构建。每个二进制文件都被索引到一个数据库中,该数据库将其链接到其源代码、函数、调试信息、变体构建、历史版本和易受攻击函数。三种分析展示了该结构的价值:(1)一个三阶段LLM基准测试(识别、策略引导检测和跨构建转移)以测试LLM是否在二进制漏洞上进行推理或在构建特定的artifact上进行模式匹配;(2)MalConv嵌入、jTrans函数嵌入和TLSH模糊哈希的比较,量化了同一包版本在每个空间中的聚类情况;(3)贝叶斯回归分解二进制相似性为时间距离、文件更改和提交的贡献。

英文摘要

Existing binary corpora typically capture only one or two axes of binary variation: they either provide cross-compiler builds without a temporal axis, or CVE labels for single-build binaries. None combine cross-build diversity, cross-version history, and CVE labels into a queryable structure. We present ASSEMBLAGE-DEEPHISTORY, which consolidates these dimensions into a unified framework where every binary's compilation context, source code, vulnerable functions, and package version are stored as first-class metadata. ASSEMBLAGE-DEEPHISTORY comprises 73,610 binaries spanning 248 open-source projects, compiled across GCC, Clang, and MSVC at multiple optimization levels on Linux and Windows, with multi-year historical builds. Each binary is indexed in a database that links it to its source code, functions, debug info, variant builds, historical versions, and vulnerable functions. Three analyses demonstrate this structure's value: (1) a three-stage LLM benchmark (recognition, strategy-guided detection, and cross-build transfer) to test whether LLMs reason about binary vulnerabilities or pattern-match on build-specific artifacts; (2) a comparison of MalConv embeddings, jTrans function embeddings, and TLSH fuzzy hashes quantifying how same-package versions cluster in each space; and (3) a Bayesian regression decomposing binary similarity into contributions from temporal distance, file changes, and commits.

2605.21614 2026-05-22 cs.HC cs.LG 版本更新

Exploring the Effectiveness of Using LLMs for Automated Assessment of Student Self Explanations in Programming Education

探索使用LLMs对编程教育中学生自解释进行自动评估的有效性

Arun-Balajiee Lekshmi-Narayanan, Mohammad Hassany, Peter Brusilovsky

发表机构 * University of Pittsburgh(匹兹堡大学)

AI总结 本文研究了在编程教育中使用LLMs自动评估学生自解释的有效性,通过比较LLMs与语义相似性方法在二元分类任务中的表现,探讨了自动评分技术的优劣。

详情
AI中文摘要

worked examples是特定领域的逐步问题解决示例,提供给学生以获得领域特定的问题解决技能。通过将worked examples与self-explanations结合,可以增强其有效性,因为self-explanations要求学生解释而不是被动学习每个问题解决步骤。主要挑战是评估学生解释的正确性。在现有方法中,学生解释通过其语义相似性与教师或领域专家的解释进行判断。鉴于近期LLM基于的自动评分进展,仍不清楚语义相似性方法是否仍然是自动评分文本学生响应(如文章或代码解释)最有效的方法。比较这些方法需要高质量的数据集,提供如平衡的类别分布和领域特定的标注数据等特征。在本文中,我们提出了一个严格比较LLMs与语义相似性用于自动评分的比较,框架为二元分类任务。

英文摘要

Worked examples are step-by-step solutions to problems in a specific domain, offered to students to acquire domain-specific problem-solving skills. The effectiveness of worked examples could be enhanced by combining them with self-explanations, which ask students to explain rather than passively study each problem-solving step. The main challenge of this approach is assessing the correctness of the student's explanations. In the prevailing approach, student explanations are judged by their semantic similarity to an instructor's or domain expert's explanation. Given recent advances in LLM-based automated scoring, it remains unclear whether semantic similarity methods are still the most effective technique to automatically score textual student responses like essays or code explanations. Comparing these methods also requires quality datasets that offer distinctive features such as balanced class distributions and domain-specific labeled data for automated scoring tasks. In this paper, we present a rigorous comparison between LLMs and semantic similarity used for automated scoring, framed as a binary classification task.

2605.21611 2026-05-22 cs.CV cs.LG 版本更新

UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

UniVL:统一的视觉-语言嵌入用于空间接地的上下文图像生成

Jiayun Wang, Yu Wang, Weijie Gan, Zhenting Wang, Wei Wei

发表机构 * Center for Advanced AI(先进人工智能中心)

AI总结 本文提出了一种统一的视觉-语言嵌入方法,通过单一的视觉输入直接将语义绑定到空间位置,从而减少计算并提高图像生成质量。

详情
AI中文摘要

我们引入了空间接地的上下文图像生成任务,这是一种可控的图像生成任务,重新定义了条件生成范式。与通过两个独立编码器分别提供参考图像和全局文本提示不同,UniVL被训练以从单一统一的视觉输入中直接绑定语义到空间位置,其中文本指令被渲染到空间掩码上。这消除了推理过程中对独立文本编码器的需求。所得到的模型通过遵循用户指定的指令来支持上下文图像生成,即在指定位置生成什么内容,同时显著减少了计算量。为了解决这一任务,我们提出了一种框架,其中从光学字符识别预训练的backbone中适应的UniVL编码器读取统一的条件,并生成一个融合视觉和语义意图以及空间位置的UniVL嵌入fVIL。一个两阶段流程首先对齐UniVL与VAE嵌入空间,然后将预训练的扩散backbone完全基于UniVL嵌入进行条件生成,消除了如T5等独立文本编码器。尽管这种重新定义使用了刻意最小化的文本接口,但仍然取得了显著的实证收益。在UniVL-ImgGen上,一个包含477,000个掩码标注图像的基准数据集上,UniVL在文本提示基线之上提高了图像质量,将FID从14降低到11,并将PSNR从16提高到20。它还完全消除了文本编码器,将推理TFLOPs减少高达52%,将运行时间减少高达44%。此外的消融研究验证了所提出组件的贡献,为具有统一条件范式的高效、空间接地图像生成铺平了道路。

英文摘要

We introduce spatially grounded contextual image generation, a controllable image generation task that reframes the conditioning paradigm. Instead of supplying a reference image and a global text prompt through two separate encoders, one for vision and one for language, UniVL is trained to bind semantics to spatial locations directly from a single unified visual input, where the textual instruction is rendered onto the spatial mask. This removes the need for a standalone text encoder at inference time. The resulting model supports contextual image generation by following user-specified instructions about what should appear where, while substantially reducing computation. To address this task, we propose a framework in which the UniVL encoder, adapted from an optical-character-recognition-pretrained backbone, reads the unified condition optically and produces a UniVL embedding, fVIL, that fuses visual and semantic intent with spatial locations in a single token sequence. A two-stage pipeline first aligns UniVL with the VAE embedding space and then conditions a pretrained diffusion backbone entirely on UniVL embeddings, eliminating the standalone text encoder, such as T5. Although this reframing uses a deliberately minimal text interface, it yields strong empirical gains. On UniVL-ImgGen, a benchmark of 477K mask-annotated images that we construct for training and evaluation, UniVL improves image quality over text-prompted baselines, reducing FID from 14 to 11 and increasing PSNR from 16 to 20. It also eliminates the text encoder entirely, reducing inference TFLOPs by up to 52% and runtime by up to 44%. Additional ablation studies validate the contributions of the proposed components, paving the way for efficient, spatially grounded image generation with a unified conditioning paradigm.

2605.21610 2026-05-22 cs.LG 版本更新

AgForce Enables Antigen-conditioned Generative Antibody Design

AgForce 使生成抗体设计具备抗原条件

Mansoor Ahmed, Murray Patterson

发表机构 * Georgia State University(佐治亚州立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出AgForce方法,通过图神经网络和改进的解码器设计,解决传统抗体设计方法中对抗原输入忽略的问题,提升了抗体序列生成的质量和恢复能力。

详情
AI中文摘要

抗体设计方法通常基于抗原结构生成互补决定区(CDR),但基线方法的系统评估表明,它们大多忽略了抗原输入。我们识别出三种导致这种行为的失败模式。抗原盲性是因为模型从抗体框架上下文推断预测,而非抗原信息,从而产生几乎相同的CDR,无论目标如何。词汇坍塌将预测的氨基酸减少到每个位置3到5种,远低于天然序列的真实分布。此外,任何使用标准位置交叉熵训练的模型都会收敛到位置边际分布,这使得它无法产生抗原特异性序列预测。我们提出了一种名为AgForce的新型编码器-解码器架构,它使用图神经网络(GNN)作为编码器,并针对序列-结构协同设计设计了专用解码器。具体而言,我们应用了框架dropout、门控瓶颈和双曲交叉注意力,以防止抗体的捷径路径。在解码器中,一个具有Potts-like成对耦合和退火的多选学习(aMCL)的混合密度网络(MDN)序列头取代了交叉熵目标,用一个多组件分布替代了位置边际分布的最优解。一个抗原循环一致性头将梯度路由通过序列解码器,迫使预测分布编码抗原身份。AgForce在CHIMERA-Bench数据集上同时实现了最佳的结合质量和序列恢复能力,比最强的序列基线提高了8%的氨基酸恢复率,且在所有界面指标上均优于基线,几乎将GNN方法的有效词汇量翻倍。源代码可在:https://github.com/mansoor181/ag-force.git

英文摘要

Antibody design methods condition on antigen structure to generate complementarity-determining regions (CDR), yet a systematic evaluation of baseline methods reveals that they largely ignore the antigen input. We identify three failure modes that explain this behavior. Antigen blindness arises because models derive predictions from antibody framework context rather than antigen information, producing nearly identical CDRs regardless of the target. Vocabulary collapse reduces predicted amino acids to three to five per position, far below the ground truth distribution in native sequences. Moreover, any model trained with standard per-position cross-entropy converges to the positional marginal distribution, making it provably unable to produce antigen-specific sequence predictions. We propose a novel encoder-decoder architecture called AgForce, that uses a graph neural network (GNN) as the encoder and specialized decoders for sequence-structure co-design. Specifically, we apply framework dropout, gated bottlenecks, and hyperbolic cross attention that prevent the antibody shortcut path. In the decoder, a Mixture Density Network (MDN) sequence head with Potts-like pairwise coupling and annealed Multiple Choice Learning (aMCL) replaces the cross-entropy objective with a multi-component distribution whose optimal solution differs from the positional marginal. An antigen cycle consistency head routes gradients through the sequence decoder, forcing predicted distributions to encode antigen identity. AgForce achieves the best binding quality and sequence recovery simultaneously on the CHIMERA-Bench dataset, improving amino acid recovery by 8% over the strongest sequence baseline while surpassing the baselines across all interface metrics, and nearly doubling the effective vocabulary of GNN methods. The source code is available at: https://github.com/mansoor181/ag-force.git

2605.21606 2026-05-22 cs.LG cs.AI 版本更新

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

何时教师标记可靠?用于推理的基于位置加权的在线自我蒸馏

Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种基于位置加权的在线自我蒸馏方法,用于改进推理任务中教师标记的可靠性,通过引入分支可行性诊断来识别教师标记的可靠性,并在不同模型上验证了其有效性。

Comments Pre-print. Code is available at https://github.com/SaFo-Lab/PW-OPSD

详情
AI中文摘要

在线自我蒸馏(OPSD)通过一个特权教师训练学生,但其标准目标对所有生成的标记同等重视,隐含地将特权教师目标视为在每个学生访问的前缀中同样可靠。现有的基于熵的OPD方法通过调节令牌级监督来放松这种均匀性,但推理中高教师熵的可靠性含义具有歧义:它可以反映非可行的不确定性或良性的解决方案多样性。为识别这一现象,我们引入了分支可行性诊断。具体来说,我们记录特权答案教师提示中的下一个标记替代方案,强制每个替代方案在学生提示及其在线脊柱前缀之后,并测试由此产生的学生模板延续是否能恢复正确答案。在Qwen3-4B上,我们发现一个导向的序列内位置分数是测试中最强的教师标记可靠性预测因子,达到曲线下面积(AUROC)为0.83;局部不确定性分数最多为0.57。受此轨迹结构的启发,我们提出了基于位置加权的在线自我蒸馏(PW-OPSD),其在保持相同的学生滚动生成、特权教师传递和截断的前向KL目标的同时,应用递增的位置权重。在不同随机种子的全面评估中,诊断衍生的PW-OPSD在AIME 2024和AIME 2025 Avg@12上分别提高了+1.0和+1.1分,并在两个更大规模的模型上也展示了一致的Avg@12改进。这些结果表明,推理蒸馏中的教师标记可靠性具有轨迹结构,并且可以在不增加教师计算的情况下利用。

英文摘要

On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliable at every student-visited prefix. Existing entropy-based OPD methods relax this uniformity by modulating token-level supervision with teacher entropy, but high teacher entropy in reasoning has an ambiguous reliability meaning: it can reflect either non-viable uncertainty or benign solution diversity. To identify this phenomenon, we introduce a branch-viability diagnostic. Specifically, we record next-token alternatives from the privileged-answer teacher prompt, force each alternative after the student prompt plus its on-policy spine prefix, and test whether the resulting student-template continuation recovers the correct answer. On Qwen3-4B, we find that an oriented within-sequence position score is the strongest tested predictor of teacher-token reliability, reaching an area-under-ROC-curve (AUROC) of 0.83; local uncertainty scores are at most 0.57. Motivated by this trajectory-level structure, we propose Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies an increasing position weight while keeping the same student rollout, privileged teacher pass, and clipped forward-KL target as OPSD. In our comprehensive evaluations with different random seeds, the diagnostic-derived PW-OPSD improves AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points, and a generalization evaluation on two larger-scale models from different families, DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think, also demonstrates consistent aggregate Avg@12 improvements. These results show that teacher-token reliability in reasoning distillation is trajectory-structured and can be utilized without additional teacher computation.

2605.21600 2026-05-22 cs.LG 版本更新

ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning

ConTact: 通过显式界面推理进行接触优先的抗体CDR设计

Mansoor Ahmed, Spencer VonBank, Nadeem Taj, Sujin Lee, Naila Jan, Murray Patterson

发表机构 * Georgia State University, Atlanta, USA(佐治亚州立大学) Georgia Institute of Technology, Atlanta, USA(佐治亚理工学院) DePauw University, Indiana, USA(德保罗大学) University of Engineering(工程大学)

AI总结 本文提出ConTact,一种通过显式界面推理进行抗体CDR设计的方法,通过显式分解CDR设计为三个阶段:学习表面互补性指纹、预测CDR-抗原接触以及注入接触门控抗原特征,从而提高结构质量和表位意识。

详情
AI中文摘要

计算抗体CDR设计方法基于抗原结构生成结合环,但现有架构将两个根本不同的子问题混为一谈:确定哪些CDR位置会接触抗原,以及在这些位置选择氨基酸。这种混合同一迫使模型通过统一的消息传递隐式学习接触推理,稀释抗原信号在所有位置中均等。我们引入ConTact,一种接触然后作用的架构,将CDR设计显式分解为三个连续阶段:学习表面互补性指纹、预测CDR-抗原接触以及将接触门控抗原特征注入序列头。距离偏倚的交叉注意力模块编码几何先验,倾向于空间邻居,而接触加权的交叉熵损失将梯度信号集中于结合关键位置。在CHIMERA-Bench数据集上,ConTact在结构质量(比次优基线提高7% RMSD)、表位意识(比GNN基线提高10% F1分数)以及序列恢复(AAR 0.38)方面均表现最佳。

英文摘要

Computational antibody CDR design methods condition on antigen structure to generate binding loops, yet existing architectures conflate two fundamentally distinct sub-problems: identifying which CDR positions will contact the antigen, and selecting amino acids at those positions. This conflation forces models to learn contact reasoning implicitly through uniform message passing, diluting antigen signal across all positions equally. We introduce ConTact, a contact-then-act architecture that explicitly decomposes CDR design into three cascaded stages: learning surface complementarity fingerprints, predicting CDR-antigen contacts, and injecting contact-gated antigen features into the sequence head. A distance-biased cross-attention module encodes geometric priors favoring spatial neighbors, while a contact-weighted cross-entropy loss concentrates gradient signal on binding-critical positions. On CHIMERA-Bench dataset, ConTact achieves the best structural quality (7% RMSD improvement over the next-best baseline), best epitope awareness (10% F1 score over GNN baselines), and competitive sequence recovery (AAR 0.38) among several CDR-H3 design baselines.

2605.21568 2026-05-22 cs.LG 版本更新

Equilibrium Propagation and Hamiltonian Inference in the Diffusive Fitzhugh-Nagumo Model

扩散 Fitzhugh-Nagumo 模型中的平衡传播与哈密顿推断

Jack Kendall

发表机构 * Rain Neuromorphics(Rain神经形态实验室)

AI总结 本文扩展了平衡传播框架以应用于偏斜梯度系统,并展示了深度能量模型与哈密顿神经网络之间的等价性。研究重点是扩散耦合的 Fitzhugh-Nagumo 神经网络作为典型示例,证明了由于 Fitzhugh-Nagumo 模型的稳态解由自共轭算子描述,因此可以应用平衡传播方法进行信用分配。此外,对于具有深度残差网络拓扑的 Fitzhugh-Nagumo 网络,稳态解具有(空间)哈密顿量,因此可以应用哈密顿回传方法。最后,推导出一个显式的层间哈密顿递推关系,用于指导深度 Fitzhugh-Nagumo 网络和深度能量模型的稳态解推断。

详情
AI中文摘要

在本文中,我们将平衡传播框架扩展到偏斜梯度系统,并展示了深度能量模型与哈密顿神经网络之间的等价性。我们重点研究了扩散耦合的 Fitzhugh-Nagumo 神经网络作为典型示例。我们证明了由于 Fitzhugh-Nagumo 模型的稳态解由自共轭算子描述,因此可以应用平衡传播方法进行信用分配。此外,对于具有深度残差网络拓扑的 Fitzhugh-Nagumo 网络,我们证明稳态解具有(空间)哈密顿量,因此可以应用哈密顿回传方法。最后,我们推导出一个显式的层间哈密顿递推关系,用于指导深度 Fitzhugh-Nagumo 网络和深度能量模型的稳态解推断。

英文摘要

In this work, we extend the Equilibrium Propagation framework to skew-gradient systems and show an equivalence between deep Energy-Based Models and Hamiltonian neural networks. We focus on networks of diffusively coupled Fitzhugh-Nagumo neurons as a prototypical example. We show that since stationary solutions of the Fitzhugh-Nagumo model are described by self-adjoint operators, the methods of equilibrium propagation for performing credit assignment can be applied. Furthermore, for Fitzhugh-Nagumo networks with the topology of a deep residual network, we show that the steady state solutions admit a (spatial) Hamiltonian, and thus the methods of Hamiltonian Echo Backpropagation can be applied. We end by deriving an explicit layer-wise Hamiltonian recurrence relation governing inference for stationary solutions of both deep Fitzhugh-Nagumo networks and deep Energy-Based Models.

2605.21566 2026-05-22 cs.LG 版本更新

Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study

CKD风险预测中的校准、不确定性沟通与部署准备性:一个框架评估研究

Michael O. Eniolade

发表机构 * University of the Cumberlands(卡默尔兰兹大学)

AI总结 本文评估了在慢性肾病风险预测中,校准、不确定性量化和部署准备性的重要性,通过五个分类器在UCI CKD数据集上的表现,发现内部性能优异但外部转移性差,强调了校准稳定性和外部数据验证的必要性。

Comments 27 pages, 6 figures, 4 tables. Supplementary materials (S1-S4) included as ancillary file

详情
AI中文摘要

用于慢性肾病(CKD)风险预测的机器学习模型在内部测试集上通常表现出很强的判别能力。然而,校准和不确定性量化往往受到忽视,导致临床医生无法获得关于概率输出是否准确的可靠信息。我们训练了五个分类器在UCI CKD数据集(400名患者,62.5%的CKD患病率)上:逻辑回归、随机森林、XGBoost、带有Platt缩放的SVM以及高斯朴素贝叶斯。我们评估了每个模型在校准质量、符合性预测覆盖率以及一个八项部署准备性框架上的表现。分布压力测试将每个模型的最佳校准变体应用于公开的MIMIC-IV演示队列(97名患者,23.7%的CKD患病率)以评估在患病率变化和特征缺失情况下的行为。我们使用期望校准误差和Brier分数测量校准在Platt缩放和等距回归前后的变化,并通过分割符合性预测来量化不确定性,目标为90%的边际覆盖率。所有五个模型在UCI测试集上达到了AUROC 1.00。等距重校准将内部ECE降低到0.000-0.022。在MIMIC-IV上,AUROC降至0.48-0.58,ECE升至0.68-0.76,符合性覆盖率从0.80-0.98降至0.21-0.25,目标为90%。没有模型在部署准备性检查表上得分超过16分中的4分。近完美的内部性能并未转移。在任何临床预测模型部署之前,校准稳定性和符合性覆盖率应在外部数据上进行评估。

英文摘要

Machine learning models for chronic kidney disease (CKD) risk prediction often post strong discrimination scores on internal test sets. Calibration and uncertainty quantification get far less attention, leaving clinicians without reliable information about whether the probability outputs are accurate. We trained five classifiers on the UCI CKD dataset (400 patients, 62.5% CKD prevalence): logistic regression, random forest, XGBoost, SVM with Platt scaling, and Gaussian naive Bayes. We evaluated each across calibration quality, conformal prediction coverage, and an eight-criterion deployment readiness framework. A distributional stress-test applied the best-calibrated variant of each model to the open-access MIMIC-IV demo cohort (97 patients, 23.7% CKD) to assess behaviour under prevalence shift and feature missingness. We measured calibration before and after Platt scaling and isotonic regression using Expected Calibration Error and Brier Score, and quantified uncertainty through split conformal prediction targeting 90% marginal coverage. All five models reached AUROC 1.00 on the UCI test set. Isotonic recalibration reduced internal ECE to 0.000-0.022. On MIMIC-IV, AUROC fell to 0.48-0.58, ECE rose to 0.68-0.76, and conformal coverage dropped from 0.80-0.98 to 0.21-0.25 against a 90% target. No model scored above 4 out of 16 on the deployment readiness checklist. Near-perfect internal performance did not transfer. Calibration stability and conformal coverage should be evaluated on external data before any clinical prediction model moves toward deployment.

2605.21565 2026-05-22 cs.LG 版本更新

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

通过自适应课程学习提升多模态对话情感识别的模态平衡

Phuong-Anh Nguyen, The-Son Le, Duc-Trong Le, Cam-Van Thi Nguyen

发表机构 * VNU University of Engineering and Technology(越南工程大学)

AI总结 本文提出基于自适应课程学习的框架,通过双层难度评估器解决多模态对话情感识别中的模态不均衡问题,实验表明该方法在IEMOCAP和MELD数据集上显著提升了模型性能。

Comments Accepted at Neural Computing and Applications (Springer), 2026

详情
AI中文摘要

多模态情感识别在对话中是一项关键任务,其中融合语言、面部表情和语音语调的多模态方法已取得显著进展。然而,模态不匹配和学习不平衡仍然是主要挑战,限制了多模态信息的有效利用。为了解决这个问题,我们提出了一种基于自适应课程学习(SPCL)的插件式框架用于MERC。我们引入了双层难度评估器,捕捉语句级和对话级的挑战。语句级分数模型细粒度地捕捉模态特定的难度,而对话级分数捕捉更广泛的对话结构,包括情感依赖性和模态一致性。基于这些分数,学习调度器动态地指导从简单到困难的实例训练。通过将SPCL整合到现有的MERC架构中,我们的方法缓解了模态不平衡并提高了模型鲁棒性。在IEMOCAP和MELD数据集上的大量实验显示,不同架构和模态设置下均取得一致的改进。在IEMOCAP上,SPCL在基线模型上将加权F1分数提高约+1.2%至+6.6%,而在MELD上,提升达到+10.4%。这些结果突显了SPCL作为轻量级插件模块在多模态情感识别中的有效性与通用性。

英文摘要

Multimodal Emotion Recognition in Conversations (MERC) is a crucial task for understanding human interactions, where multimodal approaches integrating language, facial expressions, and vocal tone have achieved significant progress. However, modality misalignment and imbalanced learning remain major challenges, limiting the effective utilization of multimodal information. To address this issue, we propose a plug-and-play framework based on Self-Paced Curriculum Learning (SPCL) for MERC. We introduce a dual-level Difficulty Measurer that captures both utterance-level and conversation-level challenges. The utterance-level score models fine-grained modality-specific difficulty, while the conversation-level score captures broader dialogue structures, including emotional dependencies and modality coherence. Based on these scores, the Learning Scheduler dynamically guides training from easier to more difficult instances. By integrating SPCL into existing MERC architectures, our method alleviates modality imbalance and improves model robustness. Extensive experiments on the IEMOCAP and MELD datasets demonstrate consistent improvements across different architectures and modality settings. On IEMOCAP, SPCL improves weighted F1-score by approximately +1.2% to +6.6% over baseline models, while on MELD, gains reach up to +10.4%. These results highlight the effectiveness and generalizability of SPCL as a lightweight plug-and-play module for multimodal emotion recognition.

2605.21563 2026-05-22 cs.LG 版本更新

Embedding-Based Federated Learning with Runtime Governance for Iron Deficiency Prediction

基于运行时治理的嵌入式联邦学习用于缺铁预测

Fan Zhang, Simon Deltadahl, Majid Lotfian Delouee, Daniel Kreuter, Joseph Taylor, Allerdien Visser, BloodCounts Consortium, James H. F. Rudd, Nicholas S. Gleadall, Suthesh Sivapalaratnam, Folkert Asselbergs, Martijn C. Schut, Michael Roberts

发表机构 * Theoretical Physics University of Cambridge Cambridge, UK Translational AI Laboratory, Dept. of Laboratory Medicine Amsterdam UMC Amsterdam, The Netherlands Precision Health University Research Institute Queen Mary Univ. of London London, UK Department of Medicine University of Cambridge Cambridge Biomedical Campus Cambridge, UK Transplant Cambridge Biomedical Campus Cambridge, UK Dept. of Cardiology Amsterdam Cardiovascular Sciences Amsterdam UMC Amsterdam, The Netherlands

AI总结 本文提出了一种基于嵌入的联邦学习框架,用于从常规全血计数数据中预测缺铁,并在两个临床环境中部署,展示了个性化聚合方法在处理不同样本量和任务相关性时的优越性。

详情
AI中文摘要

近期的综述发现,发表的大多数医疗联邦学习(FL)研究从未达到实际应用。我们开发了一种基于嵌入的FL管道,用于从常规全血计数(FBC)数据中预测缺铁,并在阿姆斯特丹大学医学中心(AUMC)和英国国家血库和移植(NHSBT)两个临床环境中部署。这两个临床数据集在结构上不独立和相同分布(非IID),异质性源于不同的群体差异而非采样误差。运行时治理由FLA$^3$强制执行,这是一个面向医疗的FL平台,提供研究范围的执行、基于策略的授权和带签名的审计日志。标准样本量加权聚合(FedAvg)在两个站点相对于仅本地训练降低了受试者工作特征曲线(ROC-AUC)的面积,因为全局更新偏向于较大的AUMC分布。FedMAP,一种个性化聚合方法,将AUMC的ROC-AUC从0.9470提升到0.9594,将NHSBT的ROC-AUC从0.8558提升到0.8671,实现了最高的宏ROC-AUC为0.9133和最佳的宏平衡精度。这些结果支持在临床联邦中使用个性化聚合,特别是在客户端样本量和任务相关性差异显著时。

英文摘要

Recent reviews find that the vast majority of published healthcare federated learning (FL) studies never reach real-world deployment. We developed an embedding-based FL pipeline for iron deficiency prediction from routine full blood count (FBC) data and deployed it across real institutional environments at Amsterdam University Medical Centre (AUMC) and NHS Blood and Transplant (NHSBT), two clinical environments that differ markedly in iron deficiency prevalence, ferritin distribution, and subject populations. A frozen domain-specific haematology foundation model, DeepCBC, performs site-local representation extraction, restricting federated training to a compact downstream classifier and substantially reducing recurrent communication relative to full-encoder federation. The two clinical datasets are structurally not independent and identically distributed (non-IID), with heterogeneity arising from distinct population differences rather than sampling artefacts. Runtime governance is enforced by FLA$^3$, a healthcare-oriented FL platform providing study-scoped execution, policy-based authorisation, and signed audit logging. Standard sample-size-weighted aggregation (FedAvg) reduced the area under the receiver operating characteristic curve (ROC-AUC) at both sites relative to local-only training, as the global update was biased towards the larger AUMC distribution. FedMAP, a personalised aggregation method, raised ROC-AUC from 0.9470 to 0.9594 at AUMC and from 0.8558 to 0.8671 at NHSBT relative to local-only training, achieving the highest macro ROC-AUC of 0.9133 and the best macro balanced accuracy overall. These results support personalised aggregation in clinical federations where client sample size and task relevance diverge substantially.

2605.21561 2026-05-22 cs.LG 版本更新

Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection

目标诱导偏差与多目标无监督特征选择中的搜索动态

Mathieu Cherpitel, Thomas Bäck, Martijn R. Tannemaat, Anna V. Kononova

发表机构 * LIACS, Leiden University(莱顿大学LIACS) LUMC, Leiden University(莱顿大学LUMC)

AI总结 本研究探讨了多目标无监督特征选择中评价目标对搜索动态和Pareto前沿质量的影响,发现基于轮廓系数的评价目标倾向于产生低基数的平凡解,而提出的PCA损失目标能生成测试准确度与监督优化相似的紧凑子集。

详情
AI中文摘要

无监督特征选择通常被建模为一个多目标优化问题,联合优化子集质量和子集大小。然而,这种形式的行为严重依赖于评估目标的选择、子集大小正则化的方向以及初始化策略。我们通过一个具有已知信息性、冗余性和无关特征类型的合成数据集,在受控环境下研究这些因素。通过结合三个评估目标:准确率、轮廓系数和PCA重建损失,并结合子集大小最小化或最大化,比较了六种形式。结果表明,形式对搜索动态和最终Pareto前沿的质量都有显著影响。基于轮廓系数的形成表现出对平凡低基数解的强烈偏见,并且仍然是预测性能的弱代理。相比之下,所提出的PCA损失目标产生具有与直接优化监督准确率获得的子集相似测试准确度的紧凑子集。总体而言,该研究表明,目标设计是有效多目标无监督特征选择的关键。

英文摘要

Unsupervised feature selection is commonly formulated as a multiobjective optimisation problem that jointly optimises subset quality and subset size. Yet the behaviour of this formulation depends critically on the choice of evaluation objective, the direction of subset-size regularisation, and the initialisation strategy. We study these factors in a controlled setting using a synthetic dataset with known informative, redundant, and irrelevant feature types. Six formulations are compared by combining three evaluation objectives: accuracy, silhouette score, and PCA reconstruction loss with subset-size minimisation or maximisation. The results show that formulation strongly affects both search dynamics and the quality of the resulting Pareto front. Silhouette-based formulations exhibit a strong bias toward trivial low-cardinality solutions and remain weak proxies for predictive performance. In contrast, the proposed PCA loss objective produces compact subsets with test accuracy comparable to subsets obtained by directly optimising supervised accuracy. Overall, the study shows that objective design is central to effective multiobjective unsupervised feature selection.

2605.21560 2026-05-22 cs.LG 版本更新

AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems

AutoMCU: 通过基于LLM的多智能体系统实现面向MCU的神经网络定制化

Penglin Dai, Zijie Zhou, Xincao Xu, Junhua Wang, Xiao Wu, Lixin Duan

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University(计算机与人工智能学院,西南交通大学) Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China(深圳先进研究院,电子科技大学) School of Computer Science and Engineering, Northeastern University(计算机科学与工程学院,东北大学)

AI总结 本文提出AutoMCU,一种基于LLM的多智能体系统,用于在MCU约束下实现神经网络的自动化定制化。通过自然语言任务需求和硬件规格,AutoMCU迭代生成结构化架构候选方案,通过供应商工具链反馈过滤不可行设计,在训练前进行筛选,评估可行模型并在受控协议下验证部署可行性。

详情
AI中文摘要

在微控制器单元(MCU)上部署神经网络对于边缘智能至关重要,但受限于内存、存储和计算约束仍具挑战性。现有方法,如模型压缩和硬件感知神经架构搜索(HW-NAS),通常依赖代理指标,导致搜索成本高,且未能充分弥合架构设计与验证部署之间的差距。本文提出AutoMCU,一种以可行性为导向的基于大型语言模型(LLM)的多智能体系统,用于在MCU约束下实现神经网络的自动化定制化。给定自然语言任务要求和硬件规格,AutoMCU迭代生成结构化架构候选方案,在训练前通过供应商工具链反馈过滤不可行设计,评估可行模型在受控协议下的性能,并通过后端基础部署分析验证部署可行性。AutoMCU包含两个关键机制:1)基于硬件的架构生成,用于在RAM和Flash约束下提前淘汰不可部署的候选方案;2)状态隔离的多智能体调度,用于稳定协调提案、训练、评估和部署阶段。在严格MCU约束下对CIFAR-10和CIFAR-100的实验表明,AutoMCU在减少定制时间至约1-2小时的同时实现了具有竞争力的精度,相比代表性的MCU导向HW-NAS基线方法所需的数百小时GPU时间。与ColabNAS和基于LLM的NAS方法GENIUS在NAS-Bench-201上的比较进一步证明了AutoMCU的有效性和稳定性。在多个STM32微控制器上的实际设备部署验证了其在MCU规模边缘智能中的实际适用性。

英文摘要

Deploying neural networks on microcontroller units (MCUs) is critical for edge intelligence but remains challenging due to tight memory, storage, and computation constraints. Existing approaches, such as model compression and hardware-aware neural architecture search (HW-NAS), often depend on proxy metrics, incur high search cost, and do not fully bridge the gap between architecture design and verified deployment. This paper presents AutoMCU, a feasibility-first large language model (LLM)-based multi-agent system for automated neural network customization under MCU constraints. Given natural-language task requirements and hardware specifications, AutoMCU iteratively generates structured architecture candidates, filters infeasible designs through vendor toolchain feedback before training, evaluates feasible models under a controlled protocol, and verifies deployability through backend-grounded deployment analysis. AutoMCU includes two key mechanisms: 1) hardware-in-the-loop architecture generation for early elimination of undeployable candidates under RAM and Flash constraints, and 2) state-isolated multi-agent scheduling for stable coordination of proposal, training, evaluation, and deployment stages. Experiments on CIFAR-10 and CIFAR-100 under strict MCU constraints show that AutoMCU achieves competitive accuracy while reducing customization time to about 1--2 hours, compared with hundreds of GPU hours for representative MCU-oriented HW-NAS baselines. Comparisons with ColabNAS and the LLM-based NAS method GENIUS on NAS-Bench-201 further demonstrate the effectiveness and stability of AutoMCU. Real-device deployments on multiple STM32 microcontrollers validate its practical applicability to MCU-scale edge intelligence.

2605.21558 2026-05-22 cs.LG cs.CL 版本更新

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

从参数到数据:一种任务参数引导的微调流水线用于高效的LLM对齐

Hao Chen, Qi Zhang, Liyao Li, Zhanming Shen, Wentao Ye, Lirong Gao, Ningtao Wang, Xing Fu, Xiaoyu Shen, Junbo Zhao

发表机构 * Zhejiang University(浙江大学) Eastern Institute of Technology(东部技术研究所)

AI总结 本研究提出了一种任务参数引导的微调流水线,通过任务敏感的注意力头作为双指南,实现样本挖掘和结构剪枝,从而提高LLM对齐的效率。

Comments Accepted@ICML26, 28 pages, 11 figures, 26 tables

详情
AI中文摘要

适应大型语言模型(LLM)到专业领域通常会带来高数据和计算开销。尽管先前的效率努力大多将数据选择和参数高效微调视为孤立过程,我们的实证分析表明它们可能本质上是耦合的。我们提出了强映射假说:稀疏的注意力头子集在任务特定适应中起主导作用,作为解锁特定数据模式的钥匙。基于这一观察,我们提出了从参数到数据(P2D)统一框架,利用这些任务敏感的注意力头作为双指南,用于样本挖掘和结构剪枝。为了严格量化整个流程的成本,我们引入了对齐效率比率(AER)指标,用于衡量选择延迟和训练时间。机理上,P2D通过轻量级代理识别关键头,并利用它们作为功能性过滤器来精选高亲和力数据,建立协同流程。经验上,通过更新仅10%的注意力头在10%的数据上,P2D在强基线基础上实现了8.3个百分点的性能提升,并提供了7.0倍的端到端时间加速。这些结果验证了精确的参数-数据同步消除了冗余,提供了一种新的高效对齐范式。

英文摘要

Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated processes, our empirical analysis suggests they may be intrinsically coupled. We posit the Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation, acting as keys that unlock specific data patterns. Building on this observation, we propose From Parameters to Data (P2D), a unified framework that leverages these task-sensitive attention heads as a dual compass for both sample mining and structural pruning. To rigorously quantify the total pipeline cost, we introduce the Alignment Efficiency Ratio (AER) metric for both selection latency and training time. Mechanistically, P2D identifies critical heads via a lightweight proxy and uses them as a functional filter to curate high-affinity data, establishing a synergistic pipeline. Empirically, by updating merely 10% of attention heads on 10% of the data, P2D achieves an 8.3 pp performance gain over strong baselines and delivers a 7.0x end-to-end time speedup. These results validate that precise parameter-data synchronization eliminates redundancy, offering a new paradigm for efficient alignment.

2605.21556 2026-05-22 cs.LG 版本更新

Beyond Single Slot: Joint Optimization for Multi-Slot Guaranteed Display Advertising

超越单一广告位:多广告位保障型显示广告的联合优化

Zhaoqi Zhang, Jiaming Deng, Miao Xie, Linyou Cai, Qianlong Xie, Xingxing Wang, Siqiang Luo, Gao Cong

发表机构 * Nanyang Technological University(南洋理工大学) China Agricultural University(中国农业大学)

AI总结 本文提出了一种多广告位保障型显示广告的联合优化框架,解决了广告位冗余、合同不平衡和曝光集中等关键问题,通过离线 bipartite 匹配问题和合同轮盘机制,提升了广告商 ROI、平台收入效率和合同履行的鲁棒性。

Comments Accepted at SIGIR Industry Track 2026

详情
AI中文摘要

保障型显示广告对于平台变现至关重要,但现有方法通常基于单一广告位假设,限制了其在多广告位页面浏览中的优化能力。本文提出了一种新颖的多广告位保障型显示广告联合优化框架,解决了广告位冗余、合同不平衡和曝光集中等关键挑战。我们的方法将分配建模为一个离线 bipartite 匹配问题,采用合同轮盘机制实现广告位独占性,并通过页面浏览约束实现印象控制,同时结合可扩展的分配优化算法以实现高效的大规模部署。在美团广告平台的大量在线测试中,我们的方法显著提高了广告商 ROI、平台收入效率和合同履行的鲁棒性。具体而言,在线 A/B 测试显示在 70% 的流量下,平均收入每用户增加了 28.99%,DID 分析进一步表明合同稳定性得到改善,证明了我们的框架在现实广告部署中的强大适用性和有效性。

英文摘要

Guaranteed display advertising is crucial for platform monetization, yet existing methods often operate under a single-slot assumption, limiting their ability to optimize allocation across multi-slot page views. In this paper, we propose a novel joint optimization framework for multi-slot GD allocation, addressing key challenges such as slot-level redundancy, contract imbalance, and exposure concentration. Our approach formulates the allocation as an offline bipartite matching problem with a contract roulette mechanism for slot exclusivity and Page View constraints for impression control, and incorporates a scalable allocation optimization algorithm for efficient large-scale deployment. Extensive online tests on the Meituan advertising platform demonstrate that our method significantly improves merchant ROI, platform revenue efficiency, and contract fulfillment robustness. Specifically, online A/B tests show a 28.99% increase in Average Revenue Per User under 70% traffic, and DID analysis further indicates improved contract stability, demonstrating the strong applicability and effectiveness of our framework in real-world advertising deployments.

2605.21553 2026-05-22 cs.LG cs.IT eess.IV math.IT 版本更新

TONIC: Token-Centric Semantic Communication for Task-Oriented Wireless Systems

TONIC:面向任务的无线系统中的基于标记的语义通信

Sige Liu, Kezhi Wang

发表机构 * Department of Computer Science, Brunel University London(布鲁内尔大学伦敦计算机科学系)

AI总结 本文提出TONIC框架,通过在发送端进行语义感知保护和接收端置信度感知门控,实现任务导向无线系统中基于标记的语义通信,优于传统方法。

Comments 15 pages, 10 figures

详情
AI中文摘要

标记正成为基础模型表示和处理信息的基本单元,用于理解和推理。然而,传统的位级忠实无线通信在可靠传输的内容与下游模型实际消耗的内容之间存在不匹配。这种不匹配要求一种通信设计,能够直接考虑标记层面的任务相关性和下游模型需求,而不是将所有传输位视为同等重要。在本文中,我们提出了TONIC,一种面向任务的无线系统中的基于标记的语义通信框架。发送端将每个源样本转换为标记序列,估计标记层面的任务相关性,并在固定信道使用预算下通过效用感知的非均等错误保护分配保护。在接收端,使用标记层面的置信度来门控不可靠的决策,将有害的替代转换为可恢复的擦除,在基于Transformer的完成模型恢复被遮蔽的标记以进行最终任务推理之前。我们的框架在模块化且可解释的架构中结合了发送端的语义感知保护和接收端的置信度感知门控,而不是仅依赖于完全黑盒端到端学习。我们进一步建立了接收端门控规则的效用感知贝叶斯风险解释,并研究其与非均等保护和完成的相互作用。在图像分类实验中,结果表明TONIC在匹配的通信预算下,无论是在AWGN、瑞利和莱斯信道上,都优于分离式方案、像素域DeepJSCC基线和标记域基线。

英文摘要

Tokens are becoming the basic units through which foundation models represent and process information for understanding and inference. However, traditional wireless communication, centered on bit-level fidelity, faces a mismatch between what is transmitted reliably and what downstream models actually consume. This mismatch calls for a communication design that directly accounts for token-level task relevance and downstream model requirements, rather than treating all transmitted bits as equally important. In this paper, we propose TONIC, a token-centric semantic communication framework for task-oriented wireless systems. The transmitter converts each source sample into a sequence of tokens, estimates token-level task relevance, and allocates protection through utility-aware unequal error protection under a fixed channel-use budget. At the receiver, token-level confidence is used to gate unreliable decisions, turning harmful substitutions into recoverable erasures before a Transformer-based completion model restores the masked tokens for final task inference. Our framework combines transmitter-side semantic-aware protection with receiver-side confidence-aware gating in a modular and interpretable architecture, rather than relying solely on fully black-box end-to-end learning. We further establish a utility-aware Bayes-risk interpretation for the receiver-side gating rule and study its interaction with unequal protection and completion. Experimental results on image classification show that TONIC consistently outperforms separation-based schemes, the pixel-domain DeepJSCC baseline, and token-domain baselines under matched communication budgets over AWGN, Rayleigh, and Rician channels.

2605.21552 2026-05-22 cs.LG stat.ML 版本更新

Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift

期望一致性损失:在协变量偏移下重新思考置信度校准

Jinzong Dong, Zhaohui Jiang, Bo Yang

发表机构 * School of Automation, Central South University, Changsha, China(中南大学自动化学院,长沙,中国)

AI总结 本文针对协变量偏移下的置信度校准问题,提出了一种无监督域适应损失(ECL),该方法在理论和实践中均表现出色,能够有效校准目标域的置信度。

Comments Accepted by ICML 2026

详情
AI中文摘要

置信度校准对于分类模型在安全关键决策场景中的应用至关重要,并已受到广泛关注。通用的置信度校准方法假设训练和测试数据是独立同分布的,这在存在协变量偏移时限制了其有效性。在协变量偏移下的先前校准方法在类内或标准校准方面存在困难,且通常依赖于当密度比较大或无界时不稳定的重要性加权。鉴于上述限制,本文重新思考了协变量偏移下的置信度校准。首先,我们推导出协变量偏移下的置信度校准的必要且充分条件,称为期望一致性条件,该条件揭示协变量偏移并不必然导致未校准的置信度,并提供了比全局协变量分布对齐更弱的置信度校准条件。然后,利用期望一致性条件,本文提出了一种无监督域适应损失来校准目标域的置信度,称为期望一致性损失(ECL),该方法兼容标准校准、类内校准和顶部标签校准。第三,我们证明计算ECL损失的样本复杂度与预期校准误差(ECE)相同,并提供了一种理论支持的mini-batch可训练方案。最后,我们在模拟和现实世界协变量偏移数据集上验证了本文方法的有效性。

英文摘要

Confidence calibration for classification models is vital in safety-critical decision-making scenarios and has received extensive attention. General confidence calibration methods assume training and test data are independent and identically distributed, limiting their effectiveness under covariate shifts. Previous calibration methods under covariate shift struggle with class-wise or canonical calibrations and often rely on unstable importance weighting when density ratios are large or unbounded. Given the above limitations, this paper rethinks confidence calibration under covariate shifts. First, we derive a necessary and sufficient condition for confidence calibration under covariate shifts, named Expectation consistency condition, which reveals covariate shifts do not necessarily lead to uncalibrated confidence and provides a weaker condition for confidence calibration than global covariate distribution alignment. Then, utilizing Expectation consistency condition, this paper proposes an unsupervised domain adaptation loss to calibrate confidence of the target domain, named Expectation consistency loss (ECL), which is compatible with canonical calibration, class-wise calibration, and top-label calibration. Third, we prove that computing ECL loss has the same sample complexity as Expected Calibration Error (ECE) and provide a theoretically grounded mini-batch trainable scheme for ECL loss. Finally, we validate the effectiveness of our method on both simulated and real-world covariate shift datasets.

2605.21550 2026-05-22 cs.LG 版本更新

PeakFocus: Bridging Peak Localization and Intensity Regression via a Unified Multi-Scale Framework for Electricity Load Forecasting

PeakFocus: 通过统一的多尺度框架桥接峰值定位与强度回归以实现电力负荷预测

Wangzhi Yu, Peng Zhu, Qing Zhao, Yiwen Jiang, Dawei Cheng

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Big Data Center, State Grid Corporation of China(国家电网公司大数据中心)

AI总结 本文提出PeakFocus框架,通过统一的多尺度框架解决电力负荷峰值预测中的峰值定位与强度回归问题,改进多尺度表示冲突和强度平滑问题,提升预测精度。

详情
AI中文摘要

电力负荷峰值预测(ELPF)同时预测峰值时间和强度,是有效电网调度和风险管理的前提。然而,现有方法面临三个限制。首先,它们采用预测后定位的两阶段范式,切断了时间定位和强度回归之间的联系。其次,它们仍然挣扎于多尺度表示冲突,导致峰值误判和时间错位。第三,强度回归过程中缺乏显式的峰值时间上下文,导致强度平滑,因为预测受全局平滑趋势主导。为了解决这些限制,我们提出了PeakFocus,一个统一的ELPF框架。(i)统一的峰值感知流水线(UPAP)利用三重混合损失共同监督时间定位和强度回归,并配以基于容忍度的评估协议。(ii)多尺度混合峰值定位器(MSM-PL)利用粗粒度特征来缓解局部波动导致的峰值误判,并通过级联机制将它们注入细粒度特征中以解决时间错位。(iii)位置感知解码器(LAD)将峰值时间上下文注入强度回归过程中,提供明确的指导以对抗强度平滑并提高峰值强度估计。在公共电力(ELC)数据集和我们工业级的World Large-scale Electricity Load(WLEL)数据集上的广泛实验表明,PeakFocus在时间和强度精度上均优于基线方法。

英文摘要

Electricity load peak forecasting (ELPF), simultaneously predicting peak timing and intensity, is a prerequisite for effective grid scheduling and risk management. However, existing methods face three limitations. First, they adopt a two-stage predict-then-locate paradigm, which severs the link between temporal localization and intensity regression. Second, they still struggle with the multi-scale representation conflict, leading to peak misjudgment and timing misalignment. Third, the lack of explicit peak timing context during intensity regression causes intensity smoothing because predictions are dominated by global smoothing trends. To address these limitations, we propose PeakFocus, a unified framework for ELPF. (i) A Unified Peak-Aware Pipeline (UPAP) utilizes a triple hybrid loss to jointly supervise temporal localization and intensity regression, alongside a tolerance-based evaluation protocol. (ii) A Multi-Scale Mixing Peak Locator (MSM-PL) exploits coarse-grained features to mitigate peak misjudgment caused by local fluctuations, and injects them into fine-grained features via a cascade mechanism to resolve timing misalignment. (iii) A Location-Aware Decoder (LAD) injects peak timing context into the intensity regression process, providing explicit guidance to counteract intensity smoothing and improve peak intensity estimation. Extensive experiments on the public Electricity (ELC) dataset and our industrial-scale World Large-scale Electricity Load (WLEL) dataset show that PeakFocus outperforms baselines in both timing precision and intensity estimation.

2605.21548 2026-05-22 stat.ML cs.AI cs.LG 版本更新

Local Covariate Selection for Average Causal Effect Estimation without Pretreatment and Causal Sufficiency Assumptions

局部协变量选择用于无预处理和因果充分性假设下的平均因果效应估计

Zeyu Liu, Zheng Li, Feng Xie, Yan Zeng, Hao Zhang, Kun Zhang

发表机构 * Department of Applied Statistics, Beijing Technology and Business University(北京技术与商业大学应用统计系) Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院) College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence(Mohamed bin Zayed人工智能大学机器学习系) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种局部学习方法,用于非参数因果效应估计中的协变量选择,避免了预处理和因果充分性假设,提高了计算效率和估计准确性。

详情
AI中文摘要

我们研究了选择协变量以无偏估计总因果效应的问题。现有方法通常依赖于对所有变量的全局因果结构学习,或依赖于强假设,如因果充分性假设——观测变量不共享潜在混杂因素,或预处理假设,限制协变量只能是不受处理或结果影响的变量。这些要求在实践中往往不现实,且在高维设置中全局学习变得计算上不可行。为了解决这些挑战,我们提出了一种新颖的局部学习方法,用于非参数因果效应估计中的协变量选择,避免了预处理和因果充分性假设。我们首先刻画了一个局部边界,该边界包含至少一个有效的调整集,当且仅当存在调整集来识别因果效应时。然后我们开发了局部识别程序,以在该边界内高效地搜索。我们证明了所提出的方法是正确且完整的。在多个合成数据集和两个真实世界数据集上的实验表明,我们的方法在准确估计因果效应的同时,显著提高了计算效率。

英文摘要

We study the problem of selecting covariates for unbiased estimation of the total causal effect.Existing approaches typically rely on global causal structure learning over all variables, or on strong assumptions such as causal sufficiency - where observed variables share no latent confounders - or the pretreatment assumption, which limits covariates to those unaffected by the treatment or outcome. These requirements are often unrealistic in practice, and global learning becomes computationally prohibitive in high-dimensional settings.To address these challenges, we propose a novel local learning method for covariate selection in nonparametric causal effect estimation that avoids both the pretreatment and causal sufficiency assumptions. We first characterize a local boundary that contains at least one valid adjustment set whenever one exists for identifying the causal effect, and then develop local identification procedures to efficiently search within this boundary.We prove that the proposed method is sound and complete. Experiments on multiple synthetic datasets and two real-world datasets show that our approach achieves accurate causal effect estimation while substantially improving computational efficiency.

2605.21544 2026-05-22 cs.LG eess.SP 版本更新

Tabular foundation models for robust calibration of near-infrared chemical sensing data

用于近红外化学传感数据稳健校准的表格基础模型

Robin Reiter, Denis Cornet, Fabien Michel, Lauriane Rouan, Gregory Beurier

发表机构 * CIRAD(国际热带农业中心) UMR AGAP Institut(AGAP研究所) Université de Montpellier(蒙彼利埃大学) INRAE(国家农业食品与环境研究机构) Institut Agro(农业研究所) LIRMM(蒙彼利埃大学LIRMM实验室)

AI总结 本文研究了表格基础模型在近红外化学传感数据校准中的应用,通过对比不同模型在回归和分类任务中的表现,发现预处理优化的TabPFN在回归任务中表现最佳,而在分类任务中直接使用原始光谱的数据表现最优,同时指出在存在光谱异常值和外推样本时,传统化学计量学模型仍具竞争力。

Comments 56 pages, 17 figures, including supplementary material

详情
AI中文摘要

近红外光谱学正日益被用作一种快速、非破坏性的化学传感技术,用于食品、制药、生物和环境样品的分析。然而,NIR传感器的实际部署仍然依赖于能够处理高维、共线性光谱、有限样本量、预处理依赖性、光谱异常值和超出校准域外推的校准模型。本文评估了表格基础模型是否能为NIR化学传感提供新的校准策略。我们对66个NIR数据集(涵盖54个回归和12个分类任务)进行了基准测试,并将直接推断原始光谱与预处理优化推断与PLS/PLS-DA、岭回归、CatBoost和一维卷积神经网络进行比较。本研究使用统一的验证框架,在此框架中预处理和模型选择仅在校准数据上进行,之后进行外部测试评估。在回归中,预处理优化的TabPFN在总体平均排名上最佳,并显著优于PLS、CatBoost、TabPFN在原始光谱上的表现以及CNN-1D,同时在统计上与岭回归相当。在分类中,直接应用于原始光谱的TabPFN提供了最佳的平均排名,性能接近优化变体。鲁棒性分析显示,TabPFN提供强的平均预测性能,但在光谱异常值和外推样本中,其优势减少,传统化学计量学模型仍具竞争力。这些结果表明,表格基础模型可以补充已建立的化学计量学工作流程用于NIR化学传感,特别是在小到中等规模的校准设置中,同时突显了需要光谱特定的先验知识和不确定性感知的部署策略。

英文摘要

Near-infrared spectroscopy is increasingly used as a rapid, non-destructive chemical sensing technology for the analysis of food, pharmaceutical, biological, and environmental samples. However, the practical deployment of NIR sensors still depends on calibration models able to handle high-dimensional, collinear spectra, limited sample sizes, preprocessing dependence, spectral outliers, and extrapolation beyond the calibration domain. Here, we evaluate whether tabular foundation models can provide a new calibration strategy for NIR chemical sensing. We benchmark TabPFN on 66 NIR datasets covering 54 regression and 12 classification tasks, and compare direct inference on raw spectra with preprocessing-optimized inference against PLS/PLS-DA, Ridge, Catboost, and one-dimensional convolutional neural networks. The study uses a unified validation framework in which preprocessing and model selection are performed exclusively on calibration data before external test evaluation. In regression, preprocessing-optimized TabPFN achieves the best overall average rank and significantly outperforms PLS, CatBoost, TabPFN on raw spectra, and CNN-1D, while remaining statistically comparable to Ridge. In classification, TabPFN applied directly to raw spectra provides the best average rank, with performance close to the optimized variant. Robustness analyses show that TabPFN provides strong average predictive performance but that its advantage decreases on spectral outliers and extrapolated samples, where classical chemometric models remain competitive. These results suggest that tabular foundation models can complement established chemometric workflows for NIR chemical sensing, especially in small- to medium-sized calibration settings, while highlighting the need for spectroscopy-specific priors and uncertainty-aware deployment strategies.

2605.21543 2026-05-22 cs.LG 版本更新

Provable Joint Decontamination for Benchmarking Multiple Large Language Models

可证明的多语言模型基准测试去污染

Zhenlong Liu, Hao Zeng, Hongxin Wei

发表机构 * Department of Statistics and Data Science, Southern University of Science and Technology Shanghai Innovation Institute(统计与数据科学系,南方科技大学上海创新研究院)

AI总结 本文提出了一种可证明的多语言模型基准测试去污染方法,通过联合选择过程实现全局污染率控制,提升跨模型比较的可靠性。

详情
AI中文摘要

在LLM评估中,基准数据污染已成为关键挑战:当评估示例出现在一个或多个受审模型的训练数据中时,报告性能可能被夸大,跨模型比较变得不可靠。大量训练数据检测工作设计了评分来量化模型对给定数据点的记忆程度,但这些基于评分的方法缺乏理论保证。最近的符合方法为单个模型提供了可证明的假识别控制;然而,分别应用它们到每个模型会产生模型特定的基准,破坏跨模型的公平比较。在本文中,我们将多模型基准去污染正式化为一个联合选择问题,并提出联合包络符合选择(JECS),一种符合程序,能够在给定假设下实现全局污染率(GCR)控制。具体而言,JECS计算每个模型的符合p值,通过每个项目的最大值进行汇总,并从高于数据驱动阈值的右尾观测中重建一个保守的包络最大p空分布。通过将自适应Benjamini-Hochberg(BH)程序应用于包络重新缩放值,我们选择了一个具有可证明GCR控制的基准。在各种模型和基准上的广泛实验表明,JECS在保持目标GCR控制的同时,比max-p基线具有更高的功效。

英文摘要

Benchmark data contamination has become a central challenge in LLM evaluation: when evaluation examples appear in the training data of one or more audited models, reported performance can be inflated and cross-model comparisons become unreliable. A broad line of training-data detection work designs scores to quantify how strongly a model memorizes a given data point, but these score-based methods lack theoretical guarantees. Recent conformal approaches provide provable false-identification control for a single model; however, applying them separately to each model can produce model-specific benchmarks, undermining fair comparison across models. In this work, we formalize multi-model benchmark decontamination as a joint selection problem and propose Joint Envelope Conformal Selection (JECS), a conformal procedure that enables global contamination rate (GCR) control under stated assumptions. Specifically, JECS computes per-model conformal p-values, aggregates them by the per-item maximum, and reconstructs a conservative envelope of the max-p null distribution from right-tail observations above a data-driven threshold. By applying the adaptive Benjamini-Hochberg (BH) procedure to the envelope-rescaled values, we select a benchmark with provable GCR control. Extensive experiments across various models and benchmarks demonstrate that JECS achieves higher power than the max-p baseline while consistently maintaining the target GCR control.

2605.21542 2026-05-22 cs.LG 版本更新

Discovering Entity-Conditioned Lag Heterogeneity: A Lag-Gated Neural Audit Framework for Panel Time Series

发现实体-条件滞后异质性:一种用于面板时间序列的滞后门神经审计框架

Andi Xu

发表机构 * School of Engineering Jönköping University(工程学院 琼斯科普инг大学)

AI总结 本文提出了一种用于面板时间序列的滞后门神经审计框架AC-GATE,旨在解决不同实体在不同时间跨度上对历史信号的响应问题,通过引入适应性编码器和尺度不变滞后门,实现对滞后异质性的发现和结构化输出。

Comments Preprint/technical paper. An interpretable neural audit framework for entity-conditioned lag discovery in panel time series. 10 pages, 5 figures, 16 tables. Code available at the GitHub repository

详情
AI中文摘要

国家层面的时间面板被广泛用于实证分析。研究人员经常需要审计不同实体在不同时间跨度上对历史信号的响应。当前方法通常无法直接提供可审计的实体特定滞后汇总。我们将其公式化为时间面板挖掘任务,并提出AC-GATE,一种具有尺度不变滞后门的适应性编码器。它通过使用可观察的实体层面代理来条件化历史观测的滞后权重分布,从而将有效的滞后作为模型的结构输出,而不是事后解释。评估基于分层审计协议,将预测校准与滞后发现分开。使用具有已知真实滞后的人工面板进行机制恢复测试,并使用两个现实世界的国家层面面板进行外部审计和压力测试。结果表明,AC-GATE可以在合成数据中恢复异质滞后结构,并在真实数据中生成非退化的、结构化的有效滞后。

英文摘要

Country-level temporal panels are widely used in empirical analysis. Researchers often need to audit how different entities respond to historical signals over different time horizons. Current approaches typically do not provide directly auditable entity-specific lag summaries. We formulate entity-conditioned heterogeneous lag discovery as a temporal panel mining task and propose AC-GATE, an Adaptive-Conditioning Encoder with a Scale-Invariant Lag Gate. It instantiates conditional Moderated Distributed Lag by using observable entity-level proxies to condition lag-weight distributions over historical observations, thereby making effective lags structural outputs of the model rather than post-hoc explanations. The evaluation is based on a layered audit protocol that separates predictive calibration from lag discovery. A synthetic panel with known ground-truth lags is used for mechanism recovery testing, and two real-world country-level panels are used for external audit and stress testing. The results show that AC-GATE can recover heterogeneous lag structure in synthetic data, and generates non-degenerate, externally structured effective lags in real data.

2605.21541 2026-05-22 cs.CR cs.AI cs.LG stat.ML 版本更新

Frequency-Domain Regularized Adversarial Alignment for Transferable Attacks against Closed-Source MLLMs

频域正则化对抗对齐用于针对闭源大语言模型的可转移攻击

Leitao Yuan, Qinghua Mao, Daizong Liu, Kun Wang, Wenjie Wang, Yan Teng, Jing Shao, Dongrui Liu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) Shanghai Jiao Tong University(上海交通大学) Wuhan University(武汉大学) Nanyang Technological University(南洋理工大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出FRA-Attack,通过频域正则化方法解决对抗转移性问题,通过高通DCT目标和频率域梯度正则化提升跨模型的对抗转移能力。

详情
AI中文摘要

多模态大语言模型(MLLMs)仍易受基于转移的针对性攻击影响,其中在开源代理编码器上优化的扰动可以泛化到闭源MLLMs。提高对抗转移性的一个关键挑战是有效捕捉不同模型间共享的内在视觉聚焦特性,使得扰动与可转移的语义线索对齐,而非代理特定行为。然而,现有方法受到空间域特征冗余和代理特定梯度信号的阻碍,影响跨模型转移性。在本文中,我们提出FRA-Attack,从统一的频域正则化视角解决这两个挑战。在特征对齐方面,对patch特征的高通DCT目标抑制冗余的全局结构,并将损失集中在承载MLLMs内在视觉聚焦的高频带。在梯度优化方面,我们引入频率域梯度正则化(FGR),一种无模型依赖的低通正则化器,仅使用几何频率坐标调节代理梯度,即不涉及代理衍生的统计量,因此FGR通过构造无模型依赖性,消除代理特定的高频伪影,同时保留可转移的低频方向。两者共同形成统一的频域转移性处理。在15个旗舰MLLMs上进行的广泛实验显示,FRA-Attack在跨模型转移性方面表现优异,特别是在GPT-5.4、Claude-Opus-4.6和Gemini-3-flash等最先进的模型上实现了最先进的性能。

英文摘要

Multimodal large language models (MLLMs) remain vulnerable to transfer-based targeted attacks, where perturbations optimized on open-source surrogate encoders can generalize to closed-source MLLMs. A key challenge for improving adversarial transferability is to effectively capture the intrinsic visual focus shared across different models, such that perturbations align with transferable semantic cues rather than surrogate-specific behaviors. However, existing methods suffer from spatial-domain feature redundancy and surrogate-specific gradient signals, thereby hindering cross-model transferability. In this paper, we propose FRA-Attack, which addresses both challenges from a unified frequency-domain regularization perspective. For feature alignment, a high-pass DCT objective on patch features suppresses redundant global structures and concentrates the loss on the high-frequency band that carries the MLLMs' intrinsic visual focus. For gradient optimization, we introduce Frequency-domain Gradient Regularization (FGR), a \textit{model-agnostic} low-pass regularizer that modulates the surrogate gradient using only the geometric frequency coordinate, \textit{i.e.}, no surrogate-derived statistic is involved, so that FGR is model-agnostic by construction, removing surrogate-specific high-frequency artifacts while preserving transferable low-frequency directions. Together, the two components form a unified frequency-domain treatment of transferability. Extensive experiments on $15$ flagship MLLMs across $7$ vendors show that FRA-Attack achieves superior cross-model transferability, particularly with state-of-the-art performance on GPT-5.4, Claude-Opus-4.6 and Gemini-3-flash.

2605.21539 2026-05-22 cs.LG 版本更新

DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models

DualOptim+: 联合与解耦优化器状态的桥梁以提升大语言模型中的机器反遗忘

Xuyang Zhong, Qizhang Li, Yiwen Guo, Chen Liu

发表机构 * Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Independent Researcher(独立研究者)

AI总结 本文提出DualOptim+,一种改进大语言模型中机器反遗忘的新优化框架,通过引入基础状态和delta状态,有效平衡遗忘与保留目标,同时提出8位量化变体以减少内存开销,实验表明其在多个任务中均表现出色。

Comments Accepted by ICML 2026

详情
AI中文摘要

我们提出DualOptim+,一种新的优化框架,用于改进大语言模型中的机器反遗忘。它引入了一个基础状态来捕捉遗忘和保留目标共享的表示,以及delta状态来保存目标特定的残差。这种架构允许优化器根据遗忘和保留梯度之间的方向冲突,自适应地连接共享和解耦状态。我们进一步引入DualOptim+ 8bit,一种量化变体,能够在不牺牲性能的情况下减少内存开销。在虚构和现实世界的反遗忘、安全对齐和多任务学习任务中进行的广泛实验表明,DualOptim+ 一致地在不同目标之间实现了更优的权衡。代码可在https://github.com/CityU-MLO/DualOptimPlus上获得。

英文摘要

We propose DualOptim+, a novel optimization framework for improving machine unlearning in large language models. It introduces a base state to capture common representations shared by forgetting and retaining objectives and delta states to preserve objective-specific residuals. This architecture allows the optimizer to adaptively bridge shared and decoupled states based on the directional conflict between forgetting and retaining gradients. We further introduce DualOptim+ 8bit, a quantized variant that reduces memory overhead without compromising performance. Extensive experiments across fictitious and real-world unlearning, safety alignment, and multi-task learning tasks demonstrate that DualOptim+ consistently achieves a superior trade-off between different objectives. Codes are available at https://github.com/CityU-MLO/DualOptimPlus.

2605.21534 2026-05-22 stat.ML cs.LG 版本更新

Adaptive RBF-KAN: A Comparative Evaluation of Dynamic Shape Parameters in Kolmogorov-Arnold Networks

自适应RBF-KAN:动态形状参数在Kolmogorov-Arnold网络中的比较评估

Roberto Cavoretto, Alessandra De Rossi, Adeeba Haider, Amir Noorizadegan

发表机构 * Member of the INdAM Research Group GNCS(INdAM GNCS研究组成员) Department of Mathematics, Hong Kong Baptist University(香港 Baptist大学数学系)

AI总结 本文研究了Kolmogorov-Arnold网络中动态形状参数的选择问题,通过引入更广泛的径向基核和基于留一验证的核尺度估计,改进了RBF-KAN模型,提升了对不同函数类型的适应能力。

详情
AI中文摘要

Kolmogorov-Arnold网络(KANs)通过可学习的单变量边缘函数近似多变量函数,通常参数化为B样条基。尽管有效,基于样条的实现可能计算成本较高。一种改进的KAN变体称为FastKAN,通过将样条替换为高斯径向基函数(RBF)来提高效率,但其依赖于固定的核和形状参数。在本工作中,我们扩展了基于RBF的KAN框架,引入了更广泛的径向基核,并通过留一验证(LOOCV)初始化核形状参数。到目前为止,这是首次将基于LOOCV的核尺度估计与深度KAN训练相结合的研究。我们还首次将Matérn和Wendland核引入KAN框架,使KAN能够超越FastKAN中使用的高斯核,提供更灵活的基函数表示。LOOCV估计提供了数据驱动的核尺度初始化,随后在网络训练中进一步优化。所提出的自适应RBF-KAN在多个二维基准函数上进行了评估。结果突显了核选择和自适应形状参数的重要性,不同核在光滑函数、不连续性和振荡模式中表现出优势。总体而言,结合基于LOOCV的初始化与自适应核学习为改进RBF-KAN模型提供了一种实用策略。

英文摘要

Kolmogorov-Arnold Networks (KANs) approximate multivariate functions using learnable univariate edge functions, typically parameterized by B-spline bases. Although effective, spline-based implementations can be computationally expensive. A modified version of KANs, called FastKAN, improves efficiency by replacing splines with Gaussian radial basis functions (RBFs), but it relies on a fixed kernel and shape parameter. In this work, we extend the RBF-based KAN framework by introducing a broader family of radial basis kernels and by initializing the kernel shape parameter using leave-one-out cross-validation (LOOCV). To the best of our knowledge, this is the first study that integrates LOOCV-based kernel scale estimation with deep KAN training. We also introduce Matérn and Wendland kernels into the KAN framework for the first time, enabling more flexible basis representations beyond the Gaussian kernel used in FastKAN. The LOOCV estimate provides a data-driven initialization of the kernel scale, which is subsequently refined during network training. The proposed adaptive RBF-KAN is evaluated on several two-dimensional benchmark functions. The results highlight the importance of kernel selection and adaptive shape parameters, with different kernels showing advantages for smooth functions, discontinuities, and oscillatory patterns. Overall, combining LOOCV-based initialization with adaptive kernel learning provides a practical strategy for improving RBF-based KAN models.

2605.21527 2026-05-22 eess.IV cs.CV cs.LG 版本更新

CryoNet: A Deep Learning Framework for Multi-Modal Debris-Covered Glacier Mapping. A Case Study of the Poiqu Basin, Central Himalaya

CryoNet:一种用于多模态冰川覆盖区制图的深度学习框架。帕iqu盆地,中央喜马拉雅地区案例研究

Farzaneh Barzegar, Tobias Bolch, Norbert Kuehtreiber, Silvia L. Ullo

发表机构 * University of Sannio(萨恩尼奥大学) Graz University of Technology(格拉茨技术大学)

AI总结 本研究提出CryoNet,一种利用多模态数据集的深度学习框架,用于区分干净冰川、覆盖冰川和冰湖,通过在喜马拉雅中央帕iqu盆地的案例研究展示了其在复杂高山环境中的有效性。

Comments 15 pages, 10 figures, 5 tables. Preprint submitted to IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS); currently under review

详情
AI中文摘要

冰川作为淡水储备和气候变化指标起着关键作用,但其自动制图,尤其是覆盖冰川,由于与周围地形的光谱相似性仍具挑战性。本研究引入了CryoNet,一种深度学习框架,利用丰富的多模态数据集,包括Sentinel-2光学影像、DEM导出的地形变量、光谱指数、主成分分析(PCA)、InSAR相干性和相位、点状特征和GLCM纹理,以区分干净冰川、覆盖冰川和冰湖。CryoNet是一种基于ResNet101编码器的编码器-解码器CNN,具有嵌套跳接连接和空间-通道Squeeze-and-Excitation(scSE)注意力机制。本研究在喜马拉雅中央帕iqu盆地进行,通过将训练模型应用于阿尔卑斯山脉的蒙布朗山群评估其可转移性。我们还分析了每层数据在提高冰川制图性能中的重要性。所提出的模型实现了总体IoU为90.52%,平均召回率为98.08%,平均精确率为92.26%。对于覆盖冰川,CryoNet实现了IoU为90.46%,召回率为95.79%,精确率为94.21%。在单类和总体指标上,CryoNet超越了DeepLabV3+、SegFormer和U-Net,作为最先进的(SOTA)参考,证明了其在复杂高山环境中的冰川制图有效性。

英文摘要

Glaciers play a critical role as freshwater reserves and indicators of climate change, yet their automatic delineation, especially for debris-covered glaciers, remains challenging due to spectral similarity with surrounding terrain. This study introduces CryoNet, a deep learning framework that leverages a rich multi-modal dataset combining Sentinel-2 optical imagery, DEM-derived topographic variables, spectral indices, Principal Component Analysis (PCA), InSAR coherence and phase, tasseled-cap features, and GLCM texture to discriminate clean-ice glaciers, debris-covered glaciers, and glacial lakes. CryoNet is an encoder-decoder CNN with nested skip connections and spatial-channel Squeeze-and-Excitation (scSE) attention, built upon a ResNet101 encoder to capture hierarchical contextual and spatial features. The study is conducted in the Poiqu Basin in the central Himalaya, and transferability is evaluated by applying the trained model to the Mont Blanc Massif in the Alps. We additionally analyse the importance of each data layer in improving glacier mapping performance. The proposed model achieves an overall IoU of 90.52%, mean Recall of 98.08%, and mean Precision of 92.26%. For debris-covered glaciers specifically, CryoNet obtains an IoU of 90.46%, a recall of 95.79%, and a precision of 94.21%. Across both per-class and overall metrics, CryoNet surpasses DeepLabV3+, SegFormer, and U-Net, taken as state-of-the-art (SOTA) references, demonstrating its effectiveness for robust glacier mapping in complex high-mountain environments.

2605.21522 2026-05-22 q-bio.QM cs.AI cs.CE cs.LG stat.ML 版本更新

Protein Thoughts: Interpretable Reasoning with Tree of Thoughts and Embedding-Space Flow Matching for Protein-Protein Interaction Discovery

蛋白质思想:基于树 of 思维和嵌入空间流匹配的可解释推理用于蛋白质-蛋白质相互作用发现

Kingsley Yeon, Xuefeng Liu, Promit Ghosal

发表机构 * Department of Statistics and CCAM University of Chicago(统计学系和CCAM大学芝加哥分校) School of Medicine Stanford University(医学学院斯坦福大学) Department of Statistics University of Chicago(统计学系芝加哥大学)

AI总结 本文提出了一种可解释的蛋白质-蛋白质相互作用发现框架,通过显式推理将PPI发现转化为可解释的搜索问题,利用嵌入空间流匹配和树 of 思维搜索方法提升预测精度和可解释性。

详情
AI中文摘要

蛋白质-蛋白质相互作用(PPIs)调控几乎所有细胞过程,但计算方法通常产生排名预测而缺乏机理解释。这限制了其应用,因为生物学家无法判断预测是否反映真实的生化见解或偶然相关性。我们提出了Protein Thoughts框架,将PPI发现重新表述为可解释的搜索问题。该系统将结合证据分解为四个生物意义的信号:序列相似性反映进化关系,结构互补性捕捉几何契合,界面平衡,以及化学兼容性编码残基级相互作用。而不是将这些信号合并为一个模糊的分数,我们通过透明的价值函数保留每个信号的贡献,从而实现排序和审计。为了高效地导航大规模候选空间,我们引入了假设引导的熵正则化树 of 思维搜索。微调的语言模型从嵌入衍生的特征生成搜索指令,将候选者分类为高优先级、探索性或可跳过。这些指令条件化一个玻尔兹曼策略,平衡利用与熵驱动的探索,同时假设意识修剪防止提前放弃有前途的候选者。对于表现出评分分歧的候选者,假设条件的嵌入空间流匹配将蛋白质嵌入推向结合者流形。在SHS148k基准测试中,Protein Thoughts实现了平均最佳结合体排名为11.2,比熵树搜索基线的47.7提高了76%,在结合预测中,训练的价值函数实现了91.08±0.19 Micro-F1,优于现有PPI方法在同一数据集上的表现。

英文摘要

Protein-protein interactions (PPIs) govern nearly all cellular processes, yet computational methods for identifying binding partners typically produce ranked predictions without mechanistic justification. This creates a fundamental barrier to adoption because biologists cannot assess whether predictions reflect genuine biochemical insight or spurious correlations. We present \textbf{Protein Thoughts}, a framework that reformulates PPI discovery as an interpretable search problem with explicit reasoning. The system decomposes binding evidence into four biologically meaningful signals: sequence similarity reflecting evolutionary relationships, structural complementarity capturing geometric fit, interface balance, and chemical compatibility encoding residue-level interactions. Rather than collapsing these signals into an opaque score, we preserve their individual contributions through a transparent value function that enables both ranking and auditing. To navigate large candidate spaces efficiently, we introduce hypothesis-guided entropy-regularized Tree-of-Thoughts search. A fine-tuned language model generates search directives from embedding-derived features, classifying candidates as high-priority, exploratory, or skippable. These directives condition a Boltzmann policy that balances exploitation with entropy-driven exploration, while hypothesis-aware pruning prevents premature abandonment of promising candidates. For candidates exhibiting score disagreement, hypothesis-conditioned embedding-space flow matching transports protein embeddings toward the binder manifold. On the SHS148k benchmark, Protein Thoughts achieves mean best-binder rank of 11.2 versus 47.7 for an entropic tree search baseline, a 76% improvement, and for binding prediction the trained value function achieves $91.08 \pm 0.19$ Micro-F1, outperforming existing PPI methods on the same dataset.

2605.21519 2026-05-22 cs.SI cs.LG 版本更新

Neural Acceleration for Graph Partitioning

图划分的神经加速

Joshua Dennis Booth, Vishvam Patel

发表机构 * Department of Computer Science University of Alabama in Huntsville(计算机科学系阿拉巴马大学亨茨维尔分校)

AI总结 本文提出利用神经网络模型替代传统特征值计算,以加速图划分过程,从而在保持划分质量的同时显著降低计算开销,提升大规模问题的可扩展性和效率。

详情
AI中文摘要

图划分是在许多科学和工程领域中至关重要的问题,包括社交网络分析、VLSI设计等。谱方法在广泛的问题中能够产生高质量的划分,同时最小化边切分。然而,计算图拉普拉斯矩阵的第二大特征值对应的费米尔向量所带来的计算成本仍然是一个瓶颈,由于内存问题和计算成本。在本文中,我们提出了一种加速谱二分划分的方法,通过用简单的神经网络模型替代传统的特征值计算来近似费米尔向量。我们证明我们的方法在划分质量上与谱二分划分相当,同时显著降低了计算开销,使其在大规模问题中更加可扩展和高效。

英文摘要

Graph Partitioning is a critical problem in numerous scientific and engineering domains including social network analysis, VLSI design, and many more. Spectral methods are known to produce quality partitions while minimizing edge cuts for a wide range of problems. However, the computational cost associated with the calculation of the Fiedler vector, an eigenvector associated with the second smallest eigenvalue of the graph Laplacian, remains a significant bottleneck due to memory issues and computational costs. In this paper, we present an accelerated approach to spectral bisection partitioning by replacing the traditional eigenvalue calculation with a simple artificial neural network model to approximate the Fiedler vector. We demonstrate that our approach achieves partitioning quality comparable to spectral bisection while significantly reducing the computational overhead, making it more scalable and efficient for large-scale problems

2605.21516 2026-05-22 cs.LG cs.AI 版本更新

Harnesses for Inference-Time Alignment over Execution Trajectories

在执行轨迹上进行推理时间对齐的工具

Boyuan Wang, Bochao Li, Minghan Wang, Yuxin Tao, Fang Kong

发表机构 * GitHub

AI总结 本文研究了在执行轨迹上进行推理时间对齐的工具设计,通过任务分解和引导执行机制来提高长期性能,发现工具设计中分解和引导的复杂性并不总是带来更好的结果,提出了任务分解和引导执行的两种机制,并通过合成实验和实际终端代理基准验证了这些发现。

详情
AI中文摘要

Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference-time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance-induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over-decomposition, over-pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.

英文摘要

Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference-time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance-induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over-decomposition, over-pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.

2605.21515 2026-05-22 cs.LG cs.AI 版本更新

Predicting Performance of Symbolic and Prompt Programs with Examples

通过示例预测符号程序和提示程序的性能

Chengqi Zheng, Keya Hu, Shuzhi Liu, Tao Wu, Kevin Ellis, Yewen Pu

发表机构 * Nanayang Technological University, Singapore(南洋理工大学,新加坡) Massachusetts Institute of Technology, USA(麻省理工学院,美国) Cornell University, USA(康奈尔大学,美国)

AI总结 本文研究了通过示例预测程序性能的问题,提出了一种基于简单硬币翻转模型的方法,利用观察到的执行结果和性能先验知识来预测程序性能,并开发了RAP方法来构建代理先验以提高预测效果。

详情
AI中文摘要

LLM提示广泛用于自然陈述的任务,但其不可靠,可能在少数测试用例上成功但在部署时失败。我们研究了性能预测:给定一个程序(例如符号程序或在LLM上执行的提示程序)和少量领域内示例,预测其在未见任务上的性能。我们使用一个简单的硬币翻转模型,将每次通过/失败的程序执行视为伯努利随机变量,其成功概率是程序未知的性能。在该模型中,性能完全取决于:1)在测试用例上观察到的执行结果,以及2)性能的先验分布。我们从多样化的程序和任务语料库中编译了经验性能先验,并发现符号程序(例如Python)都是全或无的,而提示程序具有弥漫的先验,有许多几乎正确的程序。这种差异解释了为什么少数通过测试可以认证符号程序但不能认证提示程序。基于这一见解,我们开发了RAP(检索近似先验),通过从现有语料库中检索相似任务和提示程序来构建代理先验,然后用于预测性能。我们展示了RAP实现了稳健的性能。

英文摘要

LLM prompting is widely used for naturally stated tasks, yet it is unreliable it may succeed on a few test cases but fail at deployment time. We study performance prediction: given a program, either symbolic (e.g. Python) or a prompt executed on an LLM, and a few in-domain examples, predict its performance on unseen tasks from the same domain. We use a simple coin-flip model, treating each pass/fail program execution as a Bernoulli random variable, whose success probability is the programs unknown performance. In this model, performance depends entirely on: 1) the observed execution outcomes on test cases, and 2) a prior over performances. We compile empirical performance priors from a corpus of diverse programs and tasks, and find that performance for symbolic programs (e.g., Python) are all or nothing, while prompt programs have a diffuse prior with many nearly-correct programs. This difference explains why a few passing tests can certify symbolic programs but not prompt programs. Building on this insight, we develop RAP (Retrieved Approximate Prior), which retrieves similar tasks and prompt programs from an existing corpus to construct a proxy prior, which is then used to predict performance. We show RAP achieves solid performances.

2605.21514 2026-05-22 cs.SI cond-mat.stat-mech cs.IT cs.LG math.IT physics.data-an 版本更新

Conditional Entropy of Heat Diffusion on Temporal Networks

时间网络上热扩散的条件熵

Samuel Koovely, Alexandre Bovet

发表机构 * Department of Mathematical Modeling and Machine Learning(数学建模与机器学习系)

AI总结 本文研究了时间网络上热扩散的条件熵,提出了一种新的方法来检测时间网络中的相变点,并展示了其在信息论中的意义,类似于热力学第二定律。

详情
AI中文摘要

许多复杂系统可以建模为时间网络,其组织通常通过不同的结构阶段演变。检测这些阶段的转折点既重要又具有挑战性。在本工作中,我们将静态图上的条件熵扩展到时间网络,并研究其性质。我们提供了一个上界,并解释了偏差如何源于不对称的时间路径的存在。此外,我们展示了该量在时间上是单调的,从而为时间网络上的非均匀扩散提供了信息论意义上的热力学第二定律的类比。然后,我们引入了条件熵的局部版本,旨在探测有限时间窗口内的扩散,并展示了它在连续时间时间网络中用于转折点检测的有用信号。我们还在合成基准上评估了所提出的方法,包括与现有非参数基线在快照设置下的比较实验,然后将其应用于法国小学的现实时间接触网络。最后,我们展示了如何利用检测到的转折点在目标子区间内进行社区检测,从而提高聚类结果的质量和可解释性。

英文摘要

Many complex systems can be modeled by temporal networks, whose organization often evolves through distinct structural phases. Detecting the change points that delimit these phases is both important and challenging. In this work, we extend the conditional entropy of heat diffusion from static graphs to temporal networks and study its properties. We provide an upper bound and explain how discrepancies from it arise from the presence of asymmetric temporal paths. Moreover, we show that this quantity is monotone in time, yielding an information-theoretic analog of the second law of thermodynamics for inhomogeneous diffusion on temporal networks. We then introduce a local version of conditional entropy, designed to probe diffusion over finite temporal windows, and show that it provides an informative signal for change-point detection in continuous-time temporal networks. We evaluate the proposed methodology on synthetic benchmarks, including comparative experiments with existing nonparametric baselines in the snapshot setting, and then apply it to a real-world temporal contact network from a French primary school. Finally, we show how to use detected change points to perform community detection on targeted sub-intervals, improving the quality and interpretability of the clustering results.

2605.21510 2026-05-22 cs.SI cs.LG 版本更新

Community-Aware Vertex Ordering for Reference-Based Graph Compression: A Cross-Encoder Empirical Study

面向社区的顶点排序用于基于参考的图压缩:一种交叉编码实证研究

Jimmy Dubuisson

发表机构 * Vantino Geneva Switzerland(瓦宁托日内瓦瑞士)

AI总结 本文提出了一种两阶段的Leiden+LLP顶点排序方法,并研究其与基于参考的压缩的交互作用,结果显示在初始顶点排序较差的图中,重新排序能显著节省比特数,且不同编码器对排序的响应具有高度一致性。

Comments 26 pages, 7 figures, 9 tables. Full reproducibility package at https://github.com/jimbotonic/Adjacently.jl. Preprint; comments welcome

详情
AI中文摘要

基于参考的图压缩通过将每个顶点的邻接列表相对于最近的顶点进行编码,利用局部性来压缩大规模有向图。主流工具WebGraph的BVGraph固定单一编码流程,并依赖于单独选择的顶点排序--通常为URL字典序或分层标签传播(LLP)。排序与编码器之间的相互作用很少被测量。我们提出了一种两阶段的Leiden+LLP顶点排序--全局LLP用于种子标签,Leiden社区检测,然后在每个诱导子图上进行每簇LLP--并研究其与基于参考的压缩的交互。在初始顶点排序较差的图中,重新排序在每组数据集和编码器上节省了0.3到5.4比特每边。该收益的大小对编码器的敏感性较小:在四个五弱排序数据集中,四个独立参数化的编码器在Leiden+LLP与纯LLP之间的收益在大约±0.04 bpe内一致。在URL排序的网络爬虫中,其中分布式排序已经编码了局部性,自适应编码器仍然受益于重新排序,但经过URL诱导残差结构(BV-HC,CG at K>1)调优的编码器会受到轻微损害。为了量化在排序固定后编码器选择的重要性,我们贡献了三个基于参考的编码器--BG、CS和CG--它们能够从最多28个候选分解中进行每顶点成本最优的选择。每个在自己最佳测试排序下运行。这三个中的最佳在每个测试数据集上都优于BVGraph高压缩性能,编码器层面的收益在弱排序数据集中始终小于排序层面的收益。编码器框架还产生了一个自限定的位流,支持低开销随机访问。

英文摘要

Reference-based graph compression encodes each vertex's neighbor list relative to a recent vertex, exploiting locality to compress large directed graphs. The dominant tool, WebGraph's BVGraph, fixes a single encoding pipeline and relies on a separately chosen vertex ordering -- typically URL-lexicographic or Layered Label Propagation (LLP). The interaction between ordering and encoder is rarely measured. We propose a two-stage Leiden+LLP vertex ordering -- global LLP to seed labels, Leiden community detection, then per-cluster LLP on each induced subgraph -- and study how it interacts with reference-based compression. On graphs with poor initial vertex order, reordering saves 0.3 to 5.4 bits per edge on every dataset and encoder we measured. The size of that gain is largely insensitive to the encoder: on four of five weakly ordered datasets, four independently parameterised encoders agree on the Leiden+LLP-vs-plain-LLP gain within roughly +/- 0.04 bpe. On URL-ordered web crawls, where the distributed ordering already encodes locality, adaptive encoders still benefit from reordering, but encoders tuned to URL-induced residual structure (BV-HC, CG at K>1) are mildly hurt by it. To quantify how much encoder choice matters once ordering is fixed, we contribute three reference-based encoders -- BG, CS, and CG -- that perform per-vertex cost-optimal selection from up to 28 candidate decompositions. Each is run under its own best-tested ordering. The best of the three improves over BVGraph high-compression by 2-9% on every dataset tested, with the encoder-level gain consistently smaller than the ordering-level gain on weakly ordered datasets. The encoder framework also yields a self-delimiting bitstream that supports low-overhead random access.

2605.21507 2026-05-22 physics.ao-ph cs.AI cs.CE cs.LG 版本更新

Visibility nowcasting in South Korea: a machine learning approach to class imbalance and distribution shift

韩国可见度现在预测:一种处理数据不平衡和分布偏移的机器学习方法

Bong Gyun Shin, Chan Sik Lee, Hyesun Suh

发表机构 * Department of AI Big Data(人工智能大数据系) Daejin University(大 Jain 大学) Department of Statistics and Actuarial Science(统计与精算科学系) Soongsil University(顺斯大学) College of Artificial Intelligence Convergence(人工智能融合学院)

AI总结 本文提出了一种机器学习方法,用于预测韩国六个主要城市的大气可见度,通过SMOTENC和CTGAN处理数据不平衡,并结合机器学习和深度学习模型进行评估,发现训练与测试期间的分布偏移导致预测性能下降,强调了在时间序列数据上实施现在预测模型时考虑外部环境因素的重要性。

Comments Published in Theoretical and Applied Climatology

详情
Journal ref
Theoretical and Applied Climatology, vol. 157, art. no. 283, 2026
AI中文摘要

大气可见度是交通安全和空气质量管理的关键变量,然而,由于气象条件和空气污染物之间的复杂相互作用以及低可见度事件的稀有性,准确预测仍然具有挑战性。本研究引入了一种机器学习框架,用于预测韩国六个主要城市的可见度。为了处理2018-2020年训练数据中的不平衡问题,我们应用了合成少数类过采样技术(SMOTENC)和条件表格生成对抗网络(CTGAN)。然后,使用结合机器学习和深度学习模型的集成方法,并在2021年测试数据集上进行评估。结果表明,测试集的预测性能相比交叉验证阶段明显下降。这种退化归因于训练和测试期间的分布偏移,通过测量SHAP分析确定的最显著特征的Wasserstein距离得到了定量确认。总体而言,本研究提出了一种旨在同时解决数据不平衡和时间分布偏移双重挑战的方法,并强调在时间序列数据上实施现在预测模型时考虑不断变化的外部环境因素的必要性。

英文摘要

Atmospheric visibility is a critical variable for transportation safety and air quality management, however, accurate prediction remains challenging due to the complex interactions between meteorological conditions and air pollutants, as well as the rarity of low-visibility events. This study introduces a machine learning framework to nowcast visibility in six major South Korean cities. To handle the imbalance in the 2018-2020 training data, we applied the Synthetic Minority Over-sampling Technique with Nominal and Continuous (SMOTENC) and Conditional Tabular Generative Adversarial Network (CTGAN). An ensemble approach combining machine learning and deep learning models was then used and evaluated on a 2021 test dataset. The results revealed a marked decline in predictive performance in the test set compared to the cross-validation phase. This degradation was attributed to a distributional shift between training and testing periods, which was quantitatively confirmed by measuring the Wasserstein distance of the most influential feature identified by SHAP analysis. In general, this study presents a methodology that aims to simultaneously address the dual challenges of data imbalance and temporal distributional shifts, and emphasizes the necessity of accounting for evolving external environmental factors when implementing nowcasting models on time-series data.

2605.21502 2026-05-22 q-bio.MN cs.AI cs.LG 版本更新

Graph neural network explanations reveal a topological signature of disease-associated hubs in biological networks

图神经网络解释揭示了生物网络中与疾病相关的枢纽的拓扑特征

Kyle Higgins, Ivan Laponogov, Dennis Veselkov, Kirill Veselkov

发表机构 * Division of Cancer, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London(癌症部、外科与癌症部门、医学学院、伦敦帝国学院) Department of Computing, Imperial College London(计算部门、伦敦帝国学院) Department of Environmental Health Sciences, Yale University(环境健康科学部门、耶鲁大学)

AI总结 本文研究了图神经网络在生物网络中识别疾病相关结构的方法,发现不同解释方法在稀疏单节点驱动和分布式路径信号中有不同的表现,并提出了一种结合壳层枢纽评分和解释器共识排名的框架,提升了对癌症基因的优先级排序和生物学相关分子的恢复能力。

Comments 25 pages (excluding supplement), 7 figures, 7 supplementary tables

详情
AI中文摘要

图神经网络(GNNs)越来越多地用于建模生物系统,但后验解释方法恢复有意义的分子机制的可靠性仍不清楚。本文系统评估了四种广泛使用的解释方法:显著性归因(SA)、集成梯度(IG)、GNNExplainer 和层间相关传播(LRP),以识别乳腺癌RNA-seq数据在蛋白质-蛋白质相互作用网络上的疾病相关结构。通过合成基准测试,我们发现解释方法恢复了不同的信号组织:SA在稀疏单节点驱动方面表现最佳,而IG和LRP更倾向于恢复分布式的路径样和级联样信号。在TCGA BRCA数据中,我们识别出一种一致的拓扑特征,即疾病相关枢纽的归因在最近的1跳邻居中达到峰值,并在后续网络壳层中衰减,这种模式在IG和LRP中最为显著,并与已知癌症枢纽的强富集相关。我们进一步观察到局部枢纽富集与全局基因排名性能之间的权衡,IG优化局部富集,而SA在全局区分方面表现更优。受这些互补行为的启发,我们提出了一种结合基于壳层的枢纽评分和解释器共识排名的框架。共识评分提高了对经典癌症基因(TP53、BRCA1、ESR1、MYC)的优先级排序,减少了对节点度数的依赖,并且在调优时优于单独的方法。通路富集进一步揭示了对生物上一致的癌症程序的改进恢复,包括ERBB2、RTK、MAPK、免疫和细胞因子信号。这些结果表明,拓扑感知的图解释整合可以提高生物可解释性和生物相关分子的恢复能力。

英文摘要

Graph neural networks (GNNs) are increasingly used to model biological systems, yet the reliability of post-hoc explanation methods for recovering meaningful molecular mechanisms remains unclear. Here, we systematically evaluate four widely used approaches: Saliency Attribution (SA), Integrated Gradients (IG), GNNExplainer, and Layer-wise Relevance Propagation (LRP) for identifying disease-relevant structure in breast cancer RNA-seq data projected onto a protein-protein interaction network. Using synthetic benchmarks with known ground-truth motifs, we show that explanation methods recover distinct signal organizations: SA performs best for sparse single-node drivers, whereas IG and LRP preferentially recover distributed pathway-like and cascade-like signals. In TCGA BRCA data, we identify a consistent topological signature of disease-associated hubs in which attribution peaks in the immediate 1-hop neighborhood and decays across successive network shells, a pattern most pronounced for IG and LRP and associated with strong enrichment of known cancer hubs. We further observe a trade-off between local hub enrichment and global gene ranking performance, with IG optimizing local enrichment and SA achieving superior global discrimination. Motivated by these complementary behaviors, we introduce a framework combining a shell-based hub score with consensus ranking across explainers. Consensus scores improve prioritization of canonical cancer genes (TP53, BRCA1, ESR1, MYC), reduce dependence on node degree, and, especially when tuned, outperform individual methods. Pathway enrichment further reveals improved recovery of biologically coherent cancer programs, including ERBB2, RTK, MAPK, immune, and cytokine signaling. Together, these results demonstrate that topology-aware integration of graph explanations can improve biological interpretability and biologically relevant molecular recovery.

2605.21499 2026-05-22 physics.flu-dyn cs.LG 版本更新

Conditional Neural Field based Reduced Order Model for Dynamic Ditching Load Prediction

基于条件神经场的降阶模型用于动态倾倒载荷预测

Henning Schwarz, Pyei Phyo Lin, Jens-Peter M. Zemke, Thomas Rung

发表机构 * Institute for Fluid Dynamics and Ship Theory, Hamburg University of Technology, Am Schwarzenberg-Campus 4, D-21073 Hamburg, Germany(流体动力学与船舶理论研究所,汉堡技术大学,Schwarzenberg Campus 4,德国汉堡,D-21073) Institute of Mathematics, Hamburg University of Technology, Am Schwarzenberg-Campus 3, D-21073 Hamburg, Germany(数学研究所,汉堡技术大学,Schwarzenberg Campus 3,德国汉堡,D-21073)

AI总结 本文提出一种基于条件神经场的降阶模型,用于预测飞机倾倒载荷,该模型在不依赖空间离散化的情况下,通过结合LSTM网络实现了高精度的时空预测,并在不同空间离散化条件下展示了良好的重建能力。

详情
AI中文摘要

基于网格的神经网络,如卷积自编码器,在计算流体力学中广泛用于基于维度缩减的替代模型。近年来,基于坐标的方案,如条件神经场的使用逐渐兴起。其不依赖空间离散化的特性为计算流体力学中的各种应用提供了有益的特性。本文讨论了使用条件神经场方法对飞机倾倒载荷进行时空预测。模型使用两个数据集进行评估,一个与单个固定空间离散化相关,另一个包含不同离散化数据的数据。当与潜在空间中的长短期记忆(LSTM)网络结合时,基于神经场的模型在第一个数据集上实现了与网格依赖的卷积自编码器模型相当的时空预测精度,但参数显著更少。第二个数据集的结果展示了基于神经场的方法在异质空间离散化条件下准确重建倾倒载荷的能力。这允许灵活地使用为不同几何形状和/或离散化生成的训练数据集,以及使用替代模型预测不同配置的载荷。

英文摘要

Grid-based neural networks such as convolutional autoencoders are widely used in dimension reduction-based surrogate models for computational fluid dynamics. In recent years, the use of coordinate-based approaches like conditional neural fields has emerged. Their independence of the spatial discretization is a beneficial feature for various applications in computational fluid dynamics. This paper discusses the spatio-temporal prediction of aircraft ditching loads using a conditional neural field approach. The model is evaluated using two datasets for the dynamic loads of the fuselage of a DLR-D150 aircraft, one of which relates to a single fixed spatial discretization and the other that includes data from different discretizations. When paired with a long short-term memory (LSTM) network in the latent space, the neural field-based model achieves a spatio-temporal prediction accuracy for the first data set that is close to that of grid-dependent convolutional autoencoder-based models, and with significantly less parameters. Results for the second data set demonstrate the ability of the neural field-based approach to reconstruct ditching loads accurately for heterogeneous spatial discretizations. This allows for flexible use of training datasets generated for different geometries and/or discretizations, as well as the use of the surrogate model to predict loads for different configurations.

2605.21496 2026-05-22 cs.LG cs.AI cs.CL 版本更新

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

HealthCraft: 一种用于急救医学的强化学习安全环境

Brandon Dent

发表机构 * GOATnote Inc.(GOATnote公司)

AI总结 本文提出HealthCraft,首个公开的强化学习环境,用于在真实急救医学条件下奖励轨迹级安全,通过FHIR R4世界状态、24个MCP工具和双层评估标准,评估模型在急救任务中的安全性和性能,揭示了模型在多步骤工作流中的安全失败问题。

Comments 16 pages, 5 figures, 6 tables. Code, task suite, and Docker bundle: https://github.com/GOATnote-Inc/healthcraft

详情
AI中文摘要

前沿语言模型被部署到临床工作流程的速度超过了评估它们安全性的基础设施。静态医学问答基准测试忽略了急救医学中至关重要的失败模式:轨迹级安全崩溃、工具误用和在持续临床压力下的屈从。我们提出了HealthCraft,首个公开的强化学习环境,该环境在真实急救医学条件下奖励轨迹级安全,源自Corecraft。它基于FHIR R4世界状态,包含14个实体类型和3,987个种子实体,暴露24个MCP工具,并定义了双层评估标准,只要任何安全关键性标准被违反,就会将奖励设为零。我们发布了195个任务,涵盖六个类别,根据2,255个二元标准(其中515个为安全关键性标准)进行评分;一个事后10任务负类列表将此扩展到205个任务和2,337个标准。在两个前沿模型上的V8结果表明,Claude Opus 4.6在Pass@1达到24.8% [21.5-28.4],GPT-5.4为12.6% [10.2-15.6],安全失败率为27.5%和34.0%。在多步骤工作流——最接近真实急救护理的代理——中,性能降至接近零(Claude 1.0%,GPT-5.4 0.0%),尽管在单个步骤上部分具备能力。在试点v2和v8之间修复了六个基础设施错误,重新排列了哪些模型“看起来更强”,这表明基础设施的保真度是测量的一部分。一个确定性的LLM-判断器叠加限制了评估者的噪声,并且一个60次负类烟雾试点显示奖励信号不是可直接用于训练的安全:限制标准通过率为0.929的患病率,这在评估工具可以容忍但训练奖励不能。我们搭建了与Corecraft第5.2节中的Megatron+SGLang+GRPO循环的耦合,并将训练奖励的消融作为未来的工作。环境、任务、评估标准和工具均在Apache 2.0下发布。

英文摘要

Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement-learning environment that rewards trajectory-level safety under realistic emergency-medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual-layer rubric that zeroes reward whenever any safety-critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety-critical); a post-hoc 10-task negative-class slate extends this to 205 tasks and 2,337 criteria. V8 results on two frontier models show Claude Opus 4.6 at Pass@1 24.8% [21.5-28.4] and GPT-5.4 at 12.6% [10.2-15.6], with safety-failure rates of 27.5% and 34.0%. On multi-step workflows - the closest proxy to real emergency care - performance collapses to near zero (Claude 1.0%, GPT-5.4 0.0%) despite partial competence on individual steps. Six infrastructure bugs fixed between pilots v2 and v8 re-ordered which model "looks stronger," evidence that infrastructure fidelity is part of the measurement. A deterministic LLM-judge overlay bounds evaluator noise, and a 60-run negative-class smoke pilot shows the reward signal is not drop-in training-safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot. We scaffold coupling to a Megatron+SGLang+GRPO loop per Corecraft Section 5.2 and leave training-reward ablations as future work. Environment, tasks, rubrics, and harness are released under Apache 2.0.

2605.21494 2026-05-22 cs.LG 版本更新

Double descent for least-squares interpolation on contaminated data: A simulation study

过拟合模型的最小二乘插值在受污染数据中的双下降现象:一项模拟研究

Tino Werner

发表机构 * Institute for Mathematics, Carl von Ossietzky University Oldenburg(奥尔登堡卡尔·冯·奥西特齐克大学数学研究所)

AI总结 本文研究了在受污染数据下线性回归中是否会出现双下降现象,比较了最小二乘插值估计器与几种鲁棒替代方法的性能,发现大规模过拟合确实导致双下降现象,使最小二乘插值器的泛化性能优于鲁棒替代方法。

详情
AI中文摘要

过参数化模型尽管根据经典统计理论应容易过拟合,但能表现出出色的泛化性能。双下降现象的发现,即在达到一定模型复杂度后泛化误差减小,开辟了新的研究方向。稳健统计考虑在受污染数据上的统计估计,由于现实数据不满足假设,导致数据点相对于假设的“理想”分布出现异常值,可能严重扭曲任何经典估计器。本文探讨在受污染训练数据的线性回归设置中是否会出现双下降现象。比较了高度非鲁棒的最小二乘插值估计器与几种鲁棒替代方法的性能。结果表明,大规模过参数化确实导致双下降现象,使最小二乘插值器的泛化性能非常优异,优于鲁棒替代方法。

英文摘要

Overparametrized models can exhibit an excellent generalization performance, although they should be prone to overfitting according to classical statistical theory. The discovery of the "double descent", indicating that the generalization error decreases after a certain model complexity has been reached, opened a new line of research. Robust statistics considers statistical estimation on contaminated data, which, due to assumptions that do not hold on real data, let data points appear as outliers w.r.t. the assumed "ideal" distribution, potentially severely distorting any classical estimator. We address the question whether a double descent phenomenon can be observed in a linear regression setting with contaminated training data. We compare the performance of the highly non-robust least-squares interpolation estimator with several robust alternatives. It turns out that large overparametrization indeed allows for a double descent phenomenon, resulting in a very good generalization performance of the least-squares interpolator, surpassing that of the robust alternatives.

2605.21493 2026-05-22 cs.LG cs.AI cs.CV 版本更新

Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins

不要压缩你的特征:为什么CenterLoss伤害OOD检测和多尺度Mahalanobis获胜

Rahul D Ray

发表机构 * Department of Electronics and Electrical Engineering(电子与电气工程系)

AI总结 本文提出GOEN方法,通过多尺度特征、L2归一化、Mahalanobis距离和校准头来提升OOD检测性能,发现CenterLoss会降低OOD检测性能,而GOEN-NoCenterLoss在CIFAR-10基准上表现优于其他基线方法。

详情
AI中文摘要

检测分布外(OOD)输入的能力是安全部署机器学习系统的基础。然而,当前方法往往依赖于仅优化分类准确性的特征表示,忽略了epistemic不确定性的要求。我们引入GOEN(几何优化的epistemic网络),一种结合多尺度特征、L2归一化、Mahalanobis距离和使用真实硬OOD示例训练的校准头的简单流程。通过系统消融,我们发现一个反直觉的发现:CenterLoss,一种用于特征紧凑性的流行正则化器,显著降低了OOD检测性能,尽管提高了分类准确性。最佳变体GOEN-NoCenterLoss在CIFAR-10基准上实现了0.9483的平均OOD AUROC,超过了包括深度集成(0.8827)、KNN(0.8967)和ODIN(0.8870)在内的所有基线方法,同时保持了有竞争力的分布内准确性。我们的结果挑战了普遍认为更好的分类几何自动导致更好的epistemic不确定性假设。相反,我们展示了过于紧致的特征簇会压缩类间边缘并扭曲所需的有效OOD检测的协方差结构。GOEN是高效的,在单个GPU上训练不到20分钟,并提供了一种构建可靠识别自身局限的AI系统的实用蓝图。

英文摘要

The ability to detect out-of-distribution (OOD) inputs is fundamental to safe deployment of machine learning systems. Yet, current methods often rely on feature representations that are optimised solely for classification accuracy, neglecting the distinct requirements of epistemic uncertainty. We introduce GOEN (Geometry-Optimised Epistemic Network), a simple pipeline that combines multi-scale features, L2 normalisation, Mahalanobis distance, and a calibration head trained with real hard OOD examples. Through systematic ablation we uncover a counter-intuitive finding: CenterLoss, a popular regulariser for feature compactness, significantly degrades OOD detection performance, reducing average OOD AUROC from 0.9483 to 0.9366 despite improving classification accuracy. The best variant, GOEN-NoCenterLoss, achieves an average OOD AUROC of 0.9483, surpassing all baselines including deep ensembles (0.8827), KNN (0.8967), and ODIN (0.8870) on CIFAR-10 benchmarks, while maintaining competitive in-distribution accuracy. Our results challenge the prevailing assumption that better classification geometry automatically leads to better epistemic uncertainty. Instead, we show that overly tight feature clusters compress inter-class margins and distort the covariance structure needed for effective OOD detection. GOEN is efficient, training in under 20 minutes on a single GPU, and provides a practical blueprint for building AI systems that reliably recognise their own limitations.

2605.21492 2026-05-22 cs.LG cs.AI cs.LO stat.ML 版本更新

The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity

特征归因不可能性:在共线性下,没有任何特征排名是忠实、稳定和完整的

Drake Caraker, Bryan Arnold, David Rhoads

发表机构 * Independent Researchers(独立研究人员)

AI总结 本文研究了在共线性情况下特征排名的不可能性,证明了无法同时满足忠实、稳定和完整性的条件,并提出了DASH方法作为解决途径,同时通过形式化验证展示了其理论基础和实际应用影响。

Comments 66 pages, 12 figures, 305 Lean 4 theorems. Code at https://github.com/DrakeCaraker/dash-impossibility-lean

详情
AI中文摘要

在共线性情况下,没有任何特征排名可以同时忠实、稳定和完整。对于共线性对,排名本质上等同于抛硬币。我们证明了这一不可能性,针对四种模型类别进行了量化分析,通过集成平均(DASH)方法解决该问题,并利用305个Lean 4定理进行机验证。我们刻画了完整的归因设计空间:恰好存在两种方法家族——忠实-完整方法(不稳定,排名可能翻转多达50%的时间)和集成方法如DASH(稳定,对称特征报告平局)。归因比在梯度提升中发散为1/(1-rho^2),在Lasso中为无穷大,在随机森林中收敛。DASH(Diversified Aggregation of SHAP)在无偏聚合中被证明是帕累托最优的,达到Cramer-Rao方差下界并具有紧的集成大小公式。在77个公共数据集中,68%表现出归因不稳定性。在特征具有相等因果效应时,切换到条件SHAP无法逃脱这一不可能性。该框架包括实用的诊断工具——Z检验工作流程和单模型筛查工具——并直接影响公平性审计:基于SHAP的代理歧视审计在共线性下被证明不可靠。设计空间定理、诊断和不可能性均在Lean 4中形式化验证(305个定理从16个公理,0 sorry)——据我们所知,这是可解释AI领域首个形式化验证的不可能性。

英文摘要

No feature ranking can be simultaneously faithful, stable, and complete when features are collinear. For collinear pairs, ranking reduces to a coin flip. We prove this impossibility, quantify it for four model classes, resolve it via ensemble averaging (DASH), and machine-verify it with 305 Lean 4 theorems. We characterize the complete attribution design space: exactly two families of methods exist -- faithful-complete methods (unstable, with rankings that flip up to 50% of the time) and ensemble methods like DASH (stable, reporting ties for symmetric features) -- and no method lies outside this dichotomy. The impossibility is quantitative: the attribution ratio diverges as 1/(1-rho^2) for gradient boosting, is infinite for Lasso, and converges for random forests. DASH (Diversified Aggregation of SHAP) is provably Pareto-optimal among unbiased aggregations, achieving the Cramer-Rao variance bound with a tight ensemble size formula. In a survey of 77 public datasets, 68% exhibit attribution instability. Switching to conditional SHAP does not escape the impossibility when features have equal causal effects. The framework includes practical diagnostics -- a Z-test workflow and single-model screening tool -- and has direct consequences for fairness auditing: SHAP-based proxy discrimination audits are provably unreliable under collinearity. The design space theorem, diagnostics, and impossibility are mechanically verified in Lean 4 (305 theorems from 16 axioms, 0 sorry) -- to our knowledge, the first formally verified impossibility in explainable AI.

2605.21491 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

通过比较想法评估教授语言模型预测研究成功的技巧

Srujan P Mule, Aniketh Garikaparthi, Manasi Patwardhan

发表机构 * IISER Pune(印度理工学院帕内尔)

AI总结 本研究探讨了语言模型能否在无需实验的情况下预测研究想法的实证成功,通过构建基于PapersWithCode客观结果的11488对想法数据集,发现通过强化学习可提升模型性能至71.35%,证明小型语言模型可以作为有效的客观验证器,为自主科学发现提供可扩展路径。

Comments ACL 2026 Findings

详情
AI中文摘要

随着语言模型通过自动化假设生成和实现加速科学研究,出现了一个新的瓶颈:在没有彻底实验的情况下评估和过滤数百个AI生成的想法。我们问语言模型是否能学会在任何实验运行之前预测研究想法的实证成功。我们研究了比较实证预测:给定一个基准特定的研究目标和两个候选想法,预测哪个将实现更好的基准性能。我们构建了一个基于PapersWithCode客观结果的11,488对想法数据集。尽管现成的8B参数模型表现不佳(30%准确率),SFT显著提升了性能至77.1%,优于GPT-5(61.1%)。通过将评估框架为推理任务,通过可验证奖励的强化学习(RLVR),我们训练模型发现潜在的推理路径,实现71.35%的准确率,并具有可解释的依据。通过额外的消融和分布外测试,我们展示了对表面启发式的鲁棒性,并转移到了跨领域时间拆分测试集和独立构建的测试集。我们的结果表明,计算高效的轻量级语言模型可以作为有效的、客观的验证器,为自主科学发现提供可扩展的路径。

英文摘要

As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35% acc. with interpretable justifications. Through additional ablations and out-of-distribution tests, we show robustness to surface-level heuristics and transfer to both a cross-domain time-split test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as effective, objective verifiers, offering a scalable path for autonomous scientific discovery.

2605.21490 2026-05-22 cs.LG cs.CR 版本更新

Temporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding

基于时间对比的变压器用于金融犯罪检测:通过预测对比编码实现自监督序列嵌入

Danny Butvinik, Yonit Marcus, Nitzan Tal, Gabrielle Azoulay

发表机构 * NICE Actimize

AI总结 本文提出了一种名为时间对比变压器(TCT)的表示学习框架,旨在捕捉金融交易序列中的时间动态。通过自监督对比目标训练模型,生成编码时间行为模式的嵌入,以支持下游的欺诈检测任务。实验结果显示,嵌入本身能实现有意义的预测性能(AUC 0.8644),但结合领域工程特征时,性能提升不显著(AUC 0.9205 vs. 0.9245),表明学习到的表示与现有特征抽象有较大重叠。这些发现表明TCT是一种有前景的表示学习方法,能够捕捉相关的行为信号,同时凸显了在强领域特征上实现加性价值的挑战。

Comments 10 pages, 4 figures, one table

详情
AI中文摘要

我们介绍了一种时间对比变压器(TCT),一种旨在捕捉金融交易序列中上下文时间动态的表示学习框架。该模型通过自监督对比目标进行训练,以生成编码时间行为模式的嵌入,以支持下游的欺诈检测任务。我们通过将学习到的嵌入作为输入特征送入梯度提升分类器,在现实环境中评估TCT。实验结果表明,仅使用嵌入本身就能实现有意义的预测性能(AUC 0.8644),表明模型能够捕捉非平凡的时间结构。然而,当结合领域工程特征时,与基线相比没有可观的提升(AUC 0.9205 vs. 0.9245),表明学习到的表示与现有特征抽象有较大重叠。这些发现将TCT定位为一种有前景的表示学习方法,能够捕捉相关的行为信号,同时凸显了在强领域特征上实现加性价值的挑战。这些结果反映了时间表示学习在金融犯罪检测中的发展中间阶段,并激励进一步研究模型架构、训练目标和整合策略。在这一早期阶段,实现与强特征工程基线相当的性能本身就是一个有意义的结果,表明学习到的表示可以近似于领域特定的特征,而无需手动工程。虽然尚未达到生产就绪状态,但这些结果指出了减少对特征工程依赖的有希望的方向。

英文摘要

We introduce the Temporal Contrastive Transformer (TCT), a representation learning framework designed to capture contextual temporal dynamics in sequences of financial transactions. The model is trained using a self-supervised contrastive objective to produce embeddings that encode behavioral patterns over time, with the goal of supporting downstream fraud detection tasks. We evaluate TCT in a realistic setting by using the learned embeddings as input features to a gradient boosting classifier. Experimental results show that embeddings alone achieve meaningful predictive performance (AUC 0.8644), indicating that the model captures non-trivial temporal structure. However, when combined with domain-engineered features, no measurable improvement is observed over the baseline (AUC 0.9205 vs. 0.9245), suggesting that the learned representations largely overlap with existing feature abstractions. These findings position TCT as a promising representation learning approach that captures relevant behavioral signal, while highlighting the challenges of achieving additive value over strong domain features. The results reflect an intermediate stage in the development of temporal representation learning for financial crime detection and motivate further research on model architecture, training objectives, and integration strategies. At this early stage, achieving performance comparable to a strong feature-engineered baseline is itself a meaningful outcome, indicating that learned representations approximate domain-specific features without manual engineering. While not yet production-ready, these results point to a promising direction for reducing reliance on feature engineering in financial crime detection.

2605.21282 2026-05-22 cs.LG cs.AI 版本更新

Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

随机均值流策略:带有熵镜降的一步生成控制

Zeyuan Wang, Da Li, Yulin Chen, Yuehu Gong, Yanming Guo, Ye Shi, Liang Bai, Tianyuan Yu, Yanwei Fu

发表机构 * Laboratory for Big Data and Decision(大数据与决策实验室) National University of Defense Technology(国防科技大学) Samsung AI Center Cambridge(三星AI研究中心) Queen Mary University of London(伦敦玛丽女王大学) Fudan University(复旦大学) ShanghaiTech University(上海科技大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出了一种随机均值流策略(SMFP),通过均值流变换将高斯噪声映射到动作,以实现可训练的生成策略,从而在离线策略镜降框架下实现探索性且稳定的改进。

详情
AI中文摘要

在线离线策略强化学习(RL)受到两个耦合选择的影响:策略类和更新规则。高斯策略速度快且具有可计算的熵,但难以处理多模态动作分布。生成策略更具表现力,但通常需要迭代采样或缺乏可计算的熵估计。在优化方面,SAC风格的软策略改进和镜降(MD)可以视为最小化不同的KL散度:前者将策略推向价值诱导的玻尔兹曼分布,后者则通过之前的策略正则化每个更新。将熵正则化与MD约束结合因此具有吸引力,因为它支持探索并稳定策略改进;然而,所得到的目标可能是多模态的,且与单峰高斯策略不匹配。我们提出随机均值流策略(SMFP),一种一步生成策略类,通过均值流变换将高斯噪声映射到动作。这种随机重参数化产生了一个可计算的熵替代物,并允许均值流策略在离线策略镜降框架下通过统一的目标进行训练,以实现探索性且稳定的改进。在七个MuJoCo基准测试中,SMFP在高斯和生成基线之上取得了改进,同时保留了单步推断效率。

英文摘要

Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Generative policies are more expressive, but often require iterative sampling or lack tractable entropy estimates. On the optimisation side, SAC-style soft policy improvement and mirror descent (MD) can be viewed as minimising different KL divergences: the former moves the policy towards a value-induced Boltzmann distribution, while the latter regularises each update against the previous policy. Combining entropy regularisation with an MD constraint is therefore attractive, as it supports exploration while stabilising policy improvement; however, the resulting target can be multimodal and is poorly matched by unimodal Gaussian policies. We propose Stochastic MeanFlow Policies (SMFP), a one-step generative policy class that maps Gaussian noise to actions through a MeanFlow transformation. This stochastic reparameterisation yields a tractable entropy surrogate and allows MeanFlow policies to be trained within off-policy mirror descent under a unified objective for exploratory yet stable improvement. Across seven MuJoCo benchmarks, SMFP improves over Gaussian and generative baselines while retaining single-step inference efficiency.

2605.20514 2026-05-22 cs.LG cs.NA math.NA stat.ML 版本更新

Fast Reconstruction of Exact Maxwell Dynamics from Sparse Data

从稀疏数据快速重建精确的Maxwell动力学

Dan DeGenaro, Xin Li, Obed Amo, Michael Pokojovy, Sarah Adel Bargal, Markus Lange-Hegermann, Bogdan Raiţă

发表机构 * Department of Computer Science, Georgetown University(乔治城大学计算机科学系) Department of Mathematics, Georgetown University(乔治城大学数学系) Department of Mathematics and Statistics, Old Dominion University(老 Dominion 大学数学与统计学系) School of Data Science, Old Dominion University(老 Dominion 大学数据科学学院) Institute Industrial IT, Department of Computer Science and Automation, OWL University of Applied Sciences and Arts(OWL 应用科学与艺术大学工业IT研究所)

AI总结 本文提出FLASH-MAX神经网络架构,通过稀疏点观测预测均匀电磁场,该架构通过符号构造满足Maxwell方程,实现从稀疏数据快速训练,且保持零PDE残差,提升了科学机器学习中精度与优化速度的平衡。

Comments 31 pages, 8 figures

详情
AI中文摘要

我们介绍了FLASH-MAX,一种浅层、精确由构造的神经网络架构,用于从稀疏点观测预测均匀电磁场。每个隐藏神经元代表Maxwell方程的一个独立精确解,因此网络通过构造满足 governing equations,并能从稀疏数据中以秒级时间进行端到端训练。我们证明了一个通用逼近结果,表明这种精确模型类在任意域上保持通用性。FLASH-MAX在约1K稀疏点观测中达到子1%的验证相对误差,同时保持零PDE残差,并在仅100观测采样时仍保持单数字误差。这些结果表明,将 governing structure 从损失转移到假设类可以显著提升科学机器学习中精度与优化速度的平衡。

英文摘要

We introduce FLASH-MAX, a shallow, exact-by-construction neural network architecture for predicting homogeneous electromagnetic fields from sparse pointwise observations. Each hidden neuron represents a separate exact solution to Maxwell's equations, so that the network satisfies the governing equations symbolically by construction and can be trained end-to-end from sparse data within seconds. We prove a universal approximation result showing that this exact model class remains universal on arbitrary domains. FLASH-MAX reaches sub-1% relative validation error from about 1K sparse pointwise observations in seconds, all while maintaining a zero PDE residual, and keeps single-digit errors even for only 100 observations sampled from 3D space. These results suggest that moving governing structure from the loss into the hypothesis class can dramatically improve the trade-off between precision and optimization speed in scientific machine learning.

2605.20303 2026-05-22 cs.LG 版本更新

AirfoilGen: A valid-by-construction and performance-aware latent diffusion model for airfoil generation

AirfoilGen: 一种用于翼型生成的可构造且性能感知的潜在扩散模型

Zhijie Yang, Min Tang, Peng Du, Qiang Zou

发表机构 * State Key Laboratory of CAD \& CG, Zhejiang University, Hangzhou, 310027, China

AI总结 本文提出了一种新的翼型生成模型AirfoilGen,通过引入圆扫表示法约束生成过程,确保生成的翼型符合基本特性,并通过在学习的潜在空间中操作实现对气动性能的显式控制,同时提供了一个包含超过20万翼型的新数据集。

Comments 15 pages

详情
AI中文摘要

翼型形状设计是航空工程中的基本任务,直接影响飞行稳定性与燃油消耗。深度学习最近 emerged 作为一种有前景的工具用于此任务,但现有的深度生成方法在几何有效性与物理可控性方面仍然有限。它们对生成的形状控制很少,导致无效的几何形状,并且通常不有效地对气动性能进行条件化。为了解决这些问题,本文提出了一种名为AirfoilGen的可构造且性能感知的潜在扩散模型用于翼型生成。首先引入了一种新的翼型表示方案,即圆扫表示法,以约束生成过程,使得输出形状尊重基本的翼型特性。然后通过在学习的潜在空间中操作,实现对气动性能(例如升力和阻力系数)的显式控制:一个transformer模型将翼型形状编码为向量嵌入,而一个条件扩散模型将高斯噪声解噪为这些潜在嵌入,同时结合目标气动性能。此外,本文还提出了一组包含超过200,000个翼型的新数据集,该数据集比广泛使用的UIUC翼型数据集(1,650个翼型)大得多,并且更适合训练现代深度生成模型。实验表明,AirfoilGen在几何有效性和气动性能可控性方面比之前实现的要高得多,平均性能条件化精度为98.41%。

英文摘要

Airfoil shape design is a fundamental task in aerospace engineering, with a direct impact on flight stability and fuel consumption. Deep learning has recently emerged as a promising tool for this task, but existing deep generative approaches remain limited in both geometric validity and physical controllability. They offer little control over the generated shapes, yielding invalid geometries, and they typically do not condition effectively on aerodynamic performance. To address these issues, this paper proposes AirfoilGen, a valid-by-construction and performance-aware latent diffusion model for airfoil. It first introduces a novel airfoil representation scheme, the circle sweeping representation, to constrain the generative process so that output shapes respect essential airfoil characteristics. It then enables explicit control over aerodynamic performance (e.g., lift and drag coefficients) by operating in a learned latent space: a transformer model encodes airfoil shapes into vector embeddings, and a conditional diffusion model denoises Gaussian noise into these latent embeddings while incorporating target aerodynamic performance. In addition, this paper presents a new dataset of over 200,000 airfoils, which is substantially larger than the widely used UIUC airfoil dataset (1,650 airfoils) and more suitable for training modern deep generative models. Experiments demonstrate that AirfoilGen enables airfoil generation with far greater geometric validity and aerodynamic performance controllability than previously achievable, with an average performance-conditioning accuracy of 98.41%.

2605.20302 2026-05-22 cs.LG cs.CV 版本更新

Neural Collapse by Design: Learning Class Prototypes on the Hypersphere

按设计实现神经崩溃:在超球面上学习类别原型

Panagiotis Koromilas, Theodoros Giannakopoulos, Mihalis A. Nicolaou, Yannis Panagakis

发表机构 * The Cyprus Institute(塞浦路斯研究所) University of Athens(雅典大学) Archimedes AI/Athena Research Center(阿基米德AI/阿泰纳研究中心) University of Cyprus(塞浦路斯大学)

AI总结 本文研究了监督分类的理论最优解神经崩溃(NC),指出交叉熵(CE)和监督对比学习(SCL)两种主流范式在实践中无法达到该最优解。作者提出通过在超球面上对比原型的方法,改进了CE和SCL,从而在多个基准测试中实现了更接近NC的性能。

Comments 43rd International Conference on Machine Learning (ICML 2026); Code: https://github.com/pakoromilas/nc_by_design

详情
AI中文摘要

监督分类有一个理论最优解,即神经崩溃(NC),然而其两种主导范式在实践中都无法达到这一最优。交叉熵(CE)保留了径向自由度,导致收敛到退化几何结构,而监督对比学习(SCL)在预训练阶段驱动特征向NC靠近,但在后续的线性探测阶段丢弃了这一结构。我们证明这两种范式实际上是同一种方法的不同表现,即在单位超球面上对比原型。缩小差距需要在各自失败点进行修正。从CE侧,我们提出NTCE和NONL两种归一化损失,将对比优化缺失的成分引入分类器学习:大有效负样本集和解耦的对齐和均匀性项。从SCL侧,我们证明SCL的目标在训练过程中已经优化了原理分类器,其权重是类别均值嵌入,使线性探测变得冗余且有害。实验表明,在四个基准测试(包括ImageNet-1K)中,NTCE和NONL在准确率上超过了CE,接近NC(≥95%),并在不到7.5%的迭代次数中在4/5个指标上匹配CE的收敛NC,而SCL在固定原型的情况下无需线性探测阶段即可达到。学习的几何结构在迁移学习中带来了+5.5%的平均相对改进,严重类别不平衡下可达+8.7%,并且在ImageNet-C上提高了对损坏的鲁棒性。本文将监督学习重新定义为在超球面上的原型学习,通过设计达到NC。

英文摘要

Supervised classification has a theoretical optimum, Neural Collapse (NC), yet neither of its two dominant paradigms reaches it in practice. Cross entropy (CE) leaves radial degrees of freedom unconstrained and converges to a degenerate geometry, while supervised contrastive learning (SCL) drives features toward NC during pretraining but discards this structure in a post hoc linear probing phase. We show that both paradigms are different appearances of the same method that contrasts prototypes on the unit hypersphere, and that closing the gap requires fixing each at its point of failure. From the CE side, we propose NTCE and NONL, two normalized losses that import contrastive optimization's missing ingredients into classifier learning: a large effective negative set and decoupled alignment and uniformity terms. From the SCL side, we prove that SCL's objective already optimizes throughout training for a principled classifier whose weights are the class mean embeddings, making linear probing both redundant and harmful. Empirically, on four benchmarks including ImageNet-1K, NTCE and NONL surpass CE accuracy, closely approximate NC ($\geq 95\%$), and match CE's converged NC on 4/5 metrics in under $7.5\%$ of its iterations, while SCL with fixed prototypes matches linear probing without the hours-long classifier training phase. The learned geometry yields $+5.5\%$ mean relative improvement in transfer learning, up to $+8.7\%$ under severe class imbalance, and improved robustness to corruptions on ImageNet-C. Our work recasts supervised learning as prototype learning on the hypersphere, with NC reached by design.

2605.20246 2026-05-22 cs.LG cs.AI 版本更新

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

GROW: 将GRPO与状态-动作建模对齐以适用于开放世界VLM智能体

Xiongbin Wu, Zhihao Luo, Shanzhe Lei, Lechao Zhang, Xuhong Wang, Jie Yang, Zhonglong Zheng, Yuanjie Zheng, Xin Tan, Wei Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) East China Normal University(华东师范大学) Zhejiang Normal University(浙江师范大学) Shandong Normal University(山东省师范大学)

AI总结 本文提出GROW框架,通过将收集的轨迹分解为状态-动作样本,并在样本间计算优势,解决了标准GRPO在多轮RL中因需要完整轨迹导致上下文过长和噪声的问题,实验表明其在超过800个Minecraft任务中取得SOTA性能。

详情
AI中文摘要

最近,视觉-语言模型(VLM)智能体在开放世界任务中展现出有前景的进步,其中成功的任务完成通常需要多次视觉感知和动作执行的回合。然而,现有方法仍主要依赖于监督微调(SFT)专家演示,而先进的强化学习(RL)算法,特别是分组相对策略优化(GRPO),尚未在这些任务中有效应用于多轮RL,因为标准GRPO需要完整的轨迹作为训练样本,导致上下文过长和噪声。为了解决这个问题,我们提出GROW,一种适用于开放世界VLM智能体的RL框架,将收集的轨迹分解为状态-动作样本,并在这些样本之间计算优势,而不是将完整轨迹视为单一实体。我们进一步提供了一个替代分析,表明尽管分组样本是基于不同的局部状态而不是相同的提示上下文,简化假设下目标可以保留GRPO的核心相对策略优化信号。在超过800个Minecraft任务上的实验表明,我们的方法实现了最先进的性能,证明了我们提出的RL框架在开放世界VLM智能体中的有效性。

英文摘要

Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception and action execution. However, existing methods still rely primarily on Supervised Fine-Tuning (SFT) with expert demonstrations, while the advanced reinforcement learning (RL) algorithm, specifically Group Relative Policy Optimization (GRPO), has not been effectively employed for multi-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise. To address this issue, we propose GROW, a RL framework for open-world VLM agents that decomposes collected trajectories into state-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity. We further provide a surrogate analysis indicating that, even though the grouped samples are conditioned on different local states rather than an identical prompt context, the objective can preserve the core relative policy optimization signal of GRPO under simplifying assumptions. Experiments on more than 800 Minecraft tasks show that our method achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of our proposed RL framework for open-world VLM agents.

2605.18893 2026-05-22 cs.LG 版本更新

Position: Graph Condensation Needs a Reset -- Move Beyond Full-dataset Training and Model-Dependence

位置:图压缩需要重新开始——超越全数据集训练和模型依赖

Mridul Gupta, Samyak Jain, Vansh Ramani, Hariprasad Kodamana, Sayan Ranu

发表机构 * Yardi School of Artificial Intelligence, IIT Delhi, India(印度德里理工学院Yardi人工智能学院) Department of Computer Science and Engineering, IIT Delhi, India(印度德里理工学院计算机科学与工程系) Department of Chemical Engineering, IIT Delhi, India(印度德里理工学院化学工程系) Indian Institute of Technology Delhi, Abu Dhabi, Zayed City, Abu Dhabi, UAE(印度德里理工学院阿布扎赫德分校,扎耶德城,阿布扎赫德,阿联酋)

AI总结 本文指出当前图压缩方法存在系统性缺陷,呼吁转向轻量、架构无关且可部署的方法,以实现高效、通用和可扩展的图神经网络训练。

详情
AI中文摘要

图神经网络(GNNs)是学习图结构数据的强大工具,但其可扩展性在推荐系统、欺诈检测和分子生物学等领域的现实图规模下日益受到限制。图压缩——生成保留原始模型性能的更小合成图的任务——已成为有前途的解决方案。然而,主流的梯度匹配方法引入了根本性矛盾:它需要在完整数据集上训练以生成压缩版本,从而削弱了效率目标。更糟糕的是,这些方法存在高计算开销、在不同GNN架构间泛化差以及对特定模型配置的脆弱依赖。同样令人担忧的是社区对误导性评估协议如节点压缩比的依赖,这些协议未能反映真正的资源节约、压缩开销以及对神经架构搜索的虚假应用。这些不足并非偶然——它们是系统性的,并阻碍了有意义的进展。在本文的立场论文中,我们主张图压缩目前需要重新开始。我们呼吁超越全数据集训练和模型依赖,转而倡导轻量、架构无关且可部署的方法。通过识别关键方法论缺陷并概述具体研究方向,我们旨在将领域重新导向能够实现压缩真正承诺的方法:高效、通用和可扩展的图神经网络训练。

英文摘要

Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their scalability is increasingly strained by the size of real-world graphs in domains like recommender systems, fraud detection, and molecular biology. Graph condensation -- the task of generating a smaller synthetic graph that retains the performance of models trained on the original -- has emerged as a promising solution. However, the dominant approach of gradient matching introduces a fundamental contradiction: it requires training on the full dataset to create the compressed version, thereby undermining the goal of efficiency. Worse still, these methods suffer from high computational overhead, poor generalization across GNN architectures, and brittle reliance on specific model configurations. Equally concerning is the community's reliance on misleading evaluation protocols such as node compression ratios, which fail to reflect true resource savings, condensation overhead, and illusory application to neural architecture search. These shortcomings are not incidental -- they are systemic, and they obstruct meaningful progress. In this position paper, we argue that graph condensation, in its current form, needs a reset. We call for moving beyond full-dataset training and model-dependent design, and instead advocate for methods that are lightweight, architecture-agnostic, and practically deployable. By identifying key methodological flaws and outlining concrete research directions, we aim to reorient the field toward approaches that deliver on the true promise of condensation: efficient, generalizable, and usable GNN training at scale.

2605.18721 2026-05-22 cs.LG cs.CL 版本更新

General Preference Reinforcement Learning

通用偏好强化学习

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi

发表机构 * Stanford University(斯坦福大学) The University of Oklahoma(俄克拉荷马大学)

AI总结 本文提出通用偏好强化学习(GPRL),通过引入通用偏好模型(GPM)解决传统强化学习在开放任务中连续探索不足的问题,通过多维偏好比较提升模型性能。

详情
AI中文摘要

训练后将大型语言模型(LLM)对齐分解为两个大致分离的轨道。在线强化学习(RL)通过可验证奖励推动数学和代码的涌现推理,但依赖于无法达到开放任务的程序验证器;而偏好优化处理开放生成任务却牺牲了驱动在线RL的连续探索。弥合这一差距需要一个开放性质量验证器,但标量奖励模型不适合此任务。质量是多维的,任何标量分数都是不完整的代理,使在线RL崩溃于分数最敏感的轴。我们转而采用通用偏好模型(GPM),将响应嵌入到k个斜对称子空间中,并将偏好表示为结构化的、具有不传递性的比较。在此基础上,我们提出通用偏好强化学习(GPRL),将k维结构延伸到策略更新中。GPRL计算每维的组相对优势,对每个优势进行归一化以避免任何轴主导,并通过上下文相关的特征值进行聚合。相同的结构推动了一个闭环漂移监视器,能够检测单轴利用并通过重新加权维度和收紧信任区域进行即时纠正。从Llama-3-8B-Instruct开始,GPRL在AlpacaEval~2.0上达到长度控制的胜利率为56.51%,并在Arena-Hard、MT-Bench和WildBench上优于SimPO和SPPO,通过在长时间训练中抵抗奖励黑客。

英文摘要

Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.

2605.17659 2026-05-22 cs.LG 版本更新

Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes

Bug or Feature²:权重漂移、激活稀疏性与尖峰

Egor Shvetsov, Aleksandr Serkov, Shokorov Viacheslav, Redko Dmitry, Vladislav Goloshchapov, Evgeny Burnaev

发表机构 * GitHub

AI总结 本文研究了现代神经网络架构中由于标准损失与正偏激活函数相互作用导致的负权重漂移现象,分析了其对激活稀疏性和模型性能的影响,并提出通过剪枝解决尖峰问题的方法。

详情
AI中文摘要

现代神经架构的设计通过逐步经验选择逐渐收敛,但其训练动态的机制仍只部分被理解。我们识别并分析了由标准损失与正偏激活函数相互作用引起的负权重漂移。证明在MSE或交叉熵损失下,正预激活的梯度在初始化时期望非负,驱动下游权重向负值发展。这种漂移是优化固有的,而非数据相关,并在多种架构(MLP、ResNet、ViT、GPT-nano、MP-SENe)和非对称激活函数(ReLU、GELU、SiLU)中持续存在。与ReLU结合,权重漂移产生高达90%的激活稀疏性。我们跨79种配置表征稀疏性-准确率权衡,并识别出稀疏性超过约70%时的准确率断崖。虽然ReLU²在GPT-nano中实现了良好的稀疏性-准确率比,但会病理性放大中间Transformer层的激活尖峰。剪枝可以解决这一问题,同时保留平方的表示优势:剪枝ReLU²优于其未剪枝版本,GELU²在GPT-nano上达到最低验证损失。代码可在https://github.com/On-Point-RND/BugOrFeature获取。

英文摘要

The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above $\sim$70\% activation sparsity. While ReLU$^2$ achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU$^2$ outperforms its unclipped version, and GELU$^2$ achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.

2605.17602 2026-05-22 cs.AI cs.CV cs.LG 版本更新

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

AutoRubric-T2I: 一种用于文本到图像对齐的鲁棒基于规则的奖励模型

Kuei-Chun Kao, Daixuan Huo, Yuanhao Ban, Cho-Jui Hsieh

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出AutoRubric-T2I,一种首个用于文本到图像生成的规则学习框架,通过自动合成和选择显式规则来指导视觉语言模型(VLM)法官。该方法通过合成偏好对的推理轨迹生成候选规则,并利用VLM法官在每种规则下对配对图像进行评分,产生配对规则评分差异用于偏好学习。通过ℓ1正则化逻辑回归精简器去除噪声和冗余规则,从而在少量标注偏好数据下生成高质量、可解释的奖励信号,并在多个图像奖励基准测试中优于现有奖励模型基线。

Comments 27 pages

详情
AI中文摘要

将文本到图像(T2I)生成模型与人类偏好对齐越来越依赖于图像奖励模型,这些模型根据提示对齐和感知质量对生成图像进行评分或排序。现有的奖励模型通常在大规模人类偏好语料上训练为Bradley-Terry(BT)偏好模型,这使得训练成本高、适应困难且评估标准不透明。同时,视觉语言模型(VLM)法官可以通过文本评分规则提供更细致的评估,但其手动设计或启发式生成的评分规则可能无法可靠地反映人类偏好。在本文中,我们提出AutoRubric-T2I,这是首个用于T2I的规则学习框架,能够自动合成和选择显式规则以指导VLM法官。AutoRubric-T2I首先通过合成偏好对的推理轨迹生成候选规则,然后利用VLM法官在每种规则下对配对图像进行评分,产生配对规则评分差异用于偏好学习。为了去除噪声和冗余规则,我们进一步采用ℓ1正则化逻辑回归精简器,选择Top-N最判别性的规则。广泛评估表明,AutoRubric-T2I在使用不到0.01%的标注偏好数据的情况下,能够生成高质量、可解释的奖励信号,大幅减少了大规模奖励模型训练的需求。在图像奖励基准如MMRB2上,AutoRubric-T2I优于强奖励模型基线。我们进一步验证AutoRubric-T2I作为强化学习奖励在下游T2I任务中的效果,包括TIIF和UniGenBench++,其中它通过流-GRPO管道在扩散模型上提升了生成质量,优于标量奖励模型。

英文摘要

Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a $\ell_1$-Regularized Logistic Regression Refiner, which selects the Top-$N$ most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.

2605.17156 2026-05-22 quant-ph cs.LG 版本更新

Sparse Mamba Decoder for Quantum Error Correction: Efficient Defect-Centric Processing of Surface Code Syndromes

稀疏Mamba解码器用于量子纠错:高效处理表面码syndrome的缺陷中心处理

Samira Sayedsalehi, Nader Bagherzadeh, Maxim Shcherbakov, Jean-Luc Gaudiot

发表机构 * Department of Electrical Engineering and Computer Science(电气工程与计算机科学系)

AI总结 本文提出了一种基于缺陷中心的稀疏Mamba解码器,通过仅处理活跃的检测事件,提高了量子纠错中表面码syndrome处理的效率和准确性,同时在多个基准测试中展示了显著的性能提升。

Comments 22 pages, 7 figures, 10 tables. Neural decoder for surface code quantum error correction. Submitted to Quantum

详情
AI中文摘要

量子纠错(QEC)对于构建容错量子计算机至关重要,需要同时准确、快速且可扩展的解码器。大多数最先进的神经解码器在高准确性方面表现优异,但处理整个密集的syndrome数组,其大小为O(d²R),无论实际错误率如何,其中d是编码距离,R是测量轮次的数量。在物理相关错误率(p ~ 0.1%)下,少于5%的syndrome条目包含活跃的检测事件——然而现有解码器处理整个syndrome体积。我们引入了稀疏Mamba解码器(SMD),一种以缺陷为中心的神经解码器,使用每个缺陷13维的特征表示和Mamba状态空间骨干,仅处理k个活跃的检测事件,实现O(k)复杂度。在去极化、均匀电路级、SI1000和Google Sycamore实验基准上,SMD在SI1000噪声下,d ≤ 5时将MWPM逻辑错误率降低了高达49%,比Tesseract近MLD解码器快95-467倍,比Belief Matching快232-463倍,并在均匀电路级噪声下保持几乎恒定的延迟(24-57 us)。在Sycamore实验数据集上,SMD集合匹配或略微超过Varbanov等人密集的Mamba解码器。所有结果均在商用NVIDIA GPU上获得,参数数量为7.5-16M,无需专用加速器。

英文摘要

Quantum error correction (QEC) is essential for building fault-tolerant quantum computers, requiring decoders that are simultaneously accurate, fast, and scalable. Most state-of-the-art neural decoders achieve high accuracy but process the full dense syndrome array of size $O(d^2 R) $regardless of the actual error rate, where d is the code distance and R is the number of measurement rounds. At physically relevant error rates (p ~ 0.1%), fewer than 5% of syndrome entries contain active detection events -- yet existing decoders process the entire syndrome volume. We introduce the Sparse Mamba Decoder (SMD), a defect-centric neural decoder that processes only the k active detection events using a 13-dimensional feature representation per defect and a Mamba state-space backbone, achieving $O(k)$ complexity. Across depolarizing, uniform circuit-level, SI1000, and Google Sycamore experimental benchmarks, SMD reduces the MWPM logical error rate by up to 49% at $d \le 5$ under SI1000 noise, runs 95-467x faster than the Tesseract near-MLD decoder and 232-463x faster than Belief Matching, and maintains nearly constant latency (24-57 us) across d = 3-9 under uniform circuit-level noise. On the Sycamore experimental dataset, the SMD ensemble matches or slightly surpasses the dense Mamba decoder of Varbanov et al. All results are obtained on commodity NVIDIA GPUs with 7.5-16M parameters, without specialized accelerators.

2605.16579 2026-05-22 cs.CV cs.LG 版本更新

Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

局部关注,线性记忆:线性注意力作为跨帧记忆用于自回归视频扩散

Kunyang Li, Mubarak Shah, Yuzhang Shang

发表机构 * Institute of Artificial Intelligence, University of Central Florida(中央佛罗里达大学人工智能研究所)

AI总结 本文提出了一种名为ARL2的混合注意力模块,通过将二次跨帧注意力替换为固定大小的递归状态,解决了自回归视频扩散模型在长视频生成中的可扩展性瓶颈问题,实现了线性时间复杂度和常数内存消耗,同时提升了时间一致性。

详情
AI中文摘要

自回归(AR)视频扩散是一种强大的视频生成范式,用于流式和交互式视频生成。然而,其依赖于softmax自注意力机制导致序列长度的二次计算复杂度和内存使用,由于键值缓存,限制了其扩展到长视频时间范围的能力。现有的解决方案(例如稀疏注意力和KV缓存压缩)降低了每步成本,但仍依赖于线性增长的缓存或不可逆地丢弃过去上下文,因此无法解决线性内存增长和流式上下文管理问题。为了解决这一可扩展性瓶颈,我们提出了ARL2(局部关注,线性记忆),一种混合注意力模块,通过将二次跨帧注意力替换为固定大小的递归状态。我们将自注意力分解为两个分支:一个用于空间细节和局部依赖的帧内softmax分支,以及一个用于维护固定大小状态以流式管理上下文的帧间门控线性分支。我们的关键见解是softmax注意力捕捉细粒度的局部交互,而递归状态提供可控的长程记忆。这种设计实现了线性时间复杂度和常数内存消耗,同时在全softmax模型上提高了时间一致性。为防止噪声中间状态破坏记忆,我们只在去噪步骤后更新递归状态。为了避免帧内信息不对称,所有token共享相同的预更新状态,而不是按顺序更新。据我们所知,这是首次将预训练的AR视频扩散模型转换为混合线性注意力架构的工作,通过一种高效的两阶段训练方案实现AR视频的训练。在75%的层被替换为混合线性注意力的情况下,模型实现了高达2.26倍的时钟时间加速和54%的内存减少,同时保持与改进的时间一致性相当的质量。

英文摘要

Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame gated recurrent linear branch that maintains a fixed-size state for streaming context. Our key insight is that softmax attention captures fine-grained local interactions, while a recurrent state provides controllable long-range memory. This design achieves linear-time scaling with constant memory while improving temporal consistency over the full-softmax model. To prevent noisy intermediate states from corrupting memory, we update the recurrent state only after the denoised pass. To avoid within-frame information asymmetry, all tokens share the same pre-update state rather than sequential updates. To the best of our knowledge, this is the first work to convert a pretrained AR video diffusion model into a hybrid linear attention architecture, through an efficient two-stage training scheme for AR video. With 75% of layers replaced by hybrid linear attention, the model achieves up to 2.26 wall-clock speedup and 54% memory reduction, while maintaining comparable quality with improving temporal consistency.

2605.16362 2026-05-22 cs.LG cs.AI 版本更新

When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

当秩-1引导廉价时是什么情况?几何学、粒度和预算化搜索

John T. Robertson, Jianing Zhu, Haris Vikalo, Zhangyang Wang

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文研究了秩-1引导在不同概念上的有效性差异,提出粒度和几何学是影响引导成本的关键因素,并介绍了GRACE框架来高效优化引导过程。

Comments Updated Abstract metadata

详情
AI中文摘要

激活引导提供了一种无需重新训练即可控制大语言模型的轻量方法,但其效果在不同概念上变化显著。先前研究通常将这种变化视为许多概念无法由单一引导方向捕捉的证据。我们主张这种变化更多反映了搜索难度:有用的秩-1干预通常存在,但找到它可能成本高昂。我们正式将秩-1引导定义为在干预层和系数上的预算约束优化。在不同概念和模型家族中,提示边界方向对齐预测有效干预的位置,使几何引导搜索能够以更少的评估达到高效用,平均减少39.8%的试验次数以恢复95%的最佳效用。为解释为何某些概念即使在更好的搜索下仍昂贵,我们引入了粒度,即对比上下文中方向异质性的度量。粒度区分了差异向量共享稳定全局方向的概念,与提示在每个输入中局部一致但最优方向系统性旋转的概念。更高的粒度与更慢的收敛速度和更低的最佳效用相关(相关系数分别为0.44和-0.46,p<0.001)。我们提出了GRACE框架,一个粒度和表征意识的概念工程框架,利用激活几何学来诊断引导难度的主要来源,选择适当的解决方案,并高效分配优化努力。我们的结果将框架从“秩-1何时失败?”转变为“秩-1何时廉价且稳定?”,使激活几何学从描述性工具转变为LLM控制的可操作先验。

英文摘要

Activation steering offers a lightweight way to control LLMs without retraining, but its effectiveness varies sharply across concepts. Prior work often reads this variability as evidence that many concepts are not captured by a single steering direction. We argue instead that much of it reflects search difficulty: a useful rank-1 intervention often exists, but finding it can be expensive. We formalize rank-1 steering as a budget-constrained optimization over intervention layer and coefficient. Across concepts and model families, prompt-boundary directional alignment predicts where effective interventions occur, enabling geometry-guided search that reaches high utility with substantially fewer evaluations, reducing the trials needed to recover 95% of best-found utility by 39.8% on average across three model families. To explain why some concepts remain expensive even under better search, we introduce concept granularity, a measure of directional heterogeneity across contrastive contexts. Granularity distinguishes concepts whose difference vectors share a stable global direction from those where prompts agree locally within each input but the utility-maximizing direction rotates systematically across inputs. Higher granularity is associated with slower convergence and lower best-found performance (Pearson $r{=}0.44$ with trials-to-95%, $r{=}{-}0.46$ with best-found utility, both $p<0.001$). We present GRACE, a Granularity- and Representation-Aware Concept Engineering framework that uses activation geometry to diagnose the dominant source of steering difficulty, select the appropriate remedy, and allocate optimization effort efficiently. Our results shift the frame from "when does rank-1 fail?" to "when is rank-1 cheap and stable?", turning activation geometry from a descriptive tool into an actionable prior for LLM control.

2605.15588 2026-05-22 cs.CL cs.LG 版本更新

Calibrating LLMs with Semantic-level Reward

通过语义层面奖励校准大型语言模型

Fengfei Yu, Ruijia Niu, Dongxia Wu, Yian Ma, Rose Yu

发表机构 * Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA(加州大学圣地亚哥分校计算机科学与工程系,拉贾尔,加利福尼亚州,美国) Halıcıoğlu Data Science Institute, University of California San Diego, La Jolla, California, USA(加州大学圣地亚哥分校Halıcıoğlu数据科学研究所,拉贾尔,加利福尼亚州,美国) Department of Statistics, Stanford University, Stanford, California, USA(斯坦福大学统计学系,斯坦福,加利福尼亚州,美国)

AI总结 本文提出了一种新的校准框架CSR,通过在语义空间中直接校准语言模型,避免了传统方法中因词汇化置信度导致的不一致问题,实验显示CSR在多个数据集上均能有效降低ECE并提高AUROC。

详情
AI中文摘要

随着大型语言模型(LLMs)被应用于医疗问答和法律推理等关键领域,估计其输出正确性的能力对于安全可靠使用至关重要,要求模型具有良好的校准能力。标准的可验证奖励强化学习(RLVR)通过二元正确性奖励训练模型,但该奖励对置信度不敏感,无法对自信但错误的预测施加惩罚,从而降低校准效果。最近的研究通过训练模型生成带有词汇化置信度的置信分数并奖励与正确性的同意来解决这一问题。然而,词汇化置信度在语义相同但文本变化时表现出不一致性。我们提出Calibration with Semantic Reward(CSR),一种在语义空间中直接校准语言模型的框架,无需词汇化置信度接口。CSR结合了正确性奖励和一种新的语义校准奖励,通过促进正确路径中的语义一致性和不正确路径中的探索来鼓励利用和探索。在HotpotQA(在分布)和TriviaQA、MSMARCO、NQ-Open(不在分布)三个模型家族上的实验表明,CSR在几乎所有设置中都比词汇化置信度基线实现了更低的ECE和更高的AUROC,ECE减少高达40%,AUROC提高高达31%,校准行为在所有四个评估设置中均表现出良好的鲁棒性。

英文摘要

As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.

2605.15505 2026-05-22 cs.AI cs.IR cs.LG 版本更新

X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Digital Human Attention

X-SYNTH:超越检索——从观察到的数字人类注意力中提取企业上下文

Guruprasad Raghavan, George Nychis, Rohan Narayana Murthy

发表机构 * Workfabric AI

AI总结 本文提出X-SYNTH框架,通过分析数字人类注意力行为模式,解决企业上下文合成问题,其核心方法是基于行为模式的上下文合成,而非传统检索,从而显著提升有效线索率并降低误报率。

Comments 11 pages, 7 figures, 5 tables

详情
AI中文摘要

在企业运营中,AI代理任务所需上下文分散在记录系统、静态信息存储和通信渠道中。所存储的是系统状态,这是工作实际发生情况的损失性表示。现有的方法通过匹配请求内容来检索存储的信息;对于狭窄请求,这种方法效果良好。但合成质量依赖于了解应展示什么以及如何解释它:这涉及每个组织、团队和个人特有的知识,存在于行为模式中,而不在任何检索索引中。对于提出对企业有价值的线索给销售员的代理任务,这种方法失效:真正的线索率低,假线索率高,且模型没有改进机制。我们提出了X-SYNTH,一个基于数字人类注意力的框架,这种注意力是每个工人的可数字化交互特征,编码了他们做了什么、按什么顺序做,以及隐含的奖励信号。在没有外部标签的情况下,可以区分出导致积极结果的先前行为轨迹与未导致积极结果的轨迹。X-SYNTH将每个个体的行为基线建模为数字双胞胎签名(DTS),并根据个体和查询选择七种注意力过滤器:比例、反比、微分、递归、比较、顺序和集体,以识别因果相关的活动签名。一个四阶段的管道将基于行为模式的排名上下文组装起来,而不是查询嵌入。一个前沿模型在无辅助的情况下实现了9.5%的真实线索率(TLR)和90.5%的假线索率(FLR)。在加入X-SYNTH后,TLR上升到61.9%(6.5倍),而FLR下降到18.8%。企业上下文合成不是检索问题,而是相关性问题,而数字人类注意力是其最可靠的地面真实值。

英文摘要

In enterprise operations, the context required for an AI agent task is scattered across systems of record, static information stores, and communication channels. What is stored is system state, a lossy representation of the work that actually happened. The prevailing approach retrieves by matching request content to what is stored; for narrow requests this works well. But synthesis quality depends on knowing what to surface and how to interpret it: knowledge specific to each organization, team, and individual, present in behavioral patterns, absent from any retrieval index. For the agentic task of proposing enterprise-valuable leads to sellers, this approach breaks down: True Lead Rate is low, False Lead Rate is high, and the model has no mechanism to improve. We present X-SYNTH, a framework for enterprise context synthesis grounded in digital human attention, the digitally observable interaction signatures of each worker, encoding what they did, the sequence in which they did it, and implicit reward signals. Behavioral traces preceding positive outcomes are distinguishable from those that did not, without external labeling. X-SYNTH models each individual's behavioral baseline as a Digital Twin Signature (DTS) and selects among seven attention filters, Proportional, Inverse, Differential, Recurrent, Comparative, Sequential, and Collective, per individual and per query, to identify causally relevant activity signatures. A four-stage pipeline assembles ranked context grounded in behavioral patterns rather than query embeddings. A frontier model unaided achieves 9.5% True Lead Rate (TLR) with 90.5% False Lead Rate (FLR). Augmented with X-SYNTH, TLR rises to 61.9% (6.5x) while FLR falls to 18.8%. Enterprise context synthesis is not a retrieval problem. It is a relevance problem, and digital human attention is its most reliable ground truth.

2605.12836 2026-05-22 cs.LG 版本更新

Discrete Stochastic Localization for Non-autoregressive Generation

非自回归生成的离散随机定位

Yunshu Wu, Jiayi Cheng, Longxuan Yu, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis, Greg Ver Steeg

AI总结 本文提出了一种连续状态框架,通过单位球体令牌嵌入实现离散随机定位,以提高离散序列生成的分布忠实度,并展示了在OpenWebText上改进MAUVE指标的效果。

Comments This work was intended as a replacement of arXiv:2602.16169 and any subsequent updates will appear there

详情
AI中文摘要

连续扩散是一种非自回归生成的自然框架,但在离散序列生成上通常落后于掩码离散扩散模型(MDMs)。我们争论瓶颈不是连续性本身,而是在于表示中去噪依赖于时间步索引的噪声制度。我们引入了离散随机定位(DSL),一种具有单位球体令牌嵌入的连续状态框架,其贝叶斯最优去噪器在定位信道下对名义信号噪声比(SNR)具有不变性。一个训练好的网络可以支持整个SNR路径家族,端点掩码扩散路径是特殊情况。对预训练MDLM检查点进行微调可显著提高OpenWebText上的分布忠实度(MAUVE)在所有步骤预算从T=128到T=1024,且同一检查点支持随机顺序自回归采样,以及使用最少T=48总步骤的混合连续-然后-离散采样器,无需蒸馏或重新训练。

英文摘要

Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce \emph{Discrete Stochastic Localization} (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from $T{=}128$ to $T{=}1024$, and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps -- without distillation or retraining.

2605.12623 2026-05-22 cs.CL cs.CV cs.LG 版本更新

DocAtlas: Multilingual Document Understanding Across 80+ Languages

DocAtlas: 跨80多种语言的多语言文档理解

Ahmed Heakl, Youssef Mohamed, Abdullah Sohail, Rania Elbadry, Ahmed Nassar, Peter W. J. Staar, Fahad Shahbaz Khan, Imran Razzak, Salman Khan

发表机构 * MBZUAI(穆罕默德·本·拉谢德人工智能研究所) IBM Research(IBM研究院)

AI总结 本文提出DocAtlas框架,通过构建高保真的OCR数据集和基准测试,覆盖82种语言和9个评估任务,利用双重管道生成精确的结构注解,展示了直接偏好优化在多语言适应中的有效性,提升了领域内和领域外的准确率。

Comments Under submission

详情
AI中文摘要

多语言文档理解在低资源语言中受限于稀缺的训练数据和基于模型的标注流程,这些流程会加剧现有偏见。我们引入DocAtlas,一个构建覆盖82种语言和9个评估任务的高保真OCR数据集和基准测试的框架。我们的双重管道,包括本地DOCX文档的差异渲染和针对从右到左脚本的合成LaTeX生成,生成统一的DocTag格式注解,编码布局、文本和组件类型,无需学习模型进行核心注解。评估16种最先进的模型揭示了低资源脚本中的持续差距。我们展示直接偏好优化(DPO)使用渲染派生的真实情况作为正信号,实现了稳定的多语言适应,提高了领域内(+1.9%)和领域外(+1.8%)的准确性,而监督微调会导致领域外性能下降高达21%。我们的最佳变体,DocAtlas-DeepSeek,在最强基线基础上提高了+1.7%。代码可在https://github.com/ahmedheakl/DocAtlas获取。

英文摘要

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline. Code is available at https://github.com/ahmedheakl/DocAtlas .

2605.10067 2026-05-22 cs.LG cs.AI 版本更新

Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization

Metis: 通过自进化元认知策略优化学习 jailbreak LLMs

Huilin Zhou, Jian Zhao, Yilu Zhong, Zhen Liang, Xiuyuan Chen, Yuchen Yuan, Tianle Zhang, Chi Zhang, Lan Zhang, Xuelong Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信)

AI总结 本文提出Metis框架,通过将jailbreaking重新表述为对抗性部分可观测马尔可夫决策过程中的推理时间策略优化,以提高对抗性测试的效率和效果,同时通过结构化反馈和透明推理轨迹提升可解释性,实验表明Metis在多种模型上均表现出更高的攻击成功率和更低的token成本。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

红队测试对于揭示大型语言模型(LLMs)中的漏洞至关重要。尽管自动化方法已提高可扩展性,但现有方法往往依赖静态启发式或随机搜索,使其在面对高级安全对齐时显得脆弱。为了解决这一问题,我们引入了Metis框架,该框架将jailbreaking重新表述为对抗性部分可观测马尔可夫决策过程(POMDP)中的推理时间策略优化。Metis采用自进化元认知循环来执行目标防御逻辑的因果诊断,并利用结构化反馈作为语义梯度来优化其策略,通过透明推理轨迹提高可解释性。在10种不同模型上的广泛评估表明,Metis在比较方法中实现了最强的平均攻击成功率(ASR)为89.2%,在坚韧的前沿模型(如O1和GPT-5-chat)上保持高效果,而传统基线方法表现出显著的性能下降。通过用定向优化替代冗余探索,Metis将token成本平均降低了8.2倍,最高可达11.4倍。我们的分析表明,当前防御在测试设置下仍易受内部引导的闭环推理轨迹影响,突显了下一代防御机制在推理过程中动态处理安全性的关键需求。

英文摘要

Red teaming is critical for uncovering vulnerabilities in Large Language Models (LLMs). While automated methods have improved scalability, existing approaches often rely on static heuristics or stochastic search, rendering them brittle against advanced safety alignment. To address this, we introduce Metis, a framework that reformulates jailbreaking as inference-time policy optimization within an adversarial Partially Observable Markov Decision Process (POMDP). Metis employs a self-evolving metacognitive loop to perform causal diagnosis of a target's defense logic and leverages structured feedback as a semantic gradient to refine its policy, offering enhanced interpretability through transparent reasoning traces. Extensive evaluations across 10 diverse models demonstrate that Metis achieves the strongest average Attack Success Rate (ASR) among compared methods at 89.2%, maintaining high efficacy on resilient frontier models (e.g., 76.0% on O1 and 78.0% on GPT-5-chat) where traditional baselines exhibit substantial performance degradation. By replacing redundant exploration with directed optimization, Metis reduces token costs by an average of 8.2x and up to 11.4x. Our analysis reveals that current defenses remain vulnerable to internally-steered, closed-loop reasoning trajectories under the tested settings, highlighting a critical need for next-generation defenses capable of reasoning about safety dynamically during inference.

2605.09273 2026-05-22 cs.LG 版本更新

Instance-Adaptive Online Multicalibration

实例自适应在线多校准

Zhiming Huang, Jamie Morgenstern, Aaron Roth, Claire Jie Zhang

发表机构 * Paul G. Allen School of Computer Science and Engineering, University of Washington(华盛顿大学保罗·G·阿伦计算机科学与工程学院) Department of Computer and Information Sciences, University of Pennsylvania(宾夕法尼亚大学计算机与信息科学系)

AI总结 本文提出了一种高效的实例自适应在线多校准算法,通过动态调整预测值的二进制网格来平衡最坏情况和易处理情况,实现了在不同实例下的最优误差控制。

Comments We tightened the analysis and added a comparison to the concurrent work of Liu et al. (arXiv:2605.11490)

详情
AI中文摘要

我们研究了超越最坏情况的在线多校准。我们给出一个单一、高效的算法,通过自适应细化预测值的二进制网格,动态插值于良性和最坏情况序列之间。其误差由细化树中的叶子数量控制。我们的分析恢复了已知的在线多校准最坏情况最优率$\widetilde O(T^{2/3})$,同时自动适应于更简单的实例:在边际随机情况下,获得$\widetilde O(\sqrt T)$的速率,对于具有$J$段的分段平稳均值,其速率是$\widetilde O(\sqrt{JT})$。更一般地说,速率取决于可预测均值过程相对于组族的阈值复杂度度量。我们证明这种依赖性在对数因子范围内是紧致的。

英文摘要

We study online multicalibration beyond the worst-case. We give a single, efficient algorithm which dynamically interpolates between benign and worst-case sequences by adaptively refining a dyadic grid of prediction values. Its error is controlled by the number of leaves in the refinement tree. Our analysis recovers the known $\widetilde O(T^{2/3})$ worst-case-optimal rate for online multicalibration, while simultaneously automatically adapting to easier instances: in the marginal stochastic setting it obtains a rate of $\widetilde O(\sqrt T)$, and for piecewise-stationary means with $J$ segments its rate is $\widetilde O(\sqrt{JT})$. More generally, the rate depends on a threshold-complexity measure of the predictable mean process relative to the group family. We show that this dependence is tight up to logarithmic factors.

2605.01466 2026-05-22 cs.CV cs.LG 版本更新

SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion

SplAttN: 通过高斯软溅射和注意力在2D和3D之间架桥以实现点云补全

Zhaoyang Li, Zhichao You, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence(计算与人工智能学院) Southwest Jiaotong University(西南交通大学) Chengdu, China(中国成都)

AI总结 本文提出SplAttN方法,通过高斯软溅射和注意力机制解决点云补全中2D和3D模态连接问题,改进了传统硬投影导致的跨模态熵塌陷问题,实现了更有效的跨模态连接学习。

Comments Accepted as a Spotlight paper at ICML 2026; camera-ready version

详情
AI中文摘要

尽管多模态学习在点云补全方面取得了进展,但理论机制仍不明确。最近的研究将成功归因于模态间的联系,但我们发现标准硬投影破坏了这种联系:将稀疏点云投影到图像平面会产生极稀疏的支持,阻碍视觉先验传播,这种失败模式我们称为跨模态熵塌陷。为解决这一实际限制,我们提出了SplAttN,用可微高斯溅射替代硬投影,生成密集的连续图像平面表示。通过将投影重新公式化为连续密度估计,SplAttN避免了塌陷的稀疏支持,促进了梯度流动,并提高了跨模态连接的学习能力。广泛的实验表明,SplAttN在PCN和ShapeNet-55/34上实现了最先进的性能。关键的是,我们利用现实世界的KITTI基准作为多模态依赖的应力测试。反事实评估显示,尽管基线退化为对视觉移除不敏感的单模态模板检索器,SplAttN仍能保持对视觉线索的稳健依赖,验证了我们的方法建立了有效的跨模态连接。代码可在https://github.com/zay002/SplAttN获取。

英文摘要

Although multi-modal learning has advanced point cloud completion, the theoretical mechanisms remain unclear. Recent works attribute success to the connection between modalities, yet we identify that standard hard projection severs this connection: projecting a sparse point cloud onto the image plane yields an extremely sparse support, which hinders visual prior propagation, a failure mode we term Cross-Modal Entropy Collapse. To address this practical limitation, we propose SplAttN, which replaces hard projection with Differentiable Gaussian Splatting to produce a dense, continuous image-plane representation. By reformulating projection as continuous density estimation, SplAttN avoids collapsed sparse support, facilitates gradient flow, and improves cross-modal connection learnability. Extensive experiments show that SplAttN achieves state-of-the-art performance on PCN and ShapeNet-55/34. Crucially, we utilize the real-world KITTI benchmark as a stress test for multi-modal reliance. Counter-factual evaluation reveals that while baselines degenerate into unimodal template retrievers insensitive to visual removal, SplAttN maintains a robust dependency on visual cues, validating that our method establishes an effective cross-modal connection. Code is available at https://github.com/zay002/SplAttN.

2605.00414 2026-05-22 cs.LG cond-mat.stat-mech cs.AI 版本更新

Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

树到流及回归:统一决策树和扩散模型

Sai Niranjan Ramachandran, Suvrit Sra

发表机构 * School of Computation, Information and Technology, Technical University of Munich, Germany(慕尼黑技术大学计算、信息与技术学院,德国) Munich center for machine learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 本文通过建立层次决策树与扩散过程之间的数学对应关系,统一了决策树和扩散模型,揭示了共同的优化原则'全局轨迹得分匹配',并提出了两种实用应用:treeflow在表格数据生成中表现优异,且计算速度更快;dsmtree将层次决策逻辑转移到神经网络中,在多个基准上与教师模型表现相近。

Comments 12 pages (main), 68 pages (inclusive of appendix), Accepted in the Forty-Third International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

决策树和扩散模型本质上是不同的模型类别,前者是离散和层次的,后者是连续和动态的。本文通过在适当的极限情况下建立层次决策树与扩散过程之间的清晰数学对应关系,将两者统一起来。我们的统一揭示了一个共同的优化原则:全局轨迹得分匹配(GTSM),其中梯度提升(在理想化版本中)在渐近意义上是最优的。通过两个关键的实用实例,我们强调了本工作的概念价值:treeflow在表格数据上实现了具有更高保真度和2倍计算速度的竞争性生成质量,而dsmtree是一种新的蒸馏方法,将层次决策逻辑转移到神经网络中,在许多基准上与教师模型表现相近。

英文摘要

Decision trees and diffusion models are ostensibly disparate model classes, one discrete and hierarchical, the other continuous and dynamic. This work unifies the two by establishing a crisp mathematical correspondence between hierarchical decision trees and diffusion processes in appropriate limiting regimes. Our unification reveals a shared optimization principle: \emph{Global Trajectory Score Matching (GTSM)}, for which gradient boosting (in an idealized version) is asymptotically optimal. We underscore the conceptual value of our work through two key practical instantiations: \treeflow, which achieves competitive generation quality on tabular data with higher fidelity and a 2\times computational speedup, and \dsmtree, a novel distillation method that transfers hierarchical decision logic into neural networks, matching teacher performance within 2\% on many benchmarks.

2605.00185 2026-05-22 cs.LG cs.AI 版本更新

Fair Dataset Distillation via Cross-Group Barycenter Alignment

通过跨组重心对齐实现公平的数据集蒸馏

Mohammad Hossein Moslemi, Nima Hosseini Dashtbayaz, Zhimin Mei, Bissan Ghaddar, Boyu Wang

发表机构 * Western University(温莎大学) Vector Institute(向量研究所) IE University(IE大学) Ivey Business School(Ivey商学院)

AI总结 本文研究了数据集蒸馏中因不同群体预测模式差异导致的公平性问题,提出通过跨组重心对齐方法来减少群体间的预测偏差,从而提升模型的公平性。

Comments Accepted by ICML 2026

详情
AI中文摘要

数据集蒸馏旨在将大规模数据集压缩成小规模合成数据集,同时保持预测性能。我们发现,由于不同人口群体表现出不同的预测模式,蒸馏过程在保持所有子群体信息信号方面面临困难,无论群体大小是轻微还是严重不平衡。因此,训练在蒸馏数据上的模型可能会在某些子群体上出现显著性能下降,导致公平性差距。关键的是,这些差距不会仅仅通过纠正群体不平衡来消失,因为它们源于子群体预测模式的根本不匹配,而不是样本数量差异本身。因此,我们正式分析了这两种偏差源之间的相互作用,并将解决方案定义为识别一个不考虑群体不平衡的预测信息重心,该重心在所有子群体中诱导出相似的表示。通过向这个共享的聚合表示进行蒸馏,我们证明可以减少群体公平性方面的担忧。我们的方法与现有蒸馏方法兼容,并且实验证明,它显著减少了数据集蒸馏引入的偏差。代码可在https://github.com/mhmoslemi/COBRA上获得。

英文摘要

Dataset Distillation aims to compress a large dataset into a small synthetic one while maintaining predictive performance. We show that as different demographic groups exhibit distinct predictive patterns, the distillation process struggles to simultaneously preserve informative signals for all subgroups, regardless of whether group sizes are mildly or severely imbalanced. Consequently, models trained on distilled data can experience substantial performance drops for certain subgroups, leading to fairness gaps. Crucially, these gaps do not disappear by merely correcting group imbalance, since they stem from fundamental mismatches in subgroup predictive patterns rather than from sample-size disparities alone. We therefore formally analyze the interaction between these two sources of bias and cast the solution as identifying a group-imbalance-agnostic barycenter of the predictive information that induces similar representations across all subgroups. By distilling toward this shared aggregate representation, we show that group fairness concerns can be reduced. Our approach is compatible with existing distillation methods, and empirical results show that it substantially reduces bias introduced by dataset distillation. Code is available at https://github.com/mhmoslemi/COBRA.

2604.24514 2026-05-22 cs.LG 版本更新

SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling

SceneSelect: 用于轨迹场景分类和专家调度的选择性学习

Xinrun Wang, Deshun Xia, Yuxi Sun, Weijie Zhu

发表机构 * School of Computer Science, China University of Geosciences (Wuhan)(中国地质大学(武汉)计算机科学学院) School of Information Engineering, Wuhan University of Technology(武汉理工大学信息工程学院) School of Mathematics and Statistics, Wuhan University of Technology(武汉理工大学数学与统计学学院)

AI总结 本文提出SceneSelect,一种基于场景的选择性学习方法,通过动态路由输入到最合适的专家模型,提升轨迹预测的准确性和效率。

Comments This paper has been accepted by ICIC 2026

详情
AI中文摘要

准确的轨迹预测因高场景异质性而具有根本挑战性 - 不同现实环境中的运动速度、空间密度和交互模式存在剧烈变化。然而,大多数现有方法通常训练一个单一统一模型,期望固定容量架构能普遍泛化所有可能场景。这种以模型为中心的范式在面对此类极端异质性时本质上是错误的,不可避免导致严重的泛化差距、降级的准确性以及大量的计算浪费。为克服这一瓶颈,我们提出选择性学习,一种新的以场景为中心的范式。它明确分析底层场景的特性,动态路由输入到最合适的专家模型。作为这一范式的具体实现,我们引入SceneSelect。具体而言,SceneSelect利用无监督聚类在可解释的几何和运动学特征上发现潜在的场景分类。然后训练一个高度解耦的分类模块,将实时输入分配到这些场景类别,并一个高度可扩展、插件式的调度策略自动将轨迹序列调度到最优的专家预测器。关键的是,这种解耦设计确保了出色的泛化能力,允许无缝集成不同的现成模型,并在新数据集上稳健适应,而无需计算昂贵的联合再训练。在三个公开基准(ETH-UCY、SDD和NBA)上的大量实验表明,我们的方法在强单模型和集成基线中一致表现更好,平均提高10.5%,展示了场景感知选择性学习的有效性。

英文摘要

Accurate trajectory prediction is fundamentally challenging due to high scene heterogeneity - the severe variance in motion velocity, spatial density, and interaction patterns across different real-world environments. However, most existing approaches typically train a single unified model, expecting a fixed-capacity architecture to generalize universally across all possible scenarios. This conventional model-centric paradigm is fundamentally flawed when confronting such extreme heterogeneity, inevitably leading to a severe generalization gap, degraded accuracy, and massive computational waste. To overcome this bottleneck, rather than refining restricted model-centric architectures, we propose selective learning, a novel scene-centric paradigm. It explicitly analyzes the characteristics of the underlying scene to dynamically route inputs to the most appropriate expert models. As a concrete implementation of this paradigm, we introduce SceneSelect. Specifically, SceneSelect utilizes unsupervised clustering on interpretable geometric and kinematic features to discover a latent scene taxonomy. A highly decoupled classification module is then trained to assign real-time inputs to these scene categories, and a highly extensible, plug-and-play scheduling policy automatically dispatches the trajectory sequence to the optimal expert predictor. Crucially, this decoupled design ensures excellent generalization capabilities, allowing seamless integration with different off-the-shelf models and robust adaptation across new datasets without requiring computationally expensive joint retraining. Extensive experiments on three public benchmarks (ETH-UCY, SDD, and NBA) demonstrate that our method consistently outperforms strong single-model and ensemble baselines, achieving an average improvement of 10.5%, showcasing the effectiveness of scene-aware selective learning.

2604.14084 2026-05-22 cs.LG cs.AI 版本更新

TIP: Token Importance in On-Policy Distillation

TIP: on-policy distillation 中的 token 重要性

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard

AI总结 本研究探讨了在 on-policy 知识蒸馏中哪些 token 对学习信号最有用,提出了一种基于学生熵和教师-学生分歧的双轴分类方法,并通过实验验证了在有限内存条件下使用少量 token 进行蒸馏的有效性。

详情
AI中文摘要

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining 50% of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to 47%. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than 10% of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on <20% of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

英文摘要

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

2604.12325 2026-05-22 cs.LG cs.AI 版本更新

Black-Box Optimization From Small Offline Datasets via Meta Learning with Synthetic Tasks

通过合成任务进行元学习的黑盒优化

Azza Fadhel, The Hung Tran, Trong Nghia Hoang, Jana Doppa

发表机构 * School of EECS, Washington State University, Pullman, WA, USA(华盛顿州立大学电子工程与计算机科学学院,普拉默,华盛顿州,美国)

AI总结 本文提出了一种通过生成合成任务进行元学习的框架OptBias,用于解决小规模离线数据下的黑盒优化问题,通过学习可重用的优化偏差来提升小数据场景下的性能。

Comments Accepted for Publication at International Conference on Artificial Intelligence and Statistics (AISTATS)

详情
AI中文摘要

我们考虑了离线黑盒优化的问题,目标是从过去的实验数据中发现最优设计(例如分子或材料)。在这一设置中,一个关键挑战是数据稀缺性:在许多科学应用中,只有小规模或低质量的数据集可用,这严重限制了现有算法的有效性。先前的工作在理论和实证上都表明,离线优化算法的性能取决于代理模型对优化偏差(即正确排序输入设计的能力)的捕捉程度,这在有限的实验数据下很难实现。本文提出了一种通过生成合成任务进行元学习的框架OptBias,该框架通过在高斯过程生成的合成任务上训练来直接解决数据稀缺性问题。OptBias通过在小数据上微调代理模型来解决目标任务。在多样化的连续和离散离线优化基准上,OptBias在小数据场景中始终优于最先进的基线。这些结果突显了OptBias作为现实中小数据设置中离线优化的稳健且实用的解决方案。

英文摘要

We consider the problem of offline black-box optimization, where the goal is to discover optimal designs (e.g., molecules or materials) from past experimental data. A key challenge in this setting is data scarcity: in many scientific applications, only small or poor-quality datasets are available, which severely limits the effectiveness of existing algorithms. Prior work has theoretically and empirically shown that performance of offline optimization algorithms depends on how well the surrogate model captures the optimization bias (i.e., ability to rank input designs correctly), which is challenging to accomplish with limited experimental data. This paper proposes Surrogate Learning with Optimization Bias via Synthetic Task Generation (OptBias), a meta-learning framework that directly tackles data scarcity. OptBias learns a reusable optimization bias by training on synthetic tasks generated from a Gaussian process, and then fine-tunes the surrogate model on the small data for the target task. Across diverse continuous and discrete offline optimization benchmarks, OptBias consistently outperforms state-of-the-art baselines in small data regimes. These results highlight OptBias as a robust and practical solution for offline optimization in realistic small data settings.

2604.08872 2026-05-22 cs.LG cond-mat.dis-nn cond-mat.stat-mech 版本更新

How does Chain of Thought decompose complex tasks?

链式思维如何分解复杂任务?

Amrut Nadgir, Vijay Balasubramanian, Pratik Chaudhari

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文研究了链式思维在复杂任务分解中的作用,发现通过将任务分解为多个小分类问题可以显著降低预测误差,并确定了分解深度的最优阈值。

详情
AI中文摘要

许多语言任务可以建模为分类问题,其中大型语言模型(LLM)被给出提示并选择多个可能答案中的一个。我们证明此类问题中的分类误差随着类别的数量呈幂律变化。这具有重大影响:通过将整体任务分解为一系列较小的分类问题,每个问题具有相同数量的类别(

英文摘要

Many language tasks can be modeled as classification problems where a large language model (LLM) is given a prompt and selects one among many possible answers. We show that the classification error in such problems scales as a power law in the number of classes. This has a dramatic consequence: the prediction error can be reduced substantially by splitting the overall task into a sequence of smaller classification problems, each with the same number of classes ("degree"). This tree-structured decomposition models chain-of-thought (CoT). It has been observed that CoT-based predictors perform better when they "think", i.e., when they develop a deeper tree, thus decomposing the problem into a larger number of steps. We identify a critical threshold for the degree, below which thinking is detrimental, and above which there exists an optimal depth that minimizes the error. It is impossible to surpass this minimal error by increasing the depth of thinking.

2604.08571 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Robust Reasoning Benchmark

鲁棒推理基准

Pavel Golikov, Evgenii Opryshko, Gennady Pekhimenko, Mark C. Jeffrey

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 本研究提出鲁棒推理基准(RRB),通过13种确定性文本扰动评估8种前沿模型,发现Claude在面对变换提示时表现出异常拒绝行为,而开放权重模型在结构噪声下出现多种失败模式,如认知冲刷、分词崩溃和推理崩溃,导致平均准确率下降高达54%。研究进一步发现由模型自身推理链引起的注意力稀释问题,并提出Intra-Query Attention Dilution概念,表明中间推理步骤会污染标准密集注意力机制,未来架构需整合显式上下文重置以实现可靠推理。

详情
AI中文摘要

尽管大型语言模型(LLMs)在标准数学基准上表现优异,但其问题解决能力依赖于上下文和文本格式。我们引入鲁棒推理基准(RRB),该基准由13种确定性文本扰动组成,应用于2024年和2025年的AIME。评估8种最先进的模型后,发现前沿模型总体上具有较强的鲁棒性,但Claude在面对变换提示时表现出异常拒绝行为。开放权重推理模型在结构噪声下表现出多种失败模式(认知冲刷、分词崩溃和推理崩溃),在扰动下平均准确率下降高达54%,某些扰动甚至导致100%的准确率下降。我们进一步研究其中一种失败模式:由模型自身推理链引起的注意力稀释。通过要求模型在单一上下文窗口内依次解决多个独立数学问题,我们识别出Intra-Query Attention Dilution。从7B到120B参数的开放权重模型在后续问题上的准确率逐渐下降,表明中间推理步骤会污染标准密集注意力机制。我们主张,为了实现可靠的推理,未来架构需要在模型自身推理链中整合显式上下文重置,从而引发关于推理任务最佳粒度的开放研究问题。

英文摘要

While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their problem-solving abilities depend on the context and textual formatting. We introduce the Robust Reasoning Benchmark (RRB), a pipeline of 13 deterministic textual perturbations applied to AIME 2024 and AIME 2025. Evaluating 8 state-of-the-art models, we find that frontier models are largely resilient, with the notable exception of Claude, which categorically refuses many transformed prompts. Open-weights reasoning models exhibit a range of failure modes under structural noise (cognitive thrashing, tokenization breakdown, and reasoning collapse), with up to 54% average accuracy drops across perturbations and up to 100% on some. We further study one of these failure modes in isolation: attention dilution caused by the model's own chain-of-thought. By tasking models with solving multiple independent mathematical problems sequentially within a single context window, we identify Intra-Query Attention Dilution. Open-weights models ranging from 7B to 120B parameters exhibit accuracy decay on subsequent problems, suggesting that intermediate reasoning steps progressively pollute standard dense attention mechanisms. We argue that in order to achieve reliable reasoning, future architectures need to integrate explicit contextual resets within models' own chain-of-thought, leading to open research questions regarding the optimal granularity of reasoning tasks.

2603.21743 2026-05-22 cs.LG q-bio.QM 版本更新

CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

CellFluxRL: 通过强化学习实现生物约束的虚拟细胞建模

Dongxia Wu, Shiye Su, Yuhui Zhang, Elaine Sui, Emma Lundberg, Emily B. Fox, Serena Yeung-Levy

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出CellFluxRL,通过强化学习约束虚拟细胞模型,使其在生物功能、结构有效性及形态正确性方面更符合生物学规律,从而提升虚拟细胞建模的生物意义。

详情
AI中文摘要

构建虚拟细胞以生成模型模拟细胞行为在硅中的仿真,正成为加速药物发现的有前途的范式。然而,先前基于图像的生成方法可能会产生不合理的细胞图像,违反基本的物理和生物学约束。为了解决这个问题,我们提出通过强化学习(RL)后训练虚拟细胞模型,利用具有生物意义的评估器作为奖励函数。我们设计了七个奖励,涵盖三个类别——生物功能、结构有效性及形态正确性,并优化最先进的CellFlux模型以获得CellFluxRL。CellFluxRL在所有奖励上均优于CellFlux,且在测试时扩展进一步提升性能。总体而言,我们的结果展示了一个通过强化学习施加物理约束的虚拟细胞建模框架,从而超越了“视觉逼真”的生成,朝着“生物意义”的生成迈进。

英文摘要

Building virtual cells with generative models to simulate cellular behavior in silico is emerging as a promising paradigm for accelerating drug discovery. However, prior image-based generative approaches can produce implausible cell images that violate basic physical and biological constraints. To address this, we propose to post-train virtual cell models with reinforcement learning (RL), leveraging biologically meaningful evaluators as reward functions. We design seven rewards spanning three categories-biological function, structural validity, and morphological correctness-and optimize the state-of-the-art CellFlux model to yield CellFluxRL. CellFluxRL consistently improves over CellFlux across all rewards, with further performance boosts from test-time scaling. Overall, our results present a virtual cell modeling framework that enforces physically-based constraints through RL, advancing beyond "visually realistic" generations towards "biologically meaningful" ones.

2603.21717 2026-05-22 cs.LG 版本更新

Uncertainty-Aware Distribution-to-Distribution Flow Matching for Scientific Imaging

面向科学成像的不确定性感知分布到分布流匹配

Dongxia Wu, Yuhui Zhang, Serena Yeung-Levy, Emma Lundberg, Emily B. Fox

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出了一种面向科学成像的不确定性感知分布到分布流匹配方法,通过引入贝叶斯随机流匹配和抗变异不确定性量化技术,提升模型在分布偏移下的泛化能力,并有效估计epistemic和aleatoric不确定性,从而检测不可靠的生成结果。

详情
AI中文摘要

分布到分布生成模型支持从建模细胞扰动响应到跨条件翻译医学图像的科学成像任务。可信生成需要可靠性,即在不同实验室、设备和实验条件下的泛化能力,以及问责,即检测出分布外情况,其中预测可能不可靠。我们利用随机流匹配(SFM),一种保持边缘的随机扩展流匹配,以改进在分布偏移下的泛化能力。SFM在确定性流中加入扩散项和学习的分数基漂移校正,保留所学的传输边缘的同时建模条件变化性。基于此SFM框架,我们引入贝叶斯随机流匹配(BSFM)作为不确定性量化机制,并开发AVUQ(反向方差减少不确定性量化)以通过样本高效反向采样和近似后验推断来近似估计epistemic和aleatoric不确定性。我们进一步使用AVUQ生成异常分数以检测不可靠的生成结果。在细胞成像(BBBC021,JUMP)和脑部fMRI(Theory of Mind)等不同未见过的场景中的实验表明,SFM在提升泛化能力的同时,AVUQ在实际采样预算下提供了有效的基于不确定性的异常分数。

英文摘要

Distribution-to-distribution generative models support scientific imaging tasks ranging from modeling cellular perturbation responses to translating medical images across conditions. Trustworthy generation requires reliability, or generalization across labs, devices, and experimental conditions, and accountability, or detecting out-of-distribution cases where predictions may be unreliable. We leverage Stochastic Flow Matching (SFM), a marginal-preserving stochastic extension of flow matching for improved generalization under distribution shift. SFM augments deterministic flows with a diffusion term together with a learned score-based drift correction, retaining the learned transport marginals while modeling conditional variability. Building on this SFM framework, we introduce Bayesian Stochastic Flow Matching (BSFM) as a companion uncertainty quantification mechanism and develop AVUQ (Antithetic Variance-reduction Uncertainty Quantification) to approximately estimate epistemic and aleatoric uncertainty via sample-efficient antithetic sampling with approximate posterior inference. We further use AVUQ to yield anomaly scores for unreliable generation detection. Experiments on cellular imaging (BBBC021, JUMP) and brain fMRI (Theory of Mind) across diverse unseen scenarios show that SFM improves generalization while AVUQ provides effective uncertainty-based anomaly scores under practical sampling budgets.

2603.21610 2026-05-22 cs.LG cs.AI stat.ML 版本更新

Rule-State Inference (RSI): A Bayesian Framework for Compliance Monitoring in Rule-Governed Domains

规则状态推断(RSI):一种用于规则治理领域合规监控的贝叶斯框架

Abdou-Raouf Atarmla

发表机构 * Institut National des Postes et Télécommunications(摩洛哥邮政和电信国家研究院) Togo DataLab(多哥数据实验室) Ministry of Digital Economy(数字经济部)

AI总结 本文提出了一种名为规则状态推断(RSI)的贝叶斯框架,用于解决规则治理领域中合规监控的三大结构性挑战:部署时缺乏标记结果、非合规实体战略性缺失观察以及监管环境变化速度超过任何监督模型的重新训练速度。RSI通过将权威、形式化的规则集作为结构化的贝叶斯先验,利用变分推断和精确坐标上升更新来推断人口的潜在合规状态。

Comments 18 pages. Experimental validation forthcoming

详情
AI中文摘要

在规则治理领域(如税收管理、临床协议遵守、环境监管)的合规监控面临三个结构性障碍,标准机器学习无法同时解决:部署时缺乏标记结果、非合规实体战略性缺失观察以及监管环境变化速度超过任何监督模型的重新训练速度。我们引入规则状态推断(RSI),一种贝叶斯框架,颠覆了传统的学习规则从数据的范式。RSI将权威、形式化的规则集作为结构化的贝叶斯先验,并通过均场变分推断和精确坐标上升更新推断人口的潜在合规状态。核心建模对象是一个联合潜变量,每个监管时期一个:全局合规文化因子η以及每个规则的激活、人口合规水平和参数漂移成分。RSI提供了三个正式保证:每个规则更新的监管适应性为O(n_k + K);对于可识别的连续成分的伯恩斯坦-冯·米塞斯一致性;以及每次迭代的单调ELBO收敛。我们将在托戈财政系统上实例化RSI,基于官方监管法律的基准2000家合成企业;完整的数值验证将随后进行。该框架设计用于直接扩展到顺序RSI,一种状态空间公式化中,一个监管时期的后验成为下一个的先验,从而产生精确的卡尔曼滤波器用于合规轨迹跟踪和实体级贝叶斯评分。

英文摘要

Compliance monitoring in rule-governed domains (tax administration, clinical protocol adherence, environmental regulation) faces three structural obstacles that standard machine learning does not simultaneously address: the absence of labeled outcomes at deployment, strategically missing observations where non-compliant entities selectively withhold evidence, and a regulatory environment that changes faster than any supervised model can be retrained. We introduce Rule-State Inference (RSI), a Bayesian framework that reverses the usual paradigm. Rather than learning rules from data, RSI treats an authoritative, formalized rule set as structured Bayesian priors and infers the latent compliance state of a population through mean-field variational inference with exact coordinate-ascent updates. The central modeling object is a joint latent state per regulatory period: a global compliance-culture factor eta and per-rule components for activation, population compliance level, and parametric drift. RSI delivers three formal guarantees: O(n_k + K) regulatory adaptability per rule update; Bernstein-von Mises consistency for the identifiable continuous components; and monotone ELBO convergence at every iteration. We instantiate RSI on the Togolese fiscal system on a benchmark of 2,000 synthetic enterprises grounded in official regulatory law; full numerical validation is forthcoming. The framework is designed for direct extension to Sequential RSI, a state-space formulation where the posterior from one regulatory period becomes the prior for the next, yielding an exact Kalman filter for compliance-trajectory tracking and entity-level Bayesian scoring.

2603.20228 2026-05-22 math.OC cs.LG 版本更新

Compact Lifted Relaxations for Low-Rank Optimization

紧凑的提升松弛方法用于低秩优化

Ryan Cory-Wright, Jean Pauphilet

发表机构 * Department of Analytics, Marketing and Operations, Imperial Business School(分析、营销与运营部,帝国商业学院) Management Science and Operations, London Business School(管理科学与运营,伦敦商业学院)

AI总结 本文提出了一种可处理秩约束二次优化问题的紧凑凸松弛方法,通过引入提升半正定松弛,避免了传统方法中所需的谱结构项,并通过冗余块的分析得到更紧凑的松弛形式,同时引入了新的有效不等式(投影割)以增强低秩松弛效果,适用于矩阵补全和降维回归等问题。

Comments Part of this material previously appeared in arXiv:2501.02942v2, which was split into this paper and arXiv:2501.02942v3

详情
AI中文摘要

我们开发了可处理n×m矩阵上的秩约束二次优化问题的可 tractable 凸松弛方法,这种设置通常只有在目标函数或约束具有谱结构时才可用 tractable 松弛。我们推导了不需谱项的提升半正定松弛。尽管直接提升引入了维度为n² + nm + 1的大型半正定约束,我们证明了许多时刻矩阵的块是冗余的,并推导出等价的紧凑松弛,仅涉及两个半正定约束,分别维度为nm + 1和n + m。我们还推导了一种新的有效不等式类别,称为投影割,利用了低秩矩阵的线性像继承秩约束的事实,显著增强了我们的低秩松弛。对于矩阵补全和降维回归等问题,我们利用额外的结构得到更紧凑的公式,涉及半正定矩阵的维度至多为低秩决策矩阵两个维度之和(即大小至多为n + m)。总体而言,我们为广泛低秩二次问题获得了可扩展的半正定界。

英文摘要

We develop tractable convex relaxations for rank-constrained quadratic optimization problems over $n \times m$ matrices, a setting for which tractable relaxations are typically only available when the objective or constraints admit spectral structure. We derive lifted semidefinite relaxations that do not require such spectral terms. Although a direct lifting introduces a large semidefinite constraint in dimension $n^2 + nm + 1$, we prove that many blocks of the moment matrix are redundant and derive an equivalent compact relaxation that only involves two semidefinite constraints of dimension $nm + 1$ and $n+m$, respectively. We also derive a new class of valid inequalities for low-rank problems, which we call projection cuts, that exploit the fact that rank constraints are inherited by linear images of a low-rank matrix, to strengthen our low-rank relaxations substantially. For matrix completion and reduced-rank regression problems, among others, we exploit additional structure to obtain even more compact formulations involving semidefinite matrices of dimension at most the sum of the two dimensions of the low-rank decision matrix (i.e., of size at most $n+m$). Overall, we obtain scalable semidefinite bounds for a broad class of low-rank quadratic problems.

2603.16077 2026-05-22 cs.LG 版本更新

MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Scaling of Diffusion Language Models

MDM-Prime-v2:二进制编码和索引洗牌使扩散语言模型能够扩展

Chen-Hao Chao, Wei-Fang Sun, Junwei Quan, Chun-Yi Lee, Rahul G. Krishnan

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) NVIDIA AI Technology Center(NVIDIA AI技术中心) National Taiwan University(国立台湾大学)

AI总结 本文提出MDM-Prime-v2,通过二进制编码和索引洗牌技术改进扩散语言模型,解决了子分词器功能形式与BPE分词器结合导致的交叉熵损失增加以及子分词器粒度超参数选择缺乏工具的问题,从而提升了模型在常识推理基准上的零样本准确率。

详情
AI中文摘要

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we find that the functional form of the subtokenizer significantly increases the cross-entropy loss in the objective when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. Second, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. To address these limitations, we analyze the optimal design of the subtokenizer that minimizes MDM-Prime training objective and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our analysis characterizes how token granularity and sub-token entropy influence the training objective and downstream performance, providing principled criteria for subtokenizer design. When extending the model size to 1.1B parameters, MDM-Prime-v2 demonstrates superior average zero-shot accuracy across eight commonsense reasoning benchmarks, outperforming similar-sized baselines including GPT-Neo, OPT, Pythia, Bloom, SMDM, and TinyLLaMA.

英文摘要

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we find that the functional form of the subtokenizer significantly increases the cross-entropy loss in the objective when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. Second, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. To address these limitations, we analyze the optimal design of the subtokenizer that minimizes MDM-Prime training objective and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our analysis characterizes how token granularity and sub-token entropy influence the training objective and downstream performance, providing principled criteria for subtokenizer design. When extending the model size to 1.1B parameters, MDM-Prime-v2 demonstrates superior average zero-shot accuracy across eight commonsense reasoning benchmarks, outperforming similar-sized baselines including GPT-Neo, OPT, Pythia, Bloom, SMDM, and TinyLLaMA.

2603.02604 2026-05-22 cs.LG 版本更新

Heterogeneous Agent Collaborative Reinforcement Learning

异质智能体协作强化学习

Zhixia Zhang, Zixuan Huang, Gongxun Li, Huaiyang Wang, Chengyi Yuan, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, Yikun Ban

发表机构 * Beihang University(北航) Bytedance China(字节跳动中国) Tsinghua University(清华大学) Peking University(北京大学) Apple(苹果公司)

AI总结 本文提出了一种新的强化学习从可验证奖励(RLVR)问题HACRL,通过异质智能体共享验证的轨迹实现协同优化,解决了孤立多智能体在线优化的效率问题,并提出HACPO算法以最大化样本利用率和跨智能体知识转移。

详情
AI中文摘要

我们引入了异质智能体协作强化学习(HACRL),一种新的强化学习从可验证奖励(RLVR)问题,旨在解决孤立多智能体在线优化的低效问题。HACRL允许独立执行的协同优化:异质智能体在训练期间共享验证的轨迹以互相改进,而在推理期间独立操作。不同于基于大语言模型的多智能体强化学习(MARL),HACRL不需要协调部署,也不同于在线/离线策略蒸馏,它使异质智能体之间实现双向相互学习,而非单向的教师到学生转移。基于此问题,我们提出HACPO,一种协作RL算法,能够通过原则性的轨迹共享最大化样本利用率和跨智能体知识转移。为缓解能力差异和策略分布偏移,HACPO引入了四个定制机制,具有对无偏优势估计的理论保证。在多样化的异质模型组合和推理基准上的广泛实验表明,HACPO一致地提升了所有参与智能体,相比使用双轨迹的GSPO,平均提高了3.6%,同时仅使用一半的轨迹成本。

英文摘要

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new Reinforcement Learning from Verifiable Reward (RLVR) problem that addresses the inefficiencies of isolated multi-agent on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional homogeneous teacher-to-student transfer. Building on this problem, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO with double rollouts by an average of 3.6% while using only half the rollout cost.

2602.23200 2026-05-22 cs.LG cs.CL 版本更新

InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models

InnerQ: 一种面向硬件的无需调优的KV缓存量化方法用于大语言模型

Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本文提出InnerQ,一种面向硬件的KV缓存量化方法,旨在减少解码延迟而不影响评估性能,通过分组量化策略提高数据重用率,从而在Llama和Mistral模型上提升了少样本评估得分。

Comments 18 pages, 5 figures, 7 tables

详情
AI中文摘要

当基于Transformer的语言模型用于文本生成时,大部分推理时间消耗在解码阶段,其中依次生成输出token。因此,减少每个解码步骤的硬件成本对于高效的长上下文生成至关重要。主要瓶颈是键值(KV)缓存,其大小随序列长度增长,通常主导模型的内存足迹。先前工作提出了压缩KV缓存的同时最小化精度损失的量化方法。我们提出了InnerQ,一种面向硬件的KV缓存量化方案,能够在不牺牲评估性能的情况下减少解码延迟。InnerQ通过沿内维对缓存矩阵进行分组实现分组量化。这种分组策略使去量化与向量-矩阵乘法对齐,并在GPU计算单元之间增加数据重用。结果,InnerQ减少了内存访问并加速了去量化,实现了比先前KV缓存量化方法平均快1.3倍,比非量化基线快2.7倍。为了在剧烈压缩下保持精度,InnerQ结合了三种技术:(i) 混合量化,根据局部统计选择对每个组使用对称或非对称量化;(ii) 高精度窗口用于最近的token和注意力sink token以缓解异常值泄漏;(iii) 对key缓存的通道归一化,在prefill期间计算一次并折叠到模型参数中以消除运行时开销。除了减少延迟外,在Llama和Mistral模型上的实验表明,InnerQ还相对于先前的KV缓存量化方法提升了少样本评估得分。

英文摘要

When transformer-based language models are deployed for text generation, most of the inference time is spent in the decoding stage, where output tokens are generated sequentially. Reducing the hardware cost of each decoding step is therefore critical for efficient long-context generation. A major bottleneck is the key-value (KV) cache, whose size grows with sequence length and often dominates the model's memory footprint. Prior work has proposed quantization methods to compress the KV cache while minimizing its loss of precision. We present InnerQ, a hardware-aware KV cache quantization scheme that reduces decode latency without compromising evaluation performance. InnerQ performs group-wise quantization by grouping cache matrices along their inner dimension. This grouping strategy aligns dequantization with vector-matrix multiplication and increases data reuse across GPU compute units. As a result, InnerQ reduces memory access and accelerates dequantization, achieving an average $1.3\times$ speedup over prior KV cache quantization methods and $2.7\times$ over the non-quantized baseline. To maintain fidelity under aggressive compression, InnerQ incorporates three techniques: (i) hybrid quantization, which chooses symmetric or asymmetric quantization for each group based on local statistics; (ii) high-precision windows for both recent tokens and attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key cache, computed once during prefill and folded into the model parameters to eliminate runtime overhead. Beyond reducing latency, experiments on Llama and Mistral models show that InnerQ also improves few-shot evaluation scores relative to prior KV cache quantization methods.

2602.18600 2026-05-22 cs.LG 版本更新

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

MapTab: MLLMs 是否已准备好在异构图中进行多标准路线规划?

Ziqiao Shang, Lingyue Ge, Zi-Jian Cheng, Shi-Yu Tian, Zhenyu Huang, Wenbo Fu, Weiming Wu, Yang Chen, Xiangwen Zhang, Yulan Hu, Bin Liu, Yu-Feng Li, Lan-Zhe Guo

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院) AMAP, Alibaba Group(阿里集团AMAP) School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院)

AI总结 本文提出MapTab基准测试,用于评估多模态大语言模型在多标准路线规划任务中的综合推理能力,发现当前模型在多模态推理方面存在显著挑战。

详情
AI中文摘要

系统评估多模态大语言模型(MLLMs)对于推进人工通用智能(AGI)至关重要。然而,现有基准测试仍不足以严格评估其在多标准约束下的推理能力。为弥合这一差距,我们引入MapTab,一个专门设计用于通过路线规划任务评估MLLMs的综合多标准推理能力的多模态基准测试。MapTab要求MLLMs感知并结合地图图像中的视觉线索与结构化表格数据中的路线属性(如时间、价格)。该基准测试涵盖两个场景:Metromap,涵盖52个国家160座城市的地铁网络;Travelmap,描绘19个国家的168个代表性旅游景点。总共包含328张图像、196,800个路线规划查询和3,936个问答查询,所有数据均包含4个关键标准:时间、价格、舒适度和可靠性。对15个代表性MLLMs的广泛评估表明,当前模型在多标准多模态推理方面面临重大挑战。值得注意的是,在视觉感知有限的条件下,多模态协作往往不如单模态方法表现优异。我们认为MapTab提供了一个具有挑战性和现实性的测试平台,以推进MLLMs的系统评估。我们的代码可在https://github.com/Ziqiao-Shang/MapTab上获得。

英文摘要

Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their reasoning capabilities under multi-criteria constraints. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate holistic multi-criteria reasoning in MLLMs via route planning tasks. MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data. The benchmark encompasses two scenarios: Metromap, covering metro networks in 160 cities across 52 countries, and Travelmap, depicting 168 representative tourist attractions from 19 countries. In total, MapTab comprises 328 images, 196,800 route planning queries, and 3,936 QA queries, all incorporating 4 key criteria: Time, Price, Comfort, and Reliability. Extensive evaluations across 15 representative MLLMs reveal that current models face substantial challenges in multi-criteria multimodal reasoning. Notably, under conditions of limited visual perception, multimodal collaboration often underperforms compared to unimodal approaches. We believe MapTab provides a challenging and realistic testbed to advance the systematic evaluation of MLLMs. Our code is available at https://github.com/Ziqiao-Shang/MapTab.

2602.16169 2026-05-22 cs.LG cs.CL 版本更新

Discrete Stochastic Localization for Non-autoregressive Generation

非自回归生成的离散随机定位

Yunshu Wu, Jiayi Cheng, Longxuan Yu, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis, Greg Ver Steeg

发表机构 * University of California Riverside(加州大学河滨分校) New York University(纽约大学)

AI总结 本文提出了一种名为离散随机定位(DSL)的连续状态框架,通过单位球体令牌嵌入实现最优去噪,从而在离散序列生成中提升分布忠实度,并展示了其在OpenWebText上的有效性。

详情
AI中文摘要

连续扩散是一种非自回归生成的自然框架,但在离散序列生成中通常落后于掩码离散扩散模型(MDMs)。我们认为瓶颈不在于连续性本身,而在于一种表示方式,其中去噪依赖于时间步索引的噪声模式。我们引入了离散随机定位(DSL),一种具有单位球体令牌嵌入的连续状态框架,其贝叶斯最优去噪器在定位信道下对名义信号噪声比(SNR)具有不变性。一个训练好的网络可以支持整个SNR路径家族,端点掩码扩散路径是特殊情况。对预训练MDLM检查点进行微调可显著提升OpenWebText在所有步预算(从T=128到T=1024)下的分布忠实度(MAUVE),并且同一检查点支持随机顺序自回归采样,以及使用最少T=48总步数的混合连续-然后-离散采样器,无需蒸馏或重新训练。

英文摘要

Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce \emph{Discrete Stochastic Localization} (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from $T{=}128$ to $T{=}1024$, and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps -- without distillation or retraining.

2602.15338 2026-05-22 cs.LG cs.CL 版本更新

Discovering Implicit Large Language Model Alignment Objectives

发现隐式大语言模型对齐目标

Edward Chen, Sanmi Koyejo, Carlos Guestrin

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出Obj-Disco框架,通过自动分解对齐奖励信号为可解释的目标,解决现有方法的不足,验证了框架在多种任务和模型上的鲁棒性,并发现潜在的对齐偏差。

Comments ICML 2026

详情
AI中文摘要

大语言模型(LLM)对齐依赖于复杂的奖励信号,这些信号往往模糊了被激励的具体行为,导致对齐风险和奖励黑客问题。现有解释方法通常依赖预定义的准则,可能遗漏“未知的未知”,或无法识别全面覆盖和因果影响模型行为的目标。为了解决这些限制,我们引入Obj-Disco框架,该框架能够自动将对齐奖励信号分解为稀疏、加权的可解释自然语言目标的组合。我们的方法利用迭代贪心算法分析训练检查点的行为变化,识别并验证最佳解释残差奖励信号的候选目标。在多种任务、模型大小和对齐算法上的广泛评估证明了框架的鲁棒性。对流行开源奖励模型的实验表明,框架一致捕获超过90%的奖励行为,这一发现进一步得到人类评估的证实。此外,对开源奖励模型对齐的案例研究显示,Obj-Disco能够成功识别伴随预期行为出现的潜在偏移激励。我们的工作提供了一种关键工具,用于揭示LLM对齐中的隐式目标,为更透明和安全的AI发展铺平道路。

英文摘要

Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness. Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation. Additionally, a case study on alignment with an open-source reward model reveals that Obj-Disco can successfully identify latent misaligned incentives that emerge alongside intended behaviors. Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.

2602.10445 2026-05-22 cs.IR cs.LG 版本更新

End-to-End Semantic ID Generation for Generative Advertisement Recommendation

端到端语义ID生成用于生成式广告推荐

Jie Jiang, Xinxun Zhang, Enming Zhang, Yuling Xiong, Jun Zhang, Jingwen Wang, Huan Yu, Yuxiang Wang, Hao Wang, Xiao Yan, Jiawei Jiang

发表机构 * Tencent Inc.(腾讯公司) Wuhan University(武汉大学)

AI总结 本文提出UniSID框架,通过端到端优化广告数据中的嵌入和ID,直接将语义信息传递到ID空间,解决传统两阶段压缩方法的不足,并通过多粒度对比学习和基于摘要的广告重建机制提升ID的语义表达能力。

Comments Add the emails

详情
AI中文摘要

生成式推荐(GR)通过将推荐视为下一个标记预测来取得成功。这种范式依赖于语义ID(SIDs)将大规模项目分解为离散序列。现有GR方法主要通过残差量化(RQ)生成SIDs,其中项目被编码为嵌入并量化为离散SIDs。然而,这种范式存在固有局限:1)由于两阶段压缩导致的目标偏差和语义退化;2)RQ结构固有的误差累积。为了解决这些限制,我们提出了UniSID,一种用于生成式广告推荐的统一SID生成框架。具体来说,我们从原始广告数据中端到端地优化嵌入和SID,使语义信息直接流入SID空间,从而解决两阶段级联压缩范式的固有局限。为了捕捉细粒度语义,引入了多粒度对比学习策略以在SID级别对齐不同的项目。最后,提出了一种基于摘要的广告重建机制,以鼓励SID捕捉不在广告上下文中显式存在的高层语义信息。实验表明,UniSID在下游广告场景中 consistently 超过最先进的SID生成方法,在Hit Rate指标上比最强基线提升高达4.62%。

英文摘要

Generative Recommendation (GR) has excelled by framing recommendation as next-token prediction. This paradigm relies on Semantic IDs (SIDs) to tokenize large-scale items into discrete sequences. Existing GR approaches predominantly generate SIDs via Residual Quantization (RQ), where items are encoded into embeddings and then quantized to discrete SIDs. However, this paradigm suffers from inherent limitations: 1) Objective misalignment and semantic degradation stemming from the two-stage compression; 2) Error accumulation inherent in the structure of RQ. To address these limitations, we propose UniSID, a Unified SID generation framework for generative advertisement recommendation. Specifically, we jointly optimize embeddings and SIDs in an end-to-end manner from raw advertising data, enabling semantic information to flow directly into the SID space and thus addressing the inherent limitations of the two-stage cascading compression paradigm. To capture fine-grained semantics, a multi-granularity contrastive learning strategy is introduced to align distinct items across SID levels. Finally, a summary-based ad reconstruction mechanism is proposed to encourage SIDs to capture high-level semantic information that is not explicitly present in advertising contexts. Experiments demonstrate that UniSID consistently outperforms state-of-the-art SID generation methods, yielding up to a 4.62% improvement in Hit Rate metrics across downstream advertising scenarios compared to the strongest baseline.

2602.10062 2026-05-22 cs.LG cs.CV 版本更新

Vendi Novelty Scores for Out-of-Distribution Detection

Vendi Novelty Scores for Out-of-Distribution Detection

Amey P. Pasarkar, Adji Bousso Dieng

发表机构 * Lewis-Sigler Institute For Integrative Genomics, Princeton University(普林斯顿大学整合基因组学研究所) Department of Computer Science, Princeton University(普林斯顿大学计算机科学系)

AI总结 本文提出了一种基于Vendi Scores的Vendi Novelty Score(VNS)方法,从多样性角度解决分布外检测问题,该方法无需密度建模,具有线性时间复杂度和非参数特性,并在多个图像分类基准上实现了最先进的OOD检测性能。

详情
AI中文摘要

Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.

英文摘要

Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.

2602.06264 2026-05-22 cs.LG 版本更新

Swap Regret Minimization Through Response-Based Approachability

通过响应方法实现交换遗憾最小化

Ioannis Anagnostides, Gabriele Farina, Maxwell Fishelson, Haipeng Luo, Jon Schneider

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Massachusetts Institute of Technology(麻省理工学院) University of Southern California(南加州大学) Google Research(谷歌研究)

AI总结 本文提出了一种更简单高效的算法,通过预处理后的约翰椭球,保证了线性交换遗憾为O(d√T),并建立了信息论下限,证明了经典算法在减少线性交换遗憾方面的最优性,同时扩展了该方法以处理多项式维度的交换偏差集。

Comments V3 makes certain clarifications and improves the upper bound for general sets via symmetrization

详情
AI中文摘要

我们考虑在线优化中最小化不同交换遗憾形式的问题。这些形式的遗憾与博弈中的相关均衡概念紧密相关,并且最近已被证明能够保证对战略对手的非操纵性。最近,Daskalakis, Farina, Fishelson, Pipis和Schneider(STOC '25)开发了在一般凸集上最小化线性交换遗憾的计算效率算法,但其遗憾界为Ω(d⁴√T),并且每次迭代都需要计算强度大的椭球算法调用。在本文中,我们开发了一种显著更简单、计算效率更高的算法,该算法保证在经过约翰椭球预处理的一般凸集上线性交换遗憾为O(d√T)。我们的算法利用了Bernstein和Shimkin(JMLR~'15)提出的强大的响应方法可接近框架——此前在交换遗憾最小化研究中被忽视——同时最小化了profile交换遗憾,最近已被证明能够保证非操纵性。此外,我们建立了匹配的信息论下限:即使当集合是中心对称的时,任何学习者在期望上必须承受Ω(d√T)的线性交换遗憾,对于足够大的T。这还表明,Gordon, Greenwald和Marks(ICML '08)的经典算法在减少线性交换遗憾方面是存在最优的,尽管它计算上效率低下。最后,我们将这种方法扩展以最小化相对于具有多项式维度的交换偏差集的遗憾,统一并加强了最近在均衡计算和在线学习中的研究成果。

英文摘要

We consider the problem of minimizing different notions of swap regret in online optimization. These forms of regret are tightly connected to correlated equilibrium concepts in games, and have been more recently shown to guarantee non-manipulability against strategic adversaries. The only computationally efficient algorithm for minimizing linear swap regret over a general convex set in $\mathbb{R}^d$ was developed recently by Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC '25). However, it incurs a highly suboptimal regret bound of $Ω(d^4 \sqrt{T})$ and also relies on computationally intensive calls to the ellipsoid algorithm at each iteration. In this paper, we develop a significantly simpler, computationally efficient algorithm that guarantees $O(d \sqrt{T})$ linear swap regret for a general convex set that has been preconditioned via the John ellipsoid. Our algorithm leverages the powerful response-based approachability framework of Bernstein and Shimkin (JMLR~'15) -- previously overlooked in the line of work on swap regret minimization -- and simultaneously minimizes profile swap regret, which was recently shown to guarantee non-manipulability. Moreover, we establish a matching information-theoretic lower bound: any learner must incur in expectation $Ω(d \sqrt{T})$ linear swap regret for large enough $T$, even when the set is centrally symmetric. This also shows that the classic algorithm of Gordon, Greenwald, and Marks (ICML '08) is existentially optimal for minimizing linear swap regret, although it is computationally inefficient. Finally, we extend our approach to minimize regret with respect to the set of swap deviations with polynomial dimension, unifying and strengthening recent results in equilibrium computation and online learning.

2602.05286 2026-05-22 cs.LG cs.AI 版本更新

HealthMamba: An Uncertainty-aware Spatiotemporal Graph State Space Model for Effective and Reliable Healthcare Facility Visit Prediction

HealthMamba: 一种考虑不确定性的时空图状态空间模型用于有效可靠的医疗设施访问预测

Dahai Yu, Lin Jiang, Rongchao Xu, Guang Wang

发表机构 * Department of Computer Science, Florida State University(佛罗里达州立大学计算机科学系)

AI总结 本文提出HealthMamba,一种考虑不确定性的时空图状态空间模型,用于有效可靠的医疗设施访问预测。该模型包含三个关键组件:统一的时空上下文编码器、新的图状态空间模型GraphMamba以及综合的不确定性量化模块。实验结果显示,HealthMamba在预测准确性和不确定性量化方面分别比现有最佳基线提高了6.0%和3.5%。

Comments IJCAI 2026

详情
AI中文摘要

医疗设施访问预测对于优化医疗资源配置和 informing 公共卫生政策至关重要。尽管已经采用了先进的机器学习方法以提高预测性能,但现有工作通常将此任务视为时间序列预测问题,而没有考虑不同类型的医疗设施的内在空间依赖性,且在公共紧急情况等异常情况下也无法提供可靠的预测。为了推进现有研究,我们提出了HealthMamba,一种考虑不确定性的时空框架,用于准确且可靠的医疗设施访问预测。HealthMamba包含三个关键组件:(i) 一个统一的时空上下文编码器,融合异构的静态和动态信息,(ii) 一种新的图状态空间模型称为GraphMamba用于分层时空建模,(iii) 一个综合的不确定性量化模块,整合三种不确定性量化机制以实现可靠的预测。我们在四个大规模真实世界数据集上评估了HealthMamba,这些数据集来自加州、纽约、得克萨斯州和佛罗里达州。结果表明,HealthMamba在预测准确性和不确定性量化方面分别比现有最佳基线提高了6.0%和3.5%。

英文摘要

Healthcare facility visit prediction is essential for optimizing healthcare resource allocation and informing public health policy. Despite advanced machine learning methods being employed for better prediction performance, existing works usually formulate this task as a time-series forecasting problem without considering the intrinsic spatial dependencies of different types of healthcare facilities, and they also fail to provide reliable predictions under abnormal situations such as public emergencies. To advance existing research, we propose HealthMamba, an uncertainty-aware spatiotemporal framework for accurate and reliable healthcare facility visit prediction. HealthMamba comprises three key components: (i) a Unified Spatiotemporal Context Encoder that fuses heterogeneous static and dynamic information, (ii) a novel Graph State Space Model called GraphMamba for hierarchical spatiotemporal modeling, and (iii) a comprehensive uncertainty quantification module integrating three uncertainty quantification mechanisms for reliable prediction. We evaluate HealthMamba on four large-scale real-world datasets from California, New York, Texas, and Florida. Results show HealthMamba achieves around 6.0% improvement in prediction accuracy and 3.5% improvement in uncertainty quantification over state-of-the-art baselines.

2602.03067 2026-05-22 cs.LG cs.AI cs.NA math.NA 版本更新

FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU

FlashSinkhorn: GPU上的IO感知熵最优传输

Felix X. -F. Ye, Xingjie Li, An Yu, Ming-Ching Chang, Linsong Chu, Davis Wertheimer

发表机构 * Department of Mathematics \& Statistics, University at Albany, Albany, NY, USA Department of Mathematics Statistics, University of North Carolina at Charlotte, Charlotte, NC, USA Department of Computer Science, University at Albany, Albany, NY, USA IBM T.\ J.\ Watson Research Center, Yorktown Heights, NY, USA

AI总结 本文提出FlashSinkhorn,一种基于GPU的熵最优传输求解器,通过将稳定化的对数域Sinkhorn更新转换为行-wise的LogSumExp归一化,实现了与Transformer注意力相同的归一化方式,从而实现了FlashAttention风格的融合和分块处理,显著降低了HBMIO并保持线性内存操作。

详情
AI中文摘要

熵最优传输(EOT)通过Sinkhorn迭代在现代机器学习中广泛应用,但GPU求解器在大规模情况下仍效率低下。张量化实现因密集的n×m交互导致二次HBM流量,而现有在线后端避免存储密集矩阵但仍然依赖于通用的 tiled map-reduce 减少内核,融合有限。我们提出FlashSinkhorn,一种针对平方欧几里得成本的IO感知EOT求解器,将稳定化的对数域Sinkhorn更新重写为行-wise的LogSumExp归一化,与Transformer注意力相同的归一化方式。这使得FlashAttention风格的融合和分块处理成为可能:融合的Triton内核通过芯片上的SRAM流式传输分块,并在单次通过中更新双潜力,显著减少每个迭代的HBM IO同时保持线性内存操作。我们进一步提供了用于传输应用的流式内核,实现了可扩展的一阶和二阶优化。在A100 GPU上,FlashSinkhorn在点云OT上的前向传递速度比最先进的在线基线快32倍,在端到端速度上快161倍,提高了OT基于下游任务的可扩展性。为了可重复性,我们发布了开源实现,网址为https://github.com/ot-triton-lab/flash-sinkhorn。

英文摘要

Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense $n\times m$ interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map-reduce reduction kernels with limited fusion. We present \textbf{FlashSinkhorn}, an IO-aware EOT solver for squared Euclidean cost that rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores, the same normalization as transformer attention. This enables FlashAttention-style fusion and tiling: fused Triton kernels stream tiles through on-chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear-memory operations. We further provide streaming kernels for transport application, enabling scalable first- and second-order optimization. On A100 GPUs, FlashSinkhorn achieves up to $32\times$ forward-pass and $161\times$ end-to-end speedups over state-of-the-art online baselines on point-cloud OT, improves scalability on OT-based downstream tasks. For reproducibility, we release an open-source implementation at https://github.com/ot-triton-lab/flash-sinkhorn .

2602.01935 2026-05-22 cs.LG cs.AI cs.PL 版本更新

LiteCoOp: Lightweight Multi-LLM Shared-Tree Reasoning for Model-Serving Compiler Optimizations

LiteCoOp: 轻量级多语言模型共享树推理用于模型服务编译器优化

Annabelle Sujun Tang, Christopher Priebe, Lianhui Qin, Hadi Esmaeilzadeh

发表机构 * A lternative C omputing T echnologies ( ACT ) Lab(替代计算技术实验室) University of California San Diego(加州大学圣地亚哥分校)

AI总结 本文提出LiteCoOp,一种轻量级框架,通过将优化搜索树本身作为多语言模型协作机制,实现编译器优化过程中异构语言模型的协作,从而在降低编译成本的同时提升性能。

详情
AI中文摘要

LLM引导的编译器优化最近展现出潜力,但现有方法依赖于整个搜索过程中单一大型语言模型,使其昂贵且排除了较小模型。我们提出了研究问题:异构语言模型是否可以在编译器优化过程中协作,同时在编译成本低于由单一大型语言模型引导的优化时减少成本。关键的是,这必须在不引入代理框架的开销的情况下实现,这会与降低编译成本的目标相悖。为实现这些竞争目标,我们引入了LiteCoOp,一种轻量级框架,将优化搜索树本身作为多语言模型协作的机制,使异构模型能够共享进展而无需外部代理协调。在每个优化步骤中,LiteCoOp查询一个语言模型以提出编译器转换并选择下一步查询的语言模型。这些语言模型的提案被记录在共享的MCTS树中,因此所有模型依次被调用,但彼此的决策相互影响。共享的MCTS回传奖励,使一个模型的进步影响其他模型后续的决策。这使得MCTS树本身成为协作推理的机制,避免了模型间通信、重载推理轨迹或代理基础设施。我们通过LLM-aware UCT将这一想法实例化,该方法倾向于较小的语言模型以减少成本,同时保持编译器性能目标。在多样化的GPU和(CPU)基准测试中,LiteCoOp在单模型基线上持续表现优异,当将协作扩展到八个异构语言模型时,其最佳结果取得。八模型配置将总编译时间减少1.95x(1.74x),减少API成本4.47x(4.32x),并且只在总调用中调用最大模型的23.1%(23.9%),并展示了协作的可扩展性。

英文摘要

LLM-guided compiler optimization has recently shown promise, but existing approaches rely on a single large LLM throughout search, making them expensive and excluding smaller models. We pose the research question: whether heterogeneous LLMs can collaborate during compiler optimization while reducing compilation cost below optimization guided by a single large LLM. Crucially, this must be achieved without introducing overhead from agentic frameworks, which would run counter to the goal of lower compilation cost. To achieve these competing objectives, we introduce LiteCoOp, a lightweight framework that turns the optimization search tree itself into the mechanism for multi-LLM collaboration, enabling heterogeneous models to share progress without external agentic coordination. At each optimization step, LiteCoOp queries one LLM to propose both a compiler transformation and select the LLM to query at the next step. These LLM proposals are recorded in a shared MCTS tree, so all models are invoked serially and yet are informed by each other's decisions. The shared MCTS backpropagates the rewards, allowing progress made by one model to influence later decisions by others. This makes the MCTS tree the collaborative reasoning mechanism itself, avoiding inter-model communication, heavy reasoning traces, or agentic infrastructure. We instantiate this idea with an LLM-aware UCT that biases model selection toward smaller LLMs to reduce cost while still preserving the compiler performance objective. Across diverse GPU and (CPU) benchmarks, LiteCoOp consistently outperforms single-model baselines, with the best results obtained when scaling collaboration to eight heterogeneous LLMs. This eight-model config reduces total compilation time by 1.95x (1.74x), reduces API cost by 4.47x (4.32x), and invokes the largest model for only 23.1% (23.9%) of total calls while demonstrating collaboration scalability.

2602.01279 2026-05-22 cs.LG 版本更新

Richer Bayesian Last Layers with Subsampled NTK Features

更丰富的贝叶斯最后层与子采样NTK特征

Sergio Calvo-Ordoñez, Jonathan Plenk, Richard Bergna, Álvaro Cartea, Yarin Gal, Jose Miguel Hernández-Lobato, Kamil Ciosek

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) Oxford-Man Institute, University of Oxford(牛津大学奥克斯曼研究所) OATML, University of Oxford(牛津大学OATML研究所) Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 本文提出了一种改进贝叶斯最后层的方法,通过将神经切线核特征投影到由最后层特征张成的空间中,以更准确地估计不确定性,同时保持计算效率。

Comments Appearing in the Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情
AI中文摘要

贝叶斯最后层(BLLs)提供了一种方便且计算高效的神经网络不确定性估计方法。然而,由于只对最终层应用贝叶斯处理,忽略了早期层引入的不确定性,导致低估了epistemic不确定性。我们提出了一种方法,通过将神经切线核(NTK)特征投影到由最后层特征张成的空间中,从而在保持标准BLL推理低计算成本的同时,实现对整个网络变异性更全面的后验推断。我们证明了该方法产生的后验方差至少等于标准BLL的方差,纠正了其低估epistemic不确定性的倾向。为进一步降低计算成本,我们引入了统一的子采样方案来估计投影矩阵和后验推断。我们为两种子采样类型推导了近似界限。在UCI回归、上下文带币、图像分类和分布外检测任务中,对图像和表格数据集的实证评估显示,与标准BLL和竞争基线相比,该方法在校准和不确定性估计方面有所改进,同时降低了计算成本。

英文摘要

Bayesian Last Layers (BLLs) provide a convenient and computationally efficient way to estimate uncertainty in neural networks. However, they underestimate epistemic uncertainty because they apply a Bayesian treatment only to the final layer, ignoring uncertainty induced by earlier layers. We propose a method that improves BLLs by leveraging a projection of Neural Tangent Kernel (NTK) features onto the space spanned by the last-layer features. This enables posterior inference that accounts for variability of the full network while retaining the low computational cost of inference of a standard BLL. We show that our method yields posterior variances that are provably greater or equal to those of a standard BLL, correcting its tendency to underestimate epistemic uncertainty. To further reduce computational cost, we introduce a uniform subsampling scheme for estimating the projection matrix and for posterior inference. We derive approximation bounds for both types of subsampling. Empirical evaluations on UCI regression, contextual bandits, image classification, and out-of-distribution detection tasks in image and tabular datasets, demonstrate improved calibration and uncertainty estimates compared to standard BLLs and competitive baselines, while reducing computational cost.

2601.22365 2026-05-22 cs.DM cs.LG 版本更新

Towards Solving the Gilbert-Pollak Conjecture via Large Language Models

通过大语言模型解决吉尔伯特-波拉克猜想

Yisi Ke, Tianyu Huang, Yankai Shu, Di He, Jingchu Gai, Liwei Wang

发表机构 * School of EECS, Peking University(北京大学电子工程学院) School of Mathematical Sciences, Peking University(北京大学数学科学学院) Center for Machine Learning Research, Peking University(北京大学机器学习研究中心) State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China(北京大学通用人工智能国家重点实验室) Carnegie Mellon University, Machine Learning Department(卡内基梅隆大学机器学习系)

AI总结 本文提出一种新的AI系统,通过生成受规则约束的几何引理并构建专用函数,以获得更紧的Steiner比下界,展示了大语言模型在高级数学研究中的强大潜力。

Comments 44 pages, 11 figures

详情
AI中文摘要

吉尔伯特-波拉克猜想,也称为Steiner比猜想,指出在欧几里得平面中任意有限点集的Steiner最小树长度至少是欧几里得最小生成树长度的√3/2≈0.866倍。从1980年代以来的一系列改进最终得出下界为0.824,过去三十年内没有实质性进展。最近大语言模型(LLM)在竞赛级别的数学问题上表现出色,但其在解决开放性研究问题上的潜力尚待探索。本文提出了一种新的AI系统,通过生成受规则约束的几何引理并构建专用函数,以获得更紧的Steiner比下界。而不是直接提示LLM解决猜想,而是让它们生成受规则约束的几何引理,并将其作为可执行代码实现。这些引理随后用于构建一组专用函数,我们称之为验证函数,从而产生理论上得到认证的Steiner比下界。通过逐步引理细化驱动的反思,该系统建立了新的认证的Steiner比下界为0.8559。整个研究努力仅涉及数千次LLM调用,展示了基于LLM的系统在高级数学研究中的强大潜力。

英文摘要

The Gilbert-Pollak Conjecture \citep{gilbert1968steiner}, also known as the Steiner Ratio Conjecture, states that for any finite point set in the Euclidean plane, the Steiner minimum tree has length at least $\sqrt{3}/2 \approx 0.866$ times that of the Euclidean minimum spanning tree (the Steiner ratio). A sequence of improvements through the 1980s culminated in a lower bound of $0.824$, with no substantial progress reported over the past three decades. Recent advances in LLMs have demonstrated strong performance on contest-level mathematical problems, yet their potential for addressing open, research-level questions remains largely unexplored. In this work, we present a novel AI system for obtaining tighter lower bounds on the Steiner ratio. Rather than directly prompting LLMs to solve the conjecture, we task them with generating rule-constrained geometric lemmas implemented as executable code. These lemmas are then used to construct a collection of specialized functions, which we call verification functions, that yield theoretically certified lower bounds of the Steiner ratio. Through progressive lemma refinement driven by reflection, the system establishes a new certified lower bound of 0.8559 for the Steiner ratio. The entire research effort involves only thousands of LLM calls, demonstrating the strong potential of LLM-based systems for advanced mathematical research.

2601.21025 2026-05-22 stat.ML cs.LG 版本更新

A Diffusive Classification Loss for Learning Energy-based Generative Models

一种用于学习基于能量的生成模型的扩散分类损失

RuiKang OuYang, Louis Grenioux, José Miguel Hernández-Lobato

发表机构 * CMAP, CNRS, École polytechnique, Institut Polytechnique de Paris, Palaiseau, France(CMAP、法国国家科学研究中心、巴黎高等理工学院、巴黎理工 institute、法国巴黎帕莱苏实验室) Center for Computational Mathematics, Flatiron Institute, New York, NY, USA(计算数学中心、Flatiron 机构、美国纽约纽约州) Department of Engineering, University of Cambridge, Cambridge, United Kingdom(工程系、剑桥大学、英国剑桥)

AI总结 本文提出了一种名为DiffCLF的扩散分类损失,用于学习基于能量的生成模型,通过将能量模型学习重新表述为跨噪声级别的监督分类问题,从而在保持计算效率的同时避免了模式盲区,提高了模型的保真度和应用范围。

Comments Accepted at ICML 2026

详情
AI中文摘要

基于分数的生成模型最近取得了显著的成功。虽然它们通常由分数参数化,但另一种方法是使用一系列时间依赖的能量模型(EBMs),其中分数是从能量的负输入梯度获得的。关键的是,EBMs不仅可以用于生成,还可以用于诸如组合采样或通过蒙特卡洛方法构建玻尔兹曼生成器等任务。然而,训练EBMs仍然具有挑战性。直接最大似然估计由于需要嵌套采样而计算上不可行,而分数匹配虽然高效,但存在模式盲区。为了解决这些问题,我们引入了扩散分类(DiffCLF)目标,这是一种简单的方法,可以避免盲区同时保持计算效率。DiffCLF将EBM学习重新表述为跨噪声级别的监督分类问题,并可以无缝结合标准的分数基目标。我们通过在分析高斯混合案例中将估计能量与真实值进行比较,以及通过应用训练好的模型到诸如模型组合和玻尔兹曼生成器采样等任务中,验证了DiffCLF的有效性。我们的结果表明,DiffCLF使EBM比现有方法具有更高的保真度和更广泛的应用范围。

英文摘要

Score-based generative models have recently achieved remarkable success. While they are usually parameterized by the score, an alternative way is to use a series of time-dependent energy-based models (EBMs), where the score is obtained from the negative input-gradient of the energy. Crucially, EBMs can be leveraged not only for generation, but also for tasks such as compositional sampling or building Boltzmann Generators via Monte Carlo methods. However, training EBMs remains challenging. Direct maximum likelihood is computationally prohibitive due to the need for nested sampling, while score matching, though efficient, suffers from mode blindness. To address these issues, we introduce the Diffusive Classification (DiffCLF) objective, a simple method that avoids blindness while remaining computationally efficient. DiffCLF reframes EBM learning as a supervised classification problem across noise levels, and can be seamlessly combined with standard score-based objectives. We validate the effectiveness of DiffCLF by comparing the estimated energies against ground truth in analytical Gaussian mixture cases, and by applying the trained models to tasks such as model composition and Boltzmann Generator sampling. Our results show that DiffCLF enables EBMs with higher fidelity and broader applicability than existing approaches.

2601.11079 2026-05-22 cs.LG 版本更新

Soft Bayesian Context Tree Models for Real-Valued Time Series

针对实值时间序列的软贝叶斯上下文树模型

Shota Saito, Yuta Nakahara, Toshiyasu Matsushima

发表机构 * Gunma University(群马大学) Waseda University(早稻田大学)

AI总结 本文提出了一种新的软贝叶斯上下文树模型(Soft-BCT),用于实值时间序列。该模型采用概率性分裂上下文空间,而非传统上下文树模型中确定性的上下文空间分裂。基于变分推断提出学习算法,实验结果表明Soft-BCT在某些数据集上优于传统上下文树模型。

详情
AI中文摘要

本文提出软贝叶斯上下文树模型(Soft-BCT),这是一种新的实值时间序列的上下文树模型。Soft-BCT考虑了上下文空间的软(概率)分裂,而不是传统上下文树模型中上下文空间的硬(确定性)分裂。基于变分推断提出Soft-BCT的学习算法。实验结果表明,Soft-BCT在某些数据集上优于传统上下文树模型。

英文摘要

This paper proposes the soft Bayesian context tree model (Soft-BCT), which is a novel BCT model for real-valued time series. The Soft-BCT considers soft (probabilistic) splits of the context space, instead of hard (deterministic) splits of the context space as in the previous BCT for real-valued time series. A learning algorithm of the Soft-BCT is proposed based on the variational inference. The results of experiments demonstrate the superiority of the Soft-BCT compared to the previous BCT for some datasets.

2601.10348 2026-05-22 cs.CL cs.AI cs.LG 版本更新

Training-Trajectory-Aware Token Selection

基于训练轨迹的token选择

Zhanming Shen, Jiaqi Hu, Zeyu Qin, Hao Chen, Wentao Ye, Zenan Huang, Yihong Zhuang, Guoshan Lu, Junlin Zhou, Junbo Zhao

发表机构 * Zhejiang University(浙江大学) Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出T3S方法,通过在token层面重构训练目标,清除未学习token的优化路径,从而在连续蒸馏中提升性能,实验表明在AR和dLLM设置中均取得显著效果。

Comments Accepted by ICML 2026

详情
AI中文摘要

高效的蒸馏是将昂贵的推理能力转化为可部署效率的关键途径,然而在前沿领域中,当学生模型已具备较强的推理能力时,朴素的连续蒸馏往往产生有限的收益甚至退化。我们观察到一种训练特征现象:即使损失单调下降,所有性能指标在几乎相同的瓶颈处会突然大幅下降,然后逐渐恢复。我们进一步揭示了token层面的机制:置信度会分裂成稳步增加的模仿锚点token,快速锚定优化,以及尚未学习的token,其置信度被抑制直到瓶颈之后。这两种类型token无法共存的特性是连续蒸馏失败的根本原因。为此,我们提出了基于训练轨迹的token选择(T3S)方法,以在token层面重建训练目标,清除未学习token的优化路径。T3S在AR和dLLM设置中均取得一致的收益:仅用数百个示例,Qwen3-8B在竞争性推理基准上超越DeepSeek-R1,Qwen3-32B接近Qwen3-235B,且T3训练的LLaDA-2.0-Mini超越其AR基线,达到所有16B级模型中的最先进性能。

英文摘要

Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3S yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.

2601.05157 2026-05-22 cs.DS cs.LG stat.ML 版本更新

Learning Mixture Models via Efficient High-dimensional Sparse Fourier Transforms

通过高效的高维稀疏傅里叶变换学习混合模型

Alkis Kalavasis, Pravesh K. Kothari, Shuchen Li, Manolis Zampetakis

发表机构 * Yale University(耶鲁大学) Princeton University(普林斯顿大学)

AI总结 本文提出了一种在高维空间中以多项式时间复杂度学习混合模型参数的方法,适用于具有重尾分布的混合模型,包括那些协方差有限的分布,且无需集群均值的最小分离。

详情
AI中文摘要

在本文中,我们提出了一种${ m poly}(d,k)$时间复杂度和样本复杂度的算法,用于高效学习$d$维空间中$k$个球形分布的参数。与之前的所有方法不同,我们的技术适用于具有重尾分布的情况,甚至包括那些没有有限协方差的分布。我们的方法在集群分布具有足够重的尾部特征函数时才能成功。此类分布包括拉普拉斯分布,但关键地排除了高斯分布。所有之前学习混合模型的方法都隐式或显式地依赖于低次矩。即使对于拉普拉斯分布的情况,我们证明任何此类算法必须使用超多项式数量的样本。因此,我们的方法补充了那些绕过矩方法限制的技术列表。出人意料的是,我们的算法不需要任何集群均值之间的最小分离。这与球形高斯混合模型形成鲜明对比,后者在信息论上证明需要最小的$\ell_2$-分离[Regev and Vijayaraghavan '17]。我们的方法与现有技术相结合,允许在混合模型中获得'两者兼得'的保证,其中每个组件要么具有重尾特征函数,要么具有亚高斯尾部但轻尾特征函数。我们的算法基于一种新的通过高效高维稀疏傅里叶变换学习混合模型的方法。我们相信这种方法将在统计估计中找到更多应用。作为例子,我们给出一个一致的鲁棒均值估计算法,以对抗噪声无关的对手,这是一个由文献中的多重假设检验文献实际提出的模型。它最近在一位作者的硕士论文中正式提出,并已启发了后续的工作。

英文摘要

In this work, we give a ${\rm poly}(d,k)$ time and sample algorithm for efficiently learning the parameters of a mixture of $k$ spherical distributions in $d$ dimensions. Unlike all previous methods, our techniques apply to heavy-tailed distributions and include examples that do not even have finite covariances. Our method succeeds whenever the cluster distributions have a characteristic function with sufficiently heavy tails. Such distributions include the Laplace distribution but crucially exclude Gaussians. All previous methods for learning mixture models relied implicitly or explicitly on the low-degree moments. Even for the case of Laplace distributions, we prove that any such algorithm must use super-polynomially many samples. Our method thus adds to the short list of techniques that bypass the limitations of the method of moments. Somewhat surprisingly, our algorithm does not require any minimum separation between the cluster means. This is in stark contrast to spherical Gaussian mixtures where a minimum $\ell_2$-separation is provably necessary even information-theoretically [Regev and Vijayaraghavan '17]. Our methods compose well with existing techniques and allow obtaining ''best of both worlds" guarantees for mixtures where every component either has a heavy-tailed characteristic function or has a sub-Gaussian tail with a light-tailed characteristic function. Our algorithm is based on a new approach to learning mixture models via efficient high-dimensional sparse Fourier transforms. We believe that this method will find more applications to statistical estimation. As an example, we give an algorithm for consistent robust mean estimation against noise-oblivious adversaries, a model practically motivated by the literature on multiple hypothesis testing. It was formally proposed in a recent Master's thesis by one of the authors, and has already inspired follow-up works.

2512.19131 2026-05-22 cs.DC cs.LG 版本更新

Evidential Trust-Aware Model Personalization in Decentralized Federated Learning for Wearable IoT

基于证据的信任感知模型个性化在可穿戴物联网的去中心化联邦学习中

Murtaza Rangwala, Richard O. Sinnott, Rajkumar Buyya

发表机构 * Quantum Cloud Computing and Distributed Systems (qCLOUDS) Lab(量子云计算与分布式系统实验室) School of Computing and Information Systems(计算与信息系统学院) The University of Melbourne, Australia(墨尔本大学)

AI总结 本文提出Murmura框架,利用证据深度学习实现去中心化联邦学习中的信任感知模型个性化,通过Dirichlet基于的证据模型中的epistemic不确定性直接指示节点兼容性,从而减少非IID条件下的性能下降并加快收敛速度。

Comments v2. Addressed minor reviewer concerns

详情
AI中文摘要

去中心化联邦学习(DFL)能够在边缘设备之间进行协作模型训练,而无需集中协调,提供了对单点故障的抗性。然而,由于非相同分布的本地数据导致的统计异质性,创建了一个根本性挑战:节点必须学习适应其本地分布的个性化模型,同时选择性地与兼容的同行合作。现有方法要么强制一个单一的全局模型,无法适应任何人,要么依赖于启发式的同行选择机制,无法区分真正不兼容数据分布的同行和具有有价值互补知识的同行。我们提出了Murmura,一个利用证据深度学习实现去中心化联邦学习中信任感知模型个性化的框架。我们的关键见解是,基于Dirichlet的证据模型中的epistemic不确定性直接表明同行兼容性:当同行模型评估本地数据时,高epistemic不确定性表明分布不匹配,使节点能够排除不兼容的影响,同时通过选择性合作保持个性化模型。Murmura引入了一种信任感知的聚合机制,通过在本地验证样本上的交叉评估计算同行兼容性分数,并基于证据信任进行模型聚合,使用自适应阈值。在三个可穿戴物联网数据集(UCI HAR,PAMAP2,PPG-DaLiA)上的评估表明,与基线相比,Murmura将从IID到非IID条件下的性能下降减少了0.9% vs. 19.3%,实现了7.4×更快的收敛速度,并在超参数选择中保持稳定的准确性。这些结果确立了证据不确定性作为去中心化异构环境中兼容性感知个性化的原则性基础。

英文摘要

Decentralized federated learning (DFL) enables collaborative model training across edge devices without centralized coordination, offering resilience against single points of failure. However, statistical heterogeneity arising from non-identically distributed local data creates a fundamental challenge: nodes must learn personalized models adapted to their local distributions while selectively collaborating with compatible peers. Existing approaches either enforce a single global model that fits no one well, or rely on heuristic peer selection mechanisms that cannot distinguish between peers with genuinely incompatible data distributions and those with valuable complementary knowledge. We present Murmura, a framework that leverages evidential deep learning to enable trust-aware model personalization in DFL. Our key insight is that epistemic uncertainty from Dirichlet-based evidential models directly indicates peer compatibility: high epistemic uncertainty when a peer's model evaluates local data reveals distributional mismatch, enabling nodes to exclude incompatible influence while maintaining personalized models through selective collaboration. Murmura introduces a trust-aware aggregation mechanism that computes peer compatibility scores through cross-evaluation on local validation samples and personalizes model aggregation based on evidential trust with adaptive thresholds. Evaluation on three wearable IoT datasets (UCI HAR, PAMAP2, PPG-DaLiA) demonstrates that Murmura reduces performance degradation from IID to non-IID conditions compared to baseline (0.9% vs. 19.3%), achieves 7.4$\times$ faster convergence, and maintains stable accuracy across hyperparameter choices. These results establish evidential uncertainty as a principled foundation for compatibility-aware personalization in decentralized heterogeneous environments.

2512.12744 2026-05-22 cs.LG 版本更新

Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity

静息神经元,主动洞察:通过自发性增强LLM中的激活稀疏性

Haotian Xu, Jiannan Yang, Tian Gao, Tsui-Wei Weng, Tengfei Ma

发表机构 * IBM Thomas J. Watson Research Center, Yorktown Heights, USA(IBM 托马斯·J·沃森研究中心,美国Yorktown Heights) Halıcıoğlu Data Science Institute, UC San Diego, La Jolla, USA(哈利奇欧数据科学研究所,美国UC圣地亚哥La Jolla) Stony Brook University, Stony Brook, USA(史泰文·布鲁克大学,美国Stony Brook)

AI总结 本文提出了一种通过引入自发神经元(SPON)来增强LLM中激活稀疏性的方法,解决了高稀疏率下模型精度下降的问题,通过分布匹配训练SPON,使模型在稀疏计算中保持稳定和泛化能力。

Comments ICML 2026

详情
AI中文摘要

激活稀疏性提供了一种有吸引力的途径来加速大型语言模型(LLM)的推理过程,通过选择性地抑制隐藏激活。然而,现有方法在高稀疏率下表现出严重的准确性下降。我们发现,这种失败源于表征不稳定:*激活稀疏性破坏了预训练期间学习的输入依赖激活,导致隐藏状态的分布偏移。*我们通过将激活稀疏性重新定义为表征对齐问题,并引入**自发神经元(SPON)**,一种受生物系统中自发神经活动启发的轻量机制。SPON注入一组小的可学习、输入无关的激活向量,作为稀疏计算中的持久表征锚点。这些向量通过分布匹配训练与密集模型匹配,并在训练后可吸收进偏置项中,带来极小的推理开销。在多个LLM架构上,SPON一致地恢复了性能,稳定了潜在表征,并保持了泛化能力。我们的结果确立了SPON作为可靠激活稀疏推理的有效且原则性解决方案,并为LLM的知识保留提供了新的见解。

英文摘要

Activation sparsity offers a compelling route to accelerate large language model (LLM) inference by selectively suppressing hidden activations, yet existing approaches exhibit severe accuracy degradation at high sparsity. We show that this failure stems from representational instability: *activation sparsity disrupts input-dependent activation learned during pretraining, inducing distribution shifts in hidden states.* We address this issue by reframing activation sparsity as a representational alignment problem and introducing **Spontaneous Neurons (SPON)**, a lightweight mechanism inspired by spontaneous neural activity in biological systems. SPON injects a small set of learnable, input-independent activation vectors that act as persistent representational anchors for sparse computation. These vectors are trained via distribution matching to the dense model and can be absorbed into bias terms after training, incurring negligible inference overhead. Across multiple LLM backbones, SPON consistently restores performance, stabilizes latent representations, and preserves generalization. Our results establish SPON as an effective and principled solution for reliable activation-sparse inference, and offer new insights into knowledge retention in LLMs.

2512.11587 2026-05-22 cs.LG cs.NA math.NA math.OC 版本更新

Gradient Descent as a Perceptron Algorithm: Understanding Dynamics and Implicit Acceleration

梯度下降作为感知机算法:理解动态与隐式加速

Alexander Tyurin

发表机构 * Applied AI Institute, Moscow, Russia(应用人工智能研究所,莫斯科,俄罗斯)

AI总结 本文研究了梯度下降在神经网络训练中的优化动态和隐式加速现象,通过非线性模型分析显示梯度下降步骤等价于广义感知机算法,揭示了非线性模型在迭代复杂度上的优势。

详情
AI中文摘要

即使对于应用于神经网络训练的梯度下降(GD)方法,理解其优化动态,包括收敛速度、迭代轨迹、函数值振荡,尤其是其隐式加速现象,仍然是一个具有挑战性的问题。我们分析了具有逻辑损失的非线性模型,并展示梯度下降的步骤等同于广义感知机算法(Rosenblatt, 1958),从而提供了新的动态视角。这种简化步骤通过经典线性代数工具进行分析。在最小化示例中,我们证明了双层模型的非线性可以证明在迭代复杂度上比线性模型更快,即$ ilde{O}(\sqrt{d})$,相比线性模型的$Ω(d)$,其中$d$是特征数量。这有助于解释神经网络中观察到的优化动态和隐式加速现象。理论结果通过广泛的数值实验得到支持。我们相信这种替代观点将进一步推动神经网络优化的研究。

英文摘要

Even for the gradient descent (GD) method applied to neural network training, understanding its optimization dynamics, including convergence rate, iterate trajectories, function value oscillations, and especially its implicit acceleration, remains a challenging problem. We analyze nonlinear models with the logistic loss and show that the steps of GD reduce to those of generalized perceptron algorithms (Rosenblatt, 1958), providing a new perspective on the dynamics. This reduction yields significantly simpler algorithmic steps, which we analyze using classical linear algebra tools. Using these tools, we demonstrate on a minimalistic example that the nonlinearity in a two-layer model can provably yield a faster iteration complexity $\tilde{O}(\sqrt{d})$ compared to $Ω(d)$ achieved by linear models, where $d$ is the number of features. This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks. The theoretical results are supported by extensive numerical experiments. We believe that this alternative view will further advance research on the optimization of neural networks.

2512.09472 2026-05-22 cs.DC cs.LG 版本更新

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

WarmServe: 为多LLM服务实现一种多GPU预热

Chiheng Lou, Sheng Qi, Rui Kang, Yong Zhang, Chen Sun, Pengcheng Wang, Xuanzhe Liu, Xin Jin

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文提出WarmServe系统,通过基于工作负载预测的多GPU预热技术,减少LLM服务中的尾部时间到第一个令牌(TTFT)并提高请求吞吐量。

Comments Accepted at ICML 2026

详情
AI中文摘要

在共享GPU集群中部署多个模型是提高大型语言模型(LLM)服务资源效率的关键策略。现有多LLM服务系统通过牺牲降级的推理性能,特别是时间到第一个令牌(TTFT)来提高GPU利用率。我们归因于缺乏对未来工作负载特征的认识。相反,最近的分析表明,现实世界中的LLM服务工作负载具有强周期性和长期可预测性。在本文中,我们提出了一种“一为多”GPU预热方法,根据工作负载预测主动将多个模型的参数加载到GPU上。这些预热的权重使系统能够在遇到请求高峰时迅速实例化服务实例。我们设计并实现了WarmServe,一个多LLM服务系统,包含三个关键技术:(1)一个模型放置算法,优化预热决策以最小化跨模型预热干扰;(2)一个KV缓存预留策略,将正在运行GPU上的空闲KV缓存空间重新利用于预热新模型;(3)一个高效的GPU内存切换机制用于张量管理。在真实世界数据集上的评估显示,WarmServe将尾部TTFT减少到比最先进的自动扩展系统高50.8倍,同时支持比GPU共享系统高2.5倍的请求吞吐量。

英文摘要

Deploying multiple models within shared GPU clusters is a key strategy to improve resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems improve GPU utilization at the cost of degraded inference performance, particularly time-to-first-token (TTFT). We attribute this degradation to the lack of awareness regarding future workload characteristics. In contrast, recent analyses have shown the strong periodicity and long-term predictability of real-world LLM serving workloads. In this paper, we propose one-for-many GPU prewarming, which proactively loads parameters from multiple models onto GPUs based on workload forecasts. These prewarmed weights enable the system to promptly instantiate serving instances upon encountering request bursts. We design and implement WarmServe, a multi-LLM serving system incorporating three key techniques: (1) a model placement algorithm that optimizes prewarming decisions to minimize cross-model prewarming interference, (2) a KV cache reservation strategy that repurposes idle KV cache space on running GPUs for prewarming new models, and (3) an efficient GPU memory switching mechanism for tensor management. Evaluation on real-world datasets shows that WarmServe reduces tail TTFT by up to 50.8$\times$ compared to the state-of-the-art autoscaling-based system, while supporting up to 2.5$\times$ higher request throughput than the GPU-sharing system.

2511.18159 2026-05-22 cs.LG 版本更新

Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models

为扩散模型带来稳定性:分解和减少训练掩码扩散模型的方差

Mengni Jia, Mengyu Zhou, Yihao Liu, Xiaoxi Jiang, Guanjun Jiang

发表机构 * University of Cambridge(剑桥大学) Peking University(北京大学) Qwen Large Model Application Team, Alibaba(阿里巴巴通义大模型应用团队)

AI总结 本文研究了掩码扩散模型(MDMs)训练方差高导致不稳定的问题,通过分解方差来源并提出六种方差减少方法,显著提升了模型在复杂推理任务中的准确率,并将运行间变异性降低至自回归模型(ARMs)水平。

详情
AI中文摘要

Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from inherently much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. There has been no theoretical explanation or systematic solution. We derive the first decomposition of MDM training variance into three sources: (A) masking pattern noise, (B) masking rate noise, and (C) data noise, while ARMs are only affected by (C). This explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a Pareto-optimal t sampler that minimizes training variance by sampling harder t values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce (A). Experiments show that compared to standard MDM training, our methods improve accuracy by 7-8% on complex reasoning tasks, while simultaneously reducing run-to-run variability to near ARM levels, substantially narrowing the gap with strong ARM baselines; in most settings, even the best baseline runs remain below the worst run of our method.

英文摘要

Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from inherently much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. There has been no theoretical explanation or systematic solution. We derive the first decomposition of MDM training variance into three sources: (A) masking pattern noise, (B) masking rate noise, and (C) data noise, while ARMs are only affected by (C). This explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a Pareto-optimal t sampler that minimizes training variance by sampling harder t values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce (A). Experiments show that compared to standard MDM training, our methods improve accuracy by 7-8% on complex reasoning tasks, while simultaneously reducing run-to-run variability to near ARM levels, substantially narrowing the gap with strong ARM baselines; in most settings, even the best baseline runs remain below the worst run of our method.

2511.10619 2026-05-22 cs.LG stat.ML 版本更新

Algorithm Design and Stronger Guarantees for the Improving Multi-Armed Bandits Problem

改进多臂老虎机问题的算法设计及更强的保证

Avrim Blum, Marten Garicano, Kavya Ravichandran, Dravyansh Sharma

发表机构 * Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) University of Chicago(芝加哥大学) IDEAL Institute, Toyota Technological Institute at Chicago(IDEAL研究所,芝加哥丰田技术研究所)

AI总结 本文提出两种新的参数化老虎机算法家族,通过离线数据界定了学习近最优算法的样本复杂度,并在标准超参数调优基准上进行了实证评估。第一家族包含先前工作的最优随机算法,展示在满足额外凹性性质的臂奖励曲线下,可以实现更强的保证。第二家族算法在良好行为实例上保证最佳臂识别,在不良行为实例上退化为最坏情况保证。

Comments 36 pages

详情
AI中文摘要

改进多臂老虎机问题是一个在不确定性下分配努力的形式模型,受投资新技术研究努力、进行临床试验和从学习曲线中选择超参数等场景的启发。每次拉取臂提供奖励,该奖励以递减回报单调增加。已有大量工作设计了改进老虎机算法,但最坏情况保证较为悲观。事实上,已知确定性和随机性算法相对于最优臂的强下界分别为Ω(k)和Ω(√k)的乘法近似因子。在本文中,我们提出两个新的参数化老虎机算法家族,并利用离线数据界定了从每个家族学习近最优算法的样本复杂度。我们还在标准超参数调优基准上进行了实证评估。我们定义的第一家族包含先前工作的最优随机算法。我们证明,适当选择的算法从该家族中可以实现更强的保证,当臂奖励曲线下满足与凹性强度相关的额外性质时,具有最优的k依赖性。我们的第二家族包含在良好行为实例上保证最佳臂识别并在不良行为实例上退化为最坏情况保证的算法。

英文摘要

The improving multi-armed bandits problem is a formal model for allocating effort under uncertainty, motivated by scenarios such as investing research effort into new technologies, performing clinical trials, and hyperparameter selection from learning curves. Each pull of an arm provides reward that increases monotonically with diminishing returns. A growing line of work has designed algorithms for improving bandits, albeit with somewhat pessimistic worst-case guarantees. Indeed, strong lower bounds of $Ω(k)$ and $Ω(\sqrt{k})$ multiplicative approximation factors are known for both deterministic and randomized algorithms (respectively) relative to the optimal arm, where $k$ is the number of bandit arms. In this work, we propose two new parameterized families of bandit algorithms and bound the sample complexity of learning the near-optimal algorithm from each family using offline data. We also perform empirical evaluations on standard hyperparameter tuning benchmarks. The first family we define includes the optimal randomized algorithm from prior work. We show that an appropriately chosen algorithm from this family can achieve stronger guarantees, with optimal dependence on $k$, when the arm reward curves satisfy additional properties related to the strength of concavity. Our second family contains algorithms that both guarantee best-arm identification on well-behaved instances and revert to worst-case guarantees on poorly-behaved instances.

2511.07885 2026-05-22 cs.DC cs.AI cs.CL cs.LG 版本更新

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

每瓦智能:衡量本地AI的智能效率

Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy, Azalia Mirhoseini, Christopher Ré

发表机构 * Stanford University(斯坦福大学) Together AI

AI总结 本文研究了本地AI在能源效率和性能上的表现,提出了一种统一的衡量指标IPW,展示了本地推理在重新分配需求方面的能力,并揭示了本地加速器的优化潜力。

详情
AI中文摘要

大型语言模型(LLM)查询主要由集中式云基础设施中的前沿模型处理。需求增长比提供商能够扩展的速度更快。两项进展创造了重新思考这一范式的机会:小型本地LM(<=20B活跃参数)在许多任务上能与前沿模型竞争性地表现,而本地加速器(如Apple M4 Max)可以以交互延迟支持这些模型。这引发了问题:本地推理能否在能源受限的设备上有效重新分配需求?这需要测量本地LM是否能准确回答现实查询以及是否在能源受限的设备上高效。我们提出了智能每瓦(IPW),即任务准确度每单位功率,作为衡量本地推理能力与效率的统一指标。我们评估了20多个最先进的本地LM、8种硬件加速器(本地和云)以及100万条现实单轮聊天和推理查询。对于每个查询,我们测量了准确性(本地LM对前沿模型的胜率)、能耗、延迟和功率。我们发现三个关键结果。首先,本地LM成功回答了88.7%的这些查询,准确性因领域而异。其次,2023-2025年的纵向分析显示IPW提高了5.3倍,由算法和加速器的改进驱动,本地可服务查询覆盖范围从23.2%增加到71.3%。第三,本地加速器在相同模型上实现的IPW至少比云加速器低1.4倍,揭示了本地加速器优化的巨大潜力。这些发现表明,本地推理可以对集中式基础设施的大量查询需求进行有意义的重新分配,IPW是跟踪这一转变的关键指标。

英文摘要

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Demand growth strains this paradigm faster than providers can scale. Two advances create an opportunity to rethink it: small, local LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) can host these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? This requires measuring both whether local LMs can accurately answer real-world queries and whether they can do so efficiently on power-constrained devices (e.g., laptops). We propose intelligence per watt (IPW), task accuracy per unit of power, as a unified metric for the capability and efficiency of local inference across model-accelerator configurations. We evaluate 20+ state-of-the-art local LMs, 8 hardware accelerators (local and cloud), and 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy (local LM win rate against frontier models), energy, latency, and power. We find three key results. First, local LMs successfully answer 88.7% of these queries, with accuracy varying by domain. Second, longitudinal analysis from 2023-2025 shows IPW improved 5.3x, driven by both algorithmic and accelerator advances, with locally-serviceable query coverage rising from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for local accelerator optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure for a substantial subset of queries, with IPW serving as the critical metric for tracking this transition.

2511.04838 2026-05-22 cs.LG math.SP q-bio.MN 版本更新

SPECTRA: Spectral Domain-Aware Graph Generation for Imbalanced Molecular Property Regression

SPECTRA: 用于不平衡分子属性回归的谱域感知图生成

Brenda Nogueira, Gisela A. Gonzalez-Montiel, Meng Jiang, Nitesh V. Chawla, Nuno Moniz

发表机构 * University of Notre Dame, Dept. of Computer Science University of Notre Dame, Dept. of Chemistry University of Notre Dame, Lucy Family Institute for Data \& Society Notre Dame Indiana USA University of Notre Dame, Lucy Family Institute for Data \& Society

AI总结 本文提出SPECTRA方法,通过结合稀缺性感知预算方案、目标邻居图对齐和拉普拉斯谱插值,提升对相关但数据稀缺的分子属性值的预测能力,同时在相关目标范围内优于现有最先进方法,计算时间减少约4倍。

详情
AI中文摘要

分子属性回归在化学相关的目标范围内遇到困难,因为这些范围在数据集中代表性不足。标准的平均误差最小化方法在这些高相关性情况下表现不佳,而过采样方法会导致分子表示失去意义。本文提出SPECTRA,一种谱域感知的图生成方法,旨在提高对相关但数据稀缺的分子属性值的预测能力。它结合了稀缺性感知的预算方案以聚焦数据稀缺区域,目标邻居图对齐以建立结构对应关系,以及拉普拉斯谱、节点特征和目标的插值。结合使用谱图神经网络和边缘感知的切比雪夫卷积,SPECTRA在属性预测基准测试中表现出色,在相关目标范围内与最先进的方法竞争,同时计算时间减少约4倍。

英文摘要

Molecular property regression struggles with cases in chemically relevant target ranges that are underrepresented in datasets. Standard average error minimization approaches underperform in these highly relevant cases, and oversampling approaches lead to meaningless molecular representations. In this paper, we propose SPECTRA, a spectral, domain-aware graph generation method designed to improve the prediction of underrepresented but relevant molecular property values. It combines a rarity-aware budgeting scheme to focus generation where data are scarce, target-neighbors graph alignment to establish structural correspondence, and interpolation of Laplacian spectra, node features, and targets. Coupled with spectral GNN using edge-aware Chebyshev convolutions, SPECTRA shows its effectiveness in property prediction benchmarks with competitive performance over leading state-of-the-art methods in relevant target ranges, while requiring ~4x less computational time.

2511.02043 2026-05-22 cs.LG cs.PF 版本更新

Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

Flashlight: PyTorch 编译器扩展以加速注意力变种

Bozhi You, Irene Wang, Zelal Su Mustafaoglu, Abhinav Jangda, Angélica Moreira, Roshan Dathathri, Divya Mahajan, Keshav Pingali

发表机构 * Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country(匿名机构,匿名城市,匿名地区,匿名国家)

AI总结 本文提出Flashlight,一种基于PyTorch的编译器框架,能够自动生成融合的FlashAttention风格内核,支持任意注意力程序,无需静态模板或预定义内核专有化,从而在保持性能的同时提供灵活性。

详情
AI中文摘要

注意力是大型语言模型(LLMs)的基本构建块,因此有很多努力去高效地实现它。例如,FlashAttention利用分块和内核融合来优化注意力。最近,一些注意力变种被引入以提高模型质量和效率。支持它们仍然困难,因为它们通常需要专门的内核或手动调优的实现。FlexAttention最近通过使用静态编程模板来支持FlashAttention-like内核来解决部分这一差距。在本文中,我们介绍了Flashlight,一种位于PyTorch生态系统中的编译器原生框架,能够自动生成融合的FlashAttention风格内核,适用于任意注意力程序,而无需依赖静态模板或预定义的内核专有化。Flashlight利用PyTorch的编译流程来透明地融合和分块注意力计算,使各种注意力模式能够高效执行。不仅支持FlexAttention模型中所有可表达的变种,还处理更一般、数据依赖的注意力公式,这些超出了FlexAttention的能力范围。我们的结果表明,Flashlight生成的内核在性能上与FlexAttention具有竞争力或更优,同时提供原生PyTorch代码的灵活性,使开发人员能够快速探索新的注意力模型,而不会牺牲性能。

英文摘要

Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.

2510.04280 2026-05-22 cs.LG cs.AI cs.RO 版本更新

A KL-regularization Framework for Learning to Plan with Adaptive Priors

一种基于KL正则化的学习规划框架:具有自适应先验的规划

Álvaro Serra-Gomez, Daniel Jarne Ornia, Dhruva Tirumala, Thomas Moerland

发表机构 * LIACS, Leiden University, Leiden, The Netherlands(莱顿大学莱顿分校,荷兰) Google Deepmind, London, United Kingdom(谷歌DeepMind,英国伦敦) University of Oxford, Oxford, United Kingdom(牛津大学,英国牛津)

AI总结 本文提出了一种基于KL正则化的学习规划框架,通过将规划器的动作分布作为先验整合到策略优化中,提升了在高维连续控制任务中模型驱动强化学习的样本效率和长期性能。

Comments Published at ICML2026

详情
AI中文摘要

有效的探索仍然是模型驱动强化学习(MBRL)中的核心挑战,尤其是在高维连续控制任务中,样本效率至关重要。近期的一项重要工作利用学习的策略作为模型预测路径积分(MPPI)规划的提案分布。初始方法在更新采样策略时独立于规划器分布,通常通过确定性策略梯度和熵正则化最大化学习的价值函数。然而,由于训练过程中遇到的状态依赖于MPPI规划器,使采样策略与规划器对齐可以提高价值估计的准确性以及长期性能。为此,近期的方法通过最小化KL散度到规划器分布或引入规划器引导的正则化来更新采样策略。在本文中,我们通过引入策略优化-模型预测控制(PO-MPC),将这些基于MPPI的强化学习方法统一到一个框架中,这是一种整合规划器动作分布作为先验的KL正则化MBRL方法家族。通过使学习的策略与规划器的行为对齐,PO-MPC允许在回报最大化和KL散度最小化之间更灵活的策略更新。我们澄清了先前方法如何作为该家族的特殊案例出现,并探索了之前未研究的变体。我们的实验表明,这些扩展配置产生了显著的性能提升,推动了基于MPPI的强化学习的前沿。

英文摘要

Effective exploration remains a central challenge in model-based reinforcement learning (MBRL), particularly in high-dimensional continuous control tasks where sample efficiency is crucial. A prominent line of recent work leverages learned policies as proposal distributions for Model-Predictive Path Integral (MPPI) planning. Initial approaches update the sampling policy independently of the planner distribution, typically maximizing a learned value function with deterministic policy gradient and entropy regularization. However, because the states encountered during training depend on the MPPI planner, aligning the sampling policy with the planner improves the accuracy of value estimation and long-term performance. To this end, recent methods update the sampling policy by minimizing KL divergence to the planner distribution or by introducing planner-guided regularization into the policy update. In this work, we unify these MPPI-based reinforcement learning methods under a single framework by introducing Policy Optimization-Model Predictive Control (PO-MPC), a family of KL-regularized MBRL methods that integrate the planner's action distribution as a prior in policy optimization. By aligning the learned policy with the planner's behavior, PO-MPC allows more flexibility in the policy updates to trade off Return maximization and KL divergence minimization. We clarify how prior approaches emerge as special cases of this family, and we explore previously unstudied variations. Our experiments show that these extended configurations yield significant performance improvements, advancing the state of the art in MPPI-based RL.

2509.12610 2026-05-22 cs.DB cs.AI cs.LG 版本更新

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

ScaleDoc: 通过大规模文档集合进行基于大语言模型的谓词扩展

Hengrui Zhang, Yulong Hui, Yihao Liu, Huanchen Zhang

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出ScaleDoc系统,通过将谓词执行分为离线表示阶段和优化的在线过滤阶段,解决了大规模文档分析中大语言模型高推理成本的问题,实现了端到端速度提升和LLM调用成本降低。

详情
AI中文摘要

谓词是数据分析系统中的基础组件。然而,现代工作负载越来越多地涉及无结构文档,这需要语义理解,而不仅仅是传统基于值的谓词。鉴于巨大的文档和随机查询,尽管大语言模型(LLMs)显示出强大的零样本能力,但其高推理成本导致不可接受的开销。因此,我们引入ScaleDoc,一种新的系统,通过将谓词执行分解为离线表示阶段和优化的在线过滤阶段来解决这一问题。在离线阶段,ScaleDoc利用LLM为每个文档生成语义表示。在线阶段,对于每个查询,它在这些表示上训练一个轻量级代理模型来过滤大多数文档,只将有歧义的案例转发给LLM进行最终决策。此外,ScaleDoc提出了两个核心创新来实现显著的效率:(1)基于对比学习的框架,训练代理模型生成可靠的预测决策分数;(2)自适应级联机制,确定有效的过滤策略,同时满足特定的准确率目标。我们在三个数据集上的评估表明,ScaleDoc实现了超过2倍的端到端速度提升,并将昂贵的LLM调用减少了高达85%,使大规模语义分析变得实用和高效。

英文摘要

Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2$\times$ end-to-end speedup and reduces expensive LLM invocations by up to 85\%, making large-scale semantic analysis practical and efficient.

2509.09088 2026-05-22 cs.LG math.DG math.DS 版本更新

An entropy formula for the Deep Linear Network

深度线性网络的熵公式

Govind Menon, Tianmin Yu

发表机构 * Division of Applied Mathematics, Brown University(布朗大学应用数学系) School of Mathematics, Institute for Advanced Study(高级研究院数学系) Department of Mathematics, Northwestern University(西北大学数学系)

AI总结 本文研究深度线性网络的黎曼几何,以建立学习过程的热力学描述。通过群作用分析过参数化,并利用参数空间到可观测空间的黎曼子流形,定义并计算玻尔兹曼熵。主要技术步骤是利用雅可比矩阵理论显式构造平衡流形的切空间正交基。

Comments Final version of accepted paper in SIAM Journal on Mathematical Analysis. Includes fixes of minor typos (especially equation (3.13), (6.35) and (6.36)

详情
AI中文摘要

我们研究深度线性网络(DLN)的黎曼几何,作为建立学习过程热力学描述的基础。主要工具是利用群作用分析过参数化以及利用参数空间到可观测空间的黎曼子流形。通过在参数空间中平衡流形的群轨道分层来定义并计算玻尔兹曼熵。我们还显示[2]中定义在可观测空间上的黎曼几何是通过平衡流形的黎曼子流形得到的。主要技术步骤是利用雅可比矩阵理论显式构造平衡流形切空间的正交基。

英文摘要

We study the Riemannian geometry of the Deep Linear Network (DLN) as a foundation for a thermodynamic description of the learning process. The main tools are the use of group actions to analyze overparametrization and the use of Riemannian submersion from the space of parameters to the space of observables. The foliation of the balanced manifold in the parameter space by group orbits is used to define and compute a Boltzmann entropy. We also show that the Riemannian geometry on the space of observables defined in [2] is obtained by Riemannian submersion of the balanced manifold. The main technical step is an explicit construction of an orthonormal basis for the tangent space of the balanced manifold using the theory of Jacobi matrices.

2508.06884 2026-05-22 math.OC cs.LG 版本更新

Near-Optimal Convergence of Accelerated Gradient Methods under Generalized and $(L_0, L_1)$-Smoothness

在一般化和$(L_0, L_1)$-光滑条件下加速梯度方法的近最优收敛性

Alexander Tyurin

发表机构 * Applied AI Institute, Moscow, Russia(应用人工智能研究所,莫斯科,俄罗斯)

AI总结 本文研究了在满足最近提出的$\ell$-光滑性条件的凸优化问题中的一阶方法。虽然加速梯度下降法(AGD)在$L$-光滑性下能达到最优复杂度$O(\sqrt{L} R / \sqrt{\varepsilon})$,但现有方法在$\ell$-光滑性下的扩展要么引入额外的初始梯度依赖,要么有指数因子$ L_1 R $,或者需要昂贵的辅助子程序。本文解决了这一开放问题,通过新的Lyapunov函数和设计新的算法,实现了$O(\sqrt{\ell(0)} R / \sqrt{\varepsilon})$的oracle复杂度,对于小$\varepsilon$和几乎任何$\ell$。例如,在$(L_0, L_1)$-光滑性下,我们的界$O(\sqrt{L_0} R / \sqrt{\varepsilon})$在小$\varepsilon$范围内被证明是最佳的,并去除了先前加速算法中所有非常数的乘法因子。

详情
AI中文摘要

我们研究了在满足最近提出的$\ell$-光滑性条件的凸优化问题中的一阶方法。该条件$|| abla^{2}f(x)|| \le \ell\left(|| abla f(x)|| ight)$扩展了$L$-光滑性和$(L_{0},L_{1})$-光滑性。虽然加速梯度下降法AGD在$L$-光滑性下能达到最优复杂度$O(\sqrt{L} R / \sqrt{\varepsilon})$,其中$\varepsilon$是误差容忍度,$R$是起始点与最优解之间的距离,但现有方法在$\ell$-光滑性下的扩展要么引入额外的初始梯度依赖,要么有指数因子$ L_1 R $,或者需要昂贵的辅助子程序,留下开放问题:是否可能在小$\varepsilon$下实现AGD型$O(\sqrt{\ell(0)} R / \sqrt{\varepsilon})$的速率,即使在$(L_{0},L_{1})$-光滑性情况下。我们解决了这一开放问题。通过新的Lyapunov函数和设计新的算法,我们实现了对于小$\varepsilon$和几乎任何$\ell$的$O(\sqrt{\ell(0)} R / \sqrt{\varepsilon})$的oracle复杂度。例如,在$(L_{0},L_{1})$-光滑性下,我们的界$O(\sqrt{L_0} R / \sqrt{\varepsilon})$在小$\varepsilon$范围内被证明是最佳的,并去除了先前加速算法中所有非常数的乘法因子。

英文摘要

We study first-order methods for convex optimization problems with functions $f$ satisfying the recently proposed $\ell$-smoothness condition $||\nabla^{2}f(x)|| \le \ell\left(||\nabla f(x)||\right),$ which generalizes the $L$-smoothness and $(L_{0},L_{1})$-smoothness. While accelerated gradient descent AGD is known to reach the optimal complexity $O(\sqrt{L} R / \sqrt{\varepsilon})$ under $L$-smoothness, where $\varepsilon$ is an error tolerance and $R$ is the distance between a starting and an optimal point, existing extensions to $\ell$-smoothness either incur extra dependence on the initial gradient, suffer exponential factors in $L_{1} R$, or require costly auxiliary sub-routines, leaving open whether an AGD-type $O(\sqrt{\ell(0)} R / \sqrt{\varepsilon})$ rate is possible for small-$\varepsilon$, even in the $(L_{0},L_{1})$-smoothness case. We resolve this open question. Leveraging a new Lyapunov function and designing new algorithms, we achieve $O(\sqrt{\ell(0)} R / \sqrt{\varepsilon})$ oracle complexity for small-$\varepsilon$ and virtually any $\ell$. For instance, for $(L_{0},L_{1})$-smoothness, our bound $O(\sqrt{L_0} R / \sqrt{\varepsilon})$ is provably optimal in the small-$\varepsilon$ regime and removes all non-constant multiplicative factors present in prior accelerated algorithms.

2507.20268 2026-05-22 cs.LG eess.SP stat.ML 版本更新

Reliable Wireless Indoor Localization via Cross-Validated Prediction-Powered Calibration

通过交叉验证的预测驱动校准实现可靠的无线室内定位

Seonghoon Yoo, Houssem Sifaou, Sangwoo Park, Joonhyuk Kang, Osvaldo Simeone

发表机构 * School of Electrical Engineering, Korea Advanced Institute of Science and Technology(韩国科学技术院电子工程学院) King’s Communications, Learning & Information Processing (KCLIP) Lab, Centre for Intelligent Information Processing Systems (CIIPS), Department of Engineering, King’s College London(伦敦国王学院信息与通信实验室,智能信息处理系统中心,工程系) Institute for Intelligent Networked Systems, Northeastern University London(伦敦东北大学智能网络系统研究所)

AI总结 本文提出一种利用有限校准数据同时优化预测器和估计合成标签偏差的方法,通过交叉验证预测驱动校准提高无线室内定位的可靠性。

详情
AI中文摘要

使用预测模型和接收信号强度信息(RSSI)进行无线室内定位需要适当的校准以获得可靠的定位估计。一种解决方法是使用由(通常不同的)预测模型生成的合成标签。但微调额外的预测器以及估计合成标签的残差偏差需要额外的数据,加剧了无线环境中的校准数据稀缺问题。本文提出了一种方法,能够高效利用有限的校准数据,同时微调预测器并估计合成标签的偏差,从而获得具有严格覆盖保证的预测集。在指纹数据集上的实验验证了所提出方法的有效性。

英文摘要

Wireless indoor localization using predictive models with received signal strength information (RSSI) requires proper calibration for reliable position estimates. One remedy is to employ synthetic labels produced by a (generally different) predictive model. But fine-tuning an additional predictor, as well as estimating residual bias of the synthetic labels, demands additional data, aggravating calibration data scarcity in wireless environments. This letter proposes an approach that efficiently uses limited calibration data to simultaneously fine-tune a predictor and estimate the bias of synthetic labels, yielding prediction sets with rigorous coverage guarantees. Experiments on a fingerprinting dataset validate the effectiveness of the proposed method.

2506.19500 2026-05-22 cs.AI cs.CL cs.LG 版本更新

NaviAgent: Graph-Driven Bilevel Planning for Scalable Tool Orchestration

NaviAgent: 一种基于图的双层规划用于可扩展的工具编排

Yan Jiang, Hao Zhou, Lizhong GU, Tianlong Li, Ruinan Jin, Wanqi Zhou, Ai Han

发表机构 * Department of Electrical and Computer Engineering, The Ohio State University, USA(电气与计算机工程系,俄亥俄州立大学,美国)

AI总结 本文提出NaviAgent,一种基于图的双层规划框架,通过解耦任务规划与工具执行,提升大规模工具编排的可扩展性和鲁棒性,实验表明其在任务成功率和实际应用中表现优异。

Comments Accepted to ICML 2026

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026
AI中文摘要

大型语言模型(LLMs)越来越多地作为功能调用代理,通过调用外部工具来处理超出其静态知识的任务。然而,它们通常逐个调用工具,缺乏对任务结构的整体视图。由于工具之间往往相互依赖,这导致了错误累积和可扩展性差,尤其是在扩展到数百或数千个工具时。为了解决这些限制,我们提出了NaviAgent,一种显式的双层架构,通过基于工具关系的图建模来解耦任务规划与工具执行。在规划层,基于LLM的代理决定是否直接回应、澄清意图或检索并执行独立于工具间复杂度的工具链。在执行层,工具世界导航模型(TWNM)编码工具之间的结构和行为关系,引导代理生成可扩展且鲁棒的调用序列。通过整合真实工具交互的反馈,NaviAgent实现了规划与执行之间的闭环对齐,使代理能够在大规模工具生态系统中实现自适应导航。在API-Bank和ToolBench上的评估显示,任务成功率(TSR)有持续改进,TWNM在复杂任务上平均提升13.1个百分点。进一步在50个真实API跨7个领域的测试中,展示了4.3-12.0个百分点的持续收益,步骤更少且延迟更低,证明了其在真实世界动态下的鲁棒泛化能力。

英文摘要

Large Language Models (LLMs) increasingly act as function-call agents that invoke external tools to tackle tasks beyond their static knowledge. However, they typically invoke tools one at a time without a global view of task structure. As tools often depend on one another, this leads to error accumulation and poor scalability, particularly when scaling to hundreds or thousands of tools. To address these limitations, we propose NaviAgent, an explicit bilevel architecture that decouples task planning from tool execution through graph-based modeling of tool relations. At the planning level, the LLM-based agent decides whether to respond directly, clarify intent, or retrieve and execute a toolchain independent of inter-tool complexity. At the execution level, a Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, steering the agent to compose scalable and robust invocation sequences. Incorporating feedback from real tool interactions, NaviAgent achieves closed-loop alignment between planning and execution, enabling adaptive navigation in large-scale tool ecosystems. Evaluations on API-Bank and ToolBench show consistent improvements in task success rate (TSR), with TWNM yielding an average gain of 13.1 points on complex tasks. Further tests on 50 real APIs across 7 domains show consistent gains of 4.3--12.0 points, with fewer steps and latency, demonstrating robust generalization under real-world dynamics.

2506.16659 2026-05-22 cs.LG cs.AI math.OC 版本更新

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

通过最小化优化器设计实现内存高效的LLM预训练

Athanasios Glentis, Jiaxiang Li, Andi Han, Mingyi Hong

发表机构 * Department of Electrical and Computer Engineering, University of Minnesota, USA(电气与计算机工程系,明尼苏达大学,美国) School of Mathematics and Statistics, University of Sydney, Australia(数学与统计学学院,悉尼大学,澳大利亚)

AI总结 本文研究了如何通过简单的优化器设计改进,使SGD在预训练中达到最先进的性能,提出了SCALE优化器,在内存使用上比Adam更高效,并在多个模型上表现优于现有内存高效的优化器。

Comments Accepted at ICML 2026

详情
AI中文摘要

训练大型语言模型(LLMs)依赖于自适应优化器,如Adam,这些优化器引入了额外的操作,并需要比SGD更多的内存来维护一阶和二阶矩量。尽管最近的工作如GaLore、Fira和APOLLO提出了状态压缩的内存高效变体,但一个根本性的问题仍然存在:plain SGD需要哪些最小的修改才能达到最先进的预训练性能?我们通过自底向上的方法系统地研究了这个问题,并识别出两种简单但高度(内存和计算)高效的技巧:(1)列级梯度归一化(沿输出维度归一化梯度),在没有动量的情况下提升SGD性能;(2)仅在输出层应用一阶动量,因为梯度方差最高。结合这两种技术得到SCALE(Stochastic Column-normAlized Last-layer momEntum),一种简单的优化器,用于内存高效的预训练。在多个模型(60M-1B)上,SCALE的内存使用仅为Adam的35-45%,并且在多个模型上表现优于Adam。它还一致优于内存高效的优化器如GaLore、Fira和APOLLO,使其成为在内存限制下的大规模预训练的强大候选者。对于LLaMA 7B,SCALE在困惑度和内存消耗方面都优于最先进的内存高效方法APOLLO和Muon。

英文摘要

Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed memory-efficient variants, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), that boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon in both perplexity and memory consumption.

2505.24333 2026-05-22 stat.ML cond-mat.dis-nn cond-mat.stat-mech cs.LG 版本更新

Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

深度变换器的两种失效模式及如何避免它们:初始化下信号传播的统一理论

Alessio Giorlandino, Sebastian Goldt

发表机构 * International School of Advanced Studies (SISSA)(国际先进研究学校(SISSA))

AI总结 本文研究了深度变换器中自注意力层的两种失效模式——秩坍缩和熵坍缩,并提出了一种统一的信号传播理论,通过分析初始化对训练稳定性的影响,提供了一种计算训练性图的简单算法,以确定正确初始化超参数的选择。

详情
Journal ref
ICLR 2026
AI中文摘要

找到正确的初始化对于确保神经网络的平稳训练和良好性能至关重要。在变换器中,错误的初始化可能导致自注意力层的两种失效模式:秩坍缩,其中所有标记坍缩为相似的表示,以及熵坍缩,其中高度集中的注意力分数导致训练不稳定。尽管之前的研究所研究了变换器的不同缩放领域,但迄今为止,关于如何初始化变换器的渐近精确、到常数的处方仍然缺乏。在这里,我们提供了一种分析深度变换器中信号通过自注意力、层归一化、跳跃连接和MLP传播的理论。我们的理论产生了一种简单的算法,用于计算训练性图,以确定给定架构的正确初始化超参数选择。我们通过建立与统计物理中随机能模型的正式平行,克服了处理自注意力层的关键挑战。我们还分析了反向路径中的梯度,并确定了梯度在初始化时消失的区域。我们通过三个案例研究展示了我们框架的通用性。我们的理论框架为自注意力的两种失效模式提供了统一的视角,并对权重和残差连接的尺度提供了定量预测,以确保平稳训练。

英文摘要

Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.

2505.20349 2026-05-22 physics.flu-dyn cs.LG 版本更新

FD-Bench: A Modular and Fair Benchmark for Data-driven Fluid Simulation

FD-Bench: 一种模块化且公平的用于数据驱动流体模拟的基准测试

Haixin Wang, Ruoyan Li, Fred Xu, Fang Sun, Kaiqiao Han, Zijie Huang, Ching Chang, Xiao Luo, Wei Wang, Yizhou Sun

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Meta University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出FD-Bench,一个模块化、公平、全面且可重复的数据驱动流体模拟基准测试,通过统一的实验设置评估85个基线模型,解决可重复性和可比性问题,为未来数据驱动流体模型的稳健评估奠定基础。

Comments 32 pages, 20 figures, paper accepted by KDD 2026

详情
AI中文摘要

数据驱动的流体动力学建模随着神经PDE求解器的快速发展而迅速进步,但公平且强大的基准测试仍然碎片化,由于缺乏统一的PDE数据集和标准化的评估协议。尽管架构创新丰富,但公平评估进一步受制于空间、时间和损失模块之间缺乏明确分离。在本文中,我们引入FD-Bench,这是首个公平、模块化、全面且可重复的数据驱动流体模拟基准测试。FD-Bench在统一的实验设置下系统评估了85个基线模型,涵盖10种代表性流场场景。它提供了四个关键贡献:(1) 模块化设计,使空间、时间和损失函数模块之间能够公平比较;(2) 首个系统框架,用于与传统数值求解器的直接比较;(3) 在不同分辨率、初始条件和时间窗口下的细粒度泛化分析;(4) 用户友好的、可扩展的代码库,以支持未来研究。通过严谨的实证研究,FD-Bench建立了迄今为止最全面的排行榜,解决了长期存在的可重复性和可比性问题,并为未来数据驱动流体模型的稳健评估奠定了基础。代码已开源在https://github.com/WillDreamer/FD-Bench。

英文摘要

Data-driven modeling of fluid dynamics has advanced rapidly with neural PDE solvers, yet a fair and strong benchmark remains fragmented due to the absence of unified PDE datasets and standardized evaluation protocols. Although architectural innovations are abundant, fair assessment is further impeded by the lack of clear disentanglement between spatial, temporal and loss modules. In this paper, we introduce FD-Bench, the first fair, modular, comprehensive and reproducible benchmark for data-driven fluid simulation. FD-Bench systematically evaluates 85 baseline models across 10 representative flow scenarios under a unified experimental setup. It provides four key contributions: (1) a modular design enabling fair comparisons across spatial, temporal, and loss function modules; (2) the first systematic framework for direct comparison with traditional numerical solvers; (3) fine-grained generalization analysis across resolutions, initial conditions, and temporal windows; and (4) a user-friendly, extensible codebase to support future research. Through rigorous empirical studies, FD-Bench establishes the most comprehensive leaderboard to date, resolving long-standing issues in reproducibility and comparability, and laying a foundation for robust evaluation of future data-driven fluid models. The code is open-sourced at https://github.com/WillDreamer/FD-Bench.

2505.15844 2026-05-22 q-bio.QM cs.LG stat.AP 版本更新

Advancing Tabular Stroke Modelling Through a Novel Hybrid Architecture and Feature-Selection Synergy

通过一种新型混合架构和特征选择协同效应推进表格性中风建模

Yousuf Islam, Md. Jalal Uddin Chowdhury, Sumon Chandra Das

发表机构 * Department of Computer Science and Engineering, Leading University, Sylhet 3112, Bangladesh(计算机科学与工程系,领先大学,锡尔het 3112,孟加拉国) DeepNet Research and Development Lab, Sylhet 3100, Bangladesh(深网研究与发展实验室,锡尔het 3100,孟加拉国)

AI总结 本文提出了一种数据驱动且可解释的机器学习框架,利用十种常规获取的 demographics、生活方式和临床变量,通过详尽的探索性数据分析、数据预处理和特征选择,构建出一个准确率达到97.2%的中风风险评估模型,显著优于现有模型。

详情
Journal ref
IEEE Conference Publication, 2025
AI中文摘要

脑中风仍然是全球死亡和残疾的主要原因之一,但大多数表格数据预测模型仍低于95%的准确率阈值,限制了实际应用。为解决这一差距,本文开发并验证了一个完全数据驱动且可解释的机器学习框架,旨在使用来自4981条记录的公共队列中十种常规获取的 demographics、生活方式和临床变量来预测中风。我们通过详尽的探索性数据分析(EDA)来理解数据集的结构和分布,随后进行严格的数据预处理,包括处理缺失值、去除异常值和使用合成少数类过采样技术(SMOTE)纠正类别不平衡。为了简化特征选择,使用了点二列相关性和随机森林Gini重要性,并利用分层五折交叉验证优化了包括树集成、提升、核方法和多层神经网络在内的十种不同算法。它们基于概率的预测帮助我们构建了所提出的模型,包括随机森林、XGBoost、LightGBM和一个支持向量分类器,其中逻辑回归作为元学习器。所提出的模型实现了97.2%的准确率和97.15%的F1分数,表明其显著优于领先的单个模型LightGBM,其准确率为91.4%。本研究的结果表明,严格的预处理与多样化的混合模型相结合,可以将低成本的表格数据转化为几乎临床级别的中风风险评估工具。

英文摘要

Brain stroke remains one of the principal causes of death and disability worldwide, yet most tabular-data prediction models still hover below the 95% accuracy threshold, limiting real-world utility. Addressing this gap, the present work develops and validates a completely data-driven and interpretable machine-learning framework designed to predict strokes using ten routinely gathered demographic, lifestyle, and clinical variables sourced from a public cohort of 4,981 records. We employ a detailed exploratory data analysis (EDA) to understand the dataset's structure and distribution, followed by rigorous data preprocessing, including handling missing values, outlier removal, and class imbalance correction using Synthetic Minority Over-sampling Technique (SMOTE). To streamline feature selection, point-biserial correlation and random-forest Gini importance were utilized, and ten varied algorithms-encompassing tree ensembles, boosting, kernel methods, and a multilayer neural network-were optimized using stratified five-fold cross-validation. Their predictions based on probabilities helped us build the proposed model, which included Random Forest, XGBoost, LightGBM, and a support-vector classifier, with logistic regression acting as a meta-learner. The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%, indicating a significant enhancement compared to the leading individual model, LightGBM, which had an accuracy of 91.4%. Our study's findings indicate that rigorous preprocessing, coupled with a diverse hybrid model, can convert low-cost tabular data into a nearly clinical-grade stroke-risk assessment tool.

2503.06115 2026-05-22 stat.ML cs.IT cs.LG math.IT math.PR 版本更新

On Statistical Estimation of Edge-Reinforced Random Walks

关于边缘增强随机游走的统计估计

Qinghua, Ding, Venkat Anantharam

发表机构 * Department of Electrical Engineering and Computer Sciences(电气工程与计算机科学系) University of California at Berkeley(加州大学伯克利分校)

AI总结 本文研究了边缘增强随机游走初始边权重的统计估计问题,利用随机环境中的超几何高斯结构来分析估计器的样本复杂性。

Comments This is the full version of the conference paper in submission to ISIT 2025

详情
AI中文摘要

增强型随机游走(RRWs),包括顶点增强随机游走(VRRWs)和边缘增强随机游走(ERRWs),是一种随机游走模型,其转移概率根据先前访问历史演变~\cite{mgr, fmk, tarres, volkov}。这些模型已在网络表示学习~\cite{xzzs}、增强型PageRank~\cite{gly}和动物行为建模~\cite{smouse}等领域得到应用。然而,对RRW参数的统计估计仍不充分。本文聚焦于利用观测轨迹数据估计ERRW的初始边权重。通过利用ERRW与随机环境中的随机游走(RWRE)~\cite{mr, mr2}之间的联系,即所谓的``magic formula

英文摘要

Reinforced random walks (RRWs), including vertex-reinforced random walks (VRRWs) and edge-reinforced random walks (ERRWs), model random walks where the transition probabilities evolve based on prior visitation history~\cite{mgr, fmk, tarres, volkov}. These models have found applications in various areas, such as network representation learning~\cite{xzzs}, reinforced PageRank~\cite{gly}, and modeling animal behaviors~\cite{smouse}, among others. However, statistical estimation of the parameters governing RRWs remains underexplored. This work focuses on estimating the initial edge weights of ERRWs using observed trajectory data. Leveraging the connections between an ERRW and a random walk in a random environment (RWRE)~\cite{mr, mr2}, as given by the so-called ``magic formula", we propose an estimator based on the generalized method of moments. To analyze the sample complexity of our estimator, we exploit the hyperbolic Gaussian structure embedded in the random environment to bound the fluctuations of the underlying random edge conductances.

2502.21194 2026-05-22 stat.ML cs.LG 版本更新

Prior shift estimation for positive unlabeled data through the lens of kernel embedding

通过核嵌入视角对正样本无标签数据的先验偏移估计

Jan Mielniczuk, Wojciech Rejchel, Paweł Teisseyre

发表机构 * Polish Academy of Sciences(波兰科学院) Nicolaus Copernicus University(尼古拉·哥白尼大学) Warsaw University of Technology(华沙理工大学)

AI总结 本文研究了在目标无标签样本的先验分布估计问题,假设其可能与源群体不同,并且源数据部分可观察:只有正类样本和整个群体的样本可用(PU学习场景)。提出了一种新的直接估计先验分布的方法,避免了对两个群体后验概率的估计,并具有简单的几何解释。该方法基于分布匹配技术与再生核希尔伯特空间中的核嵌入,并作为优化任务的显式解获得。建立了其渐近一致性以及对其与未知先验偏差的显式非渐近界,该界在实践中可计算。通过合成和实际数据研究有限样本行为,证明该方法在性能上与竞争对手相当或更优。

详情
AI中文摘要

我们研究了在目标无标签样本的先验分布估计问题,假设其可能与源群体不同,并且源数据部分可观察:只有正类样本和整个群体的样本可用(PU学习场景)。我们引入了一种新的直接估计先验分布的方法,避免了对两个群体后验概率的估计,并具有简单的几何解释。它基于分布匹配技术以及再生核希尔伯特空间中的核嵌入,并作为优化任务的显式解获得。我们建立了其渐近一致性以及对其与未知先验偏差的显式非渐近界,该界在实践中可计算。我们研究了合成和实际数据的有限样本行为,并证明该方法在性能上与竞争对手相当或更优。

英文摘要

We study estimation of a class prior for unlabeled target samples which possibly differs from that of source population. Moreover, it is assumed that the source data is partially observable: only samples from the positive class and from the whole population are available (PU learning scenario). We introduce a novel direct estimator of a class prior which avoids estimation of posterior probabilities in both populations and has a simple geometric interpretation. It is based on a distribution matching technique together with kernel embedding in a Reproducing Kernel Hilbert Space and is obtained as an explicit solution to an optimisation task. We establish its asymptotic consistency as well as an explicit non-asymptotic bound on its deviation from the unknown prior, which is calculable in practice. We study finite sample behaviour for synthetic and real data and show that the proposal works consistently on par or better than its competitors.

2502.09487 2026-05-22 cs.CL cs.AI cs.LG 版本更新

Internal narratives parameterise affective states

内部叙事参数化情感状态

Jakub Onysk, Quentin J. M. Huys

发表机构 * Applied Computational Psychiatry Lab(应用计算精神病学实验室) Max Planck UCL Centre for Computational Psychiatry and Ageing Research(马克斯·普朗克UCL计算精神病学与衰老研究中心) Queen Square Institute of Neurology and Mental Health(圣夸克广场神经病学与心理健康研究所) Neuroscience Department(神经科学系) Division of Psychiatry(精神病学系)

AI总结 本文通过量化参与者内部叙事的大语言模型表示及其子空间,研究了叙事与情感状态之间的关系,发现特定症状的描述性思维能够预测标准化的抑郁评分,并强调保持症状间的协方差对构建效度至关重要。

详情
AI中文摘要

描述我们如何用语言表达感受对于心理评估和干预至关重要,但叙事与情感状态之间的映射仍然理解不足。在两个大规模研究(n=1257)中,我们通过大语言模型表示及其子空间量化了参与者内部叙事的结构和动态,以参数化抑郁状态。在第一项研究中,我们发现对特定症状的描述性思维捕捉了预测标准化、自我报告抑郁评分的细粒度信息。关键的是,我们显示保持症状之间的特定协方差对于构效效度至关重要,这表明高维文本表示镜像了疾病的潜在几何结构。第二项研究探讨了这种关系的时间动态,当参与者与情感叙事互动时。我们发现量化内部叙事的变化导致自我报告的变化,而基线叙事严重性预测了后续情感变化的幅度。通过将情感视为计算状态,我们的结果强调了其核心、治疗相关功能:约束内部叙事的结构并整合上下文以塑造自我报告。

英文摘要

Characterising how we verbalise our feelings is central to psychological assessment and intervention, yet the mapping between narrative and affective state remains poorly understood. Across two large studies (n=1257), we parameterised the structure and dynamics of depressive states by quantifying participants' internal narratives through large-language-model representations and their subspaces. In Study 1, we found verbal descriptions of symptom-specific thoughts captured granular information predictive of standardised, self-reported depression scores. Critically, we show preserving the specific covariance between symptoms is essential for construct validity, suggesting high-dimensional text representations mirror the latent geometry of the disorder. Study 2 probed the temporal dynamics of this relationship as participants engaged with emotional narratives. We found quantified changes in internal narratives led to changes in self-report, while the baseline narrative severity predicted the magnitude of subsequent affective change. By framing affect as a computational state, our results highlight its core, therapeutically pertinent functions: constraining the structure of internal narratives and integrating context to shape self-report.

2501.00677 2026-05-22 cs.LG cs.CV cs.IT cs.NA math.IT math.NA stat.ML 版本更新

Deeply Learned Robust Matrix Completion for Large-scale Low-rank Data Recovery

深度学习鲁棒矩阵补全用于大规模低秩数据恢复

HanQin Cai, Chandra Kundu, Jialin Liu, Wotao Yin

发表机构 * School of Data, Mathematical, and Statistical Sciences and the Department of Computer Science, University of Central Florida(数据、数学与统计科学学院和计算机科学系,中央佛罗里达大学) School of Data, Mathematical, and Statistical Sciences, University of Central Florida(数据、数学与统计科学学院,中央佛罗里达大学) Damo Academy, Alibaba US(阿里云美国研究院)

AI总结 本文提出了一种可扩展且可学习的非凸方法,即学得鲁棒矩阵补全(LRMC),用于大规模鲁棒矩阵补全问题,该方法具有低计算复杂度和线性收敛性,并通过深度展开有效学习自由参数以实现最优性能,同时在合成数据集和实际应用中验证了其优越的实验性能。

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, 48(6): 6541-6556, 2026
AI中文摘要

鲁棒矩阵补全(RMC)是一种广泛使用的机器学习工具,同时解决低秩数据分析中的两个关键问题:缺失数据条目和极端异常值。本文提出了一种新颖的可扩展且可学习的非凸方法,称为学得鲁棒矩阵补全(LRMC),用于大规模RMC问题。LRMC具有低计算复杂度和线性收敛性。受所提出定理的启发,LRMC的自由参数可通过深度展开有效学习以达到最佳性能。此外,本文提出了一种灵活的前馈-递归-混合神经网络框架,将深度展开从固定次数迭代扩展到无限次数迭代。通过在合成数据集和实际应用中的广泛实验,验证了LRMC的优越的实验性能,包括视频背景减除、超声成像、面部建模和卫星图像云去除。

英文摘要

Robust matrix completion (RMC) is a widely used machine learning tool that simultaneously tackles two critical issues in low-rank data analysis: missing data entries and extreme outliers. This paper proposes a novel scalable and learnable non-convex approach, coined Learned Robust Matrix Completion (LRMC), for large-scale RMC problems. LRMC enjoys low computational complexity with linear convergence. Motivated by the proposed theorem, the free parameters of LRMC can be effectively learned via deep unfolding to achieve optimum performance. Furthermore, this paper proposes a flexible feedforward-recurrent-mixed neural network framework that extends deep unfolding from fix-number iterations to infinite iterations. The superior empirical performance of LRMC is verified with extensive experiments against state-of-the-art on synthetic datasets and real applications, including video background subtraction, ultrasound imaging, face modeling, and cloud removal from satellite imagery.

2411.02813 2026-05-22 cs.LG 版本更新

Sparse Orthogonal Parameters Tuning for Continual Learning

稀疏正交参数调优用于持续学习

Kun-Peng Ning, Hai-Jian Ke, Yu-Yang Liu, Jia-Yu Yao, Yong-Hong Tian, Li Yuan

AI总结 本文提出了一种名为SoTU的新型方法,通过稀疏正交参数调优来解决持续学习中的灾难性遗忘问题,实现了对流数据的最优特征表示。

详情
AI中文摘要

基于预训练模型(PTM)的持续学习方法近年来引起了广泛关注,这些方法能够适应连续的下游任务而无需灾难性遗忘。这些方法通常不更新预训练参数,而是使用额外的适配器、提示和分类器。在本文中,我们从新的角度研究了稀疏正交参数对持续学习的益处。我们发现,合并来自多个流任务的模型所学习的稀疏正交性在解决灾难性遗忘方面具有巨大潜力。利用这一见解,我们提出了一种新颖且有效的称为SoTU(稀疏正交参数调优)的方法。我们假设SoTU的有效性在于将多个领域学到的知识转换为正交delta参数的融合。在多样化的CL基准测试中评估了所提出的方法的有效性。值得注意的是,SoTU在不需要复杂分类器设计的情况下实现了流数据的最优特征表示,使其成为一种即插即用的解决方案。

英文摘要

Continual learning methods based on pre-trained models (PTM) have recently gained attention which adapt to successive downstream tasks without catastrophic forgetting. These methods typically refrain from updating the pre-trained parameters and instead employ additional adapters, prompts, and classifiers. In this paper, we from a novel perspective investigate the benefit of sparse orthogonal parameters for continual learning. We found that merging sparse orthogonality of models learned from multiple streaming tasks has great potential in addressing catastrophic forgetting. Leveraging this insight, we propose a novel yet effective method called SoTU (Sparse Orthogonal Parameters TUning). We hypothesize that the effectiveness of SoTU lies in the transformation of knowledge learned from multiple domains into the fusion of orthogonal delta parameters. Experimental evaluations on diverse CL benchmarks demonstrate the effectiveness of the proposed approach. Notably, SoTU achieves optimal feature representation for streaming data without necessitating complex classifier designs, making it a Plug-and-Play solution.

2411.02776 2026-05-22 cs.LG stat.AP 版本更新

Deep learning-based modularized loading protocol for parameter estimation of Bouc-Wen class models

基于深度学习的模块化加载协议用于Bouc-Wen类模型参数估计

Sebin Oh, Junho Song, Taeyong Kim

发表机构 * Department of Civil and Environmental Engineering, University of California, Berkeley, CA, United States(加州大学伯克利分校土木与环境工程系) Department of Civil Systems Engineering, Ajou University, Suwon, Republic of Korea(全州大学土木系统工程系)

AI总结 本文提出了一种基于深度学习的模块化加载协议,用于优化Bouc-Wen类模型的参数估计。该协议包含两个关键部分:最优加载历史构建和基于CNN的快速参数估计。每个部分被分解为独立的子模块,针对不同的滞回行为(基本滞回、结构退化和咬合效应),使协议能够适应多种滞回模型。三种独立的CNN架构被开发出来以捕捉这些滞回行为的路径依赖性。通过在多样化的加载历史上训练这些CNN架构,识别出最小的加载序列,称为加载历史模块,并将其组合以构建最优的加载历史。三种训练好的CNN模型用作快速参数估计器。协议的数值评估,包括三栋钢结构框架的非线性时间历史分析和三栋钢筋混凝土框架的脆弱性曲线构建,表明该协议显著减少了总分析时间,同时保持或提高了估计精度。该协议可扩展到其他滞回模型,表明了一种系统的方法来识别通用滞回模型。

详情
Journal ref
Engineering Structures 339, 120458 (2025)
AI中文摘要

本研究提出了一种模块化的深度学习基于加载协议,用于Bouc-Wen(BW)类模型的最佳参数估计。该协议由两个关键组成部分组成:最佳加载历史构建和基于CNN的快速参数估计。每个组成部分被分解为独立的子模块,针对不同的滞回行为——基本滞回、结构退化和咬合效应——使协议能够适应多种滞回模型。开发了三种独立的CNN架构以捕捉这些滞回行为的路径依赖性。通过在多样化的加载历史上训练这些CNN架构,识别出最小的加载序列,称为加载历史模块,然后将其组合以构建最优的加载历史。三种训练好的CNN模型,分别在相应的加载历史模块上训练,用作快速参数估计器。协议的数值评估,包括三栋钢结构框架的非线性时间历史分析和三栋钢筋混凝土框架的脆弱性曲线构建,表明所提出的协议显著减少了总分析时间,同时保持或提高了估计精度。所提出的协议可以扩展到其他滞回模型,表明了一种系统的方法来识别通用滞回模型。

英文摘要

This study proposes a modularized deep learning-based loading protocol for optimal parameter estimation of Bouc-Wen (BW) class models. The protocol consists of two key components: optimal loading history construction and CNN-based rapid parameter estimation. Each component is decomposed into independent sub-modules tailored to distinct hysteretic behaviors-basic hysteresis, structural degradation, and pinching effect-making the protocol adaptable to diverse hysteresis models. Three independent CNN architectures are developed to capture the path-dependent nature of these hysteretic behaviors. By training these CNN architectures on diverse loading histories, minimal loading sequences, termed \textit{loading history modules}, are identified and then combined to construct an optimal loading history. The three CNN models, trained on the respective loading history modules, serve as rapid parameter estimators. Numerical evaluation of the protocol, including nonlinear time history analysis of a 3-story steel moment frame and fragility curve construction for a 3-story reinforced concrete frame, demonstrates that the proposed protocol significantly reduces total analysis time while maintaining or improving estimation accuracy. The proposed protocol can be extended to other hysteresis models, suggesting a systematic approach for identifying general hysteresis models.

2411.01332 2026-05-22 cs.LG cs.AI 版本更新

A Mechanistic Explanatory Strategy for XAI

为XAI的解释性策略机制

Marcin Rabiza

发表机构 * Institute of Philosophy and Sociology, Polish Academy of Sciences(哲学与社会学院,波兰科学院) Institute for Philosophy, Leiden University(哲学研究所,莱顿大学)

AI总结 本文提出了一种基于机制的解释性策略,旨在通过分解、定位和重组来揭示深度学习系统功能组织的机制,从而改进可解释人工智能的理论基础和实践应用。

详情
AI中文摘要

尽管在XAI领域取得了显著进展,学者们指出缺乏坚实的理论基础和与更广泛科学解释 discourse 的整合仍是持续存在的问题。为此,新兴研究借鉴了各种科学和科学哲学文献中的解释策略来填补这些空白。本文概述了一种用于解释深度学习系统功能组织的机制性策略,将近期的可解释人工智能发展置于更广泛的哲学背景下。根据机制方法,对不透明AI系统的解释涉及识别驱动决策的机制。对于深度神经网络,这意味着辨别功能相关组件,如神经元、层、电路或激活模式,并通过分解、定位和重组来理解其作用。图像识别和语言模型的证明原理案例研究将这些理论方法与OpenAI和Anthropic的机制可解释性研究相结合。研究结果表明,追求机制性解释可以揭示传统可解释性技术可能忽略的元素,最终促进更彻底的可解释人工智能。

英文摘要

Despite significant advancements in XAI, scholars note a persistent lack of solid conceptual foundations and integration with broader scientific discourse on explanation. In response, emerging research draws on explanatory strategies from various sciences and the philosophy of science literature to fill these gaps. This paper outlines a mechanistic strategy for explaining the functional organization of deep learning systems, situating recent developments in explainable AI within a broader philosophical context. According to the mechanistic approach, the explanation of opaque AI systems involves identifying mechanisms that drive decision making. For deep neural networks, this means discerning functionally relevant components, such as neurons, layers, circuits, or activation patterns, and understanding their roles through decomposition, localization, and recomposition. Proof-of-principle case studies from image recognition and language modeling align these theoretical approaches with mechanistic interpretability research from OpenAI and Anthropic. The findings suggest that pursuing mechanistic explanations can uncover elements that traditional explainability techniques may overlook, ultimately contributing to more thoroughly explainable AI

2410.04753 2026-05-22 cs.AI cs.CL cs.LG cs.LO 版本更新

ImProver: Agent-Based Automated Proof Optimization

ImProver:基于代理的自动证明优化

Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, Sean Welleck

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文研究了自动证明优化问题,提出ImProver这一基于大语言模型的代理,用于重写证明以优化长度、可读性等任意标准,实验表明其能显著缩短证明并提高其模块化和可读性。

Comments Published as a conference paper at ICLR 2025

详情
AI中文摘要

大型语言模型(LLMs)已被用于在证明助手如Lean中生成数学定理的正式证明。然而,我们通常希望根据不同的下游用途优化正式证明,例如使其符合某种风格、易于阅读、简洁或模块化。适当优化的证明对于学习任务也非常重要,尤其是因为人工撰写的证明可能不适用于此目的。为此,我们研究了一个新的问题:自动证明优化,即重写证明以使其正确并优化任意标准,如长度或可读性。作为自动证明优化的一种初步方法,我们提出了ImProver,这是一个能够重写证明以优化任意用户定义指标的大型语言模型代理。我们发现直接应用LLMs进行证明优化效果有限,并在ImProver中引入了各种改进,例如新颖的链式状态技术中的符号Lean上下文使用,以及错误校正和检索。我们测试了ImProver在重写真实世界中的本科、竞赛和研究级数学定理方面的性能,发现ImProver能够重写证明使其显著更短、更模块化和更易读。

英文摘要

Large language models (LLMs) have been used to generate formal proofs of mathematical theorems in proofs assistants such as Lean. However, we often want to optimize a formal proof with respect to various criteria, depending on its downstream use. For example, we may want a proof to adhere to a certain style, or to be readable, concise, or modularly structured. Having suitably optimized proofs is also important for learning tasks, especially since human-written proofs may not optimal for that purpose. To this end, we study a new problem of automated proof optimization: rewriting a proof so that it is correct and optimizes for an arbitrary criterion, such as length or readability. As a first method for automated proof optimization, we present ImProver, a large-language-model agent that rewrites proofs to optimize arbitrary user-defined metrics in Lean. We find that naively applying LLMs to proof optimization falls short, and we incorporate various improvements into ImProver, such as the use of symbolic Lean context in a novel Chain-of-States technique, as well as error-correction and retrieval. We test ImProver on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that ImProver is capable of rewriting proofs so that they are substantially shorter, more modular, and more readable.

2401.00139 2026-05-22 cs.AI cs.CL cs.LG stat.ME 版本更新

Enhancing Causal Reasoning in Large Language Models: A Causal Attribution Model for Precision Fine-Tuning

增强大语言模型中的因果推理:一种用于精确微调的因果归因模型

Hengrui Cai, Shengjie Liu, Rui Song

发表机构 * University of California, Irvine(加州大学尔湾分校) North Carolina State University(北卡罗来纳州立大学) Amazon(亚马逊公司)

AI总结 本文提出一种因果归因模型,通过精确微调提升大语言模型的可解释性和因果推理能力,展示了模型在不同领域中的因果发现任务中的有效性。

Comments A Python implementation of our proposed method is available at https://github.com/ncsulsj/Causal_LLM

详情
AI中文摘要

本文介绍了一种因果归因模型,旨在通过精确微调增强大语言模型(LLMs)的可解释性并提高其因果推理能力。尽管LLMs在多种任务中表现出色,但其推理过程往往仍是一个黑箱,限制了有针对性的增强。我们提出了一种新的因果归因模型,利用“do-运算符”构建干预场景,使我们能够系统地量化LLMs因果推理过程中不同组件的贡献。通过在各种领域中进行因果发现任务来评估所提出的归因分数,我们证明了LLMs在因果发现中的有效性严重依赖于提供的上下文和领域特定知识,但也可以利用数值数据进行有限的相关性推理,而非因果性。这促使了所提出的微调LLM用于成对因果发现,有效且正确地利用了知识和数值信息。

英文摘要

This paper introduces a causal attribution model to enhance the interpretability of large language models (LLMs) and improve their causal reasoning abilities via precise fine-tuning. Despite LLMs' proficiency in diverse tasks, their reasoning processes often remain black box, and thus restrict targeted enhancement. We propose a novel causal attribution model that utilizes "do-operators" for constructing interventional scenarios, allowing us to quantify the contribution of different components in LLMs's causal reasoning process systematically. By assessing the proposed attribution scores through causal discovery tasks across various domains, we demonstrate that LLMs' effectiveness in causal discovery heavily relies on provided context and domain-specific knowledge but can also utilize numerical data with limited calculations in correlation, not causation. This motivates the proposed fine-tuned LLM for pairwise causal discovery, effectively and correctly leveraging both knowledge and numerical information.

2311.04938 2026-05-22 cs.CV cs.AI cs.LG 版本更新

Improved DDIM Sampling with Moment Matching Gaussian Mixtures

改进的DDIM采样与矩匹配高斯混合模型

Prasad Gabbur

发表机构 * Independent Researcher(独立研究者) Apple(苹果公司)

AI总结 本文提出在DDIM框架中使用高斯混合模型作为反向转换操作符,通过约束GMM参数匹配DDPM前向边缘的矩,从而在少量采样步骤下提升生成样本质量,实验表明GMM核在FID和IS指标上优于传统高斯核。

Comments 34 pages, 12 figures; Accepted to TMLR; Code open sourced

详情
Journal ref
Transactions on Machine Learning Research, 05/2026
AI中文摘要

我们提出在去噪扩散隐式模型(DDIM)框架中使用高斯混合模型(GMM)作为反向转换操作符(核),这是用于加速从预训练去噪扩散概率模型(DDPM)采样的最广泛使用的 approaches 之一。具体而言,我们通过约束GMM参数来匹配DDPM前向边缘的一阶和二阶中心矩。我们发现矩匹配足以获得与原始DDIM高斯核相等或更好的样本质量。我们分别在无条件模型(训练于CelebAHQ和FFHQ)、类条件模型(训练于ImageNet)以及使用Stable Diffusion v2.1在COYO700M数据集上进行文本到图像生成实验。我们的结果表明,当采样步骤数较小时,使用GMM核可显著提升生成样本的质量,如在ImageNet 256x256上,使用10个采样步骤时,GMM核的FID为6.94,IS为207.85,而高斯核分别为10.15和196.73。此外,我们还为修正流匹配模型推导了新的SDE采样器,并对所提出的方法进行了实验。我们发现使用1-修正流和2-修正流模型均有所改进。代码:https://github.com/pgabbur/ddim-gmm。

英文摘要

We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ, class-conditional models trained on ImageNet, and text-to-image generation using Stable Diffusion v2.1 on COYO700M datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel. Further, we derive novel SDE samplers for rectified flow matching models and experiment with the proposed approach. We see improvements using both 1-rectified flow and 2-rectified flow models. Code: https://github.com/pgabbur/ddim-gmm.

2306.05905 2026-05-22 cs.LG math.OC 版本更新

TreeDQN: Sample-Efficient Off-Policy Reinforcement Learning for Combinatorial Optimization

TreeDQN: 一种用于组合优化的高效离策略强化学习方法

D. Sorokin, A. Kostin, L. Savchenko, G. Gusev, A. V. Savchenko

发表机构 * Sber AI Lab(Sber AI实验室) Laboratory for Theoretical Foundations of AI Models, HSE University(人工智能模型理论基础实验室,HSE大学)

AI总结 TreeDQN通过优化几何平均预期回报,提高了离策略强化学习在组合优化任务中的样本效率,并在合成任务和ML4CO竞赛中表现优异。

Comments Accepted in Knowledge-Based Systems

详情
AI中文摘要

解决组合优化任务的一种方便方法是分支定界法。其分支启发式可以学习以解决大量相似任务。在这里取得的有希望的结果是通过最近出现的基于树马尔可夫决策过程的在线策略强化学习方法实现的。为了克服其主要缺点,即训练时间非常大和不稳定,我们提出了TreeDQN(树深度Q网络),一种样本效率高的离策略RL方法,通过优化预期回报的几何平均来训练。为了理论支持我们的方法的训练过程,我们证明了树MDP中Bellman算子的收缩性质。结果表明,我们的方法所需的训练数据最多减少10倍,并在合成任务上比已知的在线策略方法运行更快。此外,TreeDQN在ML4CO竞赛中的挑战性实际任务上显著优于最先进的技术。

英文摘要

A convenient approach to optimally solving combinatorial optimization tasks is the Branch-and-Bound method. Its branching heuristic can be learned to solve a large set of similar tasks. The promising results here are achieved by the recently appeared on-policy reinforcement learning method based on the tree Markov Decision Process. To overcome its main disadvantages, namely, very large training time and unstable training, we propose TreeDQN (Tree Deep Q-Network), a sample-efficient off-policy RL method trained by optimizing the geometric mean of expected return. To theoretically support the training procedure for our method, we prove the contraction property of the Bellman operator for the tree MDP. As a result, our method requires up to 10 times less training data and performs faster than known on-policy methods on synthetic tasks. Moreover, TreeDQN significantly outperforms the state-of-the-art techniques on a challenging practical task from the ML4CO competition.