arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.27366 2026-05-27 cs.AI cs.CL cs.LG cs.MA 版本更新

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

MUSE-Autoskill: 通过技能创建、记忆、管理和评估实现自我进化智能体

Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, Tieying Zhang

发表机构 * ByteDance Inc.(字节跳动公司) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 提出MUSE-Autoskill框架,通过统一的技能生命周期(创建、记忆、管理、评估和优化)使LLM智能体持续提升任务解决能力,实验表明生命周期管理的技能可提高任务成功率、效率、复用性和跨智能体迁移。

Comments 30 pages, 8 figures, 13 tables, working in progress

详情
AI中文摘要

大型语言模型(LLM)智能体依赖可复用技能来解决复杂任务。然而,现有的技能创建方法将技能视为孤立和静态的工件,限制了其可复用性、可靠性和长期改进。我们提出了MUSE-Autoskill智能体(记忆利用技能进化),一个以技能为中心的智能体框架,让智能体通过统一的技能生命周期(创建、记忆、管理、评估和优化)持续提升任务解决能力。我们的框架使智能体能够按需创建技能,跨任务存储和复用技能,高效组织和选择技能,并通过单元测试和运行时反馈评估技能以进行持续优化。我们进一步引入了技能级记忆,为每个技能跨任务积累经验,从而实现更有效的复用和随时间适应。在SkillsBench上的实验提供了初步证据,表明生命周期管理的技能可以提高任务成功率、效率、复用性和跨智能体迁移,突出了将技能视为长期存在、具有经验意识和可测试资产的重要性。

英文摘要

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.

2605.27358 2026-05-27 cs.LG cs.AI cs.CL 版本更新

MobileMoE: Scaling On-Device Mixture of Experts

MobileMoE: 扩展设备端混合专家模型

Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka, Digant Desai, Zechun Liu, Vikas Chandra, Raghuraman Krishnamoorthi

发表机构 * Meta AI

AI总结 针对设备端部署,提出MobileMoE系列子十亿参数MoE语言模型,通过联合优化架构和四阶段训练,在14个基准上匹配或超越领先的密集模型和MoE模型,并在智能手机上实现高效推理。

详情
AI中文摘要

混合专家(MoE)已成为千亿参数语言模型的事实标准架构,但其在十亿以下规模用于设备端部署的优势尚未得到充分探索。为弥补这一差距,我们提出MobileMoE,一系列设备端MoE语言模型,具有子十亿激活参数(0.3-0.9B激活,1.3-5.3B总参数),为设备端LLM建立了新的帕累托前沿。我们首先制定了一个设备端MoE缩放定律,在移动内存和计算约束下联合优化MoE架构,识别出一个设备端最佳点——具有细粒度和共享专家的适度稀疏性——同时实现内存和计算最优。基于推导出的架构,我们采用四阶段方案训练MobileMoE,包括预训练、中期训练、指令微调和量化感知训练,全部使用开源数据集。在14个基准上,MobileMoE匹配或超越领先的设备端密集LLM,推理FLOPs减少2-4倍,并以最多60%的参数匹配或超越最先进的MoE模型OLMoE-1B-7B。为弥合移动部署的最后一步,我们提供了首个在商用智能手机上的高效MoE推理,并进行了全面的设备端性能分析。在相当的INT4权重内存下,MobileMoE-S的预填充速度比密集基线MobileLLM-Pro快1.8-3.8倍,解码速度快2.2-3.4倍。

英文摘要

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.

2605.27354 2026-05-27 cs.LG cs.AI cs.CL 版本更新

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

利用稀疏自编码器的模型内部状态指导LLM后训练数据工程

Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou, Juanzi Li, Xiaozhi Wang

发表机构 * Tsinghua University(清华大学)

AI总结 提出SAERL框架,通过稀疏自编码器提取模型内部状态,建模数据多样性、难度和质量,用于强化学习数据工程,提升准确率并减少训练步数。

详情
AI中文摘要

模型内部状态编码了大型语言模型(LLM)处理其训练数据时的丰富信息;然而,后训练数据工程主要依赖外部信号,忽略了模型内部状态中丰富的内在信号。我们提出了SAERL,一个用于LLM强化学习(RL)的数据工程框架。它使用稀疏自编码器(SAE)这一先进的机制可解释性工具提取的模型内部状态,建模三种内在数据属性:多样性、难度和质量。每个属性支撑一个具体的数据工程操作:用于批次多样性控制的SAE空间聚类与适度批次混合、用于从易到难课程排序的难度代理,以及用于数据过滤的质量探针。SAERL在Qwen2.5-Math-1.5B上相比原始GRPO平均准确率提升3.00%,并以减少20%的训练步数达到目标准确率,在模型规模和RL算法上均有一致收益。实验表明,SAE在不同模型家族和规模间有效迁移,作为一种轻量级且可重用的数据工程工具。这些结果证明,模型内部状态是后训练数据工程中强大且实用的信号来源。

英文摘要

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.

2605.27352 2026-05-27 cs.LG stat.ML 版本更新

From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models

从分数到吉布斯校正器:加速均匀速率离散扩散模型

Yuchen Liang, Ness Shroff, Yingbin Liang

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 提出吉布斯加速离散扩散(GADD)方法,利用具体分数函数构建吉布斯后验似然,无需额外训练即可实现均匀速率离散扩散模型的加速采样,达到$\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$的采样复杂度。

详情
AI中文摘要

离散扩散模型在文本和其他符号领域取得了强大的实证表现,但特别是对于均匀速率模型,它们通常需要许多步骤才能生成单个样本。现有的加速方法要么依赖训练额外的量,要么遭受慢混合问题。在这项工作中,我们提出了一种新颖的基于吉布斯的离散扩散模型校正器,称为吉布斯加速离散扩散(GADD)。GADD利用具体分数函数的结构直接构建吉布斯后验似然,除了标准分数估计外不需要任何额外训练。我们证明GADD实现了$\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$的整体采样复杂度,为均匀速率离散扩散模型的基于扩散的采样器提供了第一个这样的速率。我们还进行了数值实验,展示了GADD在合成数据、零样本文本采样和零样本条件音乐生成中的实际优势。这些结果证实了理论,并表明GADD在样本质量和墙钟效率上始终优于标准基线,包括原始欧拉方法和CTMC校正器。除此之外,我们的理论分析引入了一个新颖的框架,用于分析离散扩散模型中的预测器-校正器方法,这可能具有独立的意义。与依赖Girsanov测度变换技术的现有方法不同,我们的方法基于一个归纳论证,该论证在考虑校正器更新不准确性的同时,跟踪预测器迭代中的误差传播。

英文摘要

Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, but, especially for uniform-rate models, they often require many steps to generate a single sample. Existing acceleration methods either rely on training additional quantities or suffer from slow mixing. In this work, we propose a novel Gibbs-based corrector for discrete diffusion models, termed Gibbs-Accelerated Discrete Diffusion (GADD). GADD leverages the structure of the concrete score function to construct Gibbs posterior likelihoods directly, without requiring any additional training beyond standard score estimation. We show that GADD achieves an overall sampling complexity of $\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$, yielding the first such rate for diffusion-based samplers for uniform-rate discrete diffusion models. We also conduct numerical experiments demonstrating the practical advantages of GADD across synthetic data, zero-shot text sampling, and zero-shot conditional music generation. These results corroborate the theory and show that GADD consistently improves sample quality and wall-clock efficiency over standard baselines, including vanilla Euler methods and CTMC correctors. Beyond this, our theoretical analysis introduces a novel framework for analyzing predictor-corrector methods in discrete diffusion models, which may be of independent interest. Unlike existing approaches that rely on the Girsanov change-of-measure technique, our method is based on an induction argument that tracks error propagation across predictor iterations while accounting for inaccuracies in the corrector updates.

2605.27343 2026-05-27 cs.CV cs.LG 版本更新

Towards Controllable Image Generation through Representation-Conditioned Diffusion Models

通过表示条件扩散模型实现可控图像生成

Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen

AI总结 本文提出利用预训练自监督模型的表示作为条件,通过扩散模型实现无需大量标注的可控图像生成,并探索了表示空间中的平滑和分离特性。

详情
AI中文摘要

扩散模型已成为高质量图像生成和编辑的强大工具,但引导这些模型产生特定输出仍然是一个挑战。传统方法依赖于条件机制,如文本提示或语义图,这些需要大量标注的数据集。在这项初步工作中,我们探索了以预训练自监督模型的表示为条件的扩散模型。自条件机制不仅提高了无条件图像生成的质量,还提供了一个可用于控制生成的表示空间。我们通过识别变化方向来探索这个条件空间,并展示了在平滑性和分离性方面的有前景的特性。

英文摘要

Diffusion models have emerged as powerful tools for high-quality image generation and editing, but guiding these models to produce specific outputs remains a challenge. Conventional approaches rely on conditioning mechanisms, such as text prompts or semantic maps, which require extensively annotated datasets. In this preliminary work, we explore diffusion models conditioned on representations from a pre-trained self-supervised model. The self-conditioning mechanism not only improves the quality of unconditional image generation, but also provides a representation space that can be used to control the generation. We explore this conditioning space by identifying directions of variations, and demonstrate promising properties in terms of smoothness and disentanglement.

2605.27316 2026-05-27 cs.LG math.OC 版本更新

Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization

基于比率单调变换的概率平滑用于全局优化

Kukyoung Jang, Taehyun Cho, Junrui Zhang, Ping Xu, Kyungjae Lee

发表机构 * Department of Statistics, Seoul, Korea University(首尔大学统计系) Department of Electrical and Computer Engineering, Seoul, Seoul National University(首尔国立大学电气与计算机工程系) Shandong University at Weihai, Weihai, China(威海山东大学)

AI总结 提出一种结合灵活对称单峰核与单调比率变换的通用概率平滑框架,在温和条件下保持全局最优解并保证收敛性,实验证明鲁棒性和竞争力提升。

详情
AI中文摘要

概率平滑是全局优化的标准工具,但现有方法依赖高斯核和特定变换,通常导致强超参数敏感性和有限的鲁棒性。我们提出一个通用平滑框架,将灵活的对称单峰核与基于单调比率的变换相结合。在温和条件下,我们证明平滑后的目标函数保持全局最大值,并且所有驻点都集中在真实最优值附近,无需递减的平滑调度。我们进一步为随机梯度上升提供了显式的复杂度界,并证明留一法基线可证明地减少方差。在高维基准测试和黑盒对抗攻击上的实验表明,该方法具有改进的鲁棒性和竞争性能。

英文摘要

Probabilistic smoothing is a standard tool for global optimization, but existing methods rely on Gaussian kernels and specific transforms, often resulting in strong hyperparameter sensitivity and limited robustness. We propose a general smoothing framework that combines flexible symmetric unimodal kernels with monotonic ratio-based transformations. Under mild conditions, we show that the smoothed objective preserves the global maximizer and that all stationary points concentrate near the true optimum for sufficiently large amplification, without requiring a decreasing smoothing schedule. We further provide explicit complexity bounds for stochastic gradient ascent and show that a leave-one-out baseline provably reduces variance. Experiments on high-dimensional benchmarks and black-box adversarial attacks demonstrate improved robustness and competitive performance.

2605.27309 2026-05-27 cs.LG cs.OH 版本更新

Greening AI Inference with Accuracy and Latency-aware User Incentives

通过准确性和延迟感知的用户激励实现绿色AI推理

Vasilios A. Siris, Adamantia Stamou, George D. Stamoulis, Konstantinos Varsos, Ramin Khalili

发表机构 * Department of Informatics, School of Information Sciences and Technology(信息科学与技术学院信息学院) Athens University of Economics and Business(雅典经济与商业大学) Huawei Heisenberg Research Center, Munich, Germany(华为海森堡研究中心,慕尼黑,德国)

AI总结 提出一种基于用户对推理质量和延迟的估值以及环境意识的激励框架,通过双层级服务订阅平衡碳排放与QoE参数。

详情
Journal ref
IEEE Internet Computing, 2026
AI中文摘要

AI服务的广泛使用引发了对其环境可持续性的担忧,最近的研究表明AI推理的碳排放是主要贡献者。本文介绍了一个框架,基于用户对推理质量和延迟的估值以及他们的环境意识,同时考虑碳排放与这两个QoE参数之间的权衡,来设计AI推理激励。我们的方法可以适应不同的权衡,这取决于AI模型的大小和复杂性以及用于服务推理请求的资源分配。这些激励可以通过一个实用的双层级服务订阅来提供,该订阅为用户提供折扣以换取减少的碳排放。折扣服务选项使AI提供商能够在高碳强度期间以较低的质量和较高的延迟服务一定比例的推理请求。

英文摘要

The widespread use of AI services has raised concerns for its environmental sustainability, towards which recent studies have identified carbon emissions of AI inference as the major contributor. This paper introduces a framework for designing AI inference incentives based on the users' valuation for inference quality and latency, together with their environmental consciousness, while accounting for the tradeoff between carbon emissions and the two QoE parameters. Our approach can accommodate different tradeoffs, that depend on the size and complexity of the AI models and the allocation of resources to serve inference requests. The incentives can be offered through a practical two-tier service subscription that offers users a discount in exchange for reduced carbon emissions. The discounted service option gives the AI provider the flexibility to serve some percentage of inference requests at a lower quality and higher latency during periods of high carbon intensity.

2605.27306 2026-05-27 cs.LG 版本更新

Normal Guidance is what Attention Needs

Normal Guidance is what Attention Needs

Ethan Harvey, Dennis Johan Loevlie, Michael C. Hughes

发表机构 * Department of Computer Science(计算机科学系)

AI总结 提出Normal Guidance正则化技术,使基于注意力的多实例学习方法在3D医学图像切片级定位上超越现有方法,同时保持全扫描分类性能。

详情
AI中文摘要

我们考虑仅使用整个体积的一个二元标签(而不是每个2D切片的标签)来训练3D医学图像的分类器。在这种弱监督设置下,我们能否学习准确的切片级预测分类器?基于注意力的多实例学习(MIL)可以为每个切片生成注意力分数。然而,最近的研究表明,一个忽略图像内容的简单中心聚焦基线在3D脑部扫描的切片级分类上可以胜过基于注意力和基于Transformer的MIL。我们证明该基线在胸部和腹部CT扫描的切片级分类上也优于现有的MIL。受此基线启发,我们提出了Normal Guidance,一种正则化技术,鼓励学习的注意力分布遵循钟形曲线。在三个总计超过400万张2D切片的医学影像数据集上,我们展示了Normal Guidance使基于注意力和基于Transformer的MIL方法在切片级定位上显著优于现有技术,同时在全扫描分类上保持竞争力。

英文摘要

We consider training classifiers for 3D medical images using only one binary label for the entire volume rather than a label for each 2D slice. In such weakly supervised settings, can we learn accurate classifiers for slice-level predictions? Attention-based multiple instance learning (MIL) can produce an attention score for every slice. Yet recent work demonstrates that a simple center-focused baseline that ignores image content can outperform attention-based and transformer-based MIL at slice-level classification of 3D brain scans. We show this baseline also outperforms existing MIL at slice-level classification of thoracic and abdominal CT scans. Motivated by this baseline, we propose Normal Guidance, a regularization technique that encourages the learned attention distribution to follow a bell-shaped curve. Across three medical imaging datasets totaling over 4 million 2D slices, we show our Normal Guidance enables attention-based and transformer-based MIL methods to deliver significantly better slice-level localization than the state-of-the-art while remaining competitive at whole-scan classification.

2605.27299 2026-05-27 cs.CR cs.AI cs.HC cs.LG cs.SY eess.SY 版本更新

Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models

使用次正态高斯模糊模型的IDS风险规避警报优先级排序

Murat Moran

AI总结 提出基于次正态高斯模糊数的警报优先级排序框架,通过建模威胁严重性、检测置信度和组织风险态度三种不确定性,利用排序指数实现可调安全姿态,实验证明在检测器退化下比基线方法更鲁棒。

详情
AI中文摘要

现代入侵检测系统每天生成数千条警报,但由于误报或低影响事件过多,警报疲劳严重限制了安全运营的有效性。我们通过提出一个基于次正态高斯模糊数的原则性警报优先级排序框架来解决这个问题,该框架明确建模了三种不确定性来源:威胁严重性、检测置信度和组织风险态度。每个警报被表示为一个模糊数,其核心表示严重性,展度表示不确定性,高度反映检测可靠性。我们应用排序指数对警报进行优先级排序,允许组织通过风险态度参数调整安全姿态。在CIC-IDS2017和NSL-KDD上的实验验证表明,在检测器退化下,该方法比基线方法具有更强的鲁棒性(NDCGrel@100为0.9963对比0.8215),在中等置信度警报中具有明显区分度,在稳健检测器下与基线方法接近。该框架具有理论基础、计算效率高、提供可解释推理,并且在检测器系列和校准错误场景下保持鲁棒性。

英文摘要

Modern intrusion detection systems generate thousands of alerts daily, but alert fatigue severely limits security operations effectiveness due to too many false positives or low-impact events. We address this by proposing a principled framework for alert prioritization based on subnormal Gaussian fuzzy numbers, explicitly modeling three sources of uncertainty: threat severity, detection confidence, and organizational risk attitude. Each alert is represented as a fuzzy number with the core indicating severity, spread indicating uncertainty, and height reflecting detection reliability. We apply ranking indices to prioritize alerts, allowing organizations to tune security posture through a risk-attitude parameter. Experimental validation on CIC-IDS2017 and NSL-KDD demonstrates greater robustness than baselines under detector degradation (0.9963 vs 0.8215 NDCGrel@100), with distinct differentiation in mid-confidence alerts and near-parity with baselines under robust detectors. The framework is theoretically grounded, computationally efficient, provides interpretable reasoning, and remains robust across detector families and miscalibration scenarios.

2605.27293 2026-05-27 cs.LG stat.ML 版本更新

BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

BASIS: 基于单次采样信息共享的批量优势估计用于LLM推理

Shijin Gong, Erhan Xu, Kai Ye, Francesco Quinzan, Giulia Livieri, Chengchun Shi

发表机构 * University of Science and Technology of China(中国科学技术大学) London School of Economics and Political Science(伦敦政治经济学院) University of Oxford(牛津大学)

AI总结 提出BASIS算法,通过单次采样和批次内信息共享改进价值函数估计,在减少计算开销的同时提升策略优化性能。

Comments 17 pages, 7 figures

详情
AI中文摘要

基于可验证奖励的强化学习已成为提升大型语言模型推理能力的标准方法。现有算法在价值估计和策略学习中面临计算效率与样本效率之间的权衡。我们引入BASIS,一种无评论家的后训练算法,旨在解决这一权衡。在每个在线训练步骤中,BASIS每个提示仅采样一次,但利用整个批次中跨提示的丰富信息来改进价值函数估计。实验表明,与代表性单次采样基线REINFORCE++相比,BASIS将价值函数估计的MSE降低了69%,并且使用一次采样达到的MSE低于使用8次采样的组均值估计器。价值估计的改进转化为更好的策略优化:使用显著更少的训练时间,BASIS达到了接近多次采样GRPO型基线的性能,并且通常优于单次采样REINFORCE型基线。

英文摘要

Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We introduce BASIS, a critic-free post-training algorithm designed to address this tradeoff. At each online training step, BASIS samples only one rollout per prompt, but leverages rich information across prompts in the entire batch to improve value function estimation. Our experiments demonstrate that BASIS reduces MSE in value function estimation by 69% compared to REINFORCE++, a representative single-rollout baseline, and achieves lower MSE with one rollout than group mean estimators with 8 rollouts. This improvement in value estimation translates to better policy optimization: using substantially less training time, BASIS achieves performance close to multi-rollout GRPO-type baselines and often outperforms single-rollout REINFORCE-type baselines.

2605.27288 2026-05-27 cs.CL cs.AI cs.LG 版本更新

It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

并非总是谄媚:基于认知不确定性测量LLM的从众行为

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

发表机构 * Vanderbilt University(范德比尔特大学) Vanderbilt University Medical Center(范德比尔特大学医学中心) Intuit AI Research(Intuit AI研究院)

AI总结 本文提出MUSE框架,通过区分谄媚从众和不确定性驱动的从众,揭示LLM在用户反驳时改变立场的行为机制,并发现两种从众均随用户感知专业性和建议合理性增强。

详情
AI中文摘要

大型语言模型(LLMs)已知会放弃初始立场以适应用户的反驳。虽然先前研究主要将此行为归因于从人类反馈强化学习中习得的谄媚,但我们假设从众行为也受模型在推理时的认知不确定性驱动。本文提出MUSE,一个两阶段评估框架,用于解开驱动LLM从众行为的机制。具体而言,MUSE将模型回答查询时的认知不确定性与其在后续轮次中屈服于用户反驳的可能性进行映射。我们证明驱动从众的机制不仅限于谄媚。具体来说,我们刻画了共同驱动从众的两个不同因素:谄媚从众,即模型即使对其初始回答绝对确定也会与用户反驳保持一致;以及不确定性驱动从众,即模型从众可能性随其不确定性增加而增加。此外,我们进行消融研究,证明谄媚从众和不确定性驱动从众均随1)LLM对用户感知专业性的增加和2)用户建议的合理性增加而增长。更广泛地说,MUSE通过区分对齐诱导的谄媚和训练语料驱动的不确定性,为更有针对性的干预策略提供信息。

英文摘要

Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a model's epistemic uncertainty at inference time. In this paper, we introduce MUSE, a two-stage evaluation framework to disentangle the mechanisms driving LLM conformity. Specifically, MUSE maps a model's epistemic uncertainty in responding to a query against its likelihood to yield to user pushback in a subsequent turn. We demonstrate that the mechanisms driving conformity extend beyond sycophancy alone. Specifically, we characterize two distinct factors that jointly drive conformity: sycophantic conformity, where a model aligns with user pushback even with absolute certainty in its initial response, and uncertainty-driven conformity, where a model's likelihood for conformity increases alongside its uncertainty. Furthermore, we conduct ablation studies to demonstrate that both sycophantic conformity and uncertainty-driven conformity grow with 1) the LLM's perceived expertise of the user and 2) the plausibility of the user's suggestions. More broadly, MUSE informs more targeted intervention strategies by distinguishing alignment-induced sycophancy and training-corpora-driven uncertainty.

2605.27281 2026-05-27 cs.LG stat.ML 版本更新

Causal Risk Minimization for High-Dimensional Treatments

高维处理变量的因果风险最小化

Nikita Dhawan, Arnav Paruthi, Andrew Kim, Lovedeep Gondara, Jekaterina Novikova, Chris J. Maddison

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) Vanguard(先锋)

AI总结 针对高维处理空间(如文本)的因果推断,提出通过分解因果误差为矩平衡误差序列并优化高阶平衡目标,以及将高维处理投影到低维属性的方法,实现无需属性特定训练的因果估计。

Comments 18 pages, 4 figures

详情
AI中文摘要

预测具有多种可能变化的干预效果(例如,影响心理健康结果的治疗内容或推动股价变动的财报电话会议记录)在多个领域中非常有用。然而,经典的因果估计量通常假设所有可能的干预都被观察到,这在干预变化广泛的情况下(例如,在所有文本字符串的空间中)是不可行的。我们采用了一种将因果推断重新表述为学习问题的著名方法,以处理高维处理空间。具体来说,在标准假设(如无未观测混杂)下,我们证明因果误差可分解为一系列递增阶数的矩平衡误差,并设计了直接改进因果估计的目标函数。我们还展示了如何将高维处理的效果投影到低维处理属性上,这使得单个模型能够回答多个因果问题,而无需额外的属性特定训练。我们在高维连续、离散和文本处理设置中经验性地评估了我们的估计量,其中文本处理使用了亚马逊评论的半合成数据集。我们的实验证明了高阶平衡误差优化的优势以及投影因果估计与属性特定估计的竞争性能。

英文摘要

Predicting the effect of interventions with many possible variations, e.g., therapeutic content that affects mental health outcomes or an earnings call transcript that drives movement in share price, is useful across several domains. However, classical causal estimators tend to assume that all possible interventions are observed, which is infeasible when interventions vary widely, for instance, in the space of all text strings. We adapt a well-known approach of recasting causal inference as a learning problem, to address high-dimensional treatment spaces. Specifically, under standard assumptions like no unobserved confounding, we show that causal error decomposes into a series of moment-balancing errors of increasing order, and design objectives that directly improve causal estimation. We also show how to project the effect of a high-dimensional treatment onto lower-dimensional treatment attributes, which allows a single model to answer several causal questions without additional attribute-specific training. We empirically evaluate our estimators in settings with high-dimensional continuous, discrete, and text treatments, the last of which used a semi-synthetic dataset of Amazon Reviews. Our experiments demonstrate the benefit of higher-order balance error optimization and competitive performance of projected causal estimates with attribute-specific estimators.

2605.27269 2026-05-27 cs.LG stat.AP 版本更新

Transfer Learning using 66 Diseases for Disease Forecasting Applications

使用66种疾病的迁移学习进行疾病预测应用

Lauren J Beesley, Alexander C Murph, Dave Osthus, Lauren A Castro

发表机构 * Statistics, Los Alamos National Laboratory(统计学,洛斯阿拉莫斯国家实验室) Computational Intelligence & Modeling, Los Alamos National Laboratory(计算智能与建模,洛斯阿拉莫斯国家实验室)

AI总结 本研究通过迁移学习整合66种传染病及多种数据流,发现大多数情况下加入其他数据流能提升预测性能,但数据质量至关重要,并构建了公开数据库。

详情
AI中文摘要

疾病预测模型通常依赖于单一数据流,这使得模型在历史数据短或噪声大时变得脆弱。最近表现最佳的模型表明,综合同一疾病的多个报告系统可以提升性能。其他近期工作进一步扩展了这一想法,使用迁移学习利用不同疾病的数据来训练某一疾病的预测模型。我们极大地扩展了这些方法,在涵盖66种传染病和多个数据流的数据上训练机器学习模型。我们研究了整合不同数据流对预测20种不同疾病数据流的价值。我们发现,在绝大多数(84.9%)考虑的时间序列和模型结构中,整合其他数据流改善了预测。然而,我们的工作强调,添加数据的质量很重要,添加与目标数据流极其不同的数据有时会降低预测性能。这项工作的一个主要贡献是编制了一个公开可用的数据库,供传染病预测社区使用。

英文摘要

Disease forecasting models typically rely on a single data stream, making models brittle when histories are short or noisy. Recent top-performing models have shown that synthesizing multiple reporting systems for the same disease improves performance. Other recent work takes this idea a step further, using transfer learning to train a forecasting model for one disease using data from a different disease. We expand upon each of these approaches greatly, training machine learning models on data that span 66 infectious diseases and several data streams. We investigate the value of incorporating different data streams for forecasting 20 different disease data streams. We find that incorporating other data streams improves forecasting in the vast majority (84.9%) of time series and model structures considered. However, our work highlights that the quality of the added data matters, where adding data extremely different from the target data stream can sometimes degrade forecast performance. A major contribution of this work is in compiling a publicly-available database of data for use by the infectious disease forecasting community.

2605.27259 2026-05-27 cs.LG 版本更新

Kan Extension Transformers: A Categorical Unification of Attention, Diffusion, and Predict-Detach Self-Conditioning

Kan扩展变换器:注意力、扩散和预测-分离自条件的范畴统一

Sridhar Mahadevan

发表机构 * Adobe Research(Adobe研究院) University of Massachusetts(马萨诸塞大学) Amherst(阿默斯特)

AI总结 提出Kan扩展变换器(KETs)作为多种Transformer实现的统一范畴框架,将Transformer层视为加权结构化扩展算子,并通过预测-分离机制实现有效的自条件化,实验表明预测-分离机制比改变邻域族带来更大性能提升。

Comments 30 pages

详情
AI中文摘要

我们提出Kan扩展变换器(KETs)作为多种Transformer实现的统一范畴框架。核心主张是,Transformer层可以被视为加权结构化扩展算子:标准注意力是单邻域情况,几何Transformer风格的关联混合是稀疏边限制情况,而KET是高阶单纯形情况。这一视角也阐明了与扩散式补全的桥梁。当扩展算子作用于分离的预测载体而非教师强制隐藏状态时,它成为一种有效的自条件化机制,在不泄露未来黄金令牌的情况下暴露非因果结构。我们在Penn Treebank、WikiText-2和WikiText-103上对12种不同的Transformer实现进行了全面的实验验证,这些实现在严格因果和预测-分离机制上有所不同。在严格因果设置中,二次KET是WikiText-2和WikiText-103上比较的因果架构中最强的模型。然而,在所有数据集上,最大的收益来自预测-分离机制,而非仅改变邻域族。

英文摘要

We propose Kan Extension Transformers (KETs) as a unifying categorical framework for a diverse group of Transformer implementations. The core claim is that a Transformer layer can be viewed as a weighted structured extension operator: standard attention is the singleton-neighborhood case, Geometric Transformer style incidence mixing is a sparse edge-restricted case, and KET is the higher-order simplicial case. This lens also clarifies a bridge to diffusion-style completion. When the extension operator acts on detached predictive carriers instead of teacher-forced hidden states, it becomes a valid self-conditioning mechanism that exposes noncausal structure without leaking gold future tokens. We include a comprehensive experimental validation of 12 different Transformer implementations varying across strict-causal and predict-detach regimes on Penn Treebank, WikiText-2, and WikiText-103. In the strict-causal setting, quadratic KET is the strongest model among the compared causal architectures on WikiText-2 and WikiText-103. Across all datasets, however, the largest gains come from the predict-detach regime rather than from changing the neighborhood family alone.

2605.27254 2026-05-27 cs.LG cs.AI 版本更新

LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models

LUCoS: 表格基础模型的潜在无监督上下文选择

Oroel Ipas, Guillermo Gomez-Trenado, Rocío Romero-Zaliz, Isaac Triguero

发表机构 * Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI)(安达卢西亚数据科学与计算智能研究 institute) Department of Computer Science and Artificial Intelligence (DECSAI)(计算机科学与人工智能系) Research Center in Information and Communication Technologies (CITIC)(信息与通信技术研究中心) Instituto de Investigación Biosanitaria Ibs.GRANADA, University of Granada, Granada, 18071, Spain(格拉纳达大学生物医学研究所)

AI总结 针对表格基础模型在低标签场景下的上下文选择问题,提出LUCoS方法,利用无监督先验拟合网络(PFN)的潜在几何结构选择代表性medoids作为上下文,在67个数据集上优于随机选择和原始空间方法。

Comments Comments: 18 pages, 4 figures, supplementary appendices included

详情
AI中文摘要

选择哪些实例进行标注是低标签表格学习中的一个关键挑战。对于最近的表格基础模型(如TabPFN),上下文选择直接决定预测性能。有监督的oracle实验表明,在相同标注预算下,精心选择的标注上下文集可以显著优于随机选择。然而,在TFM文献中,冷启动设置(即必须在任何标签可用之前选择实例)很少受到关注。这个问题本质上是几何问题。在视觉和语言领域,基础模型诱导出嵌入空间,其中简单的几何选择方法是有效的。相比之下,表格实例选择迄今为止主要是在原始表格空间中进行,而该空间缺乏自然的度量;异构类型、混合尺度以及非线性交互使得原始空间距离对于上下文构建不可靠,并且随着预算增加,原始空间选择在大多数数据集上表现低于随机。我们提出LUCoS(潜在无监督上下文选择),该方法用无监督先验拟合网络(PFN)诱导的潜在几何替换原始特征几何,并选择代表性medoids作为上下文。在67个OpenML-CC18数据集上,跨六个低标签预算评估,LUCoS在平均AUC、ACC和F1上排名第一,结论在指标和数据集级别的稳健性检查中保持稳定。增益分解揭示了一个简单机制:在最小预算下,主要收益来自强制覆盖;随着预算增加,决定性因素变为衡量覆盖的表示空间。LUCoS缓解了原始特征空间选择的失败,表明可靠的无监督上下文选择更少依赖于选择器的复杂性,而更多依赖于在有意义的表示几何中定义代表性。

英文摘要

Selecting which instances to label is a key challenge in low-label tabular learning. For recent Tabular Foundation Models such as TabPFN, context selection directly determines predictive performance. Supervised oracle experiments show that carefully chosen labeled context sets can strongly outperform random selection under the same labeling budget. However, the cold-start setting, where instances must be selected before any labels are available, has received little attention in the TFM literature. This problem is fundamentally geometric. In vision and language, foundation models induce embedding spaces where simple geometric selection methods are effective. In contrast, tabular instance selection has so far been performed predominantly in the original tabular space, which lacks a natural metric; heterogeneous types, mixed scales, and nonlinear interactions make raw-space distances unreliable for context construction, and original-space selection falls below random on the majority of datasets as the budget grows. We propose LUCoS (Latent Unsupervised Context Selection), which replaces raw-feature geometry with the latent geometry induced by embeddings from an unsupervised Prior-Fitted Network (PFN) and selects representative medoids as context. Evaluated on 67 OpenML-CC18 datasets across six low-label budgets, LUCoS ranks first under mean AUC, ACC, and F1, with conclusions stable across metrics and dataset-level robustness checks. A gain decomposition reveals a simple mechanism: at the smallest budgets, the main benefit comes from enforcing coverage; as the budget increases, the decisive factor becomes the representation space in which coverage is measured. LUCoS mitigates failures of original feature space selection, showing that reliable unsupervised context selection depends less on selector sophistication than on defining representativeness in a meaningful representation geometry.

2605.27245 2026-05-27 cs.LG 版本更新

Symbolic Regression via Latent Iterative Refinement

通过潜在迭代细化的符号回归

Xieting Chu, Sriram Vishwanath, Vijay Ganesh

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出潜在方程嵌入(LEE)框架,通过迭代推断在功能基础化的潜在空间中缩小符号回归的推断差距,生成更简单且准确的表达式。

Comments Preprint. 21 pages, 11 figures

详情
AI中文摘要

符号回归(SR)旨在寻找拟合观测数据的封闭形式数学表达式。神经SR方法通过训练编码器将观测数据直接映射到表达式来摊销搜索,但这种摊销推断在其一次性预测与真实后验之间留下了残余的摊销差距。我们提出潜在方程嵌入(LEE),这是一个通过在功能基础化的潜在空间中进行迭代摊销推断来缩小这一差距的框架。LEE学习一个共享的潜在空间Z,配备三个组件:编码器f_theta,将符号标记和数值观测联合嵌入到单个潜在向量z中;表达式解码器g_expr,从z重建公式;以及评估解码器g_eval,从z预测函数值,明确地将潜在空间基于功能行为。在推断时,LEE通过将解码后的表达式与观测数据联合重新编码来执行迭代细化,逐步改进潜在估计。LEE将编码器本身用作学习到的推断优化器:每个重新编码步骤隐式计算候选与数据之间的不匹配。由于g_eval在z上是可微的,我们另外将连续梯度下降与离散重新编码交错进行,产生一个混合迭代和梯度细化过程。在SRBench上,跨三个噪声水平,针对涵盖遗传规划、符号-神经混合和预训练Transformer的19个基线,LEE生成的表达式比最强精度导向的基线(包括Operon、GP-GOMEA、TPSR、RAG-SR和GenSR)简单2-10倍,复杂度为8-11,而后者为20-90。这些结果推进了精度-复杂度帕累托前沿的低复杂度区域,并显示出随着噪声增加而优雅退化。

英文摘要

Symbolic regression (SR) seeks closed-form mathematical expressions that fit observed data. Neural SR methods amortize the search by training an encoder to map observations directly to expressions in a single pass, but this amortized inference leaves a residual amortization gap between its one-shot prediction and the true posterior. We propose Latent Equation Embedding (LEE), a framework that closes this gap through iterative amortized inference in a functionally grounded latent space. LEE learns a shared latent space Z equipped with three components: an encoder f_theta that jointly embeds symbolic tokens and numerical observations into a single latent vector z; an expression decoder g_expr that reconstructs formulas from z; and an evaluation decoder g_eval that predicts function values from z, explicitly grounding the latent space in functional behavior. At inference, LEE performs iterative refinement by re-encoding decoded expressions jointly with observations, progressively improving the latent estimate. LEE uses the encoder itself as a learned inference optimizer: each re-encoding step implicitly computes the mismatch between the candidate and the data. Because g_eval is differentiable in z, we additionally interleave continuous gradient descent with discrete re-encoding, yielding a hybrid iterative and gradient refinement procedure. On SRBench across three noise levels, against 19 baselines spanning genetic programming, symbolic-neural hybrids, and pre-trained Transformers, LEE produces expressions 2--10x simpler than the strongest accuracy-oriented baselines, including Operon, GP-GOMEA, TPSR, RAG-SR, and GenSR, with complexity 8--11 versus 20--90. These results advance the low-complexity region of the accuracy-complexity Pareto frontier and show graceful degradation as noise increases.

2605.27236 2026-05-27 cs.LG physics.ao-ph 版本更新

Explainable Comparison of Feature-Based and Deep Learning Models for TROPOMI Methane Plume Screening

基于特征和深度学习模型用于TROPOMI甲烷羽流筛选的可解释比较

Solomiia Kurchaba, Joannes D. Maasakkers, Berend J. Schuit, Ilse Aben

发表机构 * SRON Space Research Organisation Netherlands(SRON空间研究组织荷兰) GHGSat Inc.(GHGSat公司) Department of Earth Sciences, Vrije Universiteit Amsterdam(地球科学系,阿姆斯特丹自由大学)

AI总结 本研究比较了基于特征(SVC、随机森林、XGBoost)和基于图像(ResNet-18、ResNet-34)的模型在甲烷羽流-伪影分类中的性能,并通过SHAP可解释性分析为操作筛选提供指导。

详情
AI中文摘要

连续且全球性地检测大量甲烷排放是全球变暖减缓的关键步骤。卫星观测(例如来自S5P/TROPOMI)结合羽流检测算法可以在这一努力中发挥关键作用。然而,并非所有看起来像甲烷排放羽流的TROPOMI羽流检测都是实际排放的结果。数据中相当一部分类似羽流的特征是检索伪影。此类伪影可能是由海拔或反照率梯度变化、高浓度气溶胶、海岸线、水体等引起的。先前的工作通过支持向量机分类器(SVC)解决了羽流-伪影分类问题,该分类器在由领域专家设计的大量基于观测的标量特征上训练。然而,这种方法将算法接收的信息范围限制在专家认为重要的内容上,破坏了像素之间的空间关系,并在统计聚合过程中丢失信息。在本研究中,我们在平衡和不平衡评估设置下比较了基于特征(SVC、随机森林、XGBoost)和基于图像(ResNet-18、ResNet-34)的模型用于甲烷羽流-伪影分类。为了解释结果,我们将基于SHAP的可解释性应用于两个模型家族。我们的发现为操作甲烷筛选工作流程(如CAMS甲烷热点探索器)中的模型选择提供了实用指导。

英文摘要

Continuous and global detection of large methane emissions is a crucial step for global warming mitigation. Satellite observations, such as from S5P/TROPOMI, combined with plume detection algorithms, can play a key role in this effort. However, not all TROPOMI plume detections that look like methane emission plumes are the result of actual emissions. A significant part of the plume-like features in the data are retrieval artifacts. Such artifacts could be the result of variations in elevation or albedo gradients, high concentrations of aerosols, coastal lines, water bodies, etc. Previous work approached the problem of plume-artifact classification by means of a Support Vector Machine Classifier (SVC), trained on an extensive set of observation-based scalar features designed by domain experts. However, such an approach limits the information scope received by the algorithm to what is deemed to be important by the experts, breaks the spatial relationship between pixels, and loses information during the process of statistical aggregation. In this study, we compare feature-based (SVC, Random Forest, XGBoost) and image-based (ResNet-18, ResNet-34) models for methane plume-artifact classification under balanced and imbalanced evaluation settings. To interpret the results, we apply SHAP-based explainability to both model families. Our findings provide practical guidance for model selection in operational methane-screening workflows such as the CAMS Methane Hotspot Explorer.

2605.27219 2026-05-27 cs.LG stat.ML 版本更新

Nonlinear Data Integration via Kernel Methods for Data Collaboration Analysis

基于核方法的非线性数据整合用于数据协作分析

Yamato Suetake, Yuta Kawakami, Shunnosuke Ikeda, Yuichi Takano

发表机构 * Graduate School of Science and Technology, University of Tsukuba(科学技术研究生院,茨口大学) Institute of Systems and Information Engineering, University of Tsukuba(系统与信息工程研究所,茨口大学)

AI总结 针对分散保密数据协作分析中线性整合方法重建风险高且无法对齐非线性变换的问题,提出非线性核整合(NKI)方法,通过核岭回归和特征值问题获得全局最优解,并引入图正则化和中心化约束以捕获几何和目标变量信息,在图像分类任务中提升了准确率并降低了重建风险。

Comments 50 pages, 7 figures

详情
AI中文摘要

分散保密数据集的协作分析很重要,但原始数据集的直接共享常受隐私和机构限制。数据协作(DC)分析通过各方特定的混淆函数将每个数据集转换为隐私保护的中间表示,并使用锚数据集将它们整合为公共协作表示。然而,许多现有的DC分析方法依赖线性变换进行数据混淆和整合,这可能增加重建风险。尽管非线性降维可以缓解这一风险,但传统的线性整合方法无法准确对齐非线性变换产生的中间表示。此外,现有的整合方法主要最小化各方之间的差异,并未明确纳入对下游分析有用的几何或目标变量信息。为克服这些限制,我们首先将线性核整合(LKI)公式化为一种线性整合方法,然后对其进行核化以获得非线性核整合(NKI)。NKI通过核岭回归和特征值问题获得全局最优解。我们还引入了图正则化和中心化约束,使得目标表示能够捕获对下游分析有用的几何和目标变量信息。在图像分类任务上的实验表明,在非线性降维下,NKI比现有的线性整合方法提高了分类准确率,而目标变量感知的图正则化和中心化进一步带来了增益。结果还表明,降维选择显著影响分类准确率和重建风险。

英文摘要

Collaborative analysis of decentralized confidential datasets is important, but direct sharing of original datasets is often restricted by privacy and institutional constraints. Data collaboration (DC) analysis transforms each dataset into privacy-preserving intermediate representations via party-specific obfuscation functions and integrates them into common collaboration representations using an anchor dataset. However, many existing DC analysis methods rely on linear transformations for data obfuscation and integration, which may increase reconstruction risk. Although nonlinear dimensionality reduction can mitigate this risk, conventional linear integration methods cannot accurately align intermediate representations produced by nonlinear transformations. Moreover, existing integration methods mainly minimize discrepancies among parties and do not explicitly incorporate geometric or target-variable information useful for downstream analysis. To overcome these limitations, we first formulate linear kernel integration (LKI) as a linear integration method and then kernelize it to obtain nonlinear kernel integration (NKI). NKI admits a globally optimal solution via kernel ridge regression and an eigenvalue problem. We also introduce graph regularization and a centering constraint so that the target representation can capture geometric and target-variable information useful for downstream analysis. Experiments on image classification tasks demonstrate that NKI improves classification accuracy over existing linear integration methods under nonlinear dimensionality reduction, with further gains from target-variable-aware graph regularization and centering. The results also show that dimensionality reduction choices substantially affect both classification accuracy and reconstruction risk.

2605.27194 2026-05-27 cs.CL cs.CV cs.LG 版本更新

Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation

并非所有标记都同等重要:基于关键标记监督的动态上下文向量蒸馏用于长医学报告生成

Ning Wu, Rui Liu, Xinkun Lin, Weixing Chen, Jinxi Xiang, Tao Wei, Lina Yao, Mingjie Li

发表机构 * UNSW Sydney(新南威尔士大学悉尼分校) University of Technology Sydney(技术大学悉尼分校) School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Stanford University(斯坦福大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出DIVE框架,通过关键标记监督和状态条件动态引导,解决长文本生成中标记级蒸馏忽略关键标记的问题,在医学报告生成任务上取得最佳性能。

Comments Preprint. 20 pages, 6 figures

详情
AI中文摘要

将示范效果蒸馏到隐藏空间干预中提供了一种轻量级的替代全微调的方法。然而,现有的多模态变体主要是在短文本任务上评估的,其中输出在几个标记后结束。将这些方法扩展到长文本生成暴露了一个基本但未充分研究的局限性:标记级蒸馏隐式地将所有输出标记视为同等信息量,但长文本输出由高频模板和语法标记主导,而实际决定输出质量的标记稀疏分布。在医学报告生成(MRG)中,有两种这样的关键标记突出:决定诊断内容的病理相关标记和决定终止的序列结束(EOS)事件。两者在均匀交叉熵下都受到不足的监督,自回归解码通过偏离教师强制轨迹进一步加剧了问题。我们提出DIVE,一个冻结骨干的蒸馏框架,通过两种与这些失败相匹配的互补机制来解决长文本报告生成。关键标记监督通过提高病理相关标记和EOS事件的交叉熵贡献来恢复监督平衡,确保内容保真度和终止在训练期间学习,而不是在解码时施加。状态条件动态引导用隐藏状态相关的适配器替换固定的开环残差,允许注入信号随着解码漂移而适应。在MIMIC-CXR和CheXpert Plus上使用两个医学VLM骨干的实验表明,DIVE在词汇和临床代理指标中始终位列最强方法之一。我们的方法在所有数据集-骨干设置中实现了最佳的BLEU-4、ROUGE-L和RadGraph F1,同时在粗粒度标签级CheXbert F1上保持竞争力。

英文摘要

Distilling demonstration effects into hidden-space interventions offers a lightweight alternative to full finetuning. However, existing multimodal variants are mostly evaluated on short-form tasks, where outputs end after a few tokens. Extending these methods to long-form generation exposes a fundamental yet underexamined limitation: token-level distillation implicitly treats all output tokens as equally informative, but long-form outputs are dominated by high-frequency template and grammatical tokens, while the tokens that actually determine output quality are sparsely distributed. In medical report generation (MRG), two such decisive tokens stand out: pathology-related tokens that determine diagnostic content, and the end-of-sequence (EOS) event that determines termination. Both receive insufficient supervision under uniform cross-entropy, and autoregressive decoding further compounds the problem by drifting away from teacher-forced trajectories. We propose DIVE, a frozen-backbone distillation framework that addresses long-form report generation through two complementary mechanisms matched to these failures. Decisive-token supervision restores supervision balance by upweighting the cross-entropy contribution of pathology-related tokens and the EOS event, ensuring that content fidelity and termination are learned during training rather than imposed at decoding time. State-conditioned dynamic steering replaces fixed open-loop residuals with hidden-state-dependent adapters, allowing the injected signal to adapt as decoding drifts. Experiments on MIMIC-CXR and CheXpert Plus with two medical VLM backbones show that DIVE consistently ranks among the strongest methods across lexical and clinical-proxy metrics. Our method achieves the best BLEU-4, ROUGE-L, and RadGraph F1 in all dataset--backbone settings, while remaining competitive on coarse label-level CheXbert F1.

2605.27190 2026-05-27 cs.CL cs.AI cs.LG cs.SD 版本更新

Learning When to Think While Listening in Large Audio-Language Models

在大音频语言模型中学习何时在聆听时思考

Zhiyuan Song, Weici Zhao, Yang Xiao, Suhao Yu, Cheng Zhu, Jiatao Gu

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出一种可学习的等待-思考-回答控制机制,通过多奖励强化学习优化大音频语言模型在流式语音交互中的推理时机,在提升准确率的同时减少响应延迟。

Comments 19 pages, 4 figures, 6 tables

详情
AI中文摘要

近期大音频语言模型(LALMs)的进展使得实时、流式的语音交互越来越实用。在这种场景下,推理质量和响应速度紧密耦合:将推理延迟到语音端点可以提高答案质量,但会将思考时间转移到用户可见的响应延迟中,而过早回答则可能在决定性证据到达之前做出承诺。我们为LALMs引入了一种可学习的等待-思考-回答控制公式。受人类对话渐进性启发,控制器在部分音频证据下决定何时等待、何时外化紧凑的推理更新、以及何时回答。以Qwen2.5-Omni-7B为基础模型,我们从语音推理数据中构建对齐的等待-思考-回答轨迹,使用监督微调(SFT)训练控制器,然后应用解耦裁剪和动态采样策略优化(DAPO)。奖励结合了答案正确性、动作有效性、更新时机、延迟同步、推理质量和链一致性,优化完整的等待-思考-回答轨迹,而不仅仅是最终答案。在一个六任务合成语音推理问答(SRQA)基准上,六奖励DAPO控制器将行加权准确率从67.6%提升到70.3%,同时在相同Qwen部署环境下将端点后最终思考长度减少14%。在一个包含186个人类录音的真实音频基准(Real Audio Bench)上,作为超越文本转语音(TTS)渲染语音的迁移检查,控制器家族仍然有效:SFT实现了最强的准确率,而六奖励DAPO控制器是唯一最终思考长度低于基础模型的学习变体。这些结果表明,流式模型应该学习在音频流中何时使中间推理显式化。

英文摘要

Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when to wait, when to externalize a compact reasoning update, and when to answer. Using Qwen2.5-Omni-7B as the base model, we construct aligned wait-think-answer traces from spoken reasoning data, train the controller with supervised fine-tuning (SFT), and then apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). The reward combines answer correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency, optimizing the complete wait-think-answer trajectory and not the final answer alone. On a six-task synthetic spoken reasoning question answering (SRQA) benchmark, the six-reward DAPO controller improves the row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14% under the same Qwen deployment harness. On a 186-item human-recorded Real Audio Bench, a transfer check beyond text-to-speech (TTS)-rendered speech, the controller family remains functional: SFT achieves the strongest accuracy, while the six-reward DAPO controller is the only learned variant whose final-think length falls below the base. These results suggest that a streaming model should learn when to make intermediate reasoning explicit during the audio stream.

2605.27189 2026-05-27 cs.CL cs.LG cs.SD eess.AS q-bio.NC 版本更新

Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy

超越二元:认知评分层级中的语音表征

Serli Kopar, Roshan Prakash Rane, Christian Mychajliw, Lydia Federmann, Gerhard Eschweiler, Daniela Berg, Sam Gijsen, Paula Andrea Perez-Toro, Kerstin Ritter

发表机构 * 1 Hertie Institute for AI in Brain Health, University of Tübingen, Tübingen, Germany 2 Tübingen AI Center, University of Tübingen, Tübingen, Germany 3 Department of Psychology, Humboldt-Universität zu Berlin 4 Geriatric Center, Tübingen University Hospital, Tübingen, Germany 5 Tübingen Center for Mental Health (TüCMH), Department of Psychiatry Psychotherapy, Tübingen University Hospital, Tübingen, Germany 6 German Center for Mental Health (DZPG), Partner Site Tübingen, Tübingen, Germany 7 Department of Neurology, University Medical Center Schleswig-Holstein Kiel University, Kiel, Germany 8 Center for Neurology, University Hospital Tübingen Hertie Institute for Clinical Brain Research, Tübingen, Germany 9 Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany 10 Charit\'e--Universit\"atsmedizin, Department of Psychiatry

AI总结 本研究利用5,754份德语神经心理学评估录音,比较手工声学特征与自监督学习嵌入在轻度认知障碍认知评估层级(任务、领域、全局)中的表现,发现任务约束与评估层级之间的关联。

详情
AI中文摘要

本研究考察了轻度认知障碍中语音表征与认知评估层级结构之间的关系。利用5,754份德语神经心理学评估录音,我们在三个评分层级(任务、领域和全局)上评估了六项认知任务。我们比较了手工声学特征与自监督学习(SSL)嵌入。结果表明,尽管SSL表示在较低层级通常优于手工特征,但这种趋势在MCI分类中发生逆转。此外,任务特定约束影响性能:响应自由度较大的任务随着层级增加表现出性能稀释,表明“专家”表示,而高度结构化任务的性能向更高层级增加,表明“通才”表示。这些发现揭示了自动临床语音分析中任务约束与评估层级之间的联系。

英文摘要

This study examines the relationship between speech representations and the hierarchical structure of cognitive assessment in mild cognitive impairment. Utilizing 5,754 German neuropsychological assessment recordings, we evaluate six cognitive tasks across three score levels: task, domain, and global levels. We compare hand-crafted acoustic features with self-supervised learning (SSL) embeddings. Results show that although SSL representations generally outperform hand-crafted features at lower levels, this trend reverses for MCI classification. Furthermore, task-specific constraints influence performance: tasks with greater response freedom exhibit performance dilution as hierarchical levels increase, suggesting ``specialist'' representations, whereas the performance of highly structured tasks increases toward higher levels, suggesting ``generalist'' representations. These findings show links between task constraints and assessment hierarchy in automated clinical speech analysis.

2605.27178 2026-05-27 cs.CV cs.AI cs.LG cs.RO 版本更新

FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation

FoundObj: 自监督基础模型作为无标签3D物体分割的奖励

Zihui Zhang, Zhixuan Sun, Yafei Yang, Jinxi Li, Jiahao Chen, Bo Yang

发表机构 * Shenzhen Research Institute, The Hong Kong Polytechnic University(深圳研究院,香港理工大学) vLAR Group, The Hong Kong Polytechnic University(vLAR小组,香港理工大学)

AI总结 提出FoundObj框架,利用自监督2D/3D基础模型的语义和几何先验作为奖励,通过强化学习引导超点合并,实现无标注复杂场景3D物体分割。

Comments ICML 2026. Zihui and Zhixuan are co-first authors. Code and data are available at: https://github.com/vLAR-group/FoundObj

详情
AI中文摘要

我们解决了在训练过程中不依赖任何场景级人类标注的复杂场景点云中3D物体分割的挑战性任务。现有方法通常局限于识别简单物体,这主要是由于学习过程中物体先验不足。在本文中,我们提出了FoundObj,一个新颖的框架,其特点是基于超点的物体发现代理,该代理在我们的创新语义和几何奖励模块的指导下逐步合并合适的相邻超点。这些模块协同利用自监督2D/3D基础模型中的语义和几何先验,为物体发现代理提供互补反馈,并通过强化学习实现对多类物体的鲁棒识别。在多个基准上的大量实验表明,我们的方法始终优于现有基线。值得注意的是,我们的方法在零样本和长尾场景中表现出强大的泛化能力,突显了其在可扩展、无标签3D物体分割方面的潜力。

英文摘要

We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint-based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self-supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi-class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero-shot and long-tail scenarios, underscoring its potential for scalable, label-free 3D object segmentation.

2605.27163 2026-05-27 cs.LG stat.ML 版本更新

The Role of Causal Features in Strategic Classification for Robustness and Alignment

因果特征在战略分类中的作用:鲁棒性与对齐

Antonio Gois, Sophia Gunluk, Nir Rosenfeld, Nidhi Hegde, Simon Lacoste-Julien, Dhanya Sridhar

发表机构 * Mila & Université de Montréal(Mila与蒙特利尔大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席) Faculty of Computer Science, Technion - Israel Institute of Technology(技术学院-以色列理工学院计算机科学系) Dept. of Computing Science, Amii & University of Alberta, Canada(计算科学系,Amii与阿尔伯塔大学,加拿大)

AI总结 本文通过因果模型分析战略分类中的分布偏移,证明因果分类在噪声有界时达到最优误差,并分解OOD交叉熵风险,揭示因果特征在长期激励对齐中的优势。

Comments Accepted at AISTATS 2026. 20 pages, 5 figures

详情
AI中文摘要

在战略分类中,机构(例如银行)预期用户会改变其特征以提高分类任务(例如贷款偿还)中的效用,从而进行适应。由于关键挑战是用户引起的分布偏移,我们转向因果模型,该模型已被证明可以限制最坏情况下的分布外(OOD)风险,并建立了几个将因果关系与战略分类联系起来的新结果。首先,我们证明,当噪声以某种方式有界时,因果分类在任何足够大的适应后都能达到最优分类误差。其次,当这些假设不成立时,我们证明最优分类器的OOD交叉熵风险分解为一个OOD偏差项和一个由未使用所有可观测特征引起的项,从而使我们能够理解因果分类器何时具有优势。最后,我们证明使用因果特征可以允许机构与用户之间的长期激励对齐,这与先前强调此类方法社会成本的工作形成对比。我们在合成数据上凭经验验证了我们的理论,发现我们的结果预测了实际行为。

英文摘要

In strategic classification, an institution (e.g., a bank) anticipates adaptation from users who change their features to increase utility in a classification task (e.g., loan repayment). Since a key challenge is the distribution shift induced by users, we turn to causal models, which have been shown to bound the worst-case out-of-distribution (OOD) risk, and establish several new results that link causality and strategic classification. First, we show that causal classification leads to optimal classification error after any sufficiently large adaptation, when the noise is bounded in a certain way. Second, when these assumptions do not hold, we show OOD cross-entropy risk of optimal classifiers decomposes into an OOD bias term and a term arising from not using all observable features, allowing us to understand when causal classifiers have an advantage. Finally, we show that the use of causal features can allow alignment of long-term incentives between institutions and users, contrasting with previous work that highlights social costs of such approaches. We validate our theory empirically on synthetic data, finding that our results predict behavior in practice.

2605.27144 2026-05-27 cs.CV cs.LG 版本更新

Is an Image Also Worth 16x16=256 Superpixels? A Framework for Attentional Image Classification

图像是否也值得16x16=256个超像素?一个用于注意力图像分类的框架

Pedro Henrique da Costa Avelar, Anderson R. Tavares, Luís C. Lamb

发表机构 * UFRGS(联邦大学里约格兰德杜斯鲁斯) Institute of Informatics(信息学院) Federal University of Rio Grande do Sul(里约格兰德杜斯鲁斯联邦大学) Division of Informatics(信息系) School of Health Sciences(健康科学学院) Imaging and Data Science(成像与数据科学) Faculty of Biology, Medicine and Health(生物医学与健康学院) University of Manchester(曼彻斯特大学) Vaughan House, Portsmouth St(波特兰街瓦尔赫恩大楼)

AI总结 提出超像素变换器(SPT)框架,统一超像素图像分类与视觉变换器,通过多维正弦余弦位置编码和增强的补丁数据结构,在多个数据集上优于超像素图神经网络方法,与视觉变换器竞争。

详情
AI中文摘要

基于超像素的图像分类传统上利用图神经网络(GNN)处理不规则图像表示。计算机视觉的最新进展,由视觉变换器(ViT)驱动,引入了自注意力模型的新范式,在各种任务中超越了卷积神经网络(CNN)。然而,GNN、超像素和变换器之间的协同联系仍未探索。在这项工作中,我们提出了超像素变换器(SPT),这是一个统一超像素图像分类和ViT的新框架。SPT将超像素图像分类与图注意力网络(SICGAT)模型和ViT泛化,以支持任意超像素分块策略、连接图和位置编码。我们引入了改进,包括多维正弦余弦位置编码和完全包含超像素形状和颜色信息的增强补丁数据结构。通过在CIFAR10、FashionMNIST和Imagenette等数据集上测试SPT,采用各种超像素生成和图连接策略,我们证明SPT相比以前的超像素GNN方法实现了优越的性能,并与ViT保持竞争力。值得注意的是,我们的方法解决了SICGAT的局限性,例如像素聚合过程中的信息丢失,并展示了受限图连接如何增强ViT性能。SPT弥合了基于超像素和变换器模型之间的差距,为跨领域泛化和混合注意力框架的未来创新开辟了道路,并表明图像也值得$16\times16$个超像素。

英文摘要

Superpixel-based image classification has traditionally leveraged graph neural networks (GNNs) for processing irregular image representations. Recent advances in computer vision, driven by Vision Transformers (ViTs), have introduced new paradigms in self-attentional models, surpassing convolutional neural networks (CNNs) in various tasks. However, a synergistic connection between GNNs, superpixels, and transformers remains unexplored. In this work, we propose Superpixel Transformers (SPT), a novel framework that unifies superpixel-based image classification and ViTs. SPT generalizes the Superpixel Image Classification with Graph Attention Networks (SICGAT) model and ViT to support arbitrary superpixel-based chunking strategies, connectivity graphs, and positional encodings. We introduce refinements including a multidimensional sine-cosine positional encoding and an enriched patch data structure that fully incorporates superpixel shape and color information. By testing SPT across datasets such as CIFAR10, FashionMNIST, and Imagenette, with various superpixel generation and graph connectivity strategies, we demonstrate that SPT achieves superior performance compared to previous superpixel-based GNN methods and remains competitive with ViTs. Notably, our approach addresses the limitations of SICGAT, such as information loss during pixel aggregation, and shows how constrained graph connectivity can enhance ViT performance. SPT bridges the gap between superpixel-based and transformer models, opening avenues for cross-domain generalization and future innovations in hybrid attentional frameworks, and showing that an image can also be worth $16\times16$ superpixels.

2605.27133 2026-05-27 cs.LG cs.AI 版本更新

Deep-layer limit and stability analysis of the basic forward-backward-splitting induced network (II): learning problems

基本前向-后向分裂诱导网络的深层极限与稳定性分析(II):学习问题

Xuan Lin, Chunlin Wu

发表机构 * China Academy of Aerospace System(中国航天系统研究院) School of Mathematical Sciences(数学科学学院) Nankai University(南开大学)

AI总结 本文研究基本前向-后向分裂(FBS)诱导网络的训练问题,证明其收敛到深层极限系统的学习问题,并给出扰动稳定性分析。

Comments 38 pages, 1 figure

详情
AI中文摘要

源自迭代优化方案和数值常/偏微分方程(ODE/PDE)的深度展开神经网络在过去十年中引起了数据科学界的广泛关注。其中,许多重要的网络架构是从基本的前向-后向分裂(FBS)算法构建的。在本文中,我们继续研究最基本的FBS诱导网络,该网络通过引入直接参数松弛从原始FBS算法展开。基于我们先前前向系统分析中的差分/微分包含公式,我们在此考虑相应学习问题的一些理论方面。在一些温和假设下,我们建立了基本FBS诱导网络的训练问题收敛到深层极限系统的学习问题的一般收敛性质,这意味着一个$\Gamma$-收敛论证,表明网络最优学习参数的任意聚点是深层极限系统学习问题的解。还对这些学习问题的扰动稳定性进行了定性分析。进行了一个简单的数值实验以验证我们的主要一般收敛结果。

英文摘要

Deep unfolding neural networks derived from iterative optimization schemes and numerical ordinary/partial differential equations (ODEs/PDEs) have attracted much attention in data science over the last decade. Therein, numerous important network architectures were constructed from the basic forward-backward-splitting (FBS) algorithm. In this paper, we continue our research on the most basic FBS-induced network, an architecture unrolled from the original FBS algorithm by incorporating direct parameter relaxations. Following the difference/differential inclusion formulations in our previous forward system analyses, we here consider some theoretical aspects of corresponding learning problems. Under some mild assumptions, we establish a general convergence property of the training problem of the basic FBS-induced network to the learning problem of the deep-layer limit system, implying a $Γ$-convergence argument showing that any cluster point of the optimal learning parameters for the network is a solution to the learning problem of the deep-layer limit system. A qualitative analysis of perturbation stabilities of these learning problems is also presented. A simple numerical experiment is conducted to validate our main general convergence result.

2605.27130 2026-05-27 cs.LG cs.AI 版本更新

DEI: Diversity in Evolutionary Inference for Quality-Diversity Search

DEI:质量-多样性搜索中的进化推理多样性

John Donaghy, Shikhar Rastogi

AI总结 提出DEI框架,通过异构大语言模型作为变异算子进行分布式质量-多样性搜索,实验表明模型多样性比并行性更能提升搜索性能。

Comments Accepted to ICML 2026 Workshop Scalable Learning and Optimization for Efficient Multimodal AI Agents (SCALE)

详情
AI中文摘要

我们提出DEI:进化推理中的多样性,一个分布式质量-多样性(QD)搜索框架,该框架将异构大语言模型(LLM)分配为变异算子,在通过非阻塞集合操作通信的对等节点间运行。与同质并行搜索(在所有工作节点上复制单一模型的归纳偏差)不同,DEI将每个LLM独特的创造性先验视为行为新颖性的互补来源。通过DEI扩展数字红皇后框架,节点在每轮结束时共享局部最优解,以播种下一轮的种群。这产生了跨模型的对抗压力,推动了超越模型内自博弈的鲁棒性。在Core War领域(一个竞争性编程基准,其中Redcode战士程序在模拟机器中战斗)上评估,一个四节点异构集成(GPT-5.4-mini、Claude Sonnet 4.6、GPT-5.2和Claude Haiku 4.5)在相等的总LLM调用预算下,相比单节点基线,实现了124%更高的合并存档QD分数(45.90 vs. 20.46)和28%更高的覆盖率(80.6% vs. 63.0%的单元格)。异构集成还在QD分数、覆盖率和所有四个模型家族的保留解泛化性上优于同等预算的同质集成。这些结果首次提供了经验证据,表明模型多样性(而非仅仅是并行性)是分布式基于LLM的QD搜索中增益的关键驱动因素。

英文摘要

We present DEI: Diversity in Evolutionary Inference, a distributed Quality-Diversity (QD) search framework that assigns heterogeneous large language models (LLMs) as mutation operators across peer nodes communicating with non-blocking collective operations. Unlike homogeneous parallel search, which replicates a single model's inductive biases across all workers, DEI treats each LLM's distinct creative prior as a complementary source of behavioral novelty. Extending the Digital Red Queen framework with DEI, nodes share local optimal solutions at the end of each round to seed the next round's population. This creates cross-model adversarial pressure that drives robustness beyond intra-model self-play. Evaluated on the Core War domain, a competitive programming benchmark in which Redcode warrior programs battle inside a simulated machine, a four-node heterogeneous ensemble (GPT-5.4-mini, Claude Sonnet 4.6, GPT-5.2, and Claude Haiku 4.5) achieves 124 percent higher merged-archive QD-Score (45.90 vs. 20.46) and 28 percent higher coverage (80.6 percent vs. 63.0 percent of cells) than a single-node baseline at equal total LLM-call budget. The heterogeneous ensemble also outperforms an equally-budgeted homogeneous ensemble on QD-Score, coverage, and held-out solution generality across all four model families. These results provide the first empirical evidence that model diversity, not merely parallelism, is the key driver of gain in distributed LLM-based QD search.

2605.27128 2026-05-27 cs.CV cs.LG 版本更新

PILOT: A Data-Free Continual Learning Approach for Real-Time Semantic Segmentation via Boundary Guidance

PILOT: 一种基于边界引导的无数据持续学习方法用于实时语义分割

Yujing Zhou, Prashant Shekhar, Thomas Yang, Yongxin Liu

发表机构 * Department of Mathematics, College of Arts and Sciences, Embry-Riddle Aeronautical University(数学系,文理学院,埃姆布里-里德航空大学) Department of Electrical Engineering and Computer Science, College of Engineering, Embry-Riddle Aeronautical University(电气工程与计算机科学系,工程学院,埃姆布里-里德航空大学)

AI总结 提出PILOT框架,通过冻结原网络参数并引入并行导数分支捕获新类边界信息,实现实时语义分割模型在无需旧数据情况下的增量学习,有效缓解灾难性遗忘。

详情
AI中文摘要

实时语义分割模型在准确性和推理速度之间取得了极好的平衡。然而,将这些模型部署在动态的真实世界环境中,通常需要能够在不重新训练整个数据集的情况下增量地学习新类别。这种能力被称为持续学习。在这方面,深度学习中的标准微调方法常常因灾难性遗忘而失败,即模型学习新信息但忘记了先前训练和学习的类别。针对这一关键领域,本文提出了一种针对PIDNet的新型持续学习框架,PIDNet是一种被广泛引用的最先进的实时语义分割模型。我们的方法PILOT(并行增量学习随时间)通过实现一个并行导数分支(D-branch)引入了一种实时且轻量级的策略,该分支旨在捕获新类别的高频边界信息,同时冻结原始分割网络的训练参数。这种新颖的设置允许模型适应新的语义类别,同时保留先前学习类别的知识。通过仅使用与新类别相关的数据,我们的模型显著减少了训练开销。实验结果表明,我们的方法成功分割了新类别,同时在原始基类上保持了较高的平均交并比(mIoU),从而在该领域轻松超越了所有主要的持续学习方法。总体而言,PILOT被证明能有效缓解灾难性遗忘,同时对推理延迟影响最小,从而保持实时性能。

英文摘要

Real-time semantic segmentation models offer an excellent balance between accuracy and inference speed. However, deploying these models in dynamic real world environments often requires the ability to learn novel classes incrementally without retraining on the entire dataset. This capability is known as continual learning. In this regard, the standard fine-tuning methods in deep learning often fail due to catastrophic forgetting, where the model learns new information but forgets previously trained and learned classes. Contributing to this crucial domain, the current paper proposes a novel continual learning framework tailored for PIDNet, which is a widely cited state-of-the-art real-time semantic segmentation model. Our method, PILOT(Parallel Incremental Learning Over Time), introduces a real-time and lightweight strategy by implementing a parallel Derivative-branch (D-branch) designed to capture the high frequency boundary information of novel classes while freezing the trained parameters of the original segmentation network. This novel setup allows the model to adapt to new semantic categories while preserving the knowledge of previously learned classes. By using only data associated with the new class, our model significantly reduces training overhead. Experimental results demonstrate that our approach successfully segments new classes while maintaining high mean Intersection over Union (mIoU) on the original base classes, thereby comfortably outperforming all major continual learning approaches in this domain. Overall, PILOT is shown to effectively mitigate catastrophic forgetting with minimal impact on inference latency, thus maintaining real-time performance.

2605.27113 2026-05-27 cs.LG cs.AI 版本更新

High-Quality Synthetic Financial Time-Series using a GAN-Diffusion Framework

使用GAN-扩散框架的高质量合成金融时间序列

Giuseppe Masi, Andrea Coletta, Novella Bartolini

发表机构 * Sapienza University of Rome(罗马大学)

AI总结 提出一种结合GAN和扩散模型的质量感知生成框架,通过GAN的Critic引导扩散过程,生成更真实且保留金融时间序列典型事实和资产间相关结构的合成数据。

详情
AI中文摘要

近年来,金融机构和公司越来越多地采用合成数据来解决数据稀缺问题并生成反事实市场情景。然而,再现金融时间序列的所有统计特性(通常称为典型事实)对于许多现有的通用架构来说仍然是一个开放的挑战。在本文中,我们提出了一种质量感知生成框架,该框架结合了两类生成方法,展示了它们的集成如何解决现有局限性,同时增强合成数据的真实性。具体来说,我们首先引入CoMeTS-GAN(相关多变量时间序列GAN),这是一种条件生成对抗网络(C-GAN),旨在联合生成相关股票的中价和成交量时间序列。然后,我们展示了如何将我们的GAN架构整合到最先进的扩散模型中,以提高生成的相关结构的质量。具体来说,GAN的Critic作为一个质量评估模块,指导扩散过程,在生成的时间序列中强制执行学习到的相关结构。我们的框架为真实的股票市场模拟提供了一种轻量级且响应迅速的解决方案,明确建模了资产间的相关结构。我们通过实验将我们的框架与领先的生成架构进行了比较,表明它更有效地捕捉了股票市场的典型事实并建模了资产间的相关性。

英文摘要

In recent years, financial institutions and firms have increasingly adopted synthetic data to address data scarcity and to generate counterfactual market scenarios. However, reproducing all the statistical properties of financial time series, commonly known as stylized facts, remains an open challenge for many existing general-purpose architectures. In this paper, we present a quality-aware generative framework that combines two classes of generative methods, demonstrating how their integration addresses existing limitations while enhancing the realism of synthetic data. Specifically, we first introduce CoMeTS-GAN (Correlated Multivariate Time Series GAN), a Conditional Generative Adversarial Network (C-GAN) designed to jointly generate mid-price and volume time-series for correlated stocks. We then show how our GAN architecture can be incorporated into state-of-the-art diffusion models to enhance the quality of generated correlation structures. Specifically, the GAN's Critic serves as a quality evaluation module that guides the diffusion process, enforcing learned correlation structures in the generated time-series. Our framework offers a lightweight and responsive solution for realistic stock market simulation, explicitly modeling inter-asset correlation structures. We experimentally validate our framework against leading generative architectures, showing that it more effectively captures the stylized facts of stock markets and models inter-asset correlations.

2605.27097 2026-05-27 cs.LG stat.ML 版本更新

Mildly Overparameterized ReLU Networks on Orthogonal Data: Incremental Learning and Implicit Bias

正交数据上的轻度过参数化ReLU网络:增量学习与隐式偏差

James Town, Etienne Boursier, Ben Lewis, Matthias Englert, Ranko Lazic

发表机构 * University of Warwick(沃里克大学) INRIA LMO, Université Paris-Saclay(巴黎-萨克勒大学INRIA LMO)

AI总结 研究从微小初始化出发的两层ReLU网络在正交数据上的梯度流动力学,揭示了当初始化尺度趋近零时极限流收敛到鞍点间跳跃过程,并证明网络在宽度m约大于log(n)时高概率插值训练数据,且学习到的插值器的平方ℓ2范数缩放为√n,与最小ℓ2范数插值器相差常数因子。

Comments 66 pages, 6 figures

详情
AI中文摘要

神经网络的成功训练依赖于一阶优化方法的使用,但这些方法的理论刻画仍不完整,尤其是在轻度过参数化设置下。本文研究从微小初始化出发的两层ReLU网络在正交训练数据上的梯度流动力学。我们证明,当初始化尺度趋近零时,极限流收敛到鞍点间跳跃过程,揭示了在每个鞍点处激活一个新神经元的增量学习现象。该分析恢复了Dana等人(2025, arXiv:2502.16977)的已知结果:只要$m \gtrsim \log(n)$(其中$m$是网络宽度,$n$是训练样本数),网络就以高概率插值训练数据。这一增量过程刻画还使我们能够推导出一个新的隐式偏差结果:学习到的插值器具有平方$\ell_2$范数缩放为$\sqrt{n}$,这处于最小$\ell_2$范数插值器的常数因子内。更广泛地,我们的工作为ReLU网络的增量学习过程提供了首个严格证明,同时表明轻度过参数化网络可以收敛到复杂度与最优插值器同阶的插值解。

英文摘要

The successful training of neural networks hinges on the use of first order optimization methods, yet the theoretical characterization of these methods remains incomplete. This is especially true in settings with mild overparameterization. In this work, we study the gradient flow dynamics of two-layer ReLU networks from small initialization with orthogonal training data. We prove the limiting flow converges to a saddle-to-saddle jump process as the initialization scale tends to zero, revealing an incremental learning phenomenon in which a new neuron activates at each saddle. This analysis recovers the known result of Dana et al. (2025, arXiv:2502.16977) that the network interpolates the training data with high probability as soon as $m \gtrsim \log(n)$, where $m$ is the network width and $n$ is the number of training samples. This incremental process characterization also allows us to derive a novel implicit bias result: the learned interpolator has a squared $\ell_2$-norm scaling as $\sqrt{n}$, which is within a constant factor of the minimal $\ell_2$-norm interpolator. More broadly, our work provides the first rigorous proof of an incremental learning process for ReLU networks, whilst suggesting mildly overparameterized networks can converge to interpolating solutions whose complexity is of the same order as that of the optimal interpolator.

2605.27093 2026-05-27 stat.ML cs.LG 版本更新

Gaussian Process-based learning with new MCMC-based implementation of Wishart prior on correlation matrix

基于高斯过程的学习:相关矩阵上Wishart先验的新MCMC实现

Kane Warrior, Dalia Chakrabarty

发表机构 * Department of Mathematics(数学系) University of York(约克大学)

AI总结 提出一种自组装Wishart先验用于协方差矩阵,结合MCMC对核超参数进行贝叶斯推断,通过回溯窗口引入自适应性,有效诊断弱信息输入。

详情
AI中文摘要

在输入-输出关系的概率监督学习中(作为高斯过程(GP)的样本函数),通常为核的超参数指定先验,这些超参数参数化GP的协方差函数,其中(所得多元正态)似然的诱导协方差矩阵控制学习和预测。当所寻求的函数高度多元时,必须同时学习多个长度尺度参数,使得推断困难。我们为协方差矩阵开发了一种“自组装”Wishart先验,同时使用MCMC对核超参数进行贝叶斯推断。该构造使用最近MCMC迭代的回溯窗口来定义依赖于时间步长的尺度矩阵,从而为链引入自适应性。结果表明,在基于GP的学习范式中,对协方差矩阵的直接先验指定可用于诊断弱信息输入。我们通过两个不同的实证示例支持我们的先验开发——一个基于合成数据,另一个基于真实世界数据集。

英文摘要

In probabilstic supervised learning of an input-output relationship - as a sample function of a Gaussian Process (GP) - priors are typically specified for the hyperparameters of the kernel that parametrises the covariance function of the GP, where the induced covariance matrix of the (resulting multivariate Normal) likelihood, governs the learning and prediction. When the sought function is highly multivariate, multiple lengthscale parameters must be learnt simultaneously, making inference difficult. We develop a ``self-assembled'' Wishart prior for the covariance matrix, while undertaking Bayesian inference on the kernel hyperparameters using MCMC. The construction uses a look-back window over recent MCMC iterations to define a time-step dependent scale matrix, thereby introducing adaptiveness to the chain. Results suggest that direct prior specification on the covariance matrix can be useful for diagnosing weakly informative inputs within the GP-based learning paradigm. We support our prior development with two distinct empirical illustrations - one on synthetic data, and another on a real-world dataset.

2605.27088 2026-05-27 cs.CL cs.LG 版本更新

LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring

LLMs 已经是好导师:面向教学数学辅导的无训练提示优化

Unggi Lee, Minchul Shin, Yeil Jeong, Sookbun Lee, Jeongsu Moon, Kyungtae Joo, Eunjoo Lee, Hoilym Kwon

发表机构 * Korea University Sejong Campus(韩国大学世宗校区) Gyeonggi Institute of Education(京畿教育学院) Indiana University Bloomington(印第安纳大学布卢明顿分校) Opentutorials Chosun University(全州大学) Korea University Korean Studies Center(韩国大学韩学研究中心)

AI总结 本研究探索通过API调用优化系统提示的无训练方法,提出5种教育专用方法,在2个OOD基准上评估12种方法,发现所有方法均超越最强RL训练基线,ParetoGrad在事后解决率、泄漏控制和有用性上达到最佳帕累托平衡。

Comments 17 pages, 5 figures

详情
AI中文摘要

将LLMs与数学辅导对齐通常需要基于RL的训练和多GPU基础设施。我们研究无训练提示优化——仅通过API调用演化系统提示——是否可以作为实用替代方案。我们改编了7种已发表方法并提出了5种教育专用方法,在2个OOD基准套件上的5种条件下评估这12种方法。所有12种最佳方法配置均超越了最强的RL训练基线(R_total = 0.633),我们的ParetoGrad在事后解决率、泄漏控制和有用性上实现了最佳帕累托平衡,而非在任何单一组件上占优。使用包含82个代码的教育代码本进行行为分析发现,无训练方法依赖教学知识模式的频率是RL训练模型的2-3倍,同时意图级脚手架减少了约10个百分点。我们还发现一个任务依赖的推理模式效应,在无训练和基于RL的范式中一致。我们的方法仅通过提示和最小计算即可高效开发教学对齐的LLM导师。

英文摘要

Aligning LLMs for math tutoring typically requires RL-based training with multi-GPU infrastructure. We investigate whether training-free prompt optimization-evolving only the system prompt via API calls-can serve as a practical alternative. We adapt 7 published methods and propose 5 education-specialized methods, evaluating these 12 methods under 5 conditions on 2 OOD benchmark suites. All 12 best-per-method configurations surpass the strongest RL-trained baseline (R_total = 0.633), and our ParetoGrad achieves the best Pareto balance across post-test solve rate, leak control, and helpfulness, rather than dominating any single component. Behavioral analysis with an 82-code educational codebook reveals that training-free methods rely on teaching-knowledge patterns at 2-3x the rate of RL-trained models, with a compensating ~10 percentage-point reduction in intent-level scaffolding. We also find a task-dependent reasoning mode effect consistent across training-free and RL-based paradigms. Our approach enables efficient development of pedagogically aligned LLM tutors with prompts alone and minimal compute.

2605.27081 2026-05-27 cs.LG cs.AI cs.DC 版本更新

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

ReMoE: 在内存受限的MoE大模型推理中通过路由器微调提升专家重用

Xiongwei Zhu, Xiaojian Liao, Tianyang Jiang, Yusen Zhang, Liang Wang, Limin Xiao

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing 100191, China(北京航空航天大学计算机科学与工程学院) Huawei Technologies Ltd(华为技术有限公司)

AI总结 提出ReMoE路由器微调框架,通过偏向近期选中的专家实现时间稳定的路由,减少专家从外部存储的获取次数,在保持下游任务性能的同时提升专家重用26%,并在实际系统中实现8.4%的吞吐量提升和1.77-1.99倍的解码加速。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

细粒度混合专家(MoE)模型对每个token仅稀疏激活一部分专家,在保持高模型容量的同时减少激活计算。然而,在内存受限的推理场景中,只能缓存少量专家。未缓存的专家必须从慢速外部存储(如UFS)获取,导致频繁的驱逐和大量的I/O开销。我们提出ReMoE,一个路由器微调框架,旨在提升token级别的专家重用。ReMoE使路由器偏向近期选中的专家,产生时间稳定的路由,更好地匹配缓存局部性约束。通过增加短时专家重用,ReMoE减少了从存储中获取专家,且不增加推理计算开销。在DeepSeek和Qwen模型上的实验表明,ReMoE在保持下游任务性能的同时将专家重用提升了26%。实际系统评估进一步证实了这些优势:在vLLM GPU-CPU专家卸载下,输出吞吐量提升8.4%;在Jetson Orin NX上的llama.cpp中,TPOT降低43.6-49.8%,对应不同工作负载下1.77-1.99倍的解码加速。检查点和使用说明见https://github.com/BUAA-OSCAR/ReMoE。

英文摘要

Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short-horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference-time computation. Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real-system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU-CPU expert offloading and reducing TPOT by 43.6-49.8% under llama.cpp on Jetson Orin NX, corresponding to a 1.77-1.99$\times$ decode speedup across diverse workloads. Checkpoints and usage instructions are available at https://github.com/BUAA-OSCAR/ReMoE.

2605.27079 2026-05-27 cs.LG cs.AI cs.RO 版本更新

Trust Region Q Adjoint Matching

信任区域Q伴随匹配

Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, Jinwoo Shin

发表机构 * KAIST AI(韩国科学技术院人工智能) Seoul National University(首尔国立大学) RLWRLD

AI总结 针对预训练流策略的离策略强化学习不稳定性,提出信任区域Q伴随匹配方法,通过投影对偶下降自适应控制路径空间KL散度,实现稳定微调,在50个OGBench任务中离线RL成功率达68%。

详情
AI中文摘要

由于多步采样过程带来的优化不稳定性,预训练流策略的离策略强化学习仍然具有挑战性。最近,带有伴随匹配的Q学习(QAM)通过将问题重新表述为一个具有学习评论家的无记忆随机最优控制(SOC)问题来解决这一问题。然而,QAM继承了评论家引导改进的根本脆弱性:当评论家病态时,小的评论家误差会被放大,通常导致模型崩溃。本文引入了信任区域Q伴随匹配(TRQAM),一种稳定的离策略微调算法,通过投影对偶下降自适应地控制与预训练流策略的路径空间KL散度。具体来说,我们优化SOC动力学中的信任区域参数$λ$,并从理论上证明路径空间KL可以用$λ$的闭式函数表示。因此,我们的方法可以精确控制与预训练流策略的精确偏差,实现稳定的离策略强化学习。通过在50个OGBench任务上的实验,TRQAM在离线强化学习和离线到在线强化学习中都持续优于先前的方法。特别是,TRQAM在离线强化学习中实现了68%的总体成功率,显著提高了最强基线的46%。

英文摘要

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter $λ$ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of $λ$. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.

2605.27076 2026-05-27 cs.MA cs.LG 版本更新

Cost of Structural Learning Under Censored Feedback: A Threshold-Bandit Approach

审查反馈下结构学习的代价:一种阈值-老虎机方法

Michael Ledford, William Regli

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 针对任务仅当联盟达到未知规模阈值时才产生奖励的审查反馈问题,提出阈值激活合作多臂老虎机模型,并通过集中式算法C-TAC实现O(log T)累积遗憾,以及去中心化事件触发协议D-TAC在保持可行性对齐的同时减少23倍通信。

详情
AI中文摘要

在许多多智能体应用中,任务仅当由满足未知规模阈值的联盟执行时才产生奖励;否则,反馈完全被审查。这种审查造成了可识别性问题:智能体无法区分随机失败与协调不足。我们将此设置形式化为阈值激活合作多臂老虎机(TAC-MAB),并在集中式和去中心化协调下进行分析。我们证明集中式算法(C-TAC)实现了累积遗憾O(log T),该遗憾分解为结构搜索项(捕获在审查反馈下解决可行性的代价)和统计监控项(用于价值估计)。然后我们引入D-TAC,一种去中心化事件触发协议,其中智能体仅在其结构信念改变时进行同步。实验表明,在保守信念融合下,D-TAC相对于集中式基线实现了23倍的通信减少,同时保持了可行性对齐。这些结果刻画了在审查反馈下学习的协调代价,并表明无需持续同步即可实现接近集中式的通信效率。

英文摘要

In many multi-agent applications, tasks yield rewards only when executed by a coalition meeting an unknown size threshold; otherwise, feedback is fully censored. This censorship creates an identifiability problem: agents cannot distinguish stochastic failure from insufficient coordination. We formalize this setting as the Threshold-Activated Cooperative Multi-Armed Bandit (TAC-MAB) and analyze it under both centralized and decentralized coordination. We show that a centralized algorithm (C-TAC) achieves cumulative regret O(log T), decomposed into a structural-search term that captures the cost of resolving feasibility under censored feedback and a statistical-monitoring term for value estimation. We then introduce D-TAC, a decentralized event-triggered protocol in which agents synchronize only when their structural beliefs change. Empirically, D-TAC achieves a 23x reduction in communication relative to the centralized baseline while preserving feasibility alignment under conservative belief fusion. These results characterize the coordination cost of learning under censored feedback and show that near-centralized communication efficiency is achievable without continuous synchronization.

2605.27073 2026-05-27 cs.LG 版本更新

Learning to Orchestrate Agents under Uncertainty

学习在不确定性下编排智能体

Mary Chriselda Antony Oliver, Lan Jiang, Aaron Bundi Anampiu, Elaf Almahmoud, Francesco Quinzan, Umang Bhatt

发表机构 * Department of Applied Mathematics and Theoretical Physics, University of Cambridge(应用数学与理论物理系,剑桥大学) Centre for Human-Inspired Artificial Intelligence, University of Cambridge(启发式人工智能中心,剑桥大学) African Institute for Mathematical Sciences, South Africa(南非数学科学研究所) Department of Engineering Science, University of Oxford(工程科学系,牛津大学)

AI总结 提出BOT-Orch框架,将编排问题转化为带正则化的多臂赌博机问题,在不确定性下实现异构智能体的自适应编排,理论保证遗憾界为O(√T)并优于基线。

详情
AI中文摘要

异构智能体的自适应编排需要在不确定且不断演化的智能体行为下做出顺序委派决策,例如协调具有不同可靠性、成本和响应质量的专门AI模型。虽然先前关于智能体编排的工作侧重于性能或成本,但通常未在编排层面显式建模智能体可靠性和输出分布的不确定性。在这项工作中,我们研究了不确定性下异构智能体的自适应编排问题,其中元控制器必须决定何时委派给某个智能体,同时考虑可靠性、成本和不确定性。我们提出了BOT-Orch,一个轻量级框架,将编排重新表述为智能体上的赌博机问题,并通过智能体输出分布与任务特定参考分布之间的OT距离进行正则化。我们证明,在标准假设下,正则化编排享有O(√T)的遗憾界,并能在具有相同平均奖励但分布对齐不同的智能体之间可证明地诱导偏好排序。实验上,我们展示了BOT-Orch在具有异构、非独立同分布智能体行为的合成但对抗性任务分配设置中优于标准赌博机和启发式基线。

英文摘要

Adaptive orchestration of heterogeneous agents requires making sequential delegation decisions under uncertain and evolving agent behaviour, e.g., coordinating specialised AI models with varying reliability, cost, and response quality. While prior work on agent orchestration focuses on performance or cost, uncertainty in agent reliability and output distributions is typically not modelled explicitly at the orchestration level. In this work, we study the problem of adaptive orchestration of heterogeneous agents under uncertainty, where a meta-controller must decide when to delegate to an agent, accounting for reliability, cost, and uncertainty. We propose BOT-Orch, a lightweight framework that recasts orchestration as a bandit problem over agents, regularized by OT distances between agent output distributions and task-specific reference distributions. We show that the regularised orchestration enjoys $\mathcal{O}(\sqrt{T})$ regret under standard assumptions, and provably induces preference ordering among agents with identical mean rewards but differing distributional alignment. Empirically, we demonstrate that BOT-Orch outperforms standard bandit and heuristic baselines in synthetic but adversarial task allocation settings with heterogeneous, non-i.i.d. agent behaviour.

2605.27063 2026-05-27 cs.LG 版本更新

Learning Dynamic Graph Representations through Timespan View Contrasts

通过时间跨度视图对比学习动态图表示

Yiming Xu, Zhen Peng, Bin Shi, Xu Hua, Bo Dong

发表机构 * School of Computer Science and Technology(计算机科学与技术学院) School of Distance Education(继续教育学院)

AI总结 提出基于时间平移不变性的动态图表示框架CLDG和CLDG++,通过跨时间跨度对比学习和多尺度对比学习,有效提升节点分类和动态图异常检测性能。

Comments Accepted by Neural Networks

详情
AI中文摘要

图蕴含的丰富信息激发了对无监督图表示的进一步研究。现有研究主要依赖静态图中的节点特征和拓扑属性来创建自监督信号,忽略了真实世界图数据携带的时间成分,例如边的时间戳。为了克服这一局限,本文探索了如何在动态图上优雅地建模时间演化。具体地,我们引入了一种新的归纳偏置,即时间平移不变性,它说明了同一节点在不同时间跨度上倾向于保持相似标签。基于这一假设,我们开发了一个动态图表示框架CLDG,通过在不同时间跨度上进行对比学习,鼓励节点保持局部一致的时间平移不变性。除了仅考虑显式拓扑链接的标准CLDG,我们进一步提出的CLDG++额外采用图扩散来揭示节点之间的全局上下文相关性,并设计了一个由局部-局部、局部-全局和全局-全局对比组成的多尺度对比学习目标,以增强表示能力。有趣的是,通过测量不同时间跨度之间的一致性来形成异常指标,CLDG和CLDG++无缝集成到动态图异常检测任务中,这在金融、网络安全和医疗保健等许多高影响力领域具有广泛应用。实验表明,CLDG和CLDG++在节点分类和动态图异常检测等下游任务中均表现出理想的性能。此外,CLDG通过隐式利用时间线索而不是复杂的序列模型,显著降低了时间和空间复杂度。

英文摘要

The rich information underlying graphs has inspired further investigation of unsupervised graph representation. Existing studies mainly depend on node features and topological properties within static graphs to create self-supervised signals, neglecting the temporal components carried by real-world graph data, such as timestamps of edges. To overcome this limitation, this paper explores how to model temporal evolution on dynamic graphs elegantly. Specifically, we introduce a new inductive bias, namely temporal translation invariance, which illustrates the tendency of the identical node to keep similar labels across different timespans. Based on this assumption, we develop a dynamic graph representation framework CLDG that encourages the node to maintain locally consistent temporal translation invariance through contrastive learning on different timespans. Except for standard CLDG which only considers explicit topological links, our further proposed CLDG++ additionally employs graph diffusion to uncover global contextual correlations between nodes, and designs a multi-scale contrastive learning objective composed of local-local, local-global, and global-global contrasts to enhance representation capabilities. Interestingly, by measuring the consistency between different timespans to shape anomaly indicators, CLDG and CLDG++ are seamlessly integrated with the task of spotting anomalies on dynamic graphs, which has broad applications in many high-impact domains, such as finance, cybersecurity, and healthcare. Experiments demonstrate that CLDG and CLDG++ both exhibit desirable performance in downstream tasks including node classification and dynamic graph anomaly detection. Moreover, CLDG significantly reduces time and space complexity by implicitly exploiting temporal cues instead of complicated sequence models.

2605.27062 2026-05-27 cs.CL cs.LG 版本更新

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

FalAR: 一个大规模说话人标注的欧洲葡萄牙语议会会议语音语料库

Francisco Teixeira, Carlos Carvalho, Mariana Julião, Catarina Botelho, Rubén Solera-Ureña, Sérgio Paulo, Thomas Rolland, Ben Peters, Isabel Trancoso, Alberto Abad

发表机构 * INESC-ID Instituto Superior Técnico(理工学院)

AI总结 为弥补欧洲葡萄牙语语音资源不足,构建了FalAR语料库,包含5800小时议会会议语音及说话人标注,实验表明作为预训练数据可使ASR词错误率相对降低14%。

Comments Published in LREC2026

详情
AI中文摘要

自动语音识别(ASR)的最先进性能在很大程度上依赖于大规模标注语料库的可用性。这增加了数据收集工作的需求,特别是对于代表性不足的语言和方言变体。由于欧洲葡萄牙语(EP)的说话人数量较少(约1100万),在目前可用的大规模语音数据资源中,它被巴西葡萄牙语(BP)(约2亿说话人)所掩盖,导致EP用户的语音系统性能不佳。为了弥补这一差距,并遵循其他语言的类似数据收集工作,我们提出了FalAR,一个大规模、说话人标注的欧洲葡萄牙语议会会议语音语料库。FalAR涵盖约20年,包含5800小时的语音数据。此外,4850小时具有说话人身份标注,总共1180个说话人,附带元数据包括年龄、性别、政治派别和议会角色。该语料库使用最先进的EP CAMÕES ASR模型进行转录参考对齐。在本文中,我们描述了数据收集过程以及FalAR语料库的主要特征。此外,我们评估了数据量和对齐准确性对ASR性能的权衡,实验表明,将FalAR作为预训练数据可以使基线模型的词错误率相对降低高达14%。

英文摘要

State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data. In addition, 4,850 hours have speaker identity annotations, for a total of 1,180 speakers with associated metadata including age, gender, political affiliation, and parliamentary role. The corpus was built using a state-of-the-art EP CAMÕES ASR model for transcription-reference alignment. In this paper, we describe the data collection process, together with the main characteristics of the FalAR corpus. Furthermore, we evaluate the trade-off between data quantity and alignment accuracy on ASR performance, with our experiments demonstrating that incorporating FalAR as pre-training data yields up to 14% relative WER improvement over baseline models.

2605.27050 2026-05-27 cs.CL cs.LG 版本更新

BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

BhashaSetu:一种以数据为中心的低资源机器翻译方法

Param Thakkar, Anushka Yadav, Michael Tiemann, Abhi Mehta, Akshita Bhasin, Shrinivas Khedkar

发表机构 * Department of Computer Engineering and Information Technology, Veermata Jijabai Technological Institute, Mumbai(孟买韦尔马塔·吉贾拜技术学院计算机工程与信息技术系) Tübingen AI Center, University of Tübingen, Germany(图宾根大学图宾根人工智能中心,德国)

AI总结 提出BhashaSetu数据集,通过大规模、多领域、形态感知的英-马拉地语平行语料库,并验证语料库级去重对低资源神经机器翻译质量的关键影响。

详情
AI中文摘要

我们提出了BhashaSetu,一个语言丰富的英语-马拉地语平行数据集,解决了低资源神经机器翻译(NMT)中持续存在的数据限制问题。马拉地语有超过9500万使用者,但在不同领域的高质量平行语料库中仍然代表性不足。我们的数据集包含来自新闻、政治、医疗、文学和文化等异构来源的278万个句子对,并提供了词干化和词形还原表示以支持形态感知分析。我们使用BLEU、spBLEU、chrF++和TER指标对多个最先进的翻译模型进行了基准测试,并使用LoRA对NLLB-200-distilled-600M进行了参数高效微调。我们消融实验的一个关键发现是:语料库级去重是预处理中对下游质量贡献最大的单一因素(去除它会使性能降低1.17 BLEU和2.21 chrF++),这表明对于低资源、形态丰富的语言,有纪律的跨源语料库卫生是一种低成本、高影响力的干预措施。该数据集已公开发布,以促进可重复且语言信息丰富的低资源NMT研究。

英文摘要

We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78 million sentence pairs from heterogeneous sources including news, politics, healthcare, literature, and culture, with stemmed and lemmatized representations to support morphology-aware analysis. We benchmark multiple state-of-the-art translation models using BLEU, spBLEU, chrF++, and TER metrics, and conduct parameter-efficient fine-tuning of NLLB-200-distilled-600M using LoRA. A key finding from our ablation: corpus-level deduplication is the single largest preprocessing contributor to downstream quality (removing it reduces performance by 1.17 BLEU and 2.21 chrF++), demonstrating that disciplined cross-source corpus hygiene is a low-cost, high-impact intervention for low-resource, morphologically rich languages. The dataset is publicly released to promote reproducible and linguistically informed low-resource NMT research.

2605.27043 2026-05-27 stat.ML cs.LG stat.ME 版本更新

Causal Representation Learning for Generalisable Recommendation

因果表示学习用于可泛化推荐

Yorgos Felekis, Michael O'Riordan, Oriol Corcoll, Ciarán M. Gilligan-Lee

发表机构 * University of Warwick(沃里克大学) Spotify(Spotify公司) University College London(伦敦大学学院)

AI总结 针对推荐系统中训练分布与部署分布不一致导致的泛化问题,提出基于因果表示学习的信息论解缠标准及其可计算变分下界,仅利用混淆日志即可提升模型在分布偏移下的泛化能力,在Spotify A/B测试、KuaiRand数据集和合成基准上验证了有效性。

详情
AI中文摘要

基于观测数据训练的预测模型在部署时往往无法泛化到所遇到的分布,尤其是当训练数据是被优化系统的产物时。推荐系统是一个典型例子:它们是在被部署策略、过去用户行为和平台过滤混淆的交互日志上训练的。因此,训练分布与在服务时评分的候选分布存在显著差异,这种差距使得离线指标无法可靠预测在线性能。我们通过一种受因果表示学习(CRL)启发的方法来解决分布偏移问题。我们提出了一种信息论解缠标准,并证明其最优值仅取决于输入的因果成分。然后,我们推导出一个可处理的变分下界,使得该标准仅从有限观测数据中即可优化。我们的方法范围比大多数CRL文献更窄,因为我们目标是改善分布偏移下的泛化能力,而非完全识别所有潜在因果因素。这个更窄的目标使得该方法实用,仅需要现有的混淆日志,适用于任何标准监督模型,且不增加推理时间成本。我们的主要评估是在Spotify上对数百万用户进行的A/B测试,应用于个性化播放列表生成的排序器。一个容量匹配的CRL变体在离线性能上相当,但在在线听众参与度上带来了显著提升。在公开的KuaiRand推荐数据集和具有已知因果结构的合成基准上的补充证据显示了相同模式:与基线离线持平,在分布偏移下获得收益。在所有三种设置中,加入我们的因果解缠目标都带来了更有意义的分布外泛化。

英文摘要

Predictive models trained on observational data often fail to generalise to the distributions they encounter when deployed, especially when the training data is a product of the system being optimised. Recommender systems are a canonical example: they are trained on interaction logs confounded by the deployed policy, past user behaviour, and platform filtering. As a result, the training distribution differs substantially from the candidate distribution scored at serving time, a gap that makes offline metrics unreliable predictors of online performance. We address the distribution shift problem with a method motivated by causal representation learning (CRL). We propose an information-theoretic disentanglement criterion and prove that its optimum depends only on the causal components of the input. We then derive a tractable variational lower bound that makes the criterion optimisable from finite observational data alone. The scope of our method is narrower than that of much of the CRL literature, in that we target better generalisation under distribution shift, not full identification of all latent causal factors. This narrower target is what makes the method practical, requiring only the existing confounded logs, applying to any standard supervised model, and adding no inference-time cost. Our headline evaluation is an A/B test with millions of users on Spotify, applied to a production ranker for personalised playlist generation. A capacity-matched CRL variant performed on par offline but delivered substantial online gains in listener engagement. Complementary evidence on the public KuaiRand recommendation dataset and a synthetic benchmark with known causal structure shows the same pattern: offline parity with baseline, gains under distribution shift. Across all three settings, adding our causal disentanglement objective yields meaningfully better out-of-distribution generalisation.

2605.27033 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Tracing Computation Density in LLMs

追踪LLMs中的计算密度

Corentin Kervadec, Iuliia Lysova, Iuri Macocco, Marco Baroni, Gemma Boleda

发表机构 * Universitat Pompeu Fabra(庞培法布拉大学) ICREA

AI总结 提出s-Trace方法估计最优子图,发现LLM计算分为早期稀疏核心和后期密集细化两个阶段,且计算量与模型不确定性相关。

详情
AI中文摘要

基于Transformer的大型语言模型(LLMs)由数十亿个参数组成,这些参数排列在深度和宽度都很大的计算图中,但尚不清楚它们是否对所有输入都充分利用了全部容量。我们引入了s-Trace方法,以有效估计最能近似完整模型输出的大小为s的子图。通过这种方法,我们发现各种LLM中的计算组织成两个不同的阶段。一个主要由早期层节点组成的小子图可以重建完整模型输出分布的头部。添加更多节点(主要位于后期层,且越来越多地由注意力头组成)会导致近似完整输出分布的逐步细化。此外,我们发现每个输入所需的计算量与模型不确定性相关,并且更稀疏的子图编码浅层统计信息,例如单字频率。总体而言,我们的结果表明,有效的LLM计算中存在一致的模块化组织,其中稀疏的早期层核心提供粗略预测,然后通过后期层中更密集的计算进一步细化。

英文摘要

Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs, but it is not clear that they exploit their full capacity for all inputs. We introduce the s-Trace method to efficiently estimate the subgraph of size s that best approximates a full model output. With this method, we find the computation in a variety of LLMs to be organized in two distinct phases. A small subgraph mostly composed of early-layer nodes can reconstruct the head of the full model output distribution. Adding further nodes, mostly located in later layers and increasingly consisting of attention heads, leads to incremental refinements in approximating the full output distribution. We find moreover that the amount of necessary computation per input correlates with model uncertainty, and that sparser subgraphs encode shallow statistics, such as unigram frequency. Overall, our results suggest a consistent modular organization in effective LLM computation, with a sparse early-layer core providing a rough prediction that is further refined through denser computations in later layers.

2605.27028 2026-05-27 cs.LG cs.AI 版本更新

Less is More: Early Stopping Rollout for On-Policy Distillation

少即是多:用于在线策略蒸馏的早停展开

Zhou Ziheng, Jiaqi Li, Huacong Tang, Ying Nian Wu, Demetri Terzopoulos

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Beijing Institute of General Artificial Intelligence(北京通用人工智能研究院)

AI总结 针对在线策略蒸馏中存在的“离策略教师衰减”问题,提出早停展开(ESR)方法,通过限制响应生成的前几个token来提升性能、GPU效率和训练稳定性。

详情
AI中文摘要

在线策略蒸馏最近成为标准序列级模仿的有前途的替代方案,通过使用教师模型对学生自身的展开进行评分来训练学生。然而,我们观察到这种范式中的“离策略教师衰减”问题:对于后面的token,由于学生的早期轨迹作为上下文对于教师来说是离策略的,教师产生纠正性分数的能力会衰减,并可能退回到预训练阶段学习的token补全行为。我们通过实验验证了这个问题,并提出了早停展开(ESR)来解决它:一种简单而有效的蒸馏策略,仅限制展开生成到前几个响应token。我们表明,ESR在模型大小、家族、任务和训练制度上均超越了全展开在线策略蒸馏的性能,并且在跨模型家族场景下表现出更高的GPU效率和训练稳定性。我们进一步研究了这一惊人性能背后的机制,发现了ESR的“级联对齐”和“子模式承诺”效应,这可能解释其为何有效,甚至有时超过教师模型性能。此外,我们表明这种基于位置的token选择策略不能完全由KL散度和熵信号解释。

英文摘要

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.

2605.27027 2026-05-27 cs.LG 版本更新

SQARL: A Size-Agnostic Reinforcement Learning approach for Circuit Allocation in Distributed Quantum Architectures

SQARL: 一种适用于分布式量子架构中电路分配的大小无关强化学习方法

Víctor Carballo, Júlia López-Closa, Mario Martin

发表机构 * Computer Science Department, Universitat Polit\`ecnica de Catalunya - BarcelonaTech (UPC) High Performance Artificial Intelligence group, Barcelona Supercomputing Center

AI总结 针对分布式量子计算中的量子比特分配问题,提出一种基于Transformer的灵活强化学习架构,无需重新训练即可处理任意数量的量子比特和核心,在分配成本上比匈牙利量子比特分配算法降低33%。

详情
AI中文摘要

量子处理器的扩展目前受到退相干和串扰等技术挑战的限制。随着量子比特数量的增加,干扰会增大计算噪声。分布式量子计算通过互连更小、更易处理的量子处理器(核心)来解决这些限制,但引入了最小化缓慢且易出错的核间通信的挑战。在最小化通信成本的同时将量子电路分配到核心的任务被称为量子比特分配问题。本文致力于开发一种深度学习方法来解决该问题,强调对量子硬件拓扑的灵活性,并提升现有最优性能。 启发式和非学习算法,如匈牙利量子比特分配(HQA),目前代表了最优水平。强化学习(RL)方法利用学习到的分配策略,但通常缺乏灵活性,当硬件配置改变时需要重新训练,并且其解的质量不如非学习方法。然而,学习机制可能超越人工设计的启发式方法。 为克服这些限制,本文提出一种灵活的基于Transformer的架构,无需重新训练即可处理任意数量的量子比特和核心。结果表明,训练后的策略持续优于先前的RL最优水平,并缩小了RL与HQA在大多数常见电路上的差距。对于Cuccaro加法器,它相对于HQA实现了33%的分配成本降低,对于随机电路平均降低25%。这些发现表明,基于学习的方法可以有效地匹配手工启发式方法的性能,这是向实际应用迈出的关键一步。

英文摘要

The scaling of quantum processors is currently limited by technical challenges such as decoherence and cross-talk. As the number of qubits grows, interference increases the computational noise. Distributed quantum computing addresses these limitations by interconnecting smaller, easier-to-handle quantum processors (cores), but it introduces the challenge of minimizing slow, error-prone inter-core communication. The task of distributing quantum circuits across cores while minimizing communication costs is known as the Qubit Allocation problem. This work focuses on developing a deep learning approach to this problem, emphasizing flexibility to quantum hardware topology and improving state-of-the-art performance. Heuristic and non-learning algorithms, such as the Hungarian Qubit Allocation (HQA), currently represent the state of the art. Reinforcement Learning (RL) approaches leverage learned allocation policies but often lack flexibility, requiring retraining when hardware configurations change, and they fall short of the solution quality achieved by non-learning methods. However, learning mechanisms could outperform human-crafted heuristics. To overcome these limitations, this work proposes a flexible, transformer-based architecture that can handle arbitrary numbers of qubits and cores without retraining. Results show that the trained policy consistently outperforms the previous RL state of the art and narrows the gap between RL and HQA for the most common circuits. It achieves a 33% reduction in allocation cost relative to the HQA for the Cuccaro Adder and 25% on average for random circuits. These findings show that learning-based approaches can effectively match the performance of hand-crafted heuristics, a crucial step towards their application in real-world scenarios.

2605.27016 2026-05-27 cs.CL cs.AI cs.LG stat.ML 版本更新

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

评估不确定性估计器与LLM幻觉的相关性

Yedidia Agnimo, Anna Korba, Annabelle Blangero, Nicolas Chesneau, Karteek Alahari

发表机构 * CREST, ENSAE Institut Polytechnique de Paris(CREST,巴黎高等理工学院) Ekimetrics France(法国Ekimetrics) Centre Inria de l’Université Grenoble Alpes(格勒诺布尔阿尔卑斯大学信息研究院)

AI总结 通过系统实证研究,评估信息论、基于采样和反思性等不确定性估计器与LLM幻觉之间的关联,发现关联性高度可变且通常较弱,挑战了将不确定性作为幻觉直接信号的做法。

Comments 35 pages, 7 figures, 9 tables

详情
AI中文摘要

大型语言模型(LLM)容易产生幻觉,即与输入或训练数据不符的陈述,阻碍了可靠部署。同时,许多不确定性估计(UE)方法被提出来量化模型置信度,并常被隐含地视为模型失败的代理。然而,不确定性与幻觉之间的关系尚未得到充分表征。我们对不确定性估计器与LLM幻觉之间的关联进行了系统的实证研究。我们不是假设这种关联,而是直接评估它在何时以及在多大程度上成立。我们考虑了多种不确定性估计器,包括信息论、基于采样和反思性估计器,并检查了它们在幻觉设置中的行为。我们的实验涵盖了内在幻觉(违反输入忠实性)和外在幻觉(相对于训练数据的无根据主张),使用了四个互补基准,包括RAGTruth和HalluLens。我们发现,这种关联性高度可变且通常较弱,取决于幻觉类型和所评估的LLM。这些结果挑战了将不确定性作为幻觉直接信号的做法,并阐明了何时它能提供可操作的信息。

英文摘要

Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable deployment. In parallel, numerous uncertainty estimation (UE) methods have been proposed to quantify model confidence and are often implicitly treated as proxies for model failure. However, the relationship between uncertainty and hallucinations remains insufficiently characterized. We present a systematic empirical study of the association between uncertainty estimators and hallucinations in LLMs. Rather than assuming this association, we evaluate directly when and to what extent it holds. We consider a diverse set of uncertainty estimators, including information-theoretic, sampling-based, and reflexive estimators, and examine their behavior across hallucination settings. Our experiments cover both intrinsic hallucinations (violations of input faithfulness) and extrinsic hallucinations (unsupported claims relative to training data), using four complementary benchmarks, including RAGTruth and HalluLens. We find that the association is highly variable and often weak, depending on the hallucination type and the LLM under evaluation. These results challenge the use of uncertainty as a direct signal of hallucination and clarify when it provides actionable information.

2605.27009 2026-05-27 cs.LG 版本更新

SCENT: Aligning Mass Spectra with Molecular Structure for Olfactory Perception

SCENT: 将质谱与分子结构对齐用于嗅觉感知

Ziqi Zhang, Eunyeong Jin, Miguel Vasco, Farzaneh Taleb, Nona Rajabi, Alexandra Gutmann, Jonathan Williams, Antônio H. Ribeiro, Danica Kragic

发表机构 * Dept. of Intelligent Systems, KTH Royal Institute of Technology(智能系统系,皇家理工学院) Atmospheric Chemistry Dept., Max Planck Institute for Chemistry(大气化学部,马克斯·普朗克研究所) Dept. of Information Technology, Uppsala University(信息科技系,乌普萨拉大学) Science for Life Laboratory (SciLifeLab), Uppsala(生命科学实验室(SciLifeLab),乌普萨拉)

AI总结 提出SCENT多模态对比学习框架,通过将电子电离质谱表示与预训练化学结构嵌入对齐,在无需分子结构的情况下实现与结构模型相当的嗅觉预测性能。

详情
AI中文摘要

从分子结构预测人类嗅觉感知已取得显著进展,但这些方法在推理时需要明确的化学结构,而这在实际传感场景中并不可用。我们通过探索直接电子电离质谱(EI-MS)作为嗅觉预测的替代输入模态来弥补这一差距,该传感技术可在数秒内获取化学信息丰富的碎片指纹。我们提出了谱图到化学嵌入对齐(SCENT),这是一个多模态对比学习框架,它将EI-MS表示与预训练的化学结构嵌入对齐,同时在推理时仅需要质谱。在多标签气味描述符预测任务中,SCENT显著优于仅使用MS的基线,并实现了与基于结构的模型相当的性能,尽管在测试时不需要明确的分子结构。学习到的表示还能更好地逼近连续的人类感知评分,并泛化到真实实验室测量的谱图,表明跨模态对齐是将分析谱图嵌入化学语义的有效策略。

英文摘要

Predicting human olfactory perception from molecular structure has seen remarkable progress, yet these approaches require explicit chemical structure at inference, which is not available in practical sensing settings. We address this gap by exploring direct electron ionization mass spectrometry (EI-MS), a sensing technique that acquires chemically informative fragmentation fingerprints in seconds, as an alternative input modality for olfactory prediction. We contribute Spectrum-to-Chemical Embedding alignmeNT (SCENT), a multi-modal contrastive learning framework that aligns EI-MS representations with pretrained chemical structure embeddings, while requiring only mass spectra at inference. On the multi-label odor descriptor prediction task, SCENT significantly outperforms MS-only baselines and achieves performance comparable to structure-based models, despite requiring no explicit molecular structure at test time. The learned representations also better approximate continuous human perceptual ratings and generalize to real-world lab-measured spectra, suggesting that cross-modal alignment is an effective strategy for grounding analytical spectra in chemical semantics.

2605.27006 2026-05-27 cs.LG cond-mat.dis-nn stat.ML 版本更新

Sampling Data with Chains of Forward-Backward Diffusion Steps

通过前向-反向扩散步骤链采样数据

Hyunmo Kang, Noam Itzhak Levi, Corinna Elena Wegner, Daniel J. Korchinski, Matthieu Wyart

发表机构 * Johns Hopkins University(约翰霍普金斯大学) EPFL(瑞士联邦理工学院) University of Göttingen(哥廷根大学)

AI总结 提出U-turn链,通过扩散模型的短前向-反向步骤迭代构造马尔可夫链,结合Metropolis-Hastings校正从能量修正目标中采样,并发现最小U-turn动力学经历由数据流形碎片化驱动的遍历性破缺相变。

详情
AI中文摘要

从学习到的高维分布中采样是一个基础的计算问题。我们引入U-turn链:通过迭代扩散模型的短前向-反向步骤获得的马尔可夫链,其中每一步提出一个保持在所学数据流形上的移动,并与Metropolis-Hastings校正配对,从能量修正目标中采样。对于合成语言,我们表明最小U-turn动力学经历由数据流形碎片化驱动的遍历性破缺相变;在更大的U-turn幅度下遍历性得以恢复。在非遍历区域,低层特征比高层特征松弛得更快,这种顺序仅在足够大的U-turn幅度下才会反转。我们在自然语言和自然图像上测试这些预测。在两种模态中,最小U-turn松弛缓慢,尤其是对于由CNN或LLM中深层表示近似的高层特征。层序反转仅在噪声足够大且混合高效时出现——这些特征与强约束、弱混合的局部动力学一致。我们讨论了这些结果对使用扩散模型采样的启示。

英文摘要

Sampling from learned high-dimensional distributions is a foundational computational problem. We introduce U-turn chains: Markov chains obtained by iterating short forward-backward steps of a diffusion model, in which each step proposes a move that remains on the learned data manifold and, paired with a Metropolis-Hastings correction, samples from energy-modified targets. For synthetic languages, we show that minimal U-turn dynamics undergoes an ergodicity-breaking phase transition driven by fragmentation of the data manifold; ergodicity is restored at larger U-turn magnitude. In the non-ergodic regime, low-level features relax faster than high-level ones, an ordering that inverts only at sufficiently large U-turn magnitude. We test these predictions on natural language and natural images. In both modalities, minimal U-turns relax slowly, especially for high-level features approximated by deep representations in CNNs or LLMs. The layer-ordering inversion appears only at large noise when mixing is efficient -- signatures consistent with strongly constrained, weakly mixing local dynamics. We discuss the implications of these results for sampling with diffusion models.

2605.26998 2026-05-27 cs.LG q-bio.NC 版本更新

Probabilistic Recurrent Intention Switching Model

概率递归意图切换模型

Wenyuan Sheng, Hao Zhu, Joschka Boedecker

发表机构 * Department of Computer Science, University of Freiburg(弗赖堡大学计算机科学系)

AI总结 提出PRISM模型,利用轻量级递归网络建模非平稳意图切换,实现精确EM分解和闭式求解,在网格世界、小鼠迷宫和机器人操作任务中取得最优似然并恢复可解释意图。

详情
AI中文摘要

逆强化学习(IRL)从观察到的行为中恢复奖励函数,但传统方法假设单一固定奖励,无法捕捉一个回合内的目标切换。最近的多意图IRL方法通过分割轨迹来解决这一问题,但将意图转换建模为无记忆马尔可夫链或通过固定历史窗口的手动状态增强。我们提出概率递归意图切换模型(PRISM),该模型用轻量级递归网络替代这两种机制,将观察历史映射到每步意图分布。我们证明由此产生的EM目标可以精确分解为独立的每意图奖励子问题,每个子问题可闭式求解,从而得到$\mathcal{O}(nK)$的E步,无需变分近似。我们在非马尔可夫网格世界、小鼠迷宫和BridgeData~V2机器人操作(首个大规模多意图IRL机器人应用)上评估PRISM。在所有设置中,PRISM在保持最高留出对数似然的同时,从未标记的演示中恢复出可命名、时间上连贯的意图,表明离散目标切换存在于生物和人工智能体中。

英文摘要

Inverse reinforcement learning (IRL) recovers reward functions from observed behavior, yet traditional methods assume a single stationary reward that cannot capture goal switching within an episode. Recent multi-intention IRL methods address this by segmenting trajectories, but model intention transitions as either a memoryless Markov chain or via manual state augmentation with a fixed history window. We propose the Probabilistic Recurrent Intention Switching Model (PRISM), which replaces both mechanisms with a lightweight recurrent network that maps observation history to a per-step intention distribution. We prove that the resulting EM objective decomposes exactly into independent per-intention reward subproblems, each solvable in closed form, yielding an $\mathcal{O}(nK)$ E-step with no variational approximation. We evaluate PRISM on a non-Markovian gridworld, a mouse labyrinth, and BridgeData~V2 robotic manipulation, the first large-scale robotic application of multi-intention IRL. Across all settings PRISM achieves the highest held-out log-likelihood while recovering nameable, temporally coherent intentions from unlabeled demonstrations, suggesting that discrete goal switching is present in both biological and artificial agents.

2605.26990 2026-05-27 stat.ML cs.LG 版本更新

Constrained Bayesian Experimental Design via Online Planning

通过在线规划的约束贝叶斯实验设计

Yujia Guo, Daolang Huang, Xinyu Zhang, Sammie Katt, Samuel Kaski, Ayush Bharti

发表机构 * ELLIS Institute Finland(芬兰ELLIS研究所) Department of Computer Science, Aalto University, Finland(芬兰阿尔托大学计算机科学系) Department of Computer Science, University of Manchester, UK(英国曼彻斯特大学计算机科学系)

AI总结 提出一种结合离线预训练摊销策略和后验网络与在线多步前瞻规划(场景树)的方法,以在动态约束下优化贝叶斯实验设计,相比现有方法获得更优信息序列且计算开销适中。

Comments 24 pages, 9 figures. Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

贝叶斯实验设计(BED)是一个用于数据高效顺序实验设计的理论框架。然而,现有的BED方法无法适应实际任务中由于预算限制、成本变化或物理约束(限制设计随时间演化)而产生的动态约束。在本文中,我们介绍了一种新的BED方法,通过将离线预训练的摊销策略和后验网络与使用场景树的在线多步前瞻规划相结合,实现了实验设计的约束优化。我们通过实验证明,在多种约束BED任务中,我们的方法相比现有方法产生了更信息丰富的设计序列,同时仅增加了适度的额外计算开销。

英文摘要

Bayesian experimental design (BED) is a principled framework for data-efficient design of sequential experiments. However, existing BED methods are unable to adapt to dynamic constraints inherent in real-world tasks due to budget limitations, varying costs, or physical constraints that restrict how designs evolve over time. In this paper, we introduce a novel approach to BED that enables constrained optimization of experimental designs by combining offline pre-training of an amortized policy and a posterior network with online multi-step lookahead planning using scenario trees. We empirically demonstrate that our method yields substantially more informative design sequences than existing methods across a range of constrained BED tasks, while incurring only a modest additional computational overhead.

2605.26984 2026-05-27 cs.LG 版本更新

TED: Related Party Transaction guided Tax Evasion Detection on Heterogeneous Graph

TED:基于关联方交易的异构图偷漏税检测

Yiming Xu, Bin Shi, Bo Dong, Jiaxiang Wang, Hua Wei, Qinghua Zheng

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) School of Distance Education, Xi’an Jiaotong University(西安交通大学继续教育学院) School of Computing and Augmented Intelligence, Arizona State University(亚利桑那州立大学计算与增强智能学院)

AI总结 针对现有偷漏税检测方法未能充分利用税务场景中丰富交互信息的问题,提出一种基于异构图神经网络的TED模型,通过关联方交易组过滤噪声并设计层次注意力机制捕获深层语义,在真实数据集上显著优于现有方法。

Comments Accepted by Data Mining and Knowledge Discovery (DMKD25)

详情
AI中文摘要

偷漏税导致政府收入严重损失并扰乱公平竞争的经济秩序。为缓解这一问题,最新的偷漏税检测解决方案利用专家知识提取特征,然后训练分类器判断公司是否涉嫌偷漏税。然而,现有方案主要关注公司的统计特征,未能利用税务场景中丰富的交互信息,从而影响检测性能。在本文中,我们首先将税务场景建模为异构图,并研究异构图模型下的偷漏税检测问题。为了提高偷漏税检测的性能,提出了一种新颖的图神经网络模型来提取异构图的综合信息。具体来说,我们利用异构且复杂的关联方交易组来过滤低层噪声信息。此外,设计了一种层次注意力机制来捕获关联方交易组中隐藏的更深层次结构和语义信息。我们将该方法应用于税务局的真实风险管理系统,并在两个人工标注的真实世界税务数据集上进行评估。结果表明,我们的方法在偷漏税检测任务上显著优于现有最先进方法。

英文摘要

Tax evasion causes severe losses of government revenues and disturbs the economic order of fair competition. To help alleviate this problem, the latest tax evasion detection solutions utilize expert knowledge to extract features and then train classifiers to determine whether a company is suspected of tax evasion. However, existing solutions mainly focus on the statistical features of the company, but fail to exploit the rich interactive information in tax scenarios, which affect the detection performance. In this paper, we first model the tax scenario as a heterogeneous graph and study the tax evasion detection problem under the heterogeneous graph model. To improve the performance of tax evasion detection, a novel graph neural network model is proposed to extract the comprehensive information of heterogeneous graphs. Specifically, we use heterogeneous and complex related party transaction groups to filter low-level noise information. Moreover, a hierarchical attention mechanism is designed to capture the deeper structure and semantic information hidden in the related party transaction group. We apply our method to the real risk management system of the tax bureau, and evaluate it on two human-labeled real-world tax datasets. The results demonstrate that our method significantly outperforms the state-of-the-art in the tax evasion detection task.

2605.26977 2026-05-27 cs.LG math.OC 版本更新

Convergence of Spectral Descent for Non-smooth Optimization

非光滑优化的谱下降收敛性

Yixuan Yang, Yuqing He, Song Li

发表机构 * School of Mathematical Sciences, Zhejiang University, Hangzhou, China(浙江大学数学科学学院,杭州,中国)

AI总结 研究Muon优化器的简化变体谱下降(SD)及其截断版本(TSD)在非光滑凸优化中的全局线性收敛性,并应用于鲁棒低秩矩阵恢复。

详情
AI中文摘要

Muon优化器最近在训练大型语言模型方面展示了显著的经验成功。然而,对其机制的理论理解仍然有限。目前Muon的收敛保证严重依赖于光滑性假设,其非光滑收敛行为在很大程度上未被探索。在这项工作中,我们通过研究谱下降(SD)(Muon的简化变体)及其截断版本截断谱下降(TSD),朝着弥合这一差距迈出了一步。在凸性、Lipschitz连续性和尖锐性条件下,我们建立了SD和TSD在非光滑凸公式中的全局线性收敛性。我们还研究了配备解耦权重衰减的正则化变体,并通过它们与Frank-Wolfe方法的联系推导出次线性收敛保证。最后,我们将我们的理论框架应用于混合稀疏和密集噪声下的鲁棒低秩矩阵恢复,并提供了严格的恢复保证。数值实验支持理论发现,并展示了Muon类型方法在非光滑优化中的有效性。

英文摘要

The Muon optimizer has recently demonstrated remarkable empirical success in training large language models. However, the theoretical understanding of its mechanisms remains limited. Current convergence guarantees for Muon rely heavily on smoothness assumptions, leaving its non-smooth convergence behavior largely unexplored. In this work, we take a step toward bridging this gap by investigating Spectral Descent (SD), a simplified variant of Muon, together with its truncated counterpart, Truncated Spectral Descent (TSD). Under convexity, Lipschitz continuity, and sharpness conditions, we establish global linear convergence for both SD and TSD in non-smooth convex formulations. We also study regularized variants equipped with decoupled weight decay and derive sublinear convergence guarantees through their connection with Frank-Wolfe methods. Finally, we apply our theoretical framework to robust low-rank matrix recovery under mixed sparse and dense noise regimes and provide rigorous recovery guarantees. Numerical experiments support the theoretical findings and demonstrate the effectiveness of Muon-type methods for non-smooth optimization.

2605.26973 2026-05-27 stat.ML cond-mat.dis-nn cs.LG cs.NE q-bio.NC 版本更新

Signal-to-Noise Ratio and Sample Size Govern Representational Alignment in Neural Networks

信噪比与样本量控制神经网络中的表征对齐

Ali Hussaini Umar, Alessandro Laio

发表机构 * SISSA Trieste, Italy(意大利特里斯特SISSA研究所) Theoretical and Scientific Data Science (TSDS) group at the International School for Advanced Studies (SISSA)(国际先进研究学院(SISSA)理论与科学数据科学(TSDS)小组) Condensed Matter and Statistical Physics section at the International Centre for Theoretical Physics (ICTP)(国际理论物理中心(ICTP)凝聚态与统计物理部门)

AI总结 通过理论和实验证明,信噪比和训练样本量以单调和非单调方式分别影响神经网络表征对齐,且对齐程度在插值阈值附近最小,与泛化误差解耦。

详情
AI中文摘要

已知神经网络会发展出潜在表征,这些表征是$对齐$的,即在不同架构、训练协议或训练数据集训练的网络之间结构相似。我们在一个受控环境中研究这一现象,使用被噪声过程的独立实现扰动的训练集,训练一组网络执行回归和分类任务。我们表明,信噪比(SNR)和训练样本量以定性相似的方式影响对齐,无论是在真实世界数据集上训练的网络,还是在极其简单的具有单个隐藏层的$线性$网络中(其对齐可以解析估计)。在线性和非线性网络、回归和分类任务以及合成和真实数据中,我们一致观察到,对齐随SNR单调变化,但随训练样本量非单调变化。特别地,对齐在插值阈值附近最小,且更强的对齐不一定对应更好的泛化误差。这些发现揭示了数据质量和数量对对齐的非平凡依赖关系,且与泛化性能解耦。

英文摘要

Neural networks are known to develop latent representations that are $aligned$, namely structurally similar across networks trained with different architectures, training protocols, or training datasets. We study this phenomenon in a controlled setting, where we train an ensemble of networks on regression and classification tasks using training sets perturbed by independent realizations of a noise process. We show that the signal-to-noise ratio (SNR) and the training sample size influence the alignment in qualitatively similar ways in networks trained on real-world datasets and in an extremely simple $linear$ network with a single hidden layer, for which the alignment can be estimated analytically. Across linear and nonlinear networks, regression and classification tasks, and both synthetic and real-world data, we consistently observe that alignment varies monotonically with SNR but non-monotonically with training sample size. In particular, the alignment is minimized near the interpolation threshold, and a stronger alignment does not necessarily correspond to better generalization error. These findings reveal a non-trivial dependence of alignment on data quality and quantity, decoupled from generalization performance.

2605.26971 2026-05-27 cs.LG 版本更新

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

RLVR 数据集及其查找方法:通过数据溯源寻找更好的训练数据

Hsiu-Yuan Huang, Weijie Liu, Chenming Tang, Sanwoo Lee, Kai Yang, Yangkun Chen, Saiyong Yang, Yunfang Wu

发表机构 * National Key Laboratory for Multimedia Information Processing, Peking University(北京大学多媒体信息处理国家重点实验室) School of Computer Science, Peking University(北京大学计算机科学学院) LLM Department, Tencent(腾讯LLM部门)

AI总结 针对可验证奖励强化学习(RLVR)数据集来源不清的问题,提出基于谱系感知搜索的原子源追踪框架(ATLAS),追溯超过99.7%的实例至20个原子源,并基于源级反事实归因(SCA)原则构建去污染数据集DAPO++,其质量分数Q与下游RLVR性能强相关。

Comments 7 figures, 12 tables

详情
AI中文摘要

可验证奖励强化学习(RLVR)数据集的激增加剧了来源崩溃问题,原因是现有数据集之间的谱系不明确。为弥合这一碎片化的RLVR数据格局,我们提出了基于谱系感知搜索的原子源追踪(ATLAS),这是一个系统框架,用于将RLVR数据集追溯至其原子源,将145万个实例中的超过99.7%归因于20个原子源。我们的分析表明,大多数RLVR数据集是一小组共享上游源的变体,很少有引入真正新数据的,许多面临数据污染风险。这些发现自然促使我们策划一个新的RLVR数据集DAPO++,并从谱系感知的角度对现有数据集进行基准测试。为此,我们提出源级反事实归因(SCA)作为指导原则,以策划一个具有集中学习信号的去污染训练数据集。本质上,SCA通过比较每个原子源的RL检查点与共享基模型来测量样本的边际效用。基于这些归因信号,我们进一步设计了一个复合数据集质量分数Q,该分数与下游RLVR性能强相关。在Qwen3系列模型上的实验验证了DAPO++在保留基准上持续提升性能,而Q可靠地预测了下游RLVR训练效果。我们的代码和数据可在https://github.com/Celine-hxy/ATLAS获取。

英文摘要

The proliferation of Reinforcement Learning from Verifiable Rewards (RLVR) datasets has exacerbated provenance collapse due to unclear lineage among existing datasets. To bridge this fragmented RLVR data landscape, we propose Atomic-source Tracing via Lineage-Aware Search (ATLAS), a systematic framework for tracing RLVR datasets back to their atomic sources, attributing over 99.7% of 1.45M instances to 20 atomic sources. Our analysis reveals that most RLVR datasets are variants of a small set of shared upstream sources, with few introducing genuinely new data, and many facing data contamination risks. These findings naturally motivate us to curate a new RLVR dataset, DAPO++, and to benchmark existing datasets from a lineage-aware perspective. To this end, we propose Source-level Counterfactual Attribution (SCA) as a guiding principle to curate a decontaminated training dataset with concentrated learning signals. Essentially, SCA measures a sample's marginal utility by comparing per-atomic-source RL checkpoints against a shared base model. Building upon these attribution signals, we further design a composite dataset quality score Q that strongly correlates with downstream RLVR performance. Experiments on Qwen3 series models verify that DAPO++ consistently improves performance on held-out benchmarks, while Q reliably predicts downstream RLVR training effectiveness. Our code and data is available at https://github.com/Celine-hxy/ATLAS.

2605.26925 2026-05-27 quant-ph cs.LG 版本更新

Adaptive Reinforcement Learning for Robust Open Quantum System Control: A Multi-Task Framework with Temporal Optimization

自适应强化学习用于鲁棒开放量子系统控制:一种带有时间优化的多任务框架

Haftu W. Fentaw, Steve Campbell, Simon Caton

发表机构 * Centre for Quantum Engineering, Science, and Technology, University College Dublin(都柏林大学量子工程、科学与技术中心)

AI总结 提出一种多任务软演员-评论家(SAC)强化学习框架,用于开放量子系统控制,同时学习最优脉冲序列并发现特定问题的演化时间T和控制脉冲段数N,在51种哈密顿量变化下实现高保真度状态转移,并展现出优于GRAPE的鲁棒性。

详情
AI中文摘要

我们提出了一种多任务软演员-评论家(SAC)强化学习框架,用于跨不同哈密顿量的开放系统量子控制,该框架学习最优脉冲序列,同时发现特定问题的演化时间T和控制脉冲段数N。在51种哈密顿量变化上的实验结果表明,多任务SAC模型能够生成控制脉冲,在环境噪声下将系统从初始状态驱动到目标状态,并具有高保真度,为适用于实际噪声量子器件的通用量子控制奠定了必要基础。通过逐步扩展训练哈密顿量集,我们研究了使用给定数量样本哈密顿量训练的单个多任务模型是否能够成功完成来自同一哈密顿量空间但训练中未遇到的哈密顿量的状态转移任务。此外,我们的鲁棒性不保真度度量(RIM)分析表明,与GRAPE优化的控制相比,SAC训练的策略对脉冲幅度扰动和退相干率变化表现出更优越的鲁棒性。

英文摘要

We present a Multi-task Soft Actor-Critic (SAC) Reinforcement Learning framework designed for open-system quantum control across diverse Hamiltonians, which learns optimal pulse sequences while simultaneously discovering problem-specific evolution time T and number of control pulse segments N. Experimental results across 51 Hamiltonian variations demonstrate that the multi-task SAC model is able to generate control pulses that can drive a system, under environment noise, from its initial state to its target state with high fidelities, establishing essential foundations for universal quantum control applicable to realistic noisy quantum devices. Through progressive expansion of the training Hamiltonian set, we investigate if a single multi-task model trained using a given number of sample Hamiltonians can successfully accomplish state-transfer tasks for Hamiltonians drawn from the same Hamiltonian space but not encountered during training. In addition, our Robustness Infidelity Measure (RIM) analysis reveals that SAC trained policies exhibit superior robustness to pulse amplitude perturbations and decoherence rate variations compared to GRAPE-optimized controls.

2605.26908 2026-05-27 cs.AI cs.DS cs.LG 版本更新

On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions

关于因子图中可交换因子检测的充要条件

Malte Luttermann, Ralf Möller, Marcel Gehrke

发表机构 * Institute for Humanities-Centered Artificial Intelligence, University of Hamburg, Germany(人文导向人工智能研究所,汉堡大学,德国)

AI总结 本文重新审视了因子图中可交换因子检测的理论基础,指出现有算法依赖的定理仅为必要条件而非充分条件,并提出了修正算法以保证正确性和效率。

详情
AI中文摘要

利用概率图模型(如因子图)中对象的不可区分性是提升概率推理算法的关键,并允许对领域规模进行可处理的概率推理问题。在因子图中利用不可区分对象的核心是识别可交换因子,即其输出值在分配给其部分参数的输入值的排列下保持不变的因子。本文重新审视了检测可交换因子的最先进算法的理论基础。具体而言,我们表明,在其当前形式下,最先进算法依赖于一个中心定理,该定理被错误地视为识别可交换因子的充分条件,而实际上它仅意味着必要条件。因此,正如我们在本文中所展示的,最先进算法可能会产生错误结果。为了修复当前最先进算法中存在的缺陷,我们证明了上述定理的一个略微修改版本,该版本作为识别可交换因子的必要条件。此外,我们提出了最先进算法的修正版本,在保持其效率的同时确保正确性,并引入了一种具有更严格最坏情况边界的补充算法。

英文摘要

Exploiting the indistinguishability of objects in a probabilistic graphical model such as a factor graph is key to lifted probabilistic inference algorithms and allows for tractable probabilistic inference problems with respect to domain sizes. A central building block for the exploitation of indistinguishable objects in factor graphs is the identification of commutative factors, i.e., factors whose output values are invariant under permutations of input values assigned to a subset of their arguments. In this paper, we revisit the theoretical foundations underlying the state-of-the-art algorithm to detect commutative factors. Specifically, we show that in its current form, the state-of-the-art algorithm relies on a central theorem that is mistakenly regarded as a sufficient condition to identify commutative factors, while it actually only implies necessary condition. Consequently, the state of the art might, as we show in this paper, deliver incorrect results. To fix the flaws currently present in the state of the art, we prove a slightly modified version of the aforementioned theorem, which serves as a necessary condition to identify commutative factors. Moreover, we present a corrected version of the state-of-the-art algorithm, which keeps its efficiency while ensuring correctness and introduce a complementary algorithm with tighter worst-case bounds.

2605.26900 2026-05-27 cs.LG 版本更新

SPHERE-JEPA: Spherical Prediction with Homogeneous Embeddings

SPHERE-JEPA: 均匀嵌入的球面预测

Léo Nicollier, Max Dunitz, Marc Pic, Pablo Musé, Enric Meinhardt-Llopis, Gabriele Facciolo

发表机构 * Université Paris-Saclay, CNRS, Advanced Track and Trace(巴黎萨克雷大学,国家科学研究中心,先进跟踪与追溯) ENS Paris-Saclay, Centre Borelli, Advanced Track and Trace(巴黎萨克雷高等师范学院,博雷利中心,先进跟踪与追溯) Université Paris-Saclay, CNRS(巴黎萨克雷大学,国家科学研究中心) ENS Paris-Saclay, Centre Borelli(巴黎萨克雷高等师范学院,博雷利中心)

AI总结 本文提出SPHERE-JEPA框架,通过将Cramér-Wold投影机制调整为强制超球面均匀性而非高斯先验,解决了自监督学习中高斯嵌入导致各向异性k-NN邻域的问题,在纹理检索和ImageNet-1K线性探测上取得显著提升。

详情
AI中文摘要

自监督学习中的一个基本开放问题是明确表征学习表示的最优几何。最近,LeJEPA将各向同性高斯嵌入确定为在欧几里得空间中最小化下游预测风险的最优解。然而,对于支撑在低维流形(如超球面)上的分布,相应问题仍未探索。在这项工作中,我们证明将这种极小极大分析扩展到黎曼流形上的光滑分布会根本性地改变最优解。我们表明,在最坏情况公式下,k近邻和核岭回归都诱导超球面均匀性。更精确地说,我们证明流形上的均匀分布对于k近邻是最优的,而球面上的均匀分布对于使用指数点积核和线性核的核岭回归是最优的。这一理论见解揭示了高斯嵌入的一个根本局限:其非均匀密度导致各向异性的k-NN邻域,严重偏置估计器。为纠正这一点,我们引入了SPHERE-JEPA,一个理论基础的SSL框架。我们调整LeJEPA的Cramér-Wold投影机制以强制超球面均匀性而非高斯先验。实验上,SPHERE-JEPA取得了显著改进,将纹理检索mAP提升了超过6%,同时在标准基准上持续匹配或超越LeJEPA——包括在ImageNet-1K(ViT-B/14)上+1.8%的线性探测增益。

英文摘要

A fundamental open question in self-supervised learning (SSL) is the explicit characterization of the optimal geometry of the learned representations. Recently, LeJEPA identified isotropic Gaussian embeddings as optimal for minimizing downstream prediction risk in Euclidean spaces. However, the corresponding problem for distributions supported on lower-dimensional manifolds, such as the hypersphere, remains unexplored. In this work, we demonstrate that extending this minimax analysis to smooth distributions on Riemannian manifolds fundamentally changes the optimal solution. We show that, under a worst-case formulation, both k-nearest neighbors and kernel ridge regression induce hyperspherical uniformity. More precisely, we show that uniform distributions on manifolds are optimal for k-nearest neighbors, and that the uniform distribution on the sphere is optimal for kernel ridge regression with both the exponential dot-product kernel and the linear kernel. This theoretical insight reveals a fundamental limitation of Gaussian embeddings: their non-uniform density induces anisotropic k-NN neighborhoods, severely biasing the estimator. To correct this, we introduce SPHERE-JEPA, a theoretically grounded SSL framework. We adapt LeJEPA's Cram{é}r-Wold projection mechanism to enforce hyperspherical uniformity rather than a Gaussian prior. Empirically, SPHERE-JEPA yields significant improvements, boosting texture retrieval mAP by over 6%, while consistently matching or outperforming LeJEPA on standard benchmarks-including a +1.8% linear probing gain on ImageNet-1K (ViT-B/14).

2605.26895 2026-05-27 cs.LG cs.AI stat.ML 版本更新

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

微不足道的大小,显著的效果:大型语言模型中的尺度向量

Mingze Wang, Shuchen Zhu, Yuxin Fang, Binghui Li, Kai Shen, Shu Zhong

发表机构 * Peking University(北京大学)

AI总结 本文系统研究了大型语言模型中的尺度向量,发现其虽参数占比极小但对预训练至关重要,通过自放大预条件效应优化优化过程,并提出了三种轻量级改进策略,在多种模型规模上一致提升性能。

Comments 36 pages

详情
AI中文摘要

现代大型语言模型(LLM)中的归一化层由确定性归一化操作和可学习的尺度向量组成。尽管归一化操作已被广泛研究,但尺度向量尽管被普遍使用,其作用仍未被充分理解。在这项工作中,我们从表达能力、优化和架构结构的角度对LLM中的尺度向量进行了系统研究。首先,我们通过实验表明,虽然尺度向量仅占模型参数的极小部分,但移除它们会显著降低LLM的预训练效果。我们的理论进一步表明,在Pre-Norm架构中,尺度向量并不增加表达能力;相反,它们通过对后续线性映射产生自放大预条件效应来改善优化。其次,我们研究了权重衰减对尺度向量的作用。通过区分Input-Norm和Output-Norm层,我们从理论上证明,由于它们在优化和表达能力中的不同作用,权重衰减对前者有益但对后者有害。第三,受此理解的启发,我们提出了三种轻量级且互补的尺度向量改进方法:分支特异性异质性、线性映射周围的改进放置以及幅度-方向重参数化。理论和实验均表明,每种改进都能带来一致的收益。最后,我们将这些改进整合为一个统一的尺度向量策略,并通过在0.12B到2B参数的密集和混合专家模型上进行大规模LLM预训练实验,使用多种优化器和学习率调度,在工业级token预算下进行评估。该统一策略始终比精心调整的基线获得更低的终端损失,并展现出更有利的扩展行为,同时增加可忽略的参数和计算开销。

英文摘要

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.

2605.26886 2026-05-27 cs.DS cs.LG 版本更新

Parsimonious Learning-Augmented Online Metric Matching

简约学习增强的在线度量匹配

Yongho Shin, Phanu Vajanopath

发表机构 * Institute of Computer Science, University of Wrocław, Wrocław, Poland(沃斯克拉大学计算机科学研究所)

AI总结 针对在线度量匹配问题,提出一种简约学习增强算法,通过虚拟预测填补缺失预测,并建立性能下界,实验验证了其有效性。

Comments To appear in ICML 2026

详情
AI中文摘要

近年来,学习增强算法受到了广泛关注,尤其是在在线优化领域。由于生成预测的高计算成本,越来越多的研究关注于学习增强算法中性能保证与预测使用数量之间的权衡,例如缓存和度量任务系统问题。在本文中,我们将这一研究方向扩展到在线度量匹配,开发了简约学习增强算法并建立了其性能下界。我们的方法将“跟随预测”框架扩展到简约设置,通过在缺乏实际预测时使用一种在线度量匹配算法来填充虚拟预测,该算法在执行过程中保持良好中间匹配。我们通过实证评估补充了理论结果,证明了我们方法的实际有效性。

英文摘要

Learning-augmented algorithms have received significant attention in recent years, particularly in the context of online optimization. Motivated by the high computational cost of generating predictions, a growing line of work studies the tradeoff between performance guarantees and the number of predictions used in learning-augmented algorithms for problems such as caching and metrical task systems. In this paper, we extend this line of research to online metric matching by developing parsimonious learning-augmented algorithms and establishing lower bounds on their performance. Our approach extends the Follow-the-Prediction framework to the parsimonious setting by filling in a virtual prediction in the absence of an actual prediction, using an online metric matching algorithm that maintains good intermediate matchings throughout its execution. We complement our theoretical results with an empirical evaluation, demonstrating the practical effectiveness of our approach.

2605.26857 2026-05-27 cs.LG 版本更新

Generalist Graph Anomaly Detection via Prototype-Based Distillation

基于原型蒸馏的通才图异常检测

Yiming Xu, Zihan Chen, Zhen Peng, Song Wang, Bin Shi, Bo Dong, Chao Shen

发表机构 * School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China(西安交通大学计算机科学与技术学院) National Engineering Research Center for Visual Information and Applications, Xi'an, China(视觉信息与应用国家工程研究中心) University of Virginia, Charlottesville, USA(弗吉尼亚大学) University of Central Florida, Orlando, USA(佛罗里达大学) School of Distance Education, Xi’an Jiaotong University, Xi'an, China(西安交通大学继续教育学院) School of Cyber Science and Engineering, Xi'an Jiaotong University, Xi'an, China(西安交通大学网络安全学院)

AI总结 提出首个无监督通才图异常检测框架ProMoS,通过知识蒸馏从冻结的自监督图神经网络教师模型中提取正常性先验,并利用原型引导的软标签蒸馏实现跨图零样本异常检测。

Comments Accepted by ICML 2026

详情
AI中文摘要

在高风险领域对图异常检测(GAD)的迫切需求驱动下,通才GAD范式(训练一个可迁移到新图的单一检测器)近年来日益受到关注。然而,现有方法通常依赖稀缺且昂贵的标注进行训练,有时甚至需要在推理时提供少量样本支持,这限制了其对多样且未见异常模式的鲁棒性。为解决这一局限,我们提出了ProMoS,首个无监督通才GAD框架,通过建模未标注数据中丰富的正常性来检测异常。ProMoS采用知识蒸馏范式,将正常性先验从冻结的自监督图神经网络(GNN)教师模型蒸馏到具有共享全局和轻量个性化分支的混合学生模型中,无需从头学习即可实现高效且富有表现力的正常性建模。我们进一步提出原型引导的软标签蒸馏,在共享原型空间中对齐教师和学生,增强跨图泛化能力。在推理时,ProMoS通过蒸馏偏差和原型几何偏差对未见图进行零样本异常检测。大量实验证明了ProMoS的有效性和高效性,为迈向无标签、零样本的通才GAD开辟了一条实用路径。

英文摘要

Driven by the pressing demand for graph anomaly detection (GAD) in high-stakes domains, the generalist GAD paradigm, which trains a single detector transferable across new graphs, has recently gained growing attention. However, existing methods often rely on scarce and costly annotations for training and sometimes even require few-shot support at inference, which limits their robustness to diverse and unseen anomaly patterns. To address this limitation, we introduce ProMoS, the first unsupervised generalist GAD framework, which detects anomalies by modeling the abundant normality in unlabeled data. ProMoS adopts a knowledge-distillation paradigm to distill normality priors from a frozen self-supervised graph neural network (GNN) teacher to a mixture-of-students model with shared global and lightweight personalized branches, enabling efficient and expressive normality modeling without learning from scratch. We further propose prototype-guided soft-label distillation to align teacher and student in a shared prototype space, enhancing cross-graph generalizability. During inference, ProMoS performs zero-shot anomaly detection on unseen graphs via distillation bias and prototype geometric deviation. Extensive experiments show the effectiveness and efficiency of ProMoS, charting a practical path toward label-free, zero-shot generalist GAD.

2605.26854 2026-05-27 cs.LG 版本更新

RAPNet: Accelerating Algebraic Multigrid with Learned Sparse Corrections

RAPNet: 通过学习的稀疏校正加速代数多重网格

Yali Fink, Ido Ben-Yair, Lars Ruthotto, Eran Treister

发表机构 * Institute for Interdisciplinary Computational Sciences, Faculty of Computer and Information Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel(交叉学科计算科学研究所,计算机与信息科学学院,内盖夫本·古里安大学,以色列贝尔谢瓦) Department of Mathematics and Computer Science, Emory University, Atlanta, GA, USA(数学与计算机科学系,埃默里大学,美国亚特兰大,GA)

AI总结 提出图神经网络框架RAPNet,通过从稀疏代数系统中学习生成稀疏且鲁棒的粗网格算子,解决了代数多重网格中稀疏性与收敛质量之间的权衡问题,并采用逐层训练策略实现大规模泛化。

Comments Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea Code available at https://github.com/idoby/rapnet

详情
AI中文摘要

大规模稀疏线性系统的可扩展求解是科学计算和图分析中的瓶颈。虽然代数多重网格提供了最优的线性扩展,但其性能受到粗网格算子稀疏性与收敛质量之间权衡的严重限制。经典的代数多重网格启发式方法难以平衡这些目标,常常为了稀疏性而牺牲稳定性或性能。我们提出了RAPNet,一个图神经网络框架,通过学习直接从稀疏代数系统生成稀疏、鲁棒的粗算子来解决这一权衡。我们方法的关键是一种逐层训练策略,该策略能够从小型子图中学习并泛化到百万节点规模的域,绕过了先前神经代数多重网格尝试的瓶颈。RAPNet仅在求解器设置阶段执行,确保求解阶段保持其有利的计算特性。我们展示了我们的方法在多种PDE离散化和图拉普拉斯矩阵上优于经典的非Galerkin基线,使其特别适用于多查询任务,如特征值问题、时间依赖模拟以及逆问题或设计问题。

英文摘要

The scalable solution of large sparse linear systems is a bottleneck in scientific computing and graph analysis. While algebraic multigrid (AMG) offers optimal linear scaling, its performance is severely constrained by the trade-off between the sparsity and convergence quality of coarse-grid operators. Classical AMG heuristics struggle to balance these objectives, often sacrificing stability or performance for sparsity. We propose RAPNet, a graph neural network (GNN) framework that resolves this trade-off by learning to generate sparse, robust coarse operators directly from the sparse algebraic system. Key to our approach is a level-wise training strategy that enables learning from small subgraphs and generalization to million-node domains, bypassing the bottlenecks of prior neural AMG attempts. RAPNet executes exclusively during the solver setup phase, ensuring that the solve phase retains its favorable computational properties. We show that our method outperforms classical non-Galerkin baselines on diverse PDE discretizations and graph Laplacians, making it particularly effective for multi-query tasks such as eigenproblems, time-dependent simulations, and inverse or design problems.

2605.26850 2026-05-27 cs.LG 版本更新

Learning Energy-Based Models from Stochastic Interpolants using Spatiotemporal Differences

从随机插值中学习基于能量的模型:利用时空差异

Hanlin Yu, RuiKang OuYang, Partha Kaushik, Arto Klami, Michael U. Gutmann, Omar Chehab

发表机构 * University of Helsinki(赫尔辛基大学) University of Cambridge(剑桥大学) Carnegie Mellon University(卡内基梅隆大学) University of Edinburgh(爱丁堡大学)

AI总结 提出时空噪声对比估计(stNCE)框架,通过联合时空差异从随机插值中学习能量函数,统一现有方法并实现与最先进密度估计方法竞争的性能。

详情
AI中文摘要

从数据样本中学习基于能量的模型是机器学习中的一个核心问题。许多近期流行的方法,如用于训练基于能量的扩散模型的去噪分数匹配,使用随机插值器通过时间变量索引的不同噪声水平来破坏数据样本。这定义了数据空间和时间上的联合密度,大多数方法通过空间或时间差异来学习其能量。我们识别了这两种方法各自的失败模式。为了解决这些问题,我们提出了时空噪声对比估计(stNCE),一个通过联合时空差异来学习能量的框架。stNCE统一了许多现有方法,并产生了新的训练目标。在图像和分子上的实验表明,其性能与最先进的密度估计方法相竞争。

英文摘要

Learning an energy-based model from data samples is a central problem in machine learning. Many recent and popular methods, such as denoising score matching for training energy-based diffusion models, use stochastic interpolants to corrupt data samples at different noise levels indexed by a time variable. This defines a joint density over both the data space and time, and most methods learn its energy through either spatial or temporal differences. We identify distinct failure modes for both of these approaches. To solve them, we propose Spatiotemporal Noise-Contrastive Estimation (stNCE), a framework for learning the energy through joint spatiotemporal differences. stNCE unifies many existing methods and leads to new training objectives. Experiments on images and molecules demonstrate performance competitive with state-of-the-art density estimation methods.

2605.26844 2026-05-27 cs.LG 版本更新

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

并非所有分歧都是可学习的:在线策略蒸馏中的Token可教学性

Yuanyi Wang, Su Lu, Yanggan Gu, Pengkai Wang, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University, PolyU(香港理工大学) Hong Kong Polytechnic University(香港理工大学) Daya Bay Technology and Innovation Research Institute(大亚湾技术与创新研究院)

AI总结 本文提出可教学性感知的在线策略蒸馏(TA-OPD),通过识别并选择教师信号中可学习的token位置,仅用5%的token即可超越全token蒸馏效果。

详情
AI中文摘要

在线策略蒸馏(OPD)使用token级别的教师监督在学生的自身轨迹上训练学生。最近的OPD选择性方法通过优先考虑高熵或高分歧token来利用OPD信号的非均匀性。我们重新审视这一原则并问:哪些token级别的教师信号实际上是可学习的?使用固定上下文诊断(测量相同上下文下教师-学生KL散度减少),我们表明原始KL分歧是学习价值的粗略代理。它将可学习分歧(教师将纠正质量分配给学生的top-K候选)与不兼容分歧(教师将质量主要放在学生当前支持范围之外)混为一谈。我们将这种局部兼容性形式化为token可教学性,并表明它比单独的原始KL更好地预测固定上下文的改进。受此发现启发,我们提出可教学性感知的在线策略蒸馏(TA-OPD),一种轻量级的token位置选择方法,无需奖励模型或验证器即可将OPD损失应用于高可教学性位置。在Qwen2.5和Qwen 3教师-学生设置中,TA-OPD通常仅用5%的保留token就超越了全token OPD,并优于基于熵和散度的基线。我们的结果将选择性OPD重新定义为选择可学习的教师信号,而不仅仅是选择显著的token。

英文摘要

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.

2605.26842 2026-05-27 cs.LG cs.CL 版本更新

MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

MONA: 基于Nesterov加速的Muon优化器用于可扩展语言模型训练

Jiacheng Li, Jianchao Tan, Hongtao Xu, Jiaqi Zhang, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai

发表机构 * Meituan(美团)

AI总结 提出MONA优化器,通过将Nesterov加速项集成到Muon的梯度处理流程中,实现曲率感知加速,从而帮助逃离尖锐局部最小值,并在1B到68B参数的混合专家预训练中取得更优收敛和下游任务性能。

详情
AI中文摘要

Muon优化器最近为大型语言模型训练提供了一种有希望的AdamW替代方案,利用矩阵正交化产生几何感知更新。然而,与所有一阶方法一样,Muon可能会陷入尖锐的局部最小值。在这项工作中,我们提出了MONA,一种将Muon的正交化框架与曲率感知加速相结合的优化器。MONA直接将加速项添加到Muon的梯度处理流程中。该加速项根据梯度差异的指数移动平均计算得出。我们提供了MONA的详细收敛性分析,表明加速项能够在保持Muon谱范数正则化的同时逃离尖锐最小值。实验上,在从1B到68B参数的三个规模的混合专家预训练中(最大模型在1万亿tokens上训练),MONA在收敛性和下游任务性能上均优于Muon和AdamW。此外,我们在MOE-68B-A3B模型上进行了监督微调,并在通用能力、数学推理和代码生成基准上评估,MONA达到了最先进的性能。

英文摘要

The Muon optimizer has recently offered a promising alternative to AdamW for large language model training, leveraging matrix orthogonalization to produce geometry-aware updates. However, like all first-order methods, Muon can become trapped in sharp local minima. In this work, we present MONA, an optimizer that bridges Muon's orthogonalization framework with curvature-aware acceleration. MONA adds an acceleration term directly into Muon's gradient processing pipeline. This term is calculated from the exponential moving average of gradient differences. We provide a detailed convergence analysis for MONA, showing that the acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization. Empirically, MONA achieves better convergence and downstream task performance compared to both Muon and AdamW across three scales of Mixture-of-Experts pretraining, spanning from 1B to 68B parameters, with the largest model trained on 1 trillion tokens. Furthermore, we conduct supervised fine-tuning on the MOE-68B-A3B model and evaluate it on general capability, mathematical reasoning, and code generation benchmarks, where MONA achieves SOTA performance.

2605.26833 2026-05-27 cs.LG cs.AI 版本更新

Periodic Topological Deep Learning for Polymer Design and Discovery

周期性拓扑深度学习用于聚合物设计与发现

Yasharth Yadav, Tze Kwang Gerald Er, Atsushi Goto, Kelin Xia

发表机构 * School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371(新加坡南洋理工大学物理与数学科学学院) School of Chemistry, Chemical Engineering and Biotechnology (CCEB), Nanyang Technological University, Singapore 637371(新加坡南洋理工大学化学、化工与生物技术学院)

AI总结 提出基于周期性Vietoris-Rips复形和层次单纯形消息传递的深度学习框架Periodic-TDL,通过捕捉多体相互作用和长程信息,在聚合物性质预测任务上超越现有模型,并验证了酯到酰胺取代和α-甲基化对热稳定性的提升。

Comments 19 pages, 3 figures, 3 tables

详情
AI中文摘要

聚合物支撑着能源、医疗和材料科学领域的应用,但其广阔的化学空间使得系统性发现充满挑战。大多数机器学习方法将聚合物表示为单个重复单元的分子图,从而忽略了聚合物链的周期性和超越成对键的多体相互作用。我们提出了Periodic-TDL,一个基于周期性Vietoris-Rips复形的深度学习框架,该复形捕捉跨多个空间尺度的多体相互作用,随后通过层次单纯形消息传递(HSMP)编码器将信息从长程相互作用传播到共价键,产生由高阶拓扑特征增强的表征。Periodic-TDL在涵盖电子、光学、物理和热学目标的聚合物性质预测任务中优于所有最先进的模型。此外,我们定量验证了酯到酰胺取代和α-甲基化如何增强热稳定性。使用通过系统取代丙烯酸酯和丙烯酰胺聚合物生成的计算合成数据集(48,208个结构),我们观察到在匹配的聚合物对中,酯到酰胺取代的平均$T_g$增加约$55^\circ$C,主链α-甲基化的平均$T_g$增加约$14^\circ$C。为了验证这些预测趋势,我们使用Periodic-TDL模型分析了来自独立实验测量的六对新型聚合物,包括三篇文献中未报道的新合成聚合物。实验数据成功证实了模型的预测。最终,这些发现表明Periodic-TDL捕捉了特定官能团修饰的潜在物理效应,而不仅仅是优化基准数据集上的预测性能。

英文摘要

Polymers underpin applications across energy, healthcare, and materials science, yet their vast chemical space makes systematic discovery challenging. Most machine learning approaches represent polymers as molecular graphs of a single repeating unit, thereby missing both the periodicity of polymer chains and many-body interactions beyond pairwise bonds. We introduce Periodic-TDL, a deep learning framework built on periodic Vietoris-Rips complexes that capture many-body interactions across multiple spatial scales, followed by a hierarchical simplicial message-passing (HSMP) encoder that propagates information from long-range interactions to covalent bonds, yielding representations enriched by higher-order topological features. Periodic-TDL outperforms all state-of-the-art models across polymer property prediction tasks spanning electronic, optical, physical, and thermal targets. Furthermore, we quantitatively validate how ester-to-amide substitution and $α$-methylation enhance thermal stability. Using a computationally synthesized dataset of 48,208 structures-generated via systematic substitution of acrylate and acrylamide polymers-we observed a mean $T_g$ increase of $\sim 55^\circ$C for ester-to-amide substitutions and $\sim 14^\circ$C for backbone $α$-methylation across matched polymer pairs. To verify these predicted trends, we use our Periodic-TDL model to analyze six novel polymer pairs from independent experimental measurements, including three newly synthesized polymers previously unreported in the literature. The experimental data successfully confirmed the model's predictions. Ultimately, these findings demonstrate that Periodic-TDL captures the underlying physical effects of specific functional group modifications, rather than merely optimizing predictive performance on benchmark datasets.

2605.26830 2026-05-27 cs.LG cs.AI cs.CV 版本更新

The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery

卡尔曼演化:通过可解释算法发现缩小卡尔曼滤波的差距

Vasileios Saketos, Ming Xiao

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 针对非线性传感场景下卡尔曼滤波性能下降的问题,提出Kalman Evolve框架,联合优化噪声参数与更新结构,利用大语言模型生成可解释的非仿射修改,在多个基准上实现高达12%的RMSE降低。

详情
AI中文摘要

状态估计是控制和信号处理中的一个基本问题,卡尔曼滤波器在线性动力学、高斯噪声和已知噪声协方差下提供最优解。然而,这些假设在多普勒雷达和LiDAR等实际传感场景中常常不成立。在这些情况下,最优估计器本质上是非线性的,导致系统性能下降。这产生了一个仅通过调整噪声协方差参数(即卡尔曼滤波器中的过程噪声和测量噪声)无法消除的性能差距。为了解决这一限制,我们提出了Kalman Evolve,一个通过联合优化噪声参数和更新结构来发现改进滤波算法的框架。我们的方法利用大语言模型作为程序空间上的结构化先验,能够生成对经典卡尔曼滤波器的可解释、非仿射修改,同时保留其递归形式。我们提供了分析结果,证明了在常见非线性传感模型下仿射估计器的次优性,从而激发了结构感知更新的必要性。在一系列合成和真实跟踪基准测试中,包括多普勒雷达、基于LiDAR的定位和行人跟踪,所发现的算法始终优于强基线(如优化卡尔曼滤波器),实现了高达12%的RMSE降低。这些结果表明,优化卡尔曼滤波器的结构而不仅仅是其参数,提供了一种实用且可解释的方式来改进状态估计。

英文摘要

State estimation is a fundamental problem in control and signal processing, for which the Kalman Filter provides an optimal solution under linear dynamics, Gaussian noise, and known noise covariances. However, these assumptions often fail in realistic sensing settings such as Doppler radar and LiDAR. In these cases, the optimal estimator is inherently nonlinear, which leads to systematic performance degradation. This creates a performance gap that cannot be eliminated by tuning the noise covariance parameters (i.e., the process and measurement noise in the Kalman Filter) alone. To address this limitation, we propose Kalman Evolve, a framework for discovering improved filtering algorithms by jointly optimizing both noise parameters and the update structure. Our approach leverages large language models (LLMs) as a structured prior over program space, enabling the generation of interpretable, non-affine modifications to the classical Kalman filter while preserving its recursive form. We provide analytical results establishing the suboptimality of affine estimators under common nonlinear sensing models, motivating the need for structure-aware updates. Across a range of synthetic and real-world tracking benchmarks, including Doppler radar, LiDAR-based localization, and pedestrian tracking, the discovered algorithms consistently improve over strong baselines such as the Optimized Kalman Filter, achieving up to 12\% reduction in RMSE. These results suggest that optimizing the structure of the Kalman filter, rather than only its parameters, provides a practical and interpretable way to improve state estimation.

2605.26821 2026-05-27 hep-ph cs.LG hep-ex 版本更新

Particle-Lund Multimodality in Jet Taggers

喷注标记器中的粒子-拉普兰多模态

Loukas Gouskos, Benedikt Maier

发表机构 * Brown University(布朗大学) Imperial College of Science, Technology and Medicine(帝国理工学院科学、技术与医学学院)

AI总结 提出PLuM多模态架构,联合处理粒子成分与拉普兰平面分裂,通过交叉注意力机制研究显式QCD层次结构是否补充原始粒子表示,发现对顶夸克和H→bb标记有系统性提升,在HH(4b)分析中背景抑制提高25%。

详情
AI中文摘要

拉普兰平面提供了喷注内QCD辐射的物理动机层次表示,而基于变换器的标记器通过直接从原始粒子成分及其成对关系中学习达到了最先进的性能。我们研究变换器是否从成分级输入隐式捕获层次QCD结构,或者显式物理表示是否仍然具有互补性。为了测试这一点,我们引入了PLuM,一种多模态架构,将粒子成分和拉普兰平面分裂投影到共享潜在空间,并用统一变换器联合处理两者。交叉注意力允许模型探测结构化QCD信息是否提供了超出粒子单独编码的区分能力。我们观察到顶夸克和H→bb标记的系统性增益,而在H→cc或H→4q拓扑中没有发现可比改进。这种选择性增强表明,即使在高度表达性的架构中,关于b喷注形成的显式层次信息仍然与原始粒子表示互补,而其他拓扑已经在成分级被很好地捕获。对于高影响LHC分析,如洛伦兹增强的双希格斯玻色子搜索中的四b夸克末态(HH(4b)),增益显著:在25%的双希格斯效率工作点,PLuM的背景抑制比基线高25%。我们的结果表明,在变换器时代,QCD辐射的物理结构化表示仍然保留区分价值,激励进一步研究深度学习算法如何编码喷注动力学的不同方面。

英文摘要

The Lund plane offers a physics-motivated, hierarchical representation of QCD radiation within jets, while transformer-based taggers have reached state-of-the-art performance by learning directly from raw particle constituents and their pairwise relations. We investigate whether transformers implicitly capture hierarchical QCD structure from constituent-level inputs, or whether explicit physics representations remain complementary. To test this, we introduce PLuM, a multimodal architecture that projects particle constituents and Lund plane splittings into a shared latent space, processing both jointly with a unified transformer. Cross-attention allows the model to probe whether structured QCD information provides discriminating power beyond what particles alone encode. We observe systematic gains for top-quark and $\mathrm{H}\to\mathrm{b}\bar{\mathrm{b}}$ tagging, while finding no comparable improvement for $\mathrm{H}\to\mathrm{c}\bar{\mathrm{c}}$ or $\mathrm{H}\to 4\mathrm{q}$ topologies. This selective enhancement suggests that explicit hierarchical information about b-jet formation remains complementary to raw particle representations even in highly expressive architectures, while other topologies are already well-captured at constituent level. For high-impact LHC analyses such as Lorentz-boosted di-Higgs searches in the four $\mathrm{b}$ quark final state ($\mathrm{H}\mathrm{H}(4\mathrm{b})$), the gains are substantial: at a $25\%$ di-Higgs efficiency working point, PLuM achieves $25\%$ higher background rejection than the baseline. Our results indicate that physically structured representations of QCD radiation retain discriminating value in the transformer era, motivating further study into how different aspects of jet dynamics are encoded by deep learning algorithms.

2605.26808 2026-05-27 cs.LG cs.AI cs.IT math.IT 版本更新

Innovation: An Almost Characterization of Hallucination

创新:幻觉的几乎刻画

Nishant P. Das, Piyush Srivastava

发表机构 * School of Technology and Computer Science, Tata Institute of Fundamental Research, Mumbai, Maharashtra - 400 005, India(技术与计算机科学学院,塔塔基础研究机构,孟买,马哈拉施特拉邦 - 400 005, 印度)

AI总结 本文引入“创新”属性来刻画大语言模型幻觉的必然性,证明创新与幻觉几乎等价,并基于创新率给出新的幻觉率下界。

详情
AI中文摘要

幻觉是大语言模型(LLMs)的一个核心局限,大量工作致力于理解和缓解它。为此,Kalai 和 Vempala(STOC 2024)引入了一个概率框架来形式化校准和幻觉,并证明高概率下,校准的 LLM 大致以“缺失质量”(衡量训练数据相对于其来源的不完整程度)的速率产生幻觉。这引出了两个基本问题:(i) 校准的 LLM 的什么属性使得幻觉不可避免?(ii) 能否通过放弃校准来避免幻觉?我们通过引入一个更简单的属性——我们称之为“创新”——来回答这些问题,该属性衡量模型产生训练数据之外输出的倾向。我们证明,创新由 Kalai 和 Vempala 识别的幻觉条件蕴含,并且进一步,它是幻觉的几乎刻画:幻觉蕴含创新,反之,创新高概率地蕴含幻觉。我们还基于“创新率”给出了幻觉率的下界,并通过将创新率与缺失质量联系起来,获得了基于缺失质量的新的幻觉率下界,扩展了 Kalai 和 Vempala 的结果。

英文摘要

Hallucination is a central limitation of large language models (LLMs), and substantial effort has been devoted to understanding and mitigating it. Towards this, Kalai and Vempala (STOC 2024) introduced a probabilistic framework formalizing calibration and hallucination, and showed that, with high probability, calibrated LLMs hallucinate roughly at the rate of the "missing mass", a measure of how incomplete the training data is relative to its source. This raises two fundamental questions: (i) what property of a calibrated LLM makes hallucinations unavoidable? and (ii) can hallucinations be avoided by giving up calibration? We answer these questions by introducing a simpler property we call innovation that measures the tendency of a model to produce outputs outside the training data. We show that innovation is implied by the condition for hallucination identified by Kalai and Vempala, and, further, that it is an almost characterization of hallucination: hallucination implies innovation, and conversely, innovation implies hallucination with high probability. We also provide lower bounds on the hallucination rate based on the "innovation rate", and by relating innovation rate back to missing mass, we obtain new hallucination rate lower bounds based on missing mass that extend the results of Kalai and Vempala.

2605.26802 2026-05-27 cs.LG 版本更新

PATE-TabTransGAN: Differentially Private Synthetic Tabular Data Generation via Transformer-Based Student Discrimination

PATE-TabTransGAN:基于Transformer学生鉴别的差分隐私合成表格数据生成

M. Youssef, M. Woźniak

发表机构 * Wrocław University of Science and Technology(沃拉布大学科学与技术学院)

AI总结 提出PATE-TabTransGAN框架,结合教师集成私有聚合(PATE)机制与基于Transformer的学生鉴别器,在正式差分隐私保证下生成高质量合成表格数据,并在四个基准数据集上取得最优或并列最优的AUROC。

Comments 16 pages, 3 figures, 4 tables. Submitted for publication

详情
AI中文摘要

在正式差分隐私保证下生成高保真合成表格数据仍然是一个开放挑战。提供强理论保护的方法通常牺牲了真实合成所需的特征间依赖建模,而擅长捕获复杂列关系的架构仅提供经验隐私保证。我们提出PATE-TabTransGAN,一个生成框架,将教师集成私有聚合(PATE)机制与基于Transformer的学生鉴别器相结合,以共同满足这两个要求,并采用GNMax RDP会计进行数值稳定的隐私核算。在不相交分区上训练的Logistic回归教师集成通过噪声聚合标签监督学生,残差生成器针对这个差分隐私学生进行优化,通过后处理继承正式的(ε, δ)-DP保证。将PATE-TabTransGAN与PATE-GAN、DP-GAN和DP-CTGAN(被认为是差分隐私表格合成的最先进方法)进行比较。在四个表格基准(Adult、Breast、Cardio、Cervical)上进行的实验证实了所提方法的高质量:PATE-TabTransGAN在所有四个数据集上达到最佳或并列最佳的AUROC。在AUCPR上,它在Cardio上与最强基线持平,在Cervical上领先,在Breast上落后;在Adult上,我们证明AUCPR对正类惯例高度敏感,观察到的差距与评估流程之间的惯例差异一致,而非合成缺陷。

英文摘要

Generating high-fidelity synthetic tabular data under formal differential privacy guarantees remains an open challenge. Methods that provide strong theoretical protection typically sacrifice the modeling of inter-feature dependencies required for realistic synthesis, while architectures that excel at capturing complex column relationships offer only empirical privacy guarantees. We present PATE-TabTransGAN, a generative framework that integrates the Private Aggregation of Teacher Ensembles (PATE) mechanism with a Transformer-based student discriminator to jointly address both requirements, and employs a GNMax RDP accountant for numerically stable privacy accounting. An ensemble of Logistic Regression teachers trained on disjoint partitions supervise the student via noisy-aggregated labels, and a residual generator is optimized against this differentially private student, inheriting formal (ε, δ)-DP guarantees by post-processing. PATE-TabTransGAN was compared with PATE-GAN, DP-GAN, and DP-CTGAN, considered state-of-the-art in differentially private tabular synthesis. Experiments conducted on four tabular benchmarks (Adult, Breast, Cardio, Cervical) confirmed the high quality of the proposed method: PATE-TabTransGAN attains the best or tied-best AUROC on all four datasets. On AUCPR it matches the strongest baseline on Cardio, leads on Cervical, and trails on Breast; on Adult, we demonstrate that AUCPR is highly sensitive to positive-class convention, and that the observed gap is consistent with a convention difference between evaluation pipelines rather than a synthesis deficit.

2605.26797 2026-05-27 cs.LG cs.CL 版本更新

Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior

潜在循环Transformer:架构探索、训练策略与扩展行为

Zeyi Huang, Xuehai He, LiLiang Ren, Yiping Wang, Baolin Peng, Hao Cheng, Shuohang Wang, Pengcheng He, Jianfeng Gao, Yong Jae Lee, Yelong Shen

发表机构 * Microsoft(微软公司) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) University of Washington(华盛顿大学)

AI总结 提出潜在循环Transformer(LRT),通过跨层循环潜在路径重用前一token的高层隐藏状态作为记忆,在不增加暂停token或额外深度循环的情况下,以约2倍基线计算实现并行训练,在匹配有效计算下提升语言建模损失和上下文学习能力,仅增加0.3%参数。

详情
AI中文摘要

我们研究潜在循环Transformer(LRT),一种自回归Transformer的轻量级增强,它重用来自前一个token的高层源层隐藏状态作为下一个token的循环记忆。由于该源状态在普通解码过程中已经计算,LRT跨位置添加跨层循环潜在路径,无需插入暂停token或额外深度循环,并且保留了标准注意力机制和KV-cache接口。为了在不顺序展开Transformer的情况下大规模预训练这种循环,我们引入了交错并行训练:一次完整的全序列初始化前向传播构建共享缓冲区;然后不相交的位置子集并行细化并写回,使得所有token在约2倍基线计算下获得循环记忆感知的监督。在nanochat风格的主干网络和广泛的每参数token预算范围内,LRT在匹配有效计算下改进了语言建模损失和上下文学习,同时仅增加0.3%的参数。

英文摘要

We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source state is already computed during ordinary decoding, LRT adds a cross-layer recurrent latent pathway across positions without inserting pause tokens or extra depth loops, and the standard attention mechanism and KV-cache interface are preserved. To pretrain this recurrence at scale without sequentially unrolling the transformer, we introduce interleaved parallel training: a single full-sequence initialization forward pass builds a shared buffer; then disjoint position subsets are refined in parallel and written back, so that all tokens receive recurrent-memory-aware supervision at roughly 2 times baseline compute. Across nanochat style backbones and a wide range of tokens-per-parameter budgets, LRT improves both language-modeling loss and in-context learning under matched effective compute while adding as little as 0.3% parameters.

2605.26786 2026-05-27 cs.CY cs.AI cs.LG 版本更新

Implementation of Big Data Analytics for Diabetes Management: Needs Assessment in the Rwanda Healthcare System

大数据分析在糖尿病管理中的应用:卢旺达医疗系统需求评估

Silas Majyambere, Tony Lindgren, Workneh Y. Ayele, Celestin Twizere

发表机构 * University of Rwanda(卢旺达大学)

AI总结 本研究通过利益相关者研讨会评估卢旺达医疗系统采用大数据分析管理糖尿病的准备情况,并提出了一个基于可解释机器学习模型的实用框架。

详情
AI中文摘要

糖尿病是一种慢性代谢疾病,如果不及早诊断和管理,可能导致严重的健康问题。大数据分析和机器学习为分析大型健康数据集、支持早期发现和更好的治疗决策提供了实用工具。然而,它们在常规临床实践中的使用仍然有限。本研究考察了卢旺达医疗系统采用大数据分析管理糖尿病的准备情况。随着该国不断扩大电子病历和健康信息系统的使用,改善预测、监测和临床决策的新机遇随之出现。我们举办了一个为期五天的研讨会,涉及25名关键利益相关者,包括临床医生、数据管理员、政策制定者、医学研究人员、营养学家和技术提供商,以评估准备情况并识别现有差距。研究结果突出了大数据分析实施的潜力和主要挑战。基于这些结果,本文提出了一个实用的大数据分析框架,利用可解释的机器学习模型支持糖尿病管理策略。

英文摘要

Diabetes is a chronic metabolic disease that can lead to serious health problems if not diagnosed and managed early. Big Data Analytics (BDA) and machine learning offer practical tools for analyzing large health datasets and supporting early detection and better treatment decisions. However, their use in routine clinical practice is still limited. This study examines the readiness of Rwanda's healthcare system to adopt big data analytics for diabetes management. As the country continues to expand its use of electronic medical records and health information systems, new opportunities arise for improving prediction, monitoring, and clinical decision-making. A five-day workshop involving 25 key stakeholders, including clinicians, data managers, policymakers, medical researchers, nutritionists, and technology providers, was conducted to assess preparedness and identify existing gaps. The findings highlight both the potential and the main challenges of BDA implementation. Based on these results, the paper proposes a practical BDA framework to support diabetes management strategies using explainable machine learning models.

2605.26784 2026-05-27 cs.LG cs.AI 版本更新

Ratio-Variance Regularized Policy Optimization

比率方差正则化策略优化

Yu Luo, Shuo Han, Yihan Hu, Lei Lv, Huaping Liu, Fuchun Sun, Jianye Hao, Dong Li

发表机构 * Department of Foundation Model, 2012 Labs, Huawei(华为基础模型部门,2012实验室) Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University(上海智能自主系统研究院,同济大学) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院)

AI总结 提出R²VPO方法,通过约束策略比率方差作为信任区域的局部近似,替代启发式裁剪,在LLM和机器人控制任务中提升性能与样本效率。

详情
AI中文摘要

标准的同策略强化学习依赖启发式裁剪来强制信任区域,但这种机制通过不加区分地截断高回报但高散度的更新而施加了严重代价。我们证明,显式约束策略比率方差为信任区域约束提供了原则性的局部近似,消除了二元硬裁剪的需要。通过作为分布式的“软刹车”,这种方法保留了来自新颖发现的关键梯度信号,同时自然降低权重并允许重用陈旧的离策略数据。我们引入了${\bf R}^2{\bf VPO}$(比率方差正则化策略优化),它通过原始-对偶优化框架实现这一约束。在跨越快速和慢速推理范式的$7$个LLM规模以及$10$个机器人控制任务上的广泛评估证明了所提出方法的通用性。R$^2$VPO在数学推理基准上取得了显著的性能提升,特别是在较小模型上改进尤为明显,同时显著提高了样本效率。此外,它在连续控制领域(特别是稀疏奖励和动态环境)中始终优于PPO基线。这些发现共同确立了比率方差正则化作为稳定且数据高效策略优化的原则性基础。

英文摘要

Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. By acting as a distributional ``soft brake'', this approach preserves critical gradient signals from novel discoveries while naturally down-weighting and enabling the reuse of stale, off-policy data. We introduce ${\bf R}^2{\bf VPO}$ (Ratio-Variance Regularized Policy Optimization), which implements this constraint via a primal-dual optimization framework. Extensive evaluations across $7$ LLM scales, spanning both fast and slow reasoning paradigms, and $10$ robotic control tasks demonstrate the generality of the proposed approach. R$^2$VPO achieves substantial performance gains on mathematical reasoning benchmarks, with particularly pronounced improvements on smaller models, while significantly improving sample efficiency. Furthermore, it consistently outperforms PPO baselines in continuous control domains, particularly in sparse-reward and dynamic environments. Together, these findings establish ratio-variance regularization as a principled foundation for stable and data-efficient policy optimization.

2605.26776 2026-05-27 cs.LG cs.AI 版本更新

Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts

面向泛化的混合专家车辆路径问题模型

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

发表机构 * State Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(自主智能无人系统国家重点实验室,北京理工大学) School of AI, Beijing Institute of Technology(北京理工大学人工智能学院)

AI总结 提出基于混合专家架构的残差细化专家与实例级门控机制(R2E-IG),通过模块化策略网络和动态权重适应训练,提升车辆路径问题在分布偏移下的泛化能力。

详情
AI中文摘要

近年来,深度强化学习(DRL)在车辆路径问题(VRPs)上取得了显著进展。然而,现有的基于DRL的方法通常是在均匀分布生成的实例上训练的,这限制了它们在真实世界分布偏移下的性能。在本文中,我们旨在开发一个面向泛化的模型,该模型将策略网络划分为多个模块,并在推理过程中自适应地重组模块以形成特定策略。具体来说,我们提出了具有实例级门控的残差细化专家(R2E-IG)以改进跨分布泛化。我们的贡献有三方面:(1)我们引入了一种残差细化专家(R2E)架构,通过残差细化增强专家表达能力;(2)我们设计了一种实例级门控机制,学习分布感知的实例表示并将输入路由到合适的模块;(3)我们提出了一种配备动态权重适应(DWA)的混合分布训练机制,该机制动态地重新加权来自不同分布的训练数据,以强调更具信息量的数据。大量实验表明,R2E-IG在合成和基准数据集的分布内和分布外实例上均取得了与最先进基线相竞争的性能。此外,R2E-IG是通用的,可以轻松集成到现有的基于DRL的方法中,以进一步提高性能。

英文摘要

In recent years, Deep Reinforcement Learning (DRL) has achieved substantial progress on Vehicle Routing Problems (VRPs). However, existing DRL-based methods are typically trained on instances generated from a uniform distribution, which limits their performance under real-world distribution shifts. In this paper, we aim to develop a generalization-oriented model that partitions the policy network into multiple modules and adaptively recombines modules to form specific policies during inference. Specifically, we propose Residual Refined Experts with Instance-level Gating (R2E-IG) to improve cross-distribution generalization. Our contributions are threefold: (1) We introduce a Residual Refined Expert (R2E) architecture that enhance expert expressiveness via residual refinement; (2) We design an instance-level gating mechanism that learns distribution-aware instance representations and routes inputs to suitable modules; (3) We propose a mixed-distribution training mechanism equipped with Dynamic Weight Adaption (DWA), which dynamically reweights training data from different distributions to emphasize more informative ones. Extensive experiments show that R2E-IG achieves competitive performance against state-of-the-art baselines on both in-distribution and out-of-distribution instances across synthetic and benchmark datasets. Moreover, R2E-IG is generic and can be easily integrated into existing DRL-based methods to further improve performance.

2605.26763 2026-05-27 cs.LG cs.AI 版本更新

Adversarial Training for Robust Coverage Network under Worst-case Facility Losses

对抗训练用于最坏设施损失下的鲁棒覆盖网络

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

发表机构 * State Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(自主智能无人系统国家重点实验室,北京理工大学) School of AI, Beijing Institute of Technology(北京理工大学人工智能学院)

AI总结 针对最大覆盖选址-阻断问题,提出基于对抗学习的双智能体深度强化学习框架,实现高效求解与鲁棒决策。

详情
AI中文摘要

最大覆盖选址-阻断问题(MCLIP)是一个经典的双层优化问题,对于韧性基础设施规划至关重要,但计算上仍然难以处理。具体来说,上层确定设施位置以最大化覆盖范围,而下层执行最坏情况下的阻断以最小化覆盖范围。上下层之间的强耦合以及各自的高组合复杂性使得传统方法无效。为了弥补这一差距,我们提出了一种基于对抗学习的双智能体深度强化学习(DADRL)框架,包括对应于上层的选址智能体和对应于下层的阻断智能体。我们的贡献有三方面:(1)选址智能体同时针对不断演化的阻断智能体进行训练,使其有效捕捉上下层之间的动态竞争相互作用;(2)为了充分利用阻断智能体的学习能力,我们提出了一种基于替代的集成推理策略,利用训练好的阻断智能体作为高保真替代来指导选址智能体的决策;(3)在合成和真实世界数据集上的大量实验表明,与其他基线相比,我们的方法在保持高度竞争力的解质量的同时,实现了卓越的计算效率。此外,我们的DADRL框架对网络结构是模型无关的,而其底层的对抗学习范式在解决其他双层优化问题方面显示出强大的潜力。

英文摘要

The Maximal Covering Location-Interdiction Problem (MCLIP) is a classic bi-level optimization problem, which is fundamental to resilient infrastructure planning yet remains computationally intractable. Specifically, the upper level determines facility locations to maximize coverage, while the lower level executes worst-case interdiction to minimize the coverage. The strong coupling between the upper and lower levels, combined with their respective high combinatorial complexity, renders traditional methods ineffective. To bridge this gap, we propose a Dual-Agent Deep Reinforcement Learning (DADRL) framework based on adversarial learning, comprising a location agent corresponding to the upper level and an interdiction agent corresponding to the lower level. Our contributions are threefold: (1) The location agent is trained simultaneously against an evolving interdiction agent, making it effectively capture the dynamic competitive interplay between the upper and lower levels; (2) To fully exploit the learned capabilities of the interdiction agent, we propose a Surrogate-based Ensemble Inference Strategy that utilizes the trained interdiction agent as a high-fidelity surrogate to guide the decisions of location agent; (3) Extensive experiments on synthetic and real-world datasets demonstrate that our approach achieves superior computational efficiency while maintaining highly competitive solution quality compared to other baselines. Furthermore, our DADRL framework is model-agnostic to network structures, while its underlying adversarial learning paradigm demonstrates strong potential for solving other bi-level optimization problems.

2605.26733 2026-05-27 cs.LG cs.AI 版本更新

Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

循环语言模型中测试时可扩展潜在推理的稳定循环动力学

Xiao-Wen Yang, Ziyu Han, Xi-Hua Zhang, Wen-Da Wei, Jie-Jing Shao, Lan-Zhe Guo, Yu-Feng Li

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China(新型软件技术国家重点实验室,南京大学,南京,中国) School of Artificial Intelligence, Nanjing University, Nanjing, China(人工智能学院,南京大学,南京,中国) School of Intelligence Science and Technology, Nanjing University, Nanjing, China(智能科学与技术学院,南京大学,南京,中国)

AI总结 提出STARS训练框架,通过雅可比谱半径正则化约束潜在状态趋近渐近稳定不动点,解决循环语言模型深度递归时性能崩溃问题,实现可靠的测试时扩展并提升峰值性能。

Comments ICML 2026

详情
AI中文摘要

循环语言模型(LoopLMs)通过深度递归实现高效的潜在推理,但表现出不可靠的测试时缩放行为:性能通常在某个迭代深度达到峰值,然后随着进一步递归而崩溃。通过潜在动力学分析,我们发现现有架构和策略在稳定性和有效性之间存在固有的权衡。通过将推理概念化为不确定性减少,我们提出收敛到稳定不动点同时保持有效性是一种有前景的方法。为此,我们提出了STARS(稳定性驱动的递归缩放),一种训练框架,约束潜在状态趋近渐近稳定不动点。这通过高效的雅可比谱半径正则化和随机循环采样实现,使STARS能够在确保严格稳定性的同时最大化有效性。在算术任务上的实验表明,STARS实现了可靠的测试时缩放,在复杂数学推理中,它显著减轻了随着递归深度增加而出现的性能退化,同时提高了峰值性能。

英文摘要

Looped Language Models (LoopLMs) enable efficient latent reasoning through depth recurrence, yet exhibit unreliable test-time scaling behavior: performance often peaks at a certain iteration depth and then collapses with further recurrence. Through latent dynamics analysis, we find an inherent trade-off between stability and effectiveness in existing architectures and strategies. By conceptualizing reasoning as uncertainty reduction, we propose that convergence toward stable fixed points while preserving effectiveness represents a promising way. To this end, we propose STARS (STAbility-driven Recurrent Scaling), a training framework that constrains latent states to approach asymptotically stable fixed points. This is realized via efficient Jacobian Spectral Radius Regularization with random loop sampling, enabling STARS to maximize effectiveness while ensuring rigorous stability. Experiments on arithmetic tasks show that STARS achieves reliable test-time scaling, and on complex mathematical reasoning it substantially mitigates performance degradation as recurrence depth increases while also improving peak performance.

2605.26732 2026-05-27 cs.LG 版本更新

APEX: Amplitude Anchors and Phase Priors for Target-Scarce Higher-Frequency Wave Prediction

APEX: 针对稀缺目标的高频波预测的幅度锚定与相位先验

Yifan Sun, Lei Cheng, Sijie Chen, Ting Zhang, Jianlong Li, Shikai Fang

发表机构 * College of Information Science and Electronic Engineering(信息科学与电子工程学院)

AI总结 提出APEX框架,通过低频神经算子预测幅度作为锚点,结合格林函数启发的相位先验和条件流匹配增强器,在目标数据稀缺时实现高频波场预测,在多个基准上优于直接外推和联合生成方法。

详情
AI中文摘要

基于学习的替代模型在波场预测中日益有效,特别是神经算子在观测频率范围内表现出色。然而,在目标监督稀缺的情况下,高频预测仍相对未被充分探索,尤其是在高频数据模拟或测量成本远高于低频数据的波动问题中。一个核心困难是跨频率迁移本质上是不对称的:粗粒度幅度结构在不同频率间保持相对稳定,而相位敏感的振荡结构随着频率增加而迅速恶化。受此不对称性启发,我们提出APEX(从外推粗预测中进行的幅度锚定和相位先验引导增强),一个针对目标稀缺高频波场预测的框架。低频神经算子首先在目标频率范围内提供粗预测,我们仅保留幅度作为可迁移的结构锚点。然后,条件流匹配增强器在格林函数启发的相位先验指导下重建目标高频场。在SimpleWave、Helmholtz和Maxwell基准上的实验表明,在有限的目标频率监督下,APEX始终优于直接的低频到高频外推、目标自适应算子和联合生成基线。我们的结果表明,振荡波场的可靠高频预测不应依赖于完整复数场的直接端到端迁移,而应显式重用可迁移的粗粒度结构,同时单独恢复缺失的振荡细节。

英文摘要

Learning-based surrogates have become increasingly effective for wave-field prediction, and neural operators in particular have shown strong performance within observed frequency regimes. However, higher-frequency prediction under scarce target supervision remains comparatively underexplored, especially in wave problems where higher-frequency data are substantially more expensive to simulate or measure than lower-frequency data. A central difficulty is that cross-frequency transfer is inherently asymmetric: coarse amplitude structure remains relatively stable across frequencies, whereas phase-sensitive oscillatory structure deteriorates much more rapidly as frequency increases. Motivated by this asymmetry, we propose APEX, Amplitude-anchored and Phase-prior-guided Enhancement from eXtrapolated coarse predictions, a framework for target-scarce higher-frequency wave-field prediction. A lower-frequency neural operator first provides a coarse prediction in the target-frequency regime, from which we retain only the amplitude as a transferable structural anchor. A conditional flow-matching enhancer then reconstructs the target higher-frequency field under the guidance of a Green's-function-inspired phase prior. Experiments on SimpleWave, Helmholtz, and Maxwell benchmarks show that APEX consistently outperforms direct lower-to-higher extrapolation, target-adapted operator, and joint generative baselines under limited target-frequency supervision. Our results suggest that reliable higher-frequency prediction of oscillatory wave fields should not rely on direct end-to-end transfer of the full complex field, but instead on explicitly reusing transferable coarse structure while separately recovering the missing oscillatory detail.

2605.26718 2026-05-27 cs.LG 版本更新

MTL-FNO: A Lightweight Multi-Task Fourier Neural Operator for Sparse Field Reconstruction

MTL-FNO:一种用于稀疏场重建的轻量级多任务傅里叶神经算子

Siyu Ye, Shihang Li, Zhiqiang Gong, Benrong Zhang, Weien Zhou, Yiyong Huang, Wen Yao

发表机构 * Defense Innovation Institute, Academy of Military Science, Beijing, 100071, China(国防科技研究院,军事科学院,北京,100071,中国) Intelligent Game and Decision Laboratory, Beijing, 100071, China(智能游戏与决策实验室,北京,100071,中国)

AI总结 针对航空航天飞行器多场稀疏重建中模型庞大且难以利用跨场相关性的问题,提出基于硬参数共享的轻量级多任务傅里叶神经算子MTL-FNO,通过极坐标解耦优化和Cayley变换实现高效联合训练,在少样本条件下模型大小减少76%和60%且精度相当或更优。

详情
AI中文摘要

高效的星载多场稀疏重建对于航空航天飞行器的自主运行至关重要。虽然现有的深度学习模型在单场重建中表现出潜力,但部署多个独立模型会导致模型尺寸急剧增长,并且无法利用跨场相关性,尤其是在少样本条件下。为了解决这些挑战,我们首先提出了一种轻量级多任务傅里叶神经算子(MTL-FNO),这是一种基于硬参数共享的端到端联合训练框架。在每一层中,参数被分为共享部分和任务特定部分,以捕获各场之间的共同特征,同时保留任务特定特征。此外,任务特定的微调参数被实现为低秩项,实现了显著的模型压缩。其次,为了解决共享参数和任务特定参数及其实部和虚部联合优化的困难,我们从极坐标形式的角度重新审视了FNO的谱权重,并设计了一种具有物理意义的解耦优化方案。具体地,我们应用极分解将谱权重逐片解耦为编码相位信息的酉张量和表征振幅的半正定张量。通过解耦相位和振幅的优化,我们的方法可以有效缓解任务冲突。同时,为了在训练过程中保持酉几何保真度,引入Cayley变换对酉张量进行重参数化,将约束优化问题转化为无约束优化问题。最后,在两个代表性工程案例上验证了所提方法在少样本条件下的有效性。结果表明,MTL-FNO达到了与标准FNO相当甚至更优的精度,同时分别将总模型大小减少了76%和60%。

英文摘要

Efficient onboard multi-field sparse reconstruction is essential for the autonomous operation of aerospace vehicles. While existing deep learning models exhibit promise for single-field reconstruction, deploying multiple independent models leads to prohibitive model size growth and fails to exploit cross-field correlations, particularly under few-shot conditions. To address these challenges, we first propose a lightweight multi-task Fourier neural operator (MTL-FNO), an end-to-end joint training framework based on hard parameter sharing. In each layer, the parameters are divided into shared and task-specific components to capture common features across fields while preserving task-specific characteristics. Moreover, the task-specific fine-tuning parameters are implemented as low-rank terms, achieving substantial model compression. Second, to address the difficulty of co-optimizing shared and task-specific parameters along with their real and imaginary parts, we revisit the FNO's spectral weight from a polar-form perspective and devise a physically meaningful decoupled optimization scheme. Specifically, we apply polar decomposition to slice-wise disentangle the spectral weight into a unitary tensor encoding phase information and a positive semi-definite tensor characterizing amplitude. By decoupling the optimization of phase and amplitude, our method can effectively mitigate tasks conflict. Meanwhile, to preserve unitary geometric fidelity during training, the Cayley transform is introduced to reparameterize the unitary tensor, converting the constrained optimization problem to an unconstrained one. Finally, the effectiveness of the proposed method under few-shot conditions is validated on two representative engineering cases. Results show that MTL-FNO achieves accuracy comparable to or even surpassing that of standard FNO, while reducing total model size by 76% and 60%, respectively.

2605.26715 2026-05-27 cs.LG 版本更新

Image Feature Fusion-based Federated Client Unlearning (FCU)

基于图像特征融合的联邦客户端遗忘 (FCU)

Hangyi Shen, Yizhi Pan, Tiansuo Li, Weiqi Jiang, Guanqun Sun

AI总结 针对联邦遗忘中灾难性遗忘导致全局泛化下降的问题,提出基于线性图像特征融合机制(Mixup)的联邦客户端遗忘方法,通过动态生成混合样本弥合遗忘与保留分布,在医学影像基准上实现了与重训练标准相当的遗忘效果。

详情
AI中文摘要

主要数据保护法规都提到了“被遗忘权”,这推动了联邦遗忘技术的发展。但一个顽固的问题仍然存在:灾难性遗忘——你擦除了目标知识,但同时也丢弃了必要的保留知识,从而损害了模型的全局泛化能力。为了在遗忘效果和泛化能力之间取得更好的平衡,我们提出了基于图像特征融合的联邦客户端遗忘(IFF-FCU)。其思想是引入线性图像特征融合机制(Mixup),动态创建混合样本,弥合遗忘分布和保留分布之间的差距。该策略不仅仅是删除几个离散的数据点——它在理论上拓宽并正则化了遗忘边界。我们在医学影像基准(RSNA-ICH 和 ISIC2018)上进行了大量实验,结果表明我们的方法实现了相当好的遗忘效果。例如,在 ICH 数据集上,IFF-FCU 实现了与重训练黄金标准高度竞争的误差偏差,显示出对现有基线的稳健改进。

英文摘要

Major data protection regulations all mention the "right to be forgotten," and that's what pushed federated unlearning (FU) techniques forward. But one stubborn issue remains: catastrophic forgetting--you erase the target knowledge, yet somehow you also end up throwing out essential retained knowledge, which then hurts the model's global generalization. To get a better balance between unlearning effectiveness and generalization ability, we propose something called Image Feature Fusion-based Federated Client Unlearning (IFF-FCU). The idea is to bring in a linear Image Feature Fusion mechanism (Mixup) that dynamically creates mixed samples, bridging the gap between forget-distribution and retain-distribution. What this strategy does isn't just deleting a few discrete data points--it theoretically widens and regularizes the forgetting boundary. We ran extensive experiments on medical imaging benchmarks (RSNA-ICH and ISIC2018), and the results show that our approach achieves reasonably good unlearning. For instance, on the ICH dataset, IFF-FCU achieves a highly competitive Error deviation from the retrained gold standard, demonstrating robust improvements over existing baselines.

2605.26713 2026-05-27 stat.ML cs.LG 版本更新

Transformers Can Learn Posterior Predictive Distributions In-Context

Transformer可以在上下文中学习后验预测分布

Gyeonghun Kang, Changwoo J. Lee, Xiang Cheng

发表机构 * Department of Statistical Science, Duke University, Durham, NC, USA(统计科学系,达勒姆大学,达勒姆,NC,美国) Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA(电气与计算机工程系,达勒姆大学,达勒姆,NC,美国)

AI总结 本文通过构造证明Transformer能够实现针对后验预测均值和方差的梯度下降算法,并研究其逼近后验预测分布的误差界,揭示了归一化和注意力深度对泛化能力的关键作用。

详情
AI中文摘要

先验数据拟合网络(PFN)最近已成为贝叶斯预测任务的一种强大方法,通过上下文学习近似后验预测分布(PPD)。尽管它们具有强大的实证性能和超越点预测的能力,但对Transformer在上下文中学习分布的算法能力的理论理解仍然缺乏。聚焦于高斯过程回归问题,我们通过构造证明Transformer可以实现针对后验预测均值和方差的梯度下降算法,随后通过非线性映射产生PPD的分箱概率。我们根据注意力深度和分箱分辨率研究了近似PPD的误差界。基于这些结果,我们进一步证明了归一化和注意力深度的选择在使Transformer能够超越预训练样本大小范围进行外推中的关键作用。我们进行了模拟实验,验证了我们的发现,为针对PPD的PFN的表达能力以及架构选择如何影响泛化能力提供了见解。

英文摘要

Prior-data fitted networks (PFNs) have recently emerged as a powerful approach for Bayesian prediction tasks, approximating the posterior predictive distribution (PPD) through in-context learning. Despite their strong empirical performance and ability to go beyond point predictions, theoretical understandings of the algorithmic capability of transformers to learn distributions in context are still lacking. Focusing on Gaussian process regression problems, we show by construction that transformers can implement a gradient descent algorithm targeting the posterior predictive mean and variance, followed by nonlinear mappings that yield binned probabilities of PPD. We study the error bounds of the approximated PPD in terms of attention depth and bin resolution. Based on these results, we further demonstrate the key role of normalization and the choice of attention depth in enabling the extrapolation abilities of transformers beyond the pretraining sample size range. We conduct simulations that corroborate our findings, providing insight into the expressivity of PFNs targeting PPDs and how architectural choices may influence generalization capabilities.

2605.24041 2026-05-27 cs.LG cs.AI 版本更新

Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation

迭代精化神经算子:一种学习型不动点求解器——频谱偏差缓解的原则性方法

Xiaotian Liu, Shuyuan Shang, Xiaopeng Wang, Pu Ren, Yaoqing Yang

发表机构 * Dartmouth College(达特茅斯学院) CUHK Shenzhen(香港大学深圳分校) Lawrence Berkeley National Lab(伯克利国家实验室)

AI总结 提出迭代精化神经算子(IRNO),通过固定点迭代应用学习精化模块,结合渐进频谱损失,有效缓解神经算子的频谱偏差,在湍流和活性物质等物理系统中显著降低高频误差。

Comments 47 pages; accepted to ICML 2026 as a Spotlight

详情
AI中文摘要

神经算子作为科学建模的快速数据驱动替代方法,通常依赖于单一前向推理过程,难以解析高频细节,这一局限性称为频谱偏差。我们引入迭代精化神经算子(IRNO),通过固定点迭代反复应用学习精化模块来增强预训练算子。IRNO将预测分解为粗初始化及随后的残差校正,类似于经典数值求解器。在局部假设下,我们建立了诱导算子的收缩性,确保收敛到唯一不动点。为明确针对高频误差,我们提出渐进频谱损失,在训练过程中自适应地增加对高频分量的惩罚。在物理系统中,IRNO持续降低误差,在湍流中提升高达56.05%。在活性物质中,频谱分析显示,相对于基础算子,归一化误差比在低频降至27.72-36.10%,中频降至5.07-6.68%,高频降至1.48-2.04%,且在训练迭代次数之外保持稳定。代码见 https://github.com/xiaotianliu-dartmouth/Iterative_Refinement_Neural_Operator。

英文摘要

Neural operators serve as fast, data-driven surrogates for scientific modeling but typically rely on a monolithic, single-pass inference procedure that struggles to resolve high-frequency details, a limitation known as spectral bias. We introduce the Iterative Refinement Neural Operator (IRNO), which augments pre-trained operators with a learned refinement module iteratively applied via fixed-point iteration. IRNO decomposes the prediction into a coarse initialization followed by successive residual corrections, paralleling classical numerical solvers. Under local assumptions, we establish contraction of the induced operator, ensuring convergence to a unique fixed point. To explicitly target high-frequency errors, we propose a progressive spectral loss that adaptively increases penalty on high-frequency components over refinement steps during training. Across physical systems, IRNO consistently lowers error, with up to 56.05% improvement on turbulent flow. On Active Matter, spectral analysis reveals that, relative to base operator, the normalized error ratios decrease to 27.72-36.10% in low-, 5.07-6.68% in mid-, and 1.48-2.04% in high-frequencies, remaining stable beyond the trained iteration count. Code is available at https://github.com/xiaotianliu-dartmouth/Iterative_Refinement_Neural_Operator

2605.22557 2026-05-27 cs.LG cs.NA math.NA 版本更新

Neural Flow Operators can Approximate any Operator: Abstract Frameworks and Universal Approximations

神经流算子可以逼近任意算子:抽象框架与通用逼近

Shuang Chen, Juncai He, Xue-Cheng Tai

发表机构 * Qiuzhen College, Tsinghua University(清华大学齐遵学院) Yau Mathematical Sciences Center, Tsinghua University(清华大学尤 mathematical sciences center) Norwegian Research Center(挪威研究中心)

AI总结 提出神经流抽象框架,涵盖组合与分离结构的连续深度模型,证明其在有限维和无限维空间中的通用逼近性质,并通过时间离散化统一残差与普通架构。

详情
AI中文摘要

我们为神经网络和神经算子引入了一个抽象的神经流框架。该框架包含两种连续深度模型,即具有组合和分离结构的神经流,并涵盖了有限维函数逼近和无限维算子逼近。我们证明了相应神经流的适定性和通用逼近性质,包括据我们所知,首个无限维空间之间基于流的模型的通用逼近结果。我们还获得了卷积神经流模型的通用逼近结果。通过适当的时间离散化,组合结构恢复了ResNet类型的架构,而分离结构通过基于分裂的离散化产生了普通架构。这为具有全连接或卷积线性层的神经网络和神经算子的残差和普通架构提供了一条统一的基于流的路径。

英文摘要

We introduce an abstract neural flow framework for neural networks and neural operators. The framework contains two continuous-depth models, namely neural flows with composition and separation structures, and covers both finite-dimensional function approximation and infinite-dimensional operator approximation. We prove well-posedness and universal approximation properties for the corresponding neural flows, including, to the best of our knowledge, the first universal approximation result for flow-based models between infinite-dimensional spaces. We also obtain universal approximation results for convolutional neural flow models. Through suitable time discretizations, the composition structure recovers ResNet-type architectures, while the separation structure, via a splitting-based discretization, yields plain architectures. This gives a unified flow-based route to both residual and plain architectures for neural networks and neural operators with fully connected or convolutional linear layers.

2605.22468 2026-05-27 cs.LG cs.AI 版本更新

BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series

BioFormer: 通过频谱结构对齐重新思考生物医学时间序列中的跨主体泛化

Guikang Du, Haoran Li, Xinyu Liu, Zhibo Zhang, Xiaoli Gong, Jin Zhang

发表机构 * College of Computer Science, Nankai University, Tianjin, China(南开大学计算机科学学院) College of Cyber Science, Tianjin Key Laboratory of Interventional Brain-Computer Interface(天津介入脑机接口与智能康复重点实验室) Intelligent Rehabilitation, Key Lab of Data(智能康复,数据实验室) Intelligent System Security, Frontiers Science Center for New Organic Matter, Nankai University, Tianjin, China(智能系统安全,新有机物前沿科学中心,南开大学,天津,中国)

AI总结 提出BioFormer模型,通过频谱漂移视角显式建模主体特异性变异,利用频带对齐模块和样本条件层归一化对齐频谱结构,在六个数据集上F1分数提升6%。

详情
AI中文摘要

生物医学时间序列中的跨主体泛化指在一些主体数据上训练并在未见主体上测试。关键挑战是抑制BTS表示中的主体特异性变异。大多数现有方法通过模型构建或主体对抗学习隐式抑制变异,但很少显式建模。我们引入频谱漂移作为表征主体特异性变异的新视角。具体来说,相同标签下的BTS信号通常共享一致的振荡结构,但在特定频率分量上表现出依赖于主体的幅度或相位偏移,我们将其解释为主体特异性变异。基于这一见解,我们提出BioFormer。其核心是频带对齐模块(FBAM),该模块从频谱分布生成带级调制因子,并自适应调整幅度和相位以对齐频谱结构,从而减轻变异。我们进一步将FBAM与样本条件层归一化配对,该归一化从内在信号统计量而非主体身份推断归一化参数,稳定跨主体表示。在六个数据集上的大量实验表明,BioFormer优于12个基线,绝对F1分数提升6%。

英文摘要

Cross-subject generalization in biomedical time-series refers to training on data from some subjects and testing on unseen subjects.The key challenge is to suppress subject specific variability in BTS representations.Most existing methods implicitly suppress the variability through model building or subject adversarial learning, but rarely model it explicitly.We introduce spectral drift as a new perspective to characterize subject specific variability.Specifically, BTS signals under the same label often share consistent oscillatory structure, yet exhibit subject-dependent magnitude or phase shifts in specific frequency components, which we interpret as subject-specific variability. Building on this insight, we propose BioFormer.At its core is a Frequency-Band Alignment Module(FBAM) that generates band-wise modulation factors from the spectral distribution and adaptively adjusts amplitude and phase to align spectral structure, thereby mitigating variability.We further pair FBAM with Sample Conditional Layer Normalization, which infers normalization parameters from intrinsic signal statistics rather than subject identity, stabilizing cross-subject representations.Extensive experiments on six datasets demonstrate that BioFormer outperforms 12 baselines, yielding absolute F1-score improvements of 6%.

2605.21617 2026-05-27 cs.LG q-bio.QM 版本更新

$\textit{BlockFormer}$ : Transformer-based inference from interaction maps

$ extit{BlockFormer}$:基于交互图的Transformer推理

Eloïse Touron, Pedro L. C. Rodrigues, Julyan Arbel, Nelle Varoquaux, Michael Arbel

发表机构 * Univ. Grenoble Alpes(格勒诺布尔阿尔卑斯大学) Inria(法国国家科学研究中心) CNRS(法国国家科学研究中心) Grenoble INP(格勒诺布尔研究所) LJK(实验室) TIMC

AI总结 提出BlockFormer,一种基于Transformer架构的数据驱动方法,通过模拟器生成合成数据训练,解决从交互图中推断可变数量和大小实体参数的反问题,并成功应用于多种物种的着丝粒定位。

详情
AI中文摘要

从交互图中进行推理,例如从全基因组染色体构象捕获技术(特别是Hi-C)中识别着丝粒,可以表述为一个通用的反问题:给定一个通过可变数量和大小的块总结实体间成对相互作用的图,推断一组参数。在这项工作中,我们引入了一种数据驱动的方法,利用这些图之间的共享结构(例如局部模式的全局对齐),同时处理真实数据中实体数量和大小可变性。我们的方法依赖于能够处理这种可变性的Transformer架构,以及一个自定义模拟器,用于生成丰富且计算成本低廉的合成数据进行训练。应用于着丝粒定位问题,该方法能够准确恢复各种基因组大小的多种物种的着丝粒基因组位置。

英文摘要

Inference from interaction maps, such as centromere identification from genome-wide chromosome conformation capture techniques -- notably Hi-C -- can be formulated as a generic inverse problem: infer a set of parameters given a map summarizing pairwise interactions between entities through blocks of variable numbers and sizes. In this work, we introduce a data-driven approach that leverages shared structure between these maps, such as global alignment between localized patterns, while handling the variability in number and size of entities arising in real-world data. Our approach relies on a transformer architecture capable of handling such variability and a custom simulator to generate abundant, yet computationally cheap synthetic data for training. Applied to the problem of centromere localization, the method accurately recovers their genomic positions across a wide range of species of various genome sizes.

2605.20530 2026-05-27 cs.AI cs.CL cs.LG cs.SE 版本更新

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas:超越LLM智能体的结果排行榜

Parsa Mazaheri, Kasra Mazaheri

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出AgentAtlas框架,通过控制决策分类法和轨迹故障词汇表,将智能体评估从结果成功分离为控制决策质量和轨迹质量,并揭示仅依赖结果排行榜的测量风险。

详情
AI中文摘要

大型语言模型智能体现在可以操作代码库、浏览器、操作系统、日历、文件和工具生态系统,但它们的评估通常将行为简化为最终任务成功。AgentAtlas将智能体评估重新定义为一种诊断词汇和审计协议,用于将结果成功与控制决策质量和轨迹质量分离。本文贡献了:(i) 一个六状态控制决策分类法(行动/询问/拒绝/停止/确认/恢复);(ii) 一个包含主要错误源和下游影响的轨迹失败词汇表;(iii) 对十五个智能体基准的0/1/2基准覆盖审计;(iv) 一个在合成1,342项数据集上进行的说明性协议研究,使用八种模型在分类法感知和分类法盲提示格式下进行评估。该合成演示不是公开基准发布,不应被视为确定的模型比较。相反,它说明了两个测量风险:当显式标签菜单被移除时,映射标签一致性可能发生显著变化,并且轴选择可能改变表观排名。AgentAtlas旨在帮助基准设计者说明他们覆盖的行为,并帮助评估者诊断仅结果排行榜隐藏的失败。

英文摘要

Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but their evaluations often collapse behavior into final task success. AgentAtlas reframes agent evaluation as a diagnostic vocabulary and audit protocol for separating outcome success from control-decision quality and trajectory quality. The paper contributes: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a trajectory-failure vocabulary with primary error source and downstream impact; (iii) a 0/1/2 benchmark-coverage audit over fifteen agent benchmarks; and (iv) an illustrative protocol study on a synthetic 1,342-item set evaluated with eight models under taxonomy-aware and taxonomy-blind prompt formats. The synthetic demonstration is not a public benchmark release and should not be read as a definitive model comparison. Instead, it illustrates two measurement risks: mapped label agreement can change substantially when the explicit label menu is removed, and axis choice can change apparent rankings. AgentAtlas is intended to help benchmark designers state what behavior they cover, and to help evaluators diagnose failures that outcome-only leaderboards hide.

2605.04932 2026-05-27 stat.ML cs.LG 版本更新

Jacobian-Velocity Bounds for Deployment Risk Under Covariate Drift

协变量漂移下部署风险的雅可比-速度界

Jonathan R. Landers

发表机构 * Independent Researcher(独立研究者)

AI总结 针对动态协变量漂移下冻结预测器的长期部署风险,提出基于时域庞加莱不等式和雅可比-速度定理的路径控制方法,并设计漂移对齐切线正则化(DTR)以降低风险波动。

Comments 8 pages, 4 figures, 4 tables

详情
AI中文摘要

我们研究了动态协变量漂移下冻结预测器的长期部署问题。时域庞加莱不等式首先将时间风险波动降低为导数能量。然后,雅可比-速度定理提供了相应的路径控制。在明确的规则性和支配假设下,该定理将沿部署路径的方向切线能量识别为控制量。在低秩漂移下,该量减少为漂移子空间中的方向雅可比能量,从而激发了漂移对齐切线正则化(DTR)和匹配的监测代理。DTR不是各向同性地平滑网络,而是仅沿估计的漂移方向惩罚敏感性。我们通过四个实验验证了从定理到方法的流程:一个用于时域不等式的合成基准,一个与各向同性雅可比正则化对比的受控合成实验,以及在UCI空气质量数据集和Tetouan电力消耗数据集上的两个冻结部署研究。DTR在受控低秩区域降低了风险波动和方向增益,并优于各向同性平滑。它还在两个真实数据集上给出了验证选择的部署增益,其中空气质量子空间是从目标正交传感器运动估计的。适度的漂移子空间错误指定是可容忍的,而正交错误指定则基本消除了收益。

英文摘要

We study long-horizon deployment of a frozen predictor under dynamic covariate shift. A time-domain Poincare inequality first reduces temporal risk volatility to derivative energy. A Jacobian-velocity theorem then supplies the corresponding pathwise control. Given explicit regularity and domination assumptions, the theorem identifies directional tangent energy along the deployment path as the governing quantity. Under low-rank drift, that quantity reduces to directional Jacobian energy in the drift subspace, motivating drift-aligned tangent regularization (DTR) and a matched monitoring proxy. Rather than smoothing the network isotropically, DTR penalizes sensitivity only along estimated drift directions. We validate the theorem-to-method pipeline in four experiments: a synthetic benchmark for the time-domain inequality, a controlled synthetic comparison against isotropic Jacobian regularization, and two frozen-deployment studies on the UCI Air Quality and Tetouan power-consumption datasets. DTR reduces risk volatility and directional gain in the controlled low-rank regime and beats isotropic smoothing there. It also gives validation-selected deployment gains on both real datasets, with the Air Quality subspace estimated from target-orthogonal sensor motion. Moderate drift-subspace misspecification is tolerable while orthogonal misspecification largely removes the benefit.

2605.02958 2026-05-27 cs.CR cs.AI cs.CL cs.LG 版本更新

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

追踪拒绝的动态:利用潜在拒绝轨迹进行鲁棒越狱检测

Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen

发表机构 * Peking University(北京大学) Nanyang Technological University(南洋理工大学) Beijing Jiaotong University(北京交通大学)

AI总结 通过因果追踪识别出稀疏的“拒绝轨迹”激活模式,并提出轻量级白盒检测器SALO,基于隐藏状态窗口实现鲁棒越狱检测。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

详情
AI中文摘要

表征工程分析通常使用从终端或池化表示中提取的静态方向来描述拒绝。我们质疑这种观点是否忽略了拒绝是如何在层-标记位置上构建的。通过因果追踪,我们识别出一个 extit{拒绝轨迹}:一种稀疏的上游激活模式,即使当诸如GCG的攻击抑制终端拒绝信号时,该模式也常常持续存在。基于这一观察,我们提出了SALO(稀疏激活定位算子),一种轻量级白盒检测器,它在选定层窗口的原始隐藏状态体积上操作。在Qwen、Llama和Mistral模型上,SALO在固定的XSTest校准工作点下,改进了多个攻击家族的越狱检测。我们进一步分析了静态RepE风格基线、ROI敏感性、自适应GCG攻击和编码输入边界情况,阐明了拒绝轨迹监测的前景和局限性。

英文摘要

Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. We ask whether this view misses how refusal is constructed across layer-token positions. Using causal tracing, we identify a \textit{Refusal Trajectory}: a sparse upstream activation pattern that often persists even when attacks such as GCG suppress terminal refusal signals. Based on this observation, we propose SALO (Sparse Activation Localization Operator), a lightweight white-box detector that operates on raw hidden-state volumes from a selected layer window. Across Qwen, Llama, and Mistral models, SALO improves jailbreak detection on several attack families under a fixed XSTest-calibrated operating point. We further analyze static RepE-style baselines, ROI sensitivity, adaptive GCG attacks, and encoded-input boundary cases, clarifying both the promise and limitations of refusal-trajectory monitoring.

2605.01817 2026-05-27 cs.LG 版本更新

Skipping the Zeros in Diffusion Models for Sparse Data Generation

跳过扩散模型中的零值以生成稀疏数据

Phil Sidney Ostheimer, Mayank Nagda, Andriy Balinskyy, Gabriel Vicente Rodrigues, Jean Radig, Carl Herrmann, Stephan Mandt, Marius Kloft, Sophie Fellenz

发表机构 * RPTU University Kaiserslautern-Landau(科隆-兰道大学RPTU) Heidelberg University(海德堡大学) University of California, Irvine(加州大学 Irvine 分校)

AI总结 提出稀疏利用扩散(SED)方法,通过仅建模非零值来保持稀疏性,在训练和推理中跳过零值以节省计算并提升生成质量。

Comments Accepted to ICML 2026

详情
AI中文摘要

扩散模型(DMs)在密集连续数据上表现出色,但并非为稀疏连续数据设计。它们无法建模代表信号有意缺失的精确零值。因此,它们会抹去稀疏模式,并对大部分为零的条目执行不必要的计算。通过稀疏利用扩散(SED),我们仅对非零值建模,从而保持稀疏性。SED通过在训练和推理过程中跳过零值,在保持或提高生成质量的同时节省计算。在物理和生物学基准测试中,SED匹配或超越了传统DMs和领域特定基线,而视觉实验则提供了对密集DMs局限性及SED优势的直观理解。

英文摘要

Diffusion models (DMs) excel on dense continuous data, but are not designed for sparse continuous data. They do not model exact zeros that represent the deliberate absence of a signal. As a result, they erase sparsity patterns and perform unnecessary computation on mostly zero entries. With Sparsity-Exploiting Diffusion (SED), we model only non-zero values, preserving sparsity. SED delivers computational savings while maintaining or improving generation quality by skipping zeros during training and inference. Across physics and biology benchmarks, SED matches or surpasses conventional DMs and domain-specific baselines, while vision experiments provide intuitive insights into the limitations of dense DMs and the benefits of SED.

2605.26693 2026-05-27 cs.LG cs.AI stat.ML 版本更新

Model Merging on Loss Landscape: A Geometry Perspective

损失景观上的模型合并:几何视角

Juanwu Lu, Anand Bhaskar, Brian Axelrod, Ekaterina Tolstaya, Tristan Emrich

发表机构 * Purdue University(普渡大学) Waymo LLC(Waymo公司)

AI总结 提出EpiMer框架,将模型合并视为黎曼流形上的Fréchet均值,利用任务向量张成的低秩子空间和期望Hessian度量,理论证明曲率感知合并优于平坦几何方法,并在八个图像分类任务上验证了性能提升。

Comments CVPR 2026 Findings Track. 18 pages, 4 figures, 6 tables

详情
AI中文摘要

模型合并为无需重新训练的知识集成和并行开发提供了有前景的途径。然而,现有方法要么忽略损失景观的几何结构,要么依赖于难以处理的全空间Hessian近似。我们提出EpiMer,一个将模型合并视为黎曼流形上Fréchet均值求解的框架,并将计算限制在由任务向量张成的低秩子空间内。以期望Hessian作为度量,我们揭示了局部曲率与参数认知不确定性之间的联系。我们的理论分析将合并误差界分解为子空间Fréchet方差和残差能量,并提供了曲率感知合并何时在理论上优于平坦几何方法的闭式刻画。此外,我们的框架将曲率感知方法和最近的谱方法统一为不同几何度量下子空间Fréchet均值的特例。在八个图像分类任务上合并微调的CLIP-ViT模型,Epistemic Merging在匹配秩下严格优于所有三个CLIP-ViT骨干网络的基线,提高了每个骨干网络上的跨任务平均准确率和最差任务准确率。

英文摘要

Model merging offers a promising avenue for knowledge integration and parallel development without retraining. Yet, existing methods either ignore the geometry of the loss landscape or rely on intractable full-space Hessian approximations. We propose EpiMer, a framework that casts model merging as solving the Fréchet mean on a Riemannian manifold and restricts the computation to a low-rank subspace spanned by the task vectors. With the expected Hessian as the metric, we reveal a connection between local curvature and epistemic uncertainty of the parameters. Our theoretical analysis decomposes the merging error bound into the subspace Fréchet variance and the residual energy, and provides a closed-form characterization of when curvature-aware merging provably outperforms flat-geometry methods. In addition, our framework unifies both curvature-aware methods and recent spectral methods as special cases of the subspace Fréchet mean with different geometric metrics. Merging fine-tuned CLIP-ViT models on eight image classification tasks, Epistemic Merging strictly outperforms the baselines on all three CLIP-ViT backbones at matched rank, improving the across-task average accuracy and worst-task accuracy on every backbone.

2605.26690 2026-05-27 cs.LG cs.AI q-bio.QM 版本更新

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

SILO:基于生物引导搜索的自改进模仿用于预算约束下的蛋白质设计

Ashima Khanna, Dominik Grimm

发表机构 * Technical University of Munich(慕尼黑技术大学) University of Applied Sciences Weihenstephan-Triesdorf(魏因斯坦-特里斯多夫应用科学大学)

AI总结 提出SILO框架,通过层次化编辑策略、增量随机束搜索和UCB代理集成,在有限oracle预算下实现蛋白质序列优化,在8个蛋白质适应度景观上达到最优性能。

详情
AI中文摘要

在严格的oracle预算下进行蛋白质序列优化需要探索巨大的组合空间,同时使每次评估都具有信息量。现有的强化学习和离策略生成方法在代理噪声下性能下降,且位置无关的突变提议可能破坏功能关键残基。我们提出了SILO,一个用于oracle预算蛋白质设计的轨迹级自改进模仿框架。SILO使用层次化编辑策略,将每个突变分解为位置选择后跟残基选择。在每个主动学习轮次中,策略通过增量随机无放回束搜索(SBS)采样候选轨迹,结合基于UCB的代理集成和丙氨酸扫描适应度分数(AFS),选择具有功能相关编辑的候选进行计算机oracle评估。然后,通过在轮次中最佳oracle标记轨迹上的下一动作交叉熵模仿来更新策略,避免值函数估计。在八个复现的蛋白质适应度景观和来自先前工作的五个强基线上,SILO在我们的评估中在8/8的景观上实现了最高的最大和top-100平均适应度,通常表现出更快的早期改进。在每种设置两个景观的低数据和噪声代理压力测试中,当多个基线退化时,SILO保持竞争力或最佳。消融实验表明,SBS与AFS贡献了大部分增益,迭代模仿提供了额外改进。代码可在:https://github.com/grimmlab/SILO.git 获取。

英文摘要

Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often degrade under surrogate noise, and position-agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory-level self-improvement imitation framework for oracle-budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active-learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB-based proxy ensemble, combined with an alanine-scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next-action cross-entropy imitation on the round's best oracle-labeled trajectories, avoiding value-function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top-100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early-stage improvement. In low-data and noisy-proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: https://github.com/grimmlab/SILO.git

2605.26675 2026-05-27 stat.ML cs.LG 版本更新

CART Random Forests as Sequential Allocation over Random Opportunity Sets: A Stochastic-Control Theory of Ensemble Risk

CART随机森林作为随机机会集上的序贯分配:集成风险的随机控制理论

Tianxing Mei, Yingying Fan, Mingming Leng, Jinchi Lv

发表机构 * Faculty of Business, Lingnan University(岭南大学商学院) Data Sciences and Operations Department, University of Southern California(南加州大学数据科学与运营部门)

AI总结 本文从随机控制视角将CART随机森林建模为随机机会集上的序贯分配过程,通过分离特征子采样和信息分裂策略两个设计杠杆,揭示了森林均方误差的构成,并证明了CART策略的局部稳定性与全局次优性。

Comments 69 pages, 1 figure

详情
AI中文摘要

CART随机森林是最广泛使用的现代预测方法之一,具有充分记录的经验成功。然而,在机制层面,由于其复杂性,该算法通常被视为黑箱。在本文中,我们发展了特征子采样CART随机森林的随机控制视角,称为CART随机机会集分配(CART-ROSA)。在每个节点,特征的随机子集被解释为随机可行动作集,CART分裂规则被解释为掩码动作分配策略。该策略在信息性分裂计数状态上诱导出一个受控的随机过程,其终末分布决定了森林均方误差(MSE)中的单棵树误差和树间交互项。这种表示通过分离两个设计杠杆——特征子采样引起的信息性机会率和掩码内分裂策略的收缩强度——打开了CART森林的黑箱。我们证明CART策略是局部稳定的:它收缩了信息性分裂分配中的不平衡,并集中了终末树的几何结构。然而,在系统层面,它对森林目标可能是全局次优的。针对线性模型,我们显式推导了MSE风险展开。我们的结果表明,运筹学视角如何使从CART森林的标准算法描述难以触及的理论缺口变得可处理。

英文摘要

CART random forests are among the most widely used modern predictive methods, with well-documented empirical success. Yet, at the mechanistic level, the algorithm is often treated as a black box because of its complexity. In this paper, we develop a stochastic-control perspective on feature-subsampled CART random forests, named CART random opportunity-set allocation (CART-ROSA). At each node, the random subset of features is interpreted as a random feasible action set, and the CART split rule as a masked-action allocation policy. This policy induces a controlled stochastic process over informative split-count states, whose terminal law determines both single-tree error and cross-tree interaction terms in the forest mean squared error (MSE). Such representation opens the black box of CART-forests by separating two design levers: the informative-opportunity rate induced by feature subsampling, and the contraction strength from the within-mask split policy. We establish that the CART policy is locally stabilizing: it contracts imbalances in informative split allocations and concentrates terminal tree geometry. At the system level, however, it can be globally suboptimal for the forest objective. Specializing to the linear model, we derive the MSE risk expansion explicitly. Our results show how an operations-research perspective makes tractable a theoretical gap difficult to access from the standard algorithmic description of CART forests.

2605.26667 2026-05-27 cs.AI cs.LG 版本更新

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

MemFail: LLM记忆系统的故障模式压力测试

Ishir Garg, Neel Kolhe, Dawn Song, Xuandong Zhao

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MemFail基准测试,通过形式化记忆系统为摘要、存储和检索三个操作并构建对抗性数据集,系统性地评估和诊断LLM记忆系统的故障模式。

详情
AI中文摘要

大型语言模型(LLM)代理越来越依赖外部记忆系统以在长程交互中保持一致性,但关于这些系统具体故障模式和设计选择的实证研究很少。现有基准报告聚合的问答准确率,将记忆系统视为黑箱,无法将错误答案归因于系统的特定故障模式。我们引入MemFail,一个诊断性基准,用于隔离现代LLM记忆系统的故障模式。我们首先将记忆系统形式化为三个规范操作的组合——摘要、存储和检索——并识别每个操作可能引发的故障模式。基于这些假设的故障模式,我们构建了跨越四个任务的五个数据集,每个数据集都经过对抗性设计以测试记忆系统的特定操作。使用这些数据集,我们在MemFail上评估了四种最先进的记忆系统,并展示了MemFail如何用于实证理解记忆系统架构差异带来的权衡。

英文摘要

Large language model (LLM) agents increasingly rely on external memory systems to remain consistent across long-horizon interactions, but little empirical work has been done to understand the specific failure modes and design choices that these systems present. Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes, making it impossible to attribute an incorrect answer to a particular failure mode of the system. We introduce MemFail, a diagnostic benchmark that isolates the failure modes of modern LLM memory systems. We begin by formalizing memory systems as the composition of three canonical operations -- summarization, storage, and retrieval -- and identify the potential failure modes induced by each. Based on these hypothesized failure modes, we construct five datasets spanning four tasks, each adversarially designed to test a specific operation of a memory system. Using these datasets, we evaluate four state-of-the-art memory systems on MemFail and demonstrate how MemFail can be used to empirically understand the tradeoffs induced by differences in memory system architectures.

2605.26655 2026-05-27 cs.CL cs.LG cs.NE 版本更新

Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis

为什么提示优化有效,以及为什么有时无效:一种因果启发的编辑级分析

Shuzhi Gong, Hechuan Wen

发表机构 * The University of Melbourne(墨尔本大学) The University of Queensland(昆士兰大学)

AI总结 本文通过因果推断方法分析自动提示优化在不同任务和模型上的泛化失败原因,发现编辑类型与任务特性之间的系统性交互作用。

Comments 17 pages, 4 figures, 8 tables

详情
AI中文摘要

自动提示优化方法(例如 DSpy、TextGrad)可以显著提升大语言模型(LLM)的性能,然而,它们在不同任务上的泛化能力仍然不足。在实践中,优化后的提示在一个基准上的优势往往无法迁移到另一个基准,即使切换不同的 LLM 骨干网络,这种局限性依然存在。为了探究提示性能中未被充分探索的异质性来源,我们对跨多种优化框架、LLM 骨干网络和 NLP 基准的优化提示进行了因果推断启发的观察性分析。为此,我们基于倾向调整的关联分析以及提示编辑的多种互补表示,识别出一致的任务条件编辑模式。我们发现,增加复杂性和元指令的编辑与数学和多跳推理性能呈负相关,而逐步和元认知的编辑则改善了逻辑和顺序推理任务。这些效应在认知负荷标注、表面文本特征和编辑主题分析中均具有鲁棒性,并且可以跨优化框架泛化。总体而言,这些结果表明,提示优化失败源于编辑族与任务特性之间的系统性交互,而非随机的优化伪影,从而提供了优化器行为的特征级表征,并激励了未来任务条件优化器的设计。

英文摘要

Automated prompt optimization methods (e.g., DSpy, TextGrad) can substantially improve the performance of large language model (LLM), however, their generalization ability across different tasks remains underperformed. In practice, the superiority of the optimized prompt on one benchmark often fails to transfer to another, and this limitation persists even when switching across different LLM backbones. To investigate the underexplored sources of heterogeneity in prompt performance, we conduct a causal inference-inspired observational analysis of optimized prompts across a diverse set of optimization frameworks, LLM backbones, and NLP benchmarks. To achieve the goal, we build upon the propensity-adjusted associational analysis together with multiple complementary representations of prompt edits, where the consistent task-conditioned edits patterns are identified. We find that complexity-increasing and meta-instructional edits are negatively associated with mathematical and multi-hop reasoning performance, whereas step-by-step and meta-cognitive edits improve logical and sequential reasoning tasks. These effects are robust across cognitive-load annotations, surface-level text features, and edit-motif analyses, and can generalize across optimization frameworks. Overall, these results indicate that prompt optimization failures arise from systematic interactions between edit families and task characteristics rather than random optimization artifacts, providing feature-level characterization of optimizer behavior and motivating future task-conditioned optimizer design.

2605.26654 2026-05-27 cs.LG cs.AI math.OC stat.ML 版本更新

Bilevel Optimization over Saddle Points of Zero-Sum Markov Games

零和马尔可夫博弈鞍点上的双层优化

Zihao Zheng, Irwin King, Songtao Lu

发表机构 * Shun Hing Institute of Advanced Engineering, The Chinese University of Hong Kong(香港中文大学先进工程学院) Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系)

AI总结 针对下层为零和马尔可夫博弈的双层优化问题,提出基于惩罚的Nikaido-Isoda下降-上升方法(PANDA),避免计算超梯度且无需二阶信息,在无凸性假设下收敛到平稳点,达到与单策略下层MDP双层RL相当的最优速率。

Comments Accepted to the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

强化学习(RL)通常具有层次结构,其中上层(UL)学习器选择模型参数,下层(LL)决策过程做出响应,自然形成双层优化问题。大多数现有的双层RL方法假设下层为单策略马尔可夫决策过程(MDP),因此无法捕捉激励设计等应用中出现的竞争结构,其中多个策略相互交互。我们研究了下层问题为正则化极小极大零和马尔可夫博弈、上层目标通过下层博弈诱导的鞍点均衡进行优化的双层优化问题。在这项工作中,我们提出了惩罚增强的Nikaido-Isoda下降-上升(PANDA),一种基于Nikaido-Isoda函数的惩罚一阶策略梯度方法。通过利用极小极大博弈结构,PANDA避免了计算上层超梯度,且不需要二阶信息。我们证明了PANDA在无需对上层或下层目标做凸性假设的情况下收敛到平稳点。此外,PANDA在$ ilde{\mathcal{O}}(ε^{-1})$次迭代内达到$ε$-平稳点,样本复杂度为$ ilde{\mathcal{O}}(ε^{-3})$,与单策略下层MDP的双层RL的最佳已知速率相匹配。实验表明PANDA优于密切相关基线方法。

英文摘要

Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and therefore fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact. We study bilevel optimization problems in which the LL problem is a regularized min-max zero-sum Markov game and the UL objective is optimized through the saddle-point equilibrium induced by the LL game. In this work, we propose penalty-augmented Nikaido-Isoda descent-ascent (PANDA), a penalty-based first-order policy-gradient method based on the Nikaido-Isoda function. By exploiting the min-max game structure, PANDA avoids computing UL hypergradients and does not require second-order information. We prove that PANDA converges to stationary points without convexity assumptions on either the UL or LL objectives. Moreover, PANDA reaches an $ε$-stationary point in $\tilde{\mathcal{O}}(ε^{-1})$ iterations with sample complexity $\tilde{\mathcal{O}}(ε^{-3})$, matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments demonstrate the superior performance of PANDA over closely related baselines.

2605.26647 2026-05-27 cs.LG cs.AI stat.ML 版本更新

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

更具表达力的前馈层:第一部分。激活的令牌自适应混合

Mingze Wang, Jinbo Wang, Yikuan Xia, Kai Shen, Shu Zhong

发表机构 * Peking University(北京大学)

AI总结 提出令牌自适应激活混合(MoA)和可学习激活(LA)方法,通过轻量级输入相关门混合多个激活函数,在理论和实验上证明其比固定激活FFN具有更强的表达能力和更优的缩放行为。

Comments 31 pages

详情
AI中文摘要

前馈网络(FFN)层在基于Transformer的大语言模型(LLMs)中占据了大部分参数和非线性表达能力。尽管从ReLU和GELU发展到门控变体如SwiGLU,大多数FFN设计仍使用单一固定激活函数,对所有令牌应用相同的非线性变换。在这项工作中,我们提出了激活混合(MoA),一种令牌自适应的FFN设计,它使用轻量级输入相关门混合一个激活函数字典,同时共享相同的线性投影。作为输入无关的对应,我们还引入了可学习激活(LA),它为ReLU型和SwiGLU型FFN形成激活函数的线性组合。理论上,我们在固定激活FFN、LA和MoA之间建立了严格的有限宽度表达分离:LA严格包含固定激活FFN,而MoA严格包含LA,额外的表达能力来自于输入相关的非线性混合。实验上,我们通过在不同令牌预算、优化器和学习率调度下,对0.12B到2B参数的密集和MoE语言模型进行广泛的预训练实验来评估MoA。与调整良好的基线相比,MoA始终获得更低的最终损失,并表现出更有利的缩放行为,且参数和计算开销极小。这些结果表明,令牌自适应激活混合是提高LLMs中FFN表达能力的一种简单而有效的机制。

英文摘要

Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.

2605.26640 2026-05-27 eess.SY cs.LG cs.SY math.OC stat.ML 版本更新

Sample Complexity of Policy Gradient for Log-Growth Control

对数增长控制的策略梯度样本复杂度

Qiuhua Pan, Yukai Shen, Liwei Zhang, Cailian Chen, Xinping Guan

发表机构 * State Key Laboratory of Submarine Geoscience, School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(submarine 地球科学国家重点实验室,自动化与智能感知学院,上海交通大学) Key Laboratory of System Control and Information Processing, Ministry of Education of China(系统控制与信息处理国家重点实验室,中华人民共和国教育部) Shanghai Key Laboratory of Perception and Control in Industrial Network Systems(上海工业网络系统感知与控制重点实验室) Paris Elite Institute of Technology, Shanghai Jiao Tong University(巴黎精英理工学院,上海交通大学)

AI总结 针对乘性噪声驱动标量线性系统的对数增长控制问题,利用奇点对称性消除梯度估计发散,证明了策略梯度的样本复杂度。

Comments 43 pages, 4 figures, 2 tables; includes supplementary material

详情
AI中文摘要

我们研究了策略梯度在对数增长控制中的样本复杂度——即从观测到的状态转移中学习一个反馈增益,该增益能够最优稳定一个通过乘性噪声驱动通道的标量线性系统。目标函数 $J(K) = \mathbb{E}[\log|1+BK|]$ 是闭环系统的顶部李雅普诺夫指数。该问题存在一个我们称为尖点障碍的结构性困难:最优增益 $K^*$ 总是将噪声奇点 $b_{\rm sing}(K) = -1/K$ 置于支撑集内部。在这个奇异最优处,策略梯度仅作为柯西主值存在,而非勒贝格积分,且自然的单样本梯度估计量具有无穷方差。因此,标准的一阶随机优化分析在最优处不适用,仅对目标函数进行平滑处理无法解决这一困难。然而,该障碍具有可利用的对称性:柯西核是关于移动极点位移的奇函数,因此将每个观测值与其关于极点的反射配对可以抵消发散部分。这一抵消同时控制了总体曲率、梯度估计量方差以及估计噪声密度时产生的偏差。结合这些界与一个闭式单转移梯度预言,我们证明:当噪声密度已知时,投影小批量策略梯度(初始化于稳定区域的任意紧子集内)的总样本复杂度为 $\tilde{O}(1/\eta)$;当噪声密度需估计时,对于 $C^s$ 噪声密度($s \geq 2$),样本复杂度为 $\tilde{O}(\eta^{-(2s+1)/(2s)})$。

英文摘要

We study the sample complexity of policy gradient for log-growth control -- the problem of learning, from observed state transitions, a feedback gain that optimally stabilizes a scalar linear system driven through a multiplicative-noise actuation channel. The objective $J(K) = \mathbb{E}[\log|1+BK|]$ is the top Lyapunov exponent of the closed loop. This problem carries a structural difficulty we call the cusp obstruction: the optimal gain $K^*$ always places the noise singularity $b_{\rm sing}(K) = -1/K$ in the interior of the support. At this singular optimum the policy gradient exists only as a Cauchy principal value, not as a Lebesgue integral, and the natural single-sample gradient estimator has infinite variance. Standard first-order stochastic-optimization analysis is thus inapplicable at the optimum, and merely smoothing the objective does not resolve the difficulty. The obstruction, however, has an exploitable symmetry: the Cauchy kernel is an odd function of the displacement from the moving pole, so pairing each observation with its reflection through the pole cancels the divergent part. This one cancellation simultaneously controls the population curvature, the gradient-estimator variance, and the bias incurred when the noise density is estimated. Combining these bounds with a closed-form single-transition gradient oracle, we prove that projected mini-batch policy gradient, initialized in any compact subset of the stabilizing region, attains total sample complexity $\tilde{O}(1/η)$ when the noise density is known and $\tilde{O}(η^{-(2s+1)/(2s)})$ when it must be estimated, for $C^s$ noise densities with $s \geq 2$.

2605.26619 2026-05-27 cs.LG 版本更新

PIDM-DP: Physics-Informed Diffusion with Dormand-Prince Integration for Chaotic System Identification and State Reconstruction across Multiple Dynamical Regimes

PIDM-DP: 基于Dormand-Prince积分的物理信息扩散用于跨多种动力学机制的混沌系统辨识与状态重构

Shailendra Dabral

发表机构 * Indian Institute of Technology Indore(印度理工学院印多尔)

AI总结 提出PIDM-DP模型,将5阶Dormand-Prince ODE积分器嵌入扩散模型反向采样,通过物理残差反向传播约束轨迹满足控制方程,在稀疏噪声观测下实现混沌系统状态重构,显著优于无约束扩散和集合卡尔曼滤波。

Comments extended work of my journal paper submission

详情
AI中文摘要

从稀疏、含噪观测中重构混沌动力系统的连续状态轨迹仍然是非线性科学中的一个基本开放问题。我们提出了带有Dormand-Prince积分的物理信息扩散模型(PIDM-DP),该模型将一个完全可微的5阶Dormand-Prince(DP-RK45)ODE积分器直接嵌入去噪扩散概率模型(DDPM)的反向采样循环中。在每个去噪步骤中,通过自动微分反向传播物理残差,约束每个生成的轨迹以5阶精度满足系统的控制方程。一种线性调度的引导机制将物理权重从高噪声水平的零逐渐增加到接近干净数据极限的全值,防止了梯度爆炸,而朴素的物理信息方法在雅可比特征值阶数为$O(10^3)$的刚性系统上会因梯度爆炸而失败。在五个复杂度递增的基准系统(3D Lorenz、3D Rössler、5D超混沌、20D Lorenz-96以及刚性3D Rabinovich-Fabrikant)上,在10%观测密度和加性高斯噪声($σ=0.05$)条件下进行评估,PIDM-DP的重构RMSE比无约束扩散基线提高了高达$15.4$倍,并在集合协方差崩溃的刚性系统上显著优于集合卡尔曼滤波。在Rabinovich-Fabrikant分布外基准测试中,PIDM-DP的RMSE为$0.1097 \pm 0.0269$,而无约束扩散为$0.9443 \pm 0.5288$(差$8.6$倍),EnKF为$0.3561 \pm 0.3040$(差$3.2$倍),配对Wilcoxon检验($N = 30$)的$p<0.001$。通过Rosenstein Lyapunov估计器进行的拓扑验证表明,PIDM-DP保留了混沌不变测度。

英文摘要

Reconstructing continuous state trajectories of chaotic dynamical systems from sparse, noisy observations remains a fundamental open problem in nonlinear science. We introduce the Physics-Informed Diffusion Model with Dormand-Prince Integration (PIDM-DP), which embeds a fully differentiable 5th-order Dormand-Prince (DP-RK45) ODE integrator directly into the reverse sampling loop of a Denoising Diffusion Probabilistic Model (DDPM). At each denoising step, physics residuals are back-propagated via automatic differentiation, constraining every generated trajectory to satisfy the system's governing equations to 5th-order accuracy. A linear-scheduled guidance mechanism that ramps the physics weight from zero at high noise levels to its full value near the clean-data limit prevents the gradient explosions that cause naive physics-informed approaches to fail on stiff systems with Jacobian eigenvalues of order $O(10^3)$. Evaluated across five benchmark systems of increasing complexity 3D Lorenz, 3D Rössler, 5D Hyperchaotic, 20D Lorenz-96, and the stiff 3D Rabinovich-Fabrikant at 10% observation density with additive Gaussian noise ($σ=0.05$), PIDM-DP achieves reconstruction RMSE improvements of up to $15.4\times$ over an unconstrained diffusion baseline and decisively outperforms the Ensemble Kalman Filter on stiff systems where ensemble covariance collapses. On the Rabinovich-Fabrikant out-of-distribution benchmark, PIDM-DP attains RMSE $0.1097 \pm 0.0269$ versus $0.9443 \pm 0.5288$ (unconstrained diffusion, $8.6\times$ worse) and $0.3561 \pm 0.3040$ (EnKF, $3.2\times$ worse), with $p<0.001$ in paired Wilcoxon tests ($N = 30$). Topological validation via the Rosenstein Lyapunov estimator confirms that PIDM-DP preserves the chaotic invariant measure.

2605.26606 2026-05-27 cs.LG cs.AI 版本更新

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

将你的展开用在关键处:基于组强化学习后训练的展开分配

Woojeong Kim, Ziyi Yang, Jing Nathan Yan, Jialu Liu

发表机构 * Cornell University(康奈尔大学)

AI总结 提出 Pilot-Commit 框架,通过预算感知的展开分配策略,优先将计算资源分配给高信息量的提示,从而在组策略优化中减少采样成本并加速收敛。

详情
AI中文摘要

强化学习(RL)是后训练大型语言模型的主要范式。然而,在在线、在策略设置中,展开生成主导了训练的计算成本。基于组的策略优化方法对每个提示计算多个展开的优势,但它们不加区分地将预算分配给奖励分布崩溃的提示,将昂贵的展开浪费在可忽略的学习信号上。我们证明,基于组的更新在高奖励方差区域最为有效。由于策略在整个训练过程中演变,提示的信息量必须在线估计而非预先计算,但穷举评估每个提示在计算上不可行。我们引入了 Pilot-Commit,一个用于基于组 RL 后训练的预算感知展开分配框架。Pilot-Commit 将提示评估与利用解耦:一个试点阶段使用预算的一部分估计每个提示的信息量,然后将剩余的展开分配给高杠杆提示,同时跳过低信号提示。在多个数学推理基准和从 1.5B 到 14B 参数的模型规模上,Pilot-Commit 以显著更低的采样成本匹配基线准确率,在累积展开中达到目标准确率的速度比 GRPO 快高达 $1.9 imes$,比 DAPO 快高达 $4.0 imes$。

英文摘要

Reinforcement learning (RL) is the dominant paradigm for post-training large language models. However, in the online, on-policy setting, rollout generation dominates the computational cost of training. Group-based policy optimization methods compute advantages from multiple rollouts per prompt, yet they indiscriminately allocate budget to prompts with collapsed reward distributions, wasting expensive rollouts on negligible learning signals. We demonstrate that group-based updates are most effective in regimes of high reward variance. Since the policy evolves throughout training, prompt informativeness must be estimated online rather than precomputed, but exhaustively evaluating every prompt is computationally prohibitive. We introduce Pilot-Commit, a budget-aware rollout allocation framework for group-based RL post-training. Pilot-Commit decouples prompt evaluation from exploitation: a pilot stage estimates per-prompt informativeness using a fraction of the budget, and the remaining rollouts are allocated to high-leverage prompts while low-signal prompts are skipped. Across multiple math reasoning benchmarks and model scales from 1.5B to 14B parameters, Pilot-Commit matches baseline accuracy with significantly lower sampling costs, reaching target accuracy up to $1.9\times$ faster than GRPO and $4.0\times$ faster than DAPO in cumulative rollouts.

2605.26600 2026-05-27 cs.LG cs.AI 版本更新

Geometry-Aware Contrastive Learning for Few-Shot Automatic Modulation Recognition

几何感知对比学习用于少样本自动调制识别

Guanqun Zhao, Yitong Liu, Jiaxuan Fang, Yufei Mao, Hongwen Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出动态一致性对比学习框架,通过虚拟对抗增强和语义一致性损失解决自监督学习中的各向同性增强、频谱不稳定和语义漂移问题,在少样本设置下提升自动调制识别准确率。

详情
AI中文摘要

标准的自动调制识别自监督学习面临无效的各向同性增强、频谱不稳定性和语义漂移等挑战。为解决这些问题,我们提出了动态一致性对比学习,一种几何感知框架,将虚拟对抗增强与语义一致性损失相结合。我们提供的理论分析表明,该策略作为编码器的隐式频谱正则化器,能够实现稳定的流形探索。此外,我们的信号自适应Swin骨干网络采用固定窗口注意力,通过限制注意力局部性提高了结构稳定性,而混合知识融合模块则利用物理先验锚定表示。在RML基准上的实验表明,DyCo-CL在1-shot设置下相比先前方法获得了6.27%的准确率提升。

英文摘要

Standard Self-Supervised Learning (SSL) for Automatic Modulation Recognition (AMR) struggles with ineffective isotropic augmentations, spectral instability, and semantic drift. To address these challenges, we propose Dynamic-Consistency Contrastive Learning (DyCo-CL), a geometry-aware framework that couples Virtual Adversarial Augmentation (VAA) with a semantic consistency loss. We provide a theoretical analysis indicating that this strategy acts as an implicit spectral regularizer for the encoder, enabling stable manifold exploration. Complementing this, our Signal-Adaptive Swin Backbone with fixed-window attention improves structural stability by constraining attention locality, while a Hybrid Knowledge Fusion module anchors representations with physical priors. Experiments on RML benchmarks show that DyCo-CL achieves a 6.27% accuracy gain in 1-shot settings over prior methods.

2605.26589 2026-05-27 cs.LG cs.AI stat.ML 版本更新

Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift

分布漂移下儿童贫血预测的表格机器学习与基础模型的少样本跨国家泛化

Yusuf Brima, Marcellin Atemkeng, Lansana Hassim Kallon, David Niyukuri, Antoine Vacavant, Samuel Saidu, Ding-Geng Chen

发表机构 * Department of Mathematics, Rhodes University, South Africa(数学系,罗德斯大学,南非) National Institute for Theoretical and computational Sciences (NITheCS), Stellenbosch, 7600, South Africa(理论与计算科学国家研究所(NITheCS),斯泰伦博斯,7600,南非) Interdisciplinary Research Program in Public Health, University of Burundi, Burundi(公共卫生跨学科研究计划,布恩迪大学,布恩迪) Universite Clermont Auvergne, Clermont Auvergne INP, CNRS, Institut Pascal, Clermont–Ferrand, France(克莱蒙特-奥弗涅大学,克莱蒙特-奥弗涅INP,CNRS,帕西尔研究所,克莱蒙特-费尔南,法国) Department of International Public Health, Liverpool School of Tropical Medicine, Liverpool, UK(国际公共卫生系,利物浦热带医学学校,利物浦,英国) College of Health Solutions, Arizona State University, Phoenix, USA(健康解决方案学院,亚利桑那州立大学,凤凰城,美国) Department of Statistics, University of Pretoria, Pretoria, South Africa(统计系,普里特oria大学,普里特oria,南非)

AI总结 本研究评估了基于Transformer的表格基础模型TabPFN在跨国家、数据稀缺环境下预测儿童贫血的性能,发现其优于经典监督方法,尤其在低数据场景下表现出更好的区分度和校准能力。

详情
AI中文摘要

儿童贫血影响全球约40%的6-59个月儿童,且由异质性因素引起,限制了模型的泛化能力。我们在跨国家和数据稀缺环境下,评估了基于Transformer的表格基础模型与经典监督方法。我们使用了来自非洲、亚洲、拉丁美洲、高加索和中东16个国家的DHS数据(n=68,856)。比较了逻辑回归、XGBoost、LightGBM和TabPFN v2.6。性能通过AUC-ROC、Brier评分和ECE评估。泛化性通过留一国家法(LOCO)、反向LOCO和少样本设置评估。亚组分析包括性别、年龄、居住地、母亲教育和财富。特征重要性通过SHAP估计。TabPFN在低数据场景(<200样本)中优于经典模型,显示出更高的区分度和更好的校准。在各国中,它实现了最低的Brier评分(0.042)和ECE(0.203)。在全数据设置下,AUC-ROC范围为0.59-0.76,模型间差异较小(≤0.05)。LOCO性能稳定(0.58-0.69),受国家背景驱动。反向LOCO显示出不对称的可转移性。亚组性能一致,无系统性人口统计偏差。SHAP识别出儿童年龄、海拔和年龄别身高Z分数为主要预测因子,其次是财富和母亲教育。儿童贫血预测的性能更多由人群变异驱动而非模型选择。TabPFN在低资源环境中通过改进的区分度和校准提供了优势,突显了基础模型作为数据稀缺全球健康预测的有前景工具。

英文摘要

Childhood anemia affects around 40% of children aged 6-59 months globally and arises from heterogeneous factors, limiting model generalizability. We evaluate a transformer-based tabular foundation model against classical supervised methods under cross-country and data-scarce settings. We used DHS data from 16 countries across Africa, Asia, Latin America, the Caucasus, and the Middle East (n=68,856). We compared Logistic Regression, XGBoost, LightGBM, and TabPFN v2.6. Performance was assessed using AUC-ROC, Brier score, and ECE. Generalization was evaluated using leave-one-country-out (LOCO), reverse-LOCO, and few-shot settings. Subgroup analyses included sex, age, residence, maternal education, and wealth. Feature importance was estimated using SHAP. TabPFN outperformed classical models in low-data regimes (<200 samples), showing higher discrimination and better calibration. Across countries, it achieved the lowest Brier score (0.042) and ECE (0.203). Under full-data settings, AUC-ROC ranged from 0.59-0.76 with small between-model differences ($\leq 0.05$). LOCO performance was stable (0.58-0.69), driven by country context. Reverse-LOCO showed asymmetric transferability. Subgroup performance was consistent with no systematic demographic bias. SHAP identified child age, altitude, and height-for-age z-score as dominant predictors, followed by wealth and maternal education. Performance in childhood anemia prediction is driven more by population variation than model choice. TabPFN provides advantages in low-resource settings through improved discrimination and calibration, highlighting foundation models as promising tools for data-scarce global health prediction.

2605.26585 2026-05-27 cs.LG 版本更新

Near-Optimal Regret in Adversarial Kernel Bandits

对抗性核赌博中的近最优遗憾

Yu-Jie Zhang, Hao Qiu, Jonathan Scarlett, Kevin Jamieson

发表机构 * University of Washington(华盛顿大学) National University of Singapore(新加坡国立大学)

AI总结 针对对抗性核赌博问题,提出基于正则化重要性加权损失估计的指数权重算法,通过显式修正项消除偏差,实现与随机核赌博已知最优率匹配的遗憾界。

详情
AI中文摘要

我们研究对抗性核赌博问题,其中每轮的损失由再生核希尔伯特空间(RKHS)中的任意有界元素诱导。我们提出了一种基于正则化重要性加权损失估计的指数权重算法,并带有一个显式修正项,用于抵消正则化引入的偏差。我们的主要结果将遗憾界限制为 $\widetilde{O}ig(\sqrt{T\, d_*(λ)\,\log|{X}|}ig)$,其中 $d_*(λ)$ 是广泛采用的有效维度概念,用于捕捉核的复杂度。忽略对数因子,这匹配了相关随机核赌博问题中已知的速率。一个显著的应用是 $\mathbb{R}^d$ 上具有平滑参数 $ν$ 的 Matérn$(ν,d)$ 核,此时我们的界特化为 $\widetilde{O}ig(T^{(ν+d)/(2ν+d)}ig)$,改进了 Chatterji 等人 [2019] 先前已知的最佳速率,同时去除了他们分析所需的秩一对手假设。此外,该速率与随机核赌博的已知最优速率相同,并且与并发工作中的下界仅相差一个 $\log T$ 因子。

英文摘要

We study the adversarial kernel bandit problem, in which the loss at each round is induced by an arbitrary bounded element of a reproducing kernel Hilbert space (RKHS). We propose an exponential-weights algorithm built on a regularized importance-weighted loss estimator, together with an explicit correction term that cancels the bias introduced by the regularization. Our main result bounds the regret by $\widetilde{O}\big(\sqrt{T\, d_*(λ)\,\log|{X}|}\big)$, where $d_*(λ)$ is a widely-adopted notion of effective dimension that captures the complexity of the kernel. Up to logarithmic factors, this matches the known rate achieved in the related stochastic kernel bandit problem. A notable application is the Matérn$(ν,d)$ kernel with smoothness parameter $ν$ on $\mathbb{R}^d$, for which our bound specializes to $\widetilde{O}\big(T^{(ν+d)/(2ν+d)}\big)$, improving over the best-known prior rate of Chatterji et al. [2019] while simultaneously removing the rank-one adversary assumption required by their analysis. Moreover, this rate is the same as the known optimal rate for stochastic kernel bandits, and also matches a lower bound from concurrent work up to a $\log T$ factor.

2605.26582 2026-05-27 cs.LG cs.AI 版本更新

On the Error-Correcting Effects of Stochasticity in Discrete Diffusion

离散扩散中随机性的纠错效应

William Yuan, Sungwon Jeong, Amirali Aghazadeh

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文系统研究离散扩散模型中马尔可夫转移随机性程度对采样效率与质量的权衡,提出离散搅动与重启采样(DCRS)算法,通过交替正向和反向扩散过程注入受控随机性,在低函数评估次数下改善速度-质量权衡。

详情
AI中文摘要

离散扩散模型在文本和图像生成中取得了强劲性能,但其推理仍然缓慢,且必须内在平衡采样效率与样本质量。在这项工作中,我们系统研究了马尔可夫转移中随机性程度如何主导采样权衡。我们表明,高度确定性的转移收敛迅速但遭受误差累积,而更随机的转移收敛更慢但能达到更高的最终样本质量。通过信息论分析,我们识别出潜在机制为一种由对称地在状态间交换质量的冗余转移诱导的纠错效应,并表明这些转移可证明地收缩采样误差。受此分析启发,我们提出离散搅动与重启采样(DCRS),一种新颖的推理算法,通过交替正向和反向扩散过程注入受控随机性。在合成数据集和大规模基准上的实验表明,DCRS在低函数评估次数下改善了速度-质量权衡。在图像数据集上,与标准采样器相比,DCRS在保持竞争性样本质量的同时,实现了高达10倍的采样步数减少;而在语言基准上,我们观察到更细微的行为,取决于损坏过程和采样程序。

英文摘要

Discrete diffusion models achieve strong performance in text and image generation, but their inference remains slow and must inherently balance sampling efficiency and sample quality. In this work, we present a systematic study of how the \emph{degree of stochasticity} in Markov transitions governs the sampling tradeoff. We show that highly deterministic transitions converge rapidly but suffer from error accumulation, while more stochastic transitions converge more slowly yet can achieve higher final sample quality. Using an information-theoretic analysis, we identify the underlying mechanism as an error-correcting effect induced by \emph{redundant transitions} that symmetrically exchange mass between states, and show that these transitions can provably contract sampling errors. Motivated by this analysis, we propose \emph{Discrete Churn and Restart Sampling} (DCRS), a novel inference algorithm that injects controlled stochasticity by alternating between forward and reverse diffusion processes. Experiments on synthetic datasets and large-scale benchmarks show that DCRS improves the speed-quality tradeoff in the low number of function evaluations regime. On image datasets, DCRS achieves up to a $10\times$ reduction in sampling steps compared to standard samplers while maintaining competitive sample quality, whereas on language benchmarks, we observe more nuanced behavior depending on the corruption process and sampling procedure.

2605.26579 2026-05-27 cs.LG 版本更新

Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards

Focal Reward: 基于评分标准的强化学习中的平衡奖励

Yu Huang, Zihua Zhao, Zhaoxin Huan, Wanli Gu, Feng Hong, Xinmu Ge, Lin Yuan, Weichang Wu, Qiang Hu, Xiaolu Zhang, Jun Zhou, Jiangchao Yao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团)

AI总结 针对大语言模型在基于多维评分标准的强化学习中奖励失衡的问题,提出Focal Reward方法,通过逆奖励投影机制估计各维度饱和程度并自动重加权,实现细粒度平衡,在18个模型-基准对比中均优于最强静态聚合基线。

Comments Preprint

详情
AI中文摘要

大语言模型中的开放式生成通常需要多维评分标准来充分评估质量并指导强化学习的改进。然而,这种训练范式固有的一个关键困境是不同评分标准维度上的奖励极化不平衡。在此瓶颈下,即使大语言模型在训练后获得相对较高的奖励,它们仍可能在某些维度上表现出严重缺陷,直接导致用户体验下降。为了解决这个问题,我们提出了Focal Reward,一种新颖的目标函数,用于自动平衡基于评分标准的强化学习训练。具体来说,我们首先利用逆奖励投影机制来估计评分标准中每个准则的饱和程度,这构成了校准奖励方向的基础。然后,最终目标函数为每个准则设计了一个自动重新加权的系数,以实现细粒度平衡。跨三个模型规模和六个基准的大量实验表明,我们的Focal Reward方法在所有18个模型-基准比较中均优于最强的静态聚合基线。展开、机制和消融分析进一步表明,这些增益来自于向仍有改进空间的评分标准进行在线、饱和感知的重新分配。

英文摘要

The open-ended generation in LLMs usually requires multi-dimensional rubrics to adequately assess quality and guide the improvement of reinforcement learning. However, a critical dilemma inherent in this training paradigm is the imbalanced reward polarization along different rubric dimensions. Under this bottleneck, even if LLMs achieve relatively high rewards after training, they may still exhibit severe deficiencies in certain dimensions, leading to a direct deterioration in user experience. To address this problem, we propose Focal Reward, a novel objective to automatically balance the training of reinforcement learning under rubric-based rewards. Specifically, we first leverage an inverse reward projection mechanism to estimate the saturation degree of each criterion in the rubric, which forms the basis to calibrate the reward direction. Then, the final objective is designed with an automatically reweighting coefficient for each criterion to achieve the fine-grained balancing. Extensive experiments across three model scales and six benchmarks demonstrate that our Focal Reward method outperforms the strongest static aggregation baseline in all 18 model-benchmark comparisons. Rollout, mechanism, and ablation analyses further show that these gains arise from online, saturation-aware reallocation toward rubrics that still have room for improvement.

2605.26577 2026-05-27 eess.SY cs.AI cs.LG cs.SY math.OC 版本更新

Bridging Control with Neural Network Verifier alpha-beta-CROWN: A Tutorial

桥接控制与神经网络验证器 alpha-beta-CROWN:教程

Haoyu Li, Xiangru Zhong, Hao Cheng, Bin Hu, Huan Zhang

发表机构 * Department of Computer Science(计算机科学系) Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本教程提出一个统一框架,通过将控制问题与神经网络验证器 α,β-CROWN 桥接,实现控制器属性的可扩展形式验证。

Comments ACC 2026 Tutorial

详情
AI中文摘要

基于学习的控制器合成方法因其高表达力和强经验性能而受到欢迎。然而,在自动驾驶、机器人技术和电力系统等安全关键场景中,仅凭经验性能是不够的,对控制器的稳定性、安全性等属性进行形式验证是非常可取的。不幸的是,许多先前的验证方法要么依赖于系统或证书的特定结构假设,难以在不同设置间迁移,要么在高维神经网络系统上可扩展性差。在本教程中,我们提出了一个统一框架,旨在通过将控制与最先进的神经网络验证器 $α,\!β$-CROWN(alpha-beta-CROWN)桥接来弥合这一差距。其核心是,$α,\!β$-CROWN 是一个通用的边界引擎,用于表示为计算图的非线性函数:给定一个输入域,它可以产生认证边界和非线性函数的显式线性松弛。这些认证边界本身对于可达性分析等任务很有用,并且它们为执行可满足性检查和优化的更复杂例程提供了基础。更具体地说,许多控制问题归结为验证状态域上的实值不等式(例如,李雅普诺夫理论)。因此,$α,\!β$-CROWN 通过计算紧边界并基于边界递归划分和剪枝子域,实现了这些条件的可扩展验证。得益于 GPU 并行化,该流程在对传统方法具有挑战性的验证和优化问题上展示了卓越的可扩展性。在本教程中,我们讨论了 $α,\!β$-CROWN 的基础知识,并介绍了其在各种控制相关任务中的应用。

英文摘要

Learning-based methods for synthesizing controllers have gained popularity due to their high expressiveness and strong empirical performance. However, in safety-critical scenarios such as autonomous driving, robotics, and power systems, empirical performance alone is insufficient, and formal verification of controller properties such as stability and safety is highly desirable. Unfortunately, many prior verification approaches are either tied to specific structural assumptions on the system or the certificate, making them difficult to transfer across settings, or suffer from poor scalability on higher-dimensional neural network systems. In this tutorial, we present a unified framework that aims to mitigate this gap via bridging control with the state-of-the-art neural network verifier $α,\!β$-CROWN (alpha-beta-CROWN). At its core, $α,\!β$-CROWN is a general-purpose bounding engine for nonlinear functions represented as computation graphs: given an input domain, it can produce certified bounds and explicit linear relaxation of the nonlinear function. These certified bounds are useful on their own for tasks such as reachability analysis, and they also provide the foundation for more complex routines that perform satisfiability checking and optimization. More specifically, many control problems reduce to verifying real-valued inequalities over a state domain (e.g., Lyapunov theory). Consequently, $α,\!β$-CROWN enables scalable verification of such conditions by computing tight bounds and recursively partitioning and pruning subdomains based on the bounds. Thanks to GPU parallelization, this pipeline demonstrates superior scalability on verification and optimization problems that are challenging for traditional approaches. In this tutorial, we discuss the basics of $α,\!β$-CROWN and introduce its application to various control-related tasks.

2605.26576 2026-05-27 cs.CV cs.LG 版本更新

TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting

TrackRef3D: 面向开放世界3D高斯泼溅分割的多视角一致跟踪-标注方法

Yuyang Tan, Renhe Zhang, Hang Zhang, Ao Li, Xin Tan

发表机构 * East China Normal University, Shanghai, China(华东师范大学,上海,中国) Shanghai AI Laboratory(上海人工智能实验室) University of Electronic Science and Technology of China, Chengdu, China(电子科技大学,成都,中国)

AI总结 提出TrackRef3D全自动流水线,通过多视角一致跟踪-标注范式解耦目标发现与语义定位,无需人工标注实现开放世界3D高斯泼溅分割。

详情
AI中文摘要

引用3D高斯泼溅(R3DGS)利用自然语言进行3D目标分割,已成为具身AI的关键能力。然而,现有方法通常依赖昂贵的每场景人工标注和每视图伪掩码生成,存在多视角不一致以及对不同查询特异性的泛化能力差的问题。为此,我们提出TrackRef3D,一种全自动流水线,通过引入多视角一致的跟踪-标注范式,从根本上将目标发现与语义定位解耦,无需人工标注即可实现3D高斯泼溅(3DGS)中的开放世界引用分割。具体而言,我们提出轨迹感知语义共识模块(TSCM),通过同义词聚类和轨迹感知投票聚合跨视图预测,建立规范语义身份,从而确保多视角一致性。此外,我们采用可见性感知描述生成策略以缓解歧义,并提出混合训练策略(HTS),利用多正例对比目标联合优化粗粒度类别语义和细粒度引用线索,确保在不同查询特异性下的鲁棒性。在基准上的大量实验表明,TrackRef3D达到了最先进的性能。

英文摘要

Referring 3D Gaussian Splatting (R3DGS), which utilizes natural language for 3D object segmentation, has emerged as a crucial capability for embodied AI. However, existing methods typically rely on expensive per-scene manual annotation and per-view pseudo mask generation, which suffer from multi-view inconsistency and poor generalization to varying query specificities. To address this, we present TrackRef3D, a fully automatic pipeline that achieves open-world referring segmentation in 3D Gaussian Splatting (3DGS) without manual annotation by introducing a multi-view consistent track-then-label paradigm that fundamentally decouples object discovery from semantic grounding. Specifically, we propose a Trajectory-Aware Semantic Consensus Module (TSCM) which aggregates cross-view predictions via synonymous clustering and trajectory-aware voting to establish a canonical semantic identity, thereby ensuring multi-view consistency. Furthermore, we employ a visibility-aware description generation strategy to mitigate ambiguity and propose a Hybrid Training Strategy (HTS) that jointly optimizes coarse category semantics and fine-grained referential cues to ensure robustness under varying query specificities using a multi-positive contrastive objective. Extensive experiments on benchmarks demonstrate that TrackRef3D achieves state-of-the-art performance.

2605.26571 2026-05-27 cs.LG 版本更新

Separate Aggregation of Split Network for Personalized Federated Learning

分离网络的分组聚合用于个性化联邦学习

Yunseok Kang, Jaeyoung Song

发表机构 * Department of Electronics Engineering, Pusan National University(全州国立大学电子工程系)

AI总结 提出PGFedSplit框架,采用分离架构和自适应聚合调度,结合本地与服务器生成的表示,解决客户端数据异构下的个性化与全局泛化权衡问题。

详情
AI中文摘要

联邦学习能够在不共享原始数据的情况下进行协作模型训练,但在客户端数据分布异构时性能会大幅下降。单一的全局模型往往无法满足不同客户端的需求,因此个性化联邦学习被探索用于在保持全局泛化的同时提升客户端特定性能。现有的PFL方法通常面临一个基本权衡:更强的全局共享可能削弱本地专业化,而更强的本地适应则可能导致在数据有限、标签不平衡和缺失类别场景下的过拟合。在这项工作中,我们提出了PGFedSplit,一个在严重客户端异构下同时提升个性化和全局泛化的个性化联邦学习框架。PGFedSplit采用分离架构,并根据不同模型组件的角色执行自适应聚合调度,在保持客户端特定适应的同时实现稳定的知识共享。每个客户端进一步利用本地提取的表示和从服务器端高斯统计生成的合成表示的混合,提升了在标签不平衡和缺失类别条件下的鲁棒性。在Fashion MNIST、CIFAR-10、CIFAR-100和Tiny ImageNet上的大量实验表明,与最先进的PFL方法相比,PGFedSplit在高度异构设置下实现了持续改进,具有稳定的收敛和优越的个性化性能。

英文摘要

Federated learning enables collaborative model training without sharing raw data, but its performance can degrade substantially under heterogeneous client data distributions. A single global model often cannot satisfy diverse client requirements, so personalized federated learning has therefore been explored to improve client specific performance while preserving global generalization. Existing PFL methods often face a fundamental tradeoff in which stronger global sharing can undermine local specialization, whereas stronger local adaptation can lead to overfitting under limited data, label imbalance, and missing class scenarios. In this work, we propose PGFedSplit, a personalized federated learning framework that improves both personalization and global generalization under severe client heterogeneity. PGFedSplit adopts a split architecture and performs adaptive aggregation scheduling tailored to the roles of different model components, enabling stable knowledge sharing while maintaining client specific adaptation. Each client further leverages a mixture of locally extracted representations and synthetic representations generated from server side Gaussian statistics, improving robustness under label imbalance and missing class conditions. Extensive experiments on Fashion MNIST, CIFAR 10, CIFAR 100, and Tiny ImageNet demonstrate consistent improvements over state of the art PFL methods, with stable convergence and superior personalization in highly heterogeneous settings.

2605.26569 2026-05-27 cs.LG 版本更新

Distribution-Aware Conformal Prediction: A Framework for generating efficient prediction intervals for time series

分布感知共形预测:一种为时间序列生成高效预测区间的框架

Daniel Schweizer, Peter Kuhn, Jayant Sharma, Shivali Dubey, Malte von Ramin, Christoph Brockt-Haßauer

发表机构 * Fraunhofer Institute for Highspeed Dynamics, Ernst-Mach-Institut, EMI Freiburg(弗劳恩霍夫高速动力研究所,恩斯特-马赫研究所,EMI弗赖堡)

AI总结 提出分布感知共形预测(DCP)框架,通过集成概率预测器与分数无关的共形校准,为时间序列生成有效且高效的预测区间。

Comments submitted to Journal of Machine Learning Research (JMLR)

详情
AI中文摘要

我们提出了分布感知共形预测(DCP),这是一个统一框架,将蒙特卡洛dropout、深度集成和分位数回归等概率预测器与分数无关的共形校准相结合,以生成有效且高效的预测区间。利用数值反演方法构建区间边界,DCP能够适应任意组合的分布生成预测器和非一致性分数。对合成和真实时间序列数据的基准分析表明,DCP能够在不同的不确定性机制下自适应地校准预测区间。关键的是,DCP的模块化设计便于对不同预测器-分数配对进行即插即用实验,并通过新引入的修正Winkler分数进行定量支持,该分数通过显式惩罚欠覆盖来平衡有效性和效率。虽然DCP推广并扩展了现有方法(如共形分位数回归和共形蒙特卡洛),但其模块化设计允许进一步扩展,为在动态环境和高风险应用中推进不确定性量化奠定了基础。

英文摘要

We present Distribution-aware Conformal Prediction (DCP), a unified framework integrating probabilistic predictors like Monte Carlo dropout, deep ensembles, and quantile regression with score-agnostic conformal calibration to produce valid and efficient prediction intervals. Leveraging a numerical inversion approach to construct interval bounds, DCP accommodates arbitrary combinations of distribution generating predictors and nonconformity scores. Benchmark analysis on synthetic and real-world time series data demonstrate DCP's ability to adaptively calibrate prediction intervals under varying uncertainty regimes. Crucially, DCP's modular design facilitates plug-and-play experimentation with different predictor-score pairings, quantitatively supported by a newly introduced modified Winkler score that balances validity and efficiency by explicitly penalizing undercoverage. While DCP generalizes and extends existing approaches like Conformalized Quantile Regression and Conformalized Monte Carlo, its modular design allows further extensions, setting a foundation for advancing uncertainty quantification in dynamic environments and high-risk applications.

2605.26562 2026-05-27 cs.LG 版本更新

Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting

超越整体模型:深度多变量时间序列预测的系统性组件级基准测试

Shuang Liang, Chaochuan Hou, Xu Yao, Shiping Wang, Hailiang Huang, Songqiao Han, Minqi Jiang

发表机构 * Shanghai University of Finance and Economics(上海财经大学) Key Laboratory of Interdisciplinary Research of Computation and Economics(交叉计算与经济学交叉学科实验室)

AI总结 提出TSCOMP基准,通过正交实验分解深度预测方法的核心组件,揭示其有效性并构建性能语料库,实现零样本模型构建,优于手工复杂架构。

Comments accepted by KDD 2026 Datasets and Benchmarks Track

详情
AI中文摘要

虽然先前在多变量时间序列预测中的研究集中于开发复杂的整体模型,但本工作倡导转向对其影响的细粒度、组件级理解。我们提出TSCOMP,这是第一个大规模基准,系统地将深度预测方法分解为其核心、细粒度的组件——涵盖序列预处理、编码策略、包括特定和大规模时间序列模型的网络架构以及优化方法。通过使用约束正交实验设计和广泛评估,我们进行多视角分析,揭示组件在不同骨干网络、数据特征及其交互中的有效性。除了提供见解外,该基准建立了一个包含超过20,000个模型-数据集评估的细粒度性能语料库,支持自动组件选择的学习,从而在新数据集上实现零样本模型构建。我们的实验表明,尽管简单,但基于语料库的方法始终优于最先进的方法,验证了我们评估设计的合理性,并确认系统性组件选择超越了手工设计的复杂架构。所有代码和性能语料库均可在 https://github.com/SUFE-AILAB/TSCOMP 公开获取。

英文摘要

While previous research in multivariate time series forecasting has focused on developing complex holistic models, this work advocates for a shift toward a granular, component-level understanding of their impacts. We propose TSCOMP, the first large-scale benchmark that systematically deconstructs deep forecasting methods into their core, fine-grained components--spanning series preprocessing, encoding strategies, network architectures including specific and large time-series models, and optimization methods. Using constrained orthogonal experimental design and extensive evaluations, we conduct multi-view analyses that reveal component effectiveness across different backbones, data characteristics, and their interactions. Beyond providing insights, this benchmark establishes a fine-grained performance corpus comprising over 20,000 model-dataset evaluations, which supports the learning of automated component selection, enabling zero-shot model construction on new datasets. Our experiments demonstrate that the corpus-driven approach, despite its simplicity, consistently outperforms state-of-the-art methods, validating the soundness of our evaluation design and confirming that systematic component selection surpasses manually designed complex architectures. All code and the performance corpus are publicly available at https://github.com/SUFE-AILAB/TSCOMP.

2605.26559 2026-05-27 cs.LG cs.AI econ.EM 版本更新

Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice

审计与修复离散选择中表格基础模型的经济有效性

Yingshuo Wang, Xian Sun, Yanhang Li, Zhichao Fan, Zexin Zhuang

发表机构 * University of California, Berkeley, CA, USA(加州大学伯克利分校) Duke University, Durham, NC, USA(杜克大学) Northeastern University, Boston, MA, USA(东北大学) University of Illinois Urbana-Champaign, IL, USA(伊利诺伊大学厄巴纳-香槟分校) Southern Methodist University, Dallas, TX, USA(南方 Methodist 大学)

AI总结 提出两阶段适配器,将表格基础模型预测嵌入效用最大化框架,在保证经济一致性的同时提升选择预测精度。

Comments 5 pages, 1 table. Accepted at the FMSD Workshop, ICML 2026

详情
AI中文摘要

表格基础模型在选择预测任务上取得了很高的准确率,但其预测常常违反这些任务所需的经济逻辑:提高价格有时会增加预测需求,隐含的支付意愿估计经常为负或不合理。我们提出了一种两阶段适配器,将基础模型预测嵌入效用最大化框架。在第一阶段,我们估计一个标准选择模型,其参数受经济理论约束。在第二阶段,我们冻结这些参数,并训练一个校正项,将基础模型的预测作为附加信息纳入。结果模型继承了基础模型的精度提升,同时保证了政策扰动下价格-需求的单调关系,并产生可解析计算的权衡指标。在两个交通数据集上,适配器在保持完美经济一致性的同时,相比标准logit模型恢复了高达13个百分点的准确率,这是原始基础模型或传统蒸馏都无法实现的。

英文摘要

Tabular foundation models achieve strong accuracy on choice prediction tasks, but their predictions often violate the economic logic those tasks require: raising a price sometimes increases predicted demand, and implied willingness-to-pay estimates are frequently negative or implausible. We propose a two-stage adapter that embeds foundation model predictions within a utility-maximization framework. In the first stage, we estimate a standard choice model whose parameters are constrained to obey economic theory. In the second stage, we freeze those parameters and train a correction term that incorporates the foundation model's predictions as additional information. The result is a model that inherits the foundation model's accuracy gains while guaranteeing monotonic price-demand relationships under policy perturbation and producing analytically computable trade-off measures. On two transportation datasets, the adapter recovers up to 13 percentage points of accuracy over a standard logit model while maintaining perfect economic consistency, something neither the raw foundation models nor conventional distillation achieve.

2605.26554 2026-05-27 cs.LG cs.AI 版本更新

Linear and Neural Dueling Bandits with Delayed Feedback

线性与神经延迟反馈的对抗性赌博机

Xiangyi Wang, Pingchen Lu, Jie Mao, Mingze Kong, Zhi Hong, Zhiyong Wang, Zhongxiang Dai

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) The Chinese University of Hong Kong(香港中文大学)

AI总结 针对随机延迟反馈下的上下文对抗性赌博机问题,提出线性(LDB-DF)和神经(NDB-DF)两种算法,通过将逆概率加权(IPW)机制直接融入损失函数实现无偏校正,并给出线性设置下O(d*sqrt(T))的遗憾界和神经设置下的次线性保证。

详情
AI中文摘要

上下文对抗性赌博机构成了基于偏好的决策制定的基石,在推荐系统和大语言模型对齐中有关键应用。然而,标准算法依赖于即时反馈的理想化假设,这一条件在现实场景(如提示优化)中经常被违反。这种设置带来了独特的理论挑战:与线性赌博机不同,对抗性赌博机估计量缺乏闭式解,使得标准加权技术的朴素适应产生偏差。为解决这一问题,我们形式化了具有随机延迟反馈的上下文对抗性赌博机问题,并提出了两种新颖算法:线性延迟反馈对抗性赌博机(LDB-DF)和神经延迟反馈对抗性赌博机(NDB-DF)。我们方法的核心是一种新颖的估计量,它将逆概率加权(IPW)机制直接集成到损失函数中,确保对延迟或缺失反馈的无偏校正。我们提供了全面的理论分析,为线性设置建立了O(d*sqrt(T))的遗憾界,并为神经设置建立了次线性保证。在模拟和真实数据集上的大量实验证明了我们提出方法的有效性。

英文摘要

Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption of immediate feedback, a condition frequently violated in real-world scenarios such as prompt optimization. This setting introduces a unique theoretical challenge: unlike linear bandits, dueling bandit estimators lack closed-form solutions, rendering naive adaptations of standard weighting techniques biased. To address this, we formalize the problem of Contextual Dueling Bandits with Stochastic Delayed Feedback and propose two novel algorithms: Linear (LDB-DF) and Neural (NDB-DF) Dueling Bandits with Delayed Feedback. Central to our approach is a novel estimator that integrates an Inverse Probability Weighting (IPW) mechanism directly into the loss function, ensuring unbiased correction for delayed or missing feedback. We provide comprehensive theoretical analysis, establishing an O(d*sqrt(T)) regret bound for the linear setting and sub-linear guarantees for the neural setting. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our propose.

2605.26548 2026-05-27 cs.CR cs.LG 版本更新

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

SEC-bench Pro:语言模型能解决长周期软件安全任务吗?

Hwiwon Lee, Jiawei Liu, Dongjun Kim, Ziqi Zhang, Chunqiu Steven Xia, Lingming Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出SEC-bench Pro基准,通过三阶段流程构建包含V8和SpiderMonkey共183个已验证漏洞的测试集,评估前沿语言模型在长周期漏洞狩猎任务中的表现,发现最高成功率仅48.8%。

详情
AI中文摘要

大型语言模型(LLM)现已支持自动化软件安全任务,包括漏洞发现和概念验证(PoC)生成。现有基准因依赖模糊测试工具、目标特定描述或漏洞复现任务,未能真实评估LLM在实际漏洞狩猎场景中的表现。我们提出SEC-bench Pro,一个用于衡量智能体在关键、高复杂度软件系统上进行漏洞狩猎的基准。本工作通过一个三阶段流程(漏洞收集、环境重建和基于oracle的验证)披露带有具体PoC输入的报告,并将修复链接到可复现任务中。我们用V8和SpiderMonkey上的183个已验证漏洞实例化SEC-bench Pro,其中包括一个V8子集,其累计Google漏洞奖励计划奖金超过150万美元。这些实例涵盖浏览器级和运行时级执行条件下的内存安全、沙箱、JIT和竞态条件漏洞。我们的评估表明,使用前沿模型的编码智能体在两个引擎上的成功率均低于40%。开放权重的Kimi-K2.6基线在V8上达到11.7%,而最强前沿配置在V8上达到32.0%,在SpiderMonkey上达到38.8%。ClaudeCode和Codex解决了互补的实例集,它们的双智能体联合在V8上达到37.9%,在SpiderMonkey上达到48.8%。SEC-bench Pro为评估基于LLM的安全智能体提供了稳健的环境,并揭示了长周期漏洞狩猎任务中的局限性。

英文摘要

Large language models (LLMs) now support automated software security tasks, including vulnerability discovery and proof-of-concept (PoC) generation. Existing benchmarks do not faithfully evaluate LLMs in real-world bug hunting scenarios because they rely on fuzzing harnesses, target-specific descriptions, or vulnerability-reproduction tasks. We present SEC-bench Pro, a benchmark for measuring agent bug hunting on critical, high-complexity software systems. This work discloses reports with concrete PoC inputs and links fixes into reproducible tasks through a three-phase pipeline for vulnerability collection, environment reconstruction, and oracle-based validation. We instantiate SEC-bench Pro with 183 validated vulnerabilities across V8 and SpiderMonkey, including a V8 subset with more than $1.5 million in cumulative Google Vulnerability Reward Program awards. These instances span memory-safety, sandbox, JIT, and race-condition bugs under browser-grade and runtime-grade execution conditions. Our evaluation shows that coding agents with frontier models remain below 40% success on both evaluated engines. The open-weight Kimi-K2.6 baseline reaches 11.7% on V8, while the strongest frontier configuration reaches 32.0% on V8 and 38.8% on SpiderMonkey. ClaudeCode and Codex solve complementary instance sets, and their two-agent union reaches 37.9% on V8 and 48.8% on SpiderMonkey. SEC-bench Pro provides robust environments for assessing LLM-based security agents and exposes limitations in long-horizon bug hunting tasks.

2605.26543 2026-05-27 cs.AI cs.LG 版本更新

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

PolyFusionAgent: 用于聚合物性能预测和逆向设计的多模态基础模型与自主AI助手

Manpreet Kaur, Xingying Zhang, Qian Liu

发表机构 * Department of Applied Computer Science, The University of Winnipeg(应用计算机科学系,温尼伯大学) Department of Mechanical Engineering, University of Manitoba(机械工程系,曼尼托巴大学)

AI总结 提出PolyFusionAgent框架,结合多模态聚合物基础模型PolyFusion和工具增强的设计代理PolyAgent,通过对齐序列、拓扑、3D几何和指纹等多模态视图学习共享潜在空间,实现热物理性能预测和化学有效、结构新颖的聚合物逆向设计,并利用文献证据检索闭环设计流程。

Comments 23 pages, 5 figures, 2 tables; Supplementary material included

详情
AI中文摘要

聚合物的发现对于从能量存储到生物医学等领域至关重要,但受到天文数字般的化学设计空间以及结构、性能和先验知识的碎片化表示的阻碍。这种碎片化使得许多AI模型与物理和实验现实脱节,限制了它们支持直接可操作设计决策的能力。在这里,我们介绍PolyFusionAgent,一个交互式框架,将多模态聚合物基础模型(PolyFusion)与工具增强、基于文献的设计代理(PolyAgent)相结合。PolyFusion对齐互补的聚合物视图,包括序列、拓扑、3D几何和指纹,跨越数百万种聚合物,学习一个跨化学和数据体系可迁移的共享潜在空间,改进了热物理性能预测,并实现了超出参考设计空间的化学有效、结构新颖聚合物的性能条件生成。PolyAgent通过将预测和逆向设计与从聚合物文献中检索证据联系起来,在一个工作流中提出、评估和情境化假设,从而闭合设计循环。PolyFusionAgent共同实现了交互式、证据关联的聚合物发现,结合了大规模表示学习、多模态化学知识和可验证的科学推理。

英文摘要

Polymer discovery is central to fields ranging from energy storage to biomedicine, but it is hindered by an astronomically large chemical design space and fragmented representations of structure, properties, and prior knowledge. This fragmentation leaves many AI models disconnected from physical and experimental reality, restricting their ability to support directly actionable design decisions. Here we introduce PolyFusionAgent, an interactive framework coupling a multimodal polymer foundation model (PolyFusion) with a tool-augmented, literature-grounded design agent (PolyAgent). PolyFusion aligns complementary polymer views including sequence, topology, 3D geometry, and fingerprints across millions of polymers to learn a shared latent space transferable across chemistries and data regimes, improving thermophysical property prediction and enabling property-conditioned generation of chemically valid, structurally novel polymers beyond the reference design space. PolyAgent closes the design loop by linking prediction and inverse design with evidence retrieval from the polymer literature, proposing, evaluating, and contextualizing hypotheses with explicit precedent in one workflow. Together, PolyFusionAgent enables interactive, evidence-linked polymer discovery combining large-scale representation learning, multimodal chemical knowledge, and verifiable scientific reasoning.

2605.26535 2026-05-27 cs.LG cs.AI cs.CV cs.NA math.NA 版本更新

Recursive Flow Matching

递归流匹配

Jiahe Huang, Sihan Xu, Sharvaree Vadgama, Rose Yu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) University of Michigan(密歇根大学)

AI总结 提出递归流匹配(RecFM)框架,通过自一致性约束对齐不同离散化尺度的轨迹,实现高保真单步或少步(2-4步)动态生成,在科学基准上相比领先扩散模拟器加速20倍且提升预测精度。

Comments Project page: https://jhhuangchloe.github.io/RecFM/

详情
AI中文摘要

生成模型已成为解决物理系统和建模复杂时空动态的强大范式。然而,在不产生高计算成本的情况下实现高物理精度仍然是一个基本挑战,因为现有方法面临关键的速度-保真度权衡。在这项工作中,我们引入了递归流匹配(RecFM),一个用于预测复杂时空动态的生成框架。RecFM强制执行自一致性以对齐跨离散化尺度的轨迹,减少离散化误差并改善基于物理任务的各种指标。据我们所知,这是第一种在科学系统中实现高保真单步和少步(2-4步)动态生成的方法,其性能可与最先进的多步求解器相媲美。在具有挑战性的科学基准测试中,RecFM相比领先的扩散模拟器实现了高达20倍的加速,同时提高了预测精度。此外,与普通流匹配相比,RecFM将均方误差降低了超过15%,为实时科学模拟提供了一种可扩展且高效的解决方案。

英文摘要

Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, achieving high physical accuracy without incurring high computational cost remains a fundamental challenge, as existing approaches face a critical speed-fidelity trade-off. In this work, we introduce Recursive Flow Matching (RecFM), a generative framework for forecasting complex spatiotemporal dynamics. RecFM enforces self-consistency to align trajectories across discretization scales, reducing discretization errors and improving performance across metrics for physics-based tasks. To our knowledge, this is the first method to achieve high-fidelity one- and few-step (2-4 step) dynamic generation for scientific systems with performance comparable to state-of-the-art multi-step solvers. Across challenging scientific benchmarks, RecFM achieves up to a 20$\times$ speedup over leading diffusion-based emulators while improving predictive accuracy. Furthermore, RecFM reduces mean squared error by over 15% compared to vanilla flow matching, offering a scalable and efficient solution for real-time scientific emulation.

2605.26533 2026-05-27 cs.CV cs.AI cs.CL cs.LG 版本更新

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

一种用于工业检测中自动缺陷推理与报告生成的混合视觉-语言架构

Malikussaid, Imad Gohar

发表机构 * School of Computing, Telkom University(Telkom大学计算机学院) Faculty of Engineering and Technology, School of Computing and Artificial Intelligence(工程与技术学院,计算与人工智能学院)

AI总结 本文提出一种解耦的边缘可部署管道,结合YOLO26-x-obb检测器、确定性编码模块和QLoRA微调的Qwen-2.5-1.5B模型,实现风电叶片缺陷定位与结构化报告生成,在BLEU-4、幻觉率和专家评分上显著优于零样本VLM基线。

Comments 23 pages, 6 figures, 9 equations, and 6 tables

详情
AI中文摘要

自动化工业检测需要精确的缺陷定位和结构化的维护报告生成;在当前的实践中,这些任务被分开处理,语言解释留给人类专家。本文描述了一种解耦的、边缘可部署的风电叶片检测管道,由三个组件组成,每个组件处理一个不同的子任务。“眼睛”是一个YOLO26-x-obb定向边界框检测器,在数据集原生分辨率下定位缺陷。“桥梁”是一个确定性的、无参数的编码模块,将每个检测到的边界框映射到嵌入结构化提示中的网格参考空间令牌。“大脑”是一个4比特量化的Qwen-2.5-1.5B模型,通过量化低秩适应(QLoRA)在947个合成生成的维护报告上进行适配,从该提示生成结构化的JSON报告。检索增强微调(RAFT)进一步将每个建议基于索引的维护程序。五项消融实验,通过BLEU-4、ROUGE-L、幻觉率(HR)和LLM-as-a-Judge评分标准,将该管道与单一视觉-语言模型(VLM)基线以及移除一个组件的部分配置进行比较。完整系统实现了BLEU-4 0.41、HR=4%和专家评分8.6/10,而零样本VLM基线分别为0.07、65%和3.3/10。在相同的检测证据下,QLoRA适配的1.5B模型在单个T4级GPU上以每秒47个令牌的速度生成比671B参数通用API模型更高质量的报告。结果表明,具有小型领域特定训练语料库的专用解耦架构在此结构化生成任务上优于通用端到端模型。

英文摘要

Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice these tasks are handled separately, with linguistic interpretation left to human experts. This paper describes a decoupled, edge-deployable pipeline for wind turbine blade inspection built from three components that each handle a distinct sub-task. The Eyes a YOLO26-x-obb oriented bounding-box detector localizes defects at dataset-native resolution. The Bridge a deterministic, parameter-free encoding module maps each detected bounding box to grid-referenced spatial tokens embedded in a structured prompt. The Brain a 4-bit quantized Qwen-2.5-1.5B model adapted with Quantized Low-Rank Adaptation (QLoRA) on 947 synthetically generated maintenance reports generates a structured JSON report from that prompt. Retrieval-Augmented Fine-Tuning (RAFT) further grounds each recommendation in indexed maintenance procedures. Five ablation experiments, scored by BLEU-4, ROUGE-L, Hallucination Rate (HR), and an LLM-as-a-Judge rubric, compare the pipeline against a monolithic vision-language model (VLM) baseline and against partial configurations in which one component is removed. The complete system achieves BLEU-4 0.41, HR=4%, and Expert Score = 8.6/10 compared with 0.07, 65%, and 3.3/10 for the zero-shot VLM baseline. The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model given identical detection evidence, at 47 tokens per second on a single T4-class GPU. The results show that purpose-built decoupled architecture with a small domain-specific training corpus outperforms a generalist end-to-end model on this structured generation task.

2605.26526 2026-05-27 cs.LG cs.CR 版本更新

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

开源权重大语言模型微调防御易受简单攻击

Kevin Kuo, Chhavi Yadav, Virginia Smith

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Simons Institute, UC Berkeley(Simons研究所,伯克利大学)

AI总结 本文发现针对开源权重大语言模型的防御措施易受abliteration和prefilling等低成本攻击,并提出abliteration-resistant tuning (ART) 方法将攻击成功率降低10%-20%。

Comments main body: 9 pages, 3 figures

详情
AI中文摘要

近期针对开源权重大语言模型(LLMs)的防御措施旨在防止对抗性使用。这些防御措施基于一个假设:新的有害行为是通过微调学习到的,而不是通过越狱模型诱发的。然而,预训练的LLMs已经在许多领域编码了大量有害知识,这引发了一个重要问题:对手能否越狱受保护的模型,在不进行任何微调的情况下实现有害使用?在本文中,我们表明开源权重防御措施容易受到更简单的策略攻击,这些策略虽然众所周知,但尚未针对这些防御措施进行系统评估。具体来说,我们评估了两种低成本攻击——abliteration和prefilling——它们不依赖于基于梯度的优化。在三个有害性评估基准(BeaverTails、HarmBench和AdvBench)上,这些攻击将针对受保护开源权重模型的攻击成功率从低于10%提高到16%-96%的范围。为了缓解这一漏洞,我们引入了abliteration-resistant tuning (ART),它将基于abliteration的目标纳入训练。ART可以叠加到现有防御措施上,并将abliteration、prefilling及其组合的成功率降低10%-20%。这些发现表明,开源权重模型的攻击面比先前描述的要更广,并且对防御措施的评估应包含更多样化的攻击策略,而不仅仅是对抗性微调。

英文摘要

Recent defenses for safeguarding open-weight large language models (LLMs) are intended to prevent adversarial usage. Underlying these defenses is an assumption that new harmful behavior is learned through fine-tuning rather than elicited by jailbreaking the model. Yet, pretrained LLMs already encode substantial harmful knowledge across many domains, which raises an important question: can an adversary jailbreak safeguarded models, to achieve harmful usage without fine-tuning at all? In this paper, we show that open-weight safeguards are susceptible to simpler strategies that, despite being well known, have not been systematically evaluated against these safeguards. Specifically, we evaluate two low-cost attacks--abliteration and prefilling--that do not rely on gradient-based optimization. Across three harmfulness evaluation benchmarks (BeaverTails, HarmBench, and AdvBench), these attacks increase attack success rates against safeguarded open-weight models from below 10\% to a range of 16%-96%. To mitigate this vulnerability, we introduce abliteration-resistant tuning (ART), which incorporates an abliteration-based objective into training. ART can be layered onto existing defenses and reduces the success rates of abliteration, prefilling, and their combination by 10%-20%. These findings indicate that the attack surface for open-weight models is broader than previously characterized, and that evaluations of safeguarding defenses should incorporate a more diverse set of attack strategies beyond adversarial fine-tuning.

2605.26523 2026-05-27 cs.DC cs.AI cs.LG 版本更新

StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

StreamSplit: 通过不确定性引导的自适应分割实现连续音频表示学习

Minh K. Quan, Pubudu N. Pathirana

发表机构 * School of Engineering, Deakin University(德肯大学工程学院)

AI总结 提出StreamSplit框架,通过分布式的混合损失和强化学习策略实现边缘设备上的流式对比学习,在降低延迟、带宽和能耗的同时保持高精度。

Comments Accepted at ACM MobiSys 2026

详情
AI中文摘要

大批量对比学习(CL)是现代表示学习的基础,但与边缘设备波动的资源约束根本不相容。这种冲突造成了一个困境:设备上的小批量会降低模型保真度,而将计算卸载到云端则会导致不可接受的延迟和带宽成本。现有解决方案通常采用静态模型压缩,无法适应边缘环境的运行时波动。为弥合这一差距,我们提出了StreamSplit,一种新颖的框架,使得流式对比学习在异构ARM客户端平台上变得实用。StreamSplit解决了环境音频的连续性与CLAP和COLA等模型的离散批量需求之间的冲突。我们引入:(1)一种基于分布的流式框架,将表示质量与本地批量大小解耦,使用易于处理的混合损失在稀疏更新的情况下保持保真度;(2)一种不确定性引导的自适应分割器,使用轻量级强化学习(RL)策略动态划分计算。独特的是,该策略将实时资源监控与嵌入歧义性相结合,以动态优化准确率-延迟权衡。我们在从资源受限的Raspberry Pi 4到高性能Apple M2的多种硬件上评估了StreamSplit。结果表明,与以服务器为中心的基线相比,StreamSplit将每样本延迟降低了高达4.7倍,带宽减少了77.1%,能耗减少了52.3%。关键的是,它保持了与服务器中心模型相差2.2%以内的准确率,证明了自适应分布式学习是现代边缘生态系统的一条可行路径。

英文摘要

Large-batch Contrastive Learning (CL), the foundation of modern representation learning, is fundamentally incompatible with the volatile resource constraints of edge devices. This conflict creates a dilemma: small on-device batches degrade model fidelity, while offloading to the cloud incurs unacceptable latency and bandwidth costs. Existing solutions often resort to static model compression, which fails to adapt to the runtime volatility of edge environments. To bridge this gap, we present StreamSplit, a novel framework that makes streaming CL practical across heterogeneous ARM client platforms. StreamSplit resolves the conflict between the continuous nature of ambient audio and the discrete batch requirements of models like CLAP and COLA. We introduce: (1) A distribution-based streaming framework that decouples representation quality from local batch size, using a tractable Hybrid Loss to maintain fidelity despite sparse updates; and (2) An Uncertainty-Guided Adaptive Splitter that uses a lightweight Reinforcement Learning (RL) policy to dynamically partition computation. Uniquely, this policy integrates real-time resource monitoring with embedding ambiguity to optimize the accuracy-latency trade-off on the fly. We evaluate StreamSplit on diverse hardware, from the resource-constrained Raspberry Pi 4 to the high-performance Apple M2. Results demonstrate that StreamSplit reduces per-sample latency by up to 4.7x and cuts bandwidth by 77.1% and energy by 52.3% compared to server-centric baselines. Crucially, it maintains accuracy within 2.2% of server-centric models, proving that adaptive, distributed learning is a viable path for the modern edge ecosystem.

2605.26514 2026-05-27 cs.CV cs.AI cs.LG 版本更新

CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies

CSV-ViT: 一种使用可变大小皮层超顶点的视觉Transformer用于阿尔茨海默病病理检测

Geonwoo Baek, Ikbeom Jang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Hankuk University of Foreign Studies(韩国家 foreign 学院)

AI总结 提出一种保留感兴趣区域的、基于顶点的可变大小皮层表面分块方法(皮层超顶点),并设计可变大小补丁兼容的视觉Transformer(CSV-ViT),在阿尔茨海默病诊断、淀粉样蛋白阳性和tau蛋白阳性三分类任务中优于现有表面模型。

详情
AI中文摘要

确认阿尔茨海默病(AD)通常依赖于正电子发射断层扫描(PET),该方法仍然昂贵且有创,这促使了基于结构MRI的预筛查的使用。在非欧几里得流形,特别是大脑皮层表面上的深度学习,由于数据的球形拓扑结构面临重大挑战。最近的表面模型已经能够从皮层表面数据中学习;然而,施加基于面的均匀补丁通常会导致补丁边界处的重复顶点。一般来说,许多基于表面的模型对感兴趣区域(ROI)的感知有限,这可能导致非皮层区域(如内侧壁)被包含在内。我们提出了一种皮层表面分块方法,该方法执行保留ROI的、基于顶点的、可变大小的补丁划分。我们将这些皮层表面补丁称为皮层超顶点(CSV)。基于这种表示,我们设计了CSV视觉Transformer(CSV-ViT),这是一种可变大小补丁容忍的视觉Transformer,使用填充和掩码感知的补丁嵌入。我们使用T1加权MRI,并通过将AD相关状态分类为三个类别来评估我们的框架:AD诊断、淀粉样蛋白阳性和tau蛋白阳性。在实验中,CSV-ViT取得了比最近基于表面的模型更高的分类性能。结果表明,所提出的CSV-ViT可能支持在PET或脑脊液确认之前基于MRI的AD相关状态预测。

英文摘要

Confirming Alzheimer's disease (AD) typically relies on positron emission tomography (PET), which remains costly and invasive, motivating the use of structural MRI-based prescreening. Deep learning on non-Euclidean manifolds, particularly brain cortical surfaces, faces significant challenges due to the data's spherical topology. Recent surface models have enabled learning from cortical surface data; however, imposing face-based uniform patches often causes duplicate vertices at patch boundaries. In general, many surface-based models are limited in their awareness of the region of interest (ROI), which can result in non-cortical regions, such as the medial wall, being included. We propose a cortical surface tokenization that performs ROI-preserving, vertex-based, variable-sized patch partitioning. We refer to these cortical surface patches as cortical supervertices (CSVs). Building on this representation, we design the CSV Vision Transformer (CSV-ViT), a variable-size patch-tolerant Vision Transformer that uses padding and a mask-aware patch embedding. We used T1-weighted MRI and evaluated our framework by classifying AD-related status into three categories: AD diagnosis, amyloid positivity, and tau positivity. Across the experiments, CSV-ViT achieved higher classification performance than recent surface-based models. The results suggest that the proposed CSV-ViT may support MRI-based prediction of AD-related status prior to PET or CSF confirmation.

2605.26509 2026-05-27 cs.LG math.PR stat.CO 版本更新

SIKA-GP: Accelerating Gaussian Process Inference with Sparse Inducing Kernel Approximations for Bayesian Deep Learning

SIKA-GP:利用稀疏诱导核近似加速贝叶斯深度学习中的高斯过程推断

Wenyuan Zhao, Rui Tuo, Chao Tian

发表机构 * Department of Electrical Computer Engineering, Texas A\&M University, College Station, US Department of Industrial Systems Engineering, Texas A\&M University, College Station, US

AI总结 提出SIKA-GP方法,通过基于二元有序模板基的稀疏诱导核近似,将高斯过程推断的计算复杂度降低至O(log M),并实现高效张量化GPU计算,可自然嵌入贝叶斯神经网络,在视觉和Transformer语言基准上显著加速训练和推断而不牺牲预测性能。

Comments 20 pages, 8 figures; accepted to International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

高斯过程(GPs)为不确定性估计提供了原则性的贝叶斯框架,但其计算复杂度严重限制了在大规模数据集上的可扩展性。我们提出SIKA-GP,该方法使用基于二元有序模板基的稀疏诱导核近似来加速GP推断,对诱导点数量的复杂度依赖仅为${O}(\log M)$。我们的方法从稀疏激活基构建紧凑且表达力强的核表示,从而实现高效的张量化GPU计算,并与现代大规模模型无缝集成。SIKA-GP可以自然地嵌入具有稀疏激活的贝叶斯神经网络(BNNs)中,在训练和推断中均实现显著加速,且不牺牲预测性能。该方法自然地扩展到深度特征学习,解决了深度架构和高维特征表示带来的可扩展性挑战。在视觉和基于Transformer的语言基准上的实验结果表明,我们的方法始终提供快速且准确的GP模型,为可扩展核学习提供了一条原则性路径。

英文摘要

Gaussian processes (GPs) provide a principled Bayesian framework for uncertainty estimation, but their computational complexity severely limits scalability to large datasets. We propose SIKA-GP, which accelerates GP inference using sparse inducing kernel approximations based on a dyadic ordered template basis, incurring only ${O}(\log M)$ complexity dependence on the number of inducing points. Our approach constructs compact and expressive kernel representations from sparsely activated bases, enabling efficient tensorized GPU computation and seamless integration with modern large-scale models. SIKA-GP can be naturally embedded into Bayesian neural networks (BNNs) with sparse activations, yielding significant speedups in both training and inference without sacrificing predictive performance. The method naturally extends to deep feature learning, addressing the scalability challenges introduced by deep architectures and high-dimensional feature representations. Empirical results on vision and transformer-based language benchmarks demonstrate that our approach consistently delivers fast and accurate GP models, providing a principled path toward scalable kernel learning.

2605.26496 2026-05-27 cs.LG cs.AI 版本更新

Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

Dense2MoE:通过统一剪枝和升级推动设备端LLM的帕累托前沿

Fengfa Li, Hongjin Ji, Yifeng Ding, Lei Ren, Chen Wei

发表机构 * Li Auto The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出Dense2MoE框架,通过层融合升级(LF-UC)统一剪枝和升级,将密集LLM高效转换为设备端MoE模型,在推理延迟与准确性之间取得更优帕累托前沿。

Comments 19 pages

详情
AI中文摘要

混合专家(MoE)架构对于资源受限的设备端部署极具前景,但从头训练这些模型成本高昂。当前方法试图通过将密集模型升级为MoE来缓解这一问题,然而它们常常引入参数冗余,降低推理效率。另一方面,标准层剪枝减少了冗余,但不可避免地损害模型准确性。为解决这一困境,我们提出Dense2MoE,一种通过层融合升级(LF-UC)统一剪枝和升级的新框架。在硬件Roofline理论指导下,Dense2MoE通过剪枝来自冗余层的带宽密集型注意力模块,同时将其多层感知机(MLP)重新用作MoE专家,系统地克服了推理内存墙。这种结构创新保留了模型的核心能力,并通过选择性令牌路由严格限制活跃参数。借助适度的持续预训练预算,Dense2MoE高效地将公开可用的密集LLM转换为设备端就绪的MoE模型。大量实验表明,Dense2MoE显著推进了设备端推理延迟与模型准确性的帕累托前沿,优于密集基线、最先进的压缩方法和标准升级方法。

英文摘要

The Mixture of Experts MoE architecture is highly promising for resource constrained on device deployments yet training these models from scratch incurs prohibitive costs Current methods attempt to alleviate this by upcycling dense models into MoEs however they often introduce parameter redundancy that degrades inference efficiency Alternatively standard layer pruning mitigates redundancy but inevitably compromises model accuracy To resolve this dilemma we propose Dense2MoE a novel framework that unifies pruning and upcycling through Layer Fusion UpCycling LF UC Guided by hardware Roofline theory Dense2MoE systematically overcomes the inference memory wall by pruning bandwidth heavy attention modules from redundant layers while repurposing their Multi Layer Perceptrons MLPs into MoE experts This structural innovation preserves the models core capabilities and strictly limits active parameters via selective token routing With a modest continual pre training budget Dense2MoE efficiently converts publicly available dense LLMs into on device ready MoE models Extensive experiments demonstrate that Dense2MoE significantly advances the Pareto frontier for on device inference latency versus model accuracy outperforming dense baselines state of the art compression and standard upcycling methods

2605.26494 2026-05-27 cs.AI cs.CL cs.LG 版本更新

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax-M2系列:小激活释放最大现实智能

MiniMax, :, Aili Chen, Aonian Li, Baichuan Zhou, Bangwei Gong, Binyang Jiang, Boji Dan, Changqing Yu, Chao Wang, Cheng Ma, Cheng Zhong, Cheng Zhu, Chengjun Xiao, Chengyi Yang, Chengyu Du, Chenyang Zhang, Chi Zhang, Chuangyi Huang, Chunhao Zhang, Chunhui Du, Chunyu Zhao, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dongyu Zhang, Enhui Yang, Fei Yu, Guang Zheng, Guodong Zheng, Guohong Li, Haichao Zhu, Haigang Zhou, Haimo Zhang, Han Ding, Hao Zhang, Haohai Sun, Haolin Lyu, Haonan Lu, Haoyu Wang, Huajie Shi, Huiyang Li, Jiacheng Chen, Jian Zhang, Jiaqi Zhuang, Jiaren Cai, Jiaxin Pan, Jiayao Li, Jiayuan Song, Jichuan Zhang, Jie Wang, Jihao Gu, Jin Zhu, Jingwei Dong, Jingyang Li, Jingyu Zhang, Jingze Zhuang, Jinhao Tian, Jinli Liu, Jinyi Hu, Jun Tao, Jun Zhang, Junbin Ruan, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kang Xu, Ke Ji, Ke Yang, Kecheng Xiao, Keyu Duan, Keyu Li, Le Han, Letian Ruan, Li Yuan, Lianfei Yu, Liheng Feng, Lijie Mo, Lin Li, Lingye Bao, Lingyu Yang, Lingyuan Zhou, Loki, Lu Chen, Lunbin Ceng, Ming Li, Ming Zhong, Mingliang Tao, Mingyuan Chi, Mujie Lin, Nan Hu, Ningxin Chen, Peiyin Zhu, Peng Gao, Pengcheng Gao, Pengfei Li, Penglin Li, Pengyu Zhao, Qibin Ren, Qidi Xu, Qihan Ren, Qile Li, Qin Wang, Quanliang Chen, Qunhong Ceng, Rong Tian, Rui Dong, Ruitao Leng, Ruize Zhang, Shanqi Liu, Shaoyu Chen, Sheng Jia, Shun Yao, Shuoran Zhao, Shuqi Yu, Sichen Li, Sicheng Pan, Songquan Zhu, Tengfei Li, Tian Xie, Tiancheng Qin, Tianrun Liang, Wei Liu, Weiqi Xu, Weitao Li, Weixiang Chen, Weiyu Cheng, Weiyu Zhang, Wenhu Chen, Wenqian Zhao, Xiancai Chen, Xiangjun Song, Xiangyuan Wang, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xiaojie Wu, Xihao Song, Xingyi Han, Xinyu Guan, Xuan Lu, Xun Zou, Xunhao Lai, Xutong Li, Yan Gong, Yang Wang, Yang Xu, Yangsen Wang, Ye Tang, Yicheng Chen, Yinran Qiu, Yiqi Shi, Yiting Guo, Yiwen Huang, Yixuan Wang, Yongyi Hu, Yu Gao, Yu Zhang, Yuanxiang Ying, Yuanzhen Zhang, Yubo Wang, Yuchen Song, Yufeng Yang, Yuhang Meng, Yuhang Miao, Yuhao Li, Yujie Liu, Yulin Hu, Yunan Huang, Yunji Li, Yunyi Huang, Yusen Zhang, Yusu Hong, Yutao Xie, Yutong Zhang, Yuwen Liao, Yuxuan Shi, Yuze Wenren, Zebin Li, Zehan Li, Zejian Luo, Zeyu Jin, Zeyuan Sun, Zhanpeng Zhou, Zhaochen Su, Zhendong Li, Zhengmao Zhu, Zhengyuan Peng, Zhenhua Fan, Zhi Zhang, Zhichao Xu, Zhiheng Lv, Zhikang Xu, Zhitao He, Zhiwei He, Zhongyuan Li, Zibo Gao, Zijia Wu, Zijian Song, Zijian Zhou, Zijun Sun, Zishan Huang, Ziying Chen, Ziyue Ge

发表机构 * MiniMax

AI总结 提出MiniMax-M2系列混合专家语言模型,通过小激活参数实现前沿性能,核心包括智能体驱动数据管道、可扩展强化学习系统Forge及自进化检查点M2.7。

Comments Technical Report. 35 pages, 10 figures, 4 tables

详情
AI中文摘要

我们介绍了MiniMax-M2系列,这是一个基于“小激活可以释放最大现实智能”原则构建的混合专家语言模型家族。旗舰版M2总参数量为229.9B,每个token仅激活9.8B参数。M2系列专为智能体部署而端到端设计,包含三个组成部分:(i) 智能体驱动数据管道,生成大规模、可验证的轨迹,涵盖智能体编码和智能体协作,每个轨迹都基于可执行工作空间和与工件对齐的奖励;(ii) Forge,一个可扩展的智能体原生强化学习系统,适应长程智能体轨迹,并配有窗口FIFO调度、前缀树合并、推理优化以及支持白盒和黑盒智能体的干净训练-推理-智能体解耦;(iii) 最新的M2.7检查点向自我进化迈出了早期一步——自主调试训练运行并修改其自身框架。从M2到M2.7,这种组合将小激活足迹转化为智能体编码、深度搜索、办公任务和推理基准上的前沿性能。

英文摘要

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.

2605.26492 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

灯塔中的伊莱亚斯,再次?诊断LLM故事中的低多样性

Sil Hamilton, David Mimno

发表机构 * Department of Information Science(信息科学系)

AI总结 研究通过采样20000个故事发现,LLM生成的故事中存在高度重复的词汇(如Elias、灯塔等),这些词汇来自偏好数据而非预训练数据,表明小数据集与强对齐算法的结合可能对多样性产生不成比例的影响。

详情
AI中文摘要

LLM生成的故事是一个流行的用例,但它们显示出非常低的变异性。我们使用五个提示从四个当前模型中采样了总共20,000个故事。我们发现,88.3%的生成故事中出现11个单词,模型之间差异很小。这些单词包括名字(Elias, Mara, Elara)、场景(灯塔)和职业(钟表匠、图书管理员)。这些标记在已发表的文献或预训练数据中并不常见,但在所有当前模型可能使用的偏好数据中却存在。令人惊讶的是,与平均后训练故事相比,这些“灯塔”故事并不常见,后训练故事中很多包含受版权保护的角色或成人内容。这一结果证明了小数据集与强大对齐算法结合可能产生的潜在不成比例影响。

英文摘要

LLM-generated stories are a popular use case, but they show very low variability. We sample 20,000 total stories from four current models using five prompts. We find that 11 words occur in 88.3% of generated stories, with little difference between models. These words include names (Elias, Mara, Elara), settings (lighthouses), and professions (clockmaker, librarian). These tokens do not often occur in published literature nor pre-training data, but they are found in preference data that is likely to have been used by all current models. Surprisingly, these "lighthouse" stories are infrequent when compared with the average post-training story, much of which contains references to copyrighted characters or adult content. This result demonstrates the potentially disproportionate impact of small datasets combined with powerful alignment algorithms.

2605.26491 2026-05-27 cs.LG cs.CV 版本更新

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

超越成对偏好:扩散模型的列表级奖励感知对齐

Austin Wang, Jiaqi Han, Stefano Ermon, Yisong Yue

发表机构 * Caltech(加州理工学院) Stanford University(斯坦福大学)

AI总结 提出Diffusion LAIR方法,通过列表级奖励感知优化,利用连续奖励分数和所有候选图像同时优化扩散模型,在文本到图像生成等任务上超越成对偏好基线。

详情
AI中文摘要

偏好优化已成为从人类反馈中进行在线强化学习(RLHF)的一种高效替代方案,用于对齐文本到图像扩散模型。然而,现有方法大多将监督简化为二元成对比较。当训练数据自然包含同一提示的多个候选图像,并且连续奖励分数能提供比单一赢家-输家标签更丰富的信息时,这种成对简化具有局限性。为解决这些局限性,我们提出了Diffusion LAIR,一种用于扩散模型的奖励感知列表级偏好优化方法。对于每个提示,LAIR将一组候选图像的奖励分数转换为居中优势权重,然后在隐式奖励上优化优势加权回归目标,隐式奖励定义为当前模型相对于固定参考模型的去噪损失改进,并带有二次惩罚以正则化隐式奖励的幅度。所得目标同时使用所有候选图像而非选择成对,并通过显式控制隐式奖励的幅度保持保守性。LAIR目标在隐式奖励空间中具有有界闭式最优解,阐明了正则化强度如何控制偏好更新的幅度。实验表明,Diffusion LAIR在SD1.5和SDXL上,在文本到图像生成、组合生成和图像编辑基准测试中均优于强偏好优化基线。

英文摘要

Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the magnitude of the implicit reward. The resulting objective uses all candidates simultaneously rather than selecting pairs, and remains conservative by explicitly controlling the magnitude of the implicit reward. The LAIR objective admits a bounded closed-form optimum in implicit-reward space, clarifying how the regularization strength controls the magnitude of the preference update. Experiments show that Diffusion LAIR outperforms strong preference optimization baselines on SD1.5 and SDXL across text-to-image generation, compositional generation, and image editing benchmarks.

2605.26489 2026-05-27 cs.LG 版本更新

The Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training

奇异分布的稳定性:语言模型预训练两阶段动力学的谱视角

Hongtao Zhang, Wenjie Zhou, Chenxi Jia, Wei Chen, Xueqi Cheng

发表机构 * School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学先进交叉学科学院) State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(中国科学院人工智能安全国家重点实验室) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学) School of Mathematics, Southeast University, Nanjing, China(东南大学数学学院)

AI总结 本文发现语言模型预训练中奇异值谱的早期稳定现象(SoSD),并证明该现象与慢下降阶段同步,通过理论分析揭示了权重范数增长导致SoSD阈值,从而限制后续损失下降速率。

详情
AI中文摘要

大型语言模型预训练通常表现出两阶段轨迹:快速的初始损失下降,随后是长时间的缓慢改善。我们识别出一个潜在的谱现象——奇异分布的稳定性(SoSD),其中迹归一化的奇异值谱早期就稳定下来,即使参数矩阵继续演化。我们证明,SoSD与慢下降阶段之间的同步在不同架构(GPT-2、LLaMA)和设置中广泛存在,包括各种调度(Step-wise、WSD、Cosine Decay)、权重衰减和优化器(AdamW、Muon)。通过分析一个简化的Transformer,我们证明权重范数的增长不可避免地会引发早期的SoSD阈值,之后损失下降速率在理论上受限于奇异分布的变化。我们进一步解释了WSD和Muon等策略通过调节SoSD尺度来影响预训练动态,从而为理解高效预训练动力学提供了谱视角。

英文摘要

Large language model pre-training typically exhibits a two-phase trajectory: a fast initial loss drop followed by a prolonged slow improvement. We identify an underlying spectral phenomenon, Stability of Singular Distribution (SoSD), where the trace-normalized singular value spectrum stabilizes early, even as parameter matrices continue to evolve. We demonstrate that synchronization between SoSD and the slow-descent regime is widely observed across diverse architectures (GPT-2, LLaMA) and settings, including various schedules (Step-wise, WSD, Cosine Decay), weight decays, and optimizers (AdamW, Muon). By analyzing a simplified Transformer, we prove that growing weight norms inevitably precipitate an early SoSD threshold, after which the rate of loss decrease becomes theoretically bounded by the variation in the singular distribution. We further interpret strategies like WSD and Muon through their ability to modulate the SoSD scale, offering a spectral lens for understanding efficient pre-training dynamics.

2605.26484 2026-05-27 cs.LG 版本更新

Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training

Extra-Merge:追踪语言模型预训练中模型合并的秩-1子空间

Wenjie Zhou, Bohan Wang, Hongtao Zhang, Chenxi Jia, Wei Chen, Xueqi Cheng

发表机构 * School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学先进交叉学科学院) State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, China(中国科学院人工智能安全国家重点实验室) University of Chinese Academy of Sciences, China(中国科学院大学) Alibaba Group, China(阿里巴巴集团) School of Mathematics, Southeast University, Nanjing, China(东南大学数学学院)

AI总结 本文通过分析预训练后期轨迹发现秩-1子空间现象,提出无需额外训练的Extra-Merge方法,沿该子空间外推以最小化损失,在GPT-2和LLaMA系列上优于标准合并基线。

详情
AI中文摘要

模型合并已成为增强大型语言模型(LLMs)的轻量级范式,但其底层机制仍知之甚少。在这项工作中,我们分析了后期预训练轨迹,并揭示了一个 extbf{秩-1子空间}现象:虽然原始优化步骤剧烈振荡,但连续的\emph{合并}检查点坍缩到一个稳定的、近似一维的线性流形上。我们通过\emph{河谷}景观分析从理论上为这一观察提供了依据:平均操作充当了几何低通滤波器,抑制高曲率噪声以揭示最优下降方向。基于这一见解,我们提出了 extbf{Extra-Merge},一种无需训练的策略,沿该子空间外推以最小化损失,无需额外的梯度更新。在GPT-2和LLaMA系列(124M到2B)上的大量实验表明,Extra-Merge始终优于标准合并基线。值得注意的是,它在Pythia-12B下游任务上取得了一致的零样本准确率提升,并有效推广到Muon优化器\citep{jordan2024muon}。

英文摘要

Model merging has emerged as a lightweight paradigm for enhancing Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. In this work, we analyze late-stage pre-training trajectories and uncover a \textbf{Rank-1 Subspace} phenomenon: while raw optimization steps oscillate violently, consecutive \emph{merged} checkpoints collapse onto a stable, approximately one-dimensional linear manifold. We theoretically ground this observation in a \emph{river-valley} landscape analysis: averaging acts as a geometric low-pass filter that dampens high-curvature noise to reveal the optimal descent direction. Capitalizing on this insight, we propose \textbf{Extra-Merge}, a training-free strategy that extrapolates along this subspace to minimize loss without additional gradient updates. Extensive experiments across GPT-2 and LLaMA families (124M to 2B) demonstrate that Extra-Merge consistently outperforms standard merging baselines. Notably, it yields consistent zero-shot accuracy gains on Pythia-12B downstream tasks and generalizes effectively to the Muon optimizer \citep{jordan2024muon}.

2605.26478 2026-05-27 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

基于随机解耦策略梯度的高效在策略视觉强化学习

Haoxiang You, Yilang Liu, Davis Zong, Qian Wang, Teeratham Vitchutripop, Qi Wang, Daniel Rakita, Ian Abraham

发表机构 * Yale University(耶鲁大学) Shanghai Jiao Tong University(上海交通大学) University of Sydney(悉尼大学)

AI总结 提出随机解耦策略梯度(SDPG)方法,通过轨迹滚动的随机扰动估计策略梯度,在单GPU上数小时内端到端训练多样化的视觉运动控制策略,显著降低计算和内存开销,并在视觉MuJoCo基准测试中优于基线方法。

详情
AI中文摘要

我们提出了随机解耦策略梯度(SDPG),一种轻量级的视觉强化学习方法,能够在单个NVIDIA RTX 4080 GPU上在数小时内端到端训练多样化的视觉运动控制策略。SDPG通过轨迹滚动的随机扰动估计策略梯度,所需批量渲染环境数量减少几个数量级,并显著降低计算和内存开销。在视觉MuJoCo基准测试中,SDPG在训练时间、内存使用和奖励方面始终优于基线方法。最后,为支持未来研究,我们引入了一套涵盖灵巧操作、具有挑战性的运动控制的逼真视觉机器人基准测试,并在物理硬件上展示了有效的仿真到现实迁移。

英文摘要

We present the stochastic decoupled policy gradient (SDPG), a lightweight visual reinforcement learning (RL) method that trains diverse visuomotor control policies end-to-end within a few hours on a single NVIDIA RTX 4080 GPU. SDPG estimates policy gradients via random perturbations of trajectory rollouts, requiring orders of magnitude fewer batch-rendered environments and substantially reducing compute and memory overhead. On visual MuJoCo benchmarks, SDPG consistently outperforms baseline methods in training time, memory usage, and rewards. Finally, to support future research, we introduce a suite of realistic visual robotics benchmarks spanning dexterous manipulation, challenging locomotion, and demonstrate effective sim-to-real transfer on physical hardware.

2605.26477 2026-05-27 cs.LG 版本更新

Variational Inference for Evidential Deep Learning

证据深度学习的变分推断

Jiawei Tang, Xinyan Du, Hui Liu, Junhui Hou, Yuheng Jia

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) School of Computing Information Sciences, Saint Francis University(圣弗朗西斯大学计算信息科学学院) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其交叉应用关键实验室(东南大学),中华人民共和国教育部,中国)

AI总结 针对传统证据深度学习(EDL)中KL惩罚项导致证据过高和参数设置缺乏理论保证的问题,提出基于变分推断的VI-EDL框架,通过推导证据下界(ELBO)抑制证据过度增长,并建立泛化界理论,在视觉和医学数据集上实现最先进的分布外检测、噪声检测和自动驾驶性能。

详情
AI中文摘要

尽管深度神经网络(DNN)取得了显著性能,但它们倾向于产生过度自信的预测。证据深度学习(EDL)通过将预测公式化为类别概率上的狄利克雷分布来显式量化认知不确定性,从而缓解了这一问题。然而,我们发现传统的EDL存在两个基本限制:一个仅抑制负类证据的Kullback-Leibler(KL)惩罚项,导致证据过高,从而降低了模型量化不确定性的能力;以及缺乏设置狄利克雷参数$α=e+1$的理论保证。在本文中,我们提出了一个数学上严谨的框架——变分推断证据深度学习(VI-EDL)。通过从变分推断的角度重新表述证据学习,我们推导出一个证据下界(ELBO),它防止证据过度增长。理论上,我们严格建立了泛化界,并揭示了预测不确定性、特征和网络复杂度如何影响该界,以及为什么设置$oldsymbolα = \mathbf{e} + \mathbf{1}$可以最小化它。在标准视觉和医学数据集上的大量实验表明,VI-EDL实现了最先进的性能,在分布外检测、噪声检测和自动驾驶场景中表现出色。代码可在https://github.com/seutjw/VI-EDL获取。

英文摘要

While Deep Neural Networks (DNNs) achieve remarkable performance, their tendency to produce overconfident predictions. Evidential Deep Learning (EDL) mitigates this by formulating predictions as a Dirichlet distribution over class probabilities to explicitly quantify epistemic uncertainty. However, we found that the conventional EDL suffers from two fundamental limitations: a Kullback-Leibler (KL) penalty that only suppresses the evidence of negative classes, producing excessively high evidence therefore decreasing the model's ability to quantify uncertainty, and an absence in theoretical guarantee of setting Dirichlet parameter $α=e+1$. In this paper, we propose a mathematically principled framework, Variational Inference Evidential Deep Learning (VI-EDL). By reformulating evidential learning through the lens of variational inference, we derive an Evidence Lower Bound (ELBO), which prevents the evidence from growing excessively. Theoretically, we rigorously establish a generalization bound and reveal how the predicted uncertainty, feature and network complexity affect this bound, and why setting $\boldsymbolα = \mathbf{e} + \mathbf{1}$ can minimize it. Extensive experiments on standard visual and medical datasets demonstrate that VI-EDL achieves state-of-the-art performance, showing excellent performance in out-of-distribution detection, noise detection and autonomous driving scenario. The code is available in https://github.com/seutjw/VI-EDL.

2605.26468 2026-05-27 cs.LG cs.AI 版本更新

Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection

扩散检测:用于无监督IC异常检测的生成扩散模型

Yuxuan Yin, Chen He, Todd Jacobs, Jialei He, Boxun Xu, Robert Jin, Peng Li

发表机构 * Department of Electrical and Computer Engineering, University of California Santa Barbara, CA, USA(加州大学圣芭芭拉分校电子与计算机工程系)

AI总结 提出首个结合扩散Transformer的无监督异常检测框架,通过自编码器压缩、结构化令牌序列和噪声预测误差实现晶圆级快速筛选,在16nm IC测试数据上达到最优性能。

Comments 9 pages, 5 figures

详情
AI中文摘要

潜在缺陷筛选面临极低故障率、高维测试数据和缺乏标注异常的挑战。我们提出了首个结合扩散Transformer的无监督异常检测框架。原始测试测量值首先由自编码器压缩,然后重塑为结构化令牌序列,并加入正弦和每设备晶圆位置嵌入。异常分数来自中程扩散时间步上的噪声预测误差,从而无需任何标注缺陷或手动特征工程即可实现快速晶圆级筛选。我们的方法在极端类别不平衡下的工业16nm IC测试数据上达到了最先进的性能,并通过潜在空间重建残差提供可解释的故障定位。

英文摘要

Latent defect screening is challenged by extremely low failure rates, high-dimensional test data, and absence of labeled anomalies. We propose the first unsupervised anomaly detection framework incorporating a Diffusion Transformer. Raw test measurements are first compressed by an autoencoder, then reshaped into a structured token sequence enriched with sinusoidal and per-device wafer-position embeddings. Anomaly scores are derived from the noise-prediction error over mid-range diffusion timesteps, enabling fast wafer-scale screening without any labeled defects or manual feature engineering. Our approach achieves state-of-the-art performance on industrial 16nm IC test data under extreme class imbalance, offering interpretable failure localization through latent-space reconstruction residuals.

2605.26459 2026-05-27 cs.LG 版本更新

MuCon: Clipped Muon Updates for LLM Training

MuCon: 用于LLM训练的裁剪Muon更新

Albert Yi

发表机构 * Albert Yi(阿尔伯特·伊)

AI总结 本文提出MuCon优化器,通过奇异值裁剪替代Muon的极分解方向,并研究其近似计算与数值稳定性。

详情
AI中文摘要

Muon风格的优化器采用矩阵值动量或预条件更新 $B = U \operatorname{diag}(\sigma_1,\ldots,\sigma_r) V^\top$,并将其替换为其规范部分极因子 $\operatorname{Pol}(B) = U V^\top$。这会将每个非零奇异值映射为1。MuCon是本文研究的裁剪Muon变体:它对相同的Muon矩阵应用奇异值裁剪,$D^{\mathrm{MuCon}}_\tau(B) = \operatorname{MClip}_\tau(B) = U \operatorname{diag}\bigl(\min\{\sigma_i,\tau\}\bigr) V^\top, \qquad \tau> 0$。因此,$\operatorname{MClip}_\tau$ 表示数学裁剪算子,而MuCon表示优化器原语,它将此裁剪方向替代Muon的极方向。本文使用的Muon/MuCon缩放参数化称为 $\text{SpectralP}$:这是一种隐藏矩阵缩放方案,在该方案下应用极Muon或裁剪MuCon方向。映射 $\operatorname{MClip}_\tau$ 是到谱范数球 $\{X : \|X\|_2 \le \tau\}$ 的Frobenius投影:它保持小于或等于 $\tau$ 的奇异值不变,仅修改违反的奇异方向。本文探讨何时可以在不进行完整稠密SVD的情况下近似MuCon裁剪步骤。我们记录了两个精确恒等式,一个极/绝对值公式和一个标量根公式,后者引出了用于裁剪半正定因子的有理牛顿滤波器,并指出了两者共同的数值障碍:接近阈值的奇异值使得符号决策和有理求解变得病态。因此,矩阵函数方法仅在结合稳定的极/平方根本原语或裁剪边界附近的显式正则化时才有用。

英文摘要

Muon-style optimizers take a matrix-valued momentum or preconditioned update $B = U \operatorname{diag}(σ_1,\ldots,σ_r) V^\top$ and replace it with its canonical partial polar factor $\operatorname{Pol}(B) = U V^\top$. This maps every nonzero singular value to one. MuCon is the clipped-Muon variant studied here: it applies singular-value clipping to the same Muon matrix, $D^{\mathrm{MuCon}}\_τ(B) = \operatorname{MClip}\_τ(B) = U \operatorname{diag}\bigl(\min\{σ\_i,τ\}\bigr) V^\top, \qquad τ> 0$. Thus, $\operatorname{MClip}\_τ$ denotes the mathematical clipping operator, while MuCon denotes the optimizer primitive that substitutes this clipped direction for Muon's polar direction. The Muon/MuCon scaling parameterization used in this work is called $\text{SpectralP}$: it is the hidden-matrix scaling recipe under which polar Muon or clipped MuCon directions are applied. The map $\operatorname{MClip}\_τ$ is the Frobenius projection onto the spectral-norm ball $\{X : \|X\|_2 \le τ\}$: it leaves singular values at or below $τ$ unchanged and modifies only the violating singular directions. This paper asks when the MuCon clipping step can be approximated without a full dense SVD. We record two exact identities, a polar/absolute-value formula and a scalar-root formulation leading to a rational Newton filter for the clipped positive-semidefinite factor, and identify the numerical obstruction common to both: singular values near the threshold make sign decisions and rational solves ill-conditioned. Matrix-function methods are therefore useful only when paired with stable polar/square-root primitives or explicit regularization near the clipping boundary.

2605.26446 2026-05-27 cs.LG cs.AI 版本更新

DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection

DDGAD:基于扩散的图异常检测中的轨迹动力学

Yuxin Yang, Limei Hu, Feng Chen

发表机构 * College of Artificial Intelligence(人工智能学院) Southwest University(西南大学)

AI总结 提出DDGAD框架,利用扩散正则化和可靠性感知邻域共识下的轨迹动力学区分正常与异常节点,通过三种互补异常信号检测异常。

详情
AI中文摘要

图异常检测(GAD)旨在识别图结构数据中行为或属性显著偏离整体模式的节点或子结构,在金融风险控制、社交网络分析和网络安全等领域具有关键应用。然而,现有的基于GCN的方法存在污染传播的根本问题,即异常节点通过消息传递污染其邻居的表示,导致检测性能下降。本文提出DDGAD,一种新颖的基于扩散的图异常检测框架,利用轨迹动力学区分正常和异常节点。我们的关键洞察是,在扩散正则化和可靠性感知邻域共识的耦合作用下,正常节点表现出一致且稳定的表示轨迹,而异常节点由于全局流形先验与局部污染消息传递之间的方向不一致,表现出不稳定且冲突的动力学。为了减轻污染传播,我们引入了一种分布式的可靠性感知共识细化机制,并定义了三种互补的异常信号:邻居不一致性、可靠性权重和动力学冲突能量。我们进一步对耦合动力学下的正常节点稳定性进行了初步的理论分析。这些信号从局部不一致性、共识可靠性和动力学不稳定性角度共同刻画异常行为。在五个真实世界数据集上的大量实验证明了所提框架的有效性。

英文摘要

Graph anomaly detection (GAD) aims to identify nodes or substructures whose behavior or attributes deviate significantly from the overall pattern in graph-structured data, with critical applications in financial risk control, social network analysis, and cybersecurity. However, existing GCN-based methods suffer from the fundamental problem of contamination propagation, where anomalous nodes pollute the representations of their neighbors through message passing, leading to degraded detection performance. In this paper, we propose DDGAD, a novel diffusion-based graph anomaly detection framework that leverages trajectory dynamics to distinguish normal and anomalous nodes. Our key insight is that normal nodes exhibit consistent and stable representation trajectories under the coupled effects of diffusion regularization and reliability-aware neighborhood consensus, while anomalous nodes exhibit unstable and conflicting dynamics due to the directional disagreement between the global manifold prior and locally contaminated message passing. To mitigate contamination propagation, we introduce a distributed reliability-aware consensus refinement mechanism and define three complementary anomaly signals: neighbor inconsistency, reliability weight, and dynamical conflict energy. We further provide a preliminary theoretical analysis on normal node stability under the coupled dynamics. These signals collectively characterize anomalous behaviors from the perspectives of local inconsistency, consensus reliability, and dynamical instability. Extensive experiments on five real-world datasets demonstrate the effectiveness of the proposed framework.

2605.26434 2026-05-27 cs.LG cs.AI 版本更新

Aperiodic and Low-Frequency Spectral Bias in Reconstruction based EEG Foundation Models

基于重建的脑电图基础模型中的非周期和低频谱偏差

Aditya Kommineni, Emily Zhou, Kleanthis Avramidis, Simon Bock Segaard, Jeppe Roden Münster, Andreas Peter Juhl Hansen, Takfarinas Medani, Tiantian Feng, Richard Leahy, Shrikanth Narayanan

发表机构 * University of Southern California(美国南加州大学) Aalborg University(奥尔堡大学)

AI总结 研究揭示基于重建预训练的脑电图基础模型存在非周期和低频成分偏差,导致低资源场景下性能不佳,并提出通过辅助损失关注高频振荡结构来改进。

Comments 18 pages, 13 figures, 3 tables

详情
AI中文摘要

脑电图基础模型在大规模无标签脑电图数据上预训练,已成为学习可泛化脑电图表示的有前景方向。尽管在数据丰富场景下表现积极,但在低资源设置中,它们往往无法显著优于完全监督的小型模型。我们对此缺陷提供了机制性解释,将其归因于基于重建的预训练任务与脑电图信号独特的频谱结构之间的根本性不匹配,该结构分解为高功率非周期成分和低功率振荡成分。通过使用受控的合成脑电图输入,我们证明脑电图基础模型嵌入偏向于捕捉脑电图信号的非周期成分,而低估振荡成分,尤其是高频成分。此外,在真实BCI数据集上的线性探针评估进一步揭示,嵌入比任务相关信息更强烈地编码受试者身份,从而强化了主要基于重建目标训练的基础模型嵌入中的低频和非周期成分偏差。这些发现共同阐明了基于重建的脑电图基础模型中的一种失败模式,并激励未来工作纳入明确针对高频振荡结构的辅助损失,作为实现更强大和可泛化的脑电图表示的途径。

英文摘要

EEG foundation models, pre-trained on large-scale unlabelled EEG data, have emerged as a promising direction towards learning generalizable EEG representations. Despite showing positive results in data-rich regimes, they often fail to outperform significantly smaller supervised models in low-resource settings compared to fully supervised models. We provide a mechanistic account of this shortcoming, attributing it to a fundamental mismatch between reconstruction-based pretext tasks and the idiosyncratic spectral structure of EEG signals, which decompose into distinct high-power aperiodic and low-power oscillatory components. Using controlled, synthetically-generated EEG inputs, we demonstrate that EEG foundation model embeddings are biased to capture the aperiodic components of the EEG signal while under-representing oscillatory components, particularly at higher frequencies. Additionally, linear probe evaluations on real-world BCI datasets further reveal that embeddings encode subject identity more strongly than task-relevant information, thereby reinforcing the low-frequency and aperiodic component bias in foundation model embeddings trained primarily on reconstruction based objectives. Together, these findings elucidate a failure mode in reconstruction based EEG foundation models and motivate future work to incorporate auxiliary losses explicitly targeting high-frequency oscillatory structure as a path toward more capable and generalizable EEG representations.

2605.26429 2026-05-27 stat.ME cs.AI cs.LG stat.ML 版本更新

Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing

面向大规模分布外检测的结构自适应共形推断

Rongyi Sun, Wenguang Sun, Zinan Zhao

发表机构 * Center for Data Science and School of Mathematical Sciences, Zhejiang University(数据科学中心和数学科学学院,浙江大学)

AI总结 提出结构自适应共形q值(SCQ)和伪分数引导的直推式自动模型选择(P-TAMS),在成对可交换性下实现结构化分布外检测的有限样本错误率控制、功效提升和可解释性增强。

详情
AI中文摘要

本文针对高风险机器学习应用中的结构化分布外(OOD)检测问题。传统共形方法依赖于联合可交换性,难以融入时空或分组结构等辅助信息。为克服这一局限,我们提出结构自适应共形q值(SCQ),这是一种整合个体检验证据与结构模式的显著性指标。我们还开发了伪分数引导的直推式自动模型选择(P-TAMS),将共形化模型选择适应于候选模型工具箱中的结构化OOD检测。SCQ和P-TAMS共同在成对可交换性下形成一个统一框架,提供有限样本错误率控制、改进的功效和增强的可解释性。在模拟和真实数据上的实验表明,所提方法控制了错误发现率,并在多种设置下表现良好。

英文摘要

This paper addresses structured out-of-distribution (OOD) testing in high-stakes machine learning applications. Traditional conformal methods rely on joint exchangeability, making it difficult to incorporate auxiliary information such as spatiotemporal or grouping structures. To overcome this limitation, we propose the structure-adaptive conformal q-value (SCQ), a significance index that integrates individual test evidence with structural patterns. We also develop pseudo-score-guided transductive automated model selection (P-TAMS), which adapts conformalized model selection to structured OOD testing across a toolbox of candidate models. Together, SCQ and P-TAMS form a unified framework under pairwise exchangeability, providing finite-sample error-rate control, improved power, and enhanced interpretability. Experiments on simulated and real data demonstrate that the proposed approach controls the false discovery rate and performs well across diverse settings.

2605.26424 2026-05-27 cs.IR cs.AI cs.LG 版本更新

Uniboost: Global Coordination with Value Alignment for Fair and Efficient Traffic Allocation

Uniboost:基于价值对齐的全局协调实现公平高效的流量分配

Ge Fan, Nan Zhao, Kai Meng, Cong Luo, Yang Fu, Huiping Chu, Jialin Liu, Yuning Jiang, Bo Zheng

发表机构 * Taobao \& Tmall Group of Alibaba Hangzhou China Taobao \& Tmall Group of Alibaba Beijing China Taobao \& Tmall Group of Alibaba

AI总结 提出Uniboost统一流量分配框架,通过后验价值对齐机制和独立线性提升范式,解决耦合分配、分数膨胀和可解释性问题,提升流量分配效率和推荐性能。

Comments accepted by SIGIR 2026

详情
AI中文摘要

随着互联网服务的快速发展,推荐系统已变得不可或缺。特别是混合(重排序)阶段在跨不同业务目标分配流量中起着关键作用。然而,现有方法常受限于耦合的分配方案、分数膨胀和缺乏可解释性。为应对这些挑战,我们提出Uniboost,一个统一的流量分配框架。Uniboost引入后验价值对齐机制,将抽象模型分数校准到具有明确业务语义的锚定指标,显著增强可解释性。此外,它采用独立的线性提升范式来解耦复杂的加权方案,实现每个计划贡献的精确归因。我们通过在线A/B测试和深入数据分析验证了Uniboost的有效性,展示了三个关键发现:1)降低加权分数的整体权重有效减轻了意外的业务干扰,产生更高效的微观流量分配策略;2)事后分析和聚合仪表板提供了直观的宏观洞察,指导整体流量分配机制的设计;3)提出的“有效完成分数”作为易于获取的后验指标,为内容推荐管道提供了可靠的锚点。综合来看,我们的实验表明,Uniboost不仅在微观层面提升了流量分配效率和推荐性能,还为系统迭代提供了宏观指导。因此,这项工作为大规模工业推荐系统提供了一种高效可控的流量调节解决方案。

英文摘要

With the rapid evolution of internet services, recommendation systems have become indispensable. In particular, the blending (re-ranking) stage plays a pivotal role in allocating traffic across diverse business objectives. However, existing approaches often suffer from coupled allocation plans, score inflation, and a lack of interpretability. To address these challenges, we propose Uniboost, a unified traffic allocation framework. Uniboost introduces a posterior value alignment mechanism that calibrates abstract model scores to anchor metrics with explicit business semantics, significantly enhancing interpretability. Furthermore, it employs an independent linear boosting paradigm to decouple complex weighting schemes, enabling precise attribution of each plan's contribution. We validate the effectiveness of Uniboost through online A/B tests and in-depth data analysis, demonstrating three key findings: 1) Reducing the overall weight of weighted scores effectively mitigates unintended business interference, yielding a more efficient micro-level traffic allocation strategy; 2) Post-hoc analyses and aggregated dashboards provide intuitive, macro-level insights that guide the design of the overall traffic allocation mechanism; 3) The proposed "Effective Completion Score" serves as an easily obtainable post-metric that offers a reliable anchor for content recommendation pipelines. Collectively, our experiments show that Uniboost not only improves traffic allocation efficiency and recommendation performance at the micro level but also provides macro-level guidance for system iteration. Thus, this work provides an efficient and controllable traffic regulation solution for large-scale industrial recommendation systems.

2605.26423 2026-05-27 cs.LG eess.IV 版本更新

FM-fMRI: Event Conditioned Flow Matching for Rest-to-Task fMRI Time-Series Synthesis

FM-fMRI:用于静息态到任务态fMRI时间序列合成的事件条件流匹配

Peiyu Duan, Jiyao Wang, Nicha C. Dvornek, Junlin Yang, Ziqi Gao, Lawrence H. Staib, James S. Duncan

发表机构 * Department of Biomedical Engineering(生物医学工程系) Department of Radiology & Biomedical Imaging(放射科与生物医学成像系) Department of Electrical Engineering(电气工程系)

AI总结 提出FM-fMRI模型,利用事件条件流匹配从静息态fMRI和任务事件信息生成任务态fMRI时间序列,在频谱、连接性和分布匹配上优于扩散模型、GAN和VAE,并提升自闭症分类性能。

Comments MICCAI 2026 Early Accepted

详情
AI中文摘要

基于任务的fMRI提供了任务诱发神经动力学的直接读数,但获取成本高且难以大规模采集,这促使从广泛可用的静息态fMRI(rsfMRI)进行静息态到任务态的合成。我们提出FM-fMRI,一种事件条件流匹配模型,它学习一个连续时间条件向量场,从受试者的rsfMRI和任务事件信息生成任务ROI时间序列。该公式支持基于ODE的快速采样和对异构事件调度的灵活条件设置。我们不是优化逐点重建,而是使用互补标准评估生成的信号,这些标准探究时间和频谱结构、受试者和组水平连接组一致性以及分布对齐。在公共人类连接组项目和内部BioPoint自闭症队列上,FM-fMRI在频谱和连接性一致性上达到最强,并在分布级匹配上优于条件扩散模型、生成对抗网络(GAN)和变分自编码器(VAE)基线。此外,我们通过使用我们的方法合成任务fMRI ROI时间序列来扩充BioPoint队列,改进了下游自闭症分类,并在数据有限的临床环境中展示了实用性。代码将在GitHub上提供。

英文摘要

Task-based fMRI provides a direct readout of task-evoked neural dynamics, but it is expensive and difficult to acquire at scale, motivating rest-to-task synthesis from widely available resting-state fMRI (rsfMRI). We propose FM-fMRI, an event-conditioned flow-matching model that learns a continuous-time conditional vector field to generate task ROI time series from a subject's rsfMRI and the task event information. The formulation enables fast ODE-based sampling and flexible conditioning over heterogeneous event schedules. Rather than optimizing for pointwise reconstruction, we evaluated generated signals using complementary criteria that probe temporal and spectral structure, subject and group-level connectome consistency, and distributional alignment. On the public Human Connectome Project and internal BioPoint autism cohort, FM-fMRI achieves the strongest spectral and connectivity agreement and improved distribution-level matching over conditional diffusion, generative adversarial networks (GANs), and variational autoencoders (VAEs) baselines. Furthermore, we augment the BioPoint cohort by synthesizing task-fMRI ROI time series with our method, improving downstream autism classification and demonstrating practical utility in data-limited clinical settings. The code will be available on GitHub.

2605.26419 2026-05-27 cs.LG 版本更新

Amortized Factor Inference Networks for Posterior Inference

摊销因子推理网络用于后验推理

Joohwan Ko, Justin Domke

发表机构 * Manning College of Information and Computer Sciences(Manning信息与计算机科学学院) University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 提出摊销因子推理网络(AFINs),通过编码-合并-解码架构实现跨不同先验、似然和维度的后验推理泛化,在保持后验精度的同时大幅降低测试时计算量。

详情
AI中文摘要

摊销推理承诺快速的测试时贝叶斯推理,但现有方法固有地依赖于固定模型。将摊销扩展到未见过的模型通常需要重新训练或昂贵的测试时微调。在本文中,我们提出:是否可能构建一个能够跨不同先验、似然和维度进行泛化的单一推理网络?我们引入了摊销因子推理网络(AFINs),这是一类基于维度无关模块的编码-合并-解码推理网络,将模型规范及其观测映射到变分后验的参数。实验表明,单个训练好的AFIN在达到与NUTS和几种变分推理方法相当的后验精度的同时,测试时计算量减少了2到4个数量级。代码可在 https://github.com/joohwanko/AFINs 获取。

英文摘要

Amortized inference promises fast test-time Bayesian inference, but existing methods are inherently tied to fixed models. Extending amortization to unseen models typically requires retraining or costly test-time finetuning. In this paper, we ask: is it possible to build a single inference network capable of generalizing across varying priors, likelihoods, and dimensionality? We introduce Amortized Factor Inference Networks (AFINs), a family of encode-merge-decode inference networks built on dimension-independent modules that map a model specification and its observations to the parameters of a variational posterior. Experimentally, a single trained AFIN achieves posterior accuracy comparable to NUTS and several variational inference methods, while requiring 2 to 4 orders of magnitude less test-time compute. Code is available at https://github.com/joohwanko/AFINs.

2605.26414 2026-05-27 cs.AI cs.CL cs.LG 版本更新

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

推理、代码,还是两者兼有?大型语言模型如何处理数学问题的变化

Matthew Kutakh

AI总结 本研究通过对比链式思维推理、单次代码执行和迭代代码执行三种方法在GSM-Symbolic数据集上的表现,发现代码执行并未提升大型语言模型在数学问题变体上的推理鲁棒性。

Comments 6 pages, 4 figures, 2 tables

详情
AI中文摘要

大型语言模型(LLMs)在数学推理基准测试中取得了令人印象深刻的准确性,但当问题被修改为不同的名字或数字等简单变化时,它们的性能会下降。代码执行方法允许模型生成并运行Python代码,而不是用自然语言进行推理,已被提出作为解决方案,但其对推理鲁棒性(即在问题变体中保持准确性的能力)的影响尚未得到系统测试。本研究在GSM-Symbolic数据集的1000个问题上评估了三种方法:使用链式思维(CoT)提示的纯推理、使用程序辅助语言模型(PAL)的单次代码执行,以及使用逐步编码(SBSC)的迭代代码执行。所有三种方法均在配对的原始问题和修改问题上使用Claude Haiku 4.5运行。CoT是最鲁棒的方法,在扰动下准确率下降1.3个百分点,1.8%的问题被破坏。PAL的鲁棒性最差,准确率下降1.7个百分点,3.1%的问题被破坏,SBSC介于两者之间。尽管这些差异在统计上不显著($p = .096$),但方向趋势在所有指标上一致,表明无论是单次还是迭代的代码执行,都没有提高小学水平问题变体的推理鲁棒性。

英文摘要

Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) has not been systematically tested. This study evaluates three approaches on 1,000 problems from the GSM-Symbolic dataset: pure reasoning using chain-of-thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution using Step-by-Step Coding (SBSC). All three were run on paired original and modified problems using Claude Haiku 4.5. CoT was the most robust method, with an accuracy drop of 1.3 percentage points and 1.8% of problems breaking under perturbation. PAL was the least robust at 1.7 percentage points and 3.1% broke, with SBSC falling in between. Although these differences were not statistically significant ($p = .096$), the directional trend was consistent across all measures, suggesting that code execution, whether single-shot or iterative, does not improve reasoning robustness on grade-school-level problem variations.

2605.26413 2026-05-27 stat.ME cs.AI cs.LG stat.ML 版本更新

Confounder Detection via Treatment Intent: A New Observational Study Design

通过治疗意图进行混杂检测:一种新的观察性研究设计

Drago Plecko, Patrik Okanovic, Torsten Hoefler, Elias Bareinboim

发表机构 * UCLA(加州大学洛杉矶分校) ETH Zurich(苏黎世联邦理工学院) Columbia University(哥伦比亚大学)

AI总结 提出一种通过询问治疗决策者比较配对单元来揭示未观测混杂因素的新研究设计,并在ICU数据中验证其有效性。

详情
AI中文摘要

理解干预的效果是科学进步的核心,随机对照试验(RCT)在许多应用领域被视为因果推断的金标准。然而,RCT成本高、耗时长,且常受伦理或实际限制,这促使我们需要能够从观察性数据中得出结论的因果方法。尽管此类数据收集规模日益扩大,但将其用于因果推断常因并非所有影响治疗分配和结果的变量都被观测到而受阻,这一问题称为未观测混杂。在本文中,我们介绍了一种称为通过治疗意图进行混杂检测的新研究设计。其思路是询问做出治疗决策的人类专家,并要求他们比较由原则性匹配策略提出的单元对,目的是引出解释治疗决策为何不同的未观测变量。我们为此类程序提供了理论基础,确定了此类研究设计可能引出未观测混杂因素的条件。基于这些新建立的基础,我们研究了重症监护病房(ICU)中干预的治疗效果。首先,我们展示了强烈表明ICU中收集的电子健康记录(EHR)存在未观测混杂的经验证据。通过使用临床文本笔记作为医生知识的代理并利用自然语言处理,我们在已知真实情况的半合成环境中为我们的方法提供了概念验证。

英文摘要

Understanding the effects of interventions is central to scientific progress, with randomized controlled trials (RCTs) regarded as the gold standard for causal inference in many applied fields. However, RCTs are costly, time-consuming, and often constrained by ethical or practical limitations, motivating the need for causal methods able to draw conclusions from observational data. While such data is collected at ever larger scale, making its use for causal inference is often hindered by the fact that not all variables affecting treatment allocation and the outcome are observed: an issue known as unobserved confounding. In this paper, we introduce a new study design called confounder detection via treatment intent. The idea is to query a human expert who makes treatment decisions, and ask them to compare pairs of units proposed by a principled matching strategy, with the goal of eliciting unobserved variables that explain why treatment decisions differ. We provide a theoretical basis for such a procedure, ascertaining conditions under which such a study design may elicit unobserved confounders. Building on this newly established foundations, we study treatment effects of interventions in the intensive care unit (ICU). First, we show empirical evidence strongly indicating that electronic health records (EHRs) collected in ICUs are subject to unobserved confounding. By using clinical text notes as a proxy for physicians' knowledge and leveraging natural language processing, we provide a proof of concept for our methodology in a semi-synthetic environment with a known ground truth.

2605.26409 2026-05-27 cs.CR cs.AI cs.LG 版本更新

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

通过模型的行为几何进行越狱易感性预测与缓解

Hayden Helm, Xiaodong Liu, Weiwei Yang

发表机构 * Microsoft Research(微软研究院)

AI总结 本文通过形式化模型群体的行为几何,利用已评估和防御的模型,实现高效的易感性预测和防御迁移,在79个模型和100个系统配置上,易感性检测AUPRC达0.94且探针减少约98%,防御迁移性能优于同供应商分配。

详情
AI中文摘要

评估和缓解生成系统对越狱攻击的易感性对其安全部署至关重要。由于可部署系统的数量众多,对每种配置进行全面评估和优化是不切实际的。本文形式化了模型群体的行为几何,通过利用先前评估和防御过的模型,支持群体内高效的易感性预测和有效的防御迁移。我们将该框架应用于涵盖24个提供商的79个模型以及单个基础模型的100个系统配置。使用行为几何的简单方法在易感性检测中达到了0.94的AUPRC,与全面评估相比,探针数量减少了约98%。使用行为几何选择从哪个模型迁移优化后的防御,在无额外探针成本的情况下优于同供应商分配(+2%,p = 0.03),且一组三个模型足以覆盖整个群体。结果对超参数选择和评判者具有鲁棒性。

英文摘要

Evaluating and mitigating a generative system's susceptibility to jailbreak attacks is critical to its safe deployment. Given the number of deployable systems, full per-configuration evaluation and optimization is impractical. In this paper, we formalize the behavioral geometry of a population of models that, by leveraging previously evaluated and defended models, supports both efficient susceptibility prediction and effective defense transfer across a population. We apply the framework to 79 models spanning 24 providers and to 100 system configurations of a single base model. Simple methods that use the behavioral geometry reach an AUPRC of $0.94$ for susceptibility detection with $\approx98\%$ fewer probes relative to a full evaluation. Using the behavioral geometry to select which model to transfer an optimized defense from outperforms same-provider assignment ($+2\%$, $p = 0.03$) at no additional probe cost, with a set of three models sufficient to cover the population. Results are robust to hyperparameter selection and judge.

2605.26379 2026-05-27 stat.ML cs.LG 版本更新

When Does LeJEPA Learn a World Model?

LeJEPA 何时学习世界模型?

David Klindt, Yann LeCun, Randall Balestriero

发表机构 * Cold Spring Harbor Laboratory(冷泉港实验室) New York University(纽约大学) Brown University(布朗大学)

AI总结 本文证明 LeJEPA(对齐加高斯正则化)在潜变量服从平稳加性噪声演化的世界中能够线性恢复潜变量(线性可识别性),并指出高斯分布是唯一保证该性质的潜分布,同时验证了近似可识别性和最优规划能力。

详情
AI中文摘要

一种混淆世界真实自由度的表示无法支持可靠的规划或组合泛化。我们证明,在潜变量服从平稳加性噪声演化的一类广泛世界中,LeJEPA(对齐加高斯正则化)能从非线性观测中线性恢复世界的潜变量,这一性质称为线性可识别性。我们的主要结果是:在所有此类世界中,高斯分布是唯一保证该性质的潜分布。正向方向依赖于谱分解,其中每个非线性度都受到对齐的严格惩罚,使得线性映射成为最优;反向方向排除了所有非高斯替代。我们进一步证明了近似可识别性结果,其中保证会优雅地退化,并表明线性正交可识别性能够实现最优潜空间规划。我们通过从二维示例到1024维潜变量的实验验证了理论,包括分布消融和基于像素的机器人控制。我们的理论将经验上成功的配方转化为数学保证,为构建能够可证明恢复世界结构的世界模型提供了基础。

英文摘要

A representation that scrambles the true degrees of freedom of the world cannot support reliable planning or compositional generalization. We prove that LeJEPA (alignment plus Gaussian regularization) linearly recovers the world's latent variables from nonlinear observations, a property known as linear identifiability, in a broad class of worlds where latents evolve under stationary, additive-noise transitions. Our main result is that among all such worlds, the Gaussian is the unique latent distribution for which this guarantee holds. The forward direction rests on a spectral decomposition in which each degree of nonlinearity is strictly penalized by alignment, making the linear map the optimum; the converse rules out every non-Gaussian alternative. We further prove an approximate identifiability result where the guarantee degrades gracefully, and show that linear, orthogonal identifiability enables optimal latent-space planning. We validate the theory with experiments ranging from 2D examples to 1024-dimensional latents, including distributional ablations and pixel-based robotic control. Our theory turns an empirically successful recipe into a mathematical guarantee, providing the foundation for building World Models that provably recover the structure of the world.

2605.26376 2026-05-27 cs.CV cs.AI cs.LG 版本更新

BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma

BioFact-MoE:基于生物学因子分解的混合专家模型用于肝细胞癌的视觉-语言预后建模

Junlin Yang, Tian Yu, Nicha C. Dvornek, Yuexi Du, Peiyu Duan, Annabella Shewarega, Lawrence H. Staib, James S. Duncan, Julius Chapiro

发表机构 * Department of Radiology \& Biomedical Imaging, Department of Biomedical Engineering, Department of Electrical Engineering, Department of Statistics \& Data Science Yale University, New Haven, CT, 06510, USA

AI总结 提出BioFact-MoE框架,通过生物学监督的混合专家模型显式分解肝脏和肿瘤因子,在肝细胞癌预后预测中提升准确性和生物学可解释性。

Comments Early accepted at MICCAI 2026

详情
AI中文摘要

肝细胞癌(HCC)具有生物学异质性,由肝功能储备和肿瘤相关肿瘤学因素之间的相互作用塑造;因此,相似的生存结果可能反映根本不同的潜在生物学过程。HCC的预后建模依赖于来自多参数MRI和常规临床实践放射学报告的丰富多模态信息。现有的预后视觉-语言模型(VLM)学习单一的纠缠潜在表示,混合了肝脏和肿瘤相关因素,限制了准确性和生物学可解释性。我们提出BioFact-MoE,一个生物学因子分解的混合专家(MoE)框架,通过残差MoE生存架构中的生物学监督专家显式分解肝脏和肿瘤因素。在N=588名患者的HCC队列(在4,582个3D MRI图像-报告对上预训练)中,BioFact-MoE在所有时间范围内持续优于所有基线的生存预测,实现了12、18和24个月的AUC分别为75.33%、75.85%和73.96%。除了标量风险预测,门控专家权重实现了表型感知的风险分层。通路感知的门控揭示了临床上有意义的治疗相关生存异质性。在保留验证中,肝脏和肿瘤嵌入分别与肝功能标志物和肿瘤负荷标志物显示出选择性关联(p<0.05),无需监督。代码可在https://github.com/jy-639/BioFact-MoE获取。

英文摘要

Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor-related oncologic factors; thus, similar survival outcomes may reflect fundamentally different underlying biological processes. Prognostic modeling in HCC is informed by rich multimodal information from multiparametric MRI and radiology reports from routine clinical practice. Existing prognostic vision-language models (VLMs) learn a single entangled latent representation that blends hepatic and tumor-related factors, limiting both accuracy and biological interpretability. We present BioFact-MoE, a biologically factorized Mixture of Experts (MoE) framework that explicitly decomposes liver and tumor factors via biologically supervised experts within a residual MoE survival architecture. On a HCC cohort of N=588 patients (pretrained on 4,582 3D MRI image-report pairs), BioFact-MoE consistently improves survival prediction over all baselines across time horizons, achieving 12-, 18-, and 24-month AUCs of 75.33%, 75.85%, and 73.96%. Beyond scalar risk prediction, gated expert weights enable phenotype-aware risk stratification. Pathway-informed gating uncovers clinically meaningful treatment-associated survival heterogeneity. In held-out validation, hepatic and tumor embeddings show selective associations with liver function and tumor burden markers, respectively (p<0.05), without supervision. The code is available at https://github.com/jy-639/BioFact-MoE.

2605.26373 2026-05-27 cs.LG math.OC stat.ML 版本更新

Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback

通过算法等价性在隐凸损失上的在线学习:最优遗憾、几何障碍与Bandit反馈

Anas Barakat, Andreas Kontogiannis, Vasilis Pollatos, Ioannis Panageas, Antonios Varvitsiotis

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) National Technical University of Athens(雅典国家技术大学) National and Kapodistrian University of Athens(雅典国家与卡多斯大学) University of California, Irvine(加州大学 Irvine 分校) Archimedes, Athena Research Center, Greece(希腊阿提卡研究中心 Archimedes) National University of Singapore, Centre for Quantum Technologies(新加坡国立大学 量子技术中心)

AI总结 本文通过更精确的离散时间算法等价性论证,证明在线梯度下降在隐凸损失上达到最优的$\mathcal{O}(\sqrt{T})$遗憾,并澄清了所需几何条件,同时扩展到单点Bandit反馈得到$\mathcal{O}(T^{3/4})$期望遗憾。

Comments 43 pages

详情
AI中文摘要

我们研究具有隐凸损失的对抗性在线学习,即经过非线性重参数化后变为凸的非凸损失。Ghai, Lu和Hazan (2022)证明,在几何和光滑性假设下,此类非凸损失上的在线梯度下降(OGD)近似模拟了具有适当正则化器的底层凸损失上的在线镜像下降(OMD),得到$\mathcal{O}(T^{2/3})$遗憾。他们留下了是否可以在隐凸设置中恢复在线凸优化的最优$\Theta(\sqrt{T})$遗憾的开放问题。我们肯定地回答了这个问题。更具体地,通过更尖锐的离散时间算法等价性论证,我们证明在相同假设下OGD达到$\mathcal{O}(\sqrt{T})$遗憾,匹配对抗性在线凸优化的最坏情况最优速率。我们还解决了Ghai, Lu和Hazan (2022)的另一个开放问题,澄清了这种算法等价性所需的几何条件。我们将对角雅可比充分条件替换为必要且充分的Hessian相容性条件,从而扩展了可允许重参数化的类别。我们用下界补充了紧的遗憾界,表明Hessian相容性假设对OGD是必要的;当该条件不成立时,我们构造一个光滑的重参数化和一个对抗性的隐凸损失序列,使得OGD遭受$\Omega(T)$遗憾。最后,我们将分析扩展到单点Bandit反馈,并证明使用球形平滑的Bandit OGD的$\mathcal{O}(T^{3/4})$期望遗憾界,匹配其在凸损失上的经典速率。

英文摘要

We study adversarial online learning with hidden-convex losses, i.e., nonconvex losses that become convex after a nonlinear reparameterization. Ghai, Lu and Hazan (2022) proved that, under geometric and smoothness assumptions, online gradient descent (OGD) on such nonconvex losses approximately simulates online mirror descent (OMD) on the underlying convex losses with a suitable regularizer, yielding $\mathcal{O}(T^{2/3})$ regret. They left open whether the optimal $Θ(\sqrt{T})$ regret from online convex optimization can be recovered in this hidden-convex setting. We answer this question affirmatively. More specifically, via a sharper discrete-time algorithmic equivalence argument, we prove that OGD achieves $\mathcal{O}(\sqrt{T})$ regret under the same assumptions, matching the optimal worst-case rate for adversarial online convex optimization. We also address another open question of Ghai, Lu and Hazan (2022) by clarifying the geometry required for this algorithmic equivalence. We replace the diagonal-Jacobian sufficient condition with a necessary-and-sufficient Hessian compatibility condition, thereby expanding the class of admissible reparameterizations. We complement our tight regret bound with a lower bound showing that the Hessian compatibility assumption is essential for OGD; when it fails, we construct a smooth reparameterization and an adversarial sequence of hidden-convex losses for which OGD suffers $Ω(T)$ regret. Finally, we extend our analysis to one-point bandit feedback and prove a $\mathcal{O}(T^{3/4})$ expected regret bound for bandit OGD with spherical smoothing, matching its classical rate on convex losses.

2605.26355 2026-05-27 cs.LG cs.CL eess.SP 版本更新

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

能量门控注意力与小波位置编码:Transformer注意力的互补归纳偏置

Athanasios Zeris

发表机构 * Independent Researcher(独立研究者) Athens, Greece(希腊雅典)

AI总结 针对标准注意力缺乏能量显著性和尺度选择性局部性两种互补归纳偏置的问题,提出能量门控注意力(EGA)和莫雷特位置编码(MoPE),两者组合在字符级语言建模上实现超加性性能提升。

Comments 10 pages, 1 figure, 3 tables. Part 2 of a five-paper series on spectral methods in transformer attention. Code: https://github.com/AthanasiosZeris/energy-gated-attention

详情
AI中文摘要

标准Transformer注意力计算成对标记相似性,但将所有标记视为同等显著、所有位置视为同等局部,忽略了输入的信息结构。我们识别出标准注意力缺乏两种互补归纳偏置:能量显著性(哪些标记集中了信息能量,通过端到端学习而不需要显式频率分解)和尺度选择性局部性(在每个频率上位置影响的范围,通过Morlet小波编码实现)。我们通过两个简单组件解决这两个问题。能量门控注意力(EGA)通过键标记嵌入的学习能量估计(通过单个线性投影计算)来门控值聚合;它选择关注什么。莫雷特位置编码(MoPE)用学习的高斯窗口小波替换固定的正弦编码,使联合位置-频率定位适应语料库;它指定注意力在每个尺度上操作的位置。在TinyShakespeare上,单独EGA相比标准注意力实现+0.092验证损失改进(相比Phase 1-3基线+0.103);单独MoPE为-0.032(作为独立编码低于基线);但它们的组合实现+0.119——超过各部分之和。这种超加性在两个独立训练运行中观察到,是核心实证发现:显著性和局部性是互补归纳偏置,各自填补对方无法单独填补的空白。消融实验证实,结构化谱先验(Morlet小波门控、尺度初始化头、固定正弦PE)始终不如其无约束学习对应物,而互补学习组件交互产生超加性。所有实验都在小规模(≤6M参数、字符级基准、单种子)进行;更大规模的多种子验证是未来工作最重要的方向。

英文摘要

Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary inductive biases that standard attention lacks: energy salience (which tokens concentrate informational energy, learned end-to-end without explicit frequency decomposition) and scale-selective locality (how far positional influence extends at each frequency, implemented via Morlet wavelet encoding). We address both with two simple components. Energy-Gated Attention (EGA) gates value aggregation by a learned energy estimate of key token embeddings, computed via a single linear projection; it selects what to attend to. Morlet Positional Encoding (MoPE) replaces fixed sinusoidal encodings with learned Gaussian-windowed wavelets that adapt the joint position-frequency localization to the corpus; it specifies where attention operates at each scale. On TinyShakespeare, EGA alone achieves +0.092 validation loss improvement over standard attention (+0.103 over Phase 1-3 baseline); MoPE alone is -0.032 (below baseline as a standalone encoding); but their combination achieves +0.119 -- more than the sum of parts. This superadditivity, observed across two independent training runs, is the central empirical finding: salience and locality are complementary inductive biases, each addressing a gap the other cannot fill alone. Ablations confirm that structured spectral priors (Morlet wavelet gates, scale-initialized heads, fixed sinusoidal PE) consistently underperform their unconstrained learned counterparts, while complementary learned components interact superadditively. All experiments are at small scale (<=6M parameters, character-level benchmarks, single seed); larger-scale multi-seed validation is the most important direction for future work.

2605.26353 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Personalized Generative Models for Contextual Debiasing

用于上下文去偏的个性化生成模型

Xinran Liang, Esin Tureci, Prachi Sinha, Ye Zhu, Vikram V. Ramaswamy, Olga Russakovsky

发表机构 * Department of Computer Science, Princeton University(普林斯顿大学计算机科学系) LIX, CNRS, École Polytechnique(巴黎政治学院LIX研究所,法国国家科学研究中心)

AI总结 提出DecoupleGen方法,利用个性化文本到图像扩散模型生成罕见上下文图像,作为训练增强以缓解视觉识别中的上下文偏差。

Comments CVPR 2026 Workshop on Synthetic Data for Computer Vision and Generative Models for Computer Vision. Code available at https://github.com/princetonvisualai/DecoupleGen

详情
AI中文摘要

不同的视觉模式在世界中出现的频率不同:例如,沙滩球出现在沙滩上比出现在道路上更常见。这些统计数据反映在视觉数据集中,因此训练好的模型更容易在常见场景中识别物体。然而,在道路上识别沙滩球可能比在沙滩上识别更重要。我们研究如何缓解这种差异。由于在现实世界中收集不常见的图像可能很困难,我们探索生成具有较少频繁上下文的图像是否可以作为有效的训练增强。一个关键挑战是引导生成保持在原始数据集分布附近,同时创建具有不常见上下文的多样化图像。我们引入了DecoupleGen方法,该方法个性化文本到图像扩散模型,以促进罕见上下文图像的连贯合成,同时保留原始视觉细节。生成的图像包含语义上有意义的内容,并在视觉上与原始数据集保持一致。我们进一步应用验证约束以确保增强数据的相关性。我们在复杂场景数据集上的物体分类和识别任务中评估了我们的方法。实验表明,我们的方法比先前的方法有一致的改进,并且我们的分析确定了这些改进背后的因素。

英文摘要

Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on a road may arguably be even more important than recognizing it on sand. We study how to mitigate this discrepancy. Since collecting uncommon images in the real world may be difficult, we explore whether generating images with less frequent contexts can serve as effective training augmentation. A key challenge is guiding generations to remain close to the original dataset distribution while creating diverse images with uncommon contexts. We introduce Decoupling Contextual Patterns with Generations (DecoupleGen), a method that personalizes text-to-image diffusion models to facilitate coherent synthesis of images with rare contexts while preserving original visual details. The generated images contain semantically meaningful content and remain visually aligned with the original datasets. We further apply verification constraints to ensure relevance of the augmented data. We evaluate our approach on object classification and recognition tasks on complex scene datasets. Our experiments demonstrate consistent improvements over previous approaches, and our analyses identify factors underlying these improvements.

2605.26350 2026-05-27 cs.LG cs.AI 版本更新

When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

当正确示例有害时:重新思考示例在上下文学习中的作用

Chenghao Qiu, Chunli Peng, Yufeng Yang, Kuan-Hao Huang, Yi Zhou

发表机构 * Texas A&M University(德克萨斯理工大学)

AI总结 本文通过引入任务保持扰动,揭示了正确示例不一定有益甚至可能降低上下文学习准确性的反直觉现象,并提出了上下文证据转移的概念来解释正确性与效用之间的差距。

详情
AI中文摘要

上下文学习(ICL)通常被直觉所驱动,即示例之所以有帮助是因为它们提供了正确的输入-输出对。然而,我们揭示了一个反直觉的现象:正确性并不能保证示例的效用,一些正确的示例甚至可能降低ICL的准确性。为了研究这种正确性-效用差距,我们引入了任务保持扰动,其中仅改变示例输入,而该示例仍然是同一任务的正确实例。具体来说,每个扰动后的示例被赋予由任务映射诱导的目标。该框架涵盖了标签更新扰动(其中任务相关语义发生变化且目标被重新计算)和更严格的目标保持扰动(其中原始目标仍然有效)。我们将由此产生的失败模式形式化为上下文证据转移:任务保持扰动可以改变模型用于上下文推理的有效证据混合,从而将示例正确性与示例效用分离。在情感分类、逻辑推理和数学应用题中,我们发现任务保持扰动的示例会显著降低ICL性能,尤其是对于较小的模型、较难的任务和较高的扰动比例。我们的结果表明,鲁棒的ICL不仅需要评估示例是否正确,还需要评估它们如何影响上下文推理。代码可在 https://github.com/Chenghao-Qiu/Task-Preserving-ICL 获取。

英文摘要

In-context learning (ICL) is often motivated by the intuition that demonstrations help because they provide correct input-output examples. However, we reveal a counterintuitive phenomenon: correctness does not guarantee exemplar utility, and some correct demonstrations can even reduce ICL accuracy. To study this correctness-utility gap, we introduce task-preserving perturbations, where only the exemplar input is changed, while the example remains a correct instance of the same task. Concretely, each perturbed exemplar is assigned the target induced by the task mapping. This framework covers both label-updating perturbations, where task-relevant semantics change and targets are recomputed, and stricter target-preserving perturbations, where the original target remains valid. We formalize the resulting failure mode as contextual evidence shift: task-preserving perturbations can change the effective mixture of evidence used by the model for contextual inference, thereby separating exemplar correctness from exemplar utility. Across sentiment classification, logical reasoning, and math word problems, we find that task-preserving perturbed demonstrations can substantially degrade ICL performance, especially for smaller models, harder tasks, and higher perturbation ratios. Our results show that robust ICL requires evaluating not only whether demonstrations are correct, but also how they influence contextual inference. Code is available at https://github.com/Chenghao-Qiu/Task-Preserving-ICL.

2605.26343 2026-05-27 cs.LG 版本更新

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

MechRL:强化学习智能体进行电路发现以实现机械可解释性

Barsat Khadka

发表机构 * The University of Southern Mississippi(美国密西西比州立大学)

AI总结 提出将电路发现转化为强化学习问题,使用PPO策略在GPT-2 small的144个注意力头上进行零消融和对比奖励,成功在训练任务和未见任务上恢复标准电路,验证了强化学习在机械可解释性中的有效性。

详情
AI中文摘要

机械可解释性已经识别出在Transformer语言模型中实现特定行为的小型注意力头集合,但恢复这些电路通常需要为每个新任务定制分析流程。我们将电路发现重新定义为强化学习问题。一个智能体在GPT-2 small的144个注意力头上操作,作为离散动作空间;每个动作触发零消融和对比奖励,该奖励从消融对目标任务的损害中减去其对通用下一个词预测的损害。一个在向量化多任务环境中训练于两个任务(归纳和IOI)的单一PPO策略,在两个训练任务以及一个保留的第三个任务(文档字符串补全)上均达到每轮最优。其偏好的头与现有文献中规范的头一致,恰好符合这些论文在单头消融下识别为因果非冗余的轴;它们识别为冗余的类别被智能体正确降级。在保留任务上,最佳五次规划在评估时未提供任务信号的情况下恢复了最优上限的96%。这些结果表明,基于因果干预的强化学习是识别机械电路单头瓶颈的可行且可迁移的方法,与现有的路径修补方法互补。

英文摘要

Mechanistic interpretability has identified small sets of attention heads that implement specific behaviours in transformer language models, but recovering these circuits typically requires a bespoke analytical pipeline for each new task. We recast circuit discovery as a reinforcement-learning problem. An agent operates over the 144 attention heads of GPT-2 small as a discrete action space; each action triggers a zero-ablation and a contrastive reward that subtracts the ablation's damage to general next-token prediction from its damage to the target task. A single PPO policy, trained on two tasks (induction and IOI) in a vectorised multi-task environment, attains the per-episode oracle on both training tasks and on a held-out third task (docstring completion). Its preferred heads coincide with the canonical heads of established literature on precisely the axes those papers identify as causally non-redundant under single-head ablation; the categories they identify as redundant are correctly de-prioritised by the agent. On the held-out task, best-of-five planning recovers 96\% of the oracle ceiling with no task signal supplied at evaluation. These results indicate that reinforcement learning over causal interventions is a viable, transferable substrate for identifying the single-head bottlenecks of mechanistic circuits, complementary to existing path-patching approaches.

2605.26341 2026-05-27 cs.LG stat.ML 版本更新

A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning

物理信息机器学习的泛化性的PAC-Bayesian视角

Thien V. Nguyen, Amaury Habrard, Benjamin Guedj

发表机构 * Université Jean Monnet Saint-Étienne, CNRS, Institut d’Optique Graduate School, Laboratoire Hubert Curien UMR 5516(里昂蒙特大学圣埃蒂安分校、法国国家科学研究中心、光学研究生院、Hubert Curien实验室 UMR 5516) Inria and University College London, France and United Kingdom(Inria 和 英国伦敦大学学院,法国和英国)

AI总结 本文通过PAC-Bayesian框架,针对无界损失下的回归问题,推导了物理信息机器学习的高概率泛化界,并提出了自界感知学习算法,在标准PDE基准上验证了界的非平凡性和更紧性。

详情
AI中文摘要

物理信息机器学习(PIML)将机械知识(通常以偏微分方程(PDE)的形式)整合到数据驱动模型中。尽管经验性能强劲,但其统计泛化性质仍未被充分理解,尤其是在具有无界损失的回归设置中。现有分析依赖于近似或稳定性论证,未能完全捕捉物理结构如何影响有限数据的泛化。在这项工作中,我们为PIML开发了一个PAC-Bayesian框架,在存在无界损失的情况下提供高概率泛化保证。我们采用多任务视角,联合处理数据保真度、PDE残差、初始条件和边界条件,避免了标准联合界方法导致的松散性。我们的分析利用物理信息目标的结构,推导出新的界,其中复杂度与损失的输入梯度范数成比例,揭示了物理正则性与泛化之间的直接联系。我们在Sobolev和Poincaré型假设下实例化该框架,得到两类界,在不同机制下权衡统计复杂性和光滑性。基于这些结果,我们提出了一种自界感知学习算法,直接优化推导界的可处理代理,以及一种在实际设置中估计相关常数的实用程序。在标准PDE基准上的实证评估表明,我们的界是非平凡的,显著比联合界基线更紧,并且可以在训练过程中有效最小化。总体而言,我们的结果为物理信息模型的泛化提供了原则性的统计基础。

英文摘要

Physics-informed machine learning (PIML) integrates mechanistic knowledge, typically in the form of partial differential equations (PDE), into data-driven models. Despite strong empirical performance, its statistical generalisation properties remain poorly understood, particularly in the regression setting with unbounded losses. Existing analyses rely on approximation or stability arguments and do not fully capture how physical structure influences generalisation from finite data. In this work, we develop a PAC-Bayesian framework for PIML that provides high-probability generalisation guarantees in the presence of unbounded losses. We adopt a multi-task perspective that jointly treats data fidelity, PDE residuals, initial and boundary conditions, avoiding the looseness induced by standard union-bound approaches. Our analysis leverages the structure of physics-informed objectives to derive novel bounds where the complexity scales with input-gradient norms of the losses, revealing a direct link between physical regularity and generalisation. We instantiate this framework under Sobolev and Poincaré-type assumptions, yielding two classes of bounds that trade off statistical complexity and smoothness in different regimes. Building on these results, we propose a self-bounding-aware learning algorithm that directly optimises tractable surrogates of the derived bounds, along with a practical procedure to estimate the associated constants in realistic settings. Empirical evaluations on standard PDE benchmarks demonstrate that our bounds are non-vacuous, significantly tighter than union-bound baselines, and can be effectively minimised during training. Overall, our results provide a principled statistical foundation for the generalisation of physics-informed models.

2605.26339 2026-05-27 cs.LG cs.CL 版本更新

QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling

QAM-W: 通过哈达玛旋转和激活感知缩放实现LLM权重的联合2D码本量化

Preetam Sharma, Kacper Dobek

发表机构 * Independent Research(独立研究) Institute of Computing Science(计算科学研究所) Poznan University of Technology(波兹南技术大学)

AI总结 提出QAM-W方法,通过L2归一化、块哈达玛旋转和2D坐标配对量化,结合激活感知缩放,在约5.5 bpw下使困惑度接近BF16,优于极坐标编码,并在5-6 bpw范围内保持质量。

详情
AI中文摘要

标量后训练量化器丢弃了权重行内的成对坐标结构。我们引入QAM-W(权重正交幅度调制),一种恢复该结构的编解码器:每行经过L2归一化、块哈达玛旋转、配对为2D坐标,并针对在单位圆高斯上训练的单个Lloyd-Max码本进行量化,同时采用激活感知的每通道缩放。在跨越四个家族(1.1B--13B参数)的五种LLM和八种量化配置的跨模型研究中,激活感知变体在约5.5 bpw下,每个模型的WikiText-2困惑度保持在BF16的±0.4%以内,以少32%的权重比特匹配SmoothQuant W8A8质量包络。联合2D编码在相同比特率下,在ΔPPL上优于极坐标(幅度×相位)编码2--15个百分点,且与BF16的配对KL散度在37个(方法,模型)行上以Spearman ρ=0.99跟踪ΔPPL%,与从编解码器失真到KL散度的单调复合界一致。3.5 bpw变体在量化容忍架构上具有竞争力。在严格的4 bpw下,旋转码本前沿方法QTIP优于QAM-W;贡献在于质量保持的5--6 bpw波段。

英文摘要

Scalar post-training quantizers discard pairwise coordinate structure within weight rows. We introduce QAM-W (Quadrature Amplitude Modulation for Weights), a codec that recovers this structure: each row is L2-normalized, block-Hadamard rotated, paired into 2D coordinates, and quantized against a single Lloyd-Max codebook trained on the unit circular Gaussian, with activation-aware per-channel scaling. In a cross-model study spanning five LLMs from four families (1.1B--13B parameters) and eight quantized configurations, the activation-aware variant at $\approx 5.5$ bpw stays within $\pm 0.4\%$ of BF16 WikiText-2 perplexity on every model, matching the SmoothQuant W8A8 quality envelope at $32\%$ fewer weight bits. Joint 2D coding outperforms polar (amplitude $\times$ phase) coding by 2--15~pp $Δ$PPL at equal bitrate, and paired KL against BF16 tracks $Δ$PPL\% at Spearman $ρ= 0.99$ across 37 (method, model) rows, consistent with a monotone composite bound from codec distortion to KL divergence. A 3.5~bpw variant is competitive on quantization-tolerant architectures. At strict 4~bpw, the rotated-codebook frontier method QTIP outperforms QAM-W; the contribution is the quality-preserving 5--6~bpw band.

2605.26327 2026-05-27 cs.LG 版本更新

Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage

重新参数化Shampoo和SOAP用于子空间基更新和BFloat16存储

Alan Milligan, Zikun Xu, Simon Lacoste-Julien, Felix Dangel, Wu Lin

发表机构 * Mila & Université de Montréal Microsoft(Mila与蒙特利尔大学微软公司) Concordia University & Mila(康科德大学与Mila) University of Central Florida(中央佛罗里达大学)

AI总结 本文通过重新参数化预条件器,在子空间中仅更新部分基向量,结合QR分解支持BFloat16存储,降低了Shampoo类方法的计算和内存开销,并缓解了低精度存储带来的性能下降。

Comments Preprint, working in progress

详情
AI中文摘要

基于Shampoo的方法,如KL-Shampoo和SOAP,在训练神经网络中表现出强大的性能,并依赖于QR分解。由于现有的QR实现需要单精度(FP32)算术且计算成本高,当预条件矩阵较大时,这些方法变得时间和内存密集。此外,使用BFloat16(BFP16)存储以减少内存使用会降低基于Shampoo的方法的性能。我们提出了一种预条件器的重新参数化,支持BFP16存储,并通过将更新的基向量与未改变的基向量结合形成完整基。通过在子空间中通过QR分解仅更新部分基,我们的方法减少了计算开销,同时缓解了BFP16存储导致的性能下降。我们的方法广泛适用于使用QR分解的基于Shampoo的方法,包括KL-Shampoo、SOAP和KL-SOAP。特别是,它改善了SOAP和KL-SOAP在BFP16存储下的性能,使KL-SOAP能够匹配或超过KL-Shampoo。总体而言,我们的方法使基于Shampoo的方法更加内存和时间高效。

英文摘要

Shampoo-based methods, such as KL-Shampoo and SOAP, have demonstrated strong performance in training neural networks and rely on QR decomposition. Because existing QR implementations require single-precision (FP32) arithmetic and remain computationally expensive, these methods become time- and memory-intensive when their preconditioning matrices are large. Moreover, using BFloat16 (BFP16) storage to reduce memory usage can degrade the performance of Shampoo-based methods. We propose a reparametrization of the preconditioner that supports BFP16 storage and forms a complete basis by combining updated basis vectors with unchanged ones. By updating only part of the basis through QR decomposition in a subspace, our approach reduces computational overhead while mitigating the performance degradation caused by BFP16 storage. Our approach applies broadly to Shampoo-based methods that employ QR decomposition, including KL-Shampoo, SOAP, and KL-SOAP. In particular, it improves the performance of SOAP and KL-SOAP under BFP16 storage, enabling KL-SOAP to match or exceed KL-Shampoo. Overall, our approach makes Shampoo-based methods more memory- and time-efficient.

2605.26324 2026-05-27 cs.LG cs.AI cs.NA math.NA 版本更新

Semigroup Consistency as a Diagnostic for Learned Physics Simulators

半群一致性作为学习型物理模拟器的诊断工具

Lennon J. Shikhman

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出归一化半群误差作为评估学习型物理模拟器时间组合和长程推演一致性的诊断指标,在热传导和Burgers动力学实验中验证其与推演退化正相关。

Comments 10 pages, 3 figures, 3 tables. Accepted to the AI4Physics Workshop at the 43rd International Conference on Machine Learning

详情
AI中文摘要

学习型物理模拟器通常通过单步或短时预测误差来评估,但这些指标可能遗漏时间组合和长程推演中的失败。对于自主、状态完备的系统,精确解映射满足半群定律:直接演化 $s+t$ 应与先演化 $s$ 再演化 $t$ 一致。我们提出归一化半群误差作为事后、模型无关的诊断,比较这些直接和组合的学习预测。在带有时间条件ConvNet和FNO基线的一维热传导和Burgers动力学中,半群误差与推演退化正相关,轨迹级Spearman相关系数 $ρ= 0.635$,95%置信区间 $[0.621, 0.649]$。半群正则化效果不一,支持半群一致性主要作为评估诊断而非普遍有益的训练目标。

英文摘要

Learned physics simulators are often evaluated by one-step or short-horizon prediction error, but these metrics can miss failures in temporal composition and long-horizon rollout. For autonomous, state-complete systems, exact solution maps satisfy a semigroup law: direct evolution over $s+t$ should agree with evolution over $s$ followed by $t$. We propose normalized semigroup error as a post hoc, model-agnostic diagnostic comparing these direct and composed learned predictions. On one-dimensional heat and Burgers dynamics with time-conditioned ConvNet and FNO baselines, semigroup error is positively associated with rollout degradation, with trajectory-level Spearman correlation $ρ= 0.635$ and $95%$ CI $[0.621, 0.649]$. Semigroup regularization has mixed effects, supporting semigroup consistency primarily as an evaluation diagnostic rather than a universally beneficial training objective.

2605.26320 2026-05-27 cs.LG cs.CL 版本更新

MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding

MULTISEISMO: 面向跨模态地震理解的多模态地震数据集与模型

Sai Munikoti, Ian Stewart, Chengping Chai, Lisa Linville, Scott Vasquez, Sameera Horawalavithana, Karl Pazdernik

发表机构 * Pacific Northwest National Laboratory(太平洋西北国家实验室) Oak Ridge National Laboratory(橡树岭国家实验室) Sandia National Laboratory(桑迪亚国家实验室) North Carolina State University(北卡罗来纳州立大学)

AI总结 针对地震学中多模态数据整合的缺失,构建了包含超过1.6万次地震事件的结构化多模态数据集MultiSeismo,并开发了专用多模态模型SeisModal,在跨模态地震推理任务上取得了优越性能。

详情
AI中文摘要

通用多模态模型(GMMs)在专业科学领域的应用仍然有限,原因是缺乏整合文本和图像之外多种数据模态的综合性领域特定数据集。在地震学中,理解地震现象需要综合时间序列波形数据、地理图像和上下文元数据,而现有地震数据集缺乏这种多模态整合。我们提出了MultiSeismo,一个大规模结构化多模态地震数据集,包含跨越13年(2010年至2023年)来自不同地理区域的超过1.6万次地震事件。每个事件数据整合了全球台网波形记录、烈度图、人口暴露可视化以及标准JSON格式的全面文本描述。此外,我们开发了MISCE,一个基于原始数据的多模态指令集,用于对GMMs进行监督训练和评估,涵盖从基本信息检索到复杂跨模态分析的地震推理任务。我们利用MISCE微调了一个现有的多模态模型(Unified IO 2),并增强了专门的时间序列编码器,从而得到了SeisModal——首个用于综合地震分析的领域特定多模态模型。在MultiSeismo上对最先进的多模态模型进行评估,揭示了显著挑战,特别是通用模型在处理时间序列数据方面的困难,同时证明了SeisModal在地震多模态推理任务上的优越性能。这些结果证明,MultiSeismo为未来地震学多模态研究提供了严格的基准,并验证了我们领域特定架构调整的成功。

英文摘要

The application of generalist multimodal models (GMMs) to specialized scientific domains remains limited due to the scarcity of comprehensive domain-specific datasets that integrate multiple data modalities beyond text and images. In seismology, understanding earthquake phenomena requires the synthesis of timeseries waveform data, geographical imagery, and contextual metadata, a multimodal integration absent in existing seismic datasets. We present MultiSeismo, a large scale structured multimodal seismic dataset, comprising over 16K seismic events spanning 13 years (2010 to 2023) across diverse geographical regions. Each event data integrates waveform recordings from global station networks, intensity maps, population exposure visualizations, and a comprehensive textual description within a standardized JSON format. We additionally develop MISCE, a multimodal instruction set on top of raw data to enable supervised training and evaluation of GMMs on seismic reasoning tasks ranging from basic information retrieval to complex cross modal analysis. We leverage MISCE to finetune an existing multimodal model (Unified IO 2) enhanced with a specialized timeseries encoder, which yields SeisModal, the first domain specific multimodal model for comprehensive seismic analysis. Evaluation of state of the art multimodal models on MultiSeismo reveals significant challenges, particularly with time-series data processing for general purpose models, while demonstrating SeisModal's superior performance on seismic multimodal reasoning tasks. These results prove that MultiSeismo provides a rigorous benchmark for future multimodal research in seismology and validate the success of our domain specific architectural adaptations.

2605.26315 2026-05-27 cs.LG cs.AI 版本更新

Curriculum Learning for Safety Alignment

用于安全对齐的课程学习

Sandeep Kumar, Virginia Smith, Chhavi Yadav

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Simons Institute, UC Berkeley(Simons研究所,伯克利大学)

AI总结 提出基于课程学习的Staged-Competence框架,通过难度分级的偏好数据和渐进式参考模型更新,提升DPO安全对齐的鲁棒性,在三个模型族上平均降低16%的OOD有害响应率和20%的越狱攻击成功率。

Comments Accepted at the ICML 2026 GlobalSouthML Workshop

详情
AI中文摘要

直接偏好优化(DPO)广泛用于大型语言模型的安全对齐。然而,先前的工作表明它脆弱且表现出较差的分布外(OOD)泛化能力。在本文中,我们研究课程学习是否能提高基于DPO的安全对齐的鲁棒性。我们提出Staged-Competence,一个基于课程的框架,它按难度组织偏好数据,采用基于能力的采样,并在训练过程中逐步更新参考模型。在三个模型族上平均,Staged-Competence将OOD有害响应率降低16%,越狱攻击成功率降低20%,同时保持接近零的过度拒绝,保留通用能力。我们进一步表明,Staged-Competence(1)仅使用75%的训练数据即可达到基线安全性,(2)在安全与不安全响应之间产生更好的分离。Staged-Competence与策略优化损失无关,并可扩展到其他DPO变体和对齐领域。我们的代码和数据可在https://github.com/Sandeep5500/curriculum-learning-for-safety获取。

英文摘要

Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate whether Curriculum Learning can improve the robustness of DPO-based safety alignment. We propose Staged-Competence, a curriculum-based framework that organises preference data by difficulty, employs competence-based sampling, and progressively updates the reference model during training. Averaged across three model families, Staged-Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities with near-zero over-refusal. We further show that Staged-Competence (1) matches baseline safety with only 75% of the training data and (2) yields better separation between safe and unsafe responses. Staged-Competence is agnostic to the policy optimisation loss and can extend to other DPO variants and alignment domains. Our code and data are available at https://github.com/Sandeep5500/curriculum-learning-for-safety.

2605.26289 2026-05-27 cs.LG 版本更新

Stateful Inference for Low-Latency Multi-Agent Tool Calling

面向低延迟多智能体工具调用的有状态推理

Victor Norgren

发表机构 * LayerScale, Inc.(LayerScale公司)

AI总结 提出一种有状态推理架构,通过持久化KV缓存和增量处理,将多智能体工具调用的每轮成本从O(n_t)降至O(Δ_t),在6轮和35轮工作流中分别实现2.1倍和4.2倍的加速。

详情
AI中文摘要

多智能体工具调用正成为基于LLM系统的主要交互模式,但现有推理框架将每次工具调用视为独立请求,从头重新处理整个对话,尽管85-95%的提示与上一轮相同。我们提出一种有状态推理架构,将传统服务的每轮O(n_t)成本转换为仅增量O(Δ_t)成本:持久KV缓存跨轮次存在,仅通过摄入新令牌前进,而基数前缀缓存将其扩展到交错的多智能体流量,提示查找推测解码器加速结构化输出。在针对新颖、完全生成的工作负载的测试中,与vLLM和SGLang相比,参考实现在6轮智能体工作流中每轮快2.1倍,在35轮工作流的中位数轮次中快4.2倍,端到端挂钟时间减半。优势来自有状态重用和推测,而非缓存。

英文摘要

Multi-agent tool calling is becoming the dominant interaction pattern for LLM-based systems, yet existing inference frameworks treat each tool call as an independent request, re-processing the entire conversation from scratch even though 85-95% of the prompt is unchanged from the previous turn. We present a stateful inference architecture that converts the $O(n_t)$ per-turn cost of conventional serving into an $O(Δ_t)$ delta-only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel, fully-generated workloads, the reference implementation is $2.1\times$ faster per turn on a 6-turn agentic workflow and $4.2\times$ on the median turn of a 35-turn one, halving end-to-end wall time. The advantage comes from stateful reuse and speculation, not caching.

2605.26288 2026-05-27 stat.ML cs.LG stat.ME 版本更新

Beyond Differences: Doubly Robust Meta-Learners for Ratio-Based Treatment Effects

超越差异:基于比率的治疗效应的双重稳健元学习器

Michael Fuchs, Dominik Kreiss

发表机构 * Actuarial Department(精算部)

AI总结 针对比率型条件平均处理效应(CATE)估计,提出Q-Learner将比率分解为两个优势比的乘积,并推导双重稳健增强版本,在低转化率场景和混杂观测数据中表现优异。

Comments 13+5 pages, 5 figures, 6 tables. Code: https://github.com/michaelfuchs90/ratiobasedcate

详情
AI中文摘要

当治疗效应自然表达为比率时——如在医学、定价和营销中——基于比率的CATE $τ(x) = E[Y|W=1,X=x] / E[Y|W=0,X=x]$ 是合适的估计目标。然而,现有估计器要么施加对数线性参数结构,要么应用通用回归而不对该泛函提供稳健性保证。我们引入了Q-Learner,它将$τ(x)$分解为两个优势比的乘积,将二元结果的比率CATE估计简化为两个倾向性分类任务。我们进一步推导了S/T型和Q型比率学习器的双重稳健增强,并刻画了它们不同的稳健性性质。在七个RCT数据集的基准测试中,Q-Learner在低转化率场景下是最持续有竞争力的方法,其仅基于倾向性的构造规避了伤害基于结果估计器的不平衡回归。在四个观测数据集上,其中倾向性必须估计且混杂无法排除,本文引入的DR学习器明确胜出,使其成为实践者在混杂观测数据中的自然默认选择。

英文摘要

When treatment effects are naturally expressed as ratios -- as in medicine, pricing, and marketing -- the ratio-based CATE $τ(x) = E[Y|W=1,X=x] / E[Y|W=0,X=x]$ is the appropriate estimand. Yet existing estimators either impose a log-linear parametric structure or apply generic regression without robustness guarantees for this functional. We introduce the Q-Learner, which decomposes $τ(x)$ into a product of two odds ratios, reducing ratio-CATE estimation for binary outcomes to two propensity classification tasks. We further derive doubly robust augmentations for both S/T- and Q-style ratio learners and characterize their distinct robustness properties. In benchmarks on seven RCT datasets, the Q-Learner is the most consistently competitive method in low-conversion regimes, where its propensity-only construction sidesteps the imbalanced regression that hurts outcome-based estimators. On four observational datasets, where propensity must be estimated and confounding cannot be ruled out, the DR learners introduced here decisively come out on top, making them practitioners' natural default for confounded observational data.

2605.26285 2026-05-27 cs.LG cs.NA math.NA 版本更新

Two-Parameter Flows for Learning Population Dynamics of Physical Systems

用于学习物理系统群体动力学的双参数流

Paul Schwerdtner, Tobias Blickhan, Benjamin Peherstorfer

发表机构 * Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY 10012, USA(数学科学学院,纽约大学,251 Mercer Street,纽约,NY 10012,美国)

AI总结 提出双参数流方法,通过从基础分布到每个边际的采样时间传输学习高维概率密度动力学,并利用耦合合成轨迹回归提取物理时间速度,无需轨迹信息即可处理旋转等非梯度动力学。

详情
AI中文摘要

本文解决了在无标签样本且不假设轨迹信息的情况下,学习高维概率密度随时间演化的动力学问题。我们引入了双参数流,仅学习从基础分布到每个边际的采样时间传输,然后通过回归耦合的合成轨迹提取物理时间速度。我们证明了所得的物理时间动力学是唯一的,并且继承了采样时间传输的正则性。由于我们可以利用标准且成熟的条件流匹配技术来学习基础到边际的传输,我们的方法可扩展到高维,避免了每步最优传输耦合,同时允许可解释旋转或循环物理现象的非梯度动力学。

英文摘要

This work addresses the problem of learning the dynamics of high-dimensional probability densities over time using unlabeled samples, without assuming access to trajectory information. We introduce two-parameter flows that learn only sampling-time transports from a base distribution to each marginal and then extract a physics-time velocity by regressing on coupled synthetic trajectories. We prove that the resulting physics-time dynamics are unique and inherit regularity from the sampling-time transports. Because we can build on standard, well-developed conditional flow matching techniques for learning the base-to-marginal transports, our approach scales to high dimensions and avoids per-step optimal-transport couplings, while allowing admissible non-gradient dynamics that can naturally explain rotational or circulating physics phenomena.

2605.26283 2026-05-27 cs.CV cs.LG 版本更新

Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening

卷积、Transformer、混合模型及视觉语言模型在多病种视网膜筛查中的基准测试

Durjoy Dey, Aymane Ajbar, Yuhong Yan

发表机构 * Department of Computer Science and Software Engineering(计算机科学与软件工程系) Concordia University(康科迪亚大学) Ebovir Biotechnologie Inc.(Ebovir生物技术公司)

AI总结 本研究在RFMiD数据集上对四种模型家族的12种架构进行基准测试,评估其在多病种视网膜筛查中的性能,发现基于注意力的模型(如SwinTiny、CoAtNet0、MaxViTTiny)在二元筛查和多标签分类中表现最佳,视觉语言模型与CNN基线相当但未超越最优Transformer和混合模型。

Comments 12 pages, 3 figures, accepted at ICMHI 2026, 10th International Conference on Medical and Health Informatics, Kyoto, Japan. To appear in ACM Conference Proceedings

详情
AI中文摘要

现代深度学习为自动化视网膜筛查提供了强大工具,但在现实多病种设置和领域偏移下,不同视觉模型家族的比较仍不明确。本研究使用视网膜眼底多病种图像数据集(RFMiD),对四种模型家族(卷积神经网络、视觉Transformer、混合CNN-Transformer骨干网络和视觉语言模型)的12种架构进行基准测试。我们评估两个任务:任何视网膜疾病的二元筛查和28个疾病类别的多标签分类。通过标准化训练、校准和评估协议,我们报告了在特异性接近80%的临床相关操作点下的AUC、F1、精确率、召回率和灵敏度。在RFMiD上,所有架构在二元筛查中表现良好,AUC均高于84%,但基于注意力的模型表现最佳。SwinTiny以及混合模型CoAtNet0和MaxViTTiny在二元筛查中取得最强结果,并在多标签设置中提高了宏F1和微F1。视觉语言模型(包括CLIP ViT-B/16和SigLIP-Base384)与CNN基线相当,但未超越最优Transformer和混合骨干网络。在Messidor-2上对可转诊糖尿病视网膜病变进行外部验证时,AUC范围为66.8%至84.7%,混合模型和Transformer模型再次表现出强劲性能。这些结果为多病种视网膜筛查中的模型选择提供了可重复的参考,并指导未来用于临床部署的自动化筛查工具。

英文摘要

Modern deep learning offers powerful tools for automated retinal screening, but it remains unclear how different visual model families compare in realistic multi-disease settings and under domain shift. In this work, we benchmark twelve architectures across four model families: convolutional neural networks, vision transformers, hybrid CNN-transformer backbones, and vision-language models, using the Retinal Fundus Multi-disease Image Dataset (RFMiD). We evaluate two tasks: binary screening for any retinal disease and multi-label classification across 28 disease classes. Using standardized training, calibration, and evaluation protocols, we report AUC, F1, precision, recall, and sensitivity at a clinically relevant operating point with specificity near 80%. On RFMiD, all architectures perform well on binary screening, with AUC above 84%, but attention-based models perform best. SwinTiny and the hybrid CoAtNet0 and MaxViTTiny models achieve the strongest binary screening results and improve macro and micro F1 in the multi-label setting. Vision-language models, including CLIP ViT-B/16 and SigLIP-Base384, are competitive with CNN baselines but do not surpass the best transformer and hybrid backbones. In external validation on Messidor-2 for referable diabetic retinopathy, AUC ranges from 66.8% to 84.7%, with hybrid and transformer models again showing strong performance. These results provide a reproducible reference for model selection in multi-disease retinal screening and guide future automated screening tools for clinical deployment.

2605.26282 2026-05-27 cs.LG 版本更新

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

通过扩散策略优化扩展世界模型强化学习

Xiaoyuan Cheng, Wenxuan Yuan, Zhancun Mu, Yuanzhao Zhang, Yiming Yang, Hai Wang, Zhuo Sun, Che Liu

发表机构 * Dynamic Systems Lab, University College London(伦敦大学学院动态系统实验室) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院) Santa Fe Institute(圣塔菲研究所) School of Statistics and Data Science, Shanghai University of Finance and Economics(上海财经大学统计与数据科学学院) Department of Computing, Imperial College London(伦敦帝国理工学院计算系)

AI总结 针对世界模型强化学习中搜索与价值学习之间的结构错位问题,提出基于扩散策略优化的模型基方法MBDPO,统一搜索与策略优化,实现可扩展的策略学习。

详情
AI中文摘要

基于模型的强化学习可以通过使用世界模型在大规模下得到有效支持。然而,在实践中,扩展此类方法仍然受到根本性限制。一个普遍公认的挑战是模型偏差和误差累积,这会降低长期预测的质量。除了这些问题,我们识别出一个更关键但尚未充分探索的瓶颈:现有世界模型方法中搜索与价值学习之间的结构错位。特别是,策略改进通常依赖于由独立的非搜索策略诱导的价值函数,导致训练不一致并最终产生次优学习。为了解决这一限制,我们在世界模型中提出基于模型的扩散策略优化(MBDPO),该框架通过扩散策略表示统一搜索和策略优化,从而释放世界模型在可扩展策略学习中的潜力。我们不在学习到的世界模型上构建显式规划器,而是将策略优化重新表述为潜在世界模型中搜索轨迹上的扩散过程。从这个视角,我们从收集的数据集中提取一个隐式能量函数来锚定策略,使MBDPO能够细化用于策略优化的分数场,同时缓解错位问题。我们在多种设置下评估MBDPO,包括多任务离线预训练、在线学习以及离线到在线微调。在离线场景中,我们进一步通过在大规模数据集上预训练来研究其扩展行为,观察到随着模型容量增加,性能持续单调提升。

英文摘要

Model-based reinforcement learning (RL) can be effectively supported at scale through the use of world models. However, in practice, scaling such approaches remains fundamentally limited. A commonly recognized challenge is model bias and error compounding, which degrade long-horizon predictions. Beyond these issues, we identify a more critical yet underexplored bottleneck: a structural misalignment between search and value learning in existing world model approaches. In particular, policy improvement often relies on value functions induced by a separate, non-search policy, resulting in training inconsistency and ultimately suboptimal learning. To address this limitation, we propose Model-Based Diffusion Policy Optimization (MBDPO) in world models, a framework that unifies search and policy optimization through diffusion policy representations, thereby unlocking the potential of world models for scalable policy learning. Instead of constructing an explicit planner over a learned world model, we reformulate policy optimization as a diffusion process over searched trajectories in latent world models. In this view, we extract an implicit energy function from the collected dataset that anchors the policy, enabling MBDPO to refine the score field for policy optimization while mitigating misalignment. We evaluate MBDPO across a wide range of settings, including multi-task offline pretraining, online learning, and offline-to-online fine-tuning. In the offline regime, we further investigate its scaling behavior by pretraining on large-scale datasets, observing consistent and monotonic performance gains with increasing model capacity.

2605.26271 2026-05-27 stat.ML cs.LG econ.EM 版本更新

Learning Nonlinear Factor Models with Unknown Monotone Links from Incomplete and Noisy Data

从不完整和含噪数据中学习具有未知单调链接的非线性因子模型

Yutong Chao, Resat Gökhan, Jalal Etesami, Ali Habibnia

发表机构 * School of Computation, Information and Technology, Technical University of Munich, Germany(计算、信息与技术学院,慕尼黑技术大学,德国) Department of Economics, Virginia Tech, USA(经济系,弗吉尼亚理工学院,美国) Munich Institute of Robotics and Machine Intelligence(慕尼黑机器人与人工智能研究所)

AI总结 研究从含噪和不完整数据中联合恢复低秩因子、载荷和未知单调链接函数的问题,提出投影块坐标下降算法并建立收敛保证。

详情
AI中文摘要

我们研究了一个非线性因子模型,其中观测响应通过未知的单调链接函数依赖于低秩潜在因子。由于严重的非凸性和可识别性问题,这一设置具有挑战性且在很大程度上未被充分探索。链接函数假设位于再生核希尔伯特空间(RKHS)中,从而在保持可识别性的同时实现灵活的非参数建模。我们将问题表述为从可能不完整和含噪的观测中联合恢复低秩因子、载荷和非线性链接函数,并提出一种带有显式正则化的投影块坐标下降(BCD)算法以解决尺度和旋转模糊性。在因子的弱不相干性和标准采样条件下,我们建立了无噪声和有噪声情况下的收敛保证,以及链接函数更新的次线性遗憾界。我们的结果将经典线性因子模型推广到广泛的非线性领域,并为学习非线性潜在结构提供了一个原则性框架。我们通过受控的合成实验评估了所提出的方法,显示出有希望的性能。

英文摘要

We study a nonlinear factor model in which observed responses depend on low-rank latent factors through an unknown monotone link function. This setting is challenging and largely underexplored due to severe nonconvexity and identifiability issues. The link function is assumed to lie in a reproducing kernel Hilbert space (RKHS), enabling flexible nonparametric modeling while preserving identifiability. We formulate the problem as the joint recovery of the low-rank factors, loadings, and the nonlinear link function from possibly incomplete and noisy observations and propose a projected block coordinate descent (BCD) algorithm with explicit regularization to address scale and rotational ambiguities. Under mild incoherence of factors and standard sampling conditions, we establish convergence guarantees in both noiseless and noisy regimes, along with sublinear regret bounds for the link-function updates. Our results extend classical linear factor models to a broad nonlinear regime and provide a principled framework for learning nonlinear latent structures. We evaluate the proposed approach using controlled synthetic experiments, indicating promising performance.

2605.26266 2026-05-27 cs.LG cs.AI cs.CV cs.GR eess.IV 版本更新

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

量化键窃取注意力:视频扩散中KV缓存压缩的偏差校正

Tuna Tuncer, Felix Becker, Thomas Pfeil

发表机构 * Technical University of Munich(慕尼黑技术大学) Tensordyne

AI总结 针对视频扩散模型中KV缓存量化导致注意力权重系统性偏差的问题,提出基于Jensen偏差的在线逐注意力分数校正方法,在INT2量化下恢复接近BF16的视频质量,且内存减半。

Comments Variants of this manuscript were accepted to the ICML 2026 workshops SCALE and F2S

详情
AI中文摘要

分块自回归视频扩散模型依赖先前生成块的KV缓存以避免冗余计算,但随着视频变长,该缓存迅速成为内存瓶颈。将KV缓存量化到低位宽的方法减少了内存压力,但降低了视频质量。我们表明,这种降低的一个关键驱动因素是注意力权重的系统性偏差:由于softmax注意力中指数的凸性,量化噪声膨胀了缓存键的贡献,我们称之为Jensen偏差。这种效应导致量化键从非量化的当前块中窃取注意力质量。我们推导出一个逐注意力分数校正,在期望中消除此偏差,该校正根据缓存键的量化步长和查询范数在线计算。使用二阶泰勒近似,额外的计算开销可忽略不计,且除了缓存外无需额外内存。在MAGI-1、SkyReels-V2和HY-WorldPlay上评估INT2量化,我们的校正恢复了因激进量化而损失的大部分质量,达到接近BF16的视频质量,并且在使用50%更少内存的情况下优于INT4量化。

英文摘要

Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.

2605.26248 2026-05-27 cs.LG cs.AI cs.NE 版本更新

Unified Neural Scaling Laws

统一神经缩放定律

Ethan Caballero, Priyank Jaini, David Krueger, Irina Rish

发表机构 * Mila, University of Montreal(蒙特利尔大学Mila实验室) Google DeepMind(谷歌DeepMind)

AI总结 提出一种统一神经缩放定律(UNSL)函数形式,能够准确建模和预测深度神经网络在多个维度(模型参数、训练数据量、训练步数、推理步数、计算量及超参数)同时变化时的缩放行为,适用于多种架构和任务,并在大规模视觉、语言、数学和强化学习任务中实现更精确的缩放行为外推。

详情
AI中文摘要

我们提出了一种函数形式(称为统一神经缩放定律(UNSL)),该形式能够准确建模和预测深度神经网络在多个维度(即评估指标如何随模型参数数量、训练数据集大小、训练步数、推理步数、计算量以及各种超参数同时变化)同时变化时的缩放行为,适用于多种架构以及各种上游和下游任务中的每个任务。这些任务包括大规模视觉、语言、数学和强化学习。与其他神经缩放的函数形式相比,该函数形式在该任务集上产生的缩放行为外推结果显著更准确。

英文摘要

We present a functional form (that we refer to as a Unified Neural Scaling Law (UNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks as multiple dimensions all vary simultaneously (i.e. how the evaluation metric of interest varies as one simultaneously varies the number of model parameters, training dataset size, number of training steps, number of inference steps, amount of compute, and various hyperparameters) for various architectures and for each of various tasks within a varied set of upstream and downstream tasks. This set includes large-scale vision, language, math, and reinforcement learning. When compared to other functional forms for neural scaling, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set.

2605.26246 2026-05-27 cs.LG 版本更新

The Bridge-Garden Dilemma in LLM Distillation: Why Mixing Hard and Soft Labels Works

LLM蒸馏中的桥园困境:为什么混合硬标签和软标签有效

Guanghui Wang, Kaiwen Lv Kacuila, Zhiyong Yang, Zitai Wang, Jin-Wen Wu, Longtao Huang, Qianqian Xu, Qingming Huang

发表机构 * School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学计算机科学与技术学院) Alibaba Group, Hangzhou, China(阿里巴巴集团) State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(中国科学院人工智能安全国家重点实验室) Beijing Academy of Artificial Intelligence, Beijing, China(北京人工智能研究院) Key Laboratory of Big Data Mining and Knowledge Management (BDKM), University of Chinese Academy of Sciences, Beijing, China(中国科学院大数据挖掘与知识管理重点实验室)

AI总结 针对大语言模型知识蒸馏中硬标签与软标签的混合使用,提出桥园分解理论解释其降低暴露偏差的机制,并开发自适应混合监督方法,在多个模型上实现性能提升和9.7倍训练成本降低。

Comments Accepted at ICML 2026

详情
AI中文摘要

知识蒸馏(KD)将知识从大型教师模型转移到较小的学生模型。在语言建模中,学生模型要么在从教师模型采样的标记(硬标签)上训练,要么在教师模型的完整下一个标记分布(软标签)上训练。尽管软标签看起来严格更丰富,但我们发现混合硬标签和软标签始终能产生更好的结果。关键的是,我们表明这种增益不能通过训练期间更接近教师匹配来解释。相反,它来自于减少暴露偏差,即训练和推理分布之间的不匹配。为了解释这一现象,我们引入了桥园分解理论,该理论将生成步骤分为两类:桥(Bridge),其中下一个标记必须精确;园(Garden),其中下一个标记可以灵活。我们表明,仅硬标签的KD在桥中通过避免风险偏差表现出色,而仅软标签的KD在园中保持多样性。混合策略处理两种情况,从而减少整个序列中的暴露偏差。在该理论的指导下,我们开发了一系列桥园混合监督方法,自适应地平衡硬标签和软标签。在包含七个教师-学生对(包括Qwen、Llama、Gemma和DeepSeek)的主要套件以及推理和编码基准测试中,我们的方法优于基于散度和基于策略的KD基线,同时将训练成本降低了9.7倍,实现了高效的模型压缩。代码可在https://github.com/ghwang-s/bridge_garden_hybrid_kd_release获取。

英文摘要

Knowledge distillation (KD) transfers knowledge from a large teacher model to a smaller student. In language modeling, the student is trained either on tokens sampled from the teacher (hard labels) or the teacher's full next-token distribution (soft labels). Despite soft labels appear strictly richer, we find that mixing hard and soft labels consistently yields better results. Crucially, we show that this gain cannot be explained by closer teacher matching during training. Instead, it comes from reduced exposure bias, the mismatch between training and inference distributions. To explain this phenomenon, we introduce the Bridge-Garden Decomposition theory, which categorizes generation steps into two types: Bridges, where the next token must be exact, and Gardens, where it can be flexible. We show that hard-only KD excels in Bridges by avoiding risky deviations, while soft-only KD preserves diversity in Gardens. A hybrid strategy handles both cases and, as a result, reduces exposure bias across the sequence. Guided by this theory, we develop a family of Bridge-Garden hybrid supervision methods that adaptively balance hard and soft labels. Across a primary suite of seven teacher-student pairs (including Qwen, Llama, Gemma, and DeepSeek) and benchmarks in reasoning and coding, our approach outperforms divergence-based and on-policy KD baselines while reducing training cost by 9.7x, enabling efficient model compression. Code is available at https://github.com/ghwang-s/bridge_garden_hybrid_kd_release.

2605.26243 2026-05-27 cs.LG 版本更新

Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks

可证明通信高效且隐私保护的联邦图神经网络

Zhishuai Guo, Wenhan Wu, Chen Chen, Lei Zhang, Olivera Kotevska, Ravi K Madduri

发表机构 * Northern Illinois University(北伊利诺伊大学) University of North Carolina at Charlotte(北卡罗来纳州立大学查珀尔山分校) University of Central Florida(中央佛罗里达大学) Oak Ridge National Laboratory(橡树岭国家实验室) Argonne National Laboratory(阿贡国家实验室)

AI总结 提出CE-FedGNN框架,通过稀疏交换聚合节点表示和移动平均估计器处理跨客户端依赖,结合度量差分隐私实现通信高效与隐私保护,并证明收敛速率和隐私保证。

详情
AI中文摘要

图神经网络(GNN)在关系数据上取得了强性能,但现实世界的图通常分布在多个组织之间,由于隐私和政策约束,这些组织无法共享原始数据。现有的联邦GNN方法要么忽略跨客户端链接导致精度下降,要么需要频繁的嵌入交换,带来巨大的通信和隐私成本。我们提出了CE-FedGNN,一个通信高效且隐私保护的联邦GNN框架,用于学习此类耦合图。我们的方法避免共享原始数据或每轮嵌入,而是通过稀疏交换聚合的节点表示。为了处理跨客户端依赖和过时性,我们引入了一个移动平均估计器,持续跟踪节点表示并使其能够在多轮中稳定重用。为了为发布的表示提供正式的隐私保证,我们采用了度量差分隐私(metric-DP)框架,该框架根据学习嵌入空间中的距离而非最坏情况输入扰动来衡量隐私。这在标准差分隐私变得过于保守的噪声水平下提供了有意义的保证。我们建立了以$O(1/\sqrt{T})$速率收敛到稳定点,通信复杂度为$O(T^{3/4})$。此外,我们在公共队列威胁模型下通过Rényi差分隐私组合推导了$(\varepsilon,\delta)$-度量差分隐私保证。在合成银行间反洗钱基准和引文网络上的实验表明,CE-FedGNN在显著降低通信的同时保持了强性能,并在隐私保护噪声下保持鲁棒性。

英文摘要

Graph neural networks (GNNs) achieve strong performance on relational data, but real-world graphs are often distributed across organizations that cannot share raw data due to privacy and policy constraints. Existing federated GNN methods either ignore cross-client links, leading to degraded accuracy, or require frequent embedding exchanges, incurring substantial communication and privacy costs. We propose CE-FedGNN, a communication-efficient and privacy-preserving federated GNN framework for learning over such coupled graphs. Our approach avoids sharing raw data or per-round embeddings by infrequently exchanging aggregated node representations. To handle cross-client dependency and staleness, we introduce a moving-average estimator that continuously tracks node representations and enables their stable reuse across rounds. To provide formal privacy guarantees for the released representations, we adopt the metric differential privacy (metric-DP) framework, which measures privacy with respect to distances in the learned embedding space rather than worst-case input perturbations. This yields meaningful guarantees at noise levels where standard differential privacy becomes overly conservative. We establish convergence to a stationary point at a rate of $O(1/\sqrt{T})$ with $O(T^{3/4})$ communication complexity. In addition, we derive $(\varepsilon,δ)$-metric-DP guarantees via Rényi differential privacy composition under a public-cohort threat model. Experiments on synthetic interbank anti-money laundering benchmarks and citation networks demonstrate that CE-FedGNN achieves strong performance while significantly reducing communication and maintaining robustness under privacy-preserving noise.

2605.26222 2026-05-27 cs.LG stat.ML 版本更新

From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD

从隐私到泛化:DP-SGD的线性最大信息界

Christoph H. Lampert, Hossein Zakerinia

发表机构 * Institute of Science and Technology Austria (ISTA)(奥地利科学与技术研究所)

AI总结 本文证明了DP-SGD的近似最大信息量具有与数据集大小成线性关系的有限样本界,并基于此推导出PAC-Bayes泛化界和DP-SGD训练模型的显式泛化界。

Comments 22 pages

详情
AI中文摘要

理解泛化与隐私之间的关系仍然是现代机器学习理论中的一个核心挑战,特别是对于通过差分隐私随机梯度下降(DP-SGD)变体训练的深度网络。在这项工作中,我们通过证明DP-SGD的近似最大信息量的有限样本界,该界展现出与(Dwork et al, 2015)关于$ε$-差分隐私算法的经典结果相当的缩放性质,即最多与数据集大小成线性关系,从而在这个长期存在的开放问题上取得了进展。根据我们的结果,我们得到了一个通用的PAC-Bayes泛化界,其中所需的先验分布可以由DP-SGD学习,以及一个针对DP-SGD训练模型本身的泛化界,其复杂度项完全显式且由优化超参数控制。

英文摘要

Understanding the relationship between generalization and privacy remains a central challenge in modern machine learning theory, particularly for deep networks trained by variants of differentially private stochastic gradient descent (DP-SGD). In this work we make progress on this persistent open problem by proving a finite-sample bound on the approximate max-information of DP-SGD that exhibits scaling properties comparable with (Dwork et al, 2015)'s classic result for $ε$-differentially private algorithms, namely at most linear in the dataset size. From our result we obtain a general-purpose PAC-Bayes generalization bound in which the necessary prior distribution can be learned by DP-SGD, as well as a generalization bound for DP-SGD-trained models themselves, with a complexity term that is fully explicit and controlled by the optimization hyperparameters.

2605.26192 2026-05-27 cs.LG cs.AI q-bio.BM 版本更新

Co-folding model guided by structural proteomics

结构蛋白质组学引导的共折叠模型

Alon Shtrikman, Nitzan Simchi, Michal Ran Shchory, Sagie Brodsky, Eran Seger, Kirill Pevzner

发表机构 * Protai Bio(Protai生物)

AI总结 提出AIMS-Fold框架,通过整合XL-MS和HDX-MS实验数据与扩散模型,在推理时引导蛋白质复合物构象生成,提升诱导接近靶标的预测准确性。

详情
AI中文摘要

蛋白质结构生成模型擅长从序列预测单个蛋白质的静态结构,但通常无法捕捉蛋白质复合物的正确构象状态,这对蛋白质设计和诱导接近模式(如抗体和PROTACs)至关重要。虽然交联质谱(XL-MS)和氢氘交换质谱(HDX-MS)等结构蛋白质组学技术提供了有价值的空间和动态信息,但将这些稀疏、异质的测量整合到这些模型中仍然是一个开放的挑战。在这里,我们通过将结构蛋白质组学数据与预训练扩散模型学到的丰富生物物理先验相结合来弥合这一差距。我们引入了AIMS-Fold,一个推理时引导扩散框架,它使用源自XL-MS空间约束和HDX-MS溶剂可及性轮廓的可微物理势能主动引导生成采样轨迹。我们证明这些结构方法各自提高了预测准确性,并且它们的整合产生了协同改进。关键的是,通过利用这些实验约束,AIMS-Fold在具有挑战性的诱导接近靶标上比纯计算、无引导的最先进模型(如Boltz-2)实现了更高的准确性。这确立了我们的框架作为诱导接近药物基于结构的药物设计的强大整合计算方法。评估代码将在发表后公开。

英文摘要

Protein structure generative models excel at predicting single protein static structures from sequence, but routinely fail to capture the correct conformational state of protein complexes, critical for protein design and induced proximity modalities such as antibodies and PROTACs. While structural proteomics techniques like Cross-Linking Mass Spectrometry (XL-MS) and Hydrogen-Deuterium Exchange (HDX-MS) offer valuable spatial and dynamic insights, integrating these sparse, heterogeneous measurements into these models remains an open challenge. Here, we bridge this gap by combining structural proteomics data with the rich biophysical priors learned by pretrained diffusion models. We introduce AIMS-Fold, an inference-time guided-diffusion framework that actively steers the generative sampling trajectory using differentiable physical potentials derived from XL-MS spatial restraints and HDX-MS solvent accessibility profiles. We demonstrate that these structural methods individually enhance predictive accuracy, and their integration yields synergistic improvement. Crucially, by leveraging these experimental restraints, AIMS-Fold achieves higher accuracy on challenging induced proximity targets than purely computational, unguided state-of-the-art models like Boltz-2. This establishes our framework as a powerful, integrative computational approach for the structure based drug design of induced proximity drugs. Evaluation code will be made publicly available upon publication.

2605.26191 2026-05-27 cs.LG cs.AI 版本更新

Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series

从流式时间序列建模时滞系统的动态混合

Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai

发表机构 * SANKEN, The University of Osaka, Japan(SANKEN大学大阪大学日本)

AI总结 提出在线框架DelayMix,将流式时间序列视为时滞系统的动态混合,通过固定长度表示总结过去状态,利用马尔可夫参数张量捕捉动态和延迟,实现快速适应环境变化并降低内存使用。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

本研究解决了具有清晰输入输出关系的时间序列数据流中的自适应建模问题。该问题具有挑战性,因为环境因素或输入延迟变化导致的快速系统变化(状态转移)会降低模型性能,并且在使用多个小模型处理每种时间序列模式时,需要在准确性、鲁棒性和内存使用之间进行权衡。为了解决这些问题,本文提出了一种在线框架/方法,将流式时间序列视为时滞系统的动态混合。该框架通过使用固定长度表示来总结过去的状态,该表示同时捕捉系统动态和输入输出延迟,从而保持模型跟踪的鲁棒性并减少内存使用。具体来说,该方法利用系统的马尔可夫参数序列构建一个摘要系统张量,同时捕捉动态行为和延迟特征。如有必要,张量分解算法从张量中提取相关的过去模型,并帮助选择最适合当前状态的系统。该方法能够快速适应环境变化,并且计算效率高。在真实数据集上的测试表明,DelayMix始终优于其他方法,实现了卓越的预测准确性和更快的延迟适应,特别是对于高度非平稳的数据。

英文摘要

This research addresses the problem of adaptive modeling in time-series data streams with clear input-output relationships. This problem is challenging because rapid system changes (regime shifts) caused by environmental factors or input delay changes degrade model performance, and the trade-off among accuracy, robustness, and memory usage arises when using multiple small models for each time-series pattern. To address these issues, this paper presents an online framework/method that treats streaming time series as dynamic mixtures of time-delay systems. This framework maintains robustness of model tracking and reduces memory usage by summarizing past regimes using a fixed-length representation that captures both the system dynamics and input-output delays. Concretely, this approach constructs a summary system tensor using the system's Markov parameter series, capturing both dynamic behavior and delay characteristics. If necessary, a tensor decomposition algorithm extracts relevant past models from the tensor and helps select the system that best fits the current regime. This method enables rapid adaptation to environmental changes and is computationally efficient. Tests on real datasets show that DelayMix consistently outperforms other methods, achieving superior forecast accuracy and faster adaptation to delays, especially for highly non-stationary data.

2605.26190 2026-05-27 cs.LG cs.AI eess.SP 版本更新

HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals

HRVConformer:基于心率信号的新生儿缺氧缺血性脑病分类

Shuwen Yu, William P Marnane, Geraldine B. Boylan, Gordon Lightbody

发表机构 * University College Cork(大学学院科克) INFANT Research Centre(婴儿研究中心) Department of Electrical & Electronic Engineering(电气与电子工程系) School of Engineering and Architecture(工程与建筑学院) Pediatrics and Child Health(儿科学与儿童健康)

AI总结 提出HRVConformer,一种混合卷积-Transformer深度学习架构,直接从原始心率信号端到端分类新生儿缺氧缺血性脑病,在测试集上达到83.23% AUC和74.56%准确率,优于Transformer、ResNet50等基线。

Comments Paper submitted to Journal of Engineering Applications of Artifical Intelligence

详情
AI中文摘要

本文提出了HRVConformer,一种新颖的深度学习架构,用于使用瞬时心率(HR)信号对缺氧缺血性脑病(HIE)进行分类。与依赖手工特征的常规方法不同,HRVConformer以端到端方式直接处理原始HR信号,通过混合卷积-Transformer框架捕获局部和长距离依赖关系。通过集成用于局部特征提取的卷积层和用于全局上下文建模的基于Transformer的注意力机制,该架构有效增强了信号表示和分类性能。该模型使用监督学习在包含1,573个一小时时段的大型HR数据集上训练,其中包括259个专家标注的一小时时段和大量弱标注数据。一个314小时的验证集提供了稳健的性能估计,而一个独立的215小时专家标注数据集被保留用于最终测试。使用改进的Pan-Tompkins算法从心电图(ECG)记录中提取HR信号,该算法显著提高了信号质量和数据可用性。实验结果表明,HRVConformer在测试集上实现了83.23%的AUC和74.56%的准确率。这些结果超越了Transformer、ResNet50和全卷积网络基线,突显了集成卷积和Transformer组件用于基于HR的HIE分类的优势。所提出的方法为使用HR信号实现更准确和自动化的HIE评估提供了有希望的一步。代码可在https://github.com/syu-kylin/HRVConformer获取。

英文摘要

This paper presents the HRVConformer, a novel deep learning architecture for the classification of hypoxic-ischemic encephalopathy (HIE) using the instantaneous heart rate (HR) signal. Unlike conventional approaches that rely on handcrafted features, HRVConformer directly processes raw HR signals in an end-to-end manner, capturing both local and long-range dependencies through a hybrid Convolution-Transformer framework. By integrating convolutional layers for local feature extraction and Transformer-based attention mechanisms for global context modelling, the architecture effectively enhances signal representation and classification performance. The model was trained using supervised learning on a large HR dataset consisting of 1,573 one-hour epochs, including 259 one-hour expert-annotated epochs and a substantial set of weakly labelled data. A 314-hour validation set provided a robust performance estimation, while an independent 215-hour dataset with expert annotations was reserved for final testing. HR signals were extracted from electrocardiogram (ECG) recordings using an improved Pan-Tompkins algorithm, which significantly enhanced both signal quality and data availability. Experimental results demonstrate that the HRVConformer achieves an AUC of 83.23\% and accuracy of 74.56\% on the test set. These results surpass the performance of the Transformer, ResNet50 and fully convolutional networks baselines, highlighting the advantages of integrating convolutional and Transformer-based components for HR-based HIE classification. The proposed method provides a promising step toward a more accurate and automated assessment of HIE using HR signals. The code is available at: https://github.com/syu-kylin/HRVConformer.

2605.26184 2026-05-27 cs.LG cs.AI 版本更新

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

GAC: 面向混合SFT-RL后训练的噪声感知自适应混合

Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Maritime University(上海海洋大学)

AI总结 提出噪声感知控制器GAC,通过在线估计梯度方差和两个训练信号之间的不一致性,自适应调整混合权重,以改进混合后训练性能。

Comments 15 pages, 3 figures, 22 tables

详情
AI中文摘要

混合后训练通常结合监督微调和强化学习,但固定的混合调度无法适应两种信号相对噪声随时间变化的情况。我们提出GAC,一种噪声感知控制器,通过在线估计梯度方差和两个训练信号之间的不一致性,推导出自适应混合权重。该方法在重用现有训练张量的同时,增加了平滑、先验指导和有界更新。在数学、代码、科学和逻辑基准上的实验表明,与强固定和基于规则的基线相比,GAC持续改进混合后训练,在更大模型规模下获得更大收益,且训练开销小于1%。

英文摘要

Hybrid post-training usually combines supervised fine-tuning and reinforcement learning, but fixed mixing schedules cannot adapt when the relative noise of the two signals changes over time. We propose GAC, a noise-aware controller that derives an adaptive mixing weight from online estimates of gradient variance and disagreement between the two training signals. The method adds smoothing, prior guidance, and bounded updates while reusing existing training tensors. Experiments on math, code, science, and logic benchmarks show that GAC consistently improves hybrid post-training over strong fixed and rule-based baselines, with larger gains at larger model scales and less than 1% training overhead.

2605.26178 2026-05-27 cs.MA cs.LG 版本更新

ATOM: Instantiating Budget-Controllable Multi-Agent Collaboration via Nucleus-Electron Hierarchy

ATOM: 通过核-电子层次结构实例化预算可控的多智能体协作

Xinkui Zhao, Sai Liu, Yifan Zhang, Qingyu Ma, Zewen Lin, Naibo Wang, Guanjie Cheng, Chang Liu, Yueshen Xu

发表机构 * Zhejiang University(浙江大学) Ningbo Global Innovation Center, Zhejiang University(浙江大学宁波全球创新中心) Zhejiang Key Laboratory of Digital-Intelligence Service Technology(浙江省数字智能服务技术重点实验室) Xidian University(西安电子科技大学)

AI总结 提出ATOM框架,采用核-电子层次结构和任务驱动强化学习,生成预算可控的协作图,在保持性能的同时将token效率提升高达30%。

详情
AI中文摘要

基于大型语言模型的多智能体系统依赖优化的协作拓扑来平衡性能和通信成本。然而,当前方法难以处理固有的稳定性-可扩展性权衡,并且常常使计算预算与查询难度不匹配。我们提出ATOM,一个自适应框架,通过新颖的任务驱动强化学习范式生成预算可控的协作图。受原子结构启发,ATOM采用核-电子层次结构:它维护一个稳定的、离线学习的协作骨干(核),同时在推理过程中动态激活查询条件智能体(电子)。关键的是,一种复杂度感知的预算策略通过估计查询难度来严格调控电子实例化,从而使资源消耗与任务需求对齐。在六个不同基准上的广泛实验表明,ATOM实现了最先进的性能,同时与强基线相比,token效率提升了高达30%。

英文摘要

Large Language Model (LLM)-based multi-agent systems rely on optimized collaboration topologies to balance performance and communication costs. However, current methods struggle with the inherent stability-extensibility trade-off and often misalign computational budgets with query difficulty. We propose \textsc{ATOM}, an adaptive framework that generates budget-controllable collaboration graphs via a novel task-driven reinforcement learning paradigm. Inspired by atomic structures, \textsc{ATOM} employs a nucleus-electron hierarchy: it maintains a stable, offline-learned collaboration backbone (the nucleus) while dynamically activating query-conditioned agents (electrons) during inference. Crucially, a complexity-aware budgeting strategy aligns resource consumption with task demands by estimating query difficulty to strictly regulate electron instantiation. Extensive experiments across six diverse benchmarks demonstrate that \textsc{ATOM} achieves state-of-the-art performance while improving token efficiency by up to $30\%$ compared to strong baselines.

2605.26175 2026-05-27 cs.LG cs.AI 版本更新

InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

InfoQuant:为低比特LLM量化塑造激活分布

Ke Li, Dong An, Xiaoling Zang, Can Ye, Liang Xie, Qibo Qiu, Chen Shen, Xiaofei He, Wenxiao Wang

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院) Ant Group(蚂蚁集团) College of Computer Science and Technology, Zhejiang University of Technology(浙江工业大学计算机科学与技术学院) China Mobile (Zhejiang) Research & Innovation Institute(中国移动(浙江)研究院) Alibaba Cloud Computing(阿里云计算) State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD&CG国家重点实验室)

AI总结 针对低比特激活量化中分布与量化器不匹配的问题,提出基于信息论的分析和无需训练的峰值抑制正交变换(PSOT)方法,显著提升量化精度。

详情
AI中文摘要

低比特激活量化仍然是高效大语言模型(LLM)部署的主要瓶颈。难点不仅在于激活值包含异常值,还在于其分布通常与低比特均匀量化器不匹配。现有的训练后量化(PTQ)方法抑制峰值、平衡通道或最小化重建误差,但很少明确说明什么样的激活分布实际上易于离散化。因此,激活值可能在数值上更平滑,但仍会产生较大的量化误差,因为量化范围仍然很宽,或者大多数值坍缩到均值附近的几个水平。我们将激活变换重新表述为面向量化器的分布设计,并从信息论角度分析量化误差。我们的分析表明,有利于量化的激活值应同时具有较小的数值范围和在该范围内的足够分散性。在此分析指导下,我们提出InfoQuant,一种无需训练的方法,采用峰值抑制正交变换(PSOT)将激活值塑造成更有利于量化的分布。我们进一步引入自适应异常值标记选择,以提高PSOT在优化过程中的鲁棒性。在多个LLM家族中,InfoQuant始终优于先前的PTQ和端到端训练基线。在W4A4KV4下,它平均保留了97%的浮点精度,并将LLaMA-2 13B的性能差距较先前最先进方法缩小了42%。代码可在[https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)获取。

英文摘要

Low-bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to a low-bit uniform quantizer. Existing post-training quantization (PTQ) methods suppress peaks, balance channels, or minimize reconstruction error, yet they rarely specify what activation distribution is actually easy to discretize. As a result, activations may appear numerically smoother while still incurring large quantization error because the quantization range remains wide or most values collapse into a few levels near the mean. We recast activation transformation as quantizer-facing distribution design and analyze quantization error from an information-theoretic perspective. Our analysis shows that quantization-friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range. Guided by this analysis, we propose InfoQuant, a train-free method that employs Peak Suppression Orthogonal Transformation (PSOT) to shape activations into more quantization-friendly distributions. We further introduce adaptive outlier-token selection to improve the robustness of PSOT during optimization. Across multiple LLM families, InfoQuant consistently outperforms prior PTQ and end-to-end training baselines. Under W4A4KV4, it preserves 97% of floating-point accuracy on average and reduces the LLaMA-2 13B performance gap by 42% over the previous state of the art. Code is available at [https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)

2605.26172 2026-05-27 cs.LG 版本更新

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

ARBITER:测试时采样中的推理轨迹盆地与多数投票失败

Meng Cai, Lars Kulik, Farhana Choudhury

发表机构 * School of Computing and Information Systems(计算与信息系统学院) University of Melbourne(墨尔本大学)

AI总结 本文发现语言模型测试时采样的推理轨迹会聚集成少数“推理盆地”,导致多数投票选择最稳定而非最准确的盆地,并提出ARBITER方法通过保守加性证据修正共识,从样本池中恢复部分正确性。

Comments Preprint. 34 pages, 2 figures

详情
AI中文摘要

当语言模型使用测试时采样时,它们会生成多个推理轨迹并通过多数投票选择答案。我们证明这些轨迹并非独立:对于给定问题,它们会聚集成少数几个簇,即推理盆地,每个盆地由归一化的最终答案和达到该答案的解决方案定义。因此,多数投票选择的是最稳定的盆地而非最准确的盆地,这导致错误多数失败,即正确答案存在但被否决。我们提出ARBITER,一种模型无关的方法,仅使用基础模型自身的采样输出、隐藏状态和派生证据来建模盆地之间的交互。大多数直接纠正策略失败;ARBITER则在共识之上使用保守的加性证据。在其最简单的无参数形式中,ARBITER-Δ将同模型证据添加到多数先验中,而ARBITER-Enc则通过来自完整解决方案的隐藏状态的有界残差信号增强这一过程。在GSM8K上使用Qwen3-4B,K=24个样本的共识达到约94%中段,而同池top-2 oracle达到约96%中段。ARBITER在不使用外部信息的情况下恢复了这些案例的一个子集。在三个模型系列和三个数学基准上,它带来了一致的提升,且没有净负例;例如,在Llama-3.1-8B MMLU-HS-Math上,它将准确率从约78%中段提高到约82%中段,恢复了约22%的可用oracle余量,表明该余量可以从样本池本身部分恢复。

英文摘要

When language models use test-time sampling, they generate multiple reasoning trajectories and select an answer by majority vote. We show that these trajectories are not independent: for a given question, they concentrate into a small number of clusters, or reasoning basins, each defined by a normalized final answer and the solutions that reach it. A majority vote therefore selects the most stable basin rather than the most accurate one, which creates wrong-majority failures where the correct answer is present but outvoted. We introduce ARBITER, a model-agnostic approach that models interactions between basins using only the base model's own sampled outputs, hidden states, and derived evidence. Most direct correction strategies fail; ARBITER instead uses conservative additive evidence on top of consensus. In its simplest parameter-free form, ARBITER-Δ adds same-model evidence to the majority prior, while ARBITER-Enc augments this with bounded residual signals from hidden states over complete solutions. On GSM8K with Qwen3-4B, consensus over K=24 samples achieves around the mid-94% range, while a same-pool top-2 oracle reaches around the mid-96% range. ARBITER recovers a subset of these cases using zero external information. Across three model families and three math benchmarks, it yields consistent gains with no net-negative cases; for example, on Llama-3.1-8B MMLU-HS-Math, it improves accuracy from the mid-78% range to the mid-82% range, recovering about 22% of the available oracle headroom, indicating that this headroom can be partially recovered from the sample pool itself.

2605.26171 2026-05-27 cs.LG 版本更新

When Rule Violations Are Rare: Chimera Training for Logical Anomaly Detection

当规则违反罕见时:用于逻辑异常检测的嵌合体训练

Alejandro Ascarate, Leo Lebrat, Rodrigo Santa Cruz, Clinton Fookes, Olivier Salvado

AI总结 针对规则违反样本稀少的逻辑异常检测,提出嵌合体训练方法,通过特征级操作数反事实构造生成监督信号,提升规则级异常检测性能。

Comments 9+30 pages, 4+4 figures, under review

详情
AI中文摘要

许多实际异常不仅仅是罕见的输入,而是语义约束的违反:对象以结构化方式共现,动作蕴含前提条件,事件满足时间或关系规律。我们研究这种设置下的异常检测,其中约束以学习到的视觉概念上的逻辑规则形式给出,但训练期间真实规则违反罕见或缺失。我们提出一种神经规则评估器,将每个约束编译成有向无环图,并为其内部逻辑运算符学习特征感知的子树MLP门。每个门将子特征和边级否定映射到父表示和规则满足概率,并通过基于真实概念标签的精确布尔传播获得中间监督。关键困难在于同图像训练数据通常无法提供信息性真值配置的充分覆盖,并允许捷径解。为解决此问题,我们引入嵌合体训练:在特征级别进行操作数级反事实构造。我们不混合输入图像,而是连接来自不同样本的子树特征;每个操作数保留其来源样本的硬真值标签,并通过将节点的逻辑运算符应用于这些继承标签来获得嵌合体目标。这提供了监督逻辑反例,而无需真实异常图像。在CLEVRER、OpenImages和VidOR上,所得到的评估器在规则级异常AUROC上优于独立事件和同图像语义训练基线,特别是对于组合和关系规则。该方法产生标量异常分数和规则级归因。

英文摘要

Many practical anomalies are not merely rare inputs, but violations of semantic constraints: objects co-occur in structured ways, actions imply preconditions, and events satisfy temporal or relational regularities. We study anomaly detection in this setting, where constraints are given as logical rules over learned visual concepts, but real rule violations are rare or absent during training. We propose a neural rule evaluator that compiles each constraint into a directed acyclic graph and learns feature-aware subtree MLP gates for its internal logical operators. Each gate maps child features and edge-level negations to a parent representation and a rule-satisfaction probability, with intermediate supervision obtained from exact Boolean propagation over ground-truth concept labels. The key difficulty is that same-image training data often provide insufficient coverage of informative truth configurations and also allow shortcut solutions. To address this, we introduce chimera training: an operand-level counterfactual construction at the feature level. Instead of mixing input images, we concatenate subtree features from different samples; each operand keeps the hard truth label of the sample it came from, and the chimera target is obtained by applying the node's logical operator to those inherited labels. This supplies supervised logical counterexamples without requiring real anomalous images. Across CLEVRER, OpenImages, and VidOR, the resulting evaluator improves rule-level anomaly AUROC over independent-events and same-image semantic-training baselines, especially for compositional and relational rules. The method yields both scalar anomaly scores and rule-level attributions.

2605.26168 2026-05-27 cs.OS cs.LG 版本更新

LearnedCache: An eBPF-Integrated Perceptron-Based Eviction Policy for the Linux Page Cache

LearnedCache: 一种基于eBPF集成的感知器的Linux页面缓存驱逐策略

Zejia Qi

AI总结 提出LearnedCache,一种基于eBPF和单层感知器的Linux页面缓存驱逐策略,通过真实内核数据训练模型,在代表性工作负载下实现高达10%的插入率提升。

Comments 11 pages, 12 figures, 4 listings. Policies and harnesses: https://github.com/JayAndJef/cache_ext_lc . Model and visualizations: https://github.com/JayAndJef/learnedcache

详情
AI中文摘要

Linux是数字时代的基础,占据了云和移动操作系统市场的大部分份额。任何运行Linux的设备都使用Linux页面缓存,这是操作系统和应用程序性能的核心支柱,旨在减少不必要的磁盘访问。许多页面缓存驱逐策略已被开发,但仍受限于启发式方法的僵化。近年来,AI驱动工具的兴起,加上Linux设备工作负载的日益多样化,为机器学习驱动的缓存驱逐策略奠定了基础。该领域已有有前景的研究,但仅限于CDN等用户空间应用。我们开发了LearnedCache,一种基于eBPF集成的单层感知器的Linux页面缓存驱逐策略,使用来自多样化工作负载的真实内核数据进行训练。我们展示了多个线性模型在建模页面重用时间上的中位AUC接近80%,然后进一步将这些模型嵌入Linux内核以进行实时性能评估。通过对每个工作负载与FIFO基线进行50次配对试验的统计测试,LearnedCache表明,在代表性经验工作负载下,机器学习驱动的缓存驱逐策略在Linux内核中是可行的,并且能够在特定工作负载下以统计显著的优势超越传统FIFO,插入率(缓存命中率的频率调整派生指标)提升高达10%,同时开销极小。

英文摘要

Linux is the foundation of the digital age, accounting for the majority of the cloud and mobile OS markets. Any device that runs Linux uses the Linux page cache, a central pillar in OS and application performance, serving to reduce extraneous disk access. Many page cache eviction policies have been developed but remain bound by the rigidity of heuristics. The rise of AI-driven tools in recent years, melded with the ever-increasing variety of workloads for Linux devices, sets the stage for machine-learning-driven cache eviction policies. Promising research has been done in this field, but only in the field of user-space applications such as CDNs. We develop LearnedCache, an eBPF-integrated single-layer perceptron-based cache eviction policy for the Linux page cache, trained on real kernel data from diverse workloads. We demonstrate median AUCs of nearly 80% over multiple linear models modeling page reuse time, then take a step further by embedding these models inside the Linux kernel for real-time performance evaluation. Through statistical testing over 50 paired trials against a baseline of FIFO for each workload, LearnedCache reveals that machine-learning-derived cache eviction policies are practical in the Linux kernel under representative empirical workloads and are able to surpass conventional FIFO by statistically significant margins of up to 10% in insertion rate, a frequency-adjusted derivation of cache hit rate, in specific workloads while incurring minimal overhead.

2605.26167 2026-05-27 cs.LG cs.AI math.DS math.RA 版本更新

Planning Neural Dynamics with Lie Group Embedding through Supervised Projective Manifold Learning

通过监督投影流形学习进行李群嵌入的神经动力学规划

Tianwei Wang, Bryan Chen, Qian Zuo, Qiyue Xia, Xin Li, Wei Pang

发表机构 * School of Informatics(信息学院) School of Mathematics(数学学院) University of Edinburgh(爱丁堡大学) School of Computer Science(计算机科学学院) School of MACS(MACS学院) Beijing Institute of Technology(北京理工大学) Heriot-Watt University(赫瑞-瓦特大学)

AI总结 提出李群嵌入动力神经网络(LieEDNN),通过梯度下降和流形上的度量投影实现可学习且稳定的动力学,解决李群与神经网络加法不兼容及非线性表示空间中的演化问题,并在SE(3)伸缩机械臂上验证。

Comments Preprint. Under review

详情
AI中文摘要

我们提出了李群嵌入动力神经网络(LieEDNN)以及基于梯度下降和光滑流形上度量投影的相应学习算法,其中我们将李群视为流形几何连续对称性的内在表示。因此,我们在底层流形上实现了可学习且稳定的动力学,适用于一般李群,并且能够利用李群(如SO(3)和SE(3))强大的表示能力来解决机器人、图形和控制等领域的实际工程问题。两个核心挑战是:(i)一般李群与加法运算不兼容,而加法是神经网络交互所必需的。(ii)动力学在特殊代数的非线性表示空间中演化,而非正常的欧几里得空间,这违反了常见神经常微分方程的范式。为了解决这两个挑战,我们首先引入李代数上的伴随李群作用,它诱导出一个线性映射并转移到权重矩阵的分块结构,使得加法可以在李代数上作为向量空间进行运算。然后我们将李代数和伴随作用参数化为线性变换,从而使架构与神经网络感知器对齐。明确地说,这种嵌入表现为权重上的分块流形约束,我们开发了学习算法,以确保时间神经网络动力学的平衡态具有稳定性保证。我们在特定李群SE(3)上进行了实验,应用场景为伸缩机械臂。

英文摘要

We propose Lie group embedded dynamical neural networks (LieEDNN) and the corresponding learning algorithms based on gradient descent and metric projection on smooth manifold, where we treat Lie group as an intrinsic representation for continuous symmetry of manifold geometry. Thereby we achieve learnable and stable dynamics on the underlying manifold for general Lie group, and we are able to utilize the powerful representation capability of Lie group such as SO(3) and SE(3) to solve real world engineering problems in areas such as robotics, graphics, and control. Two core challenges are: (i) General Lie groups are incompatible with addition arithmetic, which is necessary for neural network interactions. (ii) The dynamics evolve in the nonlinear representation space of special algebra rather than the normal Euclidean space, which violates the paradigm of common neural ODEs. To address these two challenges, we firstly introduce adjoint Lie group action on the Lie algebra, which induces a linear mapping and transfer to the block-wise structure of weight matrices, such that addition could operate on the Lie algebra as a vector space. Then we parameterize the Lie algebra and the adjoint action as linear transformation so that the architecture is aligned with neural network perceptrons. Explicitly, this embedding appears as block-wise manifold constraints on weights, and we develop algorithms to learn the equilibrium with stability guarantees of the temporal neural network dynamics. Experiments are implemented on a specific Lie group SE(3), with the application scenario of telescopic manipulators.

2605.26166 2026-05-27 cs.CR cs.AI cs.LG 版本更新

Enhancing Autonomous Online Intrusion Detection for IoT with Balanced Learning, Reliable Pseudo-Labels, and Lightweight Architectures

增强物联网自主在线入侵检测:平衡学习、可靠伪标签与轻量级架构

Hanzala Afzaal, Danish Memon, Chouhdary Bilal Raza, Muhammad Khurram Shahzad

发表机构 * School of Electrical Engineering and Computer Science (SEECS)(电气工程与计算机科学学院) National University of Sciences and Technology (NUST)(国立科学与技术大学)

AI总结 针对AOC-IDS在类不平衡、伪标签不可靠、泛化性差和计算开销大等四方面缺陷,提出XGBoost-BalSamp、PseudoFilter、MixupAug和LiteAE等改进方法,在UNSW-NB15上准确率提升至95.45%,参数减少55%。

Comments 9 pages, 5 figures; Code available at https://github.com/danishmemon847/AOC-IDS-Pipeline

详情
AI中文摘要

物联网设备的快速普及迫切需求能够处理动态和不断演变的网络威胁的自适应、资源高效的入侵检测系统。本文研究了AOC-IDS,一种发表于IEEE INFOCOM 2024的最先进的自主在线IDS,它采用具有簇排斥对比损失的自动编码器和自主高斯决策模块。我们首先在UNSW-NB15基准上成功复现了AOC-IDS,达到了89.39%的准确率,与发表的89.19%高度一致。然后我们识别了四个关键局限性:类不平衡、不可靠的伪标签生成、有限的泛化能力以及物联网部署的计算开销,并针对每个问题提出了改进方法。我们的XGBoost-BalSamp方法在UNSW-NB15上达到了95.45%的准确率,比基线提高了6.26%。我们的组合深度学习方法(PseudoFilter、MixupAug和LiteAE)实现了最佳运行准确率90.88%(F1:91.45%),超过了原论文,同时将模型参数减少了55%。这些结果表明,对AOC-IDS的针对性改进在提高实际物联网边缘设备可部署性的同时,实现了持续的准确率提升。

英文摘要

The rapid proliferation of Internet of Things (IoT) devices has created an urgent demand for adaptive, resource-efficient Intrusion Detection Systems (IDS) capable of handling dynamic and evolving cyber threats. This paper investigates AOC-IDS, a state-of-the-art autonomous online IDS published at IEEE INFOCOM 2024, which employs an Autoencoder (AE) with Cluster Repelling Contrastive (CRC) loss and an autonomous Gaussian-based decision module. We first successfully replicate AOC-IDS on the UNSW-NB15 benchmark, achieving 89.39% accuracy in close agreement with the published 89.19%. We then identify four key limitations: class imbalance, unreliable pseudo-label generation, limited generalization, and computational overhead for IoT deployment, and propose targeted improvements for each. Our XGBoost-BalSamp method achieves 95.45% accuracy on UNSW-NB15, a gain of 6.26% over the baseline. Our combined deep learning approach (PseudoFilter, MixupAug, and LiteAE) achieves a best-run accuracy of 90.88% (F1: 91.45%), surpassing the base paper while reducing model parameters by 55%.These results demonstrate that targeted improvements to AOC-IDS yield consistent accuracy gains while improving practical deployability on IoT edge devices.

2605.26163 2026-05-27 cs.IT cs.LG math.IT math.OC 版本更新

Adversarial Water-Filling: Theory, Algorithms and Foundation Model

对抗性注水:理论、算法与基础模型

Xindi Tong, Chee Wei Tan, H. Vincent Poor

发表机构 * College of Computing and Data Science (CCDS), Nanyang Technological University, Singapore(南洋理工大学计算与数据科学学院(CCDS),新加坡) Princeton University, United States(普林斯顿大学,美国)

AI总结 针对频率和空间上的竞争资源分配问题,提出对抗性注水(AWF)问题及其理论和算法,并开发无线基础模型学习AWF搜索动力学,实现超过一个数量级的运行时间改进。

Comments Submitted to IEEE Journal of Selected Topics in Signal Processing

详情
AI中文摘要

频率和空间上的竞争资源分配问题可以表述为发射功率与最坏情况干扰之间的极小极大交互。这种公式自然出现在多运营商低地球轨道(LEO)卫星频谱共享中,其中竞争星座的传输实时干扰。在高斯信道下,AWF在非退化活动信道上强凸-凹,而离散星座通常产生非凸的汞/注水公式。在本文中,我们针对这些实际情况提出了对抗性注水(AWF)问题及其相应的理论和算法。此外,我们为AWF开发了一个无线基础模型来学习AWF搜索动力学。该架构包含置换不变的信道表示、具有稀疏消息传递的约束感知图神经网络(GNN),以及捕获AWF最优性隐含的低维水位线的全局潜在变量。通过学习的投影外梯度迭代,该模型近似于汞/注水下约束极小极大问题的平稳解。我们进一步证明,在局部正则性和收缩性条件下,学习的AWF动力学在正则平稳点附近局部线性收敛。实验表明,在未见过的不同问题规模、不同约束和多个离散星座上具有经验泛化能力,同时与迭代基线相比实现了超过一个数量级的运行时间改进。相关代码可在https://github.com/convexsoft/AWF找到。

英文摘要

Competitive resource allocation problems over frequency and space can be formulated as minimax interaction between transmit power and worst-case interference. This formulation naturally arises in multi-operator low Earth orbit (LEO) satellite spectrum sharing, where transmissions from competing constellations interfere in real-time. Under Gaussian channels, AWF is strongly convex--concave on nondegenerate active channels, whereas discrete constellations yield generally nonconvex mercury/water-filling formulations. In this paper we propose the Adversarial Water-Filling (AWF) problem with corresponding theory and algorithms for these real situations. In addition, we develop a wireless foundation model for AWF to learn the AWF search dynamics. The architecture incorporates permutation-invariant channel representations, a constraint-aware graph neural network (GNN) with sparse message passing, and global latent variables capturing the low-dimensional water level implied by the AWF optimality. Through learned projected extragradient iterations, the model approximates stationary solutions of the constrained minimax problem arising under mercury/water-filling. We further show that, under local regularity and contractivity conditions, the learned AWF dynamics converge locally linearly around regular stationary points. Experiments demonstrate empirical generalization across unseen problem sizes, different constraints, and multiple discrete constellations, while achieving more than one-order-of-magnitude runtime improvements over iterative baselines. The related code can be found at https://github.com/convexsoft/AWF.

2605.26162 2026-05-27 cs.LG cs.AI 版本更新

On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach

基于推送的异步联邦学习:一种偏差校正聚合方法

Jiahui Bai, Hai Dong, A. K. Qin

发表机构 * School of Computer Technologies, RMIT University(RMIT大学计算机技术学院) School of Science, Computing and Engineering Technologies, Swinburne University of Technology(斯威丁大学科学与工程技术学院)

AI总结 提出PushCen-ADFL框架,通过中心表示空间中的平均保持推-求和混合与轻量级中心正则化,解决异步去中心化联邦学习中的通信开销、聚合偏差和模型漂移问题。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026). This is the extended version with full appendix

详情
AI中文摘要

异步去中心化联邦学习(ADFL)消除了中央协调和全局同步,使其在大规模和异构系统中具有吸引力。然而,频繁的点对点通信、有向拓扑上的异步更新以及非独立同分布数据共同导致了过高的通信开销、有偏聚合和严重的模型漂移。我们提出了PushCen-ADFL,一种通信高效的ADFL框架,能够在非对称通信和延迟客户端参与下实现稳定训练。PushCen-ADFL在共享中心表示空间中耦合了通信、聚合和局部稳定化,形成了压缩与优化之间的闭环。客户端交换中心形式的消息,应用平均保持的推-求和混合来校正聚合偏差,并使用锚定在同一中心空间的轻量级中心正则化来减轻异构性和陈旧性下的漂移。一个有界、发送者去重的缓冲区进一步提高了在异步到达不规则情况下的鲁棒性。在视觉数据集上的实验表明,PushCen-ADFL在数据异构性下将准确率提高了最多6%,同时将每次推送的通信成本降低了80%以上,实现了良好的准确率-通信权衡。

英文摘要

Asynchronous decentralized federated learning (ADFL) eliminates central coordination and global synchronization, making it attractive for large-scale and heterogeneous systems. However, frequent peer-to-peer communication, asynchronous updates on directed topologies, and non-IID data jointly lead to excessive communication overhead, biased aggregation and severe model drift. We propose PushCen-ADFL, a communication-efficient ADFL framework that enables stable training under asymmetric communication and delayed client participation. PushCen-ADFL couples communication, aggregation, and local stabilization in a shared centroid representation space, forming a closed loop between compression and optimization. Clients exchange centroid-form messages, apply average-preserving push-sum mixing to correct aggregation bias, and use a lightweight centroid regularization anchored in the same centroid space to mitigate drift under heterogeneity and staleness. A bounded, sender-deduplicated buffer further improves robustness under irregular asynchronous arrivals. Experiments on vision datasets demonstrate that PushCen-ADFL improves accuracy under data heterogeneity by up to 6\% while reducing per-push communication cost by more than 80\%, achieving a favorable accuracy-communication trade-off.

2605.26161 2026-05-27 cs.LG cs.AI 版本更新

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

TSFMAudit: 时间序列基础模型中的数据污染审计

Hongkai Li, Shifeng Xie, Lefei Shen, Zhuo Li, Mouxiang Chen, Xiaobin Zhang, Han Fu, Jianling Sun, Xiaoxue Ren, Chenghao Liu

发表机构 * Zhejiang University(浙江大学) Télécom Paris(巴黎高等电信学院) State Street Technology (Zhejiang) Ltd.(State Street Technology(浙江)有限公司) Datadog

AI总结 针对时间序列基础模型(TSFMs)预训练数据污染问题,提出基于探针适应动力学的审计方法TSFMAudit,通过检测微调后损失下降更快且骨干网络移动更小的异常现象来识别污染数据集。

Comments 22 pages, 7 figures, 9 tables

详情
AI中文摘要

时间序列基础模型(TSFMs)越来越多地在大型语料库上进行预训练,这引发了评估数据集可能在预训练期间被暴露从而导致过于乐观的性能估计的担忧。在时间序列中审计此类污染具有挑战性,因为信号是连续且异质的,并且通常缺乏语料库文档。据我们所知,这是第一个研究TSFMs预训练污染审计的工作。我们形式化了TSFMs的预训练污染审计问题,并提出了TSFMAudit,一种基于探针适应动力学的方法。我们的关键直觉是,污染表现为异常高效的适应:在微调探针后,受污染的数据集往往表现出更快的损失减少和更小的骨干网络移动。我们在6个TSFMs和187个数据集上评估了TSFMAudit,使用文档化的训练来源证据作为监督,并与从LLM文献中改编的10个竞争基线进行了比较。

英文摘要

Time series foundation models (TSFMs) are increasingly pretrained on large corpora, raising concerns that evaluation datasets may have been exposed during pretraining and thus yield overly optimistic performance estimates. Auditing such contamination is challenging in time series because signals are continuous and heterogeneous, and often lack corpus documentation. To the best of our knowledge, this is the first work to study pretraining contamination auditing for TSFMs. We formalize the problem of pretraining contamination auditing for TSFMs and propose TSFMAudit, a method based on probe adaptation dynamics. Our key intuition is that contamination manifests as unusually efficient adaptation: after a fine tuning probe, contaminated datasets tend to exhibit faster loss reduction with smaller backbone movement. We evaluate TSFMAudit on 6 TSFMs and 187 datasets using documented training source evidence as supervision, and compare against 10 competitive baselines adapted from the LLM literature.

2605.26159 2026-05-27 cs.NI cs.CR cs.LG 版本更新

Device Context Protocol: A Compact, Safety-First Architecture for LLM-Driven Control of Constrained Devices

设备上下文协议:一种紧凑、安全优先的架构,用于LLM驱动的受限设备控制

Dongxu Yang

发表机构 * DeepLethe

AI总结 针对LLM控制受限设备的安全问题,提出设备上下文协议(DCP),通过极小的帧开销、协议层安全原语和主机端桥接,在保持低资源占用的同时有效防御幻觉和提示注入攻击。

Comments 15 pages, 5 figures. Reference implementation, Python package (pip install pydcp), and reproduction scripts at https://github.com/device-context-protocol/dcp

详情
AI中文摘要

大型语言模型越来越多地通过模型上下文协议(MCP)作为外部工具的编排器,但MCP是为具有兆字节内存的软件服务构建的,并未覆盖主导物理设备长尾的微控制器。近期工作(IoT-MCP)将MCP移植到边缘网关,峰值内存为74 KB;这仍然排除了最小的商用MCU,并且关键的是,没有解决将不可靠调用者(可能产生幻觉或受到提示注入的LLM)直接控制物理硬件的安全问题。我们提出设备上下文协议(DCP):一个典型帧小于50字节(6字节头部+CBOR负载+可选的16字节HMAC),一个清单模式,其中能力范围、范围和类型检查、试运行评估以及单位即类型是协议层原语,以及一个主机端桥接,在设备收到任何字节之前拒绝格式错误或幻觉调用。参考固件在ESP32上占用27.6 KB闪存/0.6 KB RAM;Python桥接、ESP32固件和语言无关的一致性套件采用MIT许可证并公开。一项实证研究——由来自四个供应商(DeepSeek、阿里巴巴、智谱、MiniMax)的五个LLM针对六类对抗性提示生成的675次工具调用,其中注入类别实例化了AgentDojo的攻击模板——显示DCP拒绝了100%的能力提升尝试和78%的提示注入尝试,而原始MCP和IoT-MCP为0-1%,在固件占用空间小三个数量级的情况下匹配了结构良好的OpenAPI 3模式的表达能力。我们将DCP定位为MCP(正朝着企业SaaS连接发展)与其未覆盖的物理设备之间缺失的一层。

英文摘要

Large language models are increasingly used as orchestrators of external tools via the Model Context Protocol (MCP), but MCP is built for software services with megabytes of memory and does not descend to the microcontrollers that dominate the long tail of physical devices. Recent work (IoT-MCP) ports MCP to edge gateways at 74 KB peak memory; this still excludes the smallest commodity MCUs and, critically, does not address the safety problem of giving an unreliable caller (an LLM that may hallucinate or be prompt-injected) direct control of physical hardware. We present the Device Context Protocol (DCP): a sub-50-byte typical frame (6-byte header + CBOR payload + optional 16-byte HMAC), a manifest schema in which capability scoping, range and type checks, dry-run evaluation, and units-as-types are protocol-layer primitives, and a host-side Bridge that rejects malformed or hallucinated calls before any byte reaches the device. Reference firmware measures 27.6 KB flash / 0.6 KB RAM on ESP32; the Python Bridge, ESP32 firmware, and a language-neutral conformance suite are MIT-licensed and public. An empirical study -- 675 tool calls produced by five LLMs across four vendors (DeepSeek, Alibaba, Zhipu, MiniMax) against six categories of adversarial prompts, with the injection category instantiating AgentDojo's attack templates -- shows DCP rejects 100% of capability-escalation attempts and 78% of prompt-injection attempts, versus 0--1% for Raw MCP and IoT-MCP, matching the expressiveness of a well-formed OpenAPI 3 schema at three orders of magnitude less firmware footprint. We position DCP as the missing layer between MCP (which is moving toward enterprise SaaS connectivity) and the physical devices it does not reach.

2605.26158 2026-05-27 cs.CR cs.AI cs.LG 版本更新

Furina: Fragmented Uncertainty-Driven Refusal Instability Attack

Furina: 碎片化不确定性驱动的拒绝不稳定攻击

Tongxi Wu, Jian Zhang, Yang Gao

发表机构 * School of Intelligence Science and Technology(智能科学与技术学院) State Key Laboratory for Novel Software Technology(新型软件技术国家重点实验室) Nanjing University(南京大学)

AI总结 通过揭示大语言模型安全行为存在不稳定区域,提出多指标诊断框架并开发Furina攻击方法,利用碎片化场景提示诱导不确定性放大,实现高效越狱。

Comments This work is accepted as a regular paper at ICML 2026

详情
AI中文摘要

大语言模型和多模态大语言模型的安全对齐通常被认为是一种近二值阈值机制。我们通过揭示安全行为受不稳定区域支配来挑战这一假设,在该区域中,小的扰动会引发随机的拒绝决策而非确定性结果。我们开发了一个结合外部和内部信号的多指标诊断框架来表征这种不稳定性。通过系统实验,我们识别出一个特征性的诊断标志:处于不稳定区域的输入表现出更高的输出不确定性,同时内部安全激活降低,这种解耦现象解释了为什么基于检测的防御无法抵御复杂攻击。基于此框架,我们提出了Furina,一种越狱攻击,它通过碎片化、场景锚定的提示故意诱导这种特征,无需针对模型的优化。Furina在HarmBench上优于强单轮和多轮基线,并在MM-SafetyBench上取得了有竞争力的结果,表明不确定性放大为理解安全漏洞提供了一种有原则且可迁移的机制。代码见:https://github.com/0xCavaliers/Furina_Jailbreak。

英文摘要

Safety alignment in large language models (LLMs) and multimodal large language models (MLLMs) is commonly assumed to operate as a near-binary threshold mechanism. We challenge this assumption by revealing that safety behavior is governed by an instability region where small perturbations induce stochastic refusal decisions rather than deterministic outcomes. We develop a multi-metric diagnostic framework combining external and internal signals to characterize this instability. Through systematic experiments, we identify a characteristic diagnostic signature: inputs in unstable regimes exhibit elevated output uncertainty yet decreased internal safety activation, a decoupling phenomenon that explains why detection-based defenses fail against sophisticated attacks. Building on this framework, we introduce Furina, a jailbreak attack that deliberately induces this signature through fragmented, scene-anchored prompts without model-specific optimization. Furina outperforms strong single-turn and multi-turn baselines on HarmBench and achieves competitive results on MM-SafetyBench, demonstrating that uncertainty amplification provides a principled and transferable mechanism for understanding safety vulnerabilities. Code is available at: https://github.com/0xCavaliers/Furina_Jailbreak.

2605.26155 2026-05-27 cs.RO cs.AI cs.LG 版本更新

When Does Adaptive Guidance Help? Belief-Aware Privileged Distillation for Autonomous Driving Under Partial Observability

自适应引导何时有帮助?部分可观测条件下自动驾驶的信念感知特权蒸馏

Mehmet Haklidir

发表机构 * TUBITAK BILGEM Artificial Intelligence Institute(土耳其TUBITAK BILGEM人工智能研究所)

AI总结 本文提出信念感知GSAC(BA-GSAC),通过集成分歧动态调节蒸馏系数,系统研究自适应引导在部分可观测自动驾驶中的有效性,发现严重遮挡下系数过早崩溃,并揭示可观测性盲区问题。

Comments 9 pages, 3 figures, 7 tables. Accepted at CVPR 2026 Workshop on Autonomous Driving (WAD)

详情
AI中文摘要

引导软演员-评论家(GSAC)将来自特权全状态教师的知识蒸馏给部分可观测的学生,用于自动驾驶,但使用固定的蒸馏系数λ,而不考虑智能体的不确定性。我们提出信念感知GSAC(BA-GSAC),通过集成分歧调节λ,并将其作为系统实证研究的测试平台,探究:自适应引导何时真正有帮助?在Highway-Env上评估五种策略(固定λ∈{0.01, 0.1}、自适应、线性衰减和普通SAC)在三个POMDP难度级别下,我们发现初步的单种子运行表明在轻度和中度部分可观测性下有收益,但在严重遮挡下(所有方法使用3个种子评估),自适应系数在大约3K步内坍缩到λ_min。我们将其归因于可观测性盲区现象:由于集成预测部分观测,即使在严重遮挡下也能达到低分歧,建模了可见部分但无法检测缺失部分。我们诊断了根本原因并提出了架构修复(使用引导演员的特权访问在完整状态预测上训练集成);虽然此处未验证,但我们表明即使存在当前限制,预热阶段也提供了可测量的稳定性(CV=13.3% vs. 常数λ=0.01的29.8%)。实际上,简单的确定性线性衰减计划在所有指标上实现了最佳的严重POMDP性能(均值116.5,CV=8.9%),表明稳定性收益来自调度效应而非集成。这些发现为设计不确定性感知的师生框架提供了实用指导,并强调了集成预测目标是一个重要的设计选择。

英文摘要

Guided Soft Actor-Critic (GSAC) distills knowledge from a privileged full-state teacher to a partial-observation student for autonomous driving, but uses a fixed distillation coefficient lambda regardless of the agent's uncertainty. We present Belief-Aware GSAC (BA-GSAC), which modulates lambda via ensemble disagreement, and use it as a testbed for a systematic empirical study asking: when does adaptive guidance actually help? Evaluating five strategies (fixed lambda in {0.01, 0.1}, adaptive, linear decay, and vanilla SAC) across three POMDP difficulty levels on Highway-Env, we find that preliminary single-seed runs suggest benefits under mild and moderate partial observability, but under severe occlusion (evaluated with 3 seeds for all methods) the adaptive coefficient collapses to lambda_min within about 3K steps. We trace this to an observability blindness phenomenon: because the ensemble predicts partial observations, it achieves low disagreement even under heavy occlusion, modeling what is visible but unable to detect what is missing. We diagnose the root cause and propose an architectural fix (training the ensemble on full-state predictions using the guiding actor's privileged access); while not validated here, we show that even with current limitations, the warmup phase provides measurable stabilization (CV=13.3% vs. 29.8% for constant lambda=0.01). In fact, a simple deterministic linear decay schedule achieves the best severe-POMDP performance across all metrics (mean 116.5, CV=8.9%), suggesting that the scheduling effect, not the ensemble, drives the stability benefit. These findings provide practical guidance for designing uncertainty-aware teacher-student frameworks and highlight ensemble prediction targets as an important design choice.

2605.26147 2026-05-27 cs.LG 版本更新

Neural Bayesian Sequential Routing

神经贝叶斯序列路由

Yongchao Huang

AI总结 提出神经贝叶斯序列路由(NBSR)框架,将神经推理建模为有向无环图上的主动证据累积,通过狄利克雷-分类共轭框架实现不确定性量化、早期退出和资源理性推理。

Comments 71 pages

详情
AI中文摘要

人类决策是序列化的且具有不确定性意识,然而标准神经网络通常依赖静态、密集的前向计算,对证据获取、不确定性演化或何时停止计算的可视性有限。我们引入了 extbf{神经贝叶斯序列路由(NBSR)},这是一个将神经推理建模为层次化有向无环图(DAG)上的主动证据累积的框架。在狄利克雷-分类共轭框架内,神经专家查询一个持久的全局知识预言机以提取正证据向量,这些向量作为伪计数,通过精确共轭加法更新狄利克雷信念状态。结合Gumbel-Softmax直通估计器,该更新实现了硬性、路径依赖的路由,同时保留用于端到端训练的代理梯度。由此产生的狄利克雷精度和熵为不确定性量化、基于熵的早期退出、分布外(OOD)弃权以及成本感知的证据获取提供了机制。我们证明,在严格正证据提取下,总狄利克雷精度沿任何有效轨迹单调增加,边际预测方差有界,形式化了序列“假设锐化”;在理想容量和优化假设下,终端狄利克雷期望恢复贝叶斯最优条件分布。在视觉分类、结构化医学诊断、语言建模、部分可观测控制以及成本感知贝叶斯实验设计上的实证评估表明,NBSR在提供透明的路由轨迹、路径依赖的证据归因、不确定性感知的决策控制以及资源理性推理的同时,实现了具有竞争力的预测性能。总体而言,NBSR为可解释、模块化和资源理性的智能体AI提供了一个数学上坚实的框架。

英文摘要

Human decision-making is sequential and uncertainty-aware, yet standard neural networks often rely on static, dense forward computation with limited visibility into evidence acquisition, uncertainty evolution, or when computation should stop. We introduce \textbf{Neural Bayesian Sequential Routing (NBSR)}, a framework that models neural inference as active evidence accumulation over a hierarchical Directed Acyclic Graph (DAG). Within a Dirichlet--Categorical conjugate framework, neural experts query a persistent global knowledge oracle to extract positive evidence vectors, which act as pseudo-counts and update a Dirichlet belief state by exact conjugate addition. Coupled with a Gumbel-Softmax Straight-Through estimator, this update enables hard, path-dependent routing while preserving surrogate gradients for end-to-end training. The resulting Dirichlet precision and entropy provide mechanisms for uncertainty quantification, entropy-based early exiting, OOD abstention, and cost-aware evidence acquisition. We prove that, under strictly positive evidence extraction, total Dirichlet precision increases monotonically along any valid trajectory and marginal predictive variance is bounded, formalizing sequential ``hypothesis sharpening''; under idealized capacity and optimization assumptions, the terminal Dirichlet expectation recovers the Bayes-optimal conditional distribution. Empirical evaluations across visual categorization, structured medical diagnosis, language modeling, partially observable control, and cost-aware Bayesian experimental design show that NBSR achieves competitive predictive performance while providing transparent routing traces, path-dependent evidence attribution, uncertainty-aware decision control, and resource-rational inference. Overall, NBSR offers a mathematically grounded framework for interpretable, modular, and resource-rational agentic AI.

2605.26135 2026-05-27 cs.LG 版本更新

SilIF: Silhouette-Augmented Isolation Forest for Unsupervised Transaction Fraud Detection

SilIF:基于轮廓增强的隔离森林用于无监督交易欺诈检测

Venkatakrishnan Gopalakrishnan

发表机构 * Independent Researcher(独立研究员)

AI总结 提出SilIF方法,通过添加基于轮廓得分的层次增强隔离森林,在IEEE-CIS欺诈检测基准上平均AUC-PR提升0.0080,并在五个种子中均优于原始隔离森林。

Comments 5 pages, 1 figure, 5 tables. Code: https://github.com/venkat15vk/silif-anomaly-detection

详情
AI中文摘要

无监督异常检测广泛应用于标签稀缺的交易欺诈检测中。隔离森林(IF)因其可扩展性和易于部署而成为最流行的经典方法之一。我们提出了SilIF,一种隔离森林的增强方法,它在森林树诱导的表示空间中添加了一个基于轮廓得分的计算层。对于每个点,我们提取每棵树路径长度的向量,将这些“指纹”聚类成结构组,并计算轮廓得分,衡量该点与其分配组的匹配程度相对于最近替代组。轮廓信号通过单个超参数alpha与基础IF得分结合。在IEEE-CIS欺诈检测基准(约59万笔交易,3.5%欺诈)上,alpha=1.0的SilIF在五个种子上平均AUC-PR比普通隔离森林提高0.0080,且SilIF在所有五个种子上获胜(配对t检验p=0.046)。我们还在合成信用卡数据集(Sparkov)上报告了结果,其中轮廓增强并未优于普通IF,并描述了区分两种结果的条件。本文提出了SilIF作为隔离森林的一种可调、易于部署的增强方法,并诚实报告了其何时有效何时无效。代码见https://github.com/venkat15vk/silif-anomaly-detection。

英文摘要

Unsupervised anomaly detection is widely used in transaction fraud detection where labels are scarce. Isolation Forest (IF) is among the most popular classical methods due to its scalability and ease of deployment. We propose SilIF, an augmentation of Isolation Forest that adds a silhouette-based scoring layer computed in a representation space induced by the trees of the forest. For each point, we extract a vector of per-tree path lengths, cluster these "fingerprints" into structural groups, and compute a silhouette score that measures how well the point fits its assigned group versus the nearest alternative. The silhouette signal is combined with the base IF score via a single hyperparameter alpha. On the IEEE-CIS Fraud Detection benchmark (~590K transactions, 3.5% fraud), SilIF with alpha=1.0 improves over plain Isolation Forest by +0.0080 AUC-PR on average across five seeds, with SilIF winning on all five seeds (paired t-test p=0.046). We also report results on a synthetic credit-card dataset (Sparkov) where the silhouette augmentation does not improve over plain IF, and we characterize the conditions that distinguish the two outcomes. The paper presents SilIF as a tunable, easy-to-deploy enhancement to Isolation Forest with honest reporting of when it helps and when it does not. Code at https://github.com/venkat15vk/silif-anomaly-detection.

2605.26133 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

大型语言模型中的预训练数据暴露:成员推断、数据污染及安全影响综述

Ziyi Tong, Feifei Sun, Le Minh Nguyen

发表机构 * Japan Advanced Institute of Science and Technology(日本先进科学研究院)

AI总结 本文首次统一综述了大型语言模型中的预训练数据暴露问题,涵盖成员推断和数据污染,形式化定义了暴露级别,回顾了攻击与防御方法,并总结了实证发现及未来研究方向。

Comments accepted by NLDB 2025

详情
AI中文摘要

大型语言模型(LLMs)已成为NLP中的主导范式,推动了研究和工业的发展。随着模型规模和预训练数据的增长,由于训练数据集的规模和不可见性,对预训练数据暴露(PDE)的担忧也在增加。PDE指的是确定特定数据是否出现在LLM的预训练语料库中。它对于确保评估完整性和保护隐私至关重要,涉及两个关键领域:数据污染和成员推断。尽管概念上相关,但这些领域通常被孤立研究。本文首次在PDE框架下对两者进行了统一综述。我们形式化了跨暴露级别的PDE,回顾了攻击和防御方法,综合了实证发现,并强调了开放的挑战和未来的研究方向。

英文摘要

Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets. PDE refers to determining whether specific data appeared in an LLM's pretraining corpus. It is critical for ensuring evaluation integrity and protecting privacy, intersecting two key areas: data contamination and membership inference. Though conceptually related, these areas have often been studied in isolation. This paper offers the first unified survey of both under the PDE framework. We formalize PDE across exposure levels, review attack and defense methods, synthesize empirical findings, and highlight open challenges and future research directions.

2605.26132 2026-05-27 cs.CL cs.LG 版本更新

Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline

自验证蒸馏:你的语言模型秘密地就是它自己的合成数据管道

Tony Lee, Percy Liang

发表机构 * Stanford University(斯坦福大学)

AI总结 提出自验证蒸馏算法,让大语言模型仅用无标注种子问题,通过自生成、自验证和自训练提升推理能力,在数学、科学和编程任务上取得显著提升。

详情
AI中文摘要

经过后训练的大语言模型能否仅使用无标注提示,在没有外部教师或工具反馈的情况下进一步提升自己?我们在三个推理领域(数学、科学和编程)中研究这一设置,仅从没有真实解的无标注种子问题开始。我们提出自验证蒸馏,一种简单的后训练精炼算法,其中模型生成这些种子问题的候选解,使用基于提示的自验证进行过滤,并在由此产生的自策展数据集上进行训练。受UQ基准使用多个验证器筛选困难未解问题候选答案的启发,我们将这种基于验证的过滤思想应用于自训练:模型通过三级级联的循环一致性、事实性和正确性检查来过滤自己生成的解,仅当解通过所有阶段且获得一致判断时才被接受。我们发现,在训练数据构建过程中采样更多候选生成并使用更大的验证预算,可以产生更高质量的自策展数据,进而得到更好的推理模型。然后,我们使用自验证蒸馏训练多个规模的Qwen3模型,并在所有三个领域获得收益。对于Qwen3-4B,我们的方法在数学(AIME26和HMMT)上将聚合保留pass@1提升了+16.7个百分点,在科学(GPQA Diamond和HLE)上提升了+11.1个百分点,在编程(LCBv5和LCBv6)上提升了+8.3个百分点,这些收益也扩展到0.6B和8B模型。与我们的仅测试时基线(UQ-TTC)相比,后者通过在推理时花费额外计算来提升性能,自验证蒸馏在大多数设置下实现了更好的性能,同时仅在测试时进行一次推理调用。

英文摘要

Can post-trained large language models (LLMs) further improve themselves using only unlabeled prompts, without external teachers or feedback from tools? We study this setting starting only from unlabeled seed questions with no ground-truth solutions, across three reasoning domains: math, science, and coding. We propose Self-Verified Distillation, a simple post-training refinement algorithm in which the model generates candidate solutions to these seed questions, filters them using prompt-based self-verification, and trains on the resulting self-curated dataset. Inspired by the UQ benchmark's use of multiple validators to screen candidate answers to hard unsolved questions, we adapt this validation-based filtering idea to self-training: the model filters its own generated solutions through a three-stage cascade of cycle-consistency, factuality, and correctness checks, accepting a solution only if it passes all stages with unanimous judge votes. We find that sampling more candidate generations and using a larger verification budget during training data construction produces higher-quality self-curated data and, in turn, better reasoning models. We then train Qwen3 models at multiple scales with Self-Verified Distillation and obtain gains across all three domains. For Qwen3-4B, our method improves aggregate held-out pass@1 by +16.7 points in math (AIME26 and HMMT), +11.1 points in science (GPQA Diamond and HLE), and +8.3 points in coding (LCBv5 and LCBv6), with gains also extending to 0.6B and 8B models. Compared to our test-time-only baseline (UQ-TTC), which improves performance by spending extra compute at inference time, Self-Verified Distillation achieves better performance in most settings while requiring only a single inference call at test time.

2605.26130 2026-05-27 cs.LG physics.ao-ph 版本更新

AirCast-SR: A Foundation Model for Kilometer-Scale Atmospheric Super-Resolution via Latent Consistency Diffusion

AirCast-SR: 基于潜在一致性扩散的千米级大气超分辨率基础模型

Somnath Luitel, Manmeet Singh, Joshua Durkee, Abdullah Al Fahad, Naveen Sudharsan, Prabhjot Singh, Cenlin He, Harsh Kamath, Zong-Liang Yang, Krishnagopal Halder, Sandeep Juneja, Parthasarathi Mukhopadhyay, Saptarishi Dhanuka, Amit Kumar Srivastava

发表机构 * Department of Earth, Environmental, and Atmospheric Sciences, Western Kentucky University, Bowling Green, KY, USA(地球、环境与大气科学系,西部肯塔基大学) NASA Goddard Space Flight Center, Greenbelt, MD, USA(NASA戈达德太空飞行中心) The University of Texas at Austin, Austin, TX, USA(德克萨斯大学奥斯汀分校) NSF National Center for Atmospheric Research, Boulder, CO, USA(国家大气科学研究中心) Leibniz Centre for Agricultural Landscape Research (ZALF), Berlin, Germany(莱比锡农业景观研究中心(ZALF)) Ashoka University, Sonipat, India(阿什oka大学)

AI总结 提出AirCast-SR基础模型,利用潜在一致性扩散框架将全球AI天气预报从0.25度降尺度至1公里分辨率,实现零偏差和跨区域零样本迁移。

Comments Somnath Luitel and Manmeet Singh are equal-contribution co-first authors, with Manmeet Singh (manmeet.singh@wku.edu) as corresponding author

详情
AI中文摘要

千米尺度的业务天气预报对于传统数值天气预报(NWP)模型而言仍然计算成本过高,限制了需要精细时空细节的能源、农业和灾害管理等应用对预报的获取。本文介绍AirCast-SR,一种用于大气超分辨率的基础模型,将全球AI天气预报从0.25度(约28公里)降尺度至1公里水平分辨率,时间分辨率为每小时,同时生成八个耦合地表变量的67小时预报。EarthMind-SR采用三维U-Net,在潜在一致性模型(LCM)扩散框架内进行条件化,使用基于图块(patch)的样本在美国本土(CONUS)上训练,以GraphCast预报为输入,NOAA的校准记录分析(AORC)为目标。该模型在所有变量和预报时效上实现接近零偏差,其径向功率谱密度分析表明,在10公里至100公里波长范围内,精细大气结构得以保留,而较粗模型在此范围内会损失谱功率。我们通过涵盖冬季、夏季和春季的三个CONUS案例研究验证了EarthMind-SR,并利用独立地面站观测数据,在无需任何重新训练或微调的情况下,展示了在印度和德国上的零样本全球迁移能力。作为一个开放权重的基础模型,EarthMind-SR为千米级AI天气预报建立了新范式,并为区域微调、蒸馏以及气候服务和灾害预报中的下游应用提供了平台。

英文摘要

Operational weather prediction at kilometer scales remains computationally prohibitive for traditional numerical weather prediction (NWP) models, limiting forecast access for applications in energy, agriculture, and disaster management that require fine-grained spatiotemporal detail. Here we introduce AirCast-SR, a foundation model for atmospheric super-resolution that downscales global AI weather forecasts from 0.25 degree (~28 km) to 1 km horizontal resolution at hourly temporal resolution, producing 67-hour forecasts of eight coupled surface variables simultaneously. EarthMind-SR employs a three-dimensional U-Net conditioned within a Latent Consistency Model (LCM) diffusion framework, trained on patch-based samples over the contiguous United States (CONUS) using GraphCast forecasts as input and NOAA's Analysis of Record for Calibration (AORC) as the target. The model achieves near-zero bias across all variables and lead times, and its radial power spectral density analysis demonstrates preservation of fine-scale atmospheric structure at wavelengths of 10 km to 100 km where coarser models lose spectral power. We validate EarthMind-SR across three CONUS case studies spanning winter, summer, and spring seasons, and demonstrate zero-shot global transferability over India and Germany using independent surface station observations without any retraining or fine-tuning. As an open-weights foundation model, EarthMind-SR establishes a new paradigm for kilometer-scale AI weather prediction and provides a platform for regional fine-tuning, distillation, and downstream applications in climate services and hazard forecasting.

2605.26128 2026-05-27 cs.LG cs.SE 版本更新

The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

约束税:小语言模型结构化输出中正确性与准确性的权衡度量

Jaideep Ray

发表机构 * ACM(美国计算机协会)

AI总结 本文提出“约束税”测量协议,通过实验证明硬输出约束会显著降低小语言模型的答案准确性和可执行准确性,并建议生产系统应分别报告模式有效性、答案准确性、可执行准确性和错误有效模式率。

详情
AI中文摘要

生产级LLM系统越来越需要机器可读的输出:JSON对象、类型化轨迹、正则表达式约束字段和工具调用模式。本文针对设备端和低成本小语言模型(SLM)部署,其中低于3B参数的模型因隐私、延迟和通用硬件而具有吸引力,但在解决任务时满足模式的能力有限。通常的工程假设是硬输出约束能提高可靠性而不改变底层答案。我们证明这一假设对小模型不安全。我们引入\emph{约束税},一种测量协议,用于在固定模型、固定任务分布和固定问题实例下,隔离由结构化输出约束引起的答案和可执行准确性损失。在Qwen2.5-0.5B、Qwen2.5-1.5B和SmolLM2-1.7B的15,000次通用GPU生成中,硬答案模式解码将模式有效性从61.5%提高到100.0%,但将答案准确性从19.7%降低到11.0%,并将错误有效模式输出从49.5%增加到88.9%。最强的工业类比是确定性日历工具调用任务:Qwen2.5-1.5B在仅提示JSON下达到91.5%的可执行准确性,但在相同硬工具调用模式下仅为48.0%,而两种模式都是100.0%模式有效。错误是语义性的,而非结构性的。我们还表明,3B边界仍然支付直接模式税,并且延迟包装支持一种建设性设计模式:自由推理,延迟约束。实际结论是直接的:生产系统应分别报告模式有效性、答案准确性、可执行准确性和错误有效模式率。

英文摘要

Production LLM systems increasingly require machine-readable outputs: JSON objects, typed traces, regex-constrained fields, and tool-call schemas. This paper targets on-device and low-cost small language model (SLM) deployments, where sub-3B models are attractive for privacy, latency, and commodity hardware but have limited capacity to satisfy schemas while solving tasks. The usual engineering assumption is that hard output constraints improve reliability without changing the underlying answer. We show that this assumption is unsafe for small models. We introduce \emph{constraint tax}, a measurement protocol for isolating the answer and executable-accuracy loss caused by structured-output constraints at fixed model, fixed task distribution, and fixed problem instances. Across 15,000 commodity-GPU generations with Qwen2.5-0.5B, Qwen2.5-1.5B, and SmolLM2-1.7B, hard answer-only schema decoding raises schema validity from 61.5\% to 100.0\%, but lowers answer accuracy from 19.7\% to 11.0\% and increases wrong-valid-schema outputs from 49.5\% to 88.9\%. The strongest industry analogue is a deterministic calendar tool-call task: Qwen2.5-1.5B achieves 91.5\% executable accuracy with prompt-only JSON but only 48.0\% under the same hard tool-call schema, while both modes are 100.0\% schema-valid. The error is semantic, not structural. We also show that the 3B boundary still pays a direct-schema tax and that delayed packaging supports a constructive design pattern: reason free, constrain late. The practical conclusion is direct: production systems should report schema validity, answer accuracy, executable accuracy, and wrong-valid-schema rate separately.

2605.26127 2026-05-27 physics.med-ph cs.LG eess.IV 版本更新

Rapid online deep artifact suppression for real-time spiral bSSFP CMR with blipped-CAIPI simultaneous multi-slice imaging at 1.5 T

1.5 T 下使用 blipped-CAIPI 同步多层成像的实时螺旋 bSSFP CMR 的快速在线深度伪影抑制

Julius Åkesson, Iulius Dragonu, Einar Heiberg, Tina Yao, Rebecca Baker, Ruta Virsinskaite, Daniel Knight, Vivek Muthurangu, Jennifer Steeden

发表机构 * Centre for Translational Cardiovascular Imaging, Institute of Cardiovascular Science, University College London(转化心血管成像中心,心血管科学研究所,伦敦大学学院) Clinical Physiology, Department of Clinical Sciences Lund, Lund University, Skåne University Hospital(临床生理学,临床科学系,伦德大学,斯德哥尔摩大学医院) Department of Biomedical Engineering, Faculty of Engineering, Lund University(生物医学工程系,工程学院,伦德大学) Research & Collaborations GBI, Siemens Healthcare Ltd(研究与合作GBI,西门子医疗有限公司)

AI总结 针对实时同步多层 bSSFP 心脏磁共振成像中采集和重建时间长的问题,提出基于 3D U-Net 的深度伪影抑制方法,实现快速在线重建,显著缩短采集和重建时间,同时保持诊断图像质量。

详情
AI中文摘要

目的:实时(RT)bSSFP MRI 可实现快速自由呼吸心血管成像,但功能评估需要 10-16 层,导致扫描时间延长。同步多层(SMS)成像可减少采集时间,但与非线性轨迹结合时,依赖迭代重建,阻碍了在线使用。本研究探索深度伪影抑制以促进 RT-SMS 的快速在线重建。 方法:在 1.5 T 下实现了一种同时采集两层的螺旋 bSSFP SMS RT 序列。重建使用 k 空间层分离,随后在图像空间中使用 3D U-Net 进行深度伪影抑制。对十名健康志愿者进行成像。比较深度伪影抑制和压缩感知(CS)重建的 RT-SMS 图像质量和重建时间。比较深度伪影抑制的 RT-SMS 与参考标准屏气(BH)成像的左心室(LV)和右心室(RV)舒张末期容积(EDV)和收缩末期容积(ESV)以及 LV 质量(LVM)。 结果:RT-SMS 采集比 BH 成像快约 13 倍(15 秒 vs 3 分 15 秒)。使用深度伪影抑制的 RT-SMS 重建比 CS 快约 50 倍(30 秒 vs 24 分 55 秒)。深度伪影抑制在定量和定性图像质量上始终优于 CS(p<0.001)。BH 与深度伪影抑制的 RT-SMS 之间的功能一致性良好(LVEDV:-7.5 +/- 6.8 ml,LVESV:-0.9 +/- 4.2 ml,RVEDV:-6.4 +/- 8.4 ml,RVESV:0.2 +/- 10.7 ml,LVM:-10.3 +/- 11.0 g)。 结论:RT-SMS bSSFP CMR 的在线深度伪影抑制重建实现了自由呼吸短轴覆盖,同时大幅减少了采集和重建时间,并保持了诊断图像质量。

英文摘要

Purpose: Real-time (RT) bSSFP MRI enables fast free-breathing cardiovascular imaging but requires 10-16 slices for functional assessment, resulting in prolonged scan times. Simultaneous multi-slice (SMS) imaging can reduce acquisition time but when combined with non-Cartesian trajectories, it relies on iterative reconstructions that preclude online use. This study investigates deep artifact suppression to facilitate rapid, online reconstruction of RT-SMS. Methods: A spiral bSSFP SMS RT sequence with two simultaneously acquired slices was implemented at 1.5 T. Reconstruction used slice separation in k-space, followed by deep artifact suppression in image space using a 3D U-Net. Ten healthy volunteers were imaged. RT-SMS image quality and reconstruction time were compared between deep artifact suppression and compressed sensing (CS) reconstructions. Left (LV) and right (RV) ventricular volumes at end diastole (EDV) and end systole (ESV) and LV mass (LVM) were compared between RT-SMS with deep artifact suppression and reference-standard breath-hold (BH) imaging. Results: The RT-SMS acquisition was ~13x faster than BH imaging (15 s vs 3 min 15 s). RT-SMS reconstruction using deep artifact suppression was ~50x faster than CS (30 s vs 24 min 55 s). Deep artifact suppression consistently outperformed CS in quantitative and qualitative image quality (p<0.001). Functional agreement between BH and RT-SMS with deep artifact suppression was good (LVEDV: -7.5 +/- 6.8 ml, LVESV: -0.9 +/- 4.2 ml, RVEDV: -6.4 +/- 8.4 ml, RVESV: 0.2 +/- 10.7 ml, LVM: -10.3 +/- 11.0 g). Conclusion: Online deep artifact suppression reconstruction for RT-SMS bSSFP CMR enables free-breathing short-axis coverage with a substantial reduction in acquisition and reconstruction time while maintaining diagnostic image quality.

2605.25678 2026-05-27 stat.ML cs.DS cs.LG math.ST stat.TH 版本更新

PAC Learning with Bandit Feedback: Sharp Sample Complexity in the Realizable Setting

带强盗反馈的PAC学习:可实现设置下的精确样本复杂度

Steve Hanneke, Qinglin Meng, Shay Moran, Amirreza Shaeiri

发表机构 * Technion – Israel Institute of Technology and Google Research(技术ion – 以色列理工学院和谷歌研究)

AI总结 本文研究可实现设置下带强盗反馈的多类PAC学习问题,通过定义新的组合维度(强盗DS维度)并基于ListCascade算法,给出了最优样本复杂度的精确刻画(至多对数因子)。

Comments 18 pages

详情
AI中文摘要

我们研究了可实现设置下带强盗反馈的多类PAC学习问题。在该框架中,存在一个实例空间$\mathcal{X}$和标签空间$\mathcal{Y}$上的未知数据分布,与经典多类PAC学习相同,但学习器无法观察到独立同分布训练样本的标签。相反,在每一轮中,它接收一个无标签实例,预测其标签,并接收仅指示预测是否正确的强盗反馈。尽管有此限制,目标仍与经典PAC学习相同。我们对该问题的最优样本复杂度给出了一个一般性刻画,对于每个概念类至多相差对数因子。该刻画基于一个新的组合维度,称为强盗$\mathrm{DS}$维度,通过我们称为伪盒子的广义组合结构定义。这些结构扩展了$\mathrm{DS}$维度所依赖的伪立方体,允许每个坐标有不同数量的邻居。与通过计数伪立方体中坐标数量来刻画完全信息设置的$\mathrm{DS}$维度不同,强盗$\mathrm{DS}$维度聚合了各坐标的邻居数量,从而得到样本复杂度与邻居总数成比例的刻画。我们还提出了一种通用的学习算法,称为ListCascade,实现了上界,该算法将强盗学习与列表学习联系起来,可能具有独立意义。

英文摘要

We study the problem of multiclass PAC learning with bandit feedback in the realizable setting. In this framework, there is an unknown data distribution over an instance space $\mathcal{X}$ and a label space $\mathcal{Y}$, as in classical multiclass PAC learning, but the learner does not observe the labels of the i.i.d. training examples. Instead, in each round, it receives an unlabeled instance, predicts its label, and receives bandit feedback indicating only whether the prediction is correct. Despite this restriction, the goal remains the same as in classical PAC learning. We provide a general characterization of the optimal sample complexity of this problem, sharp for every concept class up to logarithmic factors. Our characterization is based on a new combinatorial dimension, termed the bandit $\mathrm{DS}$ dimension, defined via generalized combinatorial structures we call pseudo-boxes. These extend the pseudo-cubes underlying the $\mathrm{DS}$ dimension by allowing a different number of neighbors in each coordinate. In contrast to the $\mathrm{DS}$ dimension, which governs the full-information setting by counting the number of coordinates in the pseudo-cube, the bandit $\mathrm{DS}$ dimension aggregates the number of neighbors across coordinates, leading to a characterization in which the sample complexity scales with the total number of neighbors. We also propose a general learning algorithm achieving the upper bound, based on an algorithmic principle called ListCascade, which connects bandit learning to list learning and may be of independent interest.

2605.25629 2026-05-27 cs.CL cs.LG 版本更新

When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

当分布内增益失效:评估偏好转移下的弱到强奖励模型

Khoi Le, Tri Cao, Phong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu, Miao Chunyan, See-Kiong Ng, Thong Nguyen

发表机构 * National University of Singapore(国立新加坡大学) VinUniversity(文大学) Nanyang Technological University(南洋理工大学)

AI总结 研究弱到强偏好学习在零样本分布转移下的表现,发现弱监督微调会导致强模型偏向源域特征,提出表示锚定正则化方法以改善跨分布迁移。

Comments Code: https://anonymous.4open.science/r/w2s_reward_ood-682F

详情
AI中文摘要

弱到强(W2S)泛化是一种有前景的可扩展监督框架,然而现有评估通常在同分布训练-测试条件下进行。因此,我们研究零样本分布转移下的W2S偏好学习,发现基于弱偏好标签训练的强学生模型在分布内表现成功,但无法跨偏好数据集迁移。我们提供了证据表明存在一种表示失败模式:弱监督微调可能将强模型拉向源域特征,而不是保持广泛可迁移的偏好表示。为了缓解这一问题,我们提出表示锚定(Anchor),一种简单而有效的正则化方法,在微调过程中约束强模型预训练表示空间的过度漂移,同时允许任务相关的适应。在多个偏好领域、数据集和模型家族中,Anchor一致地改进了分布外迁移,同时保持了具有竞争力的分布内性能。综合来看,我们的评估协议、迁移感知指标和方法揭示了当前W2S奖励建模中隐藏的脆弱性,并为实现更稳健的偏好迁移提供了实用路径。

英文摘要

Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train-test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation. Across preference domains, datasets, and model families, Anchor consistently improves out-of-distribution transfer while maintaining competitive in-distribution performance. Together, our evaluation protocol, transfer-aware metrics, and method expose hidden brittleness in current W2S reward modeling and provide a practical path toward more robust preference transfer.

2605.25353 2026-05-27 cs.LG cs.CV physics.comp-ph 版本更新

PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems

PDEInvBench:面向PDE逆问题的神经网络综合数据集与设计空间探索

Divyam Goel, Nithin Chalapathi, Sanjeev Raja, Aditi S. Krishnapriyan

发表机构 * Department of Computer Science, UC Berkeley(计算机科学系,加州大学伯克利分校) UC Berkeley(加州大学伯克利分校) Departments of Computer Science and Chemical Engineering UC Berkeley(计算机科学与化学工程系,加州大学伯克利分校;劳伦斯伯克利国家实验室) LBNL

AI总结 提出PDEInvBench基准数据集,通过数值模拟涵盖多种PDE,并沿优化、表示和缩放三个维度系统探索神经网络设计空间,发现两阶段训练、PDE导数输入和初始条件多样性等实用见解。

Comments 37 total pages, 13 main pages, 20 figures, 8 tables. Published in Transactions on Machine Learning Research (TMLR), 2026

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

偏微分方程(PDE)中的逆问题涉及从观测到的时空解场估计系统的物理参数。神经网络因其对函数到函数空间变换的建模能力,非常适合PDE参数估计。虽然现有的机器学习方法基准主要关注正问题,但尚无针对PDE逆问题(即从解场映射到潜在物理参数)的类似综合研究和基准数据集。我们通过引入PDEInvBench填补了这一空白,这是一个全面的基准数据集,包含时间依赖和时间独立PDE的数值模拟,覆盖广泛的物理行为和参数。我们的数据集包括评估划分,用于评估在分布内和多种分布外设置下的性能。利用我们的基准数据集,我们沿三个关键维度全面探索了神经网络在PDE逆问题中的设计空间:(1)优化过程,分析监督、自监督和测试时训练目标对性能的作用;(2)问题表示,研究具有不同归纳偏好的架构选择和各种条件策略的价值;(3)缩放,针对模型和数据大小进行。我们的实验揭示了几个实用见解:1)神经网络在两步训练过程中表现最佳:先用PDE参数进行初始监督,然后使用PDE残差进行测试时微调;2)将PDE导数作为输入特征始终能提高精度;3)增加训练数据中初始条件的多样性比扩大PDE参数范围带来更大的性能提升。我们公开了数据集和代码库。

英文摘要

Inverse problems in partial differential equations (PDEs) involve estimating the physical parameters of a system from observed spatiotemporal solution fields. Neural networks are well-suited for PDE parameter estimation due to their capability to model function-to-function space transformations. While existing benchmarks of machine learning methods for PDEs primarily focus on the forward problem, there are no similar comprehensive studies and benchmark datasets on PDE inverse problems, i.e., mapping solution fields to underlying physical parameters. We fill this gap by introducing PDEInvBench, a comprehensive benchmark dataset consisting of numerical simulations for both time-dependent and time-independent PDEs across a wide range of physical behaviors and parameters. Our dataset includes evaluation splits that assess performance in both in-distribution and various out-of-distribution settings. Using our benchmark dataset, we comprehensively explore the design space of neural networks for PDE inverse problems along three key dimensions: (1) optimization procedures, analyzing the role of supervised, self-supervised, and test-time training objectives on performance, (2) problem representations, where we study the value of architectural choices with different inductive biases and various conditioning strategies, and (3) scaling, which we perform with respect to both model and data size. Our experiments reveal several practical insights: 1) neural networks perform best with a two-stage training procedure: initial supervision with PDE parameters followed by test-time fine-tuning using the PDE residual, 2) incorporating PDE derivatives as input features consistently improves accuracy, and 3) increasing the diversity of initial conditions in the training data yields greater performance gains than expanding the range of PDE parameters. We make our dataset and codebase publicly available.

2605.24071 2026-05-27 cs.LG cs.AI 版本更新

Not All Transitions Matter: Evidence from PPO

并非所有转移都重要:来自PPO的证据

Ajhesh Basnet

发表机构 * Department of Artificial Intelligence and Data Science(人工智能与数据科学系) KPR Institute of Engineering and Technology(KPR工程科技研究院)

AI总结 本文提出在PPO训练中随机丢弃一定比例的轨迹转移,以打破重复梯度结构,稳定训练,并在多个环境中验证了效果。

Comments 19 pages, 5 figures. Accepted to 2026 8th Asia Conference on Machine Learning and Computing (ACMLC 2026)

详情
Journal ref
Proceedings of the 2026 8th Asia Conference on Machine Learning and Computing
AI中文摘要

在策略上训练强化学习代理意味着每次更新时收集新的经验,而这些经验隐藏着一个问题。轨迹中的每个状态都是前一个状态的直接输出,由代理自身的动作因果链连接。因此,连续的转移从未真正独立。它们携带重叠信息,网络接收到的梯度信号最终比批次大小所暗示的要重复得多。相同的方向被反复强化,价值网络在策略变化时难以跟上,训练变得悄悄不稳定,而仅凭奖励曲线很少能揭示这一点。本文询问这种冗余是否可以简单地移除。我们表明,在适当阶段从轨迹中随机丢弃固定比例的转移,使得奖励信号保持完整,足以打破重复的梯度结构并稳定训练。变化很小:一个采样步骤,没有新组件,不修改核心算法,并且适用于任何PPO实现。在五个难度递增的环境(CartPole-v1、Acrobot-v1、LunarLander-v2、HalfCheetah-v5和Hopper-v5)中,该方法在奖励上与标准PPO匹配,同时在KL散度、策略熵和价值估计上产生更一致的训练动态。丢弃25%的转移是最佳点:足以破坏冗余,又不至于使批次过薄。

英文摘要

Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidden problem. Each state in a rollout is the direct output of the previous one, causally chained together by the agent's own actions. Because of this, consecutive transitions are never truly independent. They carry overlapping information, and the gradient signal the network receives ends up far more repetitive than the batch size suggests. The same directions get reinforced over and over, the value network struggles to keep up as the policy shifts, and training becomes quietly unstable in ways that reward curves alone rarely reveal. This paper asks whether that redundancy can simply be removed. We show that randomly dropping a fixed fraction of transitions from the rollout, at the right stage so the reward signal stays intact, is enough to break the repetitive gradient structure and stabilize training. The change is minimal: one sampling step, no new components, no modification to the core algorithm, and it works with any PPO implementation. Across five environments of increasing difficulty, CartPole-v1, Acrobot-v1, LunarLander-v2, HalfCheetah-v5, and Hopper-v5, the method matches vanilla PPO on reward while producing more consistent training dynamics across KL divergence, policy entropy, and value estimates. Dropping 25% of transitions turns out to be the sweet spot: enough to disrupt the redundancy, not enough to thin the batch.

2605.24042 2026-05-27 cs.LG cs.AI 版本更新

Hidden-State Privacy Has an Empty Middle

隐藏状态隐私存在空中间

Alexander Okezue Bell

发表机构 * Stanford University(斯坦福大学)

AI总结 通过理论下界和实验证明,高斯释放机制在隐藏状态隐私中无法同时实现中等效用和隐私,存在空中间区域,并提出了对角逆Fisher机制作为最优解。

Comments 74 pages, 61 figures

详情
AI中文摘要

在我们测试的1536个高斯释放协方差中,对于单层隐藏状态隐私,没有一个能在自适应检索攻击者下同时实现中等效用和中等隐私。我们证明了一个互补的Fisher球下界:每个具有O(1) Fisher效用的满秩高斯释放都存在一个方向,其马氏信号随隐藏宽度线性增长,排除了该类中的均匀高斯安全性,并与经验上的空中间匹配。对角逆Fisher释放Σ^⋆_{diag}(K) = (2K/d) diag(1/F_{ii})是在一阶KL预算K下唯一的最小最大最优对角机制,也是在32个模型层网格的每个点上最坏攻击者top-1 ≤ 0.001的唯一释放,但它位于隐私/效用边界上,而不是填充中间。在欧几里得检索下达到13倍帕累托缩减的广义特征机制,在自适应马氏攻击者下崩溃为100% top-1,而全轨迹序列逆变器恢复了干净GPT-2前缀的94%,但在Σ_{diag}下为0%。从头训练的分离记忆Transformer在90M时达到G_{Mah} ∈ [20, 33],并在固定token语言建模损失惩罚下,从30M到1B保持比相同预算GPT基线6-24倍的优势;预训练模型最高为9.3。这些结果将隐藏状态释放从高斯类内的机制设计重新定义为架构或释放协同设计。

英文摘要

Of $1{,}536$ Gaussian release covariances we tested for single-layer hidden-state privacy, zero achieve both moderate utility and moderate privacy against an adaptive retrieval attacker. We prove a complementary Fisher-ball lower bound: every full-rank Gaussian release at $O(1)$ Fisher utility admits a direction whose Mahalanobis signal grows linearly in hidden width, ruling out uniform Gaussian safety in the class and matching the empirical empty middle. The diagonal inverse-Fisher release $Σ^\star_{\mathrm{diag}}(\mathcal{K}) = (2\mathcal{K}/d)\,\mathrm{diag}(1/F_{ii})$ is the unique minimax-optimal diagonal mechanism at first-order KL budget $\mathcal{K}$ and the only release with worst-attacker top-1 $\le 0.001$ at every point of a 32 model-layer grid, but it sits on a privacy/utility edge rather than filling the middle. A generalized-eigen mechanism reaching $13\times$ Pareto reduction under Euclidean retrieval collapses to $100\%$ top-1 under the adaptive Mahalanobis attacker, and a full-trajectory sequence inverter recovers $94\%$ of clean GPT-2 prefixes but $0\%$ under $Σ_{\mathrm{diag}}$. A split-memory transformer trained from scratch reaches $G_{\mathrm{Mah}} \in [20, 33]$ at 90M and maintains a $6$--$24\times$ advantage over same-budget GPT baselines from 30M to 1B at a fixed-token language-modeling loss penalty; pretrained models top out at 9.3. These results reframe hidden-state release from mechanism-design within the Gaussian class to architecture or release co-design.

2605.24038 2026-05-27 physics.space-ph astro-ph.EP astro-ph.IM cs.LG 版本更新

Aurora Hunter: A Two-Stage Framework for Probabilistic Visibility Forecasting

极光猎人:一种用于概率可见性预测的两阶段框架

Zongyuan Ge, Chenwaner Zhang, Haoyang Li, Hantai Zhang, Wei Zhou, Wenxin Gu, Zhaoming Wang

发表机构 * College of Physics and Optoelectronic Engineering, Ocean University of China, Qingdao 266100, China(海洋大学物理与光电工程学院,青岛266100,中国) School of Mathematics and Computer Science, Yunnan Minzu University, Kunming 650504, People's Republic of China(云南民族大学数学与计算机科学学院,昆明650504,中华人民共和国) Engineering Research Center of Advanced Marine Physical Instruments and Equipment, Ministry of Education, Qingdao 266100, China(教育部先进海洋物理仪器与设备工程研究中心,青岛266100,中国) Qingdao Key Laboratory of Optics and Optoelectronics, Qingdao 266100, China(青岛市光学与光电实验室,青岛266100,中国)

AI总结 提出Aurora Hunter两阶段级联模型,分别预测极光发生概率和观测条件概率,实现高精度极光可见性预测。

详情
AI中文摘要

预测北极光可见性对于空间天气研究和极光旅游具有重要意义。某个地点和夜晚的可见性取决于两个不同因素:(1)极光是否实际发生,由太阳风-磁层耦合驱动;(2)观测条件是否允许肉眼检测,主要是云层覆盖和月照。我们提出了Aurora Hunter,一个两阶段级联模型,将这两个因素解耦。第一阶段使用XGBoost基于51个物理驱动特征预测P(发生),这些特征在联合的Tromso+Kiruna数据(约16,600小时样本,2015-2023年)上训练,标签来自Tromso AI全天相机图像分类器。第二阶段使用逻辑回归基于21个云层覆盖和月照特征预测P(晴朗观测|发生),仅在极光发生时段训练。级联模型P(可见)=P(发生)*P(晴朗|发生)在Tromso测试集(2019-2020年)上达到ROC-AUC 0.937,在独立Kiruna数据(2024年)上达到0.905,比单阶段基线提高了0.087。留出的Skibotn数据(2022-2025年)验证了跨站点泛化能力。SHAP识别出Kp×夜侧相互作用、MLT位置和极光椭圆距离为主要预测因子(合计39%)。原型:https://aurora-hunter.onrender.com。

英文摘要

Forecasting aurora borealis visibility matters for space weather research and aurora tourism. Visibility at a site and night depends on two distinct factors: (1) whether aurora is physically occurring, driven by solar wind-magnetosphere coupling, and (2) whether observing conditions allow naked-eye detection, mainly cloud cover and lunar illumination. We present Aurora Hunter, a two-stage cascade that decouples these factors. Stage 1 predicts P(occurring) with XGBoost using 51 physics-driven features trained on joint Tromso+Kiruna data (about 16,600 hourly samples, 2015-2023) with labels from the Tromso AI all-sky image classifier. Stage 2 predicts P(clear observation given occurring) with logistic regression using 21 cloud-cover and lunar-illumination features trained only on aurora-occurring hours. The cascade P(visible)=P(occurring)*P(clear|occurring) reaches ROC-AUC 0.937 (Tromso test, 2019-2020) and 0.905 (independent Kiruna, 2024), improving a single-stage baseline by +0.087. Held-out Skibotn data (2022-2025) confirm cross-site generalization. SHAP identifies the Kp x nightside interaction, MLT position, and auroral oval distance as dominant predictors (39% combined). Prototype: https://aurora-hunter.onrender.com.

2605.24001 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Diff-Instruct with Diffused Reward: 迈向有原则的一步生成器强化学习

Junyi Wu, Weijian Luo, Haoyang Zheng, Ruizhe Zhang, Guang Lin

发表机构 * Purdue University(普渡大学) hi-lab, Xiaohongshu Inc.(小红书实验室,小红书公司)

AI总结 针对一步生成器强化学习中奖励优化与生成动力学不匹配的问题,提出基于积分KL最小化的无数据轨迹级对齐框架DIDR,通过扩散奖励分数和代理估计器实现奖励驱动的校正,在一步SDXL和6B DiT骨干网络上取得帕累托优势。

Comments author list correction

详情
AI中文摘要

近期一步文本到图像生成的进展实现了实时合成,具有显著的效率和质量。先前用于一步生成器的强化学习方法将图像空间奖励优化与扩散噪声空间分布匹配相结合。这种范式由于终端奖励优化与底层生成动力学之间的不匹配带来了挑战。结果,优化倾向于利用随机自由度,通常以牺牲图像保真度为代价来提高奖励。为了解决这个问题,我们提出了Diff-Instruct with Diffused Reward (DIDR),一个从积分KL最小化推导出的无数据轨迹级对齐框架。DIDR将RLHF最优的奖励倾斜干净图像分布沿扩散轨迹传播到所有噪声水平。我们证明该目标与干净图像RLHF具有相同的最小化器,同时自然诱导出扩散奖励分数(DRS),它作为对参考分数函数的奖励驱动校正。为了使其实用,我们进一步引入了扩散奖励代理(DRP),一种基于可微短步去噪的DRS高效估计器。大量实验表明,DIDR持续帕累托主导现有的一步SDXL基线。此外,当迁移到6B DiT骨干网络(Z-Image)时,DIDR在偏好对齐上超越了其50步教师模型,同时仅需单步生成。

英文摘要

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

2605.23991 2026-05-27 physics.ao-ph astro-ph.EP cs.LG 版本更新

Quantification of atmospheric carbon dioxide from the Geostationary Operational Environmental Satellite (GOES East)

从地球静止业务环境卫星(GOES East)量化大气二氧化碳

Aaron Sonabend-W, Sean Campbell, John Platt, Christopher Van Arsdale, Anna M. Michalak

发表机构 * Google Research(谷歌研究) Google Ads(谷歌广告) Carnegie Institution for Science(卡内基研究所)

AI总结 利用GOES-East卫星的高时空分辨率数据,通过物理引导的神经网络DeepXCO2估算干空气柱CO2摩尔分数,并验证其捕捉真实XCO2变异性的能力。

Comments 28 pages, 9 figures, 1 table

详情
AI中文摘要

随着对温室气体进行本地到全球尺度CO2通量独立验证所需的分辨率、精度和准确度的需求日益迫切,当前一代天基传感器在空间和时间上仅提供稀疏观测。这一挑战激发了人们对利用原本为其他应用开发的现有任务数据来推断全球温室气体变异的兴趣。自2017年运行的地球静止业务环境卫星(GOES-East)上的先进基线成像仪(ABI)从地球静止轨道以10分钟间隔、约2 km²空间分辨率、16个光谱通道提供西半球大部分地区的全覆盖。在此,我们利用这种高空间覆盖和时间重访能力,开发了DeepXCO2——一种单像素、物理引导的神经网络,用于估算干空气柱CO2摩尔分数(XCO2)。DeepXCO2采用GOES-East的16个光谱波段的时间序列、ECMWF ERA5低对流层气象数据、MODIS地表反射率、太阳和卫星观测几何以及年积日。该网络在共置的GOES-East和OCO-2/OCO-3观测数据上训练。与保留的OCO-2和OCO-3观测年份以及TCCON网络观测相比,DeepXCO2能够捕捉真实的XCO2变异性。我们还展示了案例研究,说明利用DeepXCO2观测城市区域上空的XCO2增强和农业区域的XCO2下降。总体而言,虽然GOES-East导出的XCO2精度无法与专用仪器相媲美,但连续地理覆盖、10分钟时间频率和多年记录的 unprecedented 组合提供了观测目前从太空无法看到的大气CO2变异性的潜力。

英文摘要

There is a growing urgency to track greenhouse gasses with the resolution, precision and accuracy needed to support independent verification of $CO_2$ fluxes at local to global scales. The current generation of space-based sensors, however, only provides sparse observations in space and time. This challenge has fueled interest in the potential use of data from existing missions originally developed for other applications to infer global greenhouse gas variability. The Advanced Baseline Imager (ABI) onboard the Geostationary Operational Environmental Satellite (GOES-East), operational since 2017, provides full coverage of much of the western hemisphere at 10-minute intervals from geostationary orbit across 16 spectral channels at an approximately 2 km$^2$ spatial resolution. Here, we leverage this high spatial coverage and temporal revisit to develop Deep$XCO_2$, a single-pixel, physics-guided neural network to estimate dry-air column $CO_2$ mole fraction ($XCO_2$). Deep$XCO_2$ employs a time series of GOES-East's 16 spectral bands, ECMWF ERA5 lower tropospheric meteorology, MODIS surface reflectance, solar and satellite viewing geometry, and day of year. The network was trained on collocated GOES-East and OCO-2/OCO-3 observations. Deep$XCO_2$ is able to capture realistic $XCO_2$ variability when compared against a held-out year of OCO-2 and OCO-3 observations, and against observations from the TCCON network. We also present case studies illustrating the use of Deep$XCO_2$ to observe $XCO_2$ enhancements over urban areas and drawdown over agricultural regions. Overall, while the precision of GOES-East derived $XCO_2$ can never rival that of dedicated instruments, the unprecedented combination of contiguous geographic coverage, 10-minute temporal frequency, and multi-year record offers the potential to observe aspects of atmospheric $CO_2$ variability currently unseen from space.

2605.22774 2026-05-27 cs.LG cs.AI cs.HC 版本更新

CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation

CogAdapt: 通过导联适应将临床心电图基础模型迁移至可穿戴认知负荷评估

Amir Mousavi, Erfan Nourbakhsh, Mohammad Sadegh Sirjani, Mimi Xie, Rocky Slavin, Leslie Neely, John Davis, John Quarles

发表机构 * Department of Computer Science, College of AI, Cyber and Computing, The University of Texas at San Antonio(计算机科学系,人工智能、网络与计算学院,德克萨斯大学圣安东尼奥分校) Department of Educational Psychology, College of Education and Human Development, The University of Texas at San Antonio(教育心理学系,教育与人类发展学院,德克萨斯大学圣安东尼奥分校)

AI总结 提出CogAdapt框架,通过可学习适配器LeadBridge将3导联可穿戴信号转换为12导联表示,并结合渐进微调策略ProFine,实现临床心电图基础模型向可穿戴认知负荷评估的迁移,在跨受试者验证中显著优于从头训练的基线模型。

Comments 7 pages, 7 figures. Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)

详情
AI中文摘要

实时认知负荷评估对于自适应人机交互至关重要,但由于标记数据有限和跨受试者泛化能力差,仍然具有挑战性。最近在数百万临床记录上预训练的心电图基础模型提供了丰富的表示,但由于传感器配置不匹配和任务差异,无法直接应用于可穿戴设备。在本文中,我们提出了CogAdapt,一个将临床心电图基础模型适应于可穿戴认知负荷评估的框架。CogAdapt引入了LeadBridge,一个可学习的适配器,将3导联可穿戴信号转换为解剖学一致的12导联表示,以及ProFine,一种渐进微调策略,逐步解冻编码器层同时防止灾难性遗忘。在两个公共数据集(CLARE和CL-Drive)上的留一受试者交叉验证评估表明,CogAdapt显著优于从头训练的基线,宏F1分数分别达到0.626和0.768。这些结果证明了基础模型适应用于从可穿戴传感器进行与受试者无关的认知负荷评估的前景。

英文摘要

Real-time cognitive load assessment is essential for adaptive human-computer interaction but remains challenging due to limited labeled data and poor cross-subject generalization. Recent ECG foundation models pre-trained on millions of clinical recordings offer rich representations, but cannot be directly applied to wearable devices due to sensor configuration mismatch and task differences. In this paper, we propose CogAdapt, a framework that adapts clinical ECG foundation models to wearable cognitive load assessment. CogAdapt introduces LeadBridge, a learnable adapter that transforms 3-lead wearable signals into anatomically consistent 12-lead representations, and ProFine, a progressive fine-tuning strategy that gradually unfreezes encoder layers while preventing catastrophic forgetting. Evaluations on two public datasets (CLARE and CL-Drive) under leave-one-subject-out cross-validation show that CogAdapt substantially outperforms baselines trained from scratch, achieving macro-F1 scores of 0.626 and 0.768. These results demonstrate the promise of foundation model adaptation for subject-independent cognitive load assessment from wearable sensors.

2605.22162 2026-05-27 astro-ph.IM astro-ph.SR cs.LG 版本更新

Spectra as Language: Large Language Models for Scalable Stellar Parameter and Abundance Inference

光谱即语言:用于可扩展恒星参数和丰度推断的大型语言模型

Hai-Ling Lu, Yu-Yang Li, Yin-Bi Li, Cun-Shi Wang, A-Li Luo, Jun-Chao Liang, Shuo Li

发表机构 * National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100101, China(中国科学院国家天文台) University of Chinese Academy of Sciences, Beijing 100049, China(中国科学院大学) School of Astronomy and Space Science, University of Chinese Academy of Sciences, Beijing 100049, China(中国科学院大学天文与空间科学学院) University of Chinese Academy of Sciences, Nanjing 211135, China(中国科学院大学南京校区)

AI总结 提出两阶段大型语言模型框架,将恒星光谱视为序列信号,实现有效温度、表面重力、金属丰度及约20种化学元素丰度的准确估计,并展示随数据量增加性能系统提升的可扩展性。

详情
AI中文摘要

恒星光谱编码了恒星物理性质和化学成分的关键信息。准确的恒星参数测定对于解决星系和恒星演化等重大问题至关重要。大规模光谱巡天积累了前所未有的光谱数据。传统的特征提取或模型拟合方法难以处理高维、大规模数据集,泛化能力有限且计算效率低。大型语言模型的最新进展在自然语言处理、DNA/RNA序列分析以及蛋白质/化学解析等任务中展示了强大的泛化能力和特征学习能力。恒星光谱是连续的序列信号,使得语言模型可以迁移到恒星光谱学。在此,我们提出一个两阶段大型语言模型框架用于恒星参数推断,实现了有效温度、表面重力、金属丰度以及约20种化学元素丰度的准确估计。缩放律分析显示,随着数据增加,性能系统性地提升,为即将到来的大规模巡天提供了一个可扩展的框架。

英文摘要

Stellar spectra encode key information on the physical properties and chemical compositions of stars. Accurate stellar parameter determination is essential for addressing major questions such as galaxy and stellar evolution. Large-scale spectroscopic surveys have accumulated unprecedented spectral data. Traditional feature extraction or model-fitting approaches struggle with high-dimensional, massive datasets, limited generalization, and computational inefficiency. Recent advances in large language models demonstrate strong generalization and feature-learning in tasks like natural language processing, DNA/RNA sequence analysis, and protein/chemical parsing. Stellar spectra are continuous sequential signals, enabling the transfer of language models to stellar spectroscopy. Here, we propose a two-stage large language model framework for stellar parameter inference, achieving accurate estimation of effective temperature, surface gravity, metallicity, and abundances of ~20 chemical elements. Scaling-law analyses show systematic performance improvements with increasing data, providing a scalable framework for forthcoming large-scale surveys.

2605.20988 2026-05-27 cs.LG cs.AI 版本更新

A Sharper Picture of Generalization in Transformers

Transformer 泛化能力的更清晰图景

Paul Lintilhac, Sair Shaikh

发表机构 * Thayer School of Engineering Dartmouth College(达特茅斯学院泰勒工程学院)

AI总结 本文通过PAC-Bayes理论研究Transformer在布尔域上的泛化行为,证明稀疏低阶频谱可实现低锐度构造并得到非平凡的泛化界,解释了思维链为何能改善高阶目标函数的泛化。

Comments 10 pages, 9 figures, 41 pages of supplementary material

详情
AI中文摘要

我们从目标函数的傅里叶谱角度研究Transformer在布尔域上的泛化行为。与先前基于Rademacher复杂度推导泛化界的工作(Edelman等人,2022;Trauger & Tosh,2024)不同,我们探讨了通过PAC-Bayes理论获得泛化界的可行性。我们证明,集中在低阶分量上的稀疏谱能够实现具有良好泛化性质的低锐度构造。我们的思路是证明存在实现任何稀疏度不超过上下文长度的布尔函数的平坦极小值,然后将PAC-Bayes界应用于一个理想化的低锐度学习器,从而得到一个非平凡的泛化界。我们利用这一点正式解释了为什么思维链能改善高阶目标函数的泛化,并展示了我们界中的复杂度参数可以通过性质测试高效估计。我们通过实验评估了预测,并进行了机制可解释性研究,以支持我们的理论构造在真实Transformer中的现实性。

英文摘要

We study transformers' generalization behavior on boolean domains from the perspective of the Fourier spectra of their target functions. In contrast to prior work (Edelman et al., 2022; Trauger & Tosh, 2024), which derived generalization bounds from Rademacher complexity, we investigate the feasibility of obtaining generalization bounds via PAC-Bayes theory. We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound. We use this to give a formal account of why chain-of-thought improves generalization for high-degree target functions, and show that the complexity parameters in our bound can be efficiently estimated via property testing. We evaluate predictions empirically and conduct a mechanistic interpretability study to support the realism of our theoretical construction in real transformers.

2605.20291 2026-05-27 cs.LG 版本更新

Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

Weasel: 通过重要性-多样性数据选择实现Web智能体的域外泛化

Fatemeh Pesaran Zadeh, Seyeon Choi, Xing Han Lù, Siva Reddy, Gunhee Kim

发表机构 * Seoul National University(首尔国立大学) McGill University(麦吉尔大学) Mila -- Quebec AI Institute(蒙特利尔AI研究所) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 提出Weasel方法,通过优化平衡单步重要性与状态、网站、交互模式成对多样性的目标,选择固定预算的轨迹子集,结合目标中心AXTree剪枝和风格一致理由替换,提升Web智能体离线训练的域外泛化性能并降低训练成本。

Comments ICML 2026. Code is released at https://github.com/fatemehpesaran310/weasel

详情
AI中文摘要

大型语言模型(LLMs)使得Web智能体能够通过多步浏览器交互遵循自然语言目标。然而,在特定轨迹和领域上微调的智能体通常难以泛化到域外,且离线训练可能因噪声、冗余轨迹和长可访问性树(AXTree)状态而计算效率低下。为了解决这两个问题,我们提出了Weasel,一种用于Web智能体离线训练的轨迹选择方法。Weasel通过优化一个平衡状态、网站和交互模式上的单步重要性与成对多样性的目标,选择固定预算的轨迹步骤子集,并使用贪心算法高效求解。我们进一步通过目标中心AXTree剪枝(仅保留真实动作目标周围的内容)提高效率,并通过用模型生成的、风格一致的理由替换专家轨迹,缓解推理原生模型的风格不匹配问题。在AgentTrek和NNetNav训练数据集上,以及在WebArena、WorkArena和MiniWob中的评估,以及使用Qwen2.5-7B、Gemma3-4B和Qwen3-8B的实验表明,Weasel在降低训练成本的同时提高了域外性能,相比标准微调实现了约9.7-12.5倍的训练加速。我们在https://github.com/fatemehpesaran310/weasel提供代码。

英文摘要

Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with target-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5$\times$ training speedups over standard fine-tuning. We make the code available at https://github.com/fatemehpesaran310/weasel.

2605.20255 2026-05-27 cs.LG cs.AI cs.HC cs.RO 版本更新

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

行人行为不确定性下安全自动驾驶的多智能体强化学习

Prakash Aryan, Kaushik Raghupathruni, Timo Kehrer, Sebastiano Panichella

发表机构 * University of Bern(伯恩大学) AI4I, The Italian Institute of Artificial Intelligence(意大利人工智能研究所)

AI总结 本文使用多智能体近端策略优化(MAPPO)联合训练自动驾驶汽车和12个行人,通过隐藏的行人特质模拟乱穿马路行为,相比固定策略基线显著降低了碰撞率,并揭示了速度差异指标可用于检测未预期的乱穿马路行为。

Comments Accepted to ICRA 2026 Workshop "8th Workshop on Long-term Human Motion Prediction"

详情
AI中文摘要

自动驾驶汽车(SDC)的仿真测试通常依赖脚本化行人模型,这些模型无法捕捉真实过街行为的异质性和不确定性,限制了安全评估的真实性,尤其是对于由车辆无法观察到的潜在人格特质支配的乱穿马路行为。我们假设,通过多智能体强化学习(MARL)联合训练行人和SDC,相比针对固定行人策略训练,能产生更真实的交互场景,并且可预测与不可预测过街行为之间的差距可以直接从轨迹中测量。我们使用多智能体近端策略优化(MAPPO)联合训练一个SDC和12个行人:行人移动遵循脚本化的Dijkstra路径规划,而RL策略控制高层的前进/等待决策,乱穿马路概率取决于每个行人在回合开始时采样并隐藏于SDC的特质。在500回合评估中,联合训练的SDC达到78%的目标完成率,碰撞率为14%,而最佳基于规则的基线分别为35%和33%。速度差异指标显示,在近距离(0-3米)范围内,SDC在乱穿马路者附近比在人行横道使用者附近快2.65米/秒,表明乱穿马路遭遇未被预期。乱穿马路占过街事件的13%,但占碰撞的62%,并且联合训练相比单智能体RL减少了30%的碰撞,因为行人学会了在SDC高速接近时等待。

英文摘要

Simulation-based testing of self-driving cars (SDCs) typically relies on scripted pedestrian models that do not capture the heterogeneity and uncertainty of real crossing behavior, limiting the realism of safety assessments, especially for jaywalking, which is governed by latent personality traits the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) yields more realistic interaction scenarios than training against fixed pedestrian policies, and that the behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. We co-train an SDC and 12 pedestrians using Multi-Agent Proximal Policy Optimization (MAPPO): pedestrian locomotion follows scripted Dijkstra pathfinding while an RL policy controls high-level go/wait decisions, and jaywalking probability depends on a per-pedestrian trait sampled at episode start and hidden from the SDC. In 500-episode evaluations, the co-trained SDC reached 78% of goals with a 14% collision rate, versus 35%/33% for the best rule-based baseline. A speed differential metric shows the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0-3 m), indicating jaywalking encounters were not anticipated. Jaywalking was 13% of crossing events but 62% of collisions, and co-training reduced collisions by 30% relative to single-agent RL as pedestrians learned to wait when the SDC approached at speed.

2605.19969 2026-05-27 cs.LG 版本更新

Your Neighbors Know: Leveraging Local Neighborhoods for Backdoor Detection in Decentralized Learning

你的邻居知道:利用局部邻居进行去中心化学习中的后门检测

Sayan Biswas, Antoine Boutet, Davide Frey, Romaric Gaudel, Rachid Guerraoui, Maxime Jacovella, Anne-Marie Kermarrec, Dimitri Lerévérend, François Taïani, Martijn de Vos

发表机构 * EPFL(瑞士联邦理工学院) Inria, INSA Lyon, CITI(法国国家科学研究中心、里昂国立应用科学学院、CITI) Univ. Rennes, Inria, CNRS, IRISA(雷恩大学、法国国家科学研究中心、CNRS、IRISA)

AI总结 提出Argus框架,通过局部邻居协作分析模型更新并利用结构相似性度量区分真实后门与数据异构性导致的误报,实现去中心化学习中的后门检测,并提供理论收敛保证。

Comments 34 pages, 10 figures

详情
AI中文摘要

去中心化学习(DL)是一种新兴的机器学习范式,其中节点在没有中央服务器的情况下协作训练模型。然而,DL的协作性质使其容易受到后门攻击,即模型被训练为在标准输入上表现正常,而在遇到带有特定触发器的数据时执行隐藏的恶意行为。DL中的后门攻击仍未得到充分研究,现有防御措施常常忽视DL的约束。我们引入了Argus,一种原生于DL的新型后门检测框架,它既不需要中央协调器,也不需要预先知道触发器。在Argus中,诚实节点本地分析接收到的模型更新以识别潜在的后门触发器。然后,节点集体与邻居共享其触发器,并使用结构相似性度量将真实后门与数据异构性引起的误报区分开。一个关键见解是,假阳性触发器在不同参与者之间表现出不一致性,而真阳性触发器则呈现一致的模式。未通过此协作测试的模型更新被拒绝,持续恶意的发送者最终被驱逐。我们首次为特定于DL的后门检测机制提供了理论收敛保证,表明以高概率过滤可疑模型更新可保持与标准DL相当的收敛速度。我们在三个标准数据集上实现了Argus,并针对三个最先进的基线进行了评估。在各种设置下,与无防御相比,Argus将攻击成功率降低了多达90个百分点,同时将模型效用保持在全知神谕的5个百分点以内。此外,随着数据异构性的增加,Argus相对于基线的有效性也有所提高。

英文摘要

Decentralized learning (DL) is an emerging machine learning paradigm where nodes collaboratively train models without a central server. However, the collaborative nature of DL makes it vulnerable to backdoor attacks, where a model is taught to behave normally on standard inputs while executing hidden, malicious actions when encountering data with specific triggers. Backdoor attacks in DL remain understudied and existing defenses often overlook DL constraints. We introduce Argus, a novel backdoor detection framework native to DL that requires neither a central coordinator nor prior knowledge of the trigger. In Argus, honest nodes locally analyze received model updates to identify potential backdoor triggers. Nodes then collectively share their triggers with their neighbors and use a structural similarity metric to separate true backdoors from false alarms induced by data heterogeneity. A key insight is that false positive triggers exhibit inconsistencies across participants while true positive ones show consistent patterns. Model updates that fail this collaborative test are rejected, and persistently malicious senders are eventually evicted. We provide the first theoretical convergence guarantees for a DL-specific backdoor detection mechanism, showing that filtering out suspicious model updates with high probability preserves a convergence rate comparable to standard DL. We implement and evaluate Argus on three standard datasets and against three state-of-the-art baselines. Across settings, Argus reduces attack success rates by up to 90 points compared to no defense, while preserving model utility within 5 percentage points of an omniscient oracle. Furthermore, the effectiveness of Argus compared to baselines improves as data heterogeneity increases.

2605.19052 2026-05-27 stat.ML cs.LG 版本更新

Provably Data-driven Lagrangian Relaxation for Mixed Integer Linear Programming

可证明数据驱动的混合整数线性规划拉格朗日松弛

Tung Quoc Le, Anh Tuan Nguyen, Viet Anh Nguyen

发表机构 * Université Grenoble Alpes, LJK, CNRS, Grenoble INP(格勒诺布尔阿尔卑斯大学,LJK,CNRS,格勒诺布尔INP) Carnegie Mellon University, Machine Learning Department(卡内基梅隆大学,机器学习系) Chinese University of Hong Kong, Department of Systems Engineering and Engineering Management(香港中文大学,系统工程与工程管理系)

AI总结 针对混合整数线性规划的拉格朗日松弛,通过数据驱动算法设计框架,理论分析了学习乘子的泛化界和极小化最优速率,并证明随机梯度上升和热启动方法达到最优。

Comments Accepted to ICML 2026

详情
AI中文摘要

拉格朗日松弛(LR)是求解大规模混合整数线性规划(MILP)的强大技术,特别是那些具有可分解结构的问题,如车辆路径或机组组合问题。通过松弛耦合约束,LR能够并行求解子问题,并且通常比标准线性规划松弛产生更紧的对偶界,这对于高效的分支定界剪枝至关重要。虽然最近的实证工作显示出使用机器学习预测这些乘子的有希望的结果,但对此类方法的理论理解仍然是一个开放问题。在这项工作中,我们通过数据驱动算法设计的视角分析学习LR的问题来弥合这一差距,即在一个问题实例分布上的统计学习问题。我们的贡献如下:首先,我们推导出学习乘子的泛化界为$\mathcal{O}(s^{1.5}/\sqrt{N})$,其中$s$是耦合约束的数量,$N$是样本量。其次,我们提供了极小化下界$\Omega(s/\sqrt{N})$,证明线性依赖是不可避免的。第三,我们通过证明带有平均的随机梯度上升(SGA)达到了极小化最优速率$\Theta(s/\sqrt{N})$,建设性地缩小了这一理论差距。最后,我们将框架扩展到学习热启动设置,证明其达到了快速、极小化最优速率$\Theta(s/N)$,并确立了相对于直接乘子预测的理论优势。

英文摘要

Lagrangian Relaxation (LR) is a powerful technique for solving large-scale Mixed Integer Linear Programming (MILP), particularly those with decomposable structures, such as vehicle routing or unit commitment problems. By relaxing the coupling constraints, LR enables parallel subproblem solving and often yields tighter dual bounds than standard linear programming relaxations, which is crucial for efficient branch-and-bound pruning. While recent empirical work has shown promising results using machine learning to predict these multipliers, a theoretical understanding of such methods remains an open question. In this work, we bridge this gap by analyzing the problem of learning LR through the lens of Data-driven Algorithm Design, i.e., a statistical learning problem over a distribution of problem instances. Our contributions are as follows: first, we derive a generalization bound of $\mathcal{O}(s^{1.5}/\sqrt{N})$ for the learned multipliers, where $s$ is the number of coupling constraints and $N$ is the sample size. Second, we provide a minimax lower-bound of $Ω(s/\sqrt{N})$, proving that a linear dependency is unavoidable. Third, we constructively close this theoretical gap by proving that Stochastic Gradient Ascent (SGA) with averaging achieves the minimax optimal rate $Θ(s/\sqrt{N})$. Finally, we extend our framework to the learning-to-warm-start setting, proving that it achieves a fast, minimax-optimal rate of $Θ(s/N)$ and establishing a theoretical advantage over direct multiplier prediction.

2605.18468 2026-05-27 stat.ML cs.LG 版本更新

Shallow ReLU$^s$ Networks in $L^p$-Type and Sobolev Spaces: Approximation and Path-Norm Controlled Generalization

浅层ReLU$^s$网络在$L^p$型空间和Sobolev空间中的逼近与路径范数控制的泛化

Weizhao Li, Fanghui Liu, Lei Shi

发表机构 * School of Mathematical Sciences Fudan University(复旦大学数学学院) School of Mathematical Sciences Institute of Natural Sciences and MOE-LSC Shanghai Jiao Tong University(上海交通大学数学学院、自然科学研究院和教育部低维量子体系科学重点实验室) School of Mathematical Sciences Shanghai Key Laboratory for Contemporary Applied Mathematics Fudan University(复旦大学上海当代应用数学重点实验室数学学院)

AI总结 本文研究浅层ReLU$^s$网络在$L^p$型空间和Sobolev空间中的逼近能力,并通过$\ell_1$路径范数控制实现非参数回归的极小化最优泛化误差。

Comments 42 pages, 1 figure. Update theorem 2and fix some typos. Authors are listed in alphabetical order and contributed equally

详情
AI中文摘要

本文研究浅层ReLU$^s$网络($\sigma_s(t)=\max\{0,t\}^s$)的逼近性质及其在$\ell_1$路径范数控制下的泛化行为。对于$L^p$型积分空间$\widetilde{\mathcal{F}}_{p, au_d,s}$($1\le p\le2$),球谐分析给出了浅层网络的逼近界。特别地,当$ au_d$为均匀测度且$1\le p<2$时,逼近率为:当$1\le p\le p^*$时为$O\!\left(m^{- rac{p(2s+2d+1)-2d}{2dp}} ight)$,当$p^*<p<2$时为$O\!\left(m^{- rac{p(4s+3d-1)-2d+2}{4dp}} ight)$,其中$p^*= rac{2d+2}{d+3}$。通过嵌入到谱Barron空间,得到了Sobolev空间$W^{\alpha,p}$($1\le p<2$)的逼近界。对于亚高斯噪声下的非参数回归,路径范数正则化的浅层ReLU$^s$网络在$\mathscr{B}_s$上达到极小化最优速率$O\!\left(n^{- rac{d+2s+1}{2d+2s+1}}\log n ight)$,在$W^{\alpha,\infty}$上达到$O\!\left(n^{- rac{2\alpha}{2\alpha+d}}\log n ight)$,且下界匹配至对数因子。

英文摘要

This paper studies approximation by shallow ReLU$^s$ networks, $σ_s(t)=\max\{0,t\}^s$, together with their generalization behavior under $\ell_1$ path-norm control. For the $L^p$-type integral spaces $\widetilde{\mathcal{F}}_{p,τ_d,s}$, $1\le p\le2$, spherical harmonic analysis yields approximation bounds for shallow networks. In particular, when $τ_d$ is the uniform measure and $1\le p<2$, the approximation rate is $O\!\left(m^{-\frac{p(2s+2d+1)-2d}{2dp}}\right)$ for $1\le p\le p^*$ and $O\!\left(m^{-\frac{p(4s+3d-1)-2d+2}{4dp}}\right)$ for $p^*<p<2$, where $p^*=\frac{2d+2}{d+3}$. Approximation bounds for Sobolev spaces $W^{α,p}$, $1\le p<2$, are obtained through embeddings into spectral Barron spaces. For nonparametric regression with sub-Gaussian noise, path-norm-regularized shallow ReLU$^s$ networks achieve minimax-optimal rates $O\!\left(n^{-\frac{d+2s+1}{2d+2s+1}}\log n\right)$ over $\mathscr{B}_s$ and $O\!\left(n^{-\frac{2α}{2α+d}}\log n\right)$ over $W^{α,\infty}$, with matching lower bounds up to logarithmic factors.

2605.17036 2026-05-27 cs.AI cs.LG cs.MA cs.SY eess.SY 版本更新

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

自主AI代理在供应链管理中的可靠性与有效性

Carol Xuan Long, David Simchi-Levi, Feng Zhu, Huangyuan Su, Andre P. Calmon, Flavio P. Calmon

发表机构 * Harvard University(哈佛大学) MIT/Purdue(麻省理工学院/普渡大学) MIT(麻省理工学院) Harvard University/Kempner Institute(哈佛大学/凯普勒研究所) Georgia Tech(佐治亚理工学院)

AI总结 本文通过MIT啤酒游戏研究多级供应链中的自主生成式AI代理,发现模型能力是性能主导因素,但平均性能掩盖可靠性风险,并引入代理牛鞭效应,提出基于GRPO的后训练框架以提高可靠性。

详情
AI中文摘要

本文使用MIT啤酒游戏研究多级供应链中的自主生成式AI代理。我们确定了影响性能的四个推理时杠杆:模型选择、策略和护栏、集中数据共享以及提示工程。模型能力是主导因素:开箱即用的推理模型超越人类水平性能,优化后的推理模型相对于人类团队将成本降低高达67%。然而,强劲的平均性能掩盖了显著的可靠性风险。我们引入了代理牛鞭效应:自主多级系统中运行间决策不稳定性的放大。其中一个核心组成部分是决策牛鞭效应,即由随机代理决策而非客户需求变化产生的订单变异性部分。我们表明,即使需求路径固定,决策不稳定性也可以在固定时间点跨设施以及同一设施内随时间放大。重复采样(一种自然的测试时补救措施)未能显著减少这种不稳定性,这表明可靠性需要改变底层决策策略,而不仅仅是平均模型输出。为解决这一限制,我们提出了一种基于组相对策略优化(GRPO)的强化学习后训练框架,该框架使用系统级供应链奖励训练共享的基础LLM。后训练显著减少了尾部事件,抑制了代理牛鞭效应,并提高了自主供应链代理的可靠性。

英文摘要

This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game. We identify four inference-time levers that shape performance: model selection, policies and guardrails, centralized data sharing, and prompt engineering. Model capability is the dominant factor: an out-of-the-box reasoning model exceeds human-level performance, and optimized reasoning models reduce costs by up to 67% relative to human teams. However, strong average performance masks substantial reliability risks. We introduce agent bullwhip: the amplification of run-to-run decision instability in autonomous multi-echelon systems. A central component is decision bullwhip, the portion of order variability generated by stochastic agent decisions rather than by changes in customer demand. We show that decision instability can amplify both across facilities at a fixed point in time and within the same facility over time, even when the demand path is held fixed. Repeated sampling, a natural test-time remedy, fails to meaningfully reduce this instability, suggesting that reliability requires changing the underlying decision policy rather than merely averaging over model outputs. To address this limitation, we propose a Group Relative Policy Optimization (GRPO)-based reinforcement-learning post-training framework that trains a shared base LLM using system-level supply-chain rewards. Post-training substantially reduces tail events, curtails agent bullwhip, and improves the reliability of autonomous supply-chain agents.

2605.16457 2026-05-27 cs.LG cs.AI cs.CV 版本更新

Identifiable Token Correspondence for World Models

可辨识的令牌对应关系用于世界模型

Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University(人工智能交叉学科项目,首尔国立大学) Department of Computer Science(计算机科学系) Engineering, Seoul National University(工程系,首尔国立大学)

AI总结 提出可辨识的令牌对应关系(ITC)方法,通过将下一帧预测建模为结构化分配问题,解决基于令牌的Transformer世界模型在长程推演中的时间不一致性,在四个基准上达到最先进性能。

详情
AI中文摘要

基于令牌的Transformer世界模型在视觉强化学习中表现出色,但常在长程推演中出现时间不一致性,包括对象重复、消失和变形。一个关键原因是大多数现有方法将下一帧预测纯粹视为令牌生成问题,而未考虑令牌在时间上的持续性。我们引入可辨识的令牌对应关系(ITC),这是一种用于基于令牌的Transformer世界模型的解码步骤,将下一帧预测建模为具有潜在令牌对应变量的结构化分配问题:每个下一帧令牌要么通过从上一帧复制令牌来解释,要么通过生成新令牌来解释。ITC保持Transformer架构和训练过程不变,可以添加到现有骨干网络上。我们的实验在4个具有挑战性的基准上展示了最先进的性能。所提出的方法在Craftax-classic基准上实现了72.5%的回报率和35.6%的分数,显著超过了之前的最佳结果67.4%和27.9%。我们在https://github.com/snu-mllab/Identifiable-Token-Correspondence上发布了源代码。

英文摘要

Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token-based transformer world models that formulates next-frame prediction as a structured assignment problem with latent token correspondence variables: each next-frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu-mllab/Identifiable-Token-Correspondence.

2605.04880 2026-05-27 cs.LG cs.AI 版本更新

A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

SMDP中平均奖励强化学习的调和均值公式

Erel Shtossel, Alicia Vidler, Uri Shaham, Gal A. Kaminka

发表机构 * Bar Ilan University(巴伊兰大学)

AI总结 针对无限时域非回合制任务中的平均奖励强化学习,提出一种修正的调和均值算子,解决SMDP中奖励和持续时间非平稳时的奖励率计算问题,并证明其理论性质及有效性。

详情
Journal ref
https://alaworkshop2026.github.io/papers/ALA2026_paper_57.pdf
AI中文摘要

最近的研究重新激发并增强了对无限时域、非回合制(持续)任务中未折扣平均奖励强化学习算法的兴趣。半马尔可夫决策过程(SMDP)尤其引人关注。在SMDP中,离散动作随机产生奖励和持续时间,目标是优化平均奖励率。现有算法通过优化奖励与持续时间的比率来逼近这一目标。然而,当奖励和持续时间(在无限时域中)非平稳时,这种方法可能不正确。本文提出一种新颖的修正调和均值算子,即使在上述条件下也能正确计算奖励率。这产生了可以与SMDP一起工作的无模型学习算法,同时保持对随时间变化的非平稳奖励和持续时间分布的鲁棒性。我们证明了修正调和均值算子的理论性质,并通过实验与现有算法相比展示了其有效性。

英文摘要

Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.

2605.02207 2026-05-27 cs.CV cs.AI cs.LG 版本更新

MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings

MultiSense-Pneumo:面向资源受限环境中肺炎筛查的多模态学习框架

Dineth Jayakody, Pasindu Thenahandi, Chameli Dommanige

发表机构 * Department of Computer Science, Old Dominion University, VA, USA(计算机科学系,老 Dominion 大学,弗吉尼亚州,美国)

AI总结 提出MultiSense-Pneumo多模态原型系统,整合症状、咳嗽音频、语音和胸片,通过可解释的后期融合实现肺炎筛查与分诊支持。

详情
AI中文摘要

肺炎仍然是全球发病率和死亡率的主要原因,尤其是在低资源环境中,那里缺乏影像学、实验室检测和专科护理。临床评估依赖于异质性证据,包括症状、呼吸模式、口头描述和胸部影像,使得一线筛查本质上是多模态的。然而,许多现有的计算方法仍然是单模态的,并且主要关注放射影像。在这项工作中,我们提出了MultiSense-Pneumo,一个面向肺炎筛查和分诊支持的多模态研究原型,它整合了结构化症状描述符、咳嗽音频、口语和胸部X光片。该系统结合了确定性症状分诊、基于LightGBM的声学分类、使用ResNet-18的域对抗放射影像分析、基于Transformer的语音识别以及可解释的后期融合算子。每个模态被转换为归一化的关注信号,并聚合为统一的筛查估计。融合权重是手动指定的,被视为启发式、可解释的参数,而不是学习或临床优化的值。MultiSense-Pneumo的设计考虑了在标准笔记本电脑级硬件上的离线执行,但并未作为经过部署验证或临床验证的诊断系统呈现。实验结果表明,在合成域偏移下,放射影像路径具有强大的组件级性能,同时也突出了重要的局限性,特别是咳嗽声学的异常类别召回率降低以及缺乏配对的端到端多模态患者评估。因此,MultiSense-Pneumo旨在作为筛查和分诊研究的框架和组件级原型。

英文摘要

Pneumonia remains a leading global cause of morbidity and mortality, particularly in low-resource settings where access to imaging, laboratory testing, and specialist care is limited. Clinical assessment relies on heterogeneous evidence, including symptoms, respiratory patterns, spoken descriptions, and chest imaging, making frontline screening inherently multimodal. However, many existing computational approaches remain unimodal and focus primarily on radiographs. In this work, we present MultiSense-Pneumo, a multimodal research prototype for pneumonia-oriented screening and triage support that integrates structured symptom descriptors, cough audio, spoken language, and chest radiographs. The system combines deterministic symptom triage, LightGBM-based acoustic classification, domain-adversarial radiograph analysis using ResNet-18, transformer-based speech recognition, and an interpretable late-fusion operator. Each modality is transformed into a normalized concern signal and aggregated into a unified screening estimate. The fusion weights are hand-specified and are treated as heuristic, interpretable parameters rather than learned or clinically optimized values. MultiSense-Pneumo is implemented with offline execution in mind on standard laptop-class hardware, but it is not presented as a deployment-validated or clinically validated diagnostic system. Experimental results demonstrate strong component-level performance of the radiograph pathway under synthetic domain shifts, while also highlighting important limitations, especially reduced abnormal-class recall for cough acoustics and the absence of paired end-to-end multimodal patient evaluation. MultiSense-Pneumo is therefore intended as a framework and component-level prototype for screening and triage research.

2410.18915 2026-05-27 cs.DS cs.LG 版本更新

Testing Support Size More Efficiently Than Learning Histograms

比学习直方图更高效地测试支撑大小

Renato Ferreira Pinto, Nathaniel Harms

发表机构 * Columbia University, USA(哥伦比亚大学,美国) University of British Columbia, Canada(不列颠哥伦比亚大学,加拿大) University of Waterloo(多伦多大学) EPFL(苏黎世联邦理工学院)

AI总结 针对未知概率分布p的支撑大小测试问题,提出一种基于切比雪夫多项式的方法,仅需O(n/(ε log n) log(1/ε))个样本,优于学习直方图的Θ(n/(ε^2 log n))样本,并给出支撑大小的更大下界。

Comments 42 pages. This is the TheoretiCS journal version

详情
Journal ref
TheoretiCS, Volume 5 (May 21, 2026) theoretics:16717
AI中文摘要

考虑关于未知概率分布$p$的两个问题: 1. 需要多少来自$p$的样本才能测试$p$是否支撑在$n$个元素上?具体来说,给定来自$p$的样本,判断它是否支撑在至多$n$个元素上,或者它在总变差距离上“$ε$-远离”支撑在$n$个元素上。 2. 给定来自$p$的$m$个样本,我们能对其支撑大小产生的最大下界是多少? 问题(1)的最佳已知上界使用了一种学习分布$p$的直方图的通用算法,该算法需要$Θ( frac{n}{ε^2 \log n})$个样本。我们表明,测试可以比学习直方图更高效,仅需$O( frac{n}{ε\log n} \log(1/ε))$个样本,几乎匹配最佳已知下界$Ω( frac{n}{ε\log n})$。该算法还为问题(2)提供了更好的解决方案,比先前工作产生更大的支撑大小下界。证明依赖于对切比雪夫多项式近似在其设计范围之外的分析,本文旨在作为切比雪夫多项式方法的易于理解的自包含阐述。

英文摘要

Consider two problems about an unknown probability distribution $p$: 1. How many samples from $p$ are required to test if $p$ is supported on $n$ elements or not? Specifically, given samples from $p$, determine whether it is supported on at most $n$ elements, or it is "$ε$-far" (in total variation distance) from being supported on $n$ elements. 2. Given $m$ samples from $p$, what is the largest lower bound on its support size that we can produce? The best known upper bound for problem (1) uses a general algorithm for learning the histogram of the distribution $p$, which requires $Θ(\tfrac{n}{ε^2 \log n})$ samples. We show that testing can be done more efficiently than learning the histogram, using only $O(\tfrac{n}{ε\log n} \log(1/ε))$ samples, nearly matching the best known lower bound of $Ω(\tfrac{n}{ε\log n})$. This algorithm also provides a better solution to problem (2), producing larger lower bounds on support size than what follows from previous work. The proof relies on an analysis of Chebyshev polynomial approximations outside the range where they are designed to be good approximations, and the paper is intended as an accessible self-contained exposition of the Chebyshev polynomial method.

2503.22823 2026-05-27 quant-ph cs.IT cs.LG math.IT 版本更新

Quantum Doeblin Coefficients: Interpretations and Applications

量子Doeblin系数:解释与应用

Ian George, Christoph Hirche, Theshani Nuradha, Mark M. Wilde

发表机构 * Centre for Quantum Technologies, National University of Singapore, Singapore 117543, Singapore(量子技术中心,新加坡国立大学) School of Electrical and Computer Engineering, Cornell University, Ithaca, New York 14850, USA(电气与计算机工程学院,康奈尔大学)

AI总结 本文定义并研究了量子Doeblin系数,提供了多种解释(如最小单态分数、排除值等),并展示了其在量子机器学习、误差缓解、量子假设检验和时变信道等领域的应用。

Comments v3: 108 pages, 5 figures, added some summary tables, added proof of reducing to classical Doeblin on classical channels, and another multiplicativity result v2: 104 pages, 5 figures, Expanded the application section on mixing, indistinguishability, and decoupling times ; v1:88 pages, 2 figures

详情
Journal ref
Quantum 10, 2115 (2026)
AI中文摘要

在经典信息论中,经典信道的Doeblin系数提供了信道全变差收缩系数的可有效计算的上界,从而导致了所谓的强数据处理不等式。在这里,我们研究量子Doeblin系数作为经典概念的推广。特别地,我们定义了各种新的量子Doeblin系数,其中一种具有几个理想性质,包括级联性和可乘性,此外还能有效计算。我们还发展了两种量子Doeblin系数的各种解释,包括作为最小单态分数、排除值、反向最大互信息和ovelhoH信息、反向鲁棒性以及假设检验反向互信息和ovelhoH信息的表示。我们将量子Doeblin系数解释为纠缠辅助或非辅助的排除值特别有吸引力,表明它们与通过使用信道在状态排除任务中能达到的最佳可能错误概率成正比。我们还概述了量子Doeblin系数的各种应用,范围从对使用参数化量子电路的量子机器学习算法的限制(噪声诱导的贫瘠高原)、对误差缓解协议的限制、对噪声量子假设检验的样本复杂性的限制,以及对时变信道的混合性、可区分性和解耦时间的限制。所有这些应用都利用了量子Doeblin系数出现在信道各种迹距离收缩系数的上界中这一事实。此外,在所有这些应用中,我们使用Doeblin系数的分析在通用性和可有效计算性方面,对先前文献的贡献提供了各种改进。

英文摘要

In classical information theory, the Doeblin coefficient of a classical channel provides an efficiently computable upper bound on the total-variation contraction coefficient of the channel, leading to what is known as a strong data-processing inequality. Here, we investigate quantum Doeblin coefficients as a generalization of the classical concept. In particular, we define various new quantum Doeblin coefficients, one of which has several desirable properties, including concatenation and multiplicativity, in addition to being efficiently computable. We also develop various interpretations of two of the quantum Doeblin coefficients, including representations as minimal singlet fractions, exclusion values, reverse max-mutual and oveloH informations, reverse robustnesses, and hypothesis testing reverse mutual and oveloH informations. Our interpretations of quantum Doeblin coefficients as either entanglement-assisted or unassisted exclusion values are particularly appealing, indicating that they are proportional to the best possible error probabilities one could achieve in state-exclusion tasks by making use of the channel. We also outline various applications of quantum Doeblin coefficients, ranging from limitations on quantum machine learning algorithms that use parameterized quantum circuits (noise-induced barren plateaus), on error mitigation protocols, on the sample complexity of noisy quantum hypothesis testing, and on mixing, distinguishability, and decoupling times of time-varying channels. All of these applications make use of the fact that quantum Doeblin coefficients appear in upper bounds on various trace-distance contraction coefficients of a channel. Furthermore, in all of these applications, our analysis using Doeblin coefficients provides improvements of various kinds over contributions from prior literature, both in terms of generality and being efficiently computable.

2605.18866 2026-05-27 cs.LG cs.AI 版本更新

FLUIDSPLAT: Reconstructing Physical Fields from Sparse Sensors via Gaussian Primitives

FLUIDSPLAT: 通过高斯原语从稀疏传感器重建物理场

Huaxi Huang, Meng Li, Zhengqing Gao, Xi Zhou, Xiaoshui Huang, Xiao Sun

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) The Hong Kong University of Science and Technology(香港科学与技术大学) Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) Shanghai Jiaotong University(上海交通大学)

AI总结 提出FLUIDSPLAT模型,利用高斯原语作为空间显式中间表示,从稀疏传感器数据重建流场,理论分析了表示能力与观测数的关系,并在多个基准上实现误差降低11-28%。

Comments 24 pages, 5 figures,preprint

详情
AI中文摘要

从稀疏表面安装的传感器重建连续流场是空气动力学设计、流动控制和数字孪生仪器的核心。现有的神经方法通常将传感器读数编码为隐式潜在代码,空间可解释性差,且关于表示能力应如何随观测数量扩展的正式指导有限。受3D高斯泼溅启发,我们引入FLUIDSPLAT,一种传感器条件模型,预测K个各向异性高斯原语,形成单位划分支架,即流场的空间显式且可解释的中间表示。对于理想化的高斯原语估计器,我们证明了对于具有Sobolev光滑度s的场,逼近率为$O(K^{-s/d})$;结合N个含噪声观测,得到偏差$O(K^{-2s/d})$和方差$O(σ^{2}K/N)$的平方风险分解。平衡两者得到$K^{*}\!\sim\!(N/σ^{2})^{d/(2s+d)}$:在稀疏传感下原语数量不能自由增长,揭示了方差瓶颈,促使用状态条件残差解码器补充支架。在涵盖2D和3D的四个基准(圆柱绕流、AirfRANS、FlowBench LDC-3D和PhySense-Car 3D)上,FLUIDSPLAT相比多个强基线实现了11-28%的误差降低。

英文摘要

Reconstructing continuous flow fields from sparse surface-mounted sensors is central to aerodynamic design, flow control, and digital-twin instrumentation. Existing neural methods for this task typically encode sensor readings into implicit latent codes with little spatial interpretability and limited formal guidance on how representational capacity should scale with observation count. Inspired by 3D Gaussian Splatting, we introduce FLUIDSPLAT, a sensor-conditioned model that predicts K anisotropic Gaussian primitives forming a partition-of-unity scaffold, a spatially explicit and interpretable intermediate representation of the flow. For an idealized Gaussian primitive estimator, we prove an $O(K^{-s/d})$ approximation rate for fields with Sobolev smoothness $s$; incorporating $N$ noisy observations yields a squared-risk decomposition with bias $O(K^{-2s/d})$ and variance $O(σ^{2}K/N)$.Balancing the two yields $K^{*}\!\sim\!(N/σ^{2})^{d/(2s+d)}$: primitive count cannot grow freely under sparse sensing, revealing a variance bottleneck that motivates complementing the scaffold with a state-conditioned residual decoder. Across four benchmarks spanning 2D and 3D, FLUIDSPLAT achieves 11-28% error reduction over several strong baselines on cylinder flow, AirfRANS, FlowBench LDC-3D, and PhySense-Car 3D benchmarks.

2605.18592 2026-05-27 cs.LG cs.AI cs.CL 版本更新

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS: 一种用于基于评分标准的强化学习的记忆增强评分标准改进系统

Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao, Gang Wu, Xinya Du, Zhiyu Chen

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) Adobe Inc.(Adobe公司) Department of Computer Science, University of California, Santa Barbara(加州大学圣芭芭拉分校计算机科学系)

AI总结 提出AMARIS系统,通过持久化评估记忆存储纵向训练证据来改进评分标准,在科学、医学、指令遵循和创意写作任务上优于静态、局部自适应和无记忆基线方法。

Comments Preprint. Under review

详情
AI中文摘要

基于评分标准的奖励塑形为通过强化学习(RL)微调大语言模型(LLMs)提供了可解释且可编辑的奖励信号,但现有的自适应评分标准方法通常从局部证据(如当前批次或实例级比较)更新标准。这种局部视角丢弃了训练过程中产生的诊断信息,使得难以跟踪重复失败、评估之前的评分标准编辑或在早期标准饱和后提高标准。我们引入了AMARIS,一种记忆增强的评分标准改进系统,它将评分标准更新建立在纵向训练证据之上。AMARIS将轨迹分析、步骤级摘要和评分标准更新记录存储在持久化评估记忆中,然后检索最近和语义相关的历史来修订评分标准。我们在全局和实例特定评分标准设置下,在科学、医学、指令遵循和创意写作任务上评估了AMARIS。AMARIS在静态、局部自适应和无记忆基线上有所改进,例如在GPQA-Diamond上比最强基线高出+2.8分,在IFBench上高出+2.2分,同时分析表明记忆减少了振荡性的评分标准编辑,并支持从早期错误纠正到后期课程推进的进展。AMARIS与正常RL循环异步运行,相对于同步评分标准更新减少了阻塞延迟。

英文摘要

Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but existing adaptive rubric methods typically update criteria from local evidence such as the current batch or instance-level comparisons. This local view discards diagnostic information produced during training, making it difficult to track recurring failures, evaluate previous rubric edits, or raise standards once earlier criteria become saturated. We introduce AMARIS, A Memory-Augmented Rubric Improvement System that grounds rubric updates in longitudinal training evidence. AMARIS stores rollout analyses, step-level summaries, and rubric update records in a persistent evaluation memory, then retrieves recent and semantically relevant history to revise rubrics. We evaluate AMARIS across science, medicine, instruction following, and creative writing under both global and instance-specific rubric settings. AMARIS improves over static, local-adaptive, and memory-ablated baselines, such as +2.8 points on GPQA-Diamond and +2.2 points on IFBench over the strongest baselines, while analysis shows that memory reduces oscillatory rubric edits and supports a progression from early failure correction to later curriculum advancement. AMARIS runs asynchronously alongside the normal RL loop, reducing blocking latency relative to synchronous rubric updates.

2605.17482 2026-05-27 cs.CL cs.LG 版本更新

RSD: A Local Triangulation Audit Primitive for Learned Vector Blocks

RSD:一种用于学习向量块的局部三角剖分审计原语

Seungmin Jin

发表机构 * HSE University(俄罗斯高等经济大学)

AI总结 提出RSD(关系语义分解)作为局部三角剖分审计方法,通过拟合单纯形成员关系和坐标极点,结合关系解码器和坐标残差,实现学习向量块的可解释性审计。

Comments 8 pages, 1 figure. Revised version with clarified scope, experiments, and limitations

详情
AI中文摘要

局部XAI审计将有限的学习向量块与弱侧信号进行比较。基线方法如最近邻查找、低秩坐标模型和关系分解揭示了审计的不同部分。我们引入关系语义分解(简称RSD),作为学习向量块的局部三角剖分审计。给定坐标X和一个声明的有界弱亲和代理A,RSD拟合单纯形成员关系S和坐标极点C。它在关系解码器中重用S来解码A,并报告坐标残差R=X-SC。这产生了一个范围限定的审计单元:所选块、代理、解码器类和损失预算的兼容性,以及组件质量和残差读数。合成控制检查单纯形重构、代理解码和固定S残差分解。定理陈述、月份和狗/狼块说明了为什么低代理损失应结合组件质量、残差读数和块大小来解读。

英文摘要

Local XAI audits compare a finite block of learned vectors with a weak side signal. Baselines such as nearest-neighbor lookup, low-rank coordinate models, and relation factorization expose different parts of this audit. We introduce Relational Semantic Decomposition, abbreviated as RSD, as a local triangulation audit for learned vector blocks. Given coordinates X and a declared bounded weak affinity proxy A, RSD fits simplex memberships S and coordinate poles C. It reuses S in a relation decoder for A and reports the coordinate residual R=X-SC. This yields a scoped audit unit: compatibility for the chosen block, proxy, decoder class, and loss budget, plus component mass and residual readouts. Synthetic controls check simplex reconstruction, proxy decoding, and fixed-S residual decomposition. The theorem-statement, month, and dog/wolf blocks illustrate why low proxy loss should be read with component mass, residual readouts, and block size.

2605.15216 2026-05-27 cs.AR cs.LG 版本更新

Hardware-Software Co-Design of Scalable, Energy-Efficient Analog Recurrent Computations

可扩展、节能的模拟循环计算的软硬件协同设计

Arthur Fyon, Julien Brandoit, Loris Mendolia, Damien Ernst, Jean-Michel Redouté, Guillaume Drion

发表机构 * University of Liège(列日大学)

AI总结 通过软硬件协同设计,利用双稳态记忆循环单元(BMRU)的离散输出抑制噪声,实现了超低功耗的模拟循环神经网络硬件。

Comments This work has been the subject of two patent applications (Numbers: EP26175243.0 and EP26175248.9)

详情
AI中文摘要

始终在线的AI应用,从环境传感器到生物医学植入物,都需要超低功耗。模拟电路提供了一条亚微瓦级推理的路径,但现有的模拟实现仅限于前馈架构:由于时间反馈中的噪声累积,将其扩展到循环动态被认为是不切实际的。我们证明,通过软硬件协同设计可以克服这一障碍。具体来说,我们发现双稳态记忆循环单元(BMRU)——一类具有离散值输出和迟滞动力学的循环神经网络(RNN)——允许一种超低功耗的电流模式模拟实现,我们从第一性原理设计了该实现。由此产生的电路在每个学习参数和电路元件之间建立了一一对应关系。离散输出在每个单元边界处将模拟噪声抑制至少20倍,打破了阻止模拟循环的噪声累积。我们重新制定了BMRU,使其在固定阈值下进行第一象限操作,从而在保持表达能力和可训练性的同时实现了直接对应。在180纳米互补金属氧化物半导体(CMOS)中的晶体管级模拟显示,软件预测与电路级行为之间几乎完美一致,因此软件模型以低计算成本充当物理硬件的高保真模拟器。我们利用这种保真度进行大规模噪声免疫和功率缩放分析:添加循环的功率成本与状态维度线性缩放,而主导总功率的前馈层则二次缩放,这意味着相对于前馈骨干网络,循环是以线性边际成本添加的。端到端的关键词识别在RNN核心处实现了亚微瓦级推理。

英文摘要

Always-on AI applications, from environmental sensors to biomedical implants, require ultra-low power consumption. Analog circuits offer a path to sub-microwatt inference, yet existing analog implementations are limited to feedforward architectures: extending them to recurrent dynamics has been considered impractical due to noise accumulation through temporal feedback. We demonstrate that this barrier can be overcome through hardware-software co-design. Specifically, we identify that Bistable Memory Recurrent Units (BMRUs), a class of Recurrent Neural Networks (RNNs) with discrete-valued outputs and hysteretic dynamics, admit an ultra-low power current-mode analog implementation which we design from first principles. The resulting circuit establishes a one-to-one correspondence between each learned parameter and a circuit element. The discrete outputs suppress analog noise by at least 20-fold at each cell boundary, breaking the noise accumulation that prevents analog recurrence. We reformulate BMRUs for first-quadrant operation with fixed thresholds, enabling the direct correspondence while preserving expressivity and trainability. Transistor-level simulations in 180 nm Complementary Metal-Oxide-Semiconductor (CMOS) show near-perfect agreement between software predictions and circuit-level behavior, with the software model thereby serving as a high-fidelity simulator of the physical hardware at low computational cost. We leverage this fidelity to conduct large-scale noise immunity and power scaling analyses: the power cost of adding recurrence scales linearly with state dimension, while the feedforward layers dominating total power scale quadratically, meaning recurrence is added at linear marginal cost relative to the feedforward backbone. End-to-end keyword spotting achieves sub-microwatt inference at the RNN core.

2604.27019 2026-05-27 cs.LG cs.CL cs.CR 版本更新

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

动态对抗微调重组拒绝几何结构

Wenhao Lan, Shan Li, Xinhua Lai, Meiqi Wu, Junbin Yang, Haihua Shen, Yijun Yang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Inner Mongolia University of Technology(内蒙古科技大学) Tsinghua University(清华大学) Shandong University(山东大学)

AI总结 研究动态对抗微调如何改变安全对齐语言模型中拒绝行为的因果控制载体(低维子空间),发现R2D2沿鲁棒性-效用前沿重组几何结构但未建立自适应鲁棒性。

详情
AI中文摘要

安全对齐的语言模型必须拒绝有害请求而不广泛过度拒绝,但尚不清楚动态对抗微调如何改变拒绝控制载体:Kullback--Leibler (KL)约束方向或因果调节拒绝而不引起大规模安全提示分布偏移的小子空间。我们研究了一个7B骨干模型在监督微调(SFT)和鲁棒拒绝动态防御(R2D2)下的表现,将HarmBench、StrongREJECT和XSTest评估与五点几何测量、因果干预和稀疏自适应压力测试对齐。R2D2在早期检查点将固定源HarmBench攻击成功率降至零;然而,这些检查点也表现出最大的XSTest拒绝率并未能通过良性效用审计。后期检查点部分恢复了面向效用的行为,同时重新打开了攻击成功率,自适应GCG攻击成功率在第250步升至0.415,第500步升至0.613。内部地,R2D2在第100步之前保留了一个后期层的可接受拒绝控制载体,然后将最佳可接受载体迁移到早期层;SFT迁移更早但鲁棒性较差。有效秩保持在1.24附近,SFT表现出更大的主角漂移,这反对将维度扩展和漂移幅度作为充分解释。因果干预支持一个低维但效用耦合的载体。这些结果支持R2D2沿鲁棒性-效用前沿的几何重组解释,但未建立自适应鲁棒性。

英文摘要

Safety-aligned language models must refuse harmful requests without broad over-refusal, but it remains unclear how dynamic adversarial fine-tuning changes refusal-control carriers: Kullback--Leibler (KL)-constrained directions or small subspaces that causally modulate refusal without large safe-prompt distribution shifts. We study a 7B backbone under supervised fine-tuning (SFT) and Robust Refusal Dynamic Defense (R2D2), aligning HarmBench, StrongREJECT, and XSTest evaluations with five-anchor geometry measurements, causal interventions, and sparse adaptive stress tests. R2D2 drives fixed-source HarmBench attack success to zero at early checkpoints; however, these checkpoints also exhibit maximal XSTest refusal and fail a benign-utility audit. Later checkpoints partially recover utility-facing behavior while reopening attack success, with adaptive GCG attack success rate rising to 0.415 at step 250 and 0.613 at step 500. Internally, R2D2 preserves a late-layer admissible refusal-control carrier through step 100 and then relocates the best admissible carrier to an early layer; SFT relocates earlier yet remains less robust. Effective rank stays near 1.24, and SFT shows larger principal-angle drift, arguing against both dimensional expansion and drift magnitude as sufficient explanations. Causal interventions support a low-dimensional but utility-coupled carrier. These results support a geometry-reorganization account of R2D2 along a robustness--utility frontier, without establishing adaptive robustness.

2605.15522 2026-05-27 math.OC cs.LG 版本更新

Stochastic Non-Smooth Convex Optimization with Unbounded Gradients

无界梯度的随机非光滑凸优化

Dmitry Kovalev

发表机构 * Yandex Research(Yandex研究院)

AI总结 针对梯度范数受最优性间隙仿射函数约束的广义Lipschitz函数类,证明AdamW带裁剪更新在随机非光滑凸优化中优于SGD和AdaGrad,并建立其指数加权梯度累积的关键作用及推广到广义光滑和拟凸设置。

详情
AI中文摘要

现有的一阶非光滑优化理论大多建立在目标函数梯度一致有界这一限制性假设上。我们引入了一类更现实的广义Lipschitz函数,其中梯度范数受最优性间隙的仿射函数约束。然后我们提出一个自然的问题:什么算法能在解决凸随机广义Lipschitz优化问题时达到最好的全局收敛速度?为此,我们对几种现有算法进行了新的收敛性分析,发现带有裁剪更新的AdamW在理论上优于其他流行的随机优化方法,如SGD和AdaGrad。此外,我们的分析确立了AdamW的指数加权梯度累积(而非简单平均)的关键作用。我们进一步证明裁剪AdamW具有普适性,并在流行的广义光滑性假设下获得改进的收敛速度,分析了带对角和矩阵预条件子的裁剪AdamW的收敛性,并将结果推广到拟凸设置。

英文摘要

Much of the existing theory on first-order non-smooth optimization is built on a restrictive assumption that the gradients of the objective function are uniformly bounded. We introduce a much more realistic class of generalized Lipschitz functions, where the gradient norms are bounded by an affine function of the optimality gap. We then ask a natural question: what algorithm achieves the best global convergence rates for solving convex stochastic generalized Lipschitz optimization problems? To address this, we develop a new convergence analysis for several existing algorithms and find that AdamW with clipped updates, provably outperforms other popular stochastic optimization methods, such as SGD and AdaGrad. Moreover, our analysis establishes the critical role of AdamW's exponentially weighted gradient accumulation, as opposed to simple averaging. We further show that clipped AdamW is universal and achieves improved rates under the popular generalized smoothness assumption, analyze the convergence of clipped AdamW with diagonal and matrix preconditioners, and extend our results to the quasar-convex setting.

2602.13770 2026-05-27 eess.IV cs.LG 版本更新

NeuroMambaLLM: Dynamic Graph Learning of fMRI Functional Connectivity in Autistic Brains Using Mamba and Language Model Reasoning

NeuroMambaLLM:使用Mamba和语言模型推理的自闭症大脑fMRI功能连接的动态图学习

Yasaman Torabi, Parsa Razmara, Hamed Ajorlou, Bardia Baraeinejad

发表机构 * Department of Electrical and Computer Engineering, McMaster University(麦基尔大学电气与计算机工程系) Department of Biomedical Engineering, University of Southern California(南加州大学生物医学工程系) Department of Electrical and Computer Engineering, University of Rochester(罗切斯特大学电气与计算机工程系) BIOSEN Group(BIOSEN集团)

AI总结 提出NeuroMambaLLM框架,结合动态潜在图学习、选择性状态空间时序建模与冻结的大语言模型,通过低秩适应实现fMRI动态功能连接的诊断分类与临床文本报告生成。

详情
AI中文摘要

大型语言模型(LLMs)在多模态领域展现了强大的语义推理能力。然而,它们与基于图的脑连接模型的集成仍然有限。此外,大多数现有的fMRI分析方法依赖于静态功能连接(FC)表示,这掩盖了对神经发育障碍(如自闭症)至关重要的瞬时神经动态。最近的状态空间方法(包括Mamba)有效地建模了时间结构,但通常作为独立的特征提取器使用,缺乏显式的高层推理。我们提出了NeuroMambaLLM,一个端到端框架,将动态潜在图学习和选择性状态空间时序建模与LLMs相结合。该方法从原始血氧水平依赖(BOLD)时间序列中动态学习功能连接,用自适应潜在连接取代固定相关图,同时抑制运动相关伪影并捕获长程时间依赖。生成的动态大脑表示被投影到LLM模型的嵌入空间中,其中基础语言模型保持冻结,并训练轻量级低秩适应(LoRA)模块以实现参数高效的对齐。这种设计使LLM能够执行诊断分类和基于语言的推理,从而分析动态fMRI模式并生成具有临床意义的文本报告。

英文摘要

Large Language Models (LLMs) have demonstrated strong semantic reasoning across multimodal domains. However, their integration with graph-based models of brain connectivity remains limited. In addition, most existing fMRI analysis methods rely on static Functional Connectivity (FC) representations, which obscure transient neural dynamics critical for neurodevelopmental disorders such as autism. Recent state-space approaches, including Mamba, model temporal structure efficiently, but are typically used as standalone feature extractors without explicit high-level reasoning. We propose NeuroMambaLLM, an end-to-end framework that integrates dynamic latent graph learning and selective state-space temporal modelling with LLMs. The proposed method learns the functional connectivity dynamically from raw Blood-Oxygen-Level-Dependent (BOLD) time series, replacing fixed correlation graphs with adaptive latent connectivity while suppressing motion-related artifacts and capturing long-range temporal dependencies. The resulting dynamic brain representations are projected into the embedding space of an LLM model, where the base language model remains frozen and lightweight low-rank adaptation (LoRA) modules are trained for parameter-efficient alignment. This design enables the LLM to perform both diagnostic classification and language-based reasoning, allowing it to analyze dynamic fMRI patterns and generate clinically meaningful textual reports.

2511.19289 2026-05-27 quant-ph cs.IT cs.LG math.IT 版本更新

Performance Guarantees for Quantum Neural Estimation of Entropies

熵的量子神经估计的性能保证

Sreejith Sreekumar, Ziv Goldfeld, Mark M. Wilde

发表机构 * Laboratoire Des Signaux Et Systèmes (L2S), CNRS, CentraleSupélec, University of Paris-Saclay(信号与系统实验室(L2S)、国家科学研究中心(CNRS)、中央理工-巴黎高等师范学院(CentraleSupélec)、巴黎-萨克雷大学) School of Electrical and Computer Engineering, Cornell University(电气与计算机工程学院、康奈尔大学)

AI总结 本文针对量子神经估计器(QNE)估计测量(Rényi)相对熵的问题,提出了非渐近误差风险界和指数尾界,并给出了样本复杂度分析,证明了其最优性。

Comments 43 pages

详情
Journal ref
Quantum 10, 2113 (2026)
AI中文摘要

估计量子熵和散度是量子物理、信息论和机器学习中的一个重要问题。利用混合经典-量子架构的量子神经估计器(QNE)最近成为估计这些度量的一种有吸引力的计算框架。这种估计器将经典神经网络与参数化量子电路相结合,其部署通常需要繁琐地调整控制样本大小、网络架构和电路拓扑的超参数。本文首次以非渐近误差风险界的形式,对测量(Rényi)相对熵的QNE进行了形式化保证研究。我们进一步建立了指数尾界,表明误差是次高斯的,因此尖锐地集中在真实值附近。对于维度为$d$且具有有界Thompson度量的密度算子对的一个适当子类,我们的理论建立了QNE的副本复杂度为$O(|Θ(\mathcal{U})|d/ε^2)$,其中量子电路参数集为$Θ(\mathcal{U})$,该复杂度对精度$ε$具有极小极大最优依赖。此外,如果密度算子对是置换不变的,我们将上述维度依赖改进为$O(|Θ(\mathcal{U})|\mathrm{polylog}(d)/ε^2)$。我们的理论旨在促进测量相对熵的QNE的原则性实现,并指导实践中的超参数调优。

英文摘要

Estimating quantum entropies and divergences is an important problem in quantum physics, information theory, and machine learning. Quantum neural estimators (QNEs), which utilize a hybrid classical-quantum architecture, have recently emerged as an appealing computational framework for estimating these measures. Such estimators combine classical neural networks with parametrized quantum circuits, and their deployment typically entails tedious tuning of hyperparameters controlling the sample size, network architecture, and circuit topology. This work initiates the study of formal guarantees for QNEs of measured (Rényi) relative entropies in the form of non-asymptotic error risk bounds. We further establish exponential tail bounds showing that the error is sub-Gaussian and thus sharply concentrates about the ground truth value. For an appropriate sub-class of density operator pairs on a space of dimension $d$ with bounded Thompson metric, our theory establishes a copy complexity of $O(|Θ(\mathcal{U})|d/ε^2)$ for QNE with a quantum circuit parameter set $Θ(\mathcal{U})$, which has minimax optimal dependence on the accuracy $ε$. Additionally, if the density operator pairs are permutation invariant, we improve the dimension dependence above to $O(|Θ(\mathcal{U})|\mathrm{polylog}(d)/ε^2)$. Our theory aims to facilitate principled implementation of QNEs for measured relative entropies and guide hyperparameter tuning in practice.

2605.14151 2026-05-27 math.OC cs.LG 版本更新

Stochastic global optimization of continuous functions via random walks on Grassmannians

通过Grassmann流形上的随机游走实现连续函数的随机全局优化

Kartik Gupta, Stephen D. Miller, Pradeep Ravikumar, Ramarathnam Venkatesan

AI总结 提出一种基于Grassmann流形上随机游走的全局优化方法,通过随机采样低维子空间并利用黑盒优化器求解子空间限制问题,在非凸、非光滑条件下仅依赖几何分布实现收敛保证,并具有盲点鲁棒性。

Comments 21 pages

详情
AI中文摘要

我们提出了一种基于Grassmann流形上随机游走的随机全局优化方法。为了最小化连续目标函数 $\ell:\mathbb{R}^d\rightarrow\mathbb{R}$,该方法反复随机采样 $k$ 维线性子空间(其中 $k\ll d$),使用任意黑盒优化器求解这些子空间上的低维限制问题,并更新迭代点(该迭代点单调地优于前一个迭代点)。与依赖凸性、光滑性、Lipschitz界或Polyak-Lojasiewicz型条件的经典优化分析不同,我们的收敛保证仅取决于通过 $\mathbb{R}^d$ 中给定点的 $k$ 维子空间上限制极小值的几何分布。我们确定了一个间隙参数——类似于随机游走的谱间隙——它控制迭代点接近全局最小值的速率。最后,我们论证了相同的分析产生了一种盲点鲁棒性:损失函数中足够窄且深的凹陷($\ell$ 向下尖峰的小测度区域)对算法轨迹的影响有限,因为它们不太可能被随机子空间采样遇到。

英文摘要

We introduce a stochastic global optimization method based on random walks on Grassmannian manifolds. To minimize a continuous objective $\ell:\mathbb{R}^d\rightarrow\mathbb{R}$, the method repeatedly samples random $k$-dimensional linear subspaces (with $k\ll d$), solves the resulting low-dimensional restrictions of these problems to these subspaces using an arbitrary black-box optimizer, and updates the iterate (which monotonically improves upon the previous iterate). Unlike classical optimization analyses that rely on convexity, smoothness, Lipschitz bounds, or Polyak-Lojasiewicz-type conditions, our convergence guarantees depend only on the geometric distribution of restricted minima across the $k$-dimensional subspaces passing through a given point in $\mathbb{R}^d$. We identify a gap parameter -- an analogue of a spectral gap for random walks -- that controls the rate at which the iterates approach the global minimum value. Finally, we argue that the same analysis yields a blind-spot robustness property: sufficiently narrow, deep dips of the loss function (small-measure regions where $\ell$ spikes downward) have limited influence on the algorithm's trajectory, since they are unlikely to be encountered by random subspace sampling.

2605.13779 2026-05-27 cs.LG cs.AI cs.DC 版本更新

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

MinT:用于训练和服务数百万LLM的托管基础设施

Mind Lab, :, Song Cao, Vic Cao, Andrew Chen, Kaijie Chen, Cleon Cheng, Steven Chiang, Kaixuan Fan, Hera Feng, Huan Feng, Arthur Fu, Jun Gao, Hongquan Gu, Aaron Guan, Nolan Ho, Mutian Hong, Hailee Hou, Peixuan Hua, Charles Huang, Miles Jiang, Nora Jiang, Yuyi Jiang, Qiuyu Jin, Fancy Kong, Andrew Lei, Kyrie Lei, Alexy Li, Lucian Li, Ray Li, Theo Li, Zhihui Li, Jiayi Lin, Kairus Liu, Kieran Liu, Logan Liu, Xiang Liu, Irvine Lu, Maeve Luo, Runze Lv, Pony Ma, Verity Niu, Anson Qiu, Vincent Wang, Rio Yang, Maxwell Yao, Carrie Ye, Regis Ye, Wenlin Ye, Josh Ying, Danney Zeng, Yuhan Zhan, Anya Zhang, Di Zhang, Ruijia Zhang, Sueky Zhang, Ya Zhang, Wei Zhao, Ada Zhou, Changhai Zhou, Yuhua Zhou, Xinyue Zhu, Murphy Zhuang

发表机构 * Mind Lab

AI总结 提出MinT系统,通过LoRA适配器管理实现大规模基础模型上的高效训练与在线服务,支持百万级策略目录。

Comments 30 pages, technical report

详情
AI中文摘要

我们提出MindLab Toolkit (MinT),一个用于低秩适配(LoRA)后训练和在线服务的托管基础设施系统。MinT针对这样一种场景:在少量昂贵的基模型部署上产生许多训练好的策略。MinT不是将每个策略实现为合并的完整检查点,而是保持基模型驻留,并通过回滚、更新、导出、评估、服务和回滚等阶段移动导出的LoRA适配器修订版,将分布式训练、服务、调度和数据移动隐藏在服务接口后面。MinT沿三个维度扩展此路径。Scale Up将LoRA RL扩展到前沿规模的密集和MoE架构,包括MLA和DSA注意力路径,训练和服务已验证超过1T总参数。Scale Down仅移动导出的LoRA适配器,在秩1设置中可小于基模型大小的1%;适配器仅移交将测量步骤在4B密集模型上减少18.3倍,在30B MoE上减少2.85倍,而并发多策略GRPO将挂钟时间缩短1.77倍和1.45倍,且不提高峰值内存。Scale Out将持久策略可寻址性与CPU/GPU工作集分离:张量并行部署支持10^6规模的可寻址目录(通过100K测量单引擎扫描)和集群规模的千适配器活动波,冷加载作为计划的服务工作处理,打包的MoE LoRA张量将实时引擎加载提高8.5-8.7倍。因此,MinT管理百万规模的LoRA策略目录,同时在共享的1T级基模型上训练和服务选定的适配器修订版。

英文摘要

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.

2605.12827 2026-05-27 cs.CR cs.AI cs.LG 版本更新

GraphIP-Bench: How Hard Is It to Steal a Graph Neural Network, and Can We Stop It?

GraphIP-Bench:窃取图神经网络有多难,我们能阻止吗?

Kaixiang Zhao, Bolin Shen, Yuyang Dai, Shayok Chakraborty, Yushun Dong

发表机构 * University of Notre Dame(诺特大学) Florida State University(佛罗里达州立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出统一基准GraphIP-Bench,集成12种提取攻击和12种防御,评估图神经网络模型窃取的难易程度及防御有效性,发现中等查询预算下窃取容易且多数防御无效,异配图更难窃取。

Comments Under review

详情
AI中文摘要

作为云服务部署的图神经网络(GNN)可能通过模型提取攻击被窃取,这种攻击从查询响应中训练替代模型以复制目标行为,而越来越多的所有权防御试图防止或追踪此类窃取。本文提出两个问题:窃取GNN有多难,我们能阻止吗?先前的工作无法回答这两个问题,因为实验使用了不一致的数据集、威胁模型和指标。我们引入GraphIP-Bench,一个统一的基准,在单一黑盒协议下评估双方。GraphIP-Bench集成了十二种提取攻击、十二种防御(涵盖水印、输出扰动和查询模式检测)、十个公共图(涵盖同质、异质和大规模场景)、三种GNN骨干网络和三种图学习任务。它报告了在共享划分、查询和预算下的保真度、任务效用、所有权验证和计算成本。我们进一步增加了一个联合攻击与防御赛道,对每个受防御目标运行每种攻击,并测量结果替代模型上的水印验证,揭示了防御在提取后保留了多少保护。实证结果清晰:在中等查询预算下窃取GNN很容易,大多数防御并未改变这一点;几种水印在受保护模型上可靠验证,但在提取的替代模型上失去了大部分验证信号,暴露了单模型评估忽略的差距;异配图系统性地更难窃取,而目标与替代模型之间的跨架构不匹配减少了但并未阻止提取。我们发布了GraphIP-Bench,附带可复现的脚本和配置,并将攻击和防御集成到PyGIP库中。代码:https://github.com/LabRAI/GraphIP-Bench。库:https://labrai.github.io/PyGIP/index.html。

英文摘要

Graph neural networks (GNNs) deployed as cloud services can be stolen through model-extraction attacks, which train a surrogate from query responses to reproduce the target's behavior, and a growing line of ownership defenses tries to prevent or trace such theft. This paper asks two questions: how hard is it to steal a GNN, and can we stop it? Prior work cannot answer either, because experiments use inconsistent datasets, threat models, and metrics. We introduce GraphIP-Bench, a unified benchmark that evaluates both sides under a single black-box protocol. GraphIP-Bench integrates twelve extraction attacks, twelve defenses spanning watermarking, output perturbation, and query-pattern detection, ten public graphs covering homophilic, heterophilic, and large-scale regimes, three GNN backbones, and three graph-learning tasks. It reports fidelity, task utility, ownership verification, and computational cost on shared splits, queries, and budgets. We further add a joint attack-and-defense track that runs every attack on every defended target and measures watermark verification on the resulting surrogate, exposing how much protection a defense retains after extraction. The empirical picture is clear: stealing a GNN is easy at medium query budgets and most defenses do not change this; several watermarks verify reliably on the protected model but lose most of their verification signal on the extracted surrogate, exposing a gap that single-model evaluations miss; and heterophilic graphs are systematically harder to steal, while a cross-architecture mismatch between target and surrogate reduces but does not prevent extraction. We release GraphIP-Bench with reproducible scripts and configurations, and integrate the attacks and defenses into the PyGIP library. Code: https://github.com/LabRAI/GraphIP-Bench. Library: https://labrai.github.io/PyGIP/index.html.

2605.06152 2026-05-27 cs.LG cs.CL math.OC stat.ML 版本更新

Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

Grokking 还是 Glitching?低精度如何驱动 Slingshot 损失尖峰

Liu Hanqing, Jianjun Cao, Yuanze Li, Zijian Zhou

发表机构 * Tsinghua University(清华大学) The University of Tokyo(东京大学)

AI总结 本文证明深度神经网络训练中的 Slingshot 损失尖峰现象是由浮点精度限制导致的数值特征膨胀(NFI)机制引起的,并解释了参数范数快速增长和梯度消失等现象。

Comments 28 pages, 13 figures; ICML 2026 Workshop on High-dimensional Learning Dynamics (Spotlight)

详情
AI中文摘要

深度神经网络在无正则化的长期训练中会出现周期性的损失尖峰,这种现象被称为“Slingshot 机制”。现有工作通常将其归因于内在的优化动力学,但其触发机制仍不清楚。本文证明这种现象是浮点算术精度限制的结果。当训练进入高置信度阶段时,正确类别的 logit 与其他 logit 之间的差异可能超过吸收误差阈值。然后在反向传播中,正确类别的梯度被精确舍入为零,而错误类别的梯度保持非零。这打破了跨类别的梯度零和约束,并在分类器层的参数更新中引入了系统性漂移。我们证明这种漂移与特征形成正反馈循环,导致全局分类器均值和全局特征均值呈指数增长。我们将这种机制称为数值特征膨胀(NFI)。该机制解释了 Slingshot 尖峰前的快速范数增长、随后梯度的重新出现以及由此产生的损失尖峰。我们进一步表明,NFI 并不等同于观察到的损失尖峰:在更实际的任务中,部分吸收可能不会产生可见的尖峰,但它仍然可以打破零和约束并驱动参数范数的快速增长。我们的结果将 Slingshot 重新解释为有限精度训练的一种数值动力学,并为训练后期异常参数增长和 logit 发散提供了可检验的解释。

英文摘要

Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.

2509.26469 2026-05-27 cs.LG 版本更新

DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick

DiVeQ: 使用重参数化技巧的可微分向量量化

Mohammad Hassan Vali, Tom Bäckström, Arno Solin

发表机构 * ELLIS Institute Finland & Department of Computer Science, Aalto University, Finland(芬兰ELLIS研究所及阿尔托大学计算机科学系) Department of Information and Communications Engineering, Aalto University, Finland(芬兰阿尔托大学信息与通信工程系)

AI总结 提出DiVeQ方法,通过重参数化技巧将量化视为添加模拟量化失真的误差向量,实现前向传播硬量化而梯度可流动,并引入空间填充变体SF-DiVeQ减少量化误差并充分利用码本,在VQ-VAE、VQGAN和DAC任务中提升重建质量和样本质量。

详情
AI中文摘要

向量量化在深度模型中很常见,但其硬分配会阻止梯度传播并阻碍端到端训练。我们提出DiVeQ,将量化视为添加一个模拟量化失真的误差向量,保持前向传播为硬量化的同时让梯度流动。我们还提出一种空间填充变体(SF-DiVeQ),将输入分配到由码字间连线构成的曲线上,从而减少量化误差并充分利用码本。两种方法均无需辅助损失或温度调度即可实现端到端训练。在VQ-VAE图像压缩、VQGAN图像生成和DAC语音编码任务中,我们的方法在不同数据集上相比其他量化方法提高了重建质量和样本质量。

英文摘要

Vector quantization is common in deep models, yet its hard assignments block gradients and hinder end-to-end training. We propose DiVeQ, which treats quantization as adding an error vector that mimics the quantization distortion, keeping the forward pass hard while letting gradients flow. We also present a space-filling variant (SF-DiVeQ) that assigns input to a curve constructed by the lines connecting codewords, resulting in less quantization error and full codebook usage. Both methods train end-to-end without requiring auxiliary losses or temperature schedules. In VQ-VAE image compression, VQGAN image generation, and DAC speech coding tasks across various data sets, our proposed methods improve reconstruction and sample quality over alternative quantization approaches.

2605.08455 2026-05-27 cs.LG cs.PL cs.SE 版本更新

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

CUDABeaver:基于LLM的自动化CUDA调试基准测试

Shiyang Li, Haoyang Chen, Mattia Fazzini, Caiwen Ding

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 提出CUDABEAVER基准,通过协议条件指标pass@k(M,C,A)评估LLM修复CUDA代码的能力,揭示性能损失容忍度对成功率的影响。

Comments 25 pages, 5 figures

详情
AI中文摘要

调试CUDA程序长期以来一直具有挑战性,因为故障通常源于硬件行为、编译器决策、内存层次结构和异步执行之间微妙的交互。更重要的是,随着GPU在科学计算、机器学习、图形和系统工作负载中的快速扩展,CUDA调试变得比以往任何时候都更具挑战性。当前对基于LLM的CUDA编程的评估大多忽略了这一场景:模型可以通过退化性修复通过正确性测试,将CUDA代码简化为更安全但更慢的程序,从而放弃原始优化结构。我们引入了CUDABEAVER,一个从基于LLM的CUDA生成过程中产生的真实失败工作空间中进行CUDA调试的基准。每个任务提供损坏的候选代码、原生构建/测试命令、原始错误证据以及一个可编辑文件。CUDABEAVER评估修复程序是否真正修复了失败的CUDA代码,还是仅仅找到了一个更慢的通过测试的替代方案,并按故障类别、调试轨迹、停滞模式和性能保持情况报告结果。我们进一步提出了pass@k(M,C,A),一种协议条件的CUDA调试指标,通过明确修复程序M、语料库C和协议轴A。使用该指标在213个任务和七个前沿LLM上,我们表明协议感知评估提供了更真实的CUDA调试能力视图:当性能损失容忍度高时,修复程序看起来更强,但即使是一个微小的更严格的性能要求也能显著降低测量成功率,分数变化高达40个百分点。

英文摘要

Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.

2605.03929 2026-05-27 cs.SD cs.AI cs.LG eess.SP 版本更新

PHALAR: Phasors for Learned Musical Audio Representations

PHALAR:用于学习音乐音频表示的相量

Davide Marincione, Michele Mancusi, Giorgio Strano, Luca Cerovaz, Donato Crisostomi, Roberto Ribuoli, Emanuele Rodolà

发表机构 * Department of Computer Science, Sapienza University of Rome, Italy(罗马大学计算机科学系) Moises Systems, Inc.(Moises系统公司) Paradigma, Inc.(Paradigma公司)

AI总结 提出PHALAR对比框架,利用学习谱池化和复值头实现音高和相位等变,在茎检索任务中参数减少50%、训练加速7倍,准确率相对提升约70%,并捕获鲁棒的音乐结构。

Comments Accepted at ICML 2026

详情
AI中文摘要

茎检索,即匹配缺失茎到给定音频子混音的任务,是一个关键挑战,目前受限于丢弃时间信息的模型。我们引入PHALAR,一个对比框架,在参数少于50%且训练加速7倍的情况下,相对于现有技术实现了高达约70%的相对准确率提升。通过利用学习谱池化层和复值头,PHALAR强制施加音高等变和相位等变偏差。PHALAR在MoisesDB、Slakh和ChocoChorales上建立了新的检索最优结果,与人类一致性判断的相关性显著高于语义基线。最后,零样本节拍跟踪和线性和弦探测证实PHALAR捕获了超越检索任务的鲁棒音乐结构。

英文摘要

Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.

2605.07990 2026-05-27 cs.CL cs.AI cs.LG cs.SE 版本更新

Tool Calling is Linearly Readable and Steerable in Language Models

语言模型中的工具调用是线性可读且可引导的

Zekun Wu, Ze Wang, Seonglae Cho, Yufei Yang, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

发表机构 * University College London(伦敦大学学院) Holistic AI Imperial College London(伦敦帝国学院)

AI总结 本文发现语言模型内部存在对应工具选择的线性方向,通过干预该方向可切换工具调用,并能提前检测潜在错误,在多个模型和基准上验证了有效性。

Comments 24 pages. ACL ARR May 2026 submission (EMNLP 2026 preferred venue); v2 reflects revised manuscript

详情
AI中文摘要

当工具调用代理选错工具时,失败在执行之前是不可见的:邮件被发送,会议被错过。随着代理承担重要行动,一次糟糕的工具调用可能造成实际损害。目前我们无法在模型内部查看并在错误发生前捕捉它;本文表明我们可以做到。在模型内部,工具的选择由激活空间中的单个方向承载,每对工具对应一个方向。在生成过程中添加该方向会切换模型选择的工具。在涵盖 Gemma 3、Qwen 3、Qwen 2.5 和 Llama 3.1(270M 到 27B)的 12 个指令微调模型和 6 个基础模型上,这在 4B+ 指令微调模型上对 15 个工具的合成基准达到 83-100% 的准确率,在真实 API 基准 τ-bench airline 上达到 77-94%。随后的 JSON 参数自动适应新工具的模式,因此仅翻转名称就足够了。相同的每工具方向还能在错误发生前标记潜在错误:模型在两个工具之间不确定的查询失败率比确定的高 21 倍(Gemma 3 27B)。这不仅仅是主题注入:相同幅度的随机向量给出 0% 的切换率,而在单个领域(共享一个主题的 14 个航空工具)内的探针仍然能在五个 4B-14B 模型上以 top-1 61-89% 的准确率读取模型将调用的工具。即使是基础模型在能够输出工具之前内部已经携带了正确的工具:从模型内部状态读取所选工具(余弦读出)在 BFCL 上恢复 61-82% 的准确率,而基础生成仅为 2-10%,这表明预训练形成了表示,而指令微调后来将其连接到输出。我们的结果涵盖单轮、固定菜单设置;在多轮代理循环中,相同的干预不太稳定(匹配基线的增益或损失高达 30 个百分点,没有一致的方向)。

英文摘要

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As agents take on consequential actions, one bad tool call can do real damage. We currently have no way to look inside the model and catch the mistake before it happens; this paper shows that we can. Inside the model, the choice of tool is carried by a single direction in activation space, one direction per pair of tools. Adding that direction during generation switches which tool the model picks. Across 12 instruction-tuned and 6 base models spanning Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), this works at 83-100% accuracy on 4B+ instruction-tuned models on a 15-tool synthetic benchmark and at 77-94% on the real-API benchmark $τ$-bench airline. The JSON arguments that follow automatically adapt to the new tool's schema, so flipping the name is enough. The same per-tool directions also flag likely errors before they happen: queries where the model is unsure between two tools fail 21x more often than queries where it is not (Gemma 3 27B). This is not just topic injection: random vectors at the same magnitude give a 0% switch rate, and a probe within a single domain (14 airline tools that share one topic) still reads which tool the model will call at top-1 61-89% across five 4B-14B models. Even base models already carry the right tool internally before they can emit it: reading the chosen tool off the model's internal state (cosine readout) recovers 61-82% accuracy on BFCL while base generation lands at 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. Our results cover single-turn, fixed-menu settings; on multi-turn agent loops the same intervention is less stable (matched-baseline gain or loss of up to 30 percentage points with no consistent direction).

2605.07632 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Post-training makes large language models less human-like

后训练使大型语言模型更不像人类

Marcel Binz, Elif Akata, Abdullah Almaatouq, Mohammed Alsobay, Oleksii Ariasov, Franziska Brändle, David Broska, Jason W. Burton, Nuno Busch, Frederick Callaway, Vanessa Cheung, Brian Christian, Julian Coda-Forno, Can Demircan, Vittoria Dentella, Maria K. Eckstein, Noémi Éltető, Michael Franke, Thomas L. Griffiths, Fritz Günther, Susanne Haridi, Sebastian Hellmann, Stefan Herytash, Linus Hof, Eleanor Holton, Isabelle Hoxha, Zak Hussain, Akshay Jagadish, Elif Kara, Valentin Kriegmair, Evelina Leivada, Li Ji-An, Tobias Ludwig, Maximilian Maier, Marcelo G. Mattar, Marvin Mathony, Alireza Modirshanechi, Robin Na, Mariia Nadverniuk, Antonios Nasioulas, Surabhi S. Nath, Helen Niemeyer, Kate Nussenbaum, Sebastian Olschewski, Thorsten Pachur, Stefano Palminteri, Aliona Petrenco, Camille V. Phaneuf-Hadd, Angelo Pirrone, Manuel Rausch, Laura Raveling, Shashank Reddy, Milena Rmus, Evan M. Russek, Tankred Saanum, Kai Sandbrink, Louis Schiekiera, Johannes A. Schubert, Luca M. Schulze Buschoff, Nishad Singhi, Leah H. Somerville, Mikhail S. Spektor, Xin Sui, Christopher Summerfield, Mirko Thalmann, Anna I. Thoma, Taisiia Tikhomirova, Vuong Truong, Polina Tsvilodub, Konstantinos Voudouris, Kristin Witte, Shuchen Wu, Dirk U. Wulff, Hua-Dong Xiong, Songlin Xu, Lance Ying, Xinyu Zhang, Jian-Qiao Zhu, Eric Schulz

发表机构 * Helmholtz Munich(海德堡-慕尼黑亥姆霍兹中心) Massachusetts Institute of Technology(麻省理工学院) University of Tübingen(图宾根大学) University of Oxford(牛津大学) Stanford(斯坦福大学)

AI总结 通过引入Psych-201数据集,发现后训练(将基础模型转化为有用助手的过程)一致地降低了模型与人类行为的对齐度,且这种错位在新模型世代中加剧,而人物诱导技术无法改善个体层面的预测。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作人类参与者的替代品,但目前尚不清楚哪些模型最能捕捉人类行为及其原因。为了解决这个问题,我们引入了Psych-201,这是一个新颖的数据集,使我们能够大规模测量行为对齐。我们发现,后训练——将基础模型转化为有用助手的阶段——在模型家族、规模和目标上一致地降低了与人类行为的对齐度。此外,这种错位在新模型世代中扩大,即使基础模型继续改进。最后,我们发现人物诱导——一种通过将模型条件化为参与者特定信息来引发类人行为的流行技术——并不能改善个体层面的预测。综合来看,我们的结果表明,当前用于将LLMs转化为有用助手的那些过程也使得它们成为人类行为的不太准确的模型。

英文摘要

Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.

2511.22882 2026-05-27 cs.LG math.PR 版本更新

Normalizing Flows on Quotient Manifolds via Boundary Quotients

通过边界商在商流形上的归一化流

William Ghanem, Benjamin Cai

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出边界商框架,用于在作为更简单域边界商的流形上学习密度,并构造离散群作用下的商流形上的归一化流,在亏格g曲面和透镜空间上验证了有效性。

详情
AI中文摘要

我们引入了边界商,并提出了一个框架,用于在作为更简单域边界商的流形上学习密度。我们展示了该框架可用于构造商流形 $N/G$ 上的归一化流,其中离散群 $G$ 作用在 $N$ 上。我们为亏格 $g$ 曲面 $\Sigma_g$ 实例化了这一构造。当 $G$ 有限时,我们展示了其对对称感知学习的适用性;我们在三维球面的循环商上进行了演示。在透镜空间上的实验表明,简单的商前 RealNVP 模型可以在评估成本大幅降低的同时取得强劲的结果。

英文摘要

We introduce boundary quotients and present a framework for learning densities on manifolds that arise as boundary quotients of simpler domains. We show that this framework can be used to construct normalizing flows on quotient manifolds $N/G$, where a discrete group $G$ acts on $N$. We instantiate this construction for genus-$g$ surfaces $Σ_g$. When $G$ is finite, we show applicability to symmetry aware learning; we demonstrate this on cyclic quotients of the 3-sphere. Experiments on lens spaces show that simple pre-quotient RealNVP models can achieve strong results while being substantially cheaper to evaluate.

2603.23985 2026-05-27 cs.LG 版本更新

Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score

精简你的大语言模型:通过融合任务特定重要性分数的维度级全局剪枝

Jimyung Hong, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 提出一种无需训练的维度级结构化剪枝方法DIET,通过跨任务激活幅度多数投票构建全局掩码,在保持任务感知能力的同时避免高昂训练成本,在Gemma-2模型上显著提升剪枝后准确率。

Comments 14 pages, 10 figures. Code available at https://github.com/Jimmy145123/DIET

详情
AI中文摘要

大型语言模型(LLMs)展现了卓越的能力,但其庞大的规模给实际部署带来了重大挑战。结构化剪枝通过移除整个维度或层提供了一种有前景的解决方案,然而现有方法面临关键权衡:任务无关方法无法适应任务特定需求,而任务感知方法需要昂贵的训练来学习任务适应性。我们提出DIET(通过融合任务重要性分数进行维度级全局剪枝),一种无需训练的结构化剪枝方法,结合了维度级粒度与任务感知选择。DIET仅使用每个任务100个样本跨任务分析激活幅度,然后应用多数投票构建单个全局掩码。DIET不需要预计算或训练的高成本。在Gemma-2 2B和9B模型上的七个零样本基准测试实验证明了DIET的有效性;例如,在Gemma-2 2B上20%稀疏度下,与先前最先进的结构化剪枝方法相比,DIET实现了近10%的平均准确率提升。这一优势在不同稀疏度和模型规模下持续存在,使DIET成为结构化LLM剪枝的实用且稳健的选择。

英文摘要

Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by removing entire dimensions or layers, yet existing methods face critical trade-offs: task-agnostic approaches cannot adapt to task-specific requirements, while task-aware methods require costly training to learn task adaptability. We propose DIET (Dimension-wise global pruning of LLMs via merging Task-wise importance scores), a training-free structured pruning method that combines dimension-level granularity with task-aware selection. DIET profiles activation magnitudes across tasks using only 100 samples per task, then applies majority voting to construct a single global mask. DIET does not require large costs from pre-computation or training. Experiments on seven zero-shot benchmarks using Gemma-2 2B and 9B models demonstrate the effectiveness of DIET; for example, at 20% sparsity on Gemma-2 2B, DIET achieves near 10% average accuracy improvement, compared to previous state-of-the-art structured pruning methods. This advantage persists across various sparsity levels and model scales, positioning DIET as a practical and robust choice for structured LLM pruning.

2604.22774 2026-05-27 cs.CY cs.AI cs.CV cs.LG 版本更新

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

当VLM“修正”学生:多行手写数学OCR评估中的过度修正识别与惩罚

Jin Seong, Wencke Liermann, Minho Kim, Jong-hun Shin, Soojong Lim

发表机构 * Electronics and Telecommunications Research Institute(电子通信研究所)

AI总结 针对多行手写数学OCR评估中VLM过度修正问题,提出基于LLM的语义评估指标PINK,有效惩罚过度修正,在FERMAT数据集上优于BLEU。

详情
AI中文摘要

手写数学的准确转录对于教育AI系统至关重要,但当前基准未能正确评估这一能力。大多数先前研究关注单行表达式,并依赖BLEU等词汇指标,无法评估跨多行学生解决方案的语义推理。本文首次系统研究多行手写数学光学字符识别(OCR),揭示了视觉语言模型(VLM)的一个关键失败模式:过度修正。这些模型往往“修正”错误,而非忠实地转录学生作品,从而隐藏了教育评估旨在检测的错误。为解决此问题,我们提出PINK(基于惩罚的INK分数),一种语义评估指标,利用大语言模型(LLM)进行基于评分标准的评分,并明确惩罚过度修正。我们在FERMAT数据集上对15个最先进的VLM进行全面评估,发现与BLEU相比出现显著的排名反转:GPT-4o等模型因激进的过度修正受到严重惩罚,而Gemini 2.5 Flash成为最忠实的转录者。此外,人类专家研究表明,PINK与人类判断的一致性显著更高(55.0%偏好,而BLEU为39.5%),为教育场景中的手写数学OCR提供了更可靠的评估框架。

英文摘要

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU's 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.

2603.13381 2026-05-27 cs.LG cs.AI 版本更新

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

注意力投影中的非线性:非线性查询的情况

Marko Karbevski

发表机构 * Simplicity Technologies(简化科技)

AI总结 本文提出用非线性残差替换注意力中的查询投影W_Q,通过瓶颈MLP实现,在GPT-3小模型上验证了性能提升。

Comments Accepted at the ICLR 2026 GRaM workshop: https://openreview.net/forum?id=pwdnneFiNZ#discussion

详情
AI中文摘要

最近的代数分析表明,在仅解码器和仅编码器Transformer中,查询投影$W_Q$可以设置为恒等映射而不会显著降低性能。这是因为注意力仅通过乘积$XW_Q, XW_K, XW_V$依赖于$X$,允许基变换被相邻层吸收并通过网络传播。我们将$W_Q \in \R^{d imes d}$替换为非线性残差形式$Q(X) = X + f_θ(X)$,其中$f_θ$是一个瓶颈MLP,具有$d^2 + O(d)$个参数。恒等项将非线性锚定到已知良好的先验。在GPT-3小规模风格模型上的实验显示,与基线相比持续改进(验证对数损失降低$2.40\%$,困惑度降低$6.81\%$),轻松优于参数增加12.5%的非嵌入参数模型。这些结果激励在更大规模和多模态上的研究。

英文摘要

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \R^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_θ(X)$, where $f_θ$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline ($2.40\%$ lower validation log-loss, $6.81\%$ lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.

2512.05794 2026-05-27 cs.LG cs.AI q-bio.QM 版本更新

Mechanistic Interpretability of Antibody Language Models Using SAEs

使用 SAE 对抗体语言模型的机制可解释性研究

Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane

发表机构 * Department of Statistics, University of Oxford, UK(英国牛津大学统计系) Reticular, San Francisco, USA(美国旧金山Reticular公司) EECS, MIT, Cambridge MA, USA(美国麻省理工学院电子工程与计算机科学系) Leyden Laboratories BV, Leiden, The Netherlands(荷兰莱顿实验室)

AI总结 本研究采用 TopK 和 Ordered 稀疏自编码器(SAE)对抗体语言模型进行机制可解释性分析,发现 TopK SAE 能揭示有意义的生物学潜在特征但无法保证生成控制,而 Ordered SAE 通过层次结构可靠识别可操控特征但激活模式更复杂。

Comments v3: 15 pages; corrected author list and affiliations in the main text; minor text changes; updated steering results following minor code changes; conclusions and findings remain unchanged; included link to data and code in the Data Availability section

详情
AI中文摘要

稀疏自编码器(SAE)是一种机制可解释性技术,已被用于揭示大型蛋白质语言模型中学到的概念。在此,我们采用 TopK 和 Ordered SAE 来研究自回归抗体语言模型,并引导其生成。我们表明,TopK SAE 可以揭示有生物学意义的潜在特征,但高特征-概念相关性并不能保证对生成的因果控制。相比之下,Ordered SAE 施加了层次结构,能够可靠地识别可操控特征,但代价是激活模式更复杂且可解释性较低。这些发现推进了领域特异性蛋白质语言模型的机制可解释性,并表明,虽然 TopK SAE 足以将潜在特征映射到概念,但在需要精确生成引导时,Ordered SAE 更可取。

英文摘要

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature-concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.

2604.19667 2026-05-27 cs.CL cs.AI cs.CV cs.LG cs.MA 版本更新

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Chat2Workflow: 用自然语言生成可执行可视化工作流的基准

Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Tencent(腾讯)

AI总结 提出Chat2Workflow基准,用于评估大语言模型从自然语言生成可执行可视化工作流的能力,并设计了一个智能体基线以提升性能。

Comments Work in progress

详情
AI中文摘要

目前,可执行的可视化工作流已成为实际工业部署中的主流范式,提供了强大的可靠性和可控性。然而,在当前实践中,此类工作流几乎完全通过手动工程构建:开发人员必须仔细设计工作流,为每个步骤编写提示,并随着需求的变化反复修改逻辑——这使得开发成本高昂、耗时且容易出错。为了研究大语言模型能否自动化这一多轮交互过程,我们引入了Chat2Workflow,一个直接从自然语言生成可执行可视化工作流的基准,并提出了一个稳健的智能体基线以提高性能。该基准基于大量真实业务工作流构建,每个实例的设计使得生成的工作流可以转换并直接部署到实际工作流平台(如Dify和Coze)上。实验结果表明,尽管最先进的语言模型通常能捕捉高层次意图,但在生成正确、稳定且可执行的工作流方面仍存在困难,尤其是在面对复杂且不断变化的需求时。尽管我们的智能体基线带来了高达6.05%的解决率提升,但剩余的现实差距使Chat2Workflow成为推进工业级自动化的基础。代码可在https://github.com/zjunlp/Chat2Workflow获取。

英文摘要

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve -- making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic baseline to improve performance. The benchmark is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially given complex and evolving requirements. Although our agentic baseline yields up to 6.05% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

2604.18751 2026-05-27 cs.LG cs.AI stat.ME stat.ML 版本更新

Beyond Coefficients: Forecast-Necessity Testing for Interpretable Causal Discovery in Nonlinear Time-Series Models

超越系数:非线性时间序列模型中可解释因果发现的预测必要性检验

Valentina Kuskova, Dmitry Zaytsev, Michael Coppedge

发表机构 * Lucy Family Institute for Data & Society(数据与社会联合研究所) University of Notre Dame(诺特大学) Department of Political Science(政治学系)

AI总结 针对非线性时间序列模型中因果分数被误读为回归系数的问题,提出基于边消融和预测比较的预测必要性检验框架,以评估因果关系的实际必要性。

详情
AI中文摘要

非线性机器学习模型越来越多地用于发现时间序列数据中的因果关系,但其输出的解释仍不明确。特别是,正则化神经自回归模型产生的因果分数常被视为回归系数的类比,导致误导性的统计显著性声明。在本文中,我们认为非线性时间序列模型中的因果相关性应通过预测必要性而非系数大小来评估,并提出了一种实用的评估程序。我们提出了一个基于系统边消融和预测比较的可解释评估框架,用于测试候选因果关系是否对准确预测是必要的。以神经加性向量自回归作为案例研究模型,我们将该框架应用于一个关于民主发展的真实世界案例研究,该案例将面板数据(139个国家的民主指标)建模为多元时间序列。我们表明,具有相似因果分数的关系由于冗余、时间持久性和特定制度效应,其预测必要性可能差异巨大。我们的结果展示了预测必要性检验如何支持应用AI系统中更可靠的因果推理,并为在高风险领域解释非线性时间序列模型提供实用指导。

英文摘要

Nonlinear machine-learning models are increasingly used to discover causal relationships in time-series data, yet the interpretation of their outputs remains poorly understood. In particular, causal scores produced by regularized neural autoregressive models are often treated as analogues of regression coefficients, leading to misleading claims of statistical significance. In this paper, we argue that causal relevance in nonlinear time-series models should be evaluated through forecast necessity rather than coefficient magnitude, and we present a practical evaluation procedure for doing so. We present an interpretable evaluation framework based on systematic edge ablation and forecast comparison, which tests whether a candidate causal relationship is required for accurate prediction. Using Neural Additive Vector Autoregression as a case study model, we apply this framework to a real-world case study of democratic development, modeled as a multivariate time series of panel data - democracy indicators across 139 countries. We show that relationships with similar causal scores can differ dramatically in their predictive necessity due to redundancy, temporal persistence, and regime-specific effects. Our results demonstrate how forecast-necessity testing supports more reliable causal reasoning in applied AI systems and provides practical guidance for interpreting nonlinear time-series models in high-stakes domains.

2604.11467 2026-05-27 cs.AI cs.HC cs.LG 版本更新

From Attribution to Action: A Human-Centered Application of Activation Steering

从归因到行动:激活导向的人本应用

Tobias Labarta, Maximilian Dreyer, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin

发表机构 * Fraunhofer Heinrich-Hertz-Institut(弗劳恩霍夫 Heinrich-Hertz 研究所) Technische Universität Berlin(柏林技术大学) BIFOLD – Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究所)

AI总结 提出结合SAE归因与激活导向的交互式工作流,通过专家访谈验证其能促进从检查到干预的转变,并揭示组件抑制等调试策略及潜在风险。

详情
AI中文摘要

可解释人工智能(XAI)方法揭示了哪些特征影响模型预测,但为实践者基于这些解释采取行动提供了有限的手段。通过XAI识别出的组件的激活导向为可操作的解释提供了一条路径,但其实际效用仍未得到充分研究。我们引入了一个交互式工作流,将基于SAE的归因与激活导向相结合,用于视觉模型中概念使用的实例级分析,并实现为一个基于网页的工具。基于此工作流,我们进行了半结构化专家访谈(N=8),在CLIP上执行调试任务,以调查实践者如何推理、信任和应用激活导向。我们发现,导向使得从检查转向基于干预的假设检验(8/8参与者),大多数参与者将信任建立在观察到的模型响应上,而非仅仅解释的合理性(6/8)。参与者采用了系统性的调试策略,其中组件抑制占主导(7/8),并指出了包括涟漪效应和实例级修正的有限泛化在内的风险。总体而言,激活导向使可解释性更具可操作性,同时为安全有效使用提出了重要考虑。

英文摘要

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

2505.23606 2026-05-27 cs.LG cs.CV 版本更新

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Muddit: 通过统一离散扩散模型解放超越文本到图像的生成

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan

发表机构 * M-E-AGI-Lab(M-E-AGI实验室)

AI总结 提出Muddit,一种统一离散扩散Transformer,结合预训练文本到图像骨干的强视觉先验与轻量文本解码器,实现跨文本和图像模态的快速并行生成,在质量和效率上优于大型自回归模型。

Comments Accepted to ICLR 2026. Codes and Supplementary Material: https://github.com/M-E-AGI-Lab/Muddit

详情
AI中文摘要

统一生成模型旨在单一架构和解码范式下处理跨模态的多种任务——如文本生成、图像生成和视觉-语言推理。自回归统一模型因顺序解码导致推理缓慢,而非自回归统一模型因预训练骨干有限导致泛化能力弱。我们引入第二代Meissonic:Muddit,一种统一离散扩散Transformer,能够在文本和图像模态上实现快速并行生成。与先前从头训练的统一扩散模型不同,Muddit将来自预训练文本到图像骨干的强视觉先验与轻量文本解码器集成,从而在统一架构下实现灵活且高质量的多模态生成。实验结果表明,Muddit在质量和效率上均达到或优于显著更大的自回归模型。该工作凸显了纯离散扩散在配备强视觉先验时,作为统一生成的可扩展且有效骨干的潜力。

英文摘要

Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce the second-generation Meissonic: Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

2604.11056 2026-05-27 cs.LG cs.AI 版本更新

Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR

事后信用可驻留之处:RLVR中令牌更新的有符号容量视角

Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Huawei Technologies Ltd.(华为技术有限公司)

AI总结 本文通过条件互信息分析RLVR中令牌级信用的容量上限,提出四象限分解区分更新方向,并设计HAPO算法进行容量引导的优势重分配,提升数学推理性能。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)提升了大语言模型(LLMs)的推理能力,但稀疏的结果奖励使得令牌级信用分配变得困难。我们将令牌级信用视为从行为策略到事后后验的奖励条件偏移。在自回归RLVR中,这种偏移可以通过条件互信息(CMI)表示,这表明令牌熵限制了可能的事后信用上限。然而,熵指示的是容量而非更新方向,因此我们引入了四象限分解,根据奖励极性和令牌熵来分离更新。受控干预表明,这两个因素共同塑造了令牌更新。持续的推理增益集中在有符号的高熵象限,而低熵更新则迅速饱和。基于此分析,我们提出了事后感知策略优化(HAPO),这是对GRPO的一种符号保持修改,执行容量引导的优势重分配。在两个模型设置的数学推理基准上的实验表明,HAPO在熵感知基线中取得了有竞争力的性能。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning ability of Large Language Models (LLMs), but sparse outcome rewards make token-level credit assignment difficult. We study token-level credit as a reward-conditioned shift from the behavior policy to a hindsight posterior. In autoregressive RLVR, this shift can be expressed through Conditional Mutual Information (CMI), which shows that token entropy upper-bounds possible hindsight credit. Entropy, however, indicates capacity rather than update direction, so we introduce the Four Quadrant Decomposition to separate updates by reward polarity and token entropy. Controlled interventions show that these two factors jointly shape token updates. Sustained reasoning gains concentrate in signed high-entropy quadrants, whereas low-entropy updates saturate quickly. Based on this analysis, we propose Hindsight-Aware Policy Optimization (HAPO), a sign-preserving modification to GRPO that performs capacity-guided advantage reallocation. Experiments on mathematical reasoning benchmarks in two model settings show that HAPO achieves competitive performance among entropy-aware baselines.

2509.21882 2026-05-27 cs.LG cs.AI 版本更新

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

立场:具有可验证奖励的强化学习的隐藏成本与测量缺口

Fang Wu, Aaron Tu, Weihao Xuan, Heli Qi, Xu Huang, Qingcheng Zeng, Shayan Talaei, Yijia Xiao, Peng Xia, Xiangru Tang, Yuchen Zhuang, Yinxi Li, Bing Hu, Hanqun Cao, Wenqi Shi, Rui Yang, Nan Liu, Huaxiu Yao, Ge Liu, Li Erran Li, Amin Saberi, Naoto Yokoya, Jure Leskovec, Yejin Choi

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) The University of Tokyo(东京大学) RIKEN AIP(理化学研究所AIP) Waseda University(早稻田大学) Georgia Tech(佐治亚理工学院) Northwestern University(西北大学) UCLA(加州大学洛杉矶分校) UNC Chapel Hill(北卡罗来纳大学教堂山分校) Yale University(耶鲁大学) University of Waterloo(滑铁卢大学) Independent Researcher(独立研究者) CUHK(香港中文大学) UT Southwestern Medical Center(西南医学中心) National University of Singapore(新加坡国立大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) Amazon AWS AI(亚马逊AWS人工智能)

AI总结 本文指出,具有可验证奖励的强化学习(RLVR)在提升大语言模型性能时,常因预算不匹配、尝试膨胀和基准数据污染等混淆因素导致收益被高估,并提出了预算匹配饱和曲线、校准跟踪、法官鲁棒性测试和污染筛查等最低标准。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)是一种实用、可扩展的方法,用于在数学、代码和其他结构化任务上改进大语言模型。然而,我们认为许多头条RLVR收益尚未得到充分验证,因为报告常常将策略改进与三个混淆因素混为一谈:(i) RLVR与基线评估之间的预算不匹配,(ii) 尝试膨胀和校准漂移,将弃权转化为自信答案,以及(iii) 基准数据污染。通过预算匹配的复现和部分提示污染探测,我们发现一旦预算、提示和数据集版本匹配,并且将受污染集视为记忆探测而非推理证据,几个被广泛引用的差距会大幅缩小或消失。这并不意味着RLVR无效,而是表明当前的测量常常夸大能力收益并掩盖可靠性成本。因此,我们为RLVR训练和评估提出了一个紧凑的、考虑成本的的最低标准:带有方差、校准和弃权跟踪的预算匹配饱和曲线,当使用LLM评判者时的评判者鲁棒性压力测试,以及明确的污染筛查。有了这些控制,RLVR在可验证领域仍然有效且可部署,但如果没有这些控制,推理收益应被视为暂定的。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) is a practical, scalable way to improve large language models on math, code, and other structured tasks. However, we argue that many headline RLVR gains are not yet well validated because reports often conflate policy improvement with three confounds: (i) budget mismatch between RLVR and baseline evaluations, (ii) attempt inflation and calibration drift that convert abstentions into confident answers, and (iii) benchmark data contamination. Using budget-matched reproductions and partial-prompt contamination probes, we find that several widely cited gaps shrink substantially or disappear once budgets, prompts, and dataset versions are matched and contaminated sets are treated as memorization probes rather than evidence of reasoning. This does not mean that RLVR is ineffective, but it implies that current measurements often overstate capability gains and obscure reliability costs. We therefore propose a compact, tax-aware minimum standard for RLVR training and evaluation: budget-matched saturation curves with variance, calibration, and abstention tracking, a judge-robustness stress test when LLM judges are used, and an explicit contamination screen. With these controls, RLVR remains effective and deployable in verifiable domains, but reasoning gains should be treated as provisional without them.

2604.09157 2026-05-27 physics.med-ph cs.LG 版本更新

A Fast and Generic Energy-Shifting Transformer for Hybrid Monte Carlo Radiotherapy Calculation

一种用于混合蒙特卡罗放疗计算的快速通用能量转移变换器

Chi-Hieu Pham, Didier Benoit, Vincent Bourbonne, Ulrike Schick, Dimitris Visvikis, Julien Bert

发表机构 * LaTIM, INSERM-UMR1101, University of Brest(拉蒂姆研究所、INSERM-UMR1101、布列塔尼大学)

AI总结 提出一种名为能量转移的深度学习框架,通过从简单单能输入合成多能剂量分布,实现加速蒙特卡罗剂量计算,并设计TransUNetSE3D架构结合Transformer和残差挤压激励模块,在保持实时速度的同时达到98%以上的伽马通过率。

Comments 13 pages, 6 figures, 6 tables

详情
AI中文摘要

我们引入了一种名为能量转移的新型学习框架,用于加速蒙特卡罗(MC)剂量计算。该方法利用深度学习,在相同射束配置下,直接从简单的单能输入合成高度复杂的多能剂量分布。与依赖于噪声低计数剂量图(会损害射束轮廓完整性)的传统去噪技术不同,我们的方法通过将高保真解剖纹理和源特定射束相似性集成到模型的输入空间中,在未见数据集上实现了优越的跨域泛化。此外,我们提出了一种名为TransUNetSE3D的新型3D架构,其中包含用于全局上下文的Transformer模块和用于自适应通道特征重新校准的残差挤压激励(SE)模块。这些模块的层次表示与主要剂量图参数一起融合到网络的潜在空间中,从而实现物理感知重建。这种混合设计在空间精度和结构保持方面均优于现有的基于UNet和Transformer的基准,同时保持了实时使用所需的执行速度。我们提出的流程在治疗计划系统(TPS)框架内,使用6MV TrueBeam直线加速器(LINAC)进行前列腺放疗评估,与MC参考相比,伽马通过率超过98%(3%/3mm)。这些结果为自适应放疗中的快速体积剂量测定提供了稳健的解决方案。

英文摘要

We introduce a novel learning framework for accelerated Monte Carlo (MC) dose calculation termed Energy-Shifting. This approach leverages deep learning to synthesize highly complex polyenergetic dose distributions directly from simple monoenergetic inputs under identical beam configurations. Unlike conventional denoising techniques, which rely on noisy low-count dose maps that compromise beam profile integrity, our method achieves superior cross-domain generalization on unseen datasets by integrating high-fidelity anatomical textures and source-specific beam similarity into the model's input space. Furthermore, we propose a novel 3D architecture termed TransUNetSE3D, featuring Transformer blocks for global context and Residual Squeeze-and-Excitation (SE) modules for adaptive channel-wise feature recalibration. Hierarchical representations of these blocks are fused into the network's latent space alongside the primary dose-map parameters, allowing physics-aware reconstruction. This hybrid design outperforms existing UNet and Transformer-based benchmarks in both spatial precision and structural preservation, while maintaining the execution speed necessary for real-time use. Our proposed pipeline achieves a Gamma Passing Rate exceeding 98% (3%/3mm) compared to the MC reference, evaluated within the framework of a treatment planning system (TPS) using 6MV TrueBeam Lineac Accelerator (LINAC) for prostate radiotherapy. These results offer a robust solution for fast volumetric dosimetry in adaptive radiotherapy.

2604.08999 2026-05-27 cs.CL cs.AI cs.LG 版本更新

ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

ASTRA: 面向复杂表格问答的自适应语义树推理架构

Xiaoke Guo, Songze Li, Zhiqiang Liu, Zhaoyan Gong, Yuanxiang Liu, Huajun Chen, Wen Zhang

发表机构 * Zhejiang University(浙江大学)

AI总结 提出ASTRA架构,通过AdaSTR将表格重构为逻辑语义树,并利用DuTR双模式推理框架结合树搜索文本导航与符号代码执行,在复杂表格问答中达到最优性能。

Comments ACL 2026 Main

详情
AI中文摘要

表格序列化仍然是大型语言模型(LLMs)在复杂表格问答中的关键瓶颈,受到结构忽视、表示差距和推理不透明等挑战的阻碍。现有的序列化方法无法捕获显式层次结构且缺乏模式灵活性,而当前的基于树的方法则存在语义适应性有限的问题。为了解决这些限制,我们提出了ASTRA(自适应语义树推理架构),包括两个主要模块:AdaSTR和DuTR。首先,我们引入AdaSTR,它利用LLMs的全局语义意识将表格重构为逻辑语义树。这种序列化显式建模了层次依赖关系,并采用自适应机制根据表格规模优化构建策略。其次,基于此结构,我们提出了DuTR,一种双模式推理框架,集成了基于树搜索的文本导航以实现语言对齐,以及符号代码执行以实现精确验证。在复杂表格基准上的实验表明,我们的方法达到了最先进的性能。

英文摘要

Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.

2604.08819 2026-05-27 cs.CV cs.AI cs.LG cs.MM 版本更新

SenBen: Sensitive Scene Graphs for Explainable Content Moderation

SenBen: 用于可解释内容审核的敏感场景图

Fatih Cagatay Akyon, Alptekin Temizel

发表机构 * Graduate School of Informatics, METU(信息学院研究生院,梅尔夫大学) Ultralytics, Inc.(Ultralytics公司)

AI总结 提出SenBen基准和紧凑学生模型,通过多任务训练和词汇平衡策略实现敏感内容的空间定位与可解释性,在场景图生成上超越多数VLM。

Comments Accepted at CVPRW 2026

详情
AI中文摘要

内容审核系统将图像分类为安全或不安全,但缺乏空间定位和可解释性:它们无法解释检测到了什么敏感行为、涉及谁或发生在哪里。我们引入了敏感基准(SenBen),这是第一个用于敏感内容的大规模场景图基准,包含来自157部电影的13,999帧,标注了Visual Genome风格的场景图(25个对象类别、28个属性,包括情感状态如痛苦、恐惧、攻击和痛苦,14个谓词)以及跨5个类别的16个敏感标签。我们通过多任务配方将前沿VLM蒸馏成一个紧凑的241M学生模型,该配方通过基于后缀的对象身份、词汇感知召回(VAR)损失和解耦的Query2Label标签头(带非对称损失)解决自回归场景图生成中的词汇不平衡问题,在SenBen召回率上比标准交叉熵训练提高了+6.4个百分点。在基于场景图的指标上,我们的学生模型优于除Gemini模型外的所有评估VLM和所有商业安全API,同时在所有模型中实现了最高的对象检测和字幕生成分数,推理速度提升7.6倍,GPU内存减少16倍。

英文摘要

Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.

2603.11394 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Reliability

别听我的!多轮对话如何降低LLM的可靠性

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

发表机构 * Vanderbilt University(范德比尔大学) Vanderbilt University Medical Center(范德比尔大学医学中心) Intuit AI Research(Intuit人工智能研究)

AI总结 提出“坚持或切换”(SoS)框架,通过将问答空间分割为多个顺序呈现来评估LLM在多轮对话中的可靠性,发现对话税导致准确性和拒绝错误建议的能力平均下降30%,并观察到盲目切换现象。

详情
AI中文摘要

大型语言模型(LLM)在静态基准测试中表现出色,但它们在更能反映实际使用的多轮对话中的性能仍未得到充分研究。解决这一差距在医疗保健等高风险环境中至关重要,因为患者和临床医生正在转向LLM聊天机器人来处理他们的医疗咨询。在这里,我们引入了“坚持或切换”(SoS)框架,该框架将问答空间划分为多个顺序呈现,以模拟两种以安全为中心的行为:坚持(即坚持正确的答案选择或拒绝错误的建议)和灵活性(即在引入正确建议时切换到该建议)。在三个临床基准测试中评估了17个LLM,我们观察到普遍存在的对话税,其中将答案空间分割为顺序呈现使端到端准确性和对错误建议的拒绝率平均下降高达30%,在某些模型中达到65%。我们还观察到盲目切换,即模型从初始拒绝转向错误和正确建议的比率几乎相同,达到50%。最后,我们表明,增加模型规模可以缓解其中一些对话效率低下的问题,但会加剧其他问题,例如从初始拒绝中采纳错误建议的倾向更高。我们的研究结果共同表明,静态基准测试所捕获的一般能力并不能推广到多轮对话中。

英文摘要

Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-world usage, remains understudied. Addressing this gap is critical in high-stakes settings like healthcare, where patients and clinicians are turning to LLM chatbots to address their medical inquiries. Here, we introduce the "stick-or-switch" (SoS) framework, which partitions a question-answer space into multiple sequential presentations to model two safety-centric behaviors: conviction (i.e., sticking to a correct answer selection or abstention against incorrect suggestions) and flexibility (i.e., switching to a correct suggestion when it is introduced). Evaluating 17 LLMs across three clinical benchmarks, we observe a pervasive conversation tax, where partitioning an answer-space into sequential presentations reduces end-to-end accuracy and abstention against incorrect suggestions by an average of up to 30%, reaching 65% in certain models. We also observe blind switching, where models transition an initial abstention to incorrect and correct suggestions at near-identical rates reaching 50%. Finally, we show that increasing model scale mitigates some of these conversational inefficacies while exacerbating others, such as a higher propensity to adopt an incorrect suggestion from an initial abstention. Together our findings demonstrate that the general proficiency captured by static benchmarks do not translate over multi-turn dialogues.

2512.21602 2026-05-27 cs.LG cs.CV 版本更新

An Empirical Study of Machine Learning Robustness and Scalability for Imbalanced Tabular Clinical Data in Emergency and Critical Care

机器学习在急诊和重症监护中不平衡表格临床数据的鲁棒性与可扩展性实证研究

Yusuf Brima, Marcellin Atemkeng

发表机构 * Computer Vision Group, Institute of Cognitive Science, Osnabrück University(计算机视觉组,认知科学研究所,奥斯纳布吕克大学) Department of Mathematics, Rhodes University(数学系,罗德斯大学) National Institute for Theoretical and Computational Sciences (NITheCS)(国家理论与计算科学研究所(NITheCS))

AI总结 本研究在MIMIC-IV-ED和eICU数据集上评估六类模型在不平衡临床表格数据上的性能,发现树模型在可扩展性上最优,而表格基础模型在性能与效率间提供新的权衡。

详情
AI中文摘要

每年,数百万患者通过急诊科和重症监护室,临床医生必须在时间压力和不确定性下做出高风险决策。机器学习可以支持恶化预测、分诊和罕见关键结局的预测,但临床数据通常严重不平衡,使模型偏向多数类并降低预测性能。因此,为不平衡的临床表格数据开发鲁棒且高效的模型仍然是一个重要挑战。 我们在MIMIC-IV-ED和eICU数据库的不平衡表格数据上评估了六类模型:决策树、随机森林、XGBoost、TabNet、TabICL和TabPFN v2.6。可训练模型通过贝叶斯超参数调优进行优化,而基础模型在其预训练推理模式下进行评估,无需任务特定的重新加权。模型使用Macro F1分数、对递增不平衡的鲁棒性以及跨七个临床预测任务的计算可扩展性进行评估。 结果在不同数据集上有所不同。在MIMIC-IV-ED上,TabPFN v2.6和TabICL获得了最强的平均Macro F1排名,XGBoost保持竞争力。在eICU上,XGBoost始终表现最佳,其次是其他基于树的方法,而基础模型达到中等性能。在两个数据集中,TabNet在递增不平衡下显示出最大的性能下降和最高的计算成本。训练时间分析表明,基于树的方法随数据集大小扩展最有利,而基础模型提供了较低的每任务适应成本。 这些发现表明,没有单一模型族在所有临床环境中占主导地位。然而,表格基础模型正在缩小与强经典基线的性能差距,同时提供独特的效率-性能权衡,这可能有利于资源受限的临床环境。

英文摘要

Every year, millions of patients pass through emergency departments and intensive care units, where clinicians must make high-stakes decisions under time pressure and uncertainty. Machine learning could support prediction of deterioration, triage, and rare critical outcomes, but clinical data are often severely imbalanced, biasing models toward majority classes and reducing predictive performance. Developing robust and efficient models for imbalanced clinical tabular data therefore remains an important challenge. We evaluated six model families on imbalanced tabular data from the MIMIC-IV-ED and eICU databases: Decision Tree, Random Forest, XGBoost, TabNet, TabICL, and TabPFN v2.6. Trainable models were optimized using Bayesian hyperparameter tuning, while foundation models were evaluated in their pretrained inference regime without task-specific reweighting. Models were assessed using Macro F1-score, robustness to increasing imbalance, and computational scalability across seven clinical prediction tasks. Results differed across datasets. On MIMIC-IV-ED, TabPFN v2.6 and TabICL achieved the strongest average Macro F1 ranks, with XGBoost remaining competitive. On eICU, XGBoost consistently performed best, followed by other tree-based methods, while foundation models achieved intermediate performance. Across both datasets, TabNet showed the largest degradation under increasing imbalance and the highest computational cost. Training-time analysis showed that tree-based methods scaled most favorably with dataset size, while foundation models offered low per-task adaptation cost. These findings suggest that no single model family dominates across all clinical settings. However, tabular foundation models are narrowing the performance gap with strong classical baselines while offering a distinct efficiency-performance trade-off that may benefit resource-constrained clinical environments.

2604.07190 2026-05-27 cs.CY cs.AI cs.LG 版本更新

The ATOM Report: Measuring the Open Language Model Ecosystem

ATOM报告:衡量开放语言模型生态系统

Nathan Lambert, Florian Brand

发表机构 * Interconnects AI

AI总结 本研究通过分析约1500个主流开放语言模型(如阿里巴巴的Qwen、DeepSeek、Meta的Llama)的下载量、衍生模型、推理市场份额和性能指标,揭示了2025年夏季中国模型超越美国模型并持续扩大差距的趋势。

Comments 23 pages, 17 figures

详情
AI中文摘要

我们呈现了领先开放语言模型及其构建者的全面采用快照,重点关注来自阿里巴巴的Qwen、DeepSeek、Meta的Llama等约1500个主流开放模型,这些模型构成了对研究人员、企业家和政策顾问至关重要的生态系统基础。我们记录了一个明显趋势:中国模型在2025年夏季超越其美国对应模型,随后扩大了与西方模型的差距。我们研究了Hugging Face下载量和模型衍生品、推理市场份额、性能指标等多种因素,以全面描绘该生态系统。

英文摘要

We present a comprehensive adoption snapshot of the leading open language models and who is building them, focusing on the ~1.5K mainline open models from the likes of Alibaba's Qwen, DeepSeek, Meta's Llama, that are the foundation of an ecosystem crucial to researchers, entrepreneurs, and policy advisors. We document a clear trend where Chinese models overtook their counterparts built in the U.S. in the summer of 2025 and subsequently widened the gap over their western counterparts. We study a mix of Hugging Face downloads and model derivatives, inference market share, performance metrics and more to make a comprehensive picture of the ecosystem.

2604.00993 2026-05-27 astro-ph.IM astro-ph.EP cs.LG cs.RO 版本更新

Focal plane wavefront control with model-based reinforcement learning

基于模型的强化学习进行焦平面波前控制

Jalo Nousiainen, Iremsu Taskin, Markus Kasper, Gilles Orban De Xivry, Olivier Absil

发表机构 * European Southern Observatory (ESO)(欧洲南天文学观测站) STAR Institute, Université de Liège(利根大学STAR研究所)

AI总结 提出基于模型的强化学习算法PO4NCPA,通过顺序相位分集自动校正动态和静态非共路像差,实现高对比度成像中的焦平面波前控制。

Comments 13 pages, 11 figures accepted by A&A

详情
Journal ref
A&A 709, A267 (2026)
AI中文摘要

直接成像潜在宜居系外行星是极大望远镜高对比度成像仪器的主要科学目标之一。大多数此类系外行星轨道靠近其主星,其观测受到快速移动的大气散斑和准静态非共路像差(NCPA)的限制。传统的NCPA校正方法通常使用机械镜面探针,这会在操作期间影响性能。本文提出了基于机器学习的NCPA控制方法,通过利用顺序相位分集自动检测和校正动态及静态NCPA误差。我们将先前用于自适应光学的强化学习工作扩展到焦平面控制。一种新的基于模型的RL算法——NCPA策略优化(PO4NCPA),将焦平面图像解释为输入数据,并通过顺序相位分集确定相位校正,从而在没有先验系统知识的情况下优化非日冕和日冕后点扩散函数。此外,我们通过在受水汽诱导视宁度(动态NCPA)影响的地基望远镜和红外成像仪上数值模拟静态NCPA误差,证明了该方法的有效性。模拟表明,PO4NCPA能够稳健地补偿静态和动态NCPA。在静态情况下,它实现了使用日冕仪时近最优的焦平面光抑制,以及无日冕仪时近最优的斯特列尔比。在动态NCPA情况下,它在这些指标上与结合1步延迟积分器的模态最小二乘重构性能相当。该方法对ELT光瞳、矢量涡旋日冕仪以及在光子和背景噪声下仍然有效。PO4NCPA是无模型的,可直接应用于标准成像以及任何日冕仪。其亚毫秒级的推理时间和性能也使其适用于高对比度成像之外的大气湍流实时低阶校正。

英文摘要

The direct imaging of potentially habitable exoplanets is one prime science case for high-contrast imaging instruments on extremely large telescopes. Most such exoplanets orbit close to their host stars, where their observation is limited by fast-moving atmospheric speckles and quasi-static non-common-path aberrations (NCPA). Conventional NCPA correction methods often use mechanical mirror probes, which compromise performance during operation. This work presents machine-learning-based NCPA control methods that automatically detect and correct both dynamic and static NCPA errors by leveraging sequential phase diversity. We extend previous work in reinforcement learning for AO to focal plane control. A new model-based RL algorithm, Policy Optimization for NCPAs (PO4NCPA), interprets the focal-plane image as input data and, through sequential phase diversity, determines phase corrections that optimize both non-coronagraphic and post-coronagraphic PSFs without prior system knowledge. Further, we demonstrate the effectiveness of this approach by numerically simulating static NCPA errors on a ground-based telescope and an infrared imager affected by water-vapor-induced seeing (dynamic NCPAs). Simulations show that PO4NCPA robustly compensates static and dynamic NCPAs. In static cases, it achieves near-optimal focal-plane light suppression with a coronagraph and near-optimal Strehl without one. With dynamics NCPA, it matches the performance of the modal least-squares reconstruction combined with a 1-step delay integrator in these metrics. The method remains effective for the ELT pupil, vector vortex coronagraph, and under photon and background noise. PO4NCPA is model-free and can be directly applied to standard imaging as well as to any coronagraph. Its sub-millisecond inference times and performance also make it suitable for real-time low-order correction of atmospheric turbulence beyond HCI.

2604.04948 2026-05-27 cs.IR cs.AI cs.LG 版本更新

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

从PDF到RAG就绪:评估面向特定领域问答的文档转换框架

José Guilherme Marques dos Santos, Ricardo Yang, Rui Humberto Pereira, Alexandre Sousa, Brígida Mónica Faria, Henrique Lopes Cardoso, José Duarte, José Luís Reis, Luís Paulo Reis, Pedro Pimenta, José Paulo Marques dos Santos

发表机构 * Faculty of Engineering, University of Porto(葡萄牙波尔图大学工程学院) Department of Business Administration, University of Maia(马亚大学商业管理系) LIACC—Artificial Intelligence and Computer Science Laboratory, University of Porto(葡萄牙波尔图大学人工智能与计算机科学实验室) Department of Communication Sciences and Information Technologies, University of Maia(马亚大学通讯科学与信息科技系) School of Health, Polytechnic of Porto(波尔图理工学院健康学院) School of Technology and Management, Polytechnic Institute of Maia(马亚理工学院技术与管理学院)

AI总结 通过系统比较四种开源PDF转Markdown框架的21种流水线配置,发现文档预处理质量(尤其是层次化分块和元数据增强)对RAG系统问答准确率的影响远超转换工具本身,最佳配置(Docling+层次化分块+图像描述)达到94.1%准确率,超越人工整理。

Comments 27 pages, 3 figures, 7 tables

详情
Journal ref
Applied Sciences 16 (2026) 5069
AI中文摘要

检索增强生成(RAG)系统严重依赖文档预处理的质量,然而尚无先前研究通过评估PDF处理框架对下游问答准确性的影响来填补这一空白。我们通过系统比较四种开源PDF到Markdown转换框架——Docling、MinerU、Marker和DeepSeek OCR——在21种流水线配置下的表现,这些配置在转换工具、清洗变换、分块策略和元数据增强方面有所变化。评估使用了一个包含36份葡萄牙语行政文档(1706页,约49.2万词)的语料库上的50个问题基准,每个配置通过LLM作为裁判进行超过50次独立运行的评分。通过Wilcoxon符号秩检验和Cohen's d效应量评估统计显著性。两个基线界定了结果范围:朴素的PDFLoader(86.2%)和人工整理的Markdown(91.3%)。采用层次化分块和图像描述的Docling实现了最高的自动准确率(94.1±1.6%),甚至超越了人工整理。按问题类型分析显示,依赖表格的问题导致了最大的准确率差异,在基本分块和层次化分块之间存在33个百分点的差距。元数据增强和层次感知分块对准确率的贡献超过了转换框架本身。探索性的GraphRAG实现表现不如基本RAG(82%对比94.1%)。这些发现表明,数据准备质量是RAG系统性能的主导因素。

英文摘要

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 21 pipeline configurations, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a 50-question benchmark over a corpus of 36 Portuguese administrative documents (1706 pages, ~492K words), with LLM-as-judge scoring over 50 independent runs per configuration. Statistical significance was assessed via Wilcoxon signed-rank tests with Cohen's d effect sizes. Two baselines bounded the results: naïve PDFLoader (86.2%) and manually curated Markdown (91.3%). Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1 +/- 1.6%), surpassing even manual curation. A per-question-type analysis revealed that table-dependent questions drive the largest accuracy differences, with a 33-percentage-point gap between basic and hierarchical splitting. Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework alone. An exploratory GraphRAG implementation underperformed basic RAG (82% vs. 94.1%). These findings demonstrate that data preparation quality is the dominant factor in RAG system performance.

2602.02192 2026-05-27 cs.LG cs.DC 版本更新

ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

ECHO-2: 一种面向经济高效强化学习的大规模分布式推演框架

Jingwei Song, Meng Chen, Jie Xiao, Qingnan Ren, Jiaqi Huang, Yangshen Deng, Chris Tong, Wanyi Chen, Suli Wang, Zhisheng Chen, Ziqian Bi, Shuo Lu, Yiqun Duan, Xu Wang, Rymon Yu, Lynn Ai, Eric Yang, Tianyu Shi

发表机构 * The University of Hong Kong(香港大学) Fudan University(复旦大学) Gradient University of Edinburgh(爱丁堡大学) Soochow University(苏州大学) Technical University of Darmstadt(达姆施塔特技术大学) University of the Chinese Academy of Sciences(中国科学院大学)

AI总结 提出ECHO-2分布式强化学习框架,通过重叠推演生成、传播与训练,结合对等辅助流水线广播和成本感知异构工作节点激活,在保持奖励性能的同时显著提升成本效率。

Comments 24 pages, 7 figures

详情
AI中文摘要

强化学习(RL)是大语言模型(LLM)后训练的关键阶段,涉及推演生成、奖励评估和集中学习之间的反复交互。分布式推演执行提供了利用更具成本效益的推理资源的机会,但引入了广域协调和策略传播方面的挑战。我们提出了ECHO-2,一个用于后训练的分布式RL框架,使用远程推理工作节点且传播延迟不可忽略。ECHO-2将集中学习与分布式推演相结合,将有界策略过时性视为用户可控参数,使得推演生成、传播和训练能够重叠。我们引入了一个基于重叠的容量模型,关联训练时间、传播延迟和推演吞吐量,得出了一个维持学习器利用率的实用配置规则。为了缓解传播瓶颈并降低成本,ECHO-2采用了对等辅助流水线广播和成本感知的异构工作节点激活。在真实广域网带宽条件下,对4B到32B参数规模的LLM进行GRPO后训练的实验表明,ECHO-2在保持与强基线相当的RL奖励的同时,显著提高了成本效率。

英文摘要

Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of LLMs ranging from 4B to 32B parameters under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.

2512.01678 2026-05-27 cs.LG cs.DC cs.PL 版本更新

Morphling: Fast, Fused, and Flexible GNN Training at Scale

Morphling: 快速、融合且灵活的图神经网络规模化训练

Anubhab, Rupesh Nasre

发表机构 * IIT Madras(印度理工学院马德拉斯学院)

AI总结 提出Morphling领域特定代码合成器,通过架构感知的原语和运行时稀疏感知执行引擎,在CPU、GPU和分布式环境下显著提升GNN训练吞吐量并降低内存消耗。

详情
AI中文摘要

图神经网络(GNN)通过融合不规则、内存受限的图遍历与规则、计算密集型密集矩阵运算,带来了根本性的硬件挑战。虽然PyTorch Geometric(PyG)和Deep Graph Library(DGL)等框架优先考虑高级可用性,但它们未能解决这些不同的执行特性。因此,它们依赖通用内核,导致缓存局部性差、内存移动过多以及大量中间分配。为了解决这些限制,我们提出了Morphling,一个旨在弥合这一差距的领域特定代码合成器。Morphling将高级GNN规范编译为可移植的、后端特化的实现,针对OpenMP、CUDA和MPI。它通过实例化一个针对每个执行环境定制的优化、架构感知原语库来实现这一点。Morphling还包含一个运行时稀疏感知执行引擎,该引擎使用输入特征统计动态选择密集或稀疏执行路径,减少对零值条目的不必要计算。我们在涵盖不同图结构、特征维度和稀疏程度的11个真实世界数据集上评估了Morphling。与PyG和DGL相比,Morphling在CPU上平均提高每轮训练吞吐量20倍,在GPU上提高19倍,在分布式设置中提高6倍,峰值加速达到66倍。Morphling的内存高效布局进一步将峰值内存消耗降低多达15倍,使得在商用硬件上进行大规模GNN训练成为可能。这些发现表明,专门的、架构感知的代码合成为跨不同并行和分布式平台的高性能GNN执行提供了一条有效且可扩展的路径。

英文摘要

Graph Neural Networks (GNNs) present a fundamental hardware challenge by fusing irregular, memory-bound graph traversals with regular, compute-intensive dense matrix operations. While frameworks such as PyTorch Geometric (PyG) and Deep Graph Library (DGL) prioritize high-level usability, they fail to address these divergent execution characteristics. As a result, they rely on generic kernels that suffer from poor cache locality, excessive memory movement, and substantial intermediate allocations. To address these limitations, we present Morphling, a domain-specific code synthesizer designed to bridge this gap. Morphling compiles high-level GNN specifications into portable, backend-specialized implementations targeting OpenMP, CUDA, and MPI. It achieves this by instantiating a library of optimized, architecture-aware primitives tailored to each execution environment. Morphling also incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths using input feature statistics, reducing unnecessary computation on zero-valued entries. We evaluate Morphling on eleven real-world datasets spanning diverse graph structures, feature dimensionalities, and sparsity regimes. Morphling improves per-epoch training throughput by an average of 20X on CPUs, 19X on GPUs, and 6X in distributed settings over PyG and DGL, with peak speedups reaching 66X. Morphling's memory-efficient layouts further reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware. These findings demonstrate that specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.

2603.23994 2026-05-27 cs.LG cs.AI 版本更新

Understanding the Challenges in Iterative Generative Optimization with LLMs

理解大语言模型迭代生成优化中的挑战

Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swaminathan, Ching-An Cheng

发表机构 * Google DeepMind(谷歌DeepMind) CNRS(国家科学研究中心) Stanford University(斯坦福大学) Carnegie Mellon University(卡内基梅隆大学) Microsoft(微软) AWS(亚马逊AWS) Netflix Research(Netflix研究) Microsoft Research(微软研究院)

AI总结 本文通过案例研究,揭示了在基于大语言模型的迭代生成优化中,起始工件、信用分配和批处理等隐藏设计选择对优化成败的决定性影响,并指出缺乏跨领域的通用学习循环设置方法是生产化和采用的主要障碍。

Comments 39 pages, 17 figures

详情
AI中文摘要

生成优化利用大型语言模型(LLMs)通过执行反馈迭代改进工件(如代码、工作流或提示)。这是一种构建自我改进代理的有前途的方法,但在实践中仍然脆弱:尽管有活跃的研究,只有9%的调查代理使用了任何自动优化。我们认为这种脆弱性是因为,为了建立学习循环,工程师必须做出“隐藏”的设计选择:优化器可以编辑什么,以及在每次更新时提供什么“正确”的学习证据?我们调查了影响大多数应用的三个因素:起始工件、执行轨迹的信用跨度,以及将试错批处理为学习证据。通过在MLAgentBench、Atari和BigBench Extra Hard中的案例研究,我们发现这些设计决策可以决定生成优化是否成功,然而它们在先前的工作中很少被明确说明。不同的起始工件决定了在MLAgentBench中哪些解决方案是可达到的,截断的轨迹仍然可以改进Atari代理,而更大的小批量并不会单调地改善BBEH上的泛化。我们得出结论,缺乏一种简单、通用的跨领域设置学习循环的方法是生产化和采用的主要障碍。我们为做出这些选择提供了实用指导。

英文摘要

Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden'' design choices: What can the optimizer edit and what is the "right" learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.

2603.17685 2026-05-27 cs.LG 版本更新

Flow Matching Policy Optimization with Mirror Descent and Entropy Constraints

基于镜像下降和熵约束的流匹配策略优化

Ting Gao, Stavros Orfanoudakis, Nan Lin, Winnie Daamen, Serge Hoogendoorn, Elvin Isufi

AI总结 针对在线强化学习中策略表达性与探索-利用平衡的挑战,提出基于ODE流匹配的框架FMER,通过免模拟策略优化和可计算熵目标,结合动态温度调节,在稀疏奖励任务中取得优越性能。

详情
AI中文摘要

平衡策略表达性与探索-利用权衡是在线强化学习(RL)中的核心挑战。虽然基于随机微分方程(SDE)的扩散策略可以表示复杂的多模态动作分布,但它们存在两个关键限制:其随机逆过程使熵难以处理(需要启发式探索),并且通过长去噪链计算策略梯度既昂贵又不稳定。在这项工作中,我们表明基于ODE的流匹配通过实现免模拟策略优化和可处理的熵计算,从本质上解决了这些问题。基于此,我们引入了基于镜像下降和熵约束的流匹配策略优化(FMER)。我们的框架以三种方式利用这一见解。首先,我们从理论上证明,最小化优势加权条件流匹配损失可以作为策略镜像下降的免模拟替代。这引导速度场朝向高价值区域,同时完全避免通过ODE求解器进行反向传播。其次,我们推导了一个解析熵目标,该目标校正了由$ anh$变换(将无界潜在空间映射到有界动作)引起的密度失真,从而促进了有原则的最大熵优化。最后,我们基于有效样本量动态调整镜像下降温度,以在训练期间强制执行稳健的信任区域。实验评估表明,FMER在具有挑战性的稀疏奖励FrankaKitchen环境中实现了优越的性能,同时在标准密集奖励MuJoCo基准测试中保持了有竞争力的结果。

英文摘要

Balancing policy expressiveness with the exploration-exploitation trade-off is a core challenge in online Reinforcement Learning (RL). While Stochastic Differential Equation (SDE)-based diffusion policies can represent complex, multimodal action distributions, they suffer from two critical limitations: their stochastic reverse processes render entropy intractable (necessitating heuristic exploration), and computing policy gradients through long denoising chains is expensive and unstable. In this work, we show that ODE-based flow matching inherently resolves these issues by enabling both simulation-free policy optimization and tractable entropy computation. Building on this, we introduce Flow Matching Policy Optimization with Mirror Descent and Entropy Constraints (FMER). Our framework exploits this insight in three ways. First, we theoretically establish that minimizing an advantage-weighted conditional flow matching loss acts as a simulation-free surrogate for policy mirror descent. This steers the velocity field toward high-value regions while entirely avoiding backpropagation through the ODE solver. Second, we derive an analytic entropy objective that corrects for the density distortion caused by the $\tanh$ transformation (mapping an unbounded latent space to bounded actions), thereby facilitating principled maximum-entropy optimization. Finally, we dynamically tune the mirror descent temperature based on the effective sample size to enforce a robust trust region during training. Empirical evaluations demonstrate that FMER achieves superior performance on the challenging sparse-reward FrankaKitchen environment, while maintaining competitive results across standard dense-reward MuJoCo benchmarks.

2603.11790 2026-05-27 cs.LG 版本更新

Disentangled Representation Learning through Unsupervised Symmetry Group Discovery

通过无监督对称群发现实现解缠表示学习

Barthélémy Dang-Nhu, Louis Annabi, Sylvain Argentieri

发表机构 * Sorbonne Université, CNRS, Institut des Systèmes Intelligents et de Robotique, ISIR(索邦大学,国家科学研究中心,智能系统与机器人研究所,ISIR)

AI总结 提出一种具身智能体通过与环境的无监督交互自主发现动作空间的群结构的方法,证明了在最小假设下真实对称群分解的可识别性,并推导出两种算法以学习线性对称基解缠表示。

详情
AI中文摘要

基于对称性的解缠表示学习利用环境变换的群结构来揭示潜在的变化因素。先前的基于对称性的解缠方法需要对称群结构的强先验知识,或对子群性质做出限制性假设。在这项工作中,我们通过提出一种方法消除了这些约束,该方法使具身智能体通过与环境的无监督交互自主发现其动作空间的群结构。我们证明了在最小假设下真实对称群分解的可识别性,并推导出两种算法:一种用于从交互数据中发现群分解,另一种用于在不假设特定子群性质的情况下学习线性对称基解缠(LSBD)表示。我们的方法在三个表现出不同群分解的环境中得到了验证,其性能优于现有的LSBD方法。

英文摘要

Symmetry-based disentangled representation learning leverages the group structure of environment transformations to uncover the latent factors of variation. Prior approaches to symmetry-based disentanglement have required strong prior knowledge of the symmetry group's structure, or restrictive assumptions about the subgroup properties. In this work, we remove these constraints by proposing a method whereby an embodied agent autonomously discovers the group structure of its action space through unsupervised interaction with the environment. We prove the identifiability of the true symmetry group decomposition under minimal assumptions, and derive two algorithms: one for discovering the group decomposition from interaction data, and another for learning Linear Symmetry-Based Disentangled (LSBD) representations without assuming specific subgroup properties. Our method is validated on three environments exhibiting different group decompositions, where it outperforms existing LSBD approaches.

2601.10566 2026-05-27 cs.CL cs.LG 版本更新

Representation-Aware Unlearning via Activation Signatures: From Suppression to Entity-Signature Erasure

基于激活签名的表示感知遗忘:从抑制到实体签名擦除

Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib, K. M. Shadman Wadith, Nazia Tasnim, Farig Sadeque

发表机构 * Computer Science and Engineering, BRAC University, Dhaka, Bangladesh(布拉格大学计算机科学与工程系,达卡,孟加拉国) Boston University, Boston, MA, USA(波士顿大学,波士顿,马萨诸塞州,美国)

AI总结 提出ERUF框架,通过挖掘实体特异性激活签名并抑制对应方向,实现表示层面的遗忘,同时保持表面抑制、内部衰减和效用保留。

Comments 16 pages, 4 figures

详情
AI中文摘要

实体级遗忘通常通过模型输出评估:是否停止命名目标、拒绝查询或改变真值比分布。然而,这些输出级测试无法显示主体的内部表示是否被衰减。我们引入实体表示遗忘框架(ERUF),这是一个表示感知框架,挖掘主体特定的激活签名,抑制相应的激活方向,并将行为蒸馏到LoRA参数中。在评估的基线中,ERUF是唯一同时实现表面级抑制、内部衰减和效用保留的方法。在TOFU forget10上,ERUF达到FQ=0.99和MU=0.62,匹配报告的神谕效用,同时接近神谕遗忘质量。在大多数标准基础模型设置中,ERUF保持低泄漏和低内部目标激活,SMR在0.00%至1.10%之间,EL10低于0.06,效用漂移低于3%。在Llama-3.1-8B上,对抗性实体恢复从63.89%降至20.15%,而名称无关恢复减少72.7%至77.4%。联合表面/内部诊断进一步揭示了推理优先模型中仅靠表面指标无法发现的尺度依赖行为。我们将这些结果解释为表示层面衰减的操作性证据,而非不可逆删除的正式保证。

英文摘要

Entity-level unlearning is usually evaluated by what a model says: whether it stops naming the target, refuses a query, or shifts a Truth Ratio distribution. These output-level tests, however, do not show whether a subject's internal representation has been attenuated. We introduce the Entity Representation Unlearning Framework (ERUF), a representation-aware framework that mines subject-specific activation signatures, suppresses the corresponding activation direction, and distills the behavior into LoRA parameters. Among evaluated baselines, ERUF is the only method that jointly achieves surface-level suppression, internal attenuation, and utility preservation. On TOFU forget10, ERUF achieves FQ = 0.99 and MU = 0.62, matching reported oracle utility while approaching oracle forget quality. Across most standard foundation-model settings, ERUF maintains low leakage and low internal target activation, with SMR between 0.00% and 1.10%, EL10 below 0.06, and utility drift below 3%. On Llama-3.1-8B, adversarial entity recovery falls from 63.89% to 20.15%, while name-agnostic recovery decreases by 72.7% to 77.4%. Joint surface/internal diagnostics further reveal scale-dependent behavior in reasoning-prior models that surface metrics alone would miss. We interpret these results as operational evidence of representation-level attenuation, not as a formal guarantee of irreversible deletion.

2603.16654 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Omanic:迈向大语言模型多跳推理的逐步评估

Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li

发表机构 * The University of Tokyo(东京大学) Yale University(耶鲁大学) Stanford University(斯坦福大学) Xiaomi EV(小米EV) Soongsil University(顺天大学)

AI总结 针对大语言模型在多跳问答中中间步骤推理失败难以诊断的问题,提出Omanic基准,通过分解为单跳子问题并分析步骤级错误,揭示后期跳数瓶颈、事实知识下限和错误传播,微调后提升多个推理基准性能。

详情
AI中文摘要

仅从最终答案评估大语言模型(LLM)的推理能力可能会掩盖中间步骤的失败,尤其是在没有步骤级标注的多跳问答基准中。为解决这一问题,我们引入了Omanic,一个开放域4跳问答基准,它不仅用于衡量最终答案的准确性,还用于诊断推理在何处中断。Omanic包含10,296个机器生成的训练示例(OmanicSynth)和967个经专家审核的人工标注评估示例(OmanicBench),每个评估问题被分解为单跳子问题、中间答案和结构化图拓扑。对专有和开源LLM的实验表明,Omanic具有挑战性,而逐步分析揭示了后期跳数瓶颈、事实知识下限以及沿推理链的错误传播。在OmanicSynth上微调可迁移到六个推理和数学基准,平均提升7.41分,验证了其作为推理能力迁移监督的有效性。我们在https://huggingface.co/datasets/li-lab/Omanic 发布数据,在https://github.com/XiaojieGu/Omanic 发布代码。

英文摘要

Evaluating the reasoning abilities of large language models (LLMs) solely from final answers can obscure failures in intermediate steps, especially in multi-hop QA benchmarks without step-level annotations. To address this gap, we introduce Omanic, an open-domain 4-hop QA benchmark designed not only to measure final-answer accuracy but also to diagnose where reasoning breaks down. Omanic contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench), with each evaluation question decomposed into single-hop sub-questions, intermediate answers, and structured graph topologies. Experiments with proprietary and open-source LLMs show that Omanic is challenging, while step-wise analysis reveals a later-hop bottleneck, factual knowledge floor, and error propagation along reasoning chains. Fine-tuning on OmanicSynth transfers to six reasoning and mathematics benchmarks, yielding a 7.41-point average gain and validating its effectiveness as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

2603.15500 2026-05-27 cs.AI cs.LG 版本更新

Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

不确定性下通过策略信息分配理解LLM中的推理

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, Yuqing Yang

发表机构 * Microsoft Research(微软研究院) KAIST(韩国科学技术院) Seoul National University(首尔国立大学)

AI总结 本文提出一个信息论框架,将推理分解为程序推进和认知外化(不确定性标记级外化),证明零散外化能在无显式错误触发时恢复收敛,并通过实验表明小规模SFT即可调控该能力,从而将推理重新定义为不确定性下的策略信息分配。

详情
AI中文摘要

LLM 经常表现出“啊哈”时刻,例如在“Wait”等标记后进行自我修正,但其潜在机制仍不清楚。标准 LLM 主要通过无声发散崩溃,即轨迹偏离正确答案但仍保持局部连贯,因此没有显式错误触发反应性自我修正。我们引入一个信息论框架,将推理分解为程序推进和认知外化(不确定性的标记级外化),并证明零散外化能在没有显式错误触发的情况下恢复向正确答案的收敛。实验上,一个最小的怀疑线索即可恢复失败的轨迹,小规模 SFT 足以灌输或抑制这种能力,这表明强推理更少依赖于非凡的内在机制,而更多依赖于外化不确定性的语言习惯。我们的框架将推理重新定义为不确定性下的策略信息分配,为理解和推进 LLM 推理提供了新视角。

英文摘要

LLMs often exhibit Aha moments such as self-correction after tokens like "Wait," yet the underlying mechanism remains unclear. Standard LLMs collapse mainly through silent divergence, where trajectories drift from the correct answer yet remain locally coherent, so no explicit error triggers reactive self-correction. We introduce an information-theoretic framework that separates reasoning into procedural advancement and epistemic verbalization, the token-level externalization of uncertainty, and prove that sporadic verbalization restores convergence toward the correct answer even without explicit error triggers. Empirically, a minimal doubt cue recovers failed trajectories, and small-scale SFT suffices to instill or suppress this capability, suggesting that strong reasoning hinges less on an extraordinary inner mechanism than on the linguistic habit of externalizing uncertainty. Our framework recasts reasoning as strategic information allocation under uncertainty, offering a new lens for understanding and advancing LLM reasoning.

2603.13282 2026-05-27 cs.LG cs.AI 版本更新

FedTreeLoRA: Reconciling Statistical and Functional Heterogeneity in Federated LoRA Fine-Tuning

FedTreeLoRA:协调联邦LoRA微调中的统计异质性与功能异质性

Jieming Bian, Lei Wang, Letian Zhang, Jie Xu

发表机构 * University of Florida, Gainesville, FL 32611(佛罗里达大学) Middle Tennessee State University Murfreesboro, TN 37132(中田纳西州立大学)

AI总结 针对联邦LoRA微调中统计异质性与功能异质性正交耦合的问题,提出树结构聚合框架FedTreeLoRA,通过逐层对齐实现泛化与个性化的有效平衡。

Comments Accepted by ICML 2026

详情
AI中文摘要

基于低秩自适应(LoRA)的联邦学习(FL)已成为隐私保护的大语言模型微调的标准方法。然而,现有的个性化方法主要在一种限制性的平面模型假设下运行:它们处理客户端的 extit{统计异质性},但将模型视为一个整体块,忽略了跨LLM层的 extit{功能异质性}。我们认为这两个维度——统计(水平)异质性和功能(垂直)异质性——在来源上是正交的,但在交互中是耦合的,这意味着参数共享的最优深度在功能上依赖于客户端的相似性。为了解决这个问题,我们提出了 extbf{FedTreeLoRA},一个采用树结构聚合进行细粒度逐层对齐的框架。通过动态构建聚合层次结构,FedTreeLoRA允许客户端在浅层“树干”上共享广泛共识,同时在深层“树枝”上逐步特化。在NLU和NLG基准上的实验表明,FedTreeLoRA通过有效协调泛化与个性化,显著优于现有最先进方法。

英文摘要

Federated Learning (FL) with Low-Rank Adaptation (LoRA) has become a standard for privacy-preserving LLM fine-tuning. However, existing personalized methods predominantly operated under a restrictive Flat-Model Assumption: they addressed client-side \textit{statistical heterogeneity} but treated the model as a monolithic block, ignoring the \textit{functional heterogeneity} across LLM layers. We argue that these two statistical (horizontal) and functional (vertical) dimensions, are \textit{orthogonal in source yet coupled in interaction}, implying that the optimal depth of parameter sharing is functionally dependent on client similarity. To address this, we propose \textbf{FedTreeLoRA}, a framework employing tree-structured aggregation for fine-grained, layer-wise alignment. By dynamically constructing an aggregation hierarchy, FedTreeLoRA allows clients to share broad consensus on shallow `trunks' while progressively specializing on deep `branches'. Experiments on NLU and NLG benchmarks demonstrate that FedTreeLoRA significantly outperforms state-of-the-art methods by effectively reconciling generalization and personalization.

2603.08413 2026-05-27 cs.LG cs.AI 版本更新

Geometrically Constrained Outlier Synthesis

几何约束异常合成

Daniil Karzanov, Marcin Detyniecki

发表机构 * AXA AI Research(AXA人工智能研究) EPFL, Lausanne, Switzerland(瑞士洛桑联邦理工学院) Polish Academy of Science, IBS PAN, Warsaw, Poland(波兰科学院,IBS PAN,华沙,波兰)

AI总结 提出几何约束异常合成(GCOS)方法,通过在隐藏特征空间中生成受几何约束的虚拟异常样本,结合对比正则化,提升图像分类模型对分布外样本的鲁棒性。

Comments 19 pages, accepted to ICML 2026

详情
AI中文摘要

用于图像分类的深度神经网络通常对分布外(OOD)样本表现出过度自信。为了解决这个问题,我们引入了几何约束异常合成(GCOS),这是一种训练时正则化框架,旨在提高推理时的OOD鲁棒性。GCOS通过生成隐藏特征空间中尊重分布内(ID)数据学习到的流形结构的虚拟异常,解决了先前合成方法的局限性。合成分两个阶段进行:(i)从训练特征中提取的主方差子空间识别出几何信息引导的离流形方向;(ii)由校准集中非一致性得分的经验分位数定义的一个类共形壳,自适应地控制合成幅度以产生边界样本。该壳确保生成的异常既不是微不足道可检测的,也不是与分布内数据无法区分的,从而促进更平滑地学习鲁棒特征。这与对比正则化目标相结合,在选定的得分空间(如马氏距离或基于能量的)中促进ID和OOD样本的可分离性。实验表明,在近OOD基准测试(定义为异常与分布内数据共享相同语义域的任务)上,使用标准基于能量的推理时,GCOS优于最先进的方法。作为探索性扩展,该框架自然地过渡到共形OOD推理,将不确定性得分转化为统计上有效的p值,并启用具有形式误差保证的阈值,为更可预测和可靠的OOD检测提供了途径。

英文摘要

Deep neural networks for image classification often exhibit overconfidence on out-of-distribution (OOD) samples. To address this, we introduce Geometrically Constrained Outlier Synthesis (GCOS), a training-time regularization framework aimed at improving OOD robustness during inference. GCOS addresses a limitation of prior synthesis methods by generating virtual outliers in the hidden feature space that respect the learned manifold structure of in-distribution (ID) data. The synthesis proceeds in two stages: (i) a dominant-variance subspace extracted from the training features identifies geometrically informed, off-manifold directions; (ii) a conformally-inspired shell, defined by the empirical quantiles of a nonconformity score from a calibration set, adaptively controls the synthesis magnitude to produce boundary samples. The shell ensures that generated outliers are neither trivially detectable nor indistinguishable from in-distribution data, facilitating smoother learning of robust features. This is combined with a contrastive regularization objective that promotes separability of ID and OOD samples in a chosen score space, such as Mahalanobis or energy-based. Experiments demonstrate that GCOS outperforms state-of-the-art methods using standard energy-based inference on near-OOD benchmarks, defined as tasks where outliers share the same semantic domain as in-distribution data. As an exploratory extension, the framework naturally transitions to conformal OOD inference, which translates uncertainty scores into statistically valid p-values and enables thresholds with formal error guarantees, providing a pathway toward more predictable and reliable OOD detection.

2603.07211 2026-05-27 cs.LG 版本更新

CompassDPO: Dynamics-Controlled Direct Preference Optimization for Robust Safety Alignment

CompassDPO: 用于鲁棒安全对齐的动态控制直接偏好优化

Jilong Liu, Yonghui Yang, Pengyang Shao, Wenjian Tao, Hao Zhan, Haokai Ma, Wei Qin, Richang Hong

发表机构 * Hefei University of Technology(合肥工业大学) National University of Singapore(新加坡国立大学)

AI总结 提出CompassDPO,通过隐式DPO奖励边际控制更新方向和幅度,无需外部奖励模型,在PKU-SafeRLHF等基准上提升鲁棒性。

详情
AI中文摘要

直接偏好优化(DPO)已成为安全对齐的标准框架,但其对成对偏好更新的依赖使得训练对不完美监督敏感。现有的鲁棒DPO方法通常通过全局损失校正或外部数据级干预来解决这种敏感性,而很大程度上忽略了不可靠比较如何扭曲批次级优化动态。我们提出CompassDPO,一种无奖励的DPO框架,通过动态控制稳定偏好优化。使用隐式DPO奖励边际作为训练时的指南针,CompassDPO沿着两个互补轴调节样本影响:更新方向和更新幅度。对于方向控制,它应用稀疏、有预算和预热延迟的损失混合,以减弱与新兴偏好方向冲突的更新分量。对于幅度控制,它自适应地软温莎化高损失尾部贡献,减少尾部主导同时保留来自困难样本的有用梯度。两种机制仅使用标准DPO训练期间可用的信号,无需外部奖励模型或额外监督。在PKU-SafeRLHF上跨四个骨干网络和多个分布外安全基准的实验表明,CompassDPO在鲁棒性上持续优于普通DPO和强DPO系列基线,特别是在受控标签翻转噪声下。代码可在https://anonymous.4open.science/r/CompassDPO-4D00获取。

英文摘要

Direct Preference Optimization (DPO) has become a standard framework for safety alignment, but its reliance on pairwise preference updates makes training sensitive to imperfect supervision. Existing robust DPO methods often address this sensitivity through global loss corrections or external data-level interventions, while largely overlooking how unreliable comparisons distort batch-level optimization dynamics. We propose CompassDPO, a reward-free DPO framework that stabilizes preference optimization through dynamics control. Using the implicit DPO reward margin as a training-time compass, CompassDPO regulates sample influence along two complementary axes: update direction and update magnitude. For directional control, it applies sparse, budgeted, and warm-up delayed loss mixing to attenuate update components that conflict with the emerging preference direction. For magnitude control, it adaptively soft-winsorizes high-loss tail contributions, reducing tail dominance while preserving useful gradients from hard examples. Both mechanisms use only signals available during standard DPO training and require no external reward model or additional supervision. Experiments on PKU-SafeRLHF across four backbones and multiple out-of-distribution safety benchmarks show that CompassDPO consistently improves robustness over vanilla DPO and strong DPO-family baselines, especially under controlled label-flip noise. Code is available at https://anonymous.4open.science/r/CompassDPO-4D00

2602.13626 2026-05-27 cs.LG 版本更新

Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

基准泄露陷阱:我们能信任基于LLM的推荐吗?

Mingqiao Zhang, Qiyao Peng, Yinghui Wang, Hongtao Liu, Yumeng Wang

发表机构 * Nanjing University(南京大学) Tianjin University(天津大学) Beijing Institute of Control and Electronic Technology(北京控制与电子技术研究所)

AI总结 本文识别并研究了基于大语言模型的推荐系统中基准数据泄露问题,通过模拟多种泄露场景揭示了泄露对性能评估的误导性影响。

详情
AI中文摘要

大语言模型(LLMs)在推荐系统中的广泛应用对评估可靠性提出了严峻挑战。本文识别并研究了一个此前被忽视的问题:基于LLM的推荐中的基准数据泄露。当LLMs在预训练或微调过程中暴露于并可能记忆基准数据集时,就会发生这种现象,导致性能指标被人为夸大,无法反映模型真实性能。为验证这一现象,我们通过在战略混合语料库(包括来自域内和域外的用户-物品交互)上对基础模型进行持续预训练,模拟了多种数据泄露场景。我们的实验揭示了数据泄露的双重效应:当泄露数据与领域相关时,会导致显著但虚假的性能提升,误导性地夸大模型能力;相反,与领域无关的泄露通常会降低推荐准确性,突显了这种污染的复杂性和偶然性。我们的发现表明,数据泄露是基于LLM的推荐中一个关键但此前未被考虑的因素,可能影响模型的真实性能。我们在https://github.com/yusba1/LLMRec-Data-Leakage发布了代码。

英文摘要

The expanding integration of Large Language Models (LLMs) into recommender systems poses critical challenges to evaluation reliability. This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation. This phenomenon occurs when LLMs are exposed to and potentially memorize benchmark datasets during pre-training or fine-tuning, leading to artificially inflated performance metrics that fail to reflect true model performance. To validate this phenomenon, we simulate diverse data leakage scenarios by conducting continued pre-training of foundation models on strategically blended corpora, which include user-item interactions from both in-domain and out-of-domain sources. Our experiments reveal a dual-effect of data leakage: when the leaked data is domain-relevant, it induces substantial but spurious performance gains, misleadingly exaggerating the model's capability. In contrast, domain-irrelevant leakage typically degrades recommendation accuracy, highlighting the complex and contingent nature of this contamination. Our findings reveal that data leakage acts as a critical, previously unaccounted-for factor in LLM-based recommendation, which could impact the true model performance. We release our code at https://github.com/yusba1/LLMRec-Data-Leakage.

2603.01800 2026-05-27 cs.LG cs.AI stat.ML stat.OT 版本更新

Phase-Type Variational Autoencoders for Heavy-Tailed Data

Phase-Type变分自编码器用于重尾数据

Abdelhakim Ziani, András Horváth, Paolo Ballarini

发表机构 * Université Paris Saclay, Lab. MICS, CentraleSupélec, Gif-sur-Yvette, France(巴黎萨克雷大学,MICS实验室,CentraleSupélec,法国吉夫-sur-依夫)

AI总结 提出Phase-Type变分自编码器(PH-VAE),通过将解码器分布建模为潜在条件相位型分布(连续时间马尔可夫链的吸收时间),灵活适应重尾行为,在合成和真实基准上优于高斯、Student-t和极值VAE解码器。

详情
AI中文摘要

重尾分布在现实世界数据中无处不在,其中罕见但极端的事件主导了风险和变异性。然而,标准变分自编码器(VAE)采用简单的解码器分布,如高斯分布,无法捕捉重尾行为,而现有的重尾感知扩展仍然局限于预定义的参数族,其尾部行为是预先固定的。我们提出了Phase-Type变分自编码器(PH-VAE),其解码器分布是一个潜在条件的Phase-Type(PH)分布,定义为连续时间马尔可夫链(CTMC)的吸收时间。这种公式组合了多个指数时间尺度,产生了一个灵活且解析可处理的解码器,它直接从观测数据中调整其有限范围的尾部行为。在合成和真实世界基准上的实验表明,PH-VAE能够准确逼近各种重尾分布,在建模观测到的尾部行为和极端分位数方面显著优于基于高斯、Student-t和极值的VAE解码器。在多变量设置中,PH-VAE通过其共享的潜在表示捕捉了现实中的跨维度尾部依赖性。据我们所知,这是首次将Phase-Type分布整合到深度生成建模中的工作,桥接了应用概率论和表示学习。

英文摘要

Heavy-tailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standard Variational Autoencoders (VAEs) employ simple decoder distributions, such as Gaussian distributions, that fail to capture heavy-tailed behavior, while existing heavy-tail-aware extensions remain restricted to predefined parametric families whose tail behavior is fixed a priori. We propose the Phase-Type Variational Autoencoder (PH-VAE), whose decoder distribution is a latent-conditioned Phase-Type (PH) distribution, defined as the absorption time of a continuous-time Markov chain (CTMC). This formulation composes multiple exponential time scales, yielding a flexible and analytically tractable decoder that adapts its finite-range tail behavior directly from the observed data. Experiments on synthetic and real-world benchmarks demonstrate that PH-VAE accurately approximates diverse heavy-tailed distributions, significantly outperforming Gaussian, Student-t, and extreme-value-based VAE decoders in modeling observed tail behavior and extreme quantiles. In multivariate settings, PH-VAE captures realistic cross-dimensional tail dependence through its shared latent representation. To our knowledge, this is the first work to integrate Phase-Type distributions into deep generative modeling, bridging applied probability and representation learning.

2603.01327 2026-05-27 cs.SE cs.CL cs.LG 版本更新

SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution

SWE-Adept:基于LLM的深度代码库分析与结构化问题解决代理框架

Kang He, Kaushik Roy

发表机构 * Electrical and Computer Engineering, Purdue University(电子工程与计算机工程系,普渡大学)

AI总结 提出SWE-Adept双代理框架,通过代理引导的深度优先搜索进行代码定位,结合自适应规划和结构化问题解决,在SWE-Bench上提升端到端解决率最多4.3%。

详情
AI中文摘要

大型语言模型(LLM)在独立编程任务上表现出色,但在仓库级软件工程(SWE)中仍面临挑战,这需要(1)深度代码库导航与有效上下文管理以实现准确定位,以及(2)系统化的迭代、测试驱动代码修改以解决问题。为应对这些挑战,我们提出SWE-Adept,一个基于LLM的双代理框架,其中定位代理识别与问题相关的代码位置,解决代理实施相应的修复。对于问题定位,我们引入代理引导的深度优先搜索,选择性遍历代码依赖关系。这最小化了代理上下文窗口中的问题无关内容,提高了定位准确性。对于问题解决,我们采用自适应规划和结构化问题求解。我们为代理配备了用于进度跟踪和基于Git的版本控制的专用工具。这些工具与共享工作记忆交互,该记忆存储按执行步骤索引的代码状态检查点,便于精确的检查点检索。这种设计实现了可靠的代理驱动版本控制操作,用于系统化问题解决,包括分支以探索替代方案和回滚失败的编辑。在SWE-Bench Lite和SWE-Bench Pro上的实验表明,SWE-Adept在问题定位和解决方面均持续优于先前方法,端到端解决率提升最多4.3%。

英文摘要

Large language models (LLMs) exhibit strong performance on self-contained programming tasks. However, they still struggle with repository-level software engineering (SWE), which demands (1) deep codebase navigation with effective context management for accurate localization, and (2) systematic approaches for iterative, test-driven code modification to resolve issues. To address these challenges, we propose SWE-Adept, an LLM-based two-agent framework where a localization agent identifies issue-relevant code locations and a resolution agent implements the corresponding fixes. For issue localization, we introduce agent-directed depth-first search that selectively traverses code dependencies. This minimizes issue-irrelevant content in the agent's context window and improves localization accuracy. For issue resolution, we employ adaptive planning and structured problem solving. We equip the agent with specialized tools for progress tracking and Git-based version control. These tools interface with a shared working memory that stores code-state checkpoints indexed by execution steps, facilitating precise checkpoint retrieval. This design enables reliable agent-driven version-control operations for systematic issue resolution, including branching to explore alternative solutions and reverting failed edits. Experiments on SWE-Bench Lite and SWE-Bench Pro demonstrate that SWE-Adept consistently outperforms prior approaches in both issue localization and resolution, improving the end-to-end resolve rate by up to 4.3%.

2602.22190 2026-05-27 cs.LG cs.AI cs.CL 版本更新

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

GUI-Libra:训练原生GUI代理进行推理与行动——基于动作感知监督和部分可验证强化学习

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baolin Peng, Huan Zhang, Jianfeng Gao, Tong Zhang

发表机构 * UIUC(伊利诺伊大学香槟分校) Microsoft(微软) UNC-Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出GUI-Libra训练方案,通过动作感知SFT和部分可验证RL中的KL正则化,解决GUI代理在长程导航任务中推理与定位冲突及部分可验证性问题,显著提升步骤准确率和任务完成率。

Comments 57 pages, 17 figures

详情
AI中文摘要

开源原生GUI代理在长程导航任务上仍落后于闭源系统。这一差距源于两个限制:缺乏高质量、动作对齐的推理数据,以及直接采用忽视GUI代理独特挑战的通用后训练流程。我们识别出这些流程中的两个基本问题:(i) 带有CoT推理的标准SFT常损害定位能力,(ii) 逐步RLVR式训练面临部分可验证性,即多个动作可能正确但仅有一个示范动作用于验证。这使得离线逐步指标成为在线任务成功的弱预测器。在本工作中,我们提出GUI-Libra,一种定制化训练方案以应对这些挑战。首先,为缓解动作对齐推理数据的稀缺性,我们引入数据构建和过滤流程,并发布精心整理的81K GUI推理数据集。其次,为调和推理与定位,我们提出动作感知SFT,混合推理后动作和直接动作数据,并重新加权token以强调动作和定位。第三,为在部分可验证性下稳定RL,我们识别出RLVR中KL正则化被忽视的重要性,并证明KL信任域对改善离线到在线可预测性至关重要;我们进一步引入成功自适应缩放以降低不可靠负梯度的权重。在多种Web和移动基准测试中,GUI-Libra一致地提升了步骤准确率和端到端任务完成率。我们的结果表明,精心设计的后训练和数据整理可以在无需昂贵在线数据收集的情况下,释放显著更强的任务解决能力。我们发布数据集、代码和模型,以促进对具备推理能力的GUI代理的数据高效后训练的进一步研究。

英文摘要

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

2602.20475 2026-05-27 hep-ex cs.LG 版本更新

PhyGHT: Physics-Guided HyperGraph Transformer for Signal Purification at the HL-LHC

PhyGHT:面向HL-LHC信号净化的物理引导超图Transformer

Mohammed Rakib, Luke Vaughan, Shivang Patel, Flera Rizatdinova, Alexander Khanov, Atriya Sen

发表机构 * Department of Computer Science(计算机科学系) Department of Physics(物理系)

AI总结 提出PhyGHT混合架构,结合距离感知局部图注意力和全局自注意力,并引入可解释的物理约束堆叠抑制门(PSG),以在极端堆积碰撞噪声下准确重建顶夸克对信号的能量和质量修正因子。

Comments Accepted by KDD 2026

详情
AI中文摘要

欧洲核子研究中心的高亮度大型强子对撞机(HL-LHC)将产生前所未有的数据集,能够揭示宇宙的基本属性。然而,实现其发现潜力面临重大挑战:从主要由约200次同时堆积碰撞构成的压倒性背景中提取微小的信号成分。这种极端噪声严重扭曲了精确重建所需的物理可观测量。为此,我们引入了物理引导超图Transformer(PhyGHT),这是一种混合架构,结合了距离感知的局部图注意力和全局自注意力,以镜像质子-质子碰撞中形成的粒子簇射的物理拓扑。关键的是,我们集成了一个可解释的、受物理约束的堆叠抑制门(PSG),该机制在超图聚合之前明确学习过滤软噪声。为了验证我们的方法,我们发布了一个新的顶夸克对产生模拟数据集,以模拟极端堆积条件。PhyGHT在预测信号能量和质量修正因子方面优于来自ATLAS和CMS实验的最先进基线。通过精确重建顶夸克的不变质量,我们展示了机器学习创新和跨学科合作如何直接推动实验物理学前沿的科学发现,并增强HL-LHC的发现潜力。数据集和代码可在https://github.com/rAIson-Lab/PhyGHT获取。

英文摘要

The High-Luminosity Large Hadron Collider (HL-LHC) at CERN will produce unprecedented datasets capable of revealing fundamental properties of the universe. However, realizing its discovery potential faces a significant challenge: extracting small signal fractions from overwhelming backgrounds dominated by approximately 200 simultaneous pileup collisions. This extreme noise severely distorts the physical observables required for accurate reconstruction. To address this, we introduce the Physics-Guided Hypergraph Transformer (PhyGHT), a hybrid architecture that combines distance-aware local graph attention with global self-attention to mirror the physical topology of particle showers formed in proton-proton collisions. Crucially, we integrate a Pileup Suppression Gate (PSG), an interpretable, physics-constrained mechanism that explicitly learns to filter soft noise prior to hypergraph aggregation. To validate our approach, we release a novel simulated dataset of top-quark pair production to model extreme pileup conditions. PhyGHT outperforms state-of-the-art baselines from the ATLAS and CMS experiments in predicting the signal's energy and mass correction factors. By accurately reconstructing the top quark's invariant mass, we demonstrate how machine learning innovation and interdisciplinary collaboration can directly advance scientific discovery at the frontiers of experimental physics and enhance the HL-LHC's discovery potential. The dataset and code are available at https://github.com/rAIson-Lab/PhyGHT

2602.18907 2026-05-27 cs.LG cs.CV cs.CY 版本更新

DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation

DeepInterestGR: 利用多模态大语言模型挖掘深度多兴趣用于生成式推荐

Yangchen Zeng, Zhenyu Yu, Zhiyuan Hu, Wenxin Zhang, Jinze Wang, Rongfeng Guo

发表机构 * Southeast University(东南大学)

AI总结 提出DeepInterestGR框架,通过多LLM兴趣挖掘、奖励标记深度兴趣和兴趣增强物品离散化,解决生成式推荐中的浅层兴趣问题,在三个Amazon数据集上显著提升推荐性能。

详情
AI中文摘要

我们介绍了DeepInterestGR,一个将深度兴趣挖掘集成到生成式推荐流程中的新颖框架。这解决了“浅层兴趣”问题——现有的生成方法依赖于表面文本特征,未能捕捉潜在的用户动机,限制了个性化深度和推荐可解释性。我们的方法通过结构化推理提示利用多LLM兴趣挖掘(MLIM),通过奖励标记深度兴趣(RLDI)进行质量控制,通过RQ-VAE进行兴趣增强物品离散化(IEID),并结合由兴趣感知奖励引导的两阶段SFT-GRPO训练流程。我们在三个Amazon Review基准(Beauty、Sports、Instruments)上验证了DeepInterestGR,与包括SASRec、BERT4Rec、TIGER、LC-Rec和S-DPO在内的14个最先进基线进行了比较。我们的方法在HR@10上实现了5.8%-8.3%的相对改进,在NDCG@10上实现了7.7%-9.9%的相对改进,跨领域泛化增益达到+24.8%。这些结果证明,融入深度语义兴趣可以有效改进基于SID的生成式推荐。

英文摘要

We introduce DeepInterestGR, a novel framework that integrates deep interest mining into the generative recommendation pipeline. This addresses the "Shallow Interest" problem - existing generative methods rely on surface-level textual features and fail to capture latent user motivations, limiting personalization depth and recommendation interpretability. Our approach leverages Multi-LLM Interest Mining (MLIM) via structured reasoning prompting, Reward-Labeled Deep Interest (RLDI) for quality control, and Interest-Enhanced Item Discretization (IEID) via RQ-VAE, combined with a two-stage SFT-GRPO training pipeline guided by an Interest-Aware Reward. We validate DeepInterestGR on three Amazon Review benchmarks (Beauty, Sports, Instruments), comparing against 14 state-of-the-art baselines including SASRec, BERT4Rec, TIGER, LC-Rec, and S-DPO. Our method achieves 5.8%-8.3% relative improvements on HR@10 and 7.7%-9.9% on NDCG@10 over the strongest baseline, with cross-domain generalization gains of +24.8%. These results provide evidence that incorporating deep semantic interests can effectively improve SID-based generative recommendation.

2602.17605 2026-05-27 cs.CV cs.AI cs.CY cs.LG 版本更新

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

在飞行中主动适应:基于相关性的在线元学习与潜在概念用于地理空间发现

Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly

发表机构 * University of Michigan, Ann Arbor, MI, USA(密歇根大学,安阿伯分校) Washington University in St. Louis, St. Louis, MO, USA(华盛顿大学圣路易斯分校)

AI总结 提出一个统一的地理空间发现框架,结合主动学习、在线元学习和概念引导推理,通过概念加权不确定性采样和相关性感知元批次形成策略,在有限数据和动态环境下高效发现隐藏目标。

详情
AI中文摘要

在环境监测中,数据收集通常成本高昂、稀疏且受紧急公共卫生需求影响。这对于致癌的PFAS(全氟和多氟烷基物质)污染尤其如此,与领域专家和环境组织的讨论强调需要在有限的采样预算下战略性地识别高风险、观测不足的区域。更广泛地说,在灾害响应和公共卫生环境中也出现了类似的挑战,动态环境使得从有限的地面实况中高效发现隐藏目标变得至关重要。然而,稀疏且有偏差的地理空间标签限制了现有基于学习方法(如强化学习)的适用性。为了解决这个问题,我们提出了一个统一的地理空间发现框架,该框架集成了主动学习、在线元学习和概念引导推理。我们的方法引入了两个基于共享的*概念相关性*概念的关键创新,该概念捕捉领域特定因素如何影响目标存在:一个*概念加权不确定性采样策略*,其中不确定性通过从现成概念(如土地覆盖和源距离)学习到的相关性进行调节;以及一个*相关性感知元批次形成策略*,该策略在在线元更新期间促进语义多样性,提高动态环境中的泛化能力。我们在PFAS污染发现任务上评估了我们的框架,这是一个受真实世界启发的环境监测任务,展示了在有限数据和变化条件下鲁棒的目标发现能力。

英文摘要

In environmental monitoring, data collection is often costly, sparse, and shaped by urgent public-health needs. This is particularly true for cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, where discussions with domain experts and environmental organizations highlight the need to strategically identify high-risk, under-observed regions under tight sampling budgets. More broadly, similar challenges arise in disaster response and public health settings, where dynamic environments make it essential to efficiently uncover hidden targets from limited ground truth. Yet sparse and biased geospatial labels limit the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, capturing how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance from readily available concepts such as land cover and source proximity; and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. We evaluate our framework on PFAS contamination discovery as a real-world inspired environmental monitoring task, demonstrating robust target discovery under limited data and changing conditions.

2510.03352 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction

基于侧信息的推理时搜索用于扩散模型图像重建

Mahdi Farahbakhsh, Vishnu Teja Kunde, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

发表机构 * Department of Electrical and Computer Engineering, Texas A&M University(电气与计算机工程系,德克萨斯A&M大学)

AI总结 提出一种即插即用、无需训练的推理时搜索框架,将侧信息融入现有扩散模型逆问题求解器,显著提升重建质量。

详情
AI中文摘要

扩散模型已被用作解决逆问题的先验。然而,现有方法通常忽略了能够显著提高重建质量的侧信息,尤其是在严重病态设置中。在这项工作中,我们提出了一种新颖的框架,通过推理时搜索将侧信息以即插即用、无需训练的方式融入现有的基于扩散模型的逆问题求解器。通过在多种逆问题(包括图像修复、超分辨率和几种去模糊任务)以及多种基于扩散模型的逆问题求解器(DPS、DAPS和MPGD)上的大量实验,我们表明,用我们的框架增强每个求解器,其重建质量始终优于相应的原始方法。为了展示我们方法的通用性,我们考虑了多种形式的侧信息,包括参考图像、文本描述和解剖学MRI扫描。代码可在该仓库中获取:https://github.com/mahdi-farahbakhsh/DISS。

英文摘要

Diffusion models have been used as priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel framework that incorporates side information into existing diffusion-based inverse problem solvers via inference-time search, in a plug-and-play, training-free manner. Through extensive experiments across a range of inverse problems, including inpainting, super-resolution, and several deblurring tasks, and across multiple diffusion-based inverse problem solvers (DPS, DAPS, and MPGD), we show that augmenting each solver with our framework consistently improves the quality of the reconstructions over the corresponding original method. To demonstrate the generality of our approach, we consider diverse forms of side information, including reference images, textual descriptions, and anatomical MRI scans. The code is available at this \href{https://github.com/mahdi-farahbakhsh/DISS}{repository}\footnote{https://github.com/mahdi-farahbakhsh/DISS}.

2602.15919 2026-05-27 stat.ML cs.AI cs.LG 版本更新

Assessing Per-Sample Membership Inference Vulnerability without Retraining

无需重训练的逐样本成员推断脆弱性评估

Valentin Dorseuil, Jamal Atif, Olivier Cappé

发表机构 * ENS, École normale supérieure(巴黎高等师范学院) Université PSL, CNRS(巴黎政治学院、国家科学研究中心) Institut Polytechnique de Paris(巴黎理工 institute)

AI总结 提出一种基于数据依赖几何度量的逐样本成员推断脆弱性评分方法,仅需单个训练模型即可高效识别高风险样本。

详情
AI中文摘要

近期隐私文献表明,针对样本的成员推断攻击(MIA)显著优于非针对性方法。受此启发,我们探讨以下问题:能否在不训练影子模型的情况下评估单个训练点的隐私脆弱性?我们表明,逐样本对MIA的暴露程度不仅受其损失影响,还受数据依赖的几何度量控制。在线性设置中,我们推导出个体黑盒MIA脆弱性的闭式分解,将其分解为总体杠杆得分和残差损失项,明确了样本依赖的几何结构如何转化为隐私暴露。由于大多数现代架构的最后一层是线性的,我们将此框架扩展到深度网络,并提出一种基于最后一层表示的替代评分,仅需单个训练模型且无需影子模型。跨不同数据集和架构的实验表明,我们的评分在识别最先进攻击下的最高风险点时优于损失和梯度范数基线,为逐样本隐私风险评估提供了计算高效且理论基础的工。

英文摘要

Recent work in the privacy literature shows that sample-targeted membership inference attacks (MIAs) significantly outperform untargeted approaches by a wide margin. Motivated by this observation, we address the following question: can the privacy vulnerability of individual training points be assessed without training shadow models? We show that per-sample exposure to MIA is governed not only by a point's loss, but also by a data-dependent geometric measure. In the linear setting, we derive a closed-form decomposition of individual black-box MIA vulnerability into a population leverage score and a residual loss term, making explicit how sample-dependent geometry translates into privacy exposure. Since the final layer of most modern architectures is linear, we extend this framework to deep networks and propose a surrogate score operating on last-layer representations that requires only a single trained model and no shadow models. Empirical evaluations across diverse datasets and architectures show that our score outperforms loss and gradient-norm baselines at identifying the highest-risk points under state-of-the-art attacks, providing a computationally efficient and theoretically grounded tool for per-sample privacy risk assessment.

2602.12833 2026-05-27 cs.LG cs.AI cs.MA 版本更新

Vital Trace: Protocol-Constrained Patient-State Reasoning for Longitudinal Clinical Trajectories

Vital Trace: 协议约束的患者状态推理用于纵向临床轨迹

Zhan Qu, Michael Färber

发表机构 * TU Dresden(德累斯顿理工大学)

AI总结 提出Vital Trace,一个协议约束的多智能体框架,通过紧凑的持久患者状态记忆和四个协调智能体(Router、Reasoner、Auditor、Steward)进行分阶段推理,以解决长期临床轨迹推理中的上下文漂移和不稳定问题,在MIMIC-IV和eICU数据集上预测未来血管加压药、呼吸、肾脏支持和恶化任务中优于自由形式多智能体基线。

详情
AI中文摘要

纵向临床推理需要跟踪电子健康记录中患者轨迹的生理测量、实验室结果和干预措施。现有的基于LLM的临床推理系统通常依赖于重复序列化患者历史或交换无约束的文本智能体消息,导致上下文漂移、推理不稳定以及长期推理成本增加。我们提出了Vital Trace,一个协议约束的多智能体框架,用于在动态ICU轨迹上进行未来临床风险预测。Vital Trace不维护无界文本历史,而是使用紧凑的持久患者状态记忆以及由四个协调智能体(Router、Reasoner、Auditor和Steward)执行的分阶段推理。为了支持时间上连贯的推理,我们引入了一个手动策划的全局协议,包含生理状态转换规则和动态患者状态表示,随时间跟踪血流动力学、呼吸、肾脏、代谢和炎症不稳定性。我们在MIMIC-IV和eICU上使用未来血管加压药支持、呼吸支持、肾脏支持和恶化预测任务评估Vital Trace。结果表明,与自由形式多智能体基线相比,结构化的协议约束推理提高了时间一致性、通信稳定性、校准性和可解释性,同时在长期ICU轨迹上实现了强大的预测性能。

英文摘要

Longitudinal clinical reasoning over electronic health records requires tracking evolving physiological measurements, laboratory results, and interventions across extended patient trajectories. Existing LLM-based clinical reasoning systems often rely on repeatedly serializing patient histories or exchanging unconstrained textual agent messages, leading to context drift, unstable reasoning, and growing inference cost over long horizons. We present Vital Trace, a protocol-constrained multi-agent framework for future clinical risk prediction over evolving ICU trajectories. Instead of maintaining unbounded textual histories, Vital Trace uses a compact persistent patient-state memory together with staged reasoning performed by four coordinated agents: a Router, Reasoner, Auditor, and Steward. To support temporally coherent reasoning, we introduce a manually curated Global Protocol containing physiological state-transition rules and a dynamic patient-state representation that tracks hemodynamic, respiratory, renal, metabolic, and inflammatory instability over time. We evaluate Vital Trace on MIMIC-IV and eICU using future vasopressor-support, respiratory-support, renal-support, and deterioration prediction tasks. Results show that structured protocol-constrained reasoning improves temporal consistency, communication stability, calibration, and interpretability compared with free-form multi-agent baselines while achieving strong predictive performance across long ICU trajectories.

2507.11486 2026-05-27 cs.LG 版本更新

Exploring the robustness of TractOracle methods in RL-based tractography

探索基于强化学习的纤维追踪中TractOracle方法的鲁棒性

Jeremi Levesque, Antoine Théberge, Maxime Descoteaux, Pierre-Marc Jodoin

发表机构 * Department of Computer Science, Faculty of Science, University of Sherbrooke(谢布罗克大学计算机科学系)

AI总结 本文通过整合强化学习的最新进展,扩展了TractOracle-RL框架,并引入迭代奖励训练(IRT)方法,实验表明基于oracle的RL方法在准确性和解剖有效性上显著优于传统纤维追踪技术。

Comments 38 pages, 8 figures. Submitted to Medical Image Analysis

详情
Journal ref
Medical Image Analysis, December 2025
AI中文摘要

纤维追踪算法利用扩散MRI重建大脑白质的纤维结构。在机器学习方法中,强化学习(RL)已成为纤维追踪的一个有前景的框架,在几个关键方面优于传统方法。TractOracle-RL是一种最新的基于RL的方法,通过基于奖励的机制将解剖先验纳入训练过程,减少了假阳性。在本文中,我们通过整合RL的最新进展,研究了原始TractOracle-RL框架的四种扩展,并在五个不同的扩散MRI数据集上评估了它们的性能。结果表明,无论使用何种具体方法或数据集,将oracle与RL框架结合始终能产生鲁棒且可靠的纤维追踪。我们还提出了一种新的RL训练方案,称为迭代奖励训练(IRT),其灵感来自人类反馈强化学习(RLHF)范式。IRT不依赖人类输入,而是利用束过滤方法在训练过程中迭代优化oracle的指导。实验结果表明,使用oracle反馈训练的RL方法在准确性和解剖有效性方面显著优于广泛使用的纤维追踪技术。

英文摘要

Tractography algorithms leverage diffusion MRI to reconstruct the fibrous architecture of the brain's white matter. Among machine learning approaches, reinforcement learning (RL) has emerged as a promising framework for tractography, outperforming traditional methods in several key aspects. TractOracle-RL, a recent RL-based approach, reduces false positives by incorporating anatomical priors into the training process via a reward-based mechanism. In this paper, we investigate four extensions of the original TractOracle-RL framework by integrating recent advances in RL, and we evaluate their performance across five diverse diffusion MRI datasets. Results demonstrate that combining an oracle with the RL framework consistently leads to robust and reliable tractography, regardless of the specific method or dataset used. We also introduce a novel RL training scheme called Iterative Reward Training (IRT), inspired by the Reinforcement Learning from Human Feedback (RLHF) paradigm. Instead of relying on human input, IRT leverages bundle filtering methods to iteratively refine the oracle's guidance throughout training. Experimental results show that RL methods trained with oracle feedback significantly outperform widely used tractography techniques in terms of accuracy and anatomical validity.

2602.10450 2026-05-27 cs.LG cs.AI math.OC 版本更新

Constructing Industrial-Scale Optimization Modeling Benchmark

构建工业规模优化建模基准

Zhong Li, Hongliang Lu, Tao Wei, Yuxuan Chen, Wenyu Liu, Yuan Lan, Fan Zhang, Zaiwen Wen

发表机构 * Great Bay University(大湾大学) Peking University(北京大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 提出MIPLIB-NL基准,通过结构感知逆向构建方法从真实混合整数线性规划中生成自然语言规范与求解器代码,以评估大语言模型在工业规模优化建模中的性能。

Comments This paper was accepted by ICML'26 for publication

详情
AI中文摘要

优化建模支撑着物流、制造、能源和金融领域的决策,然而将自然语言需求转化为正确的优化公式和可执行求解器代码仍然需要大量人力。尽管大语言模型(LLMs)已被探索用于此任务,但评估仍以玩具级或合成基准为主,掩盖了具有$10^{3}$--$10^{6}$(或更多)变量和约束的工业问题的难度。一个关键瓶颈是缺乏将自然语言规范与基于真实优化模型的参考公式/求解器代码对齐的基准。为填补这一空白,我们引入了MIPLIB-NL,它通过一种结构感知的逆向构建方法从MIPLIB~2017中的真实混合整数线性规划构建而成。我们的流程(i)从平坦的求解器公式中恢复紧凑、可复用的模型结构,(ii)在统一的模型-数据分离格式下,逆向生成明确关联到该恢复结构的自然语言规范,以及(iii)通过专家评审和人类-LLM交互以及独立的逆向检查进行迭代语义验证。这产生了223个一对一的重构,保留了原始实例的数学内容,同时实现了现实的自然语言到优化评估。实验表明,在现有基准上表现良好的系统在MIPLIB-NL上性能显著下降,暴露了在玩具规模下不可见的失败模式。

英文摘要

Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with $10^{3}$--$10^{6}$ (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model--data separation format, and (iii) performs iterative semantic validation through expert review and human--LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.

2602.10104 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Olaf-World: Orienting Latent Actions for Video World Modeling

Olaf-World: 面向视频世界模型的潜在动作定向

Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore Research (A STAR), Singapore

AI总结 提出SeqΔ-REPA对齐目标,通过冻结自监督视频编码器的时序特征差异锚定潜在动作,实现无标签视频中可迁移的动作控制世界模型预训练。

Comments ICML 2026. Project page: https://showlab.github.io/Olaf-World/ Code: https://github.com/showlab/Olaf-World

详情
AI中文摘要

扩展动作可控世界模型受限于动作标签的稀缺性。虽然潜在动作学习有望从无标签视频中提取控制接口,但学习到的潜在表示往往难以跨上下文迁移:它们纠缠了场景特定线索,缺乏共享坐标系。这是因为标准目标仅在每个片段内操作,没有提供跨上下文对齐动作语义的机制。我们的关键洞察是,尽管动作未被观测到,但其语义效果是可观测的,可以作为共享参考。我们引入SeqΔ-REPA,一种序列级控制效果对齐目标,将集成潜在动作锚定到来自冻结自监督视频编码器的时序特征差异。基于此,我们提出Olaf-World,一个从大规模被动视频中预训练动作条件视频世界模型的流程。大量实验表明,我们的方法学习了更结构化的潜在动作空间,从而在零样本动作迁移和适应新控制接口的数据效率上优于最先进的基线方法。

英文摘要

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

2602.09842 2026-05-27 math.OC cs.LG 版本更新

Step-Size Stability in Stochastic Optimization: A Theoretical Perspective

随机优化中的步长稳定性:理论视角

Fabian Schaipp, Robert M. Gower, Adrien Taylor

发表机构 * Inria, Departement d'Informatique de l'Ecole Normale Superieure, PSL Research University, Paris, France(法国国家信息与自动化技术研究院,巴黎高等师范学院计算机系,巴黎理工大学,法国) CCM, Flatiron Institute, New York City(Flatiron研究所,纽约市)

AI总结 本文通过理论分析识别关键量,量化步长过大时性能下降程度,证明自适应步长方法(如SPS、NGN)比SGD更鲁棒,并实验验证理论界对实际性能的定性反映。

详情
AI中文摘要

我们提出了随机优化方法关于步长敏感性的理论分析。我们识别出一个关键量,对于每种方法,它描述了当步长过大时性能如何下降。对于凸问题,我们证明该量直接影响方法的次优性界。最重要的是,我们的分析提供了直接的理论证据,表明自适应步长方法(如SPS或NGN)比SGD更鲁棒。这使我们能够量化这些自适应方法超越经验评估的优势。最后,我们通过实验表明,即使对于非凸问题,我们的理论界也定性地反映了实际性能随步长的变化。

英文摘要

We present a theoretical analysis of stochastic optimization methods in terms of their sensitivity with respect to the step size. We identify a key quantity that, for each method, describes how the performance degrades as the step size becomes too large. For convex problems, we show that this quantity directly impacts the suboptimality bound of the method. Most importantly, our analysis provides direct theoretical evidence that adaptive step-size methods, such as SPS or NGN, are more robust than SGD. This allows us to quantify the advantage of these adaptive methods beyond empirical evaluation. Finally, we show through experiments that our theoretical bound qualitatively mirrors the actual performance as a function of the step size, even for non-convex problems.

2511.06625 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from Low-Dose Computed Tomography

可解释的跨疾病推理:基于低剂量计算机断层扫描的心血管风险评估

Yifei Zhang, Jiashuo Zhang, Mojtaba Safari, Xiaofeng Yang, Liang Zhao

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系) Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Department of Radiation Oncology(放射肿瘤学部) Winship Cancer Institute, Emory University(埃默里大学Winship癌症研究所)

AI总结 提出一种可解释的跨疾病推理框架,通过提取肺部发现、基于医学知识进行跨器官机制推理,并结合心脏子体积特征,从低剂量胸部CT中实现心血管风险评估,在NLST队列中AUC达0.919。

详情
AI中文摘要

低剂量胸部计算机断层扫描(LDCT)在一次扫描中捕获肺部和心脏结构,使得能够联合评估肺部和心血管健康。现有方法通常独立建模这些领域,并未明确表示它们的生理交互。我们提出了一种可解释的跨疾病推理框架,用于从LDCT进行心血管风险评估。该框架遵循受限的临床信息路径:它提取肺部发现,将跨器官机制基于医学知识进行推理,并生成带有自然语言理由的心血管预测。它结合了四个组件:一个冻结的肺风险先验、一个肺部感知模块、一个代理推理模块和一个心脏子体积特征提取器。它们的输出被融合,以将局部心脏证据与机制层面的肺部上下文整合。在国家肺筛查试验队列中,该框架在CVD筛查中达到0.919的AUC,在CVD死亡率预测中高达0.838,优于心脏特异性、单疾病和基础模型基线。目标对照表明,这些增益不能仅由额外的胸部视觉特征、固定规则传播或单一推理后端解释。因此,所提出的框架提供了一种可审计的方法,用于从LDCT进行跨疾病心血管风险评估。

英文摘要

Low-dose chest computed tomography (LDCT) captures pulmonary and cardiac structures in a single scan, enabling joint assessment of lung and cardiovascular health. Existing approaches typically model these domains independently and do not explicitly represent their physiological interactions. We propose an Explainable Cross-Disease Reasoning Framework for cardiovascular risk assessment from LDCT. The framework follows a constrained clinical-information pathway: it extracts pulmonary findings, grounds cross-organ mechanisms in medical knowledge, and produces a cardiovascular prediction with a natural-language rationale. It combines four components: a frozen lung-risk prior, a pulmonary perception module, an agentic reasoning module, and a cardiac subvolume feature extractor. Their outputs are fused to integrate localized cardiac evidence with mechanism-level pulmonary context. On the National Lung Screening Trial cohort, the framework achieves an AUC of 0.919 for CVD screening and up to 0.838 for CVD mortality prediction, outperforming cardiac-specific, single-disease, and foundation-model baselines. Targeted controls indicate that the gains are not explained by additional thoracic visual features alone, fixed rule propagation, or a single reasoning backend. The proposed framework thus provides an auditable approach to cross-disease cardiovascular risk assessment from LDCT.

2506.15199 2026-05-27 cs.LG stat.ML 版本更新

Interpretability and Generalization Bounds for Learning Spatial Physics

学习空间物理的可解释性与泛化界

Alejandro Francisco Queiruga, Theo Gutman-Solo, Shuai Jiang

发表机构 * OpenAI Google(谷歌) Sandia National Laboratories(桑迪亚国家实验室)

AI总结 利用数值分析技术,严格量化了应用于线性微分方程的机器学习模型在参数发现或求解中的准确性、收敛率和泛化界,并基于格林函数表示引入科学模型的可解释性视角。

Comments To appear in ICML 2026. 18 pages, 13 figures

详情
AI中文摘要

尽管机器学习在科学问题上的许多应用看起来很有前景,但视觉可能具有欺骗性。利用数值分析技术,我们严格量化了某些应用于线性微分方程进行参数发现或求解的机器学习模型的准确性、收敛率和泛化界。除了数据的数量和离散化之外,我们发现数据的函数空间对模型的泛化至关重要。对于常用模型(包括物理特定技术),我们通过实验证明了类似的泛化不足。与直觉相反,我们发现不同类别的模型可能表现出相反的泛化行为。基于我们的理论分析,我们还引入了一种新的科学模型机械可解释性视角,即可以从黑箱模型的权重中提取格林函数表示。我们的结果为测量物理系统泛化性提供了一种新的交叉验证技术,该技术可作为基准。

英文摘要

While there are many applications of ML to scientific problems that look promising, visuals can be deceiving. Using numerical analysis techniques, we rigorously quantify the accuracy, convergence rates, and generalization bounds of certain ML models applied to linear differential equations for parameter discovery or solution finding. Beyond the quantity and discretization of data, we identify that the function space of the data is critical to the generalization of the model. A similar lack of generalization is empirically demonstrated for commonly used models, including physics-specific techniques. Counterintuitively, we find that different classes of models can exhibit opposing generalization behaviors. Based on our theoretical analysis, we also introduce a new mechanistic interpretability lens on scientific models whereby Green's function representations can be extracted from the weights of black-box models. Our results inform a new cross-validation technique for measuring generalization in physical systems, which can serve as a benchmark.

2601.21008 2026-05-27 cs.LG cs.AI math.OC 版本更新

ORLoopBench: Solver-in-the-Loop Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

ORLoopBench:运筹学中自我修正与行为理性的求解器在环基准测试

Ruicheng Ao, David Simchi-Levi, Xinshang Wang

AI总结 提出ORLoopBench基准套件,通过将不可行模型修复形式化为求解器在环马尔可夫决策过程,利用不可约不可行子系统(IIS)反馈,结合验证强化学习训练(RLVR),使8B模型在LP修复上超越前沿API(95.3% vs 92.4% RR@5),并揭示全模型代码再生中的语义漂移问题。

Comments 58 pages, accepted by ICML 2026

详情
AI中文摘要

运筹学从业者通过迭代过程调试不可行模型:检查不可约不可行子系统(IIS),识别约束冲突,并修复公式直至恢复可行性。现有的LLM基准大多将OR视为从问题描述到求解器代码的一次性翻译,忽略了这一诊断循环。我们将不可行模型修复形式化为一个求解器在环马尔可夫决策过程,其中每个动作触发求解器重新执行和IIS重新计算,产生确定性的、可验证的反馈。我们引入ORLoopBench,一个包含两个组件的基准套件:OR-Debug-Bench发布5,362个LP/MILP修复实例,而OR-Bias-Bench评估库存设置中的闭式运营决策理性。求解器验证的RLVR训练使8B模型在LP修复上超越前沿API(95.3% vs 92.4% RR@5),改善诊断行为,并迁移到MILP修复。同样的评估暴露了全模型代码再生中的语义漂移:可行的再生MILP可能解决错误的问题。使用求解器预言机的过程级评估能够为可靠的OR自我修正进行针对性训练。

英文摘要

Operations Research practitioners debug infeasible models through an iterative process: inspecting Irreducible Infeasible Subsystems ( IIS), identifying constraint conflicts, and repairing formulations until feasibility is restored. Existing LLM benchmarks mostly treat OR as one-shot translation from problem descriptions to solver code, omitting this diagnostic loop. We formalize infeasible-model repair as a solver-in-the-loop Markov Decision Process in which each action triggers solver re-execution and IIS recomputation, yielding deterministic, verifiable feedback. We introduce ORLoopBench, a benchmark suite with two components: OR-Debug-Bench releases 5,362 LP/MILP repair instances, while OR-Bias-Bench evaluates closed-form operational decision rationality across inventory settings. Solver-verified RLVR training enables an 8B model to surpass frontier APIs on LP repair (95.3% vs 92.4% RR @5), improves diagnostic behavior, and transfers to MILP repair. The same evaluation exposes semantic drift in whole-model code regeneration: feasible regenerated MILPs can solve the wrong problem. Process-level evaluation with solver oracles enables targeted training for reliable OR self-correction.

2501.06708 2026-05-27 cs.LG cs.AI 版本更新

Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights

通过模仿模型权重评估样本效用以实现高效数据选择

Tzu-Heng Huang, Manjot Bilkhu, John Cooper, Frederic Sala, Javier Movellan

发表机构 * Apple(苹果公司)

AI总结 提出基于梯度和几何的Mimic Score指标,通过Grad-Mimic框架在线重加权样本加速训练、离线构建数据过滤器,在六个图像数据集上提升数据效率和CLIP模型性能。

Comments This work appears in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026) and was selected as an Oral paper at the ICML 2025 DataWorld Workshop

详情
AI中文摘要

大规模网络爬取数据集包含噪声、偏差和不相关信息,因此需要数据选择技术。现有方法依赖于手工启发式、下游数据集或需要昂贵的基于影响力的计算——所有这些都限制了可扩展性并引入了不必要的数据依赖性。为了解决这个问题,我们引入了Mimic Score,一种简单且基于几何的数据质量指标,通过测量样本梯度与预训练参考模型诱导的目标方向之间的对齐来评估效用。这利用了现成的模型权重,避免了验证数据集的需求,并且计算开销最小。基于该指标,我们提出了Grad-Mimic,一个两阶段框架,在线重新加权样本以加速训练,并离线聚合样本效用以构建有效的数据过滤器。实验表明,使用模仿分数指导训练提高了数据效率,加速了收敛,在六个图像数据集上取得了一致的性能提升,并以减少20.7%的训练步骤增强了CLIP模型。此外,基于模仿分数的过滤器增强了现有过滤技术,使得用更少470万个样本训练的CLIP模型得到改进。

英文摘要

Large-scale web-crawled datasets contain noise, bias, and irrelevant information, necessitating data selection techniques. Existing methods depend on hand-crafted heuristics, downstream datasets, or require expensive influence-based computations -- all of which limit scalability and introduce unwanted data dependencies. To address this, we introduce the Mimic Score, a simple and geometry-based data-quality metric that evaluates utility by measuring alignment between a sample's gradients and a target direction induced by a pre-trained reference model. This leverages readily available model weights, avoids needing validation datasets, and incurs minimal computational overheads. Building on this metric, we propose Grad-Mimic, a two-stage framework that re-weights samples online to accelerate training and aggregates sample utilities offline to construct effective data filters. Empirically, we show that using mimic scores to guide training improves data efficiency, accelerates convergence, yields consistent performance gains across six image datasets, and enhances CLIP models with 20.7% fewer training steps. Additionally, mimic score-based filters augment existing filtering techniques, enabling improved CLIP models trained with 4.7 million fewer samples.

2602.04990 2026-05-27 cs.LG cs.GT 版本更新

Position: Machine Learning for Heart Transplant Allocation Policy Optimization Should Account for Incentives

立场:机器学习用于心脏移植分配政策优化应考虑激励机制

Ioannis Anagnostides, Itai Zilberstein, Zachary W. Sollie, Arman Kilic, Tuomas Sandholm

发表机构 * Department of Computer Science, Carnegie Mellon University(计算机科学系,卡内基梅隆大学) Department of Surgery, Division of Cardiothoracic Surgery, Medical University of South Carolina(外科系,心血管外科 division,南卡罗来纳医科大学) Strategy Robot, Inc., Strategic Machine, Inc., Optimized Markets, Inc.

AI总结 本文指出当前机器学习优化器官分配政策忽视了激励机制问题,提出下一代分配政策应具有激励意识,并呼吁整合机制设计、策略分类、因果推断和社会选择等研究。

Comments To appear at ICML 2026 (position paper track). V3 incorporates reviewers' feedback

详情
AI中文摘要

稀缺供体器官的分配构成了医疗保健中最具影响力的算法挑战之一。尽管该领域正迅速从僵化的、基于规则的系统转向机器学习和数据驱动的优化,我们认为当前的方法常常忽视了一个基本障碍:激励机制。在这篇立场论文中,我们强调器官分配不仅仅是一个优化问题,而是一个涉及器官获取组织、移植中心、临床医生、患者和监管机构的复杂博弈。聚焦于美国成人心脏移植分配,我们识别了决策流程中的关键激励错位,并展示了表明这些错位正在产生不良后果的数据。我们的主要立场是,下一代分配政策应具有激励意识。我们为机器学习社区概述了一个研究议程,呼吁整合机制设计、策略分类、因果推断和社会选择,以确保在面对各组成群体的策略行为时,系统具有鲁棒性、效率、公平性和信任度。

英文摘要

The allocation of scarce donor organs constitutes one of the most consequential algorithmic challenges in healthcare. While the field is rapidly transitioning from rigid, rule-based systems to machine learning and data-driven optimization, we argue that current approaches often overlook a fundamental barrier: incentives. In this position paper, we highlight that organ allocation is not merely an optimization problem, but rather a complex game involving organ procurement organizations, transplant centers, clinicians, patients, and regulators. Focusing on US adult heart transplant allocation, we identify critical incentive misalignments across the decision-making pipeline, and present data showing that they are having adverse consequences today. Our main position is that the next generation of allocation policies should be incentive aware. We outline a research agenda for the machine learning community, calling for the integration of mechanism design, strategic classification, causal inference, and social choice to ensure robustness, efficiency, fairness, and trust in the face of strategic behavior from the various constituent groups.

2512.06609 2026-05-27 cs.LG cs.CV 版本更新

Training-Free Vector Quantization via Gaussian VAEs

基于高斯VAE的无训练向量量化

Tongda Xu, Wendi Zheng, Jiajun He, Jose Miguel Hernandez-Lobato, Yan Wang, Ya-Qin Zhang, Jie Tang

发表机构 * AIR, Tsinghua University(清华空气研究院) CST, Tsinghua University(清华计算机研究所) University of Cambridge(剑桥大学)

AI总结 提出Gaussian Quant (GQ)方法,通过约束训练高斯VAE并直接转换为VQ-VAE,无需额外训练,在UNet和ViT架构上优于现有VQ-VAE。

详情
AI中文摘要

向量量化变分自编码器(VQ-VAEs)是将图像压缩为离散标记的离散自编码器。然而,由于离散化,它们难以训练。在本文中,我们提出了一种简单而有效的技术,称为Gaussian Quant (GQ),它首先在特定约束下训练高斯VAE,然后将其转换为VQ-VAE,无需额外训练。对于转换,GQ生成随机高斯噪声作为码本,并找到最接近后验均值的噪声向量。理论上,我们证明当码本大小的对数超过高斯VAE的bits-back编码率时,可以保证较小的量化误差。实际上,我们提出了一种启发式方法来训练高斯VAE以实现有效转换,称为目标散度约束(TDC)。实验上,我们表明GQ在UNet和ViT架构上均优于先前的VQ-VAE,如VQGAN、FSQ、LFQ和BSQ。此外,TDC还改进了先前的离散化方法,如TokenBridge。源代码见https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE。

英文摘要

Vector-quantized variational autoencoders (VQ-VAEs) are discrete autoencoders that compress images into discrete tokens. However, they are difficult to train due to discretization. In this paper, we propose a simple yet effective technique dubbed Gaussian Quant (GQ), which first trains a Gaussian VAE under certain constraints and then converts it into a VQ-VAE without additional training. For conversion, GQ generates random Gaussian noise as a codebook and finds the closest noise vector to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAEs for effective conversion, named the target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE.

2602.04931 2026-05-27 cs.LG cs.AI 版本更新

Emergent Causal-Geometric Dynamics Across Depth in Large Language Models

大型语言模型中跨深度的涌现因果几何动力学

Shahar Haim, Daniel C McNamee

发表机构 * Champalimaud Centre for the Unknown(查普拉米乌德未知中心)

AI总结 通过结合几何分析与因果干预,揭示了解码器-only大型语言模型中从上下文处理到预测形成的跨层转变,并发现后期层中角度结构参数化下一词分布相似性并实现选择性因果控制。

详情
AI中文摘要

对大型语言模型(LLM)表征的几何分析揭示了跨深度的结构化变化,但本质上与token预测形成相关。同时,因果干预揭示了依赖于深度的效能曲线,但缺乏对其表征动力学的统一解释。对LLM功能的完整解释需要说明表征结构如何跨深度演化以因果性地产生预测。我们通过将几何分析与机械干预相结合,明确将跨深度动力学作为解释LLM功能的组织轴,综合了这些视角。在解码器-only LLM中,我们识别出从上下文处理到预测形成计算的急剧转变,伴随着跨层的表征几何的更渐进重组。这种综合揭示了一种后期层几何编码,其中角度结构参数化下一词分布相似性,并能够对预测进行选择性因果控制,而表征范数编码的信息与预测基本解耦。总之,我们的结果提供了因果和几何视角的综合,产生了关于语言模型中跨深度的控制相关几何动力学如何将上下文转化为预测的机械论解释。这一视角调和了先前令人困惑的发现,并表明层状功能不能孤立地理解或有效干预,而只能在网络涌现的全局动力学结构中理解。

英文摘要

Geometric analyses of large language model (LLM) representations reveal structured variation across depth but remain fundamentally correlational with respect to token prediction formation. Meanwhile, causal interventions expose depth-dependent efficacy profiles without a unifying account of their representational dynamics. A complete account of LLM function requires explaining how representational structure evolves across depth to causally produce predictions. We synthesize these perspectives by combining geometric analysis with mechanistic interventions, explicitly centralizing depth-wise dynamics as the organizing axis for interpreting LLM function. In decoder-only LLMs, we identify a sharp transition from context-processing to prediction-forming computation, accompanied by a more gradual reorganization of representational geometry across layers. This synthesis reveals a late-layer geometric code in which angular structure parameterizes next-token distributional similarity and enables selective causal control over predictions, while representation norms encode information largely decoupled from prediction. Together, our results provide a synthesis of causal and geometric perspectives, yielding a mechanistic account of how control-relevant geometric dynamics across depth transform context into prediction in language models. This perspective reconciles previously puzzling findings and implies that layer-wise function cannot be understood or effectively intervened upon in isolation, but only within the emergent global dynamical structure of the network.

2602.04599 2026-05-27 cs.LG 版本更新

Stochastic Decision Horizons for Constrained Reinforcement Learning

约束强化学习的随机决策视界

Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev

发表机构 * Max Planck Institute for Human Cognitive and Brain Sciences(马克斯·普朗克人类认知与脑科学研究所) Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI)(可扩展数据分析与人工智能中心 (ScaDS.AI)) Hertie Institute for Clinical Brain Research & Center for Integrative Neuroscience(赫尔特临床脑研究所在线及整合神经科学中心) University of Tübingen(图宾根大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)

AI总结 提出随机决策视界(SDH)框架,通过状态-动作延续概率实现每步约束满足,并开发了首个离策略和正则化算法(AS-SAC和VT-MPO),在90肌肉人形机器人上以4倍更少的环境步数达到最先进步态真实度。

详情
AI中文摘要

我们提出随机决策视界(SDH),这是一个理论基础的框架,用于解决具有每步约束满足的约束强化学习问题,这在许多实际应用中是一个理想属性。在SDH中,违反约束通过状态-动作延续概率有效缩短视界。利用控制作为推理,我们开发了首个用于即时约束RL的离策略和正则化算法。我们确定了违反后决策的两种原则性语义。吸收状态语义终止决策过程,因此只有存活的决策支付熵成本,产生最大熵AS-SAC。虚拟终止保持决策过程活跃,同时停止奖励信用,产生KL正则化VT-MPO。为了连接SDH与CMDP,我们跟踪违反沿轨迹的累积(它们的违反深度剖面)。SDH有效地通过每个轨迹的总违反的指数加权;这正好在违反发生在单一特征尺度时匹配加性CMDP预算,并且我们指出它不能匹配的情况:当罕见的深度违反与频繁的浅层违反混合时。实验验证了理论。在90肌肉H2190人形机器人(Hyfydy)上,VT-MPO以4倍更少的环境步数和更稳定的训练达到最先进的步态真实度。在Safety Gymnasium上,违反深度剖面正确识别了SDH提供强奖励-违反权衡的机制。

英文摘要

We propose stochastic decision horizons (SDH), a theoretically grounded framework for solving constrained RL problems with every-step constraint satisfaction, a desirable property in many real-world applications. In SDH, a constraint violation yields an effective shortening of horizon via a state-action continuation probability. Using Control as Inference, we develop the first off-policy and regularized algorithms for RL with instantaneous constraints. We identify two principled semantics for what counts as a decision after a violation. Absorbing-state semantics end the decision process, so only surviving decisions pay entropy cost, yielding max-entropy AS-SAC. Virtual-termination keeps the decision process alive while stopping reward credit, yielding KL-regularized VT-MPO. To connect SDH with CMDPs, we track how violations accumulate along trajectories (their violation-depth profile). SDH effectively weights each trajectory by the exponential of its total violation; this matches an additive CMDP budget exactly when violations occur at a single characteristic scale, and we pinpoint where it cannot: when rare, deep violations mix with frequent, shallow ones. Experiments validate the theory. On the 90-muscle H2190 humanoid (Hyfydy), VT-MPO matches state-of-the-art gait realism with $4\times$ fewer environment steps and substantially more stable training. On Safety Gymnasium, violation-depth profiles correctly identify the regimes in which SDH delivers strong reward-violation trade-offs. Experiments validate the theory. On the 90-muscle H2190 humanoid (Hyfydy), VT-MPO matches state-of-the-art gait realism with 4x fewer environment steps and substantially more stable training. On Safety Gymnasium, violation-depth profiles correctly identify the regimes in which SDH delivers strong reward-violation trade-offs.

2602.04397 2026-05-27 cs.GT cs.LG 版本更新

Optimal Rates for Feasible Payoff Set Estimation in Games

博弈中可行收益集合估计的最优速率

Annalisa Barbara, Riccardo Poiani, Martino Bernasconi, Andrea Celli

发表机构 * Bocconi University, Milano, Italy(博科尼大学,米兰,意大利) ETH Zurich, Switzerland(苏黎世联邦理工学院,瑞士)

AI总结 针对双矩阵博弈中观察玩家动作但不知收益函数的情形,提出以Hausdorff度量下高概率估计可行收益集合的方法,并给出零和与一般和博弈中精确及近似均衡下首次极小极大最优速率。

详情
AI中文摘要

我们研究一种设置,其中两个玩家进行一个双矩阵博弈的(可能近似)纳什均衡,而一个学习器仅观察他们的动作,对均衡或潜在博弈一无所知。一个自然的问题是学习器能否通过推断玩家的收益函数来合理化观察到的行为。逆博弈论旨在识别与观察行为一致的全部收益集合,而不是产生单个收益估计,从而支持下游应用,如反事实分析和机制设计,涵盖拍卖、定价和安全博弈等场景。我们关注以高概率且Hausdorff度量精度$ε$估计可行收益集合的问题。我们提供了零和博弈以及一般和博弈中精确和近似均衡下首次极小极大最优速率。我们的结果为多智能体环境中集合值收益推断提供了学习理论基础。

英文摘要

We study a setting in which two players play a (possibly approximate) Nash equilibrium of a bimatrix game, while a learner observes only their actions and has no knowledge of the equilibrium or the underlying game. A natural question is whether the learner can rationalize the observed behavior by inferring the players' payoff functions. Rather than producing a single payoff estimate, inverse game theory aims to identify the entire set of payoffs consistent with observed behavior, enabling downstream use in, e.g., counterfactual analysis and mechanism design across applications like auctions, pricing, and security games. We focus on the problem of estimating the set of feasible payoffs with high probability and up to precision $ε$ on the Hausdorff metric. We provide the first minimax-optimal rates for both exact and approximate equilibrium play, in zero-sum as well as general-sum games. Our results provide learning-theoretic foundations for set-valued payoff inference in multi-agent environments.

2602.03517 2026-05-27 cs.LG 版本更新

Rank-Learner: Orthogonal Ranking of Treatment Effects

Rank-Learner:治疗效果的正交排序

Henri Arno, Dennis Frauen, Emil Javurek, Thomas Demeester, Stefan Feuerriegel

发表机构 * Ghent University - imec(根特大学 - imec) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心 (MCML))

AI总结 提出一种名为Rank-Learner的两阶段学习器,通过成对学习目标直接学习治疗效果排序,无需显式估计条件平均处理效应,具有Neyman正交性和模型无关性。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

许多决策问题需要根据治疗效果对个体进行排序,而不是估计确切的效果大小。例如,优先考虑患者进行预防性护理干预,或根据广告的预期增量影响对客户进行排名。令人惊讶的是,尽管因果效应估计在文献中受到了广泛关注,但直接学习治疗效果排序的问题在很大程度上仍未得到探索。在本文中,我们介绍了Rank-Learner,一种新颖的两阶段学习器,它直接从观测数据中学习治疗效果的排序。我们首先表明,基于精确治疗效果估计的朴素方法解决了一个比排序所需更困难的问题,而我们的Rank-Learner优化了一个成对学习目标,该目标恢复了真实的治疗效果顺序,无需显式的CATE估计。我们进一步证明,我们的Rank-Learner是Neyman正交的,因此具有强大的理论保证,包括对 nuisance 函数估计误差的鲁棒性。此外,我们的Rank-Learner是模型无关的,可以用任意机器学习模型(例如神经网络)实例化。我们通过大量实验证明了我们方法的有效性,其中Rank-Learner始终优于标准的CATE估计器和非正交排序方法。总的来说,我们为从业者提供了一种新的、正交的两阶段学习器,用于按治疗效果对个体进行排序。

英文摘要

Many decision-making problems require ranking individuals by their treatment effects rather than estimating the exact effect magnitudes. Examples include prioritizing patients for preventive care interventions, or ranking customers by the expected incremental impact of an advertisement. Surprisingly, while causal effect estimation has received substantial attention in the literature, the problem of directly learning rankings of treatment effects has largely remained unexplored. In this paper, we introduce Rank-Learner, a novel two-stage learner that directly learns the ranking of treatment effects from observational data. We first show that naive approaches based on precise treatment effect estimation solve a harder problem than necessary for ranking, while our Rank-Learner optimizes a pairwise learning objective that recovers the true treatment effect ordering, without explicit CATE estimation. We further show that our Rank-Learner is Neyman-orthogonal and thus comes with strong theoretical guarantees, including robustness to estimation errors in the nuisance functions. In addition, our Rank-Learner is model-agnostic, and can be instantiated with arbitrary machine learning models (e.g., neural networks). We demonstrate the effectiveness of our method through extensive experiments where Rank-Learner consistently outperforms standard CATE estimators and non-orthogonal ranking methods. Overall, we provide practitioners with a new, orthogonal two-stage learner for ranking individuals by their treatment effects.

2602.02518 2026-05-27 cs.LG cs.AI cs.CL 版本更新

GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training

GraphDancer: 通过两阶段课程后训练训练LLMs在图上的探索与推理

Yuyang Bai, Zhuofeng Li, Ping Nie, Jianwen Xie, Yu Zhang

发表机构 * Texas A&M University(德克萨斯大学A&M分校) University of Waterloo(滑铁卢大学) Lambda(Lambda公司) University of Oregon(俄勒冈大学)

AI总结 提出GraphDancer两阶段后训练框架,通过图感知课程逐步增加任务难度,使LLMs学会在异构图上进行自然语言推理与函数调用交织的探索与推理,仅用3B骨干模型即在跨域基准上超越更强基线。

Comments 15 pages, Project website: https://yuyangbai.com/graphdancer/

详情
AI中文摘要

大型语言模型(LLMs)越来越依赖外部知识来提高事实性,然而许多真实世界的知识源被组织为异构图而非纯文本。在此类图上进行推理要求模型通过精确的函数调用遵循模式定义的关系,并在多轮交互中聚合证据。我们提出GraphDancer,一个两阶段后训练框架,通过将自然语言推理与图函数执行交织来教导LLMs在图上的推理。第一阶段教导模型在基于规则的奖励下如何与图交互,而第二阶段进一步教导其偏好更基于事实且高效的交互轨迹。GraphDancer的关键创新在于一个图感知课程,该课程根据信息寻求轨迹的结构复杂性组织两个阶段,在训练期间逐步增加任务难度。我们在一个多领域基准上评估GraphDancer,仅在一个领域上训练,并在未见过的领域和分布外问题类型上进行测试。尽管仅使用3B骨干模型,GraphDancer仍优于配备更大/更强骨干的基线,展示了图探索和推理技能的强大跨域泛化能力。我们的代码可在https://github.com/leopoldwhite/GraphDancer找到。

英文摘要

Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organized as heterogeneous graphs rather than plain text. Reasoning over such graphs requires models to follow schema-defined relations through precise function calls and to aggregate evidence across multiple rounds of interaction. We propose GraphDancer, a two-stage post-training framework that teaches LLMs to reason over graphs by interleaving natural-language reasoning with graph function execution. The first stage teaches the model how to interact with the graph under rule-based rewards, while the second stage further teaches it to prefer more grounded and efficient interaction trajectories. The key novelty of GraphDancer is a graph-aware curriculum that organizes both stages by the structural complexity of information-seeking trajectories, progressively increasing task difficulty during training. We evaluate GraphDancer on a multi-domain benchmark by training on one domain only and testing on unseen domains and out-of-distribution question types. Despite using only a 3B backbone, GraphDancer outperforms baselines equipped with larger/stronger backbones, demonstrating robust cross-domain generalization of graph exploration and reasoning skills. Our code can be found at https://github.com/leopoldwhite/GraphDancer.

2602.01941 2026-05-27 cond-mat.mtrl-sci cs.CE cs.LG physics.comp-ph 版本更新

FluxNet: Learning Capacity-Constrained Local Transport Operators for Conservative and Bounded PDE Surrogates

FluxNet: 学习容量受限的局部传输算子用于保守且有界的PDE代理模型

Zishuo Lan, Junjie Li, Lei Wang, Jincheng Wang

发表机构 * State Key Laboratory of Solidification Processing, Northwestern Polytechnical University, Xi'an, 710072, China(固态加工国家重点实验室,西北工业大学,西安,710072,中国)

AI总结 提出FluxNet,通过学习累积传输量实现精确离散守恒,并利用模块化容量受限传输头强制物理边界,解决自回归PDE代理中的守恒和边界违反问题。

Comments ICML2026

详情
AI中文摘要

时间步进算子的自回归学习为数据驱动的偏微分方程(PDE)模拟提供了有效方法,但对于守恒律,它们面临一个基本挑战:学习到的更新可能在长时间滚动中违反全局守恒。对于质量守恒型方程的重要子类,固有的物理边界(例如,非负性或[0,1]中的浓度)使问题更加复杂,违反这些边界会进一步破坏预测的稳定性。我们引入FluxNet,它学习累积传输量,表示在完整代理时间间隔内每个单元与可配置邻域之间重新分配的总守恒量。保守更新通过构造保证精确的离散守恒;模块化容量受限传输头(L、U和D)通过架构设计强制下界、上界或接近零的双边界违反。与需要时间积分并因此继承CFL约束的通量率代理不同,FluxNet不涉及此类积分;可配置的传输邻域允许在全空间分辨率下进行大时间步预测。虚拟单元将框架扩展到非周期边界。在四个基准测试(一维对流扩散、二维浅水、一维交通流、二维Cahn-Hilliard)上的实验证明了精确守恒、结构边界保持、架构模块化以及在大的时间步长下优于通量率代理的稳定性。代码公开于:https://github.com/Lan-zs/FluxNet。

英文摘要

Autoregressive learning of time-stepping operators provides an effective approach to data-driven partial differential equation (PDE) simulation, yet for conservation laws, they face a fundamental challenge: learned updates may violate global conservation over long rollouts. For the important subclass of mass-conservation-type equations, the problem is compounded by inherent physical bounds (e.g., nonnegativity or concentrations in [0,1]) whose violation further destabilizes predictions. We introduce FluxNet, which learns cumulative transport amounts representing the total conserved quantity redistributed between each cell and a configurable neighborhood over the full surrogate interval. A conservative update guarantees exact discrete conservation by construction; modular capacity-constrained transport heads (L, U, and D) enforce lower bounds, upper bounds, or near-zero dual-bound violations through architectural design. Unlike flux-rate surrogates that require temporal integration and thus inherit CFL constraints, FluxNet involves no such integration; configurable transport neighborhoods enable large-timestep prediction at full spatial resolution. Ghost cells extend the framework to non-periodic boundaries. Experiments on four benchmarks (1D convection--diffusion, 2D shallow water, 1D traffic flow, 2D Cahn--Hilliard) demonstrate exact conservation, structural bound preservation, architecture modularity, and superior stability over flux-rate surrogates at large temporal strides. The code is publicly available at: https://github.com/Lan-zs/FluxNet.

2602.00959 2026-05-27 cs.LG cs.CL 版本更新

Probing the Knowledge Boundary: An Interactive Agentic Framework for Deep Knowledge Extraction

探测知识边界:一种用于深度知识提取的交互式智能体框架

Yuheng Yang, Siqi Zhu, Tao Feng, Ge Liu, Jiaxuan You

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Westlake University(西交利物浦大学)

AI总结 提出一种交互式智能体框架,通过四种自适应探索策略和三级知识处理流水线,系统性地提取和量化大语言模型的知识,发现递归分类法最有效,并揭示了知识缩放定律、Pass@1与Pass@k的权衡以及训练数据对知识轮廓的影响。

Comments Homepage: https://ulab-uiuc.github.io/KnowledgeExtraction/

详情
AI中文摘要

大型语言模型(LLMs)可被视为压缩的知识库,但尚不清楚它们真正包含哪些知识以及其知识边界延伸多远。现有基准大多是静态的,对系统性知识探测的支持有限。本文提出一种交互式智能体框架,用于系统性地提取和量化LLMs的知识。我们的方法包括四种自适应探索策略,以不同粒度探测知识。为确保提取知识的质量,我们引入了一个三级知识处理流水线,结合基于向量的过滤以去除严格重复、基于LLM的裁决以解决模糊语义重叠,以及领域相关性审计以保留有效的知识单元。通过大量实验,我们发现递归分类法是最有效的探索策略。我们还观察到清晰的知识缩放定律,即更大的模型始终能恢复更多知识。此外,我们识别出Pass@1与Pass@k之间的权衡:领域专用模型初始准确率更高但退化迅速,而通用模型在长时间提取中保持稳定性能。最后,我们的结果表明,训练数据组成的差异导致不同模型家族具有独特且可测量的知识轮廓,反映了预训练如何塑造每个模型的参数化知识。

英文摘要

Large Language Models (LLMs) can be seen as compressed knowledge bases, but it remains unclear what knowledge they truly contain and how far their knowledge boundary extends. Existing benchmarks are mostly static and provide limited support for systematic knowledge probing. In this paper, we propose an interactive agentic framework to systematically extract and quantify the knowledge of LLMs. Our method includes four adaptive exploration policies to probe knowledge at different granularity. To ensure the quality of extracted knowledge, we introduce a three-stage knowledge processing pipeline that combines vector-based filtering to remove strict duplicates, LLM-based adjudication to resolve ambiguous semantic overlap, and domain relevance auditing to retain valid knowledge units. Through extensive experiments, we find that Recursive Taxonomy is the most effective exploration strategy. We also observe a clear knowledge scaling law, where larger models consistently recover more knowledge. In addition, we identify a Pass@1 versus Pass@k trade-off: domain-specialized models achieve higher initial accuracy but experience rapid degradation, while general-purpose models maintain stable performance over extended extraction. Finally, our results show that differences in training data composition lead to distinct and measurable knowledge profiles across model families, reflecting how pretraining shapes each model's parametric knowledge.

2602.00827 2026-05-27 cs.LG stat.ML 版本更新

Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization

过度对齐 vs 过拟合:特征学习强度在泛化中的作用

Taesun Yeom, Taehyeok Ha, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)(釜山科学技术大学(POSTECH))

AI总结 本文通过实验和理论分析,揭示了深度网络中特征学习强度存在最优值,过大导致过度对齐、过小导致过拟合,从而影响泛化性能。

Comments ICML 2026

详情
AI中文摘要

特征学习强度(FLS),即模型有效输出缩放的倒数,在塑造神经网络的优化动态中起着关键作用。尽管其影响已在渐近区域(训练时间和FLS)得到广泛研究,但现有理论对FLS如何影响实际设置中的泛化(例如,当训练在达到目标训练风险时停止)提供的见解有限。在这项工作中,我们研究了在实际条件下FLS对深度网络泛化的影响。通过实证研究,我们首先发现了一个$ extit{最优FLS}$的存在——既不太小也不太大——它能带来显著的泛化收益。这一发现与更强的特征学习普遍改善泛化的主流直觉相悖。为了解释这一现象,我们开发了对使用逻辑损失训练的两层ReLU网络中的梯度流动力学的理论分析,其中FLS通过初始化尺度控制。我们的主要理论结果建立了最优FLS的存在性,它源于两种竞争效应之间的权衡:过大的FLS会导致$ extit{过度对齐}$现象,降低泛化性能,而过小的FLS则会导致$ extit{过拟合}$。

英文摘要

Feature learning strength (FLS), i.e., the inverse of the effective output scaling of a model, plays a critical role in shaping the optimization dynamics of neural nets. While its impact has been extensively studied under the asymptotic regimes -- both in training time and FLS -- existing theory offers limited insight into how FLS affects generalization in practical settings, such as when training is stopped upon reaching a target training risk. In this work, we investigate the impact of FLS on generalization in deep networks under such practical conditions. Through empirical studies, we first uncover the emergence of an $\textit{optimal FLS}$ -- neither too small nor too large -- that yields substantial generalization gains. This finding runs counter to the prevailing intuition that stronger feature learning universally improves generalization. To explain this phenomenon, we develop a theoretical analysis of gradient flow dynamics in two-layer ReLU nets trained with logistic loss, where FLS is controlled via initialization scale. Our main theoretical result establishes the existence of an optimal FLS arising from a trade-off between two competing effects: An excessively large FLS induces an $\textit{over-alignment}$ phenomenon that degrades generalization, while an overly small FLS leads to $\textit{over-fitting}$.

2502.03946 2026-05-27 cs.LG 版本更新

CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning

CleanSurvival:使用强化学习为时间事件模型自动数据预处理

Yousef Koka, David Selby, Gerrit Großmann, Kathan Pandya, Sebastian Vollmer

发表机构 * German University in Cairo(埃及德国亚历山大大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) University of Saarland(萨尔大学) University of Kaiserslautern–Landau (RPTU)(凯撒斯劳滕-兰道大学(RPTU))

AI总结 提出基于强化学习的CleanSurvival框架,自动优化生存分析的数据预处理流程,提升Cox、随机森林、神经网络等时间事件模型的预测性能。

Comments Resubmitted after Peer Review Feedback to BMC Medical Informatics and Decision Making

详情
AI中文摘要

在机器学习中,数据预处理往往被忽视,尽管它对模型性能有潜在的重大影响。虽然自动化机器学习管道开始认识到并将数据预处理集成到分类和回归任务的解决方案中,但对于更专业的任务(如针对删失数据的时间事件模型)却缺乏这种集成。因此,生存分析不仅面临数据预处理的一般挑战,还缺乏针对性的自动化解决方案。为填补这一空白,本文提出了CleanSurvival,一种基于强化学习的解决方案,用于优化预处理流程,并专门扩展到生存分析。该框架可处理连续和分类变量。它基于Learn2Clean的Q学习,选择数据插补、异常值检测和特征提取技术的组合,以针对Cox、随机森林、神经网络或用户提供的时间事件模型实现最佳性能。Python包可在GitHub上获取:https://github.com/datasciapps/CleanSurvival。在真实世界数据集上的实验基准表明,基于Q学习的数据预处理相对于简单基线可以提高预测性能,而运行时行为依赖于条件,在覆盖最好的基准单元中最清晰可解释。此外,模拟研究证明了在不同类型和水平的缺失和噪声下的有效性。随着机器学习的使用增加,将AutoML管道推广到包括生存分析在内的各种模型变得重要。像CleanSurvival这样集成生存分析预处理的工具,可以使生存研究更容易、更快速地进行,并使结果更稳健。

英文摘要

Data preprocessing is often paid little attention in machine learning, despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like time-to-event models for censored data. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area. To address this gap, this paper presents CleanSurvival, a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables. It builds upon Learn2Clean's Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The Python package is available on GitHub: https://github.com/datasciapps/CleanSurvival. Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing can improve predictive performance relative to simple baselines, while runtime behavior is condition-dependent and most clearly interpretable in the best-covered benchmark cells. Furthermore, a simulation study demonstrates effectiveness across different types and levels of missingness and noise. With an increase in the use of machine learning, it becomes important to generalise AutoML pipelines to a variety of models now present, including survival analysis. Tools like CleanSurvival, which integrate preprocessing for survival analysis, can make survival studies easier and quicker to perform, as well as make the results more robust.

2601.22648 2026-05-27 cs.AI cs.LG 版本更新

UCPO: Uncertainty-Aware Policy Optimization

UCPO:不确定性感知策略优化

Xianzhou Zeng, Jing Huang, Chunmei Xie, Gongrui Nan, Siye Chen, Mengyu Lu, Weiqi Xiong, Qixuan Zhou, Junhao Zhang, Qiang Zhu, Yadong Li, Xingzhong Xu

AI总结 针对现有强化学习范式在不确定性奖励下存在的优势偏差和过度自信问题,提出三元优势解耦和动态不确定性奖励调整机制,显著提升模型在知识边界外的可靠性。

Comments Accepted by ICML 2026

详情
AI中文摘要

构建可信赖的大语言模型的关键在于赋予其内在的不确定性表达能力,从而减轻高风险应用中的过度自信错误。然而,现有的强化学习范式(如GRPO)由于二元决策空间和静态不确定性奖励,常常遭受优势偏差,导致过度保守或过度自信。为了解决这一挑战,本文揭示了当前结合不确定性奖励的强化学习范式中奖励破解和过度自信的根本原因,并在此基础上提出了不确定性感知策略优化(UCPO)框架。UCPO采用三元优势解耦来分离并独立归一化确定性和不确定性轨迹,从而消除优势偏差。此外,动态不确定性奖励调整机制根据模型演化和实例难度实时调整不确定性权重。在数学推理和通用任务上的实验结果表明,UCPO有效解决了奖励不平衡问题,显著提高了模型在知识边界外的可靠性。

英文摘要

The key to building trustworthy large language models (LLMs) lies in endowing them with inherent uncertainty expression capabilities, thereby mitigating overconfident errors in high-stakes applications. However, existing RL paradigms such as GRPO often suffer from Advantage Bias due to binary decision spaces and static uncertainty rewards, inducing either excessive conservatism or overconfidence. To tackle this challenge, this paper unveils the root causes of reward hacking and overconfidence in current RL paradigms incorporating uncertainty-based rewards, based on which we propose the UnCertainty-Aware Policy Optimization (UCPO) framework. UCPO employs Ternary Advantage Decoupling to separate and independently normalize deterministic and uncertain rollouts, thereby eliminating advantage bias. Furthermore, a Dynamic Uncertainty Reward Adjustment mechanism adapts uncertainty weights in real-time according to model evolution and instance difficulty. Experimental results in mathematical reasoning and general tasks demonstrate that UCPO effectively resolves the reward imbalance, significantly improving the reliability of the model beyond their knowledge boundaries.

2601.22384 2026-05-27 cs.LG cs.AI 版本更新

Graph is a Substrate Across Data Modalities

图是跨数据模态的基板

Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yanfang Ye, Chuxu Zhang

发表机构 * University of Connecticut(康涅狄格大学) University of Notre Dame(诺丁汉大学) National University of Singapore(新加坡国立大学)

AI总结 提出G-Substrate框架,通过统一结构模式和交错角色训练策略,使图结构作为共享基板跨模态和任务积累,优于孤立和朴素多任务方法。

Comments Graph structure across data modalities, accepted by ICML26

详情
AI中文摘要

图提供了跨不同领域出现的自然关系结构表示。尽管无处不在,图结构通常以模态和任务隔离的方式学习,即在单个任务上下文中构建图表示,然后丢弃。因此,跨模态和任务的结构规律被反复重建,而不是在中间图表示级别积累。这引发了一个表示学习问题:如何组织图结构,使其能够跨异构模态和任务持久存在并积累?我们采用以表示为中心的视角,将图结构视为跨学习上下文持久存在的结构基板。为了实例化这一视角,我们提出了G-Substrate,一个围绕共享图结构组织学习的图基板框架。G-Substrate包含两个互补机制:一个统一的结构模式,确保跨异构模态和任务的图表示兼容性;以及一个交错基于角色的训练策略,在学习过程中将同一图结构暴露给多个功能角色。跨多个领域、模态和任务的实验表明,G-Substrate优于任务隔离和朴素多任务学习方法。代码库、模型和数据集可在https://github.com/zmli6/G-Substrate获取。

英文摘要

Graphs provide a natural representation of relational structure that arises across diverse domains. Despite this ubiquity, graph structure is typically learned in a modality- and task-isolated manner, where graph representations are constructed within individual task contexts and discarded thereafter. As a result, structural regularities across modalities and tasks are repeatedly reconstructed rather than accumulated at the level of intermediate graph representations. This motivates a representation-learning question: how should graph structure be organized so that it can persist and accumulate across heterogeneous modalities and tasks? We adopt a representation-centric perspective in which graph structure is treated as a structural substrate that persists across learning contexts. To instantiate this perspective, we propose G-Substrate, a graph substrate framework that organizes learning around shared graph structures. G-Substrate comprises two complementary mechanisms: a unified structural schema that ensures compatibility among graph representations across heterogeneous modalities and tasks, and an interleaved role-based training strategy that exposes the same graph structure to multiple functional roles during learning. Experiments across multiple domains, modalities, and tasks show that G-Substrate outperforms task-isolated and naive multi-task learning methods. The codebase, model, and datasets are available at https://github.com/zmli6/G-Substrate.

2509.21906 2026-05-27 math.ST cs.LG stat.ML stat.TH 版本更新

Error Analysis of Discrete Flow with Generator Matching

生成器匹配的离散流误差分析

Zhengyan Wan, Yidong Ouyang, Qiang Yao, Liyan Xie, Fang Fang, Hongyuan Zha, Guang Cheng

发表机构 * School of Statistics, East China Normal University(东华大学统计学院) Department of Statistics and Data Science, University of California, Los Angeles(加州大学洛杉矶分校统计与数据科学系) Department of Industrial and Systems Engineering, University of Minnesota(明尼苏达大学工业与系统工程系) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院)

AI总结 本文基于随机微积分理论,通过Girsanov型定理统一分析离散流模型的收敛性质,给出了转移率估计误差和提前停止误差的非渐近误差界,并首次提供了离散流模型的误差分析。

详情
AI中文摘要

离散流模型为学习离散状态空间上的分布提供了强大的框架,并且与离散扩散模型相比表现出更优的性能。然而,它们的收敛性质和误差分析仍然在很大程度上未被探索。在这项工作中,我们开发了一个基于随机微积分理论的统一框架,以系统地研究离散流模型的理论性质。具体来说,通过利用两个连续时间马尔可夫链(CTMC)路径测度的Girsanov型定理,我们提出了一个全面的误差分析,该分析同时考虑了转移率估计误差和提前停止误差。实际上,现有工作中很少关注转移率的估计误差。与离散扩散模型不同,离散流不会因在噪声过程中截断时间范围而产生初始化误差。基于生成器匹配和均匀化,我们在没有对Oracle转移率施加有界性条件的情况下,建立了分布估计的非渐近误差界。此外,我们推导了在有界性条件下估计分布的总变差收敛的更快速率,得到了关于样本量的近乎最优的速率。我们的结果为离散流模型提供了首次误差分析。我们还基于模拟结果研究了不同设置下的模型性能。

英文摘要

Discrete flow models offer a powerful framework for learning distributions over discrete state spaces and have demonstrated superior performance compared to the discrete diffusion models. However, their convergence properties and error analysis remain largely unexplored. In this work, we develop a unified framework grounded in stochastic calculus theory to systematically investigate the theoretical properties of discrete flow models. Specifically, by leveraging a Girsanov-type theorem for the path measures of two continuous-time Markov chains (CTMCs), we present a comprehensive error analysis that accounts for both transition rate estimation error and early stopping error. In fact, the estimation error of transition rates has received little attention in existing works. Unlike discrete diffusion models, discrete flow incurs no initialization error caused by truncating the time horizon in the noising process. Building on generator matching and uniformization, we establish non-asymptotic error bounds for distribution estimation without the boundedness condition on oracle transition rates. Furthermore, we derive a faster rate of total variation convergence for the estimated distribution with the boundedness condition, yielding a nearly optimal rate in terms of sample size. Our results provide the first error analysis for discrete flow models. We also investigate model performance under different settings based on simulation results.

2601.21845 2026-05-27 cs.LG 版本更新

Constrained Meta Reinforcement Learning with Provable Test-Time Safety

具有可证明测试时安全性的约束元强化学习

Tingting Ni, Maryam Kamgarpour

发表机构 * Sycamore Lab, EPFL, Lausanne, Switzerland(苏黎世联邦理工学院萨克森实验室,瑞士拉瓦尔)

AI总结 提出一种约束元强化学习算法,在测试任务上以可证明的安全性和样本复杂度保证学习近似最优策略,并证明样本复杂度下界。

详情
AI中文摘要

元强化学习允许智能体利用在可随意训练的任务分布上的经验,从而在新测试任务上更快地学习最优策略。尽管在提高测试任务样本复杂度方面取得了成功,但许多实际应用(如机器人和医疗保健)在测试期间施加了安全约束。约束元强化学习为将安全性整合到元强化学习中提供了一个有前景的框架。约束元强化学习中的一个开放问题是如何确保策略在真实世界测试任务上的安全性,同时降低样本复杂度,从而更快地学习最优策略。为了解决这一差距,我们提出了一种算法,该算法精炼训练期间学到的策略,具有可证明的安全性和样本复杂度保证,用于在测试任务上学习近似最优策略。我们进一步推导了一个匹配的下界,表明该样本复杂度是紧的。

英文摘要

Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent can train at will, enabling faster learning of optimal policies on new test tasks. Despite its success in improving sample complexity on test tasks, many real-world applications, such as robotics and healthcare, impose safety constraints during testing. Constrained meta RL provides a promising framework for integrating safety into meta RL. An open question in constrained meta RL is how to ensure safety of the policy on the real-world test task, while reducing the sample complexity and thus, enabling faster learning of optimal policies. To address this gap, we propose an algorithm that refines policies learned during training, with provable safety and sample complexity guarantees for learning a near optimal policy on the test tasks. We further derive a matching lower bound, showing that this sample complexity is tight.

2601.21789 2026-05-27 cs.LG cs.AI stat.ML 版本更新

ECSEL: Explainable Classification via Signomial Equation Learning

ECSEL: 通过符号方程学习的可解释分类

Adia Lumadjeng, Ilker Birbil, Erman Acar

发表机构 * Amsterdam Business School, University of Amsterdam, Amsterdam, the Netherlands(阿姆斯特丹大学阿姆斯特丹商学院) Institute for Informatics, University of Amsterdam(阿姆斯特丹大学信息学院) Institute for Logic, Language and Computation, University of Amsterdam(阿姆斯特丹大学逻辑、语言与计算研究所)

AI总结 提出ECSEL方法,通过学习符号方程形式的闭式表达式实现可解释分类,在符号回归基准上以更低计算量恢复更多目标方程,并保持分类精度与可解释性。

Comments 9 pages, 4 figures, accepted at ICML 2026

详情
AI中文摘要

我们引入ECSEL,一种可解释的分类方法,它学习形如符号方程的正式表达式,其动机是观察到许多符号回归基准具有紧凑的符号结构。ECSEL直接构建一个结构化的闭式表达式,同时作为分类器和解释。在标准符号回归基准上,我们的方法比竞争的最新方法恢复更大比例的目标方程,同时需要更少的计算。利用这种效率,ECSEL在不牺牲可解释性的情况下实现了与已建立的机器学习模型竞争的分类精度。此外,我们展示了ECSEL在全局特征行为、决策边界分析和局部特征归因方面满足一些理想性质。在基准数据集和两个真实世界案例研究(即电子商务和欺诈检测)上的实验表明,学习到的方程暴露了数据集偏差,支持反事实推理,并产生可操作的见解。

英文摘要

We introduce ECSEL, an explainable classification method that learns formal expressions in the form of signomial equations, motivated by the observation that many symbolic regression benchmarks admit compact signomial structure. ECSEL directly constructs a structural, closed-form expression that serves as both a classifier and an explanation. On standard symbolic regression benchmarks, our method recovers a larger fraction of target equations than competing state-of-the-art approaches while requiring substantially less computation. Leveraging this efficiency, ECSEL achieves classification accuracy competitive with established machine learning models without sacrificing interpretability. Further, we show that ECSEL satisfies some desirable properties regarding global feature behavior, decision-boundary analysis, and local feature attributions. Experiments on benchmark datasets and two real-world case studies i.e., e-commerce and fraud detection, demonstrate that the learned equations expose dataset biases, support counterfactual reasoning, and yield actionable insights.

2511.16870 2026-05-27 cs.CV cs.LG 版本更新

Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representation Alignment

对齐与反转:通过表示对齐解决扩散和流模型中的逆问题

Loukas Sfountouris, Giannis Daras, Paris Giampouras

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出将扩散或流模型的内部表示与预训练自监督编码器(DINOv2)对齐(REPA),在推理时引导逆问题重建,显著提升重建质量和感知真实感。

详情
AI中文摘要

最近研究表明,强制扩散或流生成模型的内部表示与预训练自监督编码器的表示对齐,提供了强大的归纳偏置,改善了收敛性和样本质量。在这项工作中,我们将这一思想扩展到逆问题,其中预训练生成模型被用作先验。我们提出在扩散或流模型与DINOv2视觉编码器之间应用表示对齐(REPA),以在推理时指导重建过程。尽管逆问题中无法获得真实信号,但我们实验表明,对齐模型对近似目标特征的表示可以显著提升重建质量和感知真实感。我们提供了理论结果,显示(a) REPA正则化可以视为在DINOv2嵌入空间中最小化散度度量的变分方法,(b) 在一定的正则性假设下,REPA更新将潜在扩散状态引导向干净图像的状态。这些结果揭示了REPA在提升感知保真度中的作用。最后,我们通过将REPA集成到多个最先进的逆问题求解器中证明了方法的通用性,并在超分辨率、框内补全、高斯去模糊和运动去模糊上进行了大量实验,证实我们的方法一致地改善了重建质量,同时通过减少所需的离散化步骤数提高了效率。

英文摘要

Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a DINOv2 visual encoder, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we empirically show that aligning model representations of approximate target features can substantially enhance reconstruction quality and perceptual realism. We provide theoretical results showing (a) that REPA regularization can be viewed as a variational approach for minimizing a divergence measure in the DINOv2 embedding space, and (b) how under certain regularity assumptions REPA updates steer the latent diffusion states toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by We integrate REPA into multiple state-of-the-art inverse problem solvers, and provide extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirming that our method consistently improves reconstruction quality, while also providing efficiency gains reducing the number of required discretization steps.

2601.20796 2026-05-27 cs.CL cs.LG 版本更新

Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

解析多模态上下文学习:现代Transformer中的模态不对称性与电路动力学

Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning(慕尼黑机器学习中心) DeepMind(深Mind) Beijing University of Posts(北京邮电大学)

AI总结 通过可控实验,研究现代Transformer中多模态上下文学习的基本机制,发现模态间学习不对称性,并揭示其背后的归纳式电路机制。

Comments ICML 2026 Spotlight

详情
AI中文摘要

基于Transformer的多模态大语言模型通常展现出上下文学习(ICL)能力。受此现象启发,我们提出疑问:Transformer如何从上下文示例中跨模态关联信息?我们通过在合成分类任务上训练的小型Transformer进行可控实验来研究这一问题,从而能够精确操控数据统计和模型架构。我们首先重新审视现代Transformer中单模态ICL的核心原理。虽然多个先前发现得以复现,但我们发现旋转位置编码(RoPE)提高了ICL的数据复杂度阈值。扩展到多模态设置揭示了一个基本的学习不对称性:当在来自主要模态的高多样性数据上预训练时,次要模态中令人惊讶的低数据复杂度就足以使多模态ICL出现。机制分析表明,两种设置都依赖于一种归纳式机制,该机制从匹配的上下文示例中复制标签;多模态训练则跨模态细化和扩展这些电路。我们的发现为理解现代Transformer中的多模态ICL提供了机制基础,并为未来研究引入了一个可控的测试平台。代码可在 https://github.com/YiranHuangIrene/multimodal-icl 获取。

英文摘要

Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation. Code is available at: https://github.com/YiranHuangIrene/multimodal-icl

2508.02806 2026-05-27 cs.CV cs.LG 版本更新

PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation

PyCAT4: 基于层次化视觉Transformer的3D人体姿态估计框架

Zongyou Yang, Jonathan Loo, Yinghan Hou

发表机构 * Department of Computer Science(计算机科学系) University College London(伦敦大学学院) School of Electronic Engineering(电子工程学院) Queen Mary University of London(伦敦女王学院) Department of Earth Science(地球科学系) Imperial College London(帝国理工学院)

AI总结 本研究提出PyCAT4框架,通过引入自注意力机制的Transformer特征提取层、特征时间融合技术和空间金字塔结构,优化Pymaf网络,在COCO和3DPW数据集上显著提升3D人体姿态估计的检测能力。

Comments 10 pages, 20 figures

详情
AI中文摘要

近年来,通过将卷积神经网络与金字塔网格对齐反馈循环相结合,3D人体姿态估计的准确性得到了显著提升。此外,基于Transformer的时间分析架构的采用在计算机视觉领域取得了创新性突破。鉴于这些进展,本研究旨在深度优化和改进现有的Pymaf网络架构。本文的主要创新包括:(1) 引入基于自注意力机制的Transformer特征提取网络层,以增强对低级特征的捕获;(2) 通过特征时间融合技术增强对视频序列中时间信号的理解和捕获;(3) 实现空间金字塔结构以实现多尺度特征融合,有效平衡不同尺度下的特征表示差异。本研究得到的新PyCAT4模型在COCO和3DPW数据集上进行了实验验证。结果表明,所提出的改进策略显著提升了网络在人体姿态估计中的检测能力,进一步推动了人体姿态估计技术的发展。

英文摘要

Recently, a significant improvement in the accuracy of 3D human pose estimation has been achieved by combining convolutional neural networks (CNNs) with pyramid grid alignment feedback loops. Additionally, innovative breakthroughs have been made in the field of computer vision through the adoption of Transformer-based temporal analysis architectures. Given these advancements, this study aims to deeply optimize and improve the existing Pymaf network architecture. The main innovations of this paper include: (1) Introducing a Transformer feature extraction network layer based on self-attention mechanisms to enhance the capture of low-level features; (2) Enhancing the understanding and capture of temporal signals in video sequences through feature temporal fusion techniques; (3) Implementing spatial pyramid structures to achieve multi-scale feature fusion, effectively balancing feature representations differences across different scales. The new PyCAT4 model obtained in this study is validated through experiments on the COCO and 3DPW datasets. The results demonstrate that the proposed improvement strategies significantly enhance the network's detection capability in human pose estimation, further advancing the development of human pose estimation technology.

2512.01556 2026-05-27 cs.AI cs.CL cs.LG 版本更新

LEC: Linear Expectation Constraints for Selection-Conditioned Risk Control in Selective Prediction and Routing Systems

LEC: 选择性预测与路由系统中基于选择条件风险控制的线性期望约束

Zhiyuan Wang, Aniri, Tianlong Chen, Yue Zhang, Heng Tao Shen, Xiaoshuang Shi, Kaidi Xu

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Shandong University(山东大学) Tongji University(同济大学) City University of Hong Kong(香港城市大学)

AI总结 提出LEC框架,通过线性期望约束将选择性预测转化为决策问题,在可交换性假设下利用校准集计算风险约束下的保留最大化阈值,并扩展到双模型路由系统,实现选择条件误差控制。

Comments Accepted by ICML 2026 Regular

详情
AI中文摘要

基础模型常常生成不可靠的答案,而启发式不确定性估计器无法完全区分正确与错误输出,导致用户在没有统计保证的情况下接受错误答案。我们通过选择条件风险控制来解决这个问题,旨在确保接受的预测的错误概率不超过用户指定的风险水平。为此,我们提出了LEC,一个原则性框架,将选择性预测重新定义为由选择和错误指标上的线性期望约束控制的决策问题。该公式直接控制接受错误期望数与接受预测期望数之间的比率,这对应于选择条件下的边际错误概率。在可交换性下,我们推导出一个仅依赖于保留校准集的有限样本充分条件,从而能够计算风险约束下的保留最大化阈值。此外,我们将LEC扩展到双模型路由系统:如果主模型的不确定性超过其校准阈值,则输入被委托给后续模型,同时保持系统级的选择条件误差控制。在封闭式和开放式问答(QA)以及视觉问答(VQA)上的实验表明,LEC在接受的预测中维持了规定的风险水平,并且与基线相比显著提高了样本保留率。

英文摘要

Foundation models often generate unreliable answers, while heuristic uncertainty estimators fail to fully distinguish correct from incorrect outputs, causing users to accept erroneous answers without any statistical guarantee. We address this problem through selection-conditioned risk control, aiming to ensure that an accepted prediction has an error probability no larger than a user-specified risk level. To this end, we propose LEC, a principled framework that reframes selective prediction as a decision problem governed by a linear expectation constraint over selection and error indicators. This formulation directly controls the ratio between the expected number of accepted errors and the expected number of accepted predictions, which corresponds to the marginal error probability conditioned on selection. Under exchangeability, we derive a finite-sample sufficient condition that relies only on a held-out calibration set, enabling the computation of a risk-constrained, retention-maximizing threshold. Furthermore, we extend LEC to two-model routing systems: if the primary model's uncertainty exceeds its calibrated threshold, the input is delegated to a subsequent model, while maintaining system-level selection-conditioned error control. Experiments on both closed-ended and open-ended question answering (QA) and vision question answering (VQA) demonstrate that LEC maintains the prescribed risk level in accepted predictions and substantially improves sample retention compared to baselines.

2508.03774 2026-05-27 cs.LG cs.AI 版本更新

A Physics-Informed Hierarchical Neural Network for Microwave Scattering Analysis of 3D PEC Targets

用于三维PEC目标微波散射分析的物理信息分层神经网络

Rui Zhu, Yuexing Peng, George C. Alexandropoulos, Wenbo Wang

发表机构 * Key Laboratory of Universal Wireless Communication, Ministry of Education, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications(信息与通信工程学院,北京邮电大学,教育部无线通信重点实验室) Department of Informatics and Telecommunications, National and Kapodistrian University of Athens(信息与电信学院,希腊国家与卡波迪斯提亚大学)

AI总结 提出一种U形物理信息神经网络(U-PINet),结合近场图编码器和八叉树分层多尺度融合模块,通过电场积分方程残差训练,实现高效准确的三维PEC目标微波散射分析。

Comments Submitted to an IEEE Journal

详情
AI中文摘要

在微波频率下精确建模三维完美电导体(PEC)目标的散射是计算电磁学的一个基本目标,特别是在雷达截面(RCS)预测和微波散射分析中。经典求解器,如矩量法和多层快速多极子算法(MLFMA),虽然提供高物理保真度,但在涉及多次入射配置或频率的重复查询场景下变得昂贵,而纯数据驱动的代理模型通常在几何复杂目标上缺乏准确性。本文提出一种U形物理信息人工神经网络(U-PINet)用于三维微波散射分析。受MLFMA的近远场分解启发,U-PINet结合了由可学习单变量基函数参数化的近场图编码器,以及在八叉树分区上组织的分层多尺度融合模块。所提出的网络在表面配置点处针对电场积分方程的离散残差进行训练,无需参考电流标签。在多个频率和极化配置下,对典型和几何复杂的三维PEC目标进行的实验,并通过双站RCS重建评估,表明U-PINet优于代表性的物理信息基线,并在重复查询场景下相比经典MLFMA求解器实现了显著的运行时间节省。

英文摘要

Accurate modeling of scattering from three-dimensional (3D) perfectly electrically conducting (PEC) targets at microwave frequencies constitutes a fundamental objective in computational electromagnetics, particularly for radar cross section (RCS) prediction and microwave scattering analysis. Classical solvers, such as the method of moments and the Multilevel Fast Multipole Algorithm (MLFMA), although provide high physical fidelity, they become costly under scenarios of repeated queries involving many incidence configurations or frequencies, whereas purely data-driven surrogates often lack accuracy on geometrically complex targets. This paper proposes a U-shaped physics-informed artificial neural network (U-PINet) for 3D microwave scattering analysis. Inspired by the near-far field decomposition of MLFMA, U-PINet combines a near-field graph encoder, parameterized by learnable univariate basis functions, with a hierarchical multi-scale fusion module organized on an octree partition. The proposed network is trained against a discretized residual of the electric-field integral equation at surface collocation points, without requiring reference current labels. Experiments on canonical and geometrically complex 3D PEC targets, conducted under multiple frequency and polarization configurations and assessed through bistatic RCS reconstruction, showcase that U-PINet outperforms representative physics-informed baselines, and yields substantial runtime savings over the classical MLFMA solver under repeated-query scenarios.

2601.12809 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

CLIP风格视觉语言模型在合成空间关系数据训练中的左右对称性破缺

Takaki Yamamoto, Chihiro Noguchi, Toshihiro Tanizawa

发表机构 * InfoTech, Toyota Motor Corporation(丰田汽车公司信息科技部门)

AI总结 通过可控一维图像文本测试平台,研究基于Transformer的视觉语言编码器在CLIP风格对比学习下如何通过位置与标记嵌入交互产生左右关系理解,并发现标签多样性比布局多样性更关键。

Comments Accepted at ICML 2026

详情
AI中文摘要

空间理解仍然是视觉语言模型中的一个关键挑战。然而,这种理解是否真正获得,如果是,通过什么机制,目前尚不清楚。我们提出了一个可控的一维图像文本测试平台,以探究在基于Transformer的视觉和文本编码器中,使用CLIP风格的对比目标训练时,左右关系理解是如何出现的。我们在单对象和双对象场景的配对描述上端到端地训练轻量级基于Transformer的视觉和文本编码器,并评估对未见对象对的泛化能力,同时系统性地改变标签和布局多样性。我们发现对比训练学习了左右关系,并且标签多样性(而非布局多样性)是这种情况下泛化的主要驱动因素。为了获得机制性理解,我们进行了注意力分解,并表明位置嵌入和标记嵌入之间的相互作用导致了水平注意力梯度,从而打破了编码器中的左右对称性;消除这一贡献会显著降低左右辨别能力。我们的结果提供了关于CLIP风格模型何时以及如何获得关系能力的机制性见解。

英文摘要

Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how left-right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left-right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting. To gain the mechanistic understanding, we perform an attention decomposition and show that interactions between positional and token embeddings induce a horizontal attention gradient that breaks left-right symmetry in the encoders; ablating this contribution substantially reduces left-right discrimination. Our results provide a mechanistic insight of when and how CLIP-style models acquire relational competence.

2601.08146 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Beyond Transfer Accuracy: Faithful Circuits for Controlled Low-Resource Adaptation

超越迁移准确率:用于受控低资源适应的忠实电路

Khumaisa Nur'aini, Ayu Purwarianti, Alham Fikri Aji, Derry Wijaya

发表机构 * Monash University Indonesia(印度尼西亚墨尔本大学) Institute Teknologi Bandung(Bandung理工大学) MBZUAI(MBZUAI研究所) Boston University(波士顿大学)

AI总结 提出基于上下文分解的电路发现方法(CD-T),通过标签平衡激活均值和任务方向相关性评分实现无反事实电路发现,并利用电路目标监督微调(CT-SFT)在低资源跨语言情感迁移中最小化灾难性遗忘,优于全局微调。

详情
AI中文摘要

现有的电路发现方法依赖于具有干净反事实的模板化任务,限制了它们在多样化自然文本上的使用。我们通过标签平衡激活均值和任务方向相关性评分,将上下文分解方法适配到非结构化设置(CD-T),实现了无反事实的电路发现。我们利用这些电路进行电路目标监督微调(CT-SFT),将参数更新限制在任务相关的注意力头和层归一化上。在NusaX跨语言情感迁移上的实验表明,CT-SFT在低资源适应中极具竞争力。虽然非电路稀疏更新和全微调有时通过能力招募达到目标准确率,但CT-SFT独特地最小化灾难性遗忘,保留了源语言和相关任务的性能。在XNLI上的扩展证实了这些发现在更广泛的任务和模型家族中成立,表明电路目标适应提供了一种更安全、基于因果关系的全局微调替代方案。

英文摘要

Existing circuit discovery methods rely on templated tasks with clean counterfactuals, limiting their use on diverse natural text. We adapt Contextual Decomposition for Transformers (CD-T) for unstructured settings via label-balanced activation means and task-directional relevance scoring, enabling counterfactual-free circuit discovery. We leverage these circuits for Circuit-Targeted Supervised Fine-Tuning (CT-SFT), restricting parameter updates to task-relevant heads and LayerNorm. Experiments on NusaX cross-lingual sentiment transfer show that CT-SFT is highly competitive for low-resource adaptation. While non-circuit sparse updates and full fine-tuning sometimes match target accuracy through capacity recruitment, CT-SFT uniquely minimizes catastrophic forgetting, preserving source-language and related-task performance. Extensions to XNLI confirm these findings hold across broader tasks and model families, demonstrating that circuit-targeted adaptation provides a safer, causally grounded alternative to global fine-tuning.

2601.11334 2026-05-27 cs.IT cs.LG math.IT 版本更新

Information Theoretic Perspective on Representation Learning

表示学习的信息论视角

Deborah Pereg, Michael Wand

发表机构 * Scuola Universitaria Professionale della Svizzera Italiana (SUPSI)(瑞士意大利专业大学(SUPSI)) Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA)(达莫尔智能研究 institute(IDSIA))

AI总结 本文提出信息论框架分析回归任务中最后一层嵌入的表示,定义了表示率、表示容量和表示率失真,并推导了可达容量和表示率及其逆命题。

详情
AI中文摘要

引入了一个信息论框架来分析最后一层嵌入,重点关注回归任务的学习表示。我们定义了表示率,并推导了输入-输出信息可以可靠地表示为输入源熵所固有决定的程度。我们进一步定义了扰动设置下的表示容量,以及压缩输出的表示率失真。我们推导了可达容量、可达表示率及其逆命题。最后,我们将结果统一在一个框架中。

英文摘要

An information-theoretic framework is introduced to analyze last-layer embedding, focusing on learned representations for regression tasks. We define representation-rate and derive limits on the reliability with which input-output information can be represented as is inherently determined by the input-source entropy. We further define representation capacity in a perturbed setting, and representation rate-distortion for a compressed output. We derive the achievable capacity, the achievable representation-rate, and their converse. Finally, we combine the results in a unified setting.

2512.01572 2026-05-27 cs.LG cs.AI physics.app-ph 版本更新

Reconstructing Multi-Scale Physical Fields from Extremely Sparse Measurements with an Autoencoder-Diffusion Cascade

使用自编码器-扩散级联从极度稀疏测量中重建多尺度物理场

Letian Yi, Tingpeng Zhang, Mingyuan Zhou, Guannan Wang, Quanke Su, Zhilu Lai

发表机构 * Internet of Things Thrust(物联网方向) Intelligent Transportation Thrust(智能交通方向) Marine Hydrodynamic Research Facility(海洋流体研究设施) Department of Civil and Environmental Engineering(土木与环境工程系)

AI总结 提出Cascaded Sensing框架,通过粗尺度确定性估计和细尺度条件扩散模型级联,解决极度稀疏测量下物理场重建的不适定性和多模态后验问题。

Comments 34 pages,22 figures

详情
AI中文摘要

极端传感器稀疏性使得全场重建成为科学传感中一个根本性的不适定问题,其目标是从稀疏测量中推断物理场。在此情况下,后验严重欠约束且固有地多模态,使其近似高度病态。具体而言,确定性映射会坍塌不确定性,直接条件学习无法覆盖可能的观测条件解空间,而似然引导采样对噪声和传感器配置高度敏感。这些限制导致后验估计不稳定,并突显了以结构化方式建模不确定性的必要性。为此,我们提出了Cascaded Sensing,一个跨尺度重构后验推理的分层框架。Cas-Sensing不直接建模全场后验,而是首先通过确定性粗阶段估计器解决全局结构模糊性。一个基于神经算子的功能自编码器,使用掩码输入训练,将稀疏观测映射到粗尺度结构场,其作用类似于最大后验估计器,选择主导全局配置。该结构锚点固定了后验的主要自由度,并将问题转化为一个条件更好的残差推理任务。然后,一个条件扩散模型仅学习细化尺度的残差分布,将采样限制在合理解的稳定邻域内,并抑制观测一致模式之间的竞争。为了增强在不同传感条件下的鲁棒性,我们引入了掩码级联训练,通过中间粗重建使模型暴露于多样的稀疏观测模式。在推理过程中,流形约束引导将观测一致性作为细化机制而非全局模式选择过程来实施。

英文摘要

Extreme sensor sparsity makes full-field reconstruction a fundamentally ill-posed problem in scientific sensing,where the goal is to infer physical fields from sparse measurements.In this regime,the posterior is severely underconstrained and inherently multimodal,making its approximation highly ill-conditioned.Specifically,deterministic mappings collapse uncertainty,direct conditional learning cannot cover the space of possible observation-conditioned solutions,and likelihood-guided sampling becomes highly sensitive to noise and sensor configurations.These limitations result in unstable posterior estimates and highlight the need for modeling uncertainty in a structural manner.To this end,we propose Cascaded Sensing,a hierarchical framework that restructures posterior inference across scales.Rather than modeling the full-field posterior directly,Cas-Sensing first resolves global structural ambiguity through a deterministic coarse-stage estimator.A neural-operator-based functional autoencoder,trained with masked inputs,maps sparse observations to a coarse-scale structural field,acting analogously to a maximum a posteriori estimator that selects the dominant global configuration.This structural anchor fixes the principal degrees of freedom of the posterior and transforms the problem into a better-conditioned residual inference task.A conditional diffusion model then learns only the refined-scale residual distribution,confining sampling to a stable neighborhood of plausible solutions and suppressing competition among observation-consistent modes.To enhance robustness under varying sensing conditions,we introduce mask-cascade training,which exposes the model to diverse sparse observation patterns through intermediate coarse reconstructions.During inference,manifold-constrained guidance enforces observation consistency as a refinement mechanism rather than a global mode-selection process.

2601.03525 2026-05-27 cs.LG cs.AI 版本更新

Beyond Binary: Turning Partial Success into Dense Verifiable Rewards for Reinforcement Learning in Code Generation

超越二元:将部分成功转化为代码生成中强化学习的密集可验证奖励

Longwen Wang, Yirui Liu, Xuan'er Wu, Xiaohui Hu, Yuankai Fan, Kaidong Yu, Qizhen Weng, Wei Xi, Xuelong Li

发表机构 * Institute of Artificial Intelligence, China Telecom (TeleAI)(中国电信人工智能研究院(TeleAI)) Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd(中国电信人工智能技术(北京)有限公司Xingchen AGI实验室) National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University(人机混合增强智能国家重点实验室,西安交通大学)

AI总结 提出VeRPO框架,利用代码测试的部分成功作为可验证密集奖励,通过动态密度校准局部奖励修正基数偏差,并与全局执行结果结合,提升代码生成强化学习的性能。

详情
AI中文摘要

有效的奖励设计是代码生成强化学习(RL)中的核心挑战。主流的测试套件级结果奖励强制执行功能正确性但导致稀疏性,而外部奖励模型(RM)提供密集监督但代价是错位和额外开销。由于代码评估自然产生多个测试用例级结果,部分成功(即通过部分测试用例)提供了内在的、可验证的密集监督来源。在本文中,我们提出VeRPO(可验证密集奖励策略优化),一个系统地将可验证的部分成功转化为可靠密集奖励的RL框架。我们使用加权和公式分析部分成功奖励,理论上识别出一个关键的基数偏差,导致策略更新不成比例地偏向于从简单测试成功中获益,而非在前沿测试上取得进展。基于此,VeRPO引入了一个动态的、密度校准的局部奖励,明确纠正这种偏差,并从部分成功中提供稳健的密集监督。为了增强与端到端功能正确性的一致性,VeRPO进一步将局部密集奖励与全局执行结果相结合。在多种基准和设置上的大量实验表明,VeRPO优于结果驱动和基于RM的基线,实现了高达+8.83 pass@1的提升,且时间成本可忽略不计(<0.02%),GPU内存开销为零。

英文摘要

Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream test-suite-level outcome rewards enforce functional correctness but induce sparsity, while external Reward Models (RMs) provide dense supervision at the cost of misalignment and additional overhead. Since code evaluation naturally yields multiple test-case-level outcomes, partial success, i.e., passing a subset of test cases, offers an intrinsic, verifiable source of dense supervision. In this paper, we propose VeRPO (Verifiable Dense Reward Policy Optimization), an RL framework that systematically turns verifiable partial success into reliable dense rewards. We analyze partial-success rewards using a weighted sum formulation, theoretically identifying a critical cardinality bias that causes policy updates to disproportionately favor gains from easy-test successes over progress on frontier tests. Based on this, VeRPO introduces a dynamic, density-calibrated local reward that explicitly corrects this bias and provides robust dense supervision from partial success. To enhance alignment with end-to-end functional correctness, VeRPO further integrates the local dense reward with global execution outcomes. Extensive experiments across diverse benchmarks and settings demonstrate that VeRPO outperforms outcome-driven and RM-based baselines, achieving up to +8.83 pass@1 gain with negligible time cost (< 0.02%) and zero GPU memory overhead.

2601.05028 2026-05-27 cs.LG 版本更新

Approximate Equivariance via Projection-based Regularisation

基于投影正则化的近似等变性

Torben Berndt, Jan Stühmer

发表机构 * Heidelberg Institute for Theoretical Studies(海德堡理论研究所) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 提出一种基于投影的正则化方法,通过在线性层中分解等变与非等变分量并惩罚非等变算子范数,实现高效且精确的近似等变性,在SO(3)等连续群上优于样本基方法。

详情
AI中文摘要

等变性是神经网络中一种强大的归纳偏置,能够提高泛化能力和物理一致性。然而,最近非等变模型因其更好的运行时性能以及现实应用中可能出现的不完美对称性而重新受到关注。这推动了近似等变模型的发展,这些模型在尊重对称性和拟合数据分布之间取得了平衡。该领域现有的方法通常使用基于样本的正则化器,这些正则化器依赖于训练时的数据增强,导致较高的样本复杂度,特别是对于$SO(3)$等连续群。相反,本文通过基于投影的正则化器来处理近似等变性,该正则化器利用线性层到等变和非等变分量的正交分解。与现有方法不同,本文在算子层面上对整个群轨道上的非等变性进行惩罚,而不是逐点惩罚。我们提出了一个数学框架,用于在空间域和谱域中精确且高效地计算非等变性惩罚。在我们的实验中,我们的方法在模型性能和效率上始终优于先前的近似等变性方法,与基于样本的正则化器相比,实现了显著的运行时增益。

英文摘要

Equivariance is a powerful inductive bias in neural networks, improving generalisation and physical consistency. Recently, however, non-equivariant models have regained attention, due to their better runtime performance and imperfect symmetries that might arise in real-world applications. This has motivated the development of approximately equivariant models that strike a middle ground between respecting symmetries and fitting the data distribution. Existing approaches in this field usually apply sample-based regularisers which depend on data augmentation at training time, incurring a high sample complexity, in particular for continuous groups such as $SO(3)$. This work instead approaches approximate equivariance via a projection-based regulariser which leverages the orthogonal decomposition of linear layers into equivariant and non-equivariant components. In contrast to existing methods, this penalises non-equivariance at an operator level across the full group orbit, rather than point-wise. We present a mathematical framework for computing the non-equivariance penalty exactly and efficiently in both the spatial and spectral domain. In our experiments, our method consistently outperforms prior approximate equivariance approaches in both model performance and efficiency, achieving substantial runtime gains over sample-based regularisers.

2410.00995 2026-05-27 cs.LG 版本更新

CktGen: Automated Analog Circuit Design with Generative Artificial Intelligence

CktGen: 基于生成式人工智能的自动化模拟电路设计

Yuxuan Hou, Hehe Fan, Jianrong Zhang, Yue Zhang, Hua Chen, Min Zhou, Faxin Yu, Roger Zimmermann, Yi Yang

发表机构 * College of Computer Science and Technology(计算机科学与技术学院) Australian Artificial Intelligence Institute(澳大利亚人工智能研究所) School of Aeronautics and Astronautics(航空宇航科学学院) School of Computing(计算科学学院)

AI总结 提出CktGen,一种基于条件变分自编码器的模拟电路生成方法,通过解耦电路与规格编码并采用对比训练和分类器引导,实现从目标规格到有效电路的生成,显著优于现有方法。

Comments Paper accepted by Engineering

详情
AI中文摘要

模拟电路的自动综合面临重大挑战。大多数现有方法将问题表述为单目标优化任务,忽略了给定电路类型的设计规格在不同应用中的广泛变化。为了解决这个问题,我们引入了规格条件模拟电路生成,这是一项根据目标规格直接生成模拟电路的任务。其动机是利用现有的设计良好的电路来提高模拟电路设计的自动化程度。具体来说,我们提出了CktGen,一种简单而有效的变分自编码器,它将离散化的规格和电路映射到联合潜在空间,并从该潜在向量重建电路。值得注意的是,由于单个规格可能对应多个有效电路,简单地将规格信息融合到生成模型中无法捕捉这些一对多的关系。为了解决这个问题,我们解耦了电路和规格的编码,并对齐它们映射的潜在空间。然后,我们采用带有过滤掩码的对比训练来最大化编码电路和规格之间的差异。此外,分类器引导与潜在特征对齐促进了共享相同规格的电路的聚类,避免了模型崩溃为平凡的一对一映射。通过根据规格规范化潜在空间,我们可以搜索满足有效目标规格的最优电路。我们在开放电路基准上进行了全面实验,并引入了评估跨模型一致性的指标。实验结果表明,CktGen相比最先进的方法取得了显著改进。

英文摘要

The automatic synthesis of analog circuits presents significant challenges. Most existing approaches formulate the problem as a single-objective optimization task, overlooking that design specifications for a given circuit type vary widely across applications. To address this, we introduce specification-conditioned analog circuit generation, a task that directly generates analog circuits based on target specifications. The motivation is to leverage existing well-designed circuits to improve automation in analog circuit design. Specifically, we propose CktGen, a simple yet effective variational autoencoder that maps discretized specifications and circuits into a joint latent space and reconstructs the circuit from that latent vector. Notably, as a single specification may correspond to multiple valid circuits, naively fusing specification information into the generative model does not capture these one-to-many relationships. To address this, we decouple the encoding of circuits and specifications and align their mapped latent space. Then, we employ contrastive training with a filter mask to maximize differences between encoded circuits and specifications. Furthermore, classifier guidance along with latent feature alignment promotes the clustering of circuits sharing the same specification, avoiding model collapse into trivial one-to-one mappings. By canonicalizing the latent space with respect to specifications, we can search for an optimal circuit that meets valid target specifications. We conduct comprehensive experiments on the open circuit benchmark and introduce metrics to evaluate cross-model consistency. Experimental results demonstrate that CktGen achieves substantial improvements over state-of-the-art methods.

2601.03089 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information

基于受控保留信息的仅解码器LLM归因忠实性评估

Xin Huang, Antoni B. Chan

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 针对现有软扰动忠实性指标因保留词数不同导致评估偏差的问题,提出π-Soft-NC和π-Soft-NS框架,通过控制期望保留概率公平比较归因方法,并引入专用于自回归解码器LLM的梯度归因方法Grad-ELLM。

详情
AI中文摘要

大型语言模型(LLM)越来越多地使用输入归因方法进行评估,但比较这些解释仍然具有挑战性。现有的软扰动忠实性指标,如Soft-NC和Soft-NS,可能将归因质量与扰动期间保留的词数混为一谈:平均得分较高的归因方法可能保留更多词,从而获得膨胀的分数。为解决此问题,我们提出π-Soft-NC和π-Soft-NS,这是一个在相同期望保留概率下比较归因方法的评估框架,从而控制保留词数。我们进一步引入Grad-ELLM,一种针对自回归仅解码器LLM定制的基于梯度的归因方法,该方法在每个解码步骤将梯度导出的通道重要性与注意力导出的标记重要性相结合。在Llama和Mistral上的分类和开放生成任务实验表明,Grad-ELLM在π-Soft-NC下实现了强全面性导向的忠实性,而在π-Soft-NS下没有主导方法。我们的评估指标为比较LLM的可解释人工智能方法提供了一个严格的框架,将支持该领域的进展。

英文摘要

Large Language Models (LLMs) are increasingly evaluated with input attribution methods, yet comparing such explanations remains challenging. Existing soft-perturbation faithfulness metrics, such as Soft-NC and Soft-NS, can conflate attribution quality with the number of words retained during perturbation: attribution methods with larger average scores may keep more words and therefore obtain inflated scores. To address this issue, we propose $π$-Soft-NC and $π$-Soft-NS, an evaluation framework that compares attribution methods under the same expected retaining probability, thus controlling the number of retained words. We further introduce Grad-ELLM, a gradient-based attribution method tailored to autoregressive decoder-only LLMs, which combines gradient-derived channel importance with attention-derived token importance at each decoding step. Experiments on classification and open-generation tasks with Llama and Mistral show that Grad-ELLM achieves strong comprehensiveness-oriented faithfulness under $π$-Soft-NC, while there is no dominant method under $π$-Soft-NS. Our evaluation metric serves as a rigorous framework to compare XAI methods for LLMs, which will support progress in the field.

2512.22666 2026-05-27 cs.CV cs.LG 版本更新

INTERACT-CMIL: Multi-Task Shared Learning and Inter-Task Consistency for Conjunctival Melanocytic Intraepithelial Lesion Grading

INTERACT-CMIL:用于结膜黑色素细胞上皮内病变分级的任务共享学习与任务间一致性

Mert Ikinci, Luna Toma, Karin U. Loeffler, Leticia Ussem, Daniela Süsskind, Julia M. Weller, Yousef Yeganeh, Martina C. Herwig-Carl, Shadi Albarqouni

发表机构 * Clinic for Diagnostic and Interventional Radiology, University Hospital Bonn, Germany(波恩大学诊断与介入放射科) Department of Ophthalmology, Friedrich-Alexander University Erlangen-Nürnberg, Germany(埃尔兰根-纽伦堡弗里德里希-亚历山大大学眼科部) TUM School of Computation, Information and Technology, Technical University of Munich, Germany(慕尼黑技术大学计算、信息与技术学院) Munich Center for Machine Learning, Germany(慕尼黑机器学习中心) Helmholtz AI, Helmholtz Center Munich, Germany(海德堡人工智能,海德堡慕尼黑研究中心)

AI总结 提出INTERACT-CMIL多任务深度学习框架,通过共享特征学习、组合部分监督和任务间一致性损失联合预测五个组织病理学轴,在486张结膜活检图像数据集上相比CNN和基础模型实现最高55.1%的宏F1提升。

详情
Journal ref
IEEE ISBI 2026
AI中文摘要

结膜黑色素细胞上皮内病变(CMIL)的准确分级对于治疗和黑色素瘤预测至关重要,但由于细微的形态学线索和相互关联的诊断标准,仍然困难。我们提出INTERACT-CMIL,一个多头深度学习框架,通过共享特征学习与组合部分监督以及强制跨任务一致性的相互依赖损失,联合预测五个组织病理学轴:WHO4、WHO5、水平扩散、垂直扩散和细胞异型性。在来自三家大学医院的486张专家注释的结膜活检斑块的新整理多中心数据集上进行训练和评估,INTERACT-CMIL在CNN和基础模型(FM)基线上取得了一致的改进,相对宏F1增益高达55.1%(WHO4)和25.0%(垂直扩散)。该框架提供与专家分级一致的连贯、可解释的多标准预测,为CMIL诊断提供了可重复的计算基准,并朝着标准化数字眼科病理学迈出了一步。

英文摘要

Accurate grading of Conjunctival Melanocytic Intraepithelial Lesions (CMIL) is essential for treatment and melanoma prediction but remains difficult due to subtle morphological cues and interrelated diagnostic criteria. We introduce INTERACT-CMIL, a multi-head deep learning framework that jointly predicts five histopathological axes; WHO4, WHO5, horizontal spread, vertical spread, and cytologic atypia, through Shared Feature Learning with Combinatorial Partial Supervision and an Inter-Dependence Loss enforcing cross-task consistency. Trained and evaluated on a newly curated, multi-center dataset of 486 expert-annotated conjunctival biopsy patches from three university hospitals, INTERACT-CMIL achieves consistent improvements over CNN and foundation-model (FM) baselines, with relative macro F1 gains up to 55.1% (WHO4) and 25.0% (vertical spread). The framework provides coherent, interpretable multi-criteria predictions aligned with expert grading, offering a reproducible computational benchmark for CMIL diagnosis and a step toward standardized digital ocular pathology.

2512.19332 2026-05-27 cs.LG cs.LO 版本更新

A Logical View of GNN-Style Computation and the Role of Activation Functions

GNN风格计算的逻辑视角与激活函数的作用

Pablo Barceló, Floris Geerts, Matthias Lanzinger, Klara Pakhomenko, Jan Van den Bussche

发表机构 * Institute for Mathematical and Computational Engineering(数学与计算工程研究所) Pontifical Catholic University of Chile(天主教智利大学) IMFD CENIA Department of Computer Science, University of Antwerp(安特卫普大学计算机科学系) Institute for Logic and Computation, TU Wien(逻辑与计算研究所,维也纳技术大学) Data Science Institute, Universiteit Hasselt(数据科学研究所,哈塞尔特大学)

AI总结 本文通过定义语言MPLang,从逻辑角度研究图神经网络的计算能力,重点分析激活函数(特别是ReLU与有界激活函数)对数值和布尔表达能力的影响,并首次证明在存在线性层时,ReLU比有界激活函数具有更强的数值查询表达能力。

详情
AI中文摘要

我们研究了MPLang的数值和布尔表达能力,MPLang是一种声明式语言,通过线性消息传递和激活函数捕获图神经网络(GNN)的计算。我们从A-MPLang(无激活函数的片段)开始,并基于游走求和特征刻画了其表达能力。对于有界激活函数,我们证明(在温和条件下)所有最终恒定的激活函数产生相同的表达能力——数值和布尔——并且它包含了先前为具有最终恒定激活函数但无线性层的GNN建立的逻辑。最后,我们证明了在存在线性层的情况下,无界激活函数与有界激活函数之间的第一个表达能力分离:使用ReLU的MPLang在数值查询上严格强于使用最终恒定激活函数(例如截断ReLU)的MPLang。这依赖于线性聚合与最终恒定非线性之间的微妙交互,并确立了使用ReLU的GNN比那些仅限于最终恒定激活函数和线性层的GNN更具表达能力。

英文摘要

We study the numerical and Boolean expressiveness of MPLang, a declarative language that captures the computation of graph neural networks (GNNs) through linear message passing and activation functions. We begin with A-MPLang, the fragment without activation functions, and give a characterization of its expressive power in terms of walk-summed features. For bounded activation functions, we show that (under mild conditions) all eventually constant activations yield the same expressive power - numerical and Boolean - and that it subsumes previously established logics for GNNs with eventually constant activation functions but without linear layers. Finally, we prove the first expressive separation between unbounded and bounded activations in the presence of linear layers: MPLang with ReLU is strictly more powerful for numerical queries than MPLang with eventually constant activation functions, e.g., truncated ReLU. This hinges on subtle interactions between linear aggregation and eventually constant non-linearities, and it establishes that GNNs using ReLU are more expressive than those restricted to eventually constant activations and linear layers.

2512.18540 2026-05-27 eess.SY cs.LG cs.SY math.OC 版本更新

Distributed Control of Network Systems in the Space of Stabilizing Graph Neural Network Policies

稳定图神经网络策略空间中的网络系统分布式控制

John Cao, Luca Furieri

发表机构 * Department of Engineering Science, University of Oxford(牛津大学工程科学系)

AI总结 通过将图神经网络嵌入Youla-like幅度-方向参数化,提出一种保证闭环稳定性的分布式随机控制器,并证明其对图拓扑和模型参数扰动的鲁棒性。

详情
AI中文摘要

我们通过强化学习研究网络化系统的分布式控制,其中神经策略必须同时具有可扩展性、表达性和稳定性。我们引入了一种策略参数化方法,将图神经网络(GNN)嵌入到类似Youla的幅度-方向参数化中,从而产生分布式随机控制器,该控制器通过设计保证网络级闭环稳定性。幅度实现为一个稳定的算子,由作用于扰动反馈的GNN组成,而方向则是作用于局部观测的GNN。我们证明了该策略对图拓扑和模型参数扰动的鲁棒性。数值实验验证了所提出方法的有效性。

英文摘要

We study distributed control of networked systems through reinforcement learning, where neural policies must be simultaneously scalable, expressive and stabilizing. We introduce a policy parameterization that embeds Graph Neural Networks (GNNs) into a Youla-like magnitude-direction parameterization, yielding distributed stochastic controllers that guarantee network-level closed-loop stability by design. The magnitude is implemented as a stable operator consisting of a GNN acting on disturbance feedback, while the direction is a GNN acting on local observations. We prove robustness of the policy to perturbations in both the graph topology and model parameters. Numerical experiments validate the effectiveness of the proposed approach.

2512.17090 2026-05-27 cs.LG cs.AI 版本更新

How to Square Tensor Networks and Circuits Without Squaring Them

如何平方张量网络和电路而不进行平方操作

Lorenzo Loconte, Adrián Javaloy, Antonio Vergari

发表机构 * School of Informatics, University of Edinburgh, UK(爱丁堡大学信息学院)

AI总结 提出一种参数化方法,通过正交性和确定性条件简化平方张量网络和电路的边际化计算,避免额外复杂度,并在分布估计任务中保持表达能力且提升学习效率。

详情
AI中文摘要

平方张量网络(TNs)及其作为计算图的扩展——平方电路——已被用作表达性的分布估计器,同时支持闭式边际化。然而,平方操作在计算配分函数或边际化变量时引入了额外的复杂性,这阻碍了它们在机器学习中的应用。为了解决这个问题,张量网络的正则形式通过酉矩阵参数化以简化边际计算。然而,这些正则形式不适用于电路,因为电路可以表示不直接映射到已知张量网络的分解。受正则形式中的正交性和电路中实现可处理最大化的确定性的启发,我们展示了如何参数化平方电路以克服其边际化开销。我们的参数化即使在不同于张量网络的分解中也能实现高效的边际化,这些分解编码为电路,否则其结构会使边际化计算变得困难。最后,我们在分布估计上的实验表明,我们提出的平方电路条件在没有任何表达能力损失的情况下,实现了更高效的学习。

英文摘要

Squared tensor networks (TNs) and their extension as computational graphs--squared circuits--have been used as expressive distribution estimators, yet supporting closed-form marginalization. However, the squaring operation introduces additional complexity when computing the partition function or marginalizing variables, which hinders their applicability in ML. To solve this issue, canonical forms of TNs are parameterized via unitary matrices to simplify the computation of marginals. However, these canonical forms do not apply to circuits, as they can represent factorizations that do not directly map to a known TN. Inspired by the ideas of orthogonality in canonical forms and determinism in circuits enabling tractable maximization, we show how to parameterize squared circuits to overcome their marginalization overhead. Our parameterizations unlock efficient marginalization even in factorizations different from TNs, but encoded as circuits, whose structure would otherwise make marginalization computationally hard. Finally, our experiments on distribution estimation show how our proposed conditions in squared circuits come with no expressiveness loss, while enabling more efficient learning.

2512.16702 2026-05-27 cond-mat.mtrl-sci cs.LG physics.chem-ph 版本更新

How accurate are foundational machine learning interatomic potentials for heterogeneous catalysis?

基础机器学习原子间势对多相催化的准确性如何?

Luuk H. E. Kempen, Raffaele Cheula, Mie Andersen

发表机构 * Center for Interstellar Catalysis, Department of Physics and Astronomy, Aarhus University(星际催化中心,物理天文系,奥胡斯大学)

AI总结 系统评估80种基础机器学习原子间势在多相催化任务中的零样本性能,发现其在特定应用(如钙钛矿氧化物空位形成能)中表现优异,但在磁性材料上失败,且结构弛豫增加误差,无单一模型普遍最优。

Comments 16 pages, 5 figures, 1 table + supplementary information (37 pages, 16 figures, 15 tables)

详情
Journal ref
J. Chem. Phys. 164, 194119 (2026)
AI中文摘要

基础机器学习原子间势(MLIP)正在快速发展,有望越来越接近从头算精度,从而模拟更大的长度和时间尺度。然而,这些MLIP的基准测试通常限于有序、结晶和块体材料。因此,报告的性能不一定准确反映MLIP在实际应用(如多相催化)中的表现。在此,我们系统分析了80种不同MLIP的零样本性能,评估了多相催化中典型任务,涵盖一系列不同数据集,包括合金金属、氧化物和金属-氧化物界面体系上的吸附和反应。我们证明,当前一代基础MLIP在预测钙钛矿氧化物的空位形成能或负载纳米团簇的零点能等应用中已经能够达到高精度。然而,也存在局限性。我们发现许多MLIP在应用于磁性材料时灾难性地失败,并且与对先前优化结构的单点评估相比,MLIP中的结构弛豫通常会增加能量预测误差。将低成本的特定任务模型与基础MLIP进行比较,我们强调了这些模型方法之间的一些核心差异,并表明——如果仅考虑准确性——这些模型可以与当前一代性能最佳的MLIP竞争。此外,我们表明没有单一的MLIP普遍表现最佳,需要用户针对其所需应用研究MLIP的适用性。

英文摘要

Foundational machine learning interatomic potentials (MLIPs) are being developed at a rapid pace, promising closer and closer approximation to ab initio accuracy. This unlocks the possibility to simulate much larger length and time scales. However, benchmarks for these MLIPs are usually limited to ordered, crystalline and bulk materials. Hence, reported performance does not necessarily accurately reflect MLIP performance in real applications such as heterogeneous catalysis. Here, we systematically analyze zero-shot performance of 80 different MLIPs, evaluating tasks typical for heterogeneous catalysis across a range of different data sets, including adsorption and reaction on surfaces of alloyed metals, oxides, and metal-oxide interfacial systems. We demonstrate that current-generation foundational MLIPs can already perform at high accuracy for applications such as predicting vacancy formation energies of perovskite oxides or zero-point energies of supported nanoclusters. However, limitations also exist. We find that many MLIPs catastrophically fail when applied to magnetic materials, and structure relaxation in the MLIP generally increases the energy prediction error compared to single-point evaluation of a previously optimized structure. Comparing low-cost task-specific models to foundational MLIPs, we highlight some core differences between these model approaches and show that -- if considering only accuracy -- these models can compete with the current generation of best-performing MLIPs. Furthermore, we show that no single MLIP universally performs best, requiring users to investigate MLIP suitability for their desired application.

2512.16111 2026-05-27 cs.LG eess.SP 版本更新

BUILD with Precision: Bottom-Up Inference of Linear DAGs

精确构建:线性有向无环图的由底向上推断

Hamed Ajorlou, Samuel Rey, Gonzalo Mateos, Geert Leus, Antonio G. Marques

发表机构 * University of Rochester(罗切斯特大学) Universidad Rey Juan Carlos(雷耶皇家大学) Delft University of Technology(代尔夫特理工大学)

AI总结 提出BUILD算法,利用等噪声方差线性高斯SEM下观测数据的集成精度矩阵的独特结构,通过确定性逐步方法精确重构DAG,并在有限数据下通过周期性重估计精度矩阵增强鲁棒性。

详情
AI中文摘要

从观测数据中学习有向无环图(DAG)的结构是因果发现、统计信号处理和机器学习中的核心问题。在等噪声方差的线性高斯结构方程模型(SEM)下,该问题是可识别的,并且我们证明观测数据的集成精度矩阵展现出一种有助于DAG恢复的独特结构。利用这一性质,我们提出了BUILD(线性DAG的由底向上推断),一种确定性的逐步算法,该算法识别叶节点及其父节点,然后通过移除关联边来修剪叶节点以进入下一步,从真实的精度矩阵中精确重构DAG。在实践中,精度矩阵必须从有限数据中估计,而病态条件可能导致BUILD步骤中的误差累积。作为一种缓解策略,我们定期重新估计精度矩阵(随着叶节点被修剪,变量减少),以运行时换取增强的鲁棒性。在具有挑战性的合成基准上的可重复结果表明,BUILD与最先进的DAG学习算法相比具有优势,同时提供了对复杂性的明确控制。

英文摘要

Learning the structure of directed acyclic graphs (DAGs) from observational data is a central problem in causal discovery, statistical signal processing, and machine learning. Under a linear Gaussian structural equation model (SEM) with equal noise variances, the problem is identifiable and we show that the ensemble precision matrix of the observations exhibits a distinctive structure that facilitates DAG recovery. Exploiting this property, we propose BUILD (Bottom-Up Inference of Linear DAGs), a deterministic stepwise algorithm that identifies leaf nodes and their parents, then prunes the leaves by removing incident edges to proceed to the next step, exactly reconstructing the DAG from the true precision matrix. In practice, precision matrices must be estimated from finite data, and ill-conditioning may lead to error accumulation across BUILD steps. As a mitigation strategy, we periodically re-estimate the precision matrix (with less variables as leaves are pruned), trading off runtime for enhanced robustness. Reproducible results on challenging synthetic benchmarks demonstrate that BUILD compares favorably to state-of-the-art DAG learning algorithms, while offering an explicit handle on complexity.

2512.08371 2026-05-27 cs.LG stat.ML 版本更新

A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research

基于多元伯努利的采样方法用于多标签数据及其在元研究中的应用

Simon Chung, Colby J. Vorland, Donna L. Maney, Andrew W. Brown

发表机构 * Department of Biostatistics, University of Arkansas for Medical Sciences(生物统计学系,亚拉巴马州医学科学大学) Arkansas Children’s Research Institute(亚拉巴马州儿童研究研究所) Department of Epidemiology and Biostatistics, Indiana University School of Public Health-Bloomington(流行病学与生物统计学系,印第安纳大学公共健康学院-布卢明顿分校) Department of Psychology, Emory University(心理学系,埃默里大学)

AI总结 针对多标签数据中标签频率差异大且存在依赖关系的问题,提出一种基于多元伯努利分布的加权采样算法,通过估计标签组合权重实现目标分布特征,并在Web of Science研究文章数据上验证了其增强少数类别代表性的效果。

详情
AI中文摘要

数据集可能包含具有多个标签的观测值。如果标签不是互斥的,并且标签的频率差异很大,那么获取一个样本,该样本包含足够多的稀有标签观测值以对这些标签进行推断,并且以已知方式偏离总体频率,这带来了挑战。在本文中,我们将多元伯努利分布视为多标签问题的底层分布。我们提出了一种新颖的采样算法,该算法考虑了标签依赖性。它使用观测到的标签频率来估计多元伯努利分布参数,并为每个标签组合计算权重。这种方法确保加权采样在考虑标签依赖性的同时获得目标分布特征。我们将该方法应用于各种数据集,包括来自Web of Science的研究文章样本,这些文章标有64个生物医学主题类别。我们的目标是保持类别频率顺序,减少最常见和最不常见类别之间的频率差异,并考虑类别依赖性。该方法产生了更平衡的子样本,增强了少数类别的代表性。

英文摘要

Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculates weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a variety of datasets, including a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.

2511.20586 2026-05-27 cs.AI cs.LG 版本更新

PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic

PaTAS:基于主观逻辑的神经网络信任传播框架

Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Dennis Eisermann, Houda Labiod, Frank Kargl

AI总结 提出PaTAS框架,利用主观逻辑在神经网络中并行传播信任,通过信任节点和信任函数量化输入、参数和激活的信任,并设计参数信任更新和推理路径信任评估方法,以在对抗或退化条件下提供可解释的信任估计。

详情
AI中文摘要

可信度已成为安全关键应用中人工智能系统部署的关键要求。传统的评估指标(如准确率和精确率)无法充分捕捉不确定性或模型预测的可靠性,尤其是在对抗或退化条件下。本文介绍了并行信任评估系统(PaTAS),这是一个使用主观逻辑(SL)对神经网络中的信任进行建模和传播的框架。PaTAS通过信任节点和信任函数与标准神经计算并行运行,这些节点和函数在网络中传播输入、参数和激活信任。该框架定义了一种参数信任更新机制,以在训练过程中优化参数可靠性,以及一种推理路径信任评估(IPTA)方法,以在推理时计算实例特定的信任。在真实世界和对抗性数据集上的实验表明,PaTAS产生可解释、对称且收敛的信任估计,这些估计补充了准确率,并揭示了在中毒、有偏或不确定数据场景中的可靠性差距。结果表明,PaTAS有效区分良性输入和对抗性输入,并识别模型置信度与实际可靠性不一致的情况。通过在神经架构中实现透明且可量化的信任推理,PaTAS为评估AI生命周期中的模型可靠性提供了基础。

英文摘要

Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics, such as accuracy and precision, fail to appropriately capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the Parallel Trust Assessment System (PaTAS), a framework for modeling and propagating trust in neural networks using Subjective Logic (SL). PaTAS operates in parallel with standard neural computation through Trust Nodes and Trust Functions that propagate input, parameter, and activation trust across the network. The framework defines a Parameter Trust Update mechanism to refine parameter reliability during training and an Inference-Path Trust Assessment (IPTA) method to compute instance-specific trust at inference. Experiments on real-world and adversarial datasets demonstrate that PaTAS produces interpretable, symmetric, and convergent trust estimates that complement accuracy and expose reliability gaps in poisoned, biased, or uncertain data scenarios. The results show that PaTAS effectively distinguishes between benign and adversarial inputs and identifies cases where model confidence diverges from actual reliability. By enabling transparent and quantifiable trust reasoning within neural architectures, PaTAS provides a foundation for evaluating model reliability across the AI lifecycle.

2412.20505 2026-05-27 cs.AI cs.CL cs.LG 版本更新

LiPUP-MA: A Residential Experience-centric Multi-Agent Framework for Living-in-the-loop Participatory Urban Planning

LiPUP-MA:一种以居住体验为中心的循环参与式城市多智能体规划框架

Hang Ni, Yuzhi Wang, Yizhi Song, Hao Liu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出LiPUP-MA多智能体框架,通过模拟居住生活与体验驱动的计划修订循环,利用基于图的经验库和空间约束技能增强规划器,解决参与式城市规划中经验落地与反馈空间化问题。

详情
AI中文摘要

参与式城市规划(PUP)日益得到基于LLM的智能体的支持,但现有方法主要依赖于静态偏好 elicitation 和一次性利益相关者讨论,忽视了现实世界规划的周期性——居住生活、经验收集和计划调整持续互动。我们提出循环参与式城市规划(LiPUP),一种在模拟居住生活和经验驱动的计划修订之间交替的闭环范式,同时面临两个关键挑战:将分散的居住经验锚定到具体的城市背景中,以及将主观反馈转化为空间连贯的规划行动。为实例化LiPUP,我们引入LiPUP-MA,一个基于LLM的多智能体框架,它构建了一个以计划为中心的基于图的经验库,用于组织来自生活模拟的基于城市的居住反馈,并配备了一个空间约束的技能增强规划器智能体,通过协调经验、视觉和地理空间证据来修订计划。实验表明,LiPUP-MA在传统的静态规划指标和基于生活的指标上均持续优于基线,而迭代的LiPUP循环进一步提高了计划质量。

英文摘要

Participatory Urban Planning (PUP) is increasingly supported by LLM-based agents, yet existing methods largely rely on static preference elicitation and one-shot stakeholder discussions, overlooking the cyclical nature of real-world planning, where residential life, experience collection, and plan adjustment continually interact. We propose Living-in-the-loop Participatory Urban Planning (LiPUP), a closed-loop paradigm that alternates between simulated residential living and experience-driven plan revision, while posing two key challenges: grounding scattered living experience in concrete urban contexts and translating subjective feedback into spatially coherent planning actions. To instantiate LiPUP, we introduce LiPUP-MA, an LLM-based multi-agent framework that constructs a Plan-centric Graph-based Experience Bank to organize urban-grounded residential feedback from living simulation and equips a Spatially-constrained Skill-augmented Planner agent to revise plans by harmonizing experiential, visual, and geospatial evidence. Experiments show that LiPUP-MA consistently outperforms baselines on both conventional static planning metrics and living-based metrics, while iterative LiPUP cycles further improve plan quality.

2506.09532 2026-05-27 cs.LG cs.AI cs.CL cs.CV 版本更新

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Athena: 利用数据高效的过程奖励模型增强多模态推理

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

发表机构 * Advanced Micro Devices Inc.(先进微器件公司) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出 Athena-PRM,一种多模态过程奖励模型,通过利用弱和强完成者之间的预测一致性高效生成高质量过程标签,在仅5000样本下显著提升复杂推理问题的逐步评估性能。

Comments TMLR 2026, https://openreview.net/forum?id=unWmplHccF

详情
AI中文摘要

我们提出了 Athena-PRM,一种多模态过程奖励模型(PRM),旨在评估解决复杂推理问题中每一步的奖励分数。开发高性能的PRM通常需要大量的时间和资金投入,主要因为需要推理步骤的逐步标注。传统的自动标注方法,如蒙特卡洛估计,通常会产生噪声标签并带来巨大的计算成本。为了高效生成高质量的过程标注数据,我们提出利用弱和强完成者之间的预测一致性作为识别可靠过程标签的标准。值得注意的是,Athena-PRM 在仅5000个样本的情况下,在各种场景和基准测试中展现出卓越的效果。此外,我们还开发了两种有效策略来提升PRM的性能:ORM初始化和负数据上采样。我们在三个具体场景中验证了我们的方法:测试时扩展的验证、推理步骤正确性的直接评估以及奖励排序微调。我们的 Athena-PRM 在多个基准测试和场景中持续取得优越性能。值得注意的是,当使用 Qwen2.5-VL-7B 作为策略模型时,Athena-PRM 在 WeMath 上提升了10.2个百分点,在 MathVista 上提升了7.1个百分点(测试时扩展)。此外,Athena-PRM 在 VisualProcessBench 上取得了最先进(SoTA)结果,比之前的 SoTA 高出3.9个F1分数,展示了其准确评估推理步骤正确性的强大能力。另外,利用 Athena-PRM 作为奖励模型,我们通过奖励排序微调开发了 Athena-7B,在五个基准测试上以显著优势超越了基线。

英文摘要

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

2511.01724 2026-05-27 cs.CV cs.LG 版本更新

PRBench: A Standardized Probabilistic Robustness Benchmark

PRBench:标准化概率鲁棒性基准

Yi Zhang, Zheng Wang, Zhen Chen, Wenjie Ruan, Qing Guo, Siddartha Khastgir, Carsten Maple, Xingyu Zhao

发表机构 * WMG, University of Warwick(沃里克大学WMG学院) Department of Computer Science, University of Liverpool(利物浦大学计算机科学系) College of Computer Science, Nankai University(南开大学计算机学院) School of Computing, National University of Singapore(新加坡国立大学计算学院)

AI总结 提出PRBench基准,通过统一评估协议和理论分析,比较对抗训练与概率鲁棒性训练方法在干净准确率、鲁棒性及泛化误差上的表现。

详情
AI中文摘要

深度学习模型因对不可察觉扰动的脆弱性而闻名。现有研究大多集中于对抗鲁棒性(AR),它通过检查确定性对抗样本(AE)的存在性,在最坏情况下评估模型。相比之下,概率鲁棒性(PR)采用统计视角,衡量在随机扰动下预测保持正确的概率。尽管PR被广泛视为AR的实用补充,但专门用于提升PR的训练方法仍相对未被充分探索,尽管已有初步进展。在少数针对PR的训练方法中,我们发现了三个局限性:(i) 不可比较的评估协议;(ii) 尽管AT能带来PR提升的轶事证据,但与强AT基线的比较有限;(iii) 缺乏统一框架来比较这些方法的泛化能力。因此,我们引入了PRBench,这是第一个专门评估不同鲁棒性训练方法在PR提升上的基准。PRBench使用一套全面的指标,包括干净准确率、PR和AR性能、训练效率以及泛化误差(GE),对最常见的AT和针对PR的训练方法进行实证比较。我们还对不同训练方法的PR性能的GE进行了理论分析。PRBench揭示的主要发现包括:在跨不同超参数设置提升AR和PR性能方面,AT方法比针对PR的训练方法更具通用性,而针对PR的训练方法始终产生更低的GE和更高的干净准确率。包含229个训练模型(覆盖7个数据集和10种模型架构)的排行榜公开于 https://wellzline.github.io/PRBenchLeaderboard/。

英文摘要

Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce PRBench, the first benchmark dedicated to evaluating improvements in PR achieved by different robustness training methods. PRBench empirically compares most common AT and PR-targeted training methods using a comprehensive set of metrics, including clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). We also provide theoretical analysis on the GE of PR performance across different training methods. Main findings revealed by PRBench include: AT methods are more versatile than PR-targeted training methods in terms of improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted training methods consistently yield lower GE and higher clean accuracy. A leaderboard comprising 229 trained models across 7 datasets and 10 model architectures is publicly available at https://wellzline.github.io/PRBenchLeaderboard/.

2511.17852 2026-05-27 cs.LG stat.ML 版本更新

Transformers with RL or SFT Provably Learn Sparse Boolean Functions, But Differently

带RL或SFT的Transformer可证明学习稀疏布尔函数,但方式不同

Bochen Lyu, Yiyang Jia, Xiaohao Cai, Zhanxing Zhu

发表机构 * School of Electronics and Computer Science, University of Southampton, United Kingdom(南安普顿大学电子与计算机科学学院,英国) Independent Researcher(独立研究员)

AI总结 本文通过统一分析RL(过程奖励)和SFT微调Transformer学习可递归分解的k-稀疏布尔函数的动态,证明两者都能学习k-PARITY、k-AND、k-OR等函数,但RL同时学习整个CoT链,而SFT逐步学习。

Comments 50 pages, 12 figures

详情
AI中文摘要

Transformer可以通过微调获得思维链(CoT)能力来解决复杂的推理任务。强化学习(RL)和监督微调(SFT)是实现这一目标的两种主要方法。在这项工作中,我们专门研究了使用过程奖励的RL和SFT,通过类似于CoT的中间推理步骤,用单层Transformer学习$k$-稀疏布尔函数。特别地,我们考虑可以递归分解为固定2-稀疏布尔函数的$k$-稀疏布尔函数。我们首先以统一的方式分析使用过程奖励的RL微调和SFT的学习动态。这使我们能够识别出Transformer可证明学习这些稀疏布尔函数的充分条件。然后,我们验证了这些条件在三个基本示例(包括$k$-PARITY、$k$-AND和$k$-OR)中成立,从而证明了它们通过RL和SFT的可学习性。值得注意的是,我们揭示了RL和SFT表现出不同的学习行为:RL同时学习整个CoT链,而SFT自然地逐步学习CoT链。总体而言,我们的发现为RL和SFT的底层机制以及它们在触发Transformer的CoT能力方面的差异提供了见解,并表明RL和SFT之间的比较可能需要考虑奖励设计和教师强制(teacher forcing)的使用。

英文摘要

Transformers can acquire Chain-of-Thought (CoT) capabilities to solve complex reasoning tasks through fine-tuning. Reinforcement learning (RL) and supervised fine-tuning (SFT) are two primary approaches to this end. In this work, we specifically examine RL with process rewards and SFT for learning $k$-sparse Boolean functions with a one-layer transformer through intermediate reasoning steps akin to CoT. In particular, we consider $k$-sparse Boolean functions that can be recursively decomposed into fixed 2-sparse Boolean functions. We first analyze the learning dynamics of RL fine-tuning with process reward and SFT in a unified way. This allows us to identify sufficient conditions under which the transformer provably learns these sparse Boolean functions. We then verify that these conditions hold for three basic examples, including $k$-PARITY, $k$-AND, and $k$-OR, thus demonstrating their learnability via both RL and SFT. Notably, we reveal that RL and SFT exhibit distinct learning behaviors: RL learns the whole CoT chain simultaneously, whereas SFT naturally learns the CoT chain step by step. Overall, our findings provide insights on the mechanisms underlying RL and SFT and how they differ in triggering the CoT capabilities of transformers, and suggest that the comparison between RL and SFT may need to consider the reward design and the use of teacher forcing.

2505.13775 2026-05-27 cs.LG cs.AI 版本更新

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

超越语义:无理由中间标记的不合理有效性

Karthik Valmeekam, Vardhan Palod, Kaya Stechly, Atharva Gundawar, Subbarao Kambhampati

发表机构 * School of Computing and AI(计算与人工智能学院) Arizona State University(亚利桑那州立大学) Amazon AGI(亚马逊人工通用智能) Yale University(耶鲁大学)

AI总结 通过从零训练Transformer模型于形式可验证推理轨迹,发现模型在正确与损坏轨迹上表现相似,且损坏轨迹在分布外任务上泛化更好,挑战了中间标记反映或诱导可预测推理行为的假设。

Comments Published in Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

近期大型推理模型的显著成果被解读为思维链(CoT)的胜利,尤其是基于基础LLM采样的CoT训练过程有助于发现新的推理模式。虽然这些轨迹确实有助于模型性能,但其影响机制尚不明确:一些研究赋予其语义,另一些则警告不要将其视为模型内部计算过程的透明忠实代理。为系统探究推导轨迹的终端用户语义作用,我们设置了一项受控研究,从零开始训练Transformer模型于形式可验证的推理轨迹及其导向的解决方案。我们注意到,尽管相比仅解决方案的基线有所提升,但训练于完全正确轨迹的模型在得出正确解决方案时仍可能产生无效推理轨迹。更有趣的是,实验表明,训练于损坏轨迹(其中间推理步骤与所附问题无关)的模型与训练于正确轨迹的模型表现相似,甚至在分布外任务上泛化更好。我们还研究了基于GRPO的RL后训练对轨迹有效性的影响,发现虽然解决方案准确性提高,但轨迹有效性并未随之改善。最后,我们考察了推理轨迹长度是否反映推理时扩展,发现轨迹长度在很大程度上与所解决问题的底层计算复杂度无关。这些结果挑战了中间标记或“思维链”反映或诱导可预测推理行为的假设,并警示不要将此类输出拟人化或过度解读(尽管其表面形式看似合理)为语言模型中类人或类算法行为的证据。

英文摘要

Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. While these traces certainly seem to help model performance, it is not clear how they influence it, with some works ascribing semantics to them and others cautioning against relying on them as transparent and faithful proxies of the model's internal computational process. To systematically investigate the role of end-user semantics of derivational traces, we set up a controlled study where we train transformer models from scratch on formally verifiable reasoning traces and the solutions they lead to. We notice that, despite gains over the solution-only baseline, models trained on entirely correct traces can still produce invalid reasoning traces even when arriving at correct solutions. More interestingly, our experiments also show that models trained on corrupted traces, whose intermediate reasoning steps bear no relation to the problem they accompany, perform similarly to those trained on correct ones, and even generalize better on out-of-distribution tasks. We also study the effect of GRPO-based RL post-training on trace validity, noting that while solution accuracy increases, this is not accompanied by improvements in trace validity. Finally, we examine whether reasoning-trace length reflects inference-time scaling and find that trace length is largely agnostic to the underlying computational complexity of the problem being solved. These results challenge the assumption that intermediate tokens or ``Chains of Thought'' reflect or induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly seemingly forms) as evidence of human-like or algorithmic behaviors in language models.

2511.14993 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Kandinsky 5.0:图像与视频生成的基础模型系列

Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Nikolai Vaulin, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Julia Agafonova, Ilya Vasiliev, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov

发表机构 * Kandinsky Lab(Kandinsky 实验室)

AI总结 本文介绍Kandinsky 5.0系列模型,通过多阶段训练、自监督微调和强化学习后训练,实现高分辨率图像和10秒视频的高质量生成。

Comments Website: https://kandinskylab.ai/

详情
AI中文摘要

本报告介绍了Kandinsky 5.0,一系列用于高分辨率图像和10秒视频合成的最先进基础模型。该框架包含三个核心模型系列:Kandinsky 5.0 Image Lite——6B参数的图像生成模型系列,Kandinsky 5.0 Video Lite——快速轻量级的2B参数文本到视频和图像到视频模型,以及Kandinsky 5.0 Video Pro——19B参数模型,实现了卓越的视频生成质量。我们全面回顾了数据策展生命周期——包括收集、处理、过滤和聚类——用于多阶段训练流程,该流程涉及广泛的预训练,并融入了质量增强技术,如自监督微调(SFT)和基于强化学习(RL)的后训练。我们还介绍了新颖的架构、训练和推理优化,使Kandinsky 5.0能够在各种任务上实现高生成速度和最先进的性能,如人类评估所示。作为一个大规模、公开可用的生成框架,Kandinsky 5.0充分利用其预训练及后续阶段的全部潜力,以适应广泛的生成应用。我们希望本报告,连同我们开源代码和训练检查点的发布,将大大促进高质量生成模型的研究社区发展和可访问性。

英文摘要

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

2511.14075 2026-05-27 cs.LG cs.AI 版本更新

CFG-OEC: Classifier Free Guidance with Orthogonal Error Correction

CFG-OEC: 带正交误差校正的无分类器引导

Nakgyu Yang, Yechan Lee, SooJean Han

发表机构 * School of Electrical Engineering, Korea Advanced Institute of Science(韩国科学技术院电子工程学院)

AI总结 针对扩散模型中无分类器引导的采样规则与训练目标不匹配导致的误差,提出正交误差校正方法(CFG-OEC)通过减少条件与无条件预测误差的交互项来提升采样质量,并在Stable Diffusion上验证了FID和CLIP分数的改进。

详情
AI中文摘要

无分类器引导是扩散模型中条件采样的标准方法,但其采样规则与训练中使用的目标不一致。这种不匹配通过条件预测误差和无条件预测误差的相互作用引入了结构性采样误差。我们通过将采样误差分解为基础项和由两个误差对齐决定的交叉项来分析该问题。基于此分析,我们提出了带正交误差校正的无分类器引导(CFG-OEC),这是一种减少交互项的结构性修改。对于无法观测到真实噪声的实际场景,我们引入了一个从模型预测计算得到的代理量,以及一种跨扩散时间步稳定校正的动态方法。在受控环境下的实验验证了我们的理论误差分解和代理量构造。在Stable Diffusion v1.5和Stable Diffusion XL上的图像生成表明,CFG-OEC在多个采样器和引导机制下比CFG和CFG++改进了FID和CLIP分数。

英文摘要

Classifier free guidance is a standard method for conditional sampling in diffusion models, but its sampling rule is not aligned with the objective used in training. This mismatch induces a structural sampling error through the interaction of conditional and unconditional prediction errors. We analyze this issue by decomposing the sampling error into a base term and a cross term determined by the alignment of the two errors. Based on this analysis we propose CFG with orthogonal error correction (CFG-OEC), a structural modification that reduces the interaction term. For practical settings where ground truth noise is not observable, we introduce a proxy computed from model predictions and a dynamic method that stabilizes correction across diffusion timesteps. Experiments in a controlled environment validate our theoretical error decomposition and proxy construction. Image generation on Stable Diffusion v1.5 and Stable Diffusion XL show that CFG-OEC improves FID and CLIP scores over CFG and CFG++ across multiple samplers and guidance regimes.

2511.04711 2026-05-27 cs.CR cs.AI cs.LG 版本更新

SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking

SWAP:通过顺序水印实现软提示的版权审计

Wenyuan Yang, Yichen Sun, Changzheng Chen, Zhixuan Chu, Jiaheng Zhang, Yiming Li, Dacheng Tao

发表机构 * Sun Yat-sen University(中山大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学)

AI总结 针对软提示的版权保护问题,提出一种基于顺序水印的审计方法SWAP,通过将水印嵌入到更复杂的输出分布顺序空间中,实现无害且鲁棒的版权验证。

Comments This paper has been accepted by the International Journal of Computer Vision (IJCV), 2026. The first two authors contributed equally to this work. 28 pages

详情
AI中文摘要

大规模视觉语言模型,尤其是CLIP,在各种下游任务中展现了卓越的性能。软提示作为精心设计的模块,能够高效地将视觉语言模型适应特定任务,因此需要有效的版权保护。本文通过审计可疑的第三方模型是否使用了受保护的软提示,来研究模型版权保护。虽然这可以视为模型所有权审计的一个特例,但我们的分析表明,由于提示学习的独特特性,现有技术效果不佳。非侵入式审计在独立模型与受害模型共享相似数据分布时,本质上容易产生误报。侵入式方法也失败:为CLIP设计的后门方法无法嵌入功能性触发器,而将传统DNN后门技术扩展到提示学习则面临有害性和模糊性挑战。我们发现,侵入式审计的这些失败源于同一个根本原因:水印与主任务在同一决策空间中运行,却追求相反的目标。基于这些发现,我们提出了软提示的顺序水印(SWAP),将水印植入一个不同且更复杂的空间。SWAP通过防御者指定的分布外类别的特定顺序来编码水印,灵感来自CLIP的零样本预测能力。这种嵌入在更复杂空间中的水印保持原始预测标签不变,从而减少与主任务的冲突。我们进一步为SWAP设计了基于假设检验的验证协议,并提供了验证何时有效的理论分析。在11个数据集上的大量实验证明了SWAP的有效性、无害性以及对潜在攻击的鲁棒性。

英文摘要

Large-scale vision-language models, especially CLIP, have demonstrated remarkable performance across diverse downstream tasks. Soft prompts, as carefully crafted modules that efficiently adapt vision-language models to specific tasks, necessitate effective copyright protection. In this paper, we investigate model copyright protection by auditing whether suspicious third-party models incorporate protected soft prompts. While this can be viewed as a special case of model ownership auditing, our analysis shows that existing techniques are ineffective due to prompt learning's unique characteristics. Non-intrusive auditing is inherently prone to false positives when independent models share similar data distributions with victim models. Intrusive approaches also fail: backdoor methods designed for CLIP cannot embed functional triggers, while extending traditional DNN backdoor techniques to prompt learning suffers from harmfulness and ambiguity challenges. We find that these failures in intrusive auditing stem from the same fundamental reason: watermarking operates within the same decision space as the primary task yet pursues opposing objectives. Motivated by these findings, we propose sequential watermarking for soft prompts (SWAP), which implants watermarks into a different and more complex space. SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes, inspired by the zero-shot prediction capability of CLIP. This watermark, which is embedded in a more complex space, keeps the original prediction label unchanged, making it less opposed to the primary task. We further design a hypothesis-test-guided verification protocol for SWAP and provide a theoretical analysis of when verification works. Extensive experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential attacks.

2511.02525 2026-05-27 cs.LG cs.AI 版本更新

An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems

一种用于求解带容量约束选址-路径问题的端到端学习方法

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

发表机构 * National Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(中国自动化智能无人系统国家级实验室,北京理工大学)

AI总结 提出基于深度强化学习与异构查询机制(DRLHQ)的端到端方法,首次将编码器-解码器结构应用于带容量约束的选址-路径问题(CLRP)及其开放变体(OCLRP),通过异构查询注意力机制动态协调选址与路径决策,在合成和基准数据集上优于传统方法和现有DRL基线。

详情
AI中文摘要

带容量约束的选址-路径问题(CLRPs)是组合优化中的经典问题,需要同时做出选址和路径决策。在CLRPs中,复杂的约束以及各种决策之间的复杂关系使得问题难以求解。随着深度强化学习(DRL)的出现,它已被广泛应用于解决车辆路径问题及其变体,而与CLRPs相关的研究仍有待探索。在本文中,我们提出了带有异构查询的DRL(DRLHQ)来分别求解CLRP和开放CLRP(OCLRP)。我们是首个为CLRPs提出端到端学习方法的工作,遵循编码器-解码器结构。具体而言,我们将CLRPs重新表述为一个针对各种决策量身定制的马尔可夫决策过程,这是一个通用的建模框架,可适用于其他基于DRL的方法。为了更好地处理选址和路径决策之间的相互依赖关系,我们还引入了一种新颖的异构查询注意力机制,旨在动态适应不同的决策阶段。在合成和基准数据集上的实验结果表明,我们提出的方法在求解CLRP和OCLRP时,相较于代表性的传统方法和基于DRL的基线,具有更优的解质量和更好的泛化性能。

英文摘要

The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making location and routing decisions. In CLRPs, the complex constraints and the intricate relationships between various decisions make the problem challenging to solve. With the emergence of deep reinforcement learning (DRL), it has been extensively applied to address the vehicle routing problem and its variants, while the research related to CLRPs still needs to be explored. In this paper, we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP (OCLRP), respectively. We are the first to propose an end-to-end learning approach for CLRPs, following the encoder-decoder structure. In particular, we reformulate the CLRPs as a markov decision process tailored to various decisions, a general modeling framework that can be adapted to other DRL-based methods. To better handle the interdependency across location and routing decisions, we also introduce a novel heterogeneous querying attention mechanism designed to adapt dynamically to various decision-making stages. Experimental results on both synthetic and benchmark datasets demonstrate superior solution quality and better generalization performance of our proposed approach over representative traditional and DRL-based baselines in solving both CLRP and OCLRP.

2510.23905 2026-05-27 eess.SP cs.LG 版本更新

Inferring Group Intent as a Cooperative Game. An NLP-based Framework for Trajectory Analysis

将群体意图推断作为合作博弈:基于NLP的轨迹分析框架

Yiming Zhang, Vikram Krishnamurthy, Shashwat Jain

发表机构 * Cornell University(康奈尔大学)

AI总结 提出一个基于NLP的生成模型和合作博弈框架,通过Fisher信息特征函数和Graph Transformer神经网络从噪声观测中推断群体轨迹意图。

详情
AI中文摘要

本文研究群体目标轨迹意图,将其视为合作博弈的结果,其中复杂时空轨迹使用基于NLP的生成模型建模。在我们的框架中,群体意图由合作博弈的特征函数指定,合作博弈中参与者的分配由核心、Shapley值或核仁指定。由此产生的分配诱导概率分布,这些分布控制目标的协调时空轨迹,反映群体的潜在意图。我们解决两个关键问题:(1)如何将群体轨迹的意图最优形式化为合作博弈的特征函数?(2)如何从目标的噪声观测中推断这种意图?为回答第一个问题,我们引入基于Fisher信息的合作博弈特征函数,该函数产生生成协调时空模式的概率分布。作为这些模式的生成模型,我们开发了一个基于形式语法的NLP生成模型,能够创建逼真的多目标轨迹数据。为回答第二个问题,我们训练一个Graph Transformer神经网络(GTNN),从观测数据中以高精度推断群体轨迹意图(表示为合作博弈的特征函数)。GTNN的自注意力机制依赖于轨迹估计。因此,该公式和算法提供了一种多层方法,涵盖目标跟踪(贝叶斯信号处理)和GTNN(用于群体意图推断)。

英文摘要

This paper studies group target trajectory intent as the outcome of a cooperative game where the complex-spatio trajectories are modeled using an NLP-based generative model. In our framework, the group intent is specified by the characteristic function of a cooperative game, and allocations for players in the cooperative game are specified by either the core, the Shapley value, or the nucleolus. The resulting allocations induce probability distributions that govern the coordinated spatio-temporal trajectories of the targets that reflect the group's underlying intent. We address two key questions: (1) How can the intent of a group trajectory be optimally formalized as the characteristic function of a cooperative game? (2) How can such intent be inferred from noisy observations of the targets? To answer the first question, we introduce a Fisher-information-based characteristic function of the cooperative game, which yields probability distributions that generate coordinated spatio-temporal patterns. As a generative model for these patterns, we develop an NLP-based generative model built on formal grammar, enabling the creation of realistic multi-target trajectory data. To answer the second question, we train a Graph Transformer Neural Network (GTNN) to infer group trajectory intent-expressed as the characteristic function of the cooperative game-from observational data with high accuracy. The self-attention function of the GTNN depends on the track estimates. Thus, the formulation and algorithms provide a multi-layer approach that spans target tracking (Bayesian signal processing) and the GTNN (for group intent inference).

2510.23486 2026-05-27 cs.LG 版本更新

Learning to Reason Efficiently with Discounted Reinforcement Learning

通过折扣强化学习高效推理

Alex Ayoub, Kavosh Asadi, Dale Schuurmans, Csaba Szepesvári, Karim Bouyarmane

发表机构 * Amazon(亚马逊公司) University of Alberta(阿尔伯塔大学)

AI总结 针对大型推理模型消耗过多token导致计算成本高的问题,提出使用折扣强化学习(解释为小token成本)惩罚推理token,结合Blackwell最优性分析,在保持准确性的同时缩短推理链。

详情
AI中文摘要

大型推理模型(LRMs)通常消耗过多的token,增加了计算成本和延迟。更广泛地说,在目标到达的序列决策问题中,我们通常希望快速到达目标,而LRM推理可以从这个角度看待。我们挑战了较长响应能提高准确性的假设。通过使用折扣强化学习设置(可解释为小的token成本)惩罚推理token,并分析受限策略类中的Blackwell最优性,我们鼓励简洁而准确的推理,类似于在随机最短路径问题中偏好更短的成功轨迹。实验证实了我们的理论结果,即这种方法在保持准确性的同时缩短了思维链。

英文摘要

Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. More broadly, in goal reaching sequential decision problems we often want to reach the goal quickly, and LRM reasoning can be viewed through this lens. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning, analogous to preferring shorter successful trajectories in a stochastic shortest path problem. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.

2510.19420 2026-05-27 cs.CR cs.AI cs.LG cs.MA math.OC 版本更新

Securing Multi-Agent Systems Against Corruptions via Node Contribution Backpropagation

通过节点贡献反向传播保护多智能体系统免受腐败影响

Chengcan Wu, Zhixin Zhang, Mingqian Xu, Zeming Wei, Meng Sun

发表机构 * Peking University(北京大学)

AI总结 针对多智能体系统中对抗性智能体注入误导信息的问题,提出一种基于有向无环图的反向传播动态防御方法,通过计算每个智能体对最终决策的贡献来识别和隔离恶意智能体,实验表明该方法优于现有防御机制。

Comments ICML 2026

详情
AI中文摘要

多智能体系统(MAS)已成为大型语言模型(LLM)应用的普遍范式。然而,MAS中复杂的多智能体设计引入了独特的可信度问题:对抗性智能体可以注入误导信息,这些信息通过系统传染性地传播,破坏良性智能体并导致错误输出。现有的基于图的防御将智能体建模为节点,通信建模为边,但仅限于静态图防御。在本文中,我们提出了一种动态防御范式,将MAS通信建模为带符号的有向无环图,并通过反向传播计算每个智能体对最终决策的贡献,从而能够准确识别和隔离恶意智能体,以保护多智能体任务协作。在复杂和动态的MAS环境中的实验结果表明,我们的方法显著优于现有的MAS防御机制,为可信赖的MAS部署提供了有效的保障。我们的代码可在https://github.com/ChengcanWu/BPD获取。

英文摘要

Multi-Agent Systems (MAS) have become a prevalent paradigm for Large Language Model (LLM) applications. However, the complex multi-agent design in MAS introduces unique trustworthiness concerns: adversarial agents can inject misleading information that propagates contagiously through the system, corrupting benign agents and leading to false outputs. Existing graph-based defenses model agents as nodes and communications as edges, yet are limited to static-graph defenses. In this paper, we propose a dynamic defense paradigm that models MAS communication as a signed directed acyclic graph and computes each agent's contribution to the final decision via backward propagation, enabling accurate identification and isolation of malicious agents to secure multi-agent task collaboration. Experimental results in complex and dynamic MAS environments demonstrate that our method notably outperforms existing MAS defense mechanisms, providing an effective guardrail for trustworthy MAS deployment. Our code is available at https://github.com/ChengcanWu/BPD.

2510.17759 2026-05-27 cs.CR cs.CL cs.CV cs.LG stat.ML 版本更新

VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

VERA-V:用于破解视觉语言模型的变分推断框架

Qilin Liao, Anamika Lochab, Ruqi Zhang

发表机构 * Department of Computer Science, Purdue University, USA(美国普渡大学计算机科学系)

AI总结 提出VERA-V变分推断框架,通过联合后验分布生成隐蔽的文本-图像对抗输入,以系统性地发现视觉语言模型的多模态漏洞,在多个基准上攻击成功率最高提升53.75%。

Comments 18 pages, 7 Figures,

详情
AI中文摘要

视觉语言模型(VLM)通过视觉推理扩展了大语言模型,但其多模态设计也引入了新的、未被充分探索的漏洞。现有的多模态红队方法主要依赖脆弱的模板,专注于单一攻击设置,并且仅暴露了漏洞的一小部分。为了解决这些限制,我们引入了VERA-V,一个变分推断框架,将多模态越狱发现重新表述为学习配对文本-图像提示的联合后验分布。这种概率视角使得能够生成绕过模型防护的隐蔽、耦合的对抗输入。我们训练一个轻量级攻击者来近似后验分布,从而能够高效采样多样化的越狱方法,并提供对漏洞的分布性洞察。VERA-V进一步整合了三种互补策略:(i)基于排版的文本提示,嵌入有害线索;(ii)基于扩散的图像合成,引入对抗信号;(iii)结构化干扰物,分散VLM的注意力。在HarmBench和HADES基准上的实验表明,VERA-V在开源和前沿VLM上均持续优于最先进的基线方法,在GPT-4o上相比最佳基线实现了高达53.75%的攻击成功率(ASR)提升。我们在项目页面提供了代码,地址为:https://github.com/kxwhiowo/VERA-V

英文摘要

Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o. We include the code on the project page available here: https://github.com/kxwhiowo/VERA-V

2510.14542 2026-05-27 eess.SY cs.LG cs.SY 版本更新

A Deep State-Space Model Compression Method using Upper Bound on Output Error

一种基于输出误差上界的深度状态空间模型压缩方法

Hiroki Sakamoto, Kazuhiro Sato

发表机构 * Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo(数学信息学系,信息科学和技术研究生院,东京大学)

AI总结 本文提出一种基于输出误差上界的深度状态空间模型压缩方法,通过推导层间LQO系统的h²误差范数上界并优化该上界,实现无需重训练即可减少约60%可训练参数并保持模型性能。

详情
AI中文摘要

我们研究包含线性二次输出(LQO)系统作为内部块的深度状态空间模型(Deep SSMs),并提出一种具有可证明输出误差保证的压缩方法。我们首先推导两个Deep SSM之间输出误差的上界,并证明该上界可以用逐层LQO系统之间的$h^2$误差范数表示。特别地,我们表明减小浅层中LQO系统的$h^2$逼近误差能有效降低推导出的输出误差上界。接下来,我们针对推导出的上界制定一个优化问题,并开发一种基于梯度的模型降阶方法。在数值实验中,使用LRA基准中的IMDb任务,我们展示了所提出的基于上界的压缩方法的有效性。特别地,我们表明无需重训练即可将可训练参数数量减少约60%,同时保持原始模型的性能。

英文摘要

We study deep state-space models (Deep SSMs) that contain linear quadratic-output (LQO) systems as internal blocks and present a compression method with a provable output error guarantee. We first derive an upper bound on the output error between two Deep SSMs and show that the bound can be expressed in terms of the $h^2$-error norms between the layerwise LQO systems. In particular, we show that reducing the $h^2$ approximation errors of the LQO systems placed in shallow layers is effective in reducing the derived upper bound on the output error. Next, we formulate an optimization problem for the derived upper bound and develop a gradient-based MOR method. In the numerical experiments, using the IMDb task from the LRA benchmark, we demonstrate the effectiveness of the proposed upper-bound-based compression method. In particular, we show that the number of trainable parameters can be reduced by approximately 60\% without retraining while maintaining the performance of the original model.

2510.13217 2026-05-27 cs.IR cs.LG 版本更新

LLM-guided Hierarchical Search for End-to-end Reasoning Intensive Retrieval

LLM引导的层次化搜索用于端到端推理密集型检索

Nilesh Gupta, Wei-Cheng Chang, Ngot Bui, Cho-Jui Hsieh, Inderjit S. Dhillon

发表机构 * UT Austin(得克萨斯大学) UCLA(加州大学洛杉矶分校) Google(谷歌)

AI总结 本文提出LATTICE,一种无需嵌入模型的LLM引导层次化搜索方法,通过LLM构建搜索索引并校准路径聚合遍历,在推理密集型基准上达到与最优微调集成基线相当的性能。

详情
AI中文摘要

搜索系统越来越多地用于推理密集型查询,其中文档的相关性需要理解或推理查询-文档关系,而不是依赖表面词汇或主题相似性。标准方案——廉价的基于嵌入的检索器后接LLM验证器——仅在嵌入模型将正确文档置于其top-k中时才有效,而最近的推理密集型IR基准显示,即使对于最先进的嵌入模型,这一假设也常常不成立。最近的查询端修复方法(如查询重写和智能体循环)将LLM保持在廉价检索器的上游,但仍然容易受到嵌入器失败和LLM从其参数知识重写查询能力的影响。在本文中,我们探索了一种不同的范式——LLM引导的层次化搜索——其中LLM通过层次可导航搜索索引直接与语料库交互,搜索时无需嵌入模型参与。我们提出了LATTICE,其包含两项技术贡献:(i) 使用LLM对多级文档摘要的判断进行自上而下的LLM引导搜索索引构建,以及(ii) 通过跨分支参考节点减轻噪声、依赖列表的LLM分数的校准路径聚合LLM引导遍历。在推理密集型BRIGHT基准上,使用单个现成LLM的基础LATTICE实现了46.7 nDCG@10——与最佳微调集成基线整体匹配——而轻量级集成LATTICE++将LATTICE与廉价检索融合,达到49.1 nDCG@10。与滑动窗口重排序的受控相同LLM比较显示,在低token预算下重排序提供更好的权衡,但LATTICE在适度预算后收敛到更高的渐近线。LATTICE也适用于开放权重LLM,并在传统IR基准(NQ、SciFact、SciDocs)上保持竞争力。

英文摘要

Search systems are increasingly used for reasoning-intensive queries, where what makes a document relevant requires understanding or reasoning over the query-document relation rather than relying on surface vocabulary or topical similarity. The standard recipe - a cheap embedding-based retriever followed by an LLM verifier - works only when the embedding model places the right documents in its top-k, an assumption that recent reasoning-intensive IR benchmarks show often fails to hold even for SOTA embedding models. Recent query-side fixes such as query rewriting and agentic loops keep the LLM upstream of the cheap retriever and remain brittle to the embedder's failures and to the LLM's ability to rewrite the query from its parametric knowledge. In this paper, we explore a different paradigm - LLM-guided hierarchical search - in which an LLM interacts with the corpus directly via a hierarchically navigable search index, with no embedding model in the loop at search time. We propose LATTICE, an instantiation with two technical contributions: (i) a top-down LLM-guided construction of the search index using LLM judgements over multi-level document summaries, and (ii) a calibrated, path-aggregated LLM-guided traversal that mitigates noisy, slate-dependent LLM scores via cross-branch reference nodes. On the reasoning-intensive BRIGHT benchmark, base LATTICE with a single off-the-shelf LLM achieves 46.7 nDCG@10 - matching the best fine-tuned ensemble baseline overall - and a lightweight ensemble LATTICE++ that fuses LATTICE with cheap retrieval reaches 49.1 nDCG@10. A controlled same-LLM comparison against sliding-window reranking shows reranking offers a better tradeoff at low token budgets, but LATTICE converges to a higher asymptote after a moderate budget. LATTICE also works with open-weight LLMs and remains competitive on traditional IR benchmarks (NQ, SciFact, SciDocs).

2510.10774 2026-05-27 cs.SD cs.AI cs.HC cs.LG 版本更新

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

ParsVoice: 面向文本到语音合成的大规模多说话人波斯语语音语料库

Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

发表机构 * School of Electrical and Computer Engineering, University of Tehran(塔里哈大学电气与计算机工程学院) Institute for Research in Fundamental Sciences (IPM)(基础科学研究所(IPM))

AI总结 提出ParsVoice,目前最大的公开波斯语语音-文本语料库,通过可扩展的流水线从长篇有声读物构建高质量数据,用于训练多说话人TTS系统,并验证了其在零样本多说话人TTS中的有效性。

详情
AI中文摘要

波斯语在开放的语音-文本资源中仍然严重不足,限制了多说话人文本到语音(TTS)、语音语言建模和低资源语音处理的进展。我们介绍了ParsVoice,这是目前最大的公开波斯语语音-文本语料库,专为训练多说话人TTS系统而设计,同时提供了一个可扩展的流水线,用于从长篇有声读物录音中构建高质量的语音-文本数据。该流水线结合了微调的ParsBERT句子补全分类器、基于ASR的边界优化、标点恢复、说话人识别以及涵盖音频和波斯语特定文本属性的多维质量评估。最终发布的版本包含一个2200小时的TTS就绪子集,包含来自1815个自动识别说话人ID的136万个对齐片段,比之前最大的公开波斯语TTS数据集大25倍以上。为了验证该语料库,我们微调了XTTS,一个直接操作原始波斯语文本(无需音素表示)的零样本多语言TTS模型,实现了自然度MOS为3.6/5,说话人相似度MOS为4.0/5。ParsVoice数据集公开在:https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice。

英文摘要

Persian remains substantially underrepresented in open speech-text resources, limiting progress in multi-speaker text-to-speech (TTS), speech-language modelling, and low-resource speech processing. We introduce ParsVoice, the largest publicly available Persian speech-text corpus tailored for training multi-speaker TTS systems, along with a scalable pipeline to construct high-quality speech-text data from long-form audiobook recordings. The pipeline combines a fine-tuned ParsBERT sentence-completion classifier, ASR-based boundary optimization, punctuation restoration, speaker identification, and a multi-dimensional quality assessment that covers both audio and Persian-specific text properties. The resulting release contains a 2,200-hour TTS-ready subset with 1.36 million aligned segments from 1,815 automatically identified speaker IDs, making it more than 25 times larger than the previously largest open Persian TTS dataset. To validate the corpus, we fine-tune XTTS, a zero-shot multilingual TTS model that operates directly on raw Persian text without phoneme representations, achieving a naturalness MOS of 3.6/5 and speaker similarity MOS of 4.0/5. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.

2510.09405 2026-05-27 cs.LG 版本更新

Cross-Receiver Generalization for RF Fingerprint Identification via Feature Disentanglement and Adversarial Training

基于特征解耦与对抗训练的射频指纹识别跨接收机泛化

Yuhao Pan, Xiucheng Wang, Fushuo Huo, Nan Cheng, Wenchao Xu

发表机构 * Division of Integrative Systems and Design, Hong Kong University of Science and Technology, Hong Kong, China(香港理工大学整合系统与设计学院,中国香港,香港) State Key Laboratory of ISN and School of Telecommunications Engineering, Xidian University, Xi’an 710071, China(西安电子科技大学信息与通信国家重点实验室及电信工程学院,中国西安,710071) School of Cyber Science and Engineering, Southeast University, Nanjing, China(东南大学网络科学与工程学院,中国南京)

AI总结 提出一种特征解耦与对抗训练框架,通过分离发射机与接收机特征并抑制接收机信息,解决射频指纹识别中接收机更换导致的性能下降问题。

详情
AI中文摘要

射频指纹识别(RFFI)是无线网络安全的关键技术,利用硬件固有缺陷实现发射机识别。尽管深度神经网络能有效提取判别性射频特征,但在实际部署中,其性能受接收机引入的变异性显著影响。真实场景中,射频信号天然地混合了发射机特定特征与接收机依赖失真,导致模型在相同设备上训练和评估时会捕获接收机相关模式。因此,部署时更换接收机常导致性能显著下降。为解决此问题,我们提出一种跨接收机鲁棒的RFFI框架,明确解耦发射机特定和接收机特定表示。该方法整合对抗域对齐与接收机感知正则化,抑制发射机特征中的残余接收机信息,同时强制接收机特定表示的内部一致性。进一步引入特征分离约束,在潜在空间中解耦两个组件。在多接收机WiFi数据集上的大量实验表明,所提方法在跨接收机评估中持续优于最先进基线,并显著提升对接收机更换的鲁棒性。

英文摘要

Radio frequency fingerprint identification (RFFI) is a key technique for wireless network security, leveraging intrinsic hardware imperfections to enable transmitter identification. Although deep neural networks are effective at extracting discriminative RF features, their performance is significantly affected by receiver-induced variability in practical deployments. In real-world scenarios, RF signals inherently entangle transmitter-specific characteristics with receiver-dependent distortions, leading models to capture receiver-related patterns when training and evaluation are conducted on the same device. Consequently, replacing the receiver during deployment often results in notable performance degradation. To address this issue, we propose a cross-receiver robust RFFI framework that explicitly disentangles transmitter-specific and receiver-specific representations. The proposed method integrates adversarial domain alignment with receiver-aware regularization to suppress residual receiver information in transmitter features while enforcing intra-receiver consistency in receiver-specific representations. A feature separation constraint is further introduced to decouple the two components in the latent space. Extensive experiments on multi-receiver WiFi datasets demonstrate that the proposed method consistently outperforms state-of-the-art baselines under cross-receiver evaluation and significantly improves robustness to receiver replacement.

2510.09250 2026-05-27 physics.flu-dyn cs.LG 版本更新

Smart navigation of a gravity-driven glider with adjustable centre-of-mass

可调质心的重力驱动滑翔机的智能导航

X. Jiang, J. Qiu, K. Gustavsson, B. Mehlig, L. Zhao

发表机构 * AML, Department of Engineering Mechanics, Tsinghua University, 100084 Beijing, China(AML,工程力学系,清华大学,北京100084,中国) Department of Physics, Gothenburg University, 41296 Gothenburg, Sweden(物理系,哥德堡大学,瑞典41296哥德堡)

AI总结 通过直接数值模拟和强化学习,研究了可调质心的紧凑型滑翔机在粘性流体中沉降时的最优导航策略,揭示了粒子雷诺数对策略选择的关键影响。

Comments 13 pages, 8 figures

详情
Journal ref
Phys. Rev. Research 7 (2025) 043200
AI中文摘要

人工滑翔机被设计为在流体中沉降时分散,需要精确导航以到达目标位置。我们展示了一个在粘性流体中沉降的紧凑型滑翔机可以通过动态调整其质心来导航。使用完全解析的直接数值模拟(DNS)和强化学习,我们发现了两种最优导航策略,使滑翔机能够准确到达目标位置。这些策略敏感地依赖于滑翔机与周围流体的相互作用方式。这种相互作用的性质随着粒子雷诺数Re$_p$的变化而变化。我们的结果解释了最优策略如何依赖于Re$_p$。在大的Re$_p$下,滑翔机学会通过在其方向改变时移动质心来快速翻滚。这产生了大的水平惯性升力,使滑翔机能够远距离移动。相比之下,在小的Re$_p$下,高粘度阻碍了翻滚。在这种情况下,滑翔机学会调整其质心,使其以稳定的倾斜方向沉降,从而产生水平粘性力。水平范围比大Re$_p$时小得多,因为这种粘性力远小于大Re$_p$下的惯性升力。

英文摘要

Artificial gliders are designed to disperse as they settle through a fluid, requiring precise navigation to reach target locations. We show that a compact glider settling in a viscous fluid can navigate by dynamically adjusting its centre-of-mass. Using fully resolved direct numerical simulations (DNS) and reinforcement learning, we find two optimal navigation strategies that allow the glider to reach its target location accurately. These strategies depend sensitively on how the glider interacts with the surrounding fluid. The nature of this interaction changes as the particle Reynolds number Re$_p$ changes. Our results explain how the optimal strategy depends on Re$_p$. At large Re$_p$, the glider learns to tumble rapidly by moving its centre-of-mass as its orientation changes. This generates a large horizontal inertial lift force, which allows the glider to travel far. At small Re$_p$, by contrast, high viscosity hinders tumbling. In this case, the glider learns to adjust its centre-of-mass so that it settles with a steady, inclined orientation that results in a horizontal viscous force. The horizontal range is much smaller than for large Re$_p$, because this viscous force is much smaller than the inertial lift force at large Re$_p$. *These authors contributed equally.

2510.08932 2026-05-27 cs.LG cs.IR 版本更新

MATT-CTR: Unleashing a Model-Agnostic Test-Time Paradigm for CTR Prediction with Confidence-Guided Inference Paths

MATT-CTR:一种模型无关的测试时范式,用于通过置信度引导的推理路径进行CTR预测

Moyu Zhang, Yun Chen, Yujun Jin, Jinxin Hu, Yu Zhang, Xiaoyi Zeng

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出一种模型无关的测试时范式MATT,利用特征组合的置信度分数生成多条推理路径并聚合预测,以缓解低置信度特征对CTR预测的影响。

详情
AI中文摘要

近期,越来越多的研究致力于优化CTR模型架构以更好地建模特征交互,或改进训练目标以辅助参数学习,从而获得更好的预测性能。然而,以往的工作主要集中在训练阶段,很大程度上忽视了推理阶段的优化机会。特别是,不常出现的特征组合会降低预测性能,导致不可靠或低置信度的输出。为了释放已训练CTR模型的预测潜力,我们提出了一种模型无关的测试时范式(MATT),该范式利用特征组合的置信度分数来指导生成多条推理路径,从而减轻低置信度特征对最终预测的影响。具体来说,为了量化特征组合的置信度,我们引入了一种层次概率哈希方法来估计不同阶数特征组合的出现频率,这些频率作为对应的置信度分数。然后,以置信度分数作为采样概率,通过迭代采样生成多条实例特定的推理路径,并随后聚合来自多条路径的预测分数以进行稳健预测。最后,广泛的离线实验和在线A/B测试强有力地验证了MATT在现有CTR模型上的兼容性和有效性。

英文摘要

Recently, a growing body of research has focused on either optimizing CTR model architectures to better model feature interactions or refining training objectives to aid parameter learning, thereby achieving better predictive performance. However, previous efforts have primarily focused on the training phase, largely neglecting opportunities for optimization during the inference phase. Infrequently occurring feature combinations, in particular, can degrade prediction performance, leading to unreliable or low-confidence outputs. To unlock the predictive potential of trained CTR models, we propose a Model-Agnostic Test-Time paradigm (MATT), which leverages the confidence scores of feature combinations to guide the generation of multiple inference paths, thereby mitigating the influence of low-confidence features on the final prediction. Specifically, to quantify the confidence of feature combinations, we introduce a hierarchical probabilistic hashing method to estimate the occurrence frequencies of feature combinations at various orders, which serve as their corresponding confidence scores. Then, using the confidence scores as sampling probabilities, we generate multiple instance-specific inference paths through iterative sampling and subsequently aggregate the prediction scores from multiple paths to conduct robust predictions. Finally, extensive offline experiments and online A/B tests strongly validate the compatibility and effectiveness of MATT across existing CTR models.

2505.07894 2026-05-27 cs.NI cs.ET cs.LG eess.SP math.ST stat.TH 版本更新

EnvCDiff: Joint Refinement of Environmental Information and Channel Fingerprints via Conditional Generative Diffusion Model

EnvCDiff:通过条件生成扩散模型联合优化环境信息与信道指纹

Zhenzhou Jin, Li You, Xiang-Gen Xia, Xiqi Gao

发表机构 * National Mobile Communications Research Laboratory, Southeast University(东南大学国家移动通信研究中心) Purple Mountain Laboratories(紫金山实验室) Department of Electrical and Computer Engineering, University of Delaware(德雷塞尔大学电气与计算机工程系)

AI总结 针对环境信息和信道指纹粗粒度问题,提出条件生成扩散模型(CDiff)同时细化两者,从粗粒度重建细粒度EnvCF,实验表明性能显著提升。

Comments 6 pages, 2 figures

详情
Journal ref
IEEE Transactions on Vehicular Technology, vol. 75, no. 4, pp. 6846-6851, Apr. 2026
AI中文摘要

从环境无感知通信向智能环境感知通信的范式转变有望促进未来无线通信中信道状态信息的获取。信道指纹(CF)作为环境感知通信的新兴使能技术,为目标通信区域内潜在位置提供信道相关知识。然而,由于用于感知环境信息和测量信道相关知识的实际设备有限,大多数获取的环境信息和CF是粗粒度的,不足以指导无线传输设计。为此,本文提出一种深度条件生成学习方法,即定制的条件生成扩散模型(CDiff)。所提出的CDiff同时细化环境信息和CF,从其粗粒度对应物重建包含环境信息的细粒度CF,称为EnvCF。实验结果表明,与基线相比,所提方法显著提高了EnvCF构建的性能。

英文摘要

The paradigm shift from environment-unaware communication to intelligent environment-aware communication is expected to facilitate the acquisition of channel state information for future wireless communications. Channel Fingerprint (CF), as an emerging enabling technology for environment-aware communication, provides channel-related knowledge for potential locations within the target communication area. However, due to the limited availability of practical devices for sensing environmental information and measuring channel-related knowledge, most of the acquired environmental information and CF are coarse-grained, insufficient to guide the design of wireless transmissions. To address this, this paper proposes a deep conditional generative learning approach, namely a customized conditional generative diffusion model (CDiff). The proposed CDiff simultaneously refines environmental information and CF, reconstructing a fine-grained CF that incorporates environmental information, referred to as EnvCF, from its coarse-grained counterpart. Experimental results show that the proposed approach significantly improves the performance of EnvCF construction compared to the baselines.

2506.23274 2026-05-27 cs.LG cs.AI 版本更新

Real-Time Progress Prediction in Reasoning Language Models

推理语言模型中的实时进度预测

Hans Peter Lyngsøe Raaschou-Jensen, Constanza Fierro, Anders Søgaard

发表机构 * Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 研究通过离散化推理轨迹训练线性探针和微调模型生成0-100%进度估计,实现推理语言模型中的实时进度预测,并在数学推理任务上达到0.161 MAE。

详情
AI中文摘要

最近的推理语言模型,特别是那些采用长潜在思维链的模型,在复杂的智能体任务上表现出色。然而,随着这些模型在越来越长的时间范围内运行,其内部进展对用户变得不透明,使得期望管理和实时监督变得困难。在这项工作中,我们研究了对此类模型进行实时进度预测的可行性。我们首先通过离散化推理轨迹并训练线性探针对推理状态进行分类,测试隐藏状态是否编码进度信息。然后,我们微调模型以在思维链推理过程中生成0-100%的进度估计。我们最强的进度报告检查点在数学推理轨迹上达到了0.161的平均绝对误差,并在此设置中优于位置基线。最后,我们通过测量相同部分展开中隐含进度值的变化程度,量化了进度标签的内在模糊性。这种模糊性在Qwen3-4B中最低,其延续产生的展开离散度最小,表明更大的模型可以通过减少剩余解决方案长度的变化来使进度标签更稳定。

英文摘要

Recent reasoning language models, particularly those that employ long latent chains of thought, achieve strong performance on complex agentic tasks. However, as these models operate over increasingly long time horizons, their internal progress becomes opaque to users, making expectation management and real-time oversight difficult. In this work, we investigate whether real-time progress prediction is feasible for such models. We first test whether hidden states encode progress information by discretizing reasoning trajectories and training a linear probe to classify reasoning states. We then fine-tune models to generate progress estimates from 0--100\% during chain-of-thought reasoning. Our strongest progress-reporting checkpoint reaches 0.161 MAE on mathematical reasoning traces and outperforms position baselines in this setting. Finally, we quantify the intrinsic ambiguity of progress labels by measuring how much the implied progress value varies from the same partial rollout. This ambiguity is lowest for Qwen3-4B, whose continuations produce the smallest rollout dispersion, suggesting that larger models can make progress labels more stable by reducing variation in remaining solution length.

2510.06381 2026-05-27 cs.LG cs.AI 版本更新

Monte Carlo Permutation Search

蒙特卡洛排列搜索

Tristan Cazenave

AI总结 提出一种改进GRAVE算法的通用蒙特卡洛树搜索算法MCPS,通过利用路径上所有节点的统计信息,在多种游戏中优于GRAVE,并给出了统计权重公式的数学推导。

详情
AI中文摘要

我们提出蒙特卡洛排列搜索(MCPS),一种改进GRAVE算法的通用蒙特卡洛树搜索(MCTS)算法。当深度强化学习不可行或游戏前可用计算资源有限时(如通用游戏博弈),MCPS具有相关性。MCPS的原理是在节点的探索项中包含从根节点到该节点路径上所有走法的所有模拟的统计信息。我们在多种游戏上测试MCPS:Hex、Go、AtariGo、NoGo和一个Wargame。MCPS几乎总是优于GRAVE。我们还提供了用于加权三种统计来源的公式的数学推导。这些公式是对GRAVE公式的改进,因为它们不再使用GRAVE的偏差超参数。

英文摘要

We propose Monte Carlo Permutation Search (MCPS), a general-purpose Monte Carlo Tree Search (MCTS) algorithm that improves upon the GRAVE algorithm. MCPS is relevant when deep reinforcement learning is not an option or when the computing power available before play is not substantial, such as in General Game Playing. The principle of MCPS is to include in the exploration term of a node the statistics on all the playouts that contain all the moves on the path from the root to the node. We test MCPS on a variety of games: Hex, Go, AtariGo, NoGo and a Wargame. MCPS almost always outperforms GRAVE. We also provide a mathematical derivation of the formulas used for weighting the three sources of statistics. These formulas are an improvement on the GRAVE formula since they no longer use the bias hyperparameter of GRAVE.

2510.01168 2026-05-27 math.OC cs.LG cs.NA math.NA stat.ML 版本更新

A first-order method for constrained nonconvex-nonconcave minimax optimization

约束非凸-非凹极小极大优化的一阶方法

Zhaosong Lu, Xiangyuan Wang

发表机构 * Department of Industrial and Systems Engineering, University of Minnesota, USA(明尼苏达大学工业与系统工程系)

AI总结 针对内层最大化含复杂约束的非凸-非凹极小极大问题,通过提升重构和局部KL条件,提出基于序列凸规划的不精确近端梯度法并证明收敛性。

Comments 27 pages

详情
AI中文摘要

我们研究一类约束非凸-非凹极小极大优化问题,其中内层最大化涉及潜在复杂约束。在假设新型提升极小极大重构的内层问题满足局部Kurdyka-Lojasiewicz (KL)条件的情况下,我们证明原问题的极大函数具有局部广义Hölder光滑性。我们还提出了一种求解约束优化问题的序列凸规划(SCP)方法,并在局部KL条件下建立了其收敛速率。利用这些结果,我们为原始极小极大问题开发了一种不精确近端梯度法,其中极大函数的不精确梯度通过将SCP方法应用于局部KL结构的子问题来计算。最后,我们为所提方法在计算原始极小极大问题的近似稳定点方面建立了复杂度保证。

英文摘要

We study a class of constrained nonconvex-nonconcave minimax optimization problems in which the inner maximization involves potentially complex constraints. Under the assumption that the inner problem of a novel lifted minimax reformulation satisfies a local Kurdyka-Lojasiewicz (KL) condition, we show that the maximal function of the original problem enjoys a local generalized Hölder smoothness property. We also propose a sequential convex programming (SCP) method for solving constrained optimization problems and establish its convergence rate under a local KL condition. Leveraging these results, we develop an inexact proximal gradient method for the original minimax problem, where the inexact gradient of the maximal function is computed via the SCP method applied to a locally KL-structured subproblem. Finally, we establish complexity guarantees for the proposed method in computing an approximate stationary point of the original minimax problem.

2510.01336 2026-05-27 cs.CL cs.AI cs.LG 版本更新

HiSpec: Hierarchical Speculative Decoding for LLMs

HiSpec: 分层推测解码用于大语言模型

Avinash Kumar, Sujay Sanghavi, Poulami Das

发表机构 * Department of Electrical and Computer Engineering, The University of Texas at Austin(德克萨斯大学奥斯汀分校电子与计算机工程系)

AI总结 提出HiSpec框架,利用早期退出模型进行低开销中间验证,通过重用键值缓存和隐藏状态提高吞吐量,平均加速1.28倍,最高2.01倍,且不损失准确性。

详情
AI中文摘要

推测解码通过使用较小的草稿模型推测令牌,再由较大的目标模型验证,从而加速LLM推理。验证通常是瓶颈(例如,当3B模型为70B目标模型推测时,验证速度比令牌生成慢4倍),但大多数先前工作只关注加速草稿生成。“中间”验证通过早期丢弃不准确的草稿令牌来减少验证时间,但现有方法在引入中间验证器时会产生大量训练开销,增加内存占用以协调中间验证步骤,并依赖近似启发式方法损害准确性。我们提出$\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$,一种高吞吐量推测解码框架,利用早期退出模型进行低开销中间验证。早期退出模型允许令牌通过跳过层遍历提前退出,并经过显式训练,使得选定层的隐藏状态可解释,从而在不显著增加计算和内存开销的情况下,非常适合中间验证。为了进一步提高资源效率,我们设计了一种方法,使HiSpec能够在草稿模型、中间验证器和目标模型之间重用键值缓存和隐藏状态。为了保持准确性,HiSpec定期针对目标模型验证中间验证器接受的草稿令牌。我们在各种代表性基准和模型上的评估表明,与基线单层推测相比,HiSpec平均提高吞吐量1.28倍,最高达2.01倍,且不损失准确性。

英文摘要

Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.

2509.21167 2026-05-27 cs.LG cs.CV 版本更新

A Unified Framework for Diffusion Model Unlearning with f-Divergence

基于f-散度的扩散模型遗忘统一框架

Nicola Novello, Federico Fontana, Luigi Cinque, Deniz Gunduz, Andrea M. Tonello

发表机构 * University of Klagenfurt, Austria(克雷根福特大学) Sapienza University of Rome, Italy(罗马萨皮恩扎大学) Imperial College London, UK(伦敦帝国学院)

AI总结 提出一个基于f-散度的统一框架,将扩散模型概念遗忘中的MSE损失推广到任意f-散度,并通过理论分析和实验验证不同散度对遗忘效果的影响。

Comments Accepted at ICML 2026

详情
AI中文摘要

现有的大多数文本到图像扩散模型概念遗忘方法最小化基于目标概念和锚定概念的去噪器输出之间的均方误差(MSE)损失,这隐式地是两个高斯分布之间的KL散度。我们将这一目标推广到任意$f$-散度,将MSE恢复为KL实例,并识别出一族$\alpha$-散度,其高斯闭式形式产生廉价、类似MSE的训练目标。对于剩余的$f$-散度,我们基于$f$-散度的变分公式提供了一个最小-最大目标。我们从理论上分析并数值验证了不同$f$-散度如何影响梯度幅度和算法的收敛性质,从而影响遗忘质量。例如,我们观察到Hellinger闭式实例在多种场景下始终优于MSE。更一般地,所提出的统一框架为根据应用和用户目标选择最优散度提供了灵活的范式,允许对遗忘效果与生成保真度之间的权衡进行更精细的控制。

英文摘要

Most existing methods for concept unlearning in text-to-image diffusion models minimize a mean squared error (MSE) loss between the denoiser outputs conditioned on a target and an anchor concept, which is implicitly the KL divergence between two Gaussians. We generalize this objective to any $f$-divergence, recovering MSE as the KL instance, and identify a family of $α$-divergences whose Gaussian closed-form yields cheap, MSE-like training objectives. For the remaining $f$-divergences, we provide a min-max objective based on the variational formulation of the $f$-divergence. We theoretically analyze and numerically validate how different $f$-divergences impact the gradient magnitude and the convergence properties of the algorithm, affecting the quality of unlearning. For instance, we observe that the Hellinger closed-form instance consistently dominates MSE across multiple scenarios. More generally, the proposed unified framework offers a flexible paradigm for selecting the optimal divergence based on the application and user goal, allowing for finer control over the trade-off between unlearning efficacy and generative fidelity.

2509.15121 2026-05-27 hep-ph cs.LG hep-ex 版本更新

Shedding Light on Dark Matter at the LHC with Machine Learning

利用机器学习在LHC上揭示暗物质

Ernesto Arganda, Martín de los Rios, Andres D. Perez, Subhojit Roy, Rosa M. Sandá Seoane, Carlos E. M. Wagner

发表机构 * Departamento de Física Teórica, Universidad Autónoma de Madrid(马德里自治大学理论物理系) Instituto de Física Teórica UAM-CSIC(UAM-CSIC理论物理研究所) SISSA - International School for Advanced Studies(国际先进研究学院SISSA) Instituto de Astronomía Teórica y Experimental, CONICET - UNC(理论与实验天文学研究所,CONICET-UNC) HEP Division, Argonne National Laboratory(阿贡国家实验室高能物理部) Enrico Fermi Institute, Physics Department, University of Chicago(恩里科·费米研究所,芝加哥大学物理系) Kavli Institute for Cosmological Physics, University of Chicago(芝加哥大学宇宙学研究所) Leinweber Center for Theoretical Physics, University of Chicago(芝加哥大学理论物理中心) Perimeter Institute for Theoretical Physics, Waterloo, Ontario N2L 2Y5, Canada(加拿大滑铁卢大学理论物理研究所)

AI总结 研究在Z3对称的次最小超对称标准模型中,通过机器学习分析辐射衰变中性伴随子产生的多光子信号,实现LHC上对单重态主导LSP暗物质的5σ发现能力。

Comments 26 pages + references, 6 figures, 8 tables, 1 appendix (version published in PRD)

详情
Journal ref
Phys. Rev. D 113 (2026) 9, 095013
AI中文摘要

我们在$Z_3$对称的次最小超对称标准模型(NMSSM)中,研究了一个以单重态主导的最轻超对称粒子(LSP)形式的WIMP暗物质(DM)候选者。该框架产生了参数空间区域,其中暗物质通过与邻近的higgsino-like电弱伴随子共同湮灭而获得,且暗物质直接探测信号被抑制,即所谓的“盲点”。另一方面,由于higgsino到单重态主导的LSP和光子的辐射衰变模式增强,而不是衰变成轻子或强子,对撞机特征仍然有希望。与具有轻bino-like和wino-like电弱伴随子的MSSM情景相比,NMSSM允许来自级联辐射衰变的多光子末态,提供了独特的对撞机特征。这激发了寻找辐射衰变中性伴随子的研究,然而,这些信号面临巨大的背景挑战,因为衰变产物通常由于LSP和higgsino-like共同湮灭伙伴之间的小质量差($Δm$)而变软。我们应用了一种数据驱动的机器学习(ML)分析,提高了对这些微弱信号的灵敏度,为发现新物理情景提供了对传统搜索策略的有力补充。使用LHC在$14~\mathrm{TeV}$下积分亮度为$100~\mathrm{fb}^{-1}$的数据,该方法对higgsino质量高达$225~\mathrm{GeV}$且$Δm\!\lesssim\!12~\mathrm{GeV}$实现了$5σ$发现能力,对高达$285~\mathrm{GeV}$且$Δm\!\lesssim\!20~\mathrm{GeV}$实现了$2σ$排除能力。这些结果突显了对撞机搜索在探测当前直接探测实验隐藏的暗物质候选者方面的能力,并为LHC合作组使用ML方法进行搜索提供了动机。

英文摘要

We investigate a WIMP dark matter (DM) candidate in the form of a singlino-dominated lightest supersymmetric particle (LSP) within the $Z_3$-symmetric Next-to-Minimal Supersymmetric Standard Model (NMSSM). This framework gives rise to regions of parameter space where DM is obtained via co-annihilation with nearby higgsino-like electroweakinos and DM direct detection~signals are suppressed, the so-called ``blind spots''. On the other hand, collider signatures remain promising due to enhanced radiative decay modes of higgsinos into the singlino-dominated LSP and photons, rather than into leptons or hadrons. Compared to MSSM scenarios with light bino- and wino-like electroweakinos, the NMSSM allows for final states with multiple photons arising from cascade radiative decays, providing a distinctive collider signature. This motivates searches for radiatively decaying neutralinos, however, these signals face substantial background challenges, as the decay products are typically soft due to the small mass-splits ($Δm$) between the LSP and the higgsino-like coannihilation partners. We apply a data-driven Machine Learning (ML) analysis that improves sensitivity to these subtle signals, offering a powerful complement to traditional search strategies to discover a new physics scenario. Using an LHC integrated luminosity of $100~\mathrm{fb}^{-1}$ at $14~\mathrm{TeV}$, the method achieves a $5σ$ discovery reach for higgsino masses up to $225~\mathrm{GeV}$ with $Δm\!\lesssim\!12~\mathrm{GeV}$, and a $2σ$ exclusion up to $285~\mathrm{GeV}$ with $Δm\!\lesssim\!20~\mathrm{GeV}$. These results highlight~the power of collider searches to probe DM candidates that remain hidden from current~direct detection experiments, and provide a motivation for a search by the LHC collaborations using ML methods.

2503.20507 2026-05-27 cs.AR cs.DC cs.LG 版本更新

Harmonia: Enhancing Data Placement and Migration in Hybrid Storage Systems via Multi-Agent Reinforcement Learning

Harmonia: 通过多智能体强化学习增强混合存储系统中的数据放置与迁移

Rakesh Nadig, Vamanan Arulchelvan, Rahul Bera, Taha Shahroodi, Gagandeep Singh, Andreas Kakolyris, Ismail Emir Yuksel, Mohammad Sadrosadati, Jisung Park, Onur Mutlu

发表机构 * ETH Zürich(苏黎世联邦理工学院) AMD Research(AMD研究)

AI总结 针对混合存储系统中数据放置与迁移策略相互依赖导致性能次优的问题,提出基于多智能体强化学习的协同优化方法Harmonia,显著提升系统性能。

详情
AI中文摘要

现代高性能计算环境依赖混合存储系统(HSS),该系统结合了具有不同延迟、带宽、耐久性和容量特性的多种存储设备,以满足数据密集型应用的性能、容量和成本要求。HSS的性能高度依赖于两个关键数据管理策略:(1)数据放置,决定存储应用数据的最合适存储设备;(2)数据迁移,动态重组跨存储设备的先前存储数据(即预取热数据和驱逐冷数据),以维持高HSS性能。这些策略紧密相互依赖,因此,不考虑另一个而改进其中一个会导致HSS性能次优。不幸的是,先前的工作只专注于优化其中一个策略。我们的目标是设计一种整体数据管理技术,同时优化数据放置和数据迁移策略,以充分利用HSS的潜力。为此,我们提出了Harmonia,一种基于多智能体强化学习(RL)的数据管理技术。Harmonia采用两个轻量级自主RL智能体,即数据放置智能体和数据迁移智能体,它们在相互协调的同时,根据当前工作负载和HSS配置调整各自的策略。我们在具有多达四个异构存储设备和25个数据密集型工作负载的真实HSS配置上评估了Harmonia。在具有两个异构存储设备的性能优化(成本优化)HSS上,Harmonia平均比先前最佳方法高出29.3%(44.8%)。在具有三个(四个)设备的HSS上,Harmonia平均比先前最佳工作高出38.9%(39.2%)。Harmonia的性能优势伴随着低延迟(推理为240纳秒)和存储开销(两个RL智能体合计206 KiB的DRAM)。

英文摘要

Modern high-performance computing (HPC) environments rely on hybrid storage systems (HSS) that combine multiple storage devices with diverse latency, bandwidth, endurance, and capacity characteristics to meet the performance, capacity, and cost requirements of data-intensive applications. The performance of an HSS highly depends on two key data-management policies: (1) data placement, which determines the most suitable storage device to store application data, and (2) data migration, which dynamically reorganizes previously-stored data across storage devices (i.e., prefetching hot data and evicting cold data) to sustain high HSS performance. These policies are tightly interdependent, and thus, improving one without considering the other leads to suboptimal HSS performance. Unfortunately, prior works focus on optimizing only one of the policies. Our goal is to design a holistic data-management technique that optimizes both data-placement and data-migration policies to fully exploit the potential of an HSS. To this end, we propose Harmonia, a multi-agent reinforcement learning (RL)-based data-management technique. Harmonia employs two lightweight autonomous RL agents, a data-placement agent and a data-migration agent, that adapt their policies for the current workload and HSS configuration while coordinating with each other. We evaluate Harmonia on real HSS configurations with up to four heterogeneous storage devices and 25 data-intensive workloads. On a performance- (cost-) optimized HSS with two heterogeneous storage devices, Harmonia outperforms the best-performing prior approach by 29.3% (44.8%) on average. On an HSS with three (four) devices, Harmonia outperforms the best-performing prior work by 38.9% (39.2%) on average. Harmonia's performance benefits come with low latency (240 ns for inference) and storage (206 KiB in DRAM for both RL agents combined) overheads.

2507.13762 2026-05-27 cs.LG q-bio.BM 版本更新

MolPIF: A Parameter Interpolation Flow Model for Molecule Generation

MolPIF: 一种用于分子生成的参数插值流模型

Yaowei Jin, Junjie Wang, Yufan Tang, Wenkai Xiang, Duanhua Cao, Dan Teng, Zhehuan Fan, Jiacheng Xiong, Xia Sheng, Chuanlong Zeng, Duo An, Mingyue Zheng, Shuangjia Zheng, Qian Shi

发表机构 * Lingang Laboratory(灵冈实验室) School of Information Science and Technology(信息科学与技术学院) ShanghaiTech University(上海科技大学) Drug Discovery and Design Center, State Key Laboratory of Drug Research(药物发现与设计中心、国家药物研究重点实验室) Shanghai Institute of Materia Medica, Chinese Academy of Sciences(中国科学院上海 medicinal materials 研究院) Global Institute of Future Technology(未来技术全球研究院) Shanghai Jiao Tong University(上海交通大学) College of Computer Science and Artificial Intelligence(计算机科学与人工智能学院)

AI总结 提出参数插值流模型MolPIF,通过参数空间分布插值统一连续坐标与离散原子类型的生成,在CrossDocked2020数据集上优于基线方法。

Comments Accepted to Bioinformatics

详情
AI中文摘要

动机:基于结构的药物设计(SBDD)随着深度生成模型的发展而进步,但弥合连续原子坐标与离散原子类型之间的差距仍然是一个挑战。当前的方法,如扩散和流匹配模型,通常未能统一这些异质模态,依赖于分离的策略或对离散变量不合适的欧几里得度量。缺乏一致的框架限制了生成模型捕捉蛋白质-配体复合物的几何和化学结构的能力。结果:我们提出了MolPIF,一种参数插值流机制,旨在统一连续和离散分子变量的生成。与在样本空间中运行的传统流模型不同,MolPIF在参数空间中对分布进行插值,理论上恢复了连续坐标的Wasserstein-2最优传输,并建立了离散原子类型的Fisher-Rao测地线。我们进一步整合了几何增强学习策略,以改善原子上下文的捕捉。在CrossDocked2020数据集上的广泛评估表明,MolPIF在结合亲和力、化学有效性、几何保真度和化学空间覆盖方面优于基线。此外,MolPIF在先导优化中表现出多功能性,并提供灵活的先验分布选择(如Laplace),为SBDD建立了一个稳健的范式。可用性:源代码可在https://github.com/BLEACH366/MolPIF免费获取。补充信息:补充数据可在Bioinformatics上获取。

英文摘要

Motivation: Structure-based drug design (SBDD) has advanced with deep generative models, but bridging the gap between continuous atomic coordinates and discrete atom types remains a challenge. Current approaches, such as diffusion and flow matching models, often fail to unify these heterogeneous modalities, relying on separate strategies or ill-fitting Euclidean metrics for discrete variables. This lack of a consistent framework limits generative models' ability to capture the geometric and chemical structure of protein-ligand complexes. Results: We present MolPIF, a parameter interpolation flow mechanism designed to unify the generation of continuous and discrete molecular variables. Unlike traditional flow models that operate in sample space, MolPIF interpolates between distributions in the parameter space, theoretically recovering Wasserstein-2 optimal transport for continuous coordinates and establishing Fisher-Rao geodesics for discrete atom types. We further incorporate a geometry-enhanced learning strategy to improve the capture of atomic contexts. Extensive evaluations on the CrossDocked2020 dataset demonstrate that MolPIF outperforms baselines in binding affinity, chemical validity, geometric fidelity and chemical space coverage. Additionally, MolPIF exhibits versatility in lead optimization and offers flexible prior distribution selection (such as Laplace), establishing a robust paradigm for SBDD. Availability: Source code is freely available at https://github.com/BLEACH366/MolPIF. Supplementary information: Supplementary data are available at Bioinformatics.

2506.11253 2026-05-27 cs.CV cs.LG 版本更新

Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

将数据追踪的机器遗忘提升为基础模型的知识追踪

Yuwen Tan, Boqing Gong

发表机构 * Boston University(波士顿大学)

AI总结 本文提出将数据追踪的机器遗忘提升为基础模型的知识追踪,以应对多样化遗忘请求,并更接近人类遗忘机制,通过视觉语言模型案例展示实现范式。

Comments Accepted to TMLR

详情
AI中文摘要

机器遗忘从AI模型中移除特定训练数据点及其影响(例如,当数据所有者撤销其同意允许模型从数据中学习时)。在这篇立场论文中,我们提出将数据追踪的机器遗忘提升为基础模型(FMs)的知识追踪。我们基于实际需求和认知研究的见解支持这一立场。实际上,追踪数据无法满足对FMs的多样化遗忘请求,这些请求可能来自监管机构、企业用户、产品团队等,他们无法访问FMs的大量训练数据。相反,这些方方便提出关于FMs(不应)拥有的知识或能力的遗忘请求。认知上,知识追踪遗忘比追踪单个训练数据点更接近人脑的遗忘方式。我们进一步讨论了知识追踪机器遗忘范式中的重大挑战。最后,我们提供了一个关于视觉语言FMs的具体案例研究,以说明遗忘者如何实例化知识追踪机器遗忘范式。代码可在:https://1yuwen.github.io/Knowledge-Tracing-MU-Page 获取。

英文摘要

Machine unlearning removes certain training data points and their influence from AI models (e.g., when a data owner revokes their consent to allow models to learn from the data). In this position paper, we propose to lift data-tracing machine unlearning to knowledge-tracing for foundation models (FMs). We support this position based on practical needs and insights from cognitive studies. Practically, tracing data cannot meet the diverse unlearning requests for FMs, which may be from regulators, enterprise users, product teams, etc., who have no access to FMs' massive training data. Instead, it is convenient for these parties to issue an unlearning request about the knowledge or capability FMs (should not) possess. Cognitively, knowledge-tracing unlearning aligns with how the human brain forgets more closely than tracing individual training data points does. We further discuss the nontrivial challenges in the knowledge-tracing machine unlearning paradigm. Finally, we provide a concrete case study about a vision-language FM to illustrate how an unlearner might instantiate the knowledge-tracing machine unlearning paradigm. Code is available at: https://1yuwen.github.io/Knowledge-Tracing-MU-Page.

2502.06963 2026-05-27 cs.LG cs.AI cs.DC cs.MA 版本更新

Intelligent Offloading in Vehicular Edge Computing: A Comprehensive Review of Deep Reinforcement Learning Approaches and Architectures

车辆边缘计算中的智能卸载:深度强化学习方法与架构综述

Ashab Uddin, Ahmed Hamdi Sakr, Ning Zhang

发表机构 * Department of Electrical and Computer Engineering, University of Windsor(滑铁卢大学电气与计算机工程系)

AI总结 本文综述了基于深度强化学习的车辆边缘计算卸载方法,分类比较了学习范式、系统架构和优化目标,并分析了马尔可夫决策过程的应用及未来研究方向。

Comments 33 Pages, 6 Figures, 7 Tables. Machine Learning, Reinforcement Learning, Multi Agent Reinforcement Learning, Computational Offloading and Edge Computing

详情
AI中文摘要

智能交通系统(ITS)日益复杂,导致对计算卸载到边缘服务器、车辆节点和无人机等外部基础设施的兴趣显著增加。这些动态异构环境给传统卸载策略带来了挑战,促使人们探索强化学习(RL)和深度强化学习(DRL)作为自适应决策框架。本综述全面回顾了基于DRL的车辆边缘计算(VEC)卸载的最新进展。我们根据学习范式(如单智能体、多智能体)、系统架构(如集中式、分布式、分层式)和优化目标(如延迟、能量、公平性)对现有工作进行分类和比较。此外,我们分析了马尔可夫决策过程(MDP)公式的应用方式,并强调了奖励设计、协调机制和可扩展性方面的新兴趋势。最后,我们确定了开放挑战,并概述了未来研究方向,以指导下一代ITS鲁棒且智能的卸载策略的开发。

英文摘要

The increasing complexity of Intelligent Transportation Systems (ITS) has led to significant interest in computational offloading to external infrastructures such as edge servers, vehicular nodes, and UAVs. These dynamic and heterogeneous environments pose challenges for traditional offloading strategies, prompting the exploration of Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) as adaptive decision-making frameworks. This survey presents a comprehensive review of recent advances in DRL-based offloading for vehicular edge computing (VEC). We classify and compare existing works based on learning paradigms (e.g., single-agent, multi-agent), system architectures (e.g., centralized, distributed, hierarchical), and optimization objectives (e.g., latency, energy, fairness). Furthermore, we analyze how Markov Decision Process (MDP) formulations are applied and highlight emerging trends in reward design, coordination mechanisms, and scalability. Finally, we identify open challenges and outline future research directions to guide the development of robust and intelligent offloading strategies for next-generation ITS.

2411.02355 2026-05-27 cs.LG cs.AI 版本更新

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

“给我BF16,否则给我死亡”?LLM量化中的精度-性能权衡

Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh

发表机构 * Red Hat AI(红帽AI) Institute of Science and Technology Austria(奥地利科学与技术研究院)

AI总结 本文通过超过50万次评估,全面研究了FP8、INT8和INT4量化在Llama-3.1模型族上的精度-性能权衡,发现FP8无损、INT8精度损失低、INT4权重仅量化具有竞争力,并基于vLLM框架给出了不同部署场景下的最优量化格式建议。

Comments Accepted to ACL 2025

详情
AI中文摘要

量化是加速大型语言模型(LLM)推理的强大工具,但不同格式下的精度-性能权衡仍不明确。在本文中,我们进行了迄今为止最全面的实证研究,评估了FP8、INT8和INT4量化在整个Llama-3.1模型族上的学术基准和实际任务。通过超过50万次评估,我们的研究得出了几个关键发现:(1)FP8(W8A8-FP)在所有模型规模上均无损;(2)良好调优的INT8(W8A8-INT)实现了令人惊讶的低精度下降(1-3%);(3)INT4权重仅量化(W4A16-INT)比预期更具竞争力,可与8位量化相媲美。此外,我们通过流行的vLLM框架分析推理性能,研究了不同部署场景下的最优量化格式。我们的分析提供了明确的部署建议:W4A16是同步设置中最具成本效益的,而W8A8在异步连续批处理中占主导地位。对于混合工作负载,最优选择取决于具体用例。我们的发现为大规模部署量化LLM提供了实用的、数据驱动的指导——确保速度、效率和精度之间的最佳平衡。

英文摘要

Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks on the entire Llama-3.1 model family. Through over 500,000 evaluations, our investigation yields several key findings: (1) FP8 (W8A8-FP) is effectively lossless across all model scales, (2) well-tuned INT8 (W8A8-INT) achieves surprisingly low (1-3\%) accuracy degradation, and (3) INT4 weight-only (W4A16-INT) is more competitive than expected, rivaling 8-bit quantization. Further, we investigate the optimal quantization format for different deployments by analyzing inference performance through the popular vLLM framework. Our analysis provides clear deployment recommendations: W4A16 is the most cost-efficient for synchronous setups, while W8A8 dominates in asynchronous continuous batching. For mixed workloads, the optimal choice depends on the specific use case. Our findings offer practical, data-driven guidelines for deploying quantized LLMs at scale -- ensuring the best balance between speed, efficiency, and accuracy.

2505.18728 2026-05-27 cs.LG cs.AI 版本更新

Message-Passing State-Space Models: Improving Graph Learning with Modern Sequence Modeling

消息传递状态空间模型:利用现代序列建模改进图学习

Andrea Ceni, Alessio Gravina, Claudio Gallicchio, Davide Bacciu, Carola-Bibiane Schonlieb, Moshe Eliasof

发表机构 * University of Pisa(帕尔米斯大学) University of Cambridge(剑桥大学)

AI总结 提出MP-SSM,将现代状态空间模型的核心计算嵌入消息传递神经网络,实现静态和时序图上的高效、置换等变和长程信息传播,并通过精确敏感性分析刻画深层信息流问题。

详情
AI中文摘要

状态空间模型(SSM)在序列建模中的近期成功推动了其向图学习的迁移,催生了图状态空间模型(GSSM)。然而,现有的GSSM通过将SSM模块应用于从图中提取的序列,往往损害了置换等变性、消息传递兼容性和计算效率等核心属性。本文引入了一种新视角,将现代SSM计算的关键原理直接嵌入消息传递神经网络框架,从而为静态图和时序图提供统一的方法论。我们的方法MP-SSM能够实现高效、置换等变和长程信息传播,同时保持消息传递的架构简洁性。关键的是,MP-SSM支持精确的敏感性分析,我们利用该分析从理论上刻画信息流,并评估深层网络中的梯度消失和过压缩等问题。此外,我们的设计选择允许类似现代SSM的高度优化并行实现。我们在包括节点分类、图属性预测、长程基准和时空预测在内的广泛任务上验证了MP-SSM,展示了其多功能性和强大的实证性能。

英文摘要

The recent success of State-Space Models (SSMs) in sequence modeling has motivated their adaptation to graph learning, giving rise to Graph State-Space Models (GSSMs). However, existing GSSMs operate by applying SSM modules to sequences extracted from graphs, often compromising core properties such as permutation equivariance, message-passing compatibility, and computational efficiency. In this paper, we introduce a new perspective by embedding the key principles of modern SSM computation directly into the Message-Passing Neural Network framework, resulting in a unified methodology for both static and temporal graphs. Our approach, MP-SSM, enables efficient, permutation-equivariant, and long-range information propagation while preserving the architectural simplicity of message passing. Crucially, MP-SSM enables an exact sensitivity analysis, which we use to theoretically characterize information flow and evaluate issues like vanishing gradients and over-squashing in the deep regime. Furthermore, our design choices allow for a highly optimized parallel implementation akin to modern SSMs. We validate MP-SSM across a wide range of tasks, including node classification, graph property prediction, long-range benchmarks, and spatiotemporal forecasting, demonstrating both its versatility and strong empirical performance.

2505.17163 2026-05-27 cs.LG cs.AI cs.CL cs.CV 版本更新

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

OCR-Reasoning基准:揭示MLLMs在复杂文本丰富图像推理中的真实能力

Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin

发表机构 * South China University of Technology(华南理工大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 提出OCR-Reasoning基准,包含1069个人工标注样本,覆盖6种核心推理能力和18个实际推理任务,通过双标注(最终答案和逐步推理过程)评估多模态大语言模型在文本丰富图像推理中的能力,发现最先进模型准确率均低于50%。

Comments ICLR 2026

详情
AI中文摘要

近期多模态慢思考系统在各种视觉推理任务中表现出色。然而,由于缺乏专门且系统的基准,它们在文本丰富图像推理任务中的能力仍未得到充分研究。为填补这一空白,我们提出了OCR-Reasoning,一个新颖的基准,旨在系统评估多模态大语言模型在文本丰富图像推理任务上的表现。具体而言,OCR-Reasoning包含1069个人工标注的示例,涵盖文本丰富视觉场景中的6种核心推理能力和18个实际推理任务。与仅提供最终答案的现有文本丰富图像理解基准不同,本基准额外提供了详细的逐步推理过程。这种双标注使得能够同时评估模型的最终答案和推理过程,从而全面评估文本丰富推理能力。利用该基准,我们对最新的多模态大语言模型进行了全面评估。结果表明,即使是最先进的多模态大语言模型在文本丰富图像推理任务中也面临巨大困难,在我们的基准上没有一个模型的准确率超过50%,这表明文本丰富图像推理的挑战是一个亟待解决的问题。基准和评估脚本可在https://github.com/SCUT-DLVCLab/OCR-Reasoning获取。

英文摘要

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50\% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

2505.16942 2026-05-27 cs.CV cs.LG 版本更新

Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation

高效的全对相关性体素采样用于光流估计

Karlis Martins Briedis, Markus Gross, Christopher Schroers

发表机构 * DisneyResearch|Studios(迪士尼研究与工作室) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出一种内存和计算高效的算法,实现全对相关性体素采样的精确数学运算,在保持低内存占用的同时显著提升速度,并应用于高分辨率光流估计达到最优性能。

Comments CVPR 2026

详情
AI中文摘要

最近的光流估计方法通常从密集的全对相关性体素中进行局部代价采样。这导致计算和内存复杂度与像素数成二次关系。尽管存在一种按需代价计算的替代内存高效实现,但在实践中速度明显较慢,因此许多先前方法在降采样分辨率下处理图像,丢失了细粒度细节。为了解决这个问题,我们提出了一种算法,用于全对相关性体素采样的内存和计算高效实现,同时仍然匹配RAFT定义的精确数学算子。我们的方法在保持同样低内存使用的情况下,性能优于按需采样高达92%,并且与默认实现相比,内存使用降低高达99%的同时性能至少相当。由于代价采样占整体运行时间的很大一部分,这可以转化为高分辨率输入下端到端模型推理总时间高达63%的节省。我们对现有方法的评估包括一个8K超高清数据集和SEA-RAFT方法的推理时间扩展。通过这一点,我们在高分辨率下在准确性和运行时间上都达到了最先进的结果。

英文摘要

Recent optical flow estimation methods often employ local cost sampling from a dense all-pairs correlation volume. This results in quadratic computational and memory complexity in the number of pixels. Although an alternative memory-efficient implementation with on-demand cost computation exists, this is significantly slower in practice and therefore many prior methods process images at downsampled resolutions, missing fine-grained details. To address this, we propose an algorithm for both memory and compute-efficient implementation of the all-pairs correlation volume sampling, still matching the exact mathematical operator as defined by RAFT. Our approach outperforms on-demand sampling by up to 92% while maintaining equally low memory usage, and performs at least on par with the default implementation with up to 99% lower memory usage. As cost sampling makes up a significant portion of the overall runtime, this can translate to up to 63% savings for the total end-to-end model inference on high-resolution inputs. Our evaluation of existing methods includes an 8K ultra-high-resolution dataset and an inference-time extension of the SEA-RAFT method. With this, we achieve state-of-the-art results at high resolutions both in accuracy and runtime.

2502.17666 2026-05-27 cs.LG cs.AI 版本更新

Yes, Q-learning Helps Offline In-Context RL

是的,Q学习有助于离线上下文强化学习

Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Andrei Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Igor Kiselev, Vladislav Kurenkov

发表机构 * Reinforcement Learning Journal(强化学习期刊)

AI总结 本文在离线上下文强化学习框架中整合RL目标,通过150多个数据集实验证明,直接优化RL目标相比算法蒸馏平均提升约30%性能,且价值学习中的保守性带来额外改进。

详情
AI中文摘要

现有的离线上下文强化学习(ICRL)方法主要依赖监督训练目标,这在离线RL设置中已知存在局限性。在本研究中,我们探索了在离线ICRL框架中整合RL目标。通过在150多个GridWorld和MuJoCo环境派生数据集上的实验,我们证明,与广泛采用的算法蒸馏(AD)相比,直接优化RL目标在各种数据集覆盖范围、结构、专业水平和环境复杂性下平均提升约30%的性能。此外,在具有挑战性的XLand-MiniGrid环境中,RL目标使AD的性能翻倍。我们的结果还揭示,在几乎所有测试的设置中,价值学习期间加入保守性带来了额外的改进。我们的发现强调了将ICRL学习目标与RL奖励最大化目标对齐的重要性,并表明离线RL是推进ICRL的一个有前景的方向。

英文摘要

Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are known to have limitations in offline RL settings. In this study, we explore the integration of RL objectives within an offline ICRL framework. Through experiments on more than 150 GridWorld and MuJoCo environment-derived datasets, we demonstrate that optimizing RL objectives directly improves performance by approximately 30% on average compared to widely adopted Algorithm Distillation (AD), across various dataset coverages, structures, expertise levels, and environmental complexities. Furthermore, in the challenging XLand-MiniGrid environment, RL objectives doubled the performance of AD. Our results also reveal that the addition of conservatism during value learning brings additional improvements in almost all settings tested. Our findings emphasize the importance of aligning ICRL learning objectives with the RL reward-maximization goal, and demonstrate that offline RL is a promising direction for advancing ICRL.

2505.02974 2026-05-27 cs.LG 版本更新

PLAID: A Unified Data Model for Machine Learning on Heterogeneous Physics Simulations

PLAID:面向异构物理模拟的机器学习统一数据模型

Fabien Casenave, Xavier Roynard, Brian Staber, Alexandre Devaux-Rivière, William Piat, Michele Alessandro Bucci, Nissrine Akkari, Abbas Kabalan, Xuan Minh Vuong Nguyen, Luca Saverio, Raphaël Carpintero Perez, Anthony Kalaydjian, Samy Fouché, Thierry Gonon, Ghassan Najjar, Thomas Daniel, Emmanuel Menier, Matthieu Nastorg, Giovanni Catalani, Christian Rey

发表机构 * EPFL(苏黎世联邦理工学院) Inria(法国国家信息与自动化技术研究院)

AI总结 提出PLAID统一数据层,通过标准化异构物理模拟数据并发布六个基准数据集,解决机器学习代理模型缺乏大规模多样化数据集的问题。

Comments Presented at EuRIPS 2025 and accepted at the AI4Physics Workshop @ ICML 2026

详情
AI中文摘要

基于机器学习的代理模型已成为加速模拟驱动科学工作流的强大工具,但其应用受到缺乏大规模、多样化且标准化的物理模拟数据集的限制。现有基准测试通常聚焦于狭窄领域或依赖简化数据模型,未能捕捉由可变几何、网格和拓扑产生的异质性,而这对于评估现实场景中的泛化能力至关重要。我们提出PLAID(物理学习AI数据模型),一个用于异构物理模拟的统一且可扩展的数据层。它在保留模拟数据完整复杂性的同时,支持高效可扩展的机器学习工作流,并附带一个用于数据集构建和操作的库(https://github.com/PLAID-lib/plaid)。我们发布了六个覆盖结构力学和计算流体动力学的数据集,旨在反映真实工业场景并提供标准化基准。该框架包含可复现的评估协议,并与Hugging Face集成,支持开放、社区驱动的基准测试和用户积极参与(https://huggingface.co/PLAIDcompetitions)。

英文摘要

Machine learning-based surrogate models have emerged as a powerful tool to accelerate simulation-driven scientific workflows, but their adoption is limited by the lack of large-scale, diverse, and standardized datasets for physics-based simulations. Existing benchmarks often focus on narrow domains or rely on simplified data models, and fail to capture the heterogeneity arising from variable geometries, meshes, and topologies, which is critical for assessing generalization in realistic settings. We introduce PLAID (Physics-Learning AI Data model), a unified and extensible data layer for heterogeneous physics simulations. It preserves the full complexity of simulation data while enabling efficient and scalable machine learning workflows, together with a library for dataset construction and manipulation~(\href{https://github.com/PLAID-lib/plaid}{github.com/PLAID-lib/plaid}). We release six datasets covering structural mechanics and computational fluid dynamics, designed to reflect realistic industrial scenarios and provide standardized benchmarks. The framework includes reproducible evaluation protocols and is integrated with Hugging Face to enable open, community-driven benchmarking with active user participation (\href{https://huggingface.co/PLAIDcompetitions}{huggingface.co/PLAIDcompetitions}).

2503.21510 2026-05-27 cs.LG cs.CV stat.ML 版本更新

An uncertainty-aware Bayesian framework for machine learning classification models: A case study in land cover classification

一种不确定性感知的贝叶斯机器学习分类模型框架:以土地覆盖分类为例

Samuel Bilson, Miles McCrory, Anna Pustogvar

发表机构 * National Physical Laboratory, Teddington, UK(英国国家物理实验室,Teddington) Department of Data Science(数据科学系) Department of Thermal & Radiometric Metrology(热学与辐射计量学系) School of Geography, Geology & the Environment(地理、地质与环境学院)

AI总结 提出一种考虑输入测量不确定性的贝叶斯生成式分类模型框架,通过贝叶斯二次判别分析模型在土地覆盖数据集上验证,该模型在可解释性、不确定性建模和计算效率方面优于随机森林和神经网络。

Comments 38 pages, 16 figures

详情
AI中文摘要

确保机器学习分类模型的预测伴随不确定性估计是可信任人工智能的主要支柱之一。当前不确定性量化研究主要关注ML模型的认知不确定性,但很少考虑输入测量不确定性,而这对于计量学的可追溯性至关重要。在这项工作中,我们提出了一种考虑输入测量不确定性的生成式ML分类模型的贝叶斯框架。我们以贝叶斯二次判别分析(BQDA)模型为例,并将其应用于来自Copernicus Sentinel-2的2020年和2021年计量土地覆盖数据集。我们将该模型的性能与土地覆盖图中更流行的分类模型(如随机森林和神经网络)进行基准测试。为了验证和评估此类模型的泛化能力,我们还在合成分类数据上进行了模拟,改变了输入测量噪声的分布类型和强度。我们发现,对于真实和合成数据,所提出的BQDA模型更可信,因为它更具可解释性,显式建模了输入测量不确定性,并在不同领域和大小的数据集上保持了类别概率输出的预测性能,同时计算效率更高。

英文摘要

Ensuring that predictions of machine learning (ML) classification models are accompanied by uncertainty estimates is one of the main pillars of trustworthy AI. Current research in uncertainty quantification focuses mainly on epistemic uncertainty of the ML model, but rarely takes account of input measurement uncertainty, which is vital for traceability in metrology. In this work we propose a Bayesian framework for generative ML classification models that takes account of input measurement uncertainty. We take the specific case of a Bayesian quadratic discriminant analysis (BQDA) model, and apply it to metrological land cover datasets from Copernicus Sentinel-2 from 2020 and 2021. We benchmark the performance of the model against more popular classification models used in land cover maps such as random forests and neural networks. To validate and assess the generalisability of such a model, we also run simulations over synthetic classification data, varying distribution type and strength of the input measurement noise. We find for both real and synthetic data, the BQDA model presented is more trustworthy, in the sense that it is more interpretable, explicitly models the input measurement uncertainty, and maintains predictive performance of class probability outputs across datasets over different domains and sizes, whilst also being more computationally efficient.

2504.02993 2026-05-27 eess.SY cs.LG cs.SY 版本更新

Route Recommendations for Traffic Management Under Learned Partial Driver Compliance

基于学习部分驾驶员遵从性的交通管理路线推荐

Heeseung Bang, Jung-Hoon Cho, Cathy Wu, Andreas A. Malikopoulos

发表机构 * School of Civil and Environmental Engineering, Cornell University(康奈尔大学土木与环境工程学院) Department of Civil and Environmental Engineering, Massachusetts Institute of Technology(麻省理工学院土木与环境工程系)

AI总结 提出一种学习驾驶员部分遵从性的路线推荐框架,通过随机优化最小化系统最优流量与实际流量差距,在网格网络仿真中显著减少旅行时间。

Comments 6 pages

详情
AI中文摘要

在本文中,我们旨在通过引导旅行者沿系统最优(SO)路线行驶来缓解交通管理系统的拥堵。然而,我们认识到大多数理论方法假设驾驶员完全遵从,这通常不符合现实,因为驾驶员往往会偏离推荐以实现个人目标。因此,我们提出一个路线推荐框架,明确学习部分驾驶员遵从性,并在现实遵从条件下优化交通流。我们首先通过流优化技术计算SO边流。接下来,基于历史驾驶员决策训练一个遵从模型,以捕捉个体对我们推荐的响应。最后,我们制定一个随机优化问题,在非完美遵从条件下最小化目标SO流与实际流之间的差距。在网格网络上进行的模拟表明,与基线策略相比,我们的方法显著减少了旅行时间,证明了将学习到的遵从性纳入交通管理的实际优势。

英文摘要

In this paper, we aim to mitigate congestion in traffic management systems by guiding travelers along system-optimal (SO) routes. However, we recognize that most theoretical approaches assume perfect driver compliance, which often does not reflect reality, as drivers tend to deviate from recommendations to fulfill their personal objectives. Therefore, we propose a route recommendation framework that explicitly learns partial driver compliance and optimizes traffic flow under realistic adherence. We first compute an SO edge flow through flow optimization techniques. Next, we train a compliance model based on historical driver decisions to capture individual responses to our recommendations. Finally, we formulate a stochastic optimization problem that minimizes the gap between the target SO flow and the realized flow under conditions of imperfect adherence. Our simulations conducted on a grid network reveal that our approach significantly reduces travel time compared to baseline strategies, demonstrating the practical advantage of incorporating learned compliance into traffic management.

2504.02775 2026-05-27 cs.CV cs.LG 版本更新

TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection

TailedCore: 面向无监督长尾噪声异常检测的少样本采样

Yoon Gyo Jung, Jaewoo Park, Jaeho Yoon, Kuan-Chuan Peng, Wonchul Kim, Andrew Beng Jin Teoh, Octavia Camps

发表机构 * Northeastern University(东北大学) AiV Co.(AiV公司) Yonsei University(延世大学) Mitsubishi Electric Research Laboratories(三菱电机研究实验室)

AI总结 针对正常数据集存在缺陷污染且类别分布未知长尾的挑战,提出TailSampler估计类别大小以独立处理尾类与噪声,并构建基于记忆的异常检测模型TailedCore,在无监督长尾噪声异常检测中达到最先进性能。

Comments Accepted to CVPR2025

详情
AI中文摘要

我们旨在解决一个实际且具有挑战性的无监督异常检测问题,其中正常数据集既包含缺陷区域污染,其产品类别分布又是长尾但未知的。我们观察到现有模型存在尾类与噪声之间的权衡:如果模型对像素噪声鲁棒,则其在尾类样本上的性能会下降,反之亦然。为缓解该问题,我们独立处理尾类和噪声样本。为此,我们提出TailSampler,一种新颖的类别大小预测器,基于嵌入相似度的类别分布对称假设来估计样本的类别基数。TailSampler可用于专门采样尾类样本,从而单独处理它们。基于这些方面,我们构建了基于记忆的异常检测模型TailedCore,其记忆既能很好地捕捉尾类信息,又对噪声鲁棒。我们在无监督长尾噪声异常检测设置上广泛验证了TailedCore的有效性,并表明TailedCore在大多数设置下优于现有最先进方法。

英文摘要

We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice versa. To mitigate the issue, we handle the tail class and noise samples independently. To this end, we propose TailSampler, a novel class size predictor that estimates the class cardinality of samples based on a symmetric assumption on the class-wise distribution of embedding similarities. TailSampler can be utilized to sample the tail class samples exclusively, allowing to handle them separately. Based on these facets, we build a memory-based anomaly detection model TailedCore, whose memory both well captures tail class information and is noise-robust. We extensively validate the effectiveness of TailedCore on the unsupervised long-tail noisy anomaly detection setting, and show that TailedCore outperforms the state-of-the-art in most settings.

2504.00944 2026-05-27 hep-ph cs.LG hep-th 版本更新

Diffusion-model approach to flavor models: A case study for $S_4^\prime$ modular flavor model

扩散模型在味模型中的应用:以$S_4^\prime$模味模型为例

Satsuki Nishimura, Hajime Otsuka, Haruki Uchiyama

发表机构 * Department of Physics, Kyushu University(九州大学物理系)

AI总结 利用扩散模型(一种生成式人工智能)提出一种数值方法,通过实验约束搜索味模型参数,并以$S_4^\prime$模味模型为例,构建神经网络再现夸克质量、CKM矩阵和Jarlskog不变量,发现新的现象学感兴趣参数区域,并确认自发CP破坏。

Comments 19 pages, 2 figures

详情
Journal ref
Prog Theor Exp Phys (2026)
AI中文摘要

我们提出了一种利用扩散模型(属于生成式人工智能)在通用味模型中搜索具有实验约束参数的数值方法。作为一个具体例子,我们考虑$S_4^\prime$模味模型,并构建一个神经网络,通过将味模型中的自由参数视为生成目标,再现夸克质量、CKM矩阵和Jarlskog不变量。通过使用训练好的网络生成新参数,我们发现了各种现象学上有趣的参数区域,在这些区域中对$S_4^\prime$模型进行解析评估具有挑战性。此外,我们确认了在$S_4^\prime$模型中发生了自发CP破坏。扩散模型实现了逆问题方法,使得机器能够从给定的实验数据中提供一系列合理的模型参数。此外,它还可以作为一种通用的分析工具,用于从味模型中提取新的物理预测。

英文摘要

We propose a numerical method of searching for parameters with experimental constraints in generic flavor models by utilizing diffusion models, which are classified as a type of generative artificial intelligence (generative AI). As a specific example, we consider the $S_4^\prime$ modular flavor model and construct a neural network that reproduces quark masses, the CKM matrix, and the Jarlskog invariant by treating free parameters in the flavor model as generating targets. By generating new parameters with the trained network, we find various phenomenologically interesting parameter regions where an analytical evaluation of the $S_4^\prime$ model is challenging. Additionally, we confirm that the spontaneous CP violation occurs in the $S_4^\prime$ model. The diffusion model enables an inverse problem approach, allowing the machine to provide a series of plausible model parameters from given experimental data. Moreover, it can serve as a versatile analytical tool for extracting new physical predictions from flavor models.

2504.00307 2026-05-27 cs.LG physics.ao-ph 版本更新

Generating realistic global precipitation fields from modelled atmospheric circulation

从模拟大气环流生成逼真的全球降水场

Michael Aich, Sebastian Bathiany, Philipp Hess, Yu Huang, Niklas Boers

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Climate Center(慕尼黑气候中心) TUM School of Engineering and Design(TUM工程与设计学院) Department of Aerospace and Geodesy(航空航天与大地测量系) Earth System Modelling Group(地球系统建模组) Potsdam Institute for Climate Impact Research(波茨坦气候影响研究所) Global Systems Institute(全球系统研究所) Department of Mathematics(数学系) University of Exeter(埃克塞特大学)

AI总结 提出基于条件扩散模型与UNet架构的生成式机器学习方法,从少量预报大气变量生成高分辨率全球降水场,作为传统参数化方案的替代,减少偏差并实现高效集合预测。

Comments Accepted for publication at Climate Dynamics

详情
AI中文摘要

改进地球系统模型(ESMs)中降水的表示对于评估气候变化的影响,特别是洪水和干旱等极端事件至关重要。在现有的ESMs中,降水并非显式解析,而是通过参数化表示。这些参数化通常依赖于解析近似但计算昂贵的基于柱的物理过程,不考虑位置间的相互作用。它们难以捕捉精细尺度的降水过程,并引入显著偏差。我们提出了一种基于生成式机器学习的新方法,将条件扩散模型与UNet架构相结合,从一小部分预报大气变量生成准确、高分辨率(0.25°)的全球每日降水场。与传统参数化不同,我们的框架高效地生成集合预测,捕捉降水的不确定性,且无需手动微调。我们在ERA5再分析数据上训练模型,并提出一种方法使其能应用于未见过的ESM数据,从而实现概率预测和气候情景的快速生成。通过利用全球预报变量之间的相互作用,我们的方法提供了一种替代参数化方案,减轻了ESM降水中存在的偏差,同时保持与其大尺度(年)趋势的一致性。这项工作表明,复杂的降水模式可以直接从大尺度大气变量中学习,提供了一种计算高效的方法来获得高分辨率降水,而无需以如此高分辨率运行动力模型的成本。

英文摘要

Improving the representation of precipitation in Earth system models (ESMs) is critical for assessing the impacts of climate change and especially of extreme events like floods and droughts. In existing ESMs, precipitation is not resolved explicitly, but represented by parameterizations. These typically rely on resolving approximated but computationally expensive column-based physics, not accounting for interactions between locations. They struggle to capture fine-scale precipitation processes and introduce significant biases. We present a novel approach, based on generative machine learning, which integrates a conditional diffusion model with a UNet architecture to generate accurate, high-resolution (0.25°) global daily precipitation fields from a small set of prognostic atmospheric variables. Unlike traditional parameterizations, our framework efficiently produces ensemble predictions, capturing uncertainties in precipitation, and does not require fine-tuning by hand. We train our model on the ERA5 reanalysis and present a method that allows us to apply it to unseen ESM data, enabling fast generation of probabilistic forecasts and climate scenarios. By leveraging interactions between global prognostic variables, our approach provides an alternative parameterization scheme that mitigates biases present in the ESM precipitation while maintaining consistency with its large-scale (annual) trends. This work demonstrates that complex precipitation patterns can be learned directly from large-scale atmospheric variables, offering a computationally efficient method to obtain high-resolution precipitation without the cost of running the dynamical model at such high resolution.

2306.13985 2026-05-27 stat.ML cs.AI cs.LG stat.ME 版本更新

Robust Classification of High-Dimensional Data using Data-Adaptive Energy Distance

使用数据自适应能量距离的高维数据鲁棒分类

Jyotishka Ray Choudhury, Aytijhya Saha, Sarbojit Roy, Subhajit Dutta

发表机构 * Indian Statistical Institute , Kolkata, India(印度统计研究所,加尔各答,印度) School of Industrial and Systems Engineering, Georgia Institute of Technology , Atlanta, USA(工业与系统工程学院,佐治亚理工学院,美国亚特兰大) Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology , Saudi Arabia(计算机、电子和数学科学与工程系,国王阿卜杜勒·阿齐兹大学科学与技术学院,沙特阿拉伯) Applied Statistics Unit, Indian Statistical Institute , Kolkata, India(应用统计部,印度统计研究所,加尔各答,印度) Department of Mathematics and Statistics, Indian Institute of Technology Kanpur , India(数学与统计系,印度理工学院坎普尔分校,印度)

AI总结 针对高维低样本量数据,提出无调参、无矩条件的鲁棒分类器,在渐近条件下实现完美分类,并通过模拟和真实数据验证其优势。

Comments Published at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2023

详情
Journal ref
In: ECML PKDD 2023: Research Track. Lecture Notes in Computer Science, vol 14173. Springer, Cham (2023)
AI中文摘要

高维低样本量数据的分类在基因表达研究、癌症研究和医学成像等多种实际场景中构成挑战。本文开发并分析了一些专门为HDLSS数据设计的分类器。这些分类器无需调参且具有鲁棒性,即它们不依赖于底层数据分布的任何矩条件。研究表明,在相当一般的条件下,它们在HDLSS渐近框架下能实现完美分类。还研究了所提分类器的比较性能。我们的理论结果得到了广泛的模拟研究和真实数据分析的支持,这些分析表明所提出的分类技术相对于几种广泛认可的方法具有显著优势。

英文摘要

Classification of high-dimensional low sample size (HDLSS) data poses a challenge in a variety of real-world situations, such as gene expression studies, cancer research, and medical imaging. This article presents the development and analysis of some classifiers that are specifically designed for HDLSS data. These classifiers are free of tuning parameters and are robust, in the sense that they are devoid of any moment conditions of the underlying data distributions. It is shown that they yield perfect classification in the HDLSS asymptotic regime, under some fairly general conditions. The comparative performance of the proposed classifiers is also investigated. Our theoretical results are supported by extensive simulation studies and real data analysis, which demonstrate promising advantages of the proposed classification techniques over several widely recognized methods.

2502.06567 2026-05-27 stat.ML cs.LG 版本更新

Membership Inference Risks in Quantized Models: A Theoretical and Empirical Study

量化模型中的成员推断风险:理论与实证研究

Eric Aubinais, Philippe Formont, Pablo Piantanida, Elisabeth Gassiat

发表机构 * Université Paris-Saclay, CNRS, Laboratoire de mathématiques d’Orsay, France(巴黎萨克雷大学,法国国家科学研究中心,奥赛数学实验室,法国) Université Paris-Saclay, ILLS, MILA, ÉTS, Montreal, Canada(巴黎萨克雷大学,ILLs,MILA,ÉTS,加拿大蒙特利尔) ILLS, MILA, CNRS, CentraleSupélec, Montreal, Canada(ILLs,MILA,法国国家科学研究中心,中央超导学院,加拿大蒙特利尔)

AI总结 本文通过理论分析和实证方法,研究后训练量化对机器学习模型成员推断隐私风险的影响,并提出新的成员推断安全指标。

详情
Journal ref
AISTATS 2026
AI中文摘要

量化机器学习模型已被证明在降低内存和推理成本的同时,能够保持与原始模型相当的性能水平。在这项工作中,我们研究了量化过程对数据驱动模型隐私的影响,重点关注它们对成员推断攻击的脆弱性。成员推断安全(MIS)最近被提出,用于表征机器学习模型针对最强大(且可能未知)攻击的隐私性。然而,量化MIS在计算上似乎非常困难。在本文中,我们针对最小化经验损失的机器学习模型的后训练量化过程,提出了一种新的MIS指标。该新指标是此背景下MIS理论渐近分析的副产品。我们还提出了一种经验估计MIS指标的方法。使用合成数据集和真实世界数据(在药物发现背景下),我们证明了我们的方法在评估和排序不同量化器的MIS方面的有效性。

英文摘要

Quantizing machine learning models has demonstrated its effectiveness in lowering memory and inference costs while maintaining performance levels comparable to those of the original models. In this work, we investigate the impact of quantization procedures on privacy in data-driven models, focusing on their vulnerability to membership inference attacks. Membership Inference Security (MIS) has recently been proposed to characterize the privacy of machine learning models against the most powerful (and possibly unknown) attacks. However, quantifying MIS appears to be computationally very difficult. In this paper, we propose a new MIS indicator for post-training quantization procedures of machine learning models that minimizes an empirical loss. This new indicator is a byproduct of a theoretical asymptotic analysis of the MIS in this context. We also present a methodology for empirically estimating our MIS indicator. Using synthetic datasets and real-world data (in the context of drug discovery), we demonstrate the effectiveness of our approach in assessing and ranking the MIS of different quantizers.

2501.00520 2026-05-27 cs.CV cs.LG 版本更新

Innovative Silicosis and Pneumonia Classification: Leveraging Graph Transformer Post-hoc Modeling and Ensemble Techniques

创新性矽肺和肺炎分类:利用图Transformer后验建模与集成技术

Bao Q. Bui, Tien T. T. Nguyen, Duy M. Le, Cong Tran, Cuong Pham

AI总结 提出结合图Transformer网络与传统深度神经网络的架构,并采用平衡交叉熵损失函数和集成方法,在自建胸部X光数据集上实现高精度矽肺与肺炎分类。

Comments Withdrawn by the authors because the manuscript contains incomplete and potentially misleading descriptions of the dataset construction and evaluation protocol, particularly in the Dataset and Experimental Setup sections. The work should not be cited or used as an independent reference in its current form

详情
AI中文摘要

本文对矽肺相关肺部炎症的分类与检测进行了全面研究。我们的主要贡献包括:1) 创建了一个名为SVBCX的新策划胸部X光(CXR)图像数据集,该数据集针对不同病原体引起的肺部炎症的细微差别进行了定制,为矽肺和肺炎研究社区提供了宝贵资源;2) 提出了一种新颖的深度学习架构,该架构将图Transformer网络与传统深度神经网络模块相结合,用于有效分类矽肺和肺炎。此外,我们采用平衡交叉熵(BalCE)作为损失函数,以确保不同类别之间的更均匀学习,增强模型辨别肺部状况细微差异的能力。所提出的模型架构和损失函数选择旨在提高炎症检测的准确性和可靠性,特别是在矽肺背景下。此外,我们的研究探索了一种集成方法的有效性,该方法结合了不同模型架构的优势。在构建的数据集上的实验结果表明,与基线模型相比,取得了显著改进。模型集成实现了宏F1分数0.9749,每个类别的AUC ROC分数超过0.99,突显了我们的方法在准确和鲁棒的肺部炎症分类中的有效性。

英文摘要

This paper presents a comprehensive study on the classification and detection of Silicosis-related lung inflammation. Our main contributions include 1) the creation of a newly curated chest X-ray (CXR) image dataset named SVBCX that is tailored to the nuances of lung inflammation caused by distinct agents, providing a valuable resource for silicosis and pneumonia research community; and 2) we propose a novel deep-learning architecture that integrates graph transformer networks alongside a traditional deep neural network module for the effective classification of silicosis and pneumonia. Additionally, we employ the Balanced Cross-Entropy (BalCE) as a loss function to ensure more uniform learning across different classes, enhancing the model's ability to discern subtle differences in lung conditions. The proposed model architecture and loss function selection aim to improve the accuracy and reliability of inflammation detection, particularly in the context of Silicosis. Furthermore, our research explores the efficacy of an ensemble approach that combines the strengths of diverse model architectures. Experimental results on the constructed dataset demonstrate promising outcomes, showcasing substantial enhancements compared to baseline models. The ensemble of models achieves a macro-F1 score of 0.9749 and AUC ROC scores exceeding 0.99 for each class, underscoring the effectiveness of our approach in accurate and robust lung inflammation classification.

2410.19248 2026-05-27 cs.LG 版本更新

CHESTNUT: A QoS Dataset for Mobile Edge Environments

CHESTNUT: 面向移动边缘环境的QoS数据集

Guobing Zou, Fei Zhao, Shengxiang Hu

发表机构 * School of Computer Engineering and Science, Shanghai University(上海大学计算机工程与科学学院)

AI总结 针对现有QoS数据集忽略时间和地理位置等动态属性的问题,提出CHESTNUT数据集,在采集过程中精确记录时间和地理位置信息,以支持移动边缘环境中的QoS预测。

详情
AI中文摘要

服务质量(QoS)是衡量网络服务性能的重要指标。如今,它被广泛应用于移动边缘环境中,以评估移动设备从边缘服务器请求服务时的服务质量。QoS通常涉及多个维度,如带宽、延迟、抖动和数据包丢失率。然而,大多数现有的QoS数据集,例如常见的WS-Dream数据集,主要关注网络服务的静态QoS指标,而忽略了时间和地理位置等动态属性。这意味着它们应该详细记录服务请求时移动设备的位置或请求的时间顺序。然而,这些动态属性对于理解和预测网络服务的实际性能至关重要,因为QoS性能通常随时间和地理位置波动。为此,我们提出了一种新的数据集,在采集过程中精确记录服务质量的时间和地理位置信息,旨在为移动边缘环境中的未来QoS预测提供更准确、可靠的数据支持。

英文摘要

Quality of Service (QoS) is an important metric to measure the performance of network services. Nowadays, it is widely used in mobile edge environments to evaluate the quality of service when mobile devices request services from edge servers. QoS usually involves multiple dimensions, such as bandwidth, latency, jitter, and data packet loss rate. However, most existing QoS datasets, such as the common WS-Dream dataset, focus mainly on static QoS metrics of network services and ignore dynamic attributes such as time and geographic location. This means they should have detailed the mobile device's location at the time of the service request or the chronological order in which the request was made. However, these dynamic attributes are crucial for understanding and predicting the actual performance of network services, as QoS performance typically fluctuates with time and geographic location. To this end, we propose a novel dataset that accurately records temporal and geographic location information on quality of service during the collection process, aiming to provide more accurate and reliable data to support future QoS prediction in mobile edge environments.

2410.00357 2026-05-27 cs.LG stat.ML 版本更新

Neural Scaling Laws of Deep ReLU and Deep Operator Network: A Theoretical Study

深度ReLU和深度算子网络的神经缩放定律:一项理论研究

Hao Liu, Zecheng Zhang, Wenjing Liao, Hayden Schaeffer

发表机构 * Department of Mathematics, Hong Kong Baptist University(香港 Baptist 大学数学系) Department of ACMS, University of Notre Dame(Notre Dame 大学ACMS系) School of Mathematics, Georgia Institute of Technology(佐治亚理工学院数学系) Department of Mathematics, UCLA(加州大学洛杉矶分校数学系)

AI总结 本文通过分析深度算子网络的逼近误差和泛化误差,建立了量化神经缩放定律的理论框架,揭示了网络模型大小和训练数据大小与误差之间的关系,并推广到深度ReLU网络。

详情
AI中文摘要

神经缩放定律在深度神经网络的性能中起着关键作用,并在广泛的任务中被观察到。然而,理解这些缩放定律的完整理论框架仍不完善。在本文中,我们探索了深度算子网络的神经缩放定律,这些网络涉及学习函数空间之间的映射,重点关注Chen和Chen风格的架构。这些方法包括流行的深度算子网络(DeepONet),它们使用可学习基函数和依赖于输入函数的系数的线性组合来近似输出函数。我们建立了一个理论框架,通过分析其逼近和泛化误差来量化神经缩放定律。我们阐述了深度算子网络的逼近和泛化误差与网络模型大小和训练数据大小等关键因素之间的关系。此外,我们处理了输入函数表现出低维结构的情况,从而能够推导出更紧的误差界。这些结果也适用于深度ReLU网络和其他类似结构。我们的结果为算子学习中的神经缩放定律提供了部分解释,并为其应用提供了理论基础。

英文摘要

Neural scaling laws play a pivotal role in the performance of deep neural networks and have been observed in a wide range of tasks. However, a complete theoretical framework for understanding these scaling laws remains underdeveloped. In this paper, we explore the neural scaling laws for deep operator networks, which involve learning mappings between function spaces, with a focus on the Chen and Chen style architecture. These approaches, which include the popular Deep Operator Network (DeepONet), approximate the output functions using a linear combination of learnable basis functions and coefficients that depend on the input functions. We establish a theoretical framework to quantify the neural scaling laws by analyzing its approximation and generalization errors. We articulate the relationship between the approximation and generalization errors of deep operator networks and key factors such as network model size and training data size. Moreover, we address cases where input functions exhibit low-dimensional structures, allowing us to derive tighter error bounds. These results also hold for deep ReLU networks and other similar structures. Our results offer a partial explanation of the neural scaling laws in operator learning and provide a theoretical foundation for their applications.

2408.05560 2026-05-27 cs.LG math.OC stat.ML 版本更新

Incremental Gauss-Newton Descent for Machine Learning

增量高斯-牛顿下降法在机器学习中的应用

Mikalai Korbit, Mario Zanon

发表机构 * IMT School for Advanced Studies Lucca(利卡学院高级研究学院)

AI总结 针对标量输出损失逐样本评估的场景,提出增量高斯-牛顿下降法(IGND),通过闭式标量归一化随机梯度实现无需存储或求解曲率矩阵的高效更新,并证明其收敛性。

详情
AI中文摘要

随机梯度更新因其高效性和可扩展性被广泛使用,但其有效步长可能严重依赖于特征缩放和局部模型敏感性。高斯-牛顿方法通过曲率信息处理此类尺度效应,但在标准小批量形式中需要矩阵-向量乘积、线性求解或结构化近似。本文研究每次评估一个样本的标量输出损失的特殊情况。在此设置下,广义高斯-牛顿矩阵的秩至多为1,其唯一可能的非零曲率方向与随机梯度对齐。因此,阻尼高斯-牛顿方向简化为样本梯度的闭式标量归一化。由此产生的更新,即增量高斯-牛顿下降法(IGND),不需要曲率矩阵存储、分解或迭代线性求解。我们推导了该更新,描述了其行为,并将其与归一化梯度下降、自适应一阶方法、随机Polyak步长和小批量高斯-牛顿更新联系起来。在显式光滑性、对齐性和随机逼近假设下,我们证明了IGND更新的平稳性结果。在监督学习、尺度鲁棒性的受控测试以及线性二次控制案例研究上的实验表明,IGND提高了对敏感性缩放的鲁棒性,并且可以在保持简单增量更新的同时,与常见的随机优化器竞争或互补。

英文摘要

Stochastic gradient updates are widely used for their efficiency and scalability, but their effective step sizes can depend strongly on feature scaling and local model sensitivity. Gauss-Newton methods address such scale effects through curvature information, but in their standard mini-batch form they require matrix-vector products, linear solves, or structured approximations. This paper studies the special case of scalar-output losses evaluated one sample at a time. In this setting, the generalized Gauss-Newton matrix has rank at most one, and its only possible nonzero curvature direction is aligned with the stochastic gradient. As a result, the damped Gauss-Newton direction reduces to a closed-form scalar normalization of the sample gradient. The resulting update, Incremental Gauss-Newton Descent (IGND), requires no curvature matrix storage, factorization, or iterative linear solve. We derive the update, characterize its behavior, and relate it to normalized gradient descent, adaptive first-order methods, stochastic Polyak step sizes, and mini-batch Gauss-Newton updates. Under explicit smoothness, alignment, and stochastic approximation assumptions, we prove a stationarity result for the IGND update. Experiments on supervised learning, a controlled test of scale robustness, and a linear-quadratic control case study show that IGND improves robustness to sensitivity scaling and can be competitive with, or complementary to, common stochastic optimizers while retaining a simple incremental update.

2306.09344 2026-05-27 cs.CV cs.LG 版本更新

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

DreamSim: 使用合成数据学习人类视觉相似性的新维度

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, Phillip Isola

发表机构 * MIT(麻省理工学院) Weizmann Institute of Science(魏茨曼科学研究所) Adobe Research(Adobe研究)

AI总结 本文提出DreamSim指标,通过合成数据训练,在图像布局、对象姿态和语义内容等中高层面上对齐人类感知,并在检索和重建任务中优于现有指标。

Comments Website: https://dreamsim-nights.github.io/ Code: https://github.com/ssundaram21/dreamsim

详情
Journal ref
Advances in Neural Information Processing Systems 36 (NeurIPS 2023)
AI中文摘要

当前的感知相似性度量在像素和补丁级别上操作。这些度量在低层颜色和纹理方面比较图像,但未能捕捉图像布局、对象姿态和语义内容中的中层相似性和差异。在本文中,我们开发了一种整体评估图像的感知度量。第一步是收集一个关于以多种方式相似的图像对的人类相似性判断的新数据集。该数据集的关键在于判断几乎是自动的,并且所有观察者共享。为了实现这一点,我们使用最近的文本到图像模型创建沿不同维度扰动的合成对。我们观察到流行的感知度量无法解释我们的新数据,因此我们引入了一个新的度量DreamSim,调整以更好地与人类感知对齐。我们分析了不同视觉属性如何影响我们的度量,发现它主要关注前景对象和语义内容,同时对颜色和布局敏感。值得注意的是,尽管在合成数据上训练,我们的度量能够泛化到真实图像,在检索和重建任务上取得了强劲的结果。此外,我们的度量在这些任务上优于先前学习的度量和最近的大型视觉模型。

英文摘要

Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks.

2210.02573 2026-05-27 cs.LG 版本更新

Efficient Learning of Mesh-Based Physical Simulation with BSMS-GNN

基于BSMS-GNN的网格物理模拟高效学习

Yadi Cao, Menglei Chai, Minchen Li, Chenfanfu Jiang

发表机构 * Department of Computer Science, UCLA, Los Angeles, USA(加州大学洛杉矶分校计算机科学系) AR Perception, Google, Los Angeles, USA(谷歌AR感知部门) Department of Mathematics, UCLA, Los Angeles, USA(加州大学洛杉矶分校数学系)

AI总结 针对大规模网格物理模拟中图神经网络扩展复杂度和过平滑问题,提出基于二分图确定的双步幅池化策略BSMS-GNN,无需人工粗网格且避免几何边界错误边,显著提升精度和计算效率。

Comments Updates summary: fix the missing remark for yadi and menglei (* mention work partially done during while they are at snap inc.)

详情
AI中文摘要

使用平面图神经网络(GNN)和堆叠消息传递(MP)在大规模网格上学习物理模拟具有挑战性,因为其扩展复杂度与节点数量相关且存在过平滑问题。社区对引入多尺度结构到GNN用于物理模拟的兴趣日益增长。然而,当前最先进的方法受限于依赖人工绘制粗网格或基于空间邻近性构建粗层级,这可能在几何边界引入错误边。受二分图确定启发,我们提出了一种新颖的池化策略——双步幅(bi-stride),以解决上述限制。双步幅在广度优先搜索(BFS)的每个其他前沿上池化节点,无需手动绘制粗网格,并避免了空间邻近性导致的错误边。此外,它实现了每层级单次MP方案以及通过插值进行非参数化池化和反池化,类似于U-Net,显著降低了计算成本。实验表明,所提出的框架BSMS-GNN在代表性物理模拟中,在精度和计算效率方面均显著优于现有方法。

英文摘要

Learning the physical simulation on large-scale meshes with flat Graph Neural Networks (GNNs) and stacking Message Passings (MPs) is challenging due to the scaling complexity w.r.t. the number of nodes and over-smoothing. There has been growing interest in the community to introduce \textit{multi-scale} structures to GNNs for physical simulation. However, current state-of-the-art methods are limited by their reliance on the labor-intensive drawing of coarser meshes or building coarser levels based on spatial proximity, which can introduce wrong edges across geometry boundaries. Inspired by the bipartite graph determination, we propose a novel pooling strategy, \textit{bi-stride} to tackle the aforementioned limitations. Bi-stride pools nodes on every other frontier of the breadth-first search (BFS), without the need for the manual drawing of coarser meshes and avoiding the wrong edges by spatial proximity. Additionally, it enables a one-MP scheme per level and non-parametrized pooling and unpooling by interpolations, resembling U-Nets, which significantly reduces computational costs. Experiments show that the proposed framework, \textit{BSMS-GNN}, significantly outperforms existing methods in terms of both accuracy and computational efficiency in representative physical simulations.

2302.13473 2026-05-27 cs.LG 版本更新

Towards Interpretable Federated Learning

迈向可解释的联邦学习

Anran Li, Rui Liu, Ming Hu, Yuanyuan Chen, Shipeng Wang, Lizhen Cui, Han Yu

发表机构 * Department of Biomedical Informatics and Data Science, School of Medicine at Yale University(耶鲁大学医学院生物医学信息学与数据科学系) School of Computer Science and Engineering, Nanyang Technological University(南洋理工大学计算机科学与工程学院) School of Software, Shandong University(山东大学软件学院) Joint SDU-NTU Centre for Artificial Intelligence Research, Shandong University(山东大学与南洋理工大学联合人工智能研究中心)

AI总结 本文首次综述可解释联邦学习(IFL),提出涵盖模型解释、调试和数据贡献评估的独特分类体系,并分析代表性方法、评估指标和未来方向。

Comments Survey of interpretable federated learning

详情
AI中文摘要

联邦学习(FL)使多个数据所有者能够在不暴露私有本地数据的情况下协作构建机器学习模型。为了使FL得到广泛采用,平衡性能、隐私保护和可解释性的需求至关重要,尤其是在金融和医疗等关键任务应用中。因此,可解释联邦学习(IFL)已成为一个新兴的研究课题,吸引了学术界和工业界的极大兴趣。其跨学科性质对新研究人员来说可能具有挑战性。在本文中,我们通过提供(据我们所知)第一篇关于IFL的综述来弥合这一差距。我们提出了一个独特的IFL分类法,涵盖了使FL模型能够解释预测结果、支持模型调试以及提供关于单个数据所有者或数据样本贡献的见解的相关工作,这对于公平分配奖励以激励在FL中积极可靠的参与至关重要。我们对代表性的IFL方法、常用的性能评估指标以及构建多功能IFL技术的有前景方向进行了全面分析。

英文摘要

Federated learning (FL) enables multiple data owners to build machine learning models collaboratively without exposing their private local data. In order for FL to achieve widespread adoption, it is important to balance the need for performance, privacy-preservation and interpretability, especially in mission critical applications such as finance and healthcare. Thus, interpretable federated learning (IFL) has become an emerging topic of research attracting significant interest from the academia and the industry alike. Its interdisciplinary nature can be challenging for new researchers to pick up. In this paper, we bridge this gap by providing (to the best of our knowledge) the first survey on IFL. We propose a unique IFL taxonomy which covers relevant works enabling FL models to explain the prediction results, support model debugging, and provide insights into the contributions made by individual data owners or data samples, which in turn, is crucial for allocating rewards fairly to motivate active and reliable participation in FL. We conduct comprehensive analysis of the representative IFL approaches, the commonly adopted performance evaluation metrics, and promising directions towards building versatile IFL techniques.

2009.11997 2026-05-27 cs.LG cs.AI cs.RO 版本更新

Continual Model-Based Reinforcement Learning with Hypernetworks

基于超网络的连续模型强化学习

Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, Florian Shkurti

发表机构 * Division of Engineering Science, University of Toronto, Canada(多伦多大学工程科学系) Department of Computer Science, University of Toronto, Canada(多伦多大学计算机科学系)

AI总结 提出HyperCRL方法,利用任务条件超网络在序列任务中持续学习动力学模型,避免重新训练并固定存储开销,在机器人 locomotion 和 manipulation 任务中优于现有持续学习方法。

Comments Updated link to project website in the abstract. 7 pages (+2 pages in appendix), 8 figures. In proceedings of the 2021 IEEE International Conference on Robotics and Automation

详情
AI中文摘要

在基于模型的强化学习(MBRL)和模型预测控制(MPC)中,有效规划依赖于学习到的动力学模型的准确性。在MBRL和MPC的许多实例中,该模型被假定为平稳的,并且定期从头开始重新训练,使用从环境交互开始收集的状态转移经验。这意味着训练动力学模型所需的时间——以及计划执行之间的暂停时间——随着收集的经验规模线性增长。我们认为这对于终身机器人学习来说太慢,并提出了HyperCRL,一种使用任务条件超网络在序列任务中持续学习所遇到动力学的方法。我们的方法有三个主要特点:首先,它包括不重新访问先前任务训练数据的动力学学习会话,因此只需存储最近固定大小的状态转移经验;其次,它使用固定容量的超网络来表示非平稳且任务感知的动力学;第三,它优于依赖固定容量网络的现有持续学习替代方案,并且与记忆不断增长的过去经验核心集的基线方法相比具有竞争力。我们展示了HyperCRL在机器人 locomotion 和 manipulation 场景(如推和开门任务)中在连续基于模型的强化学习中的有效性。我们的项目网站(含视频)位于此链接:https://rvl.cs.toronto.edu/blog/hypercrl

英文摘要

Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dynamics model. In many instances of MBRL and MPC, this model is assumed to be stationary and is periodically re-trained from scratch on state transition experience collected from the beginning of environment interactions. This implies that the time required to train the dynamics model - and the pause required between plan executions - grows linearly with the size of the collected experience. We argue that this is too slow for lifelong robot learning and propose HyperCRL, a method that continually learns the encountered dynamics in a sequence of tasks using task-conditional hypernetworks. Our method has three main attributes: first, it includes dynamics learning sessions that do not revisit training data from previous tasks, so it only needs to store the most recent fixed-size portion of the state transition experience; second, it uses fixed-capacity hypernetworks to represent non-stationary and task-aware dynamics; third, it outperforms existing continual learning alternatives that rely on fixed-capacity networks, and does competitively with baselines that remember an ever increasing coreset of past experience. We show that HyperCRL is effective in continual model-based reinforcement learning in robot locomotion and manipulation scenarios, such as tasks involving pushing and door opening. Our project website with videos is at this link https://rvl.cs.toronto.edu/blog/hypercrl

1909.08210 2026-05-27 cs.LG stat.ML 版本更新

Reformulation of RBM to Unify Linear and Nonlinear Dimensionality Reduction

RBM的重新表述以统一线性和非线性降维

Jiangsheng You, Chun-Yen Liu

发表机构 * Aspen Technology Inc(阿斯彭技术公司)

AI总结 本文通过最大后验估计和期望最大化算法重新表述受限玻尔兹曼机为确定性模型,提出无需MCMC的对比散度算法,统一了标量和向量变量的线性和非线性降维。

Comments 16 pages with 7 figures

详情
AI中文摘要

受限玻尔兹曼机(RBM)是一种具有共享权重的两层神经网络,在文献中已被广泛研究用于降维、数据表示和推荐系统。传统的RBM需要对两层上的值进行概率解释,并在训练期间使用马尔可夫链蒙特卡洛(MCMC)过程生成样本。对比散度(CD)算法能高效训练RBM,但其收敛性尚未得到数学证明。在本文中,利用最大后验(MAP)估计和期望最大化(EM)算法,我们证明了无MCMC的CD算法对于条件似然目标函数是收敛的。本文的另一个关键贡献是将RBM重新表述为确定性模型。在重新表述的RBM中,无MCMC的CD算法近似于梯度下降(GD)方法。这种重新表述的RBM可以在节点上采用连续的标量和向量变量,并灵活选择激活函数。数值实验显示了其在线性和非线性降维中的能力,并且对于非线性降维,通过选择合适的激活函数,重新表述的RBM可以优于主成分分析(PCA)。最后,我们展示了其在CIFAR-10数据集(彩色图像)和多变量序列数据上的向量值节点应用,这些应用无法用传统RBM自然配置。这项工作不仅为传统RBM提供了理论见解,而且统一了标量和向量变量的线性和非线性降维。

英文摘要

A restricted Boltzmann machine (RBM) is a two-layer neural network with shared weights and has been extensively studied for dimensionality reduction, data representation and recommendation systems in the literature. The traditional RBM requires a probabilistic interpretation of the values on both layers and a Markov chain Monte Carlo (MCMC) procedure to generate samples during the training. The contrastive divergence (CD) is efficient to train the RBM but its convergence has not been proved mathematically. In this paper, using a maximum a posteriori (MAP) estimate and the expectation maximization (EM) algorithm, we show that the CD algorithm without MCMC is convergent for the conditional likelihood object function. Another key contribution in this paper is the reformulation of the RBM into a deterministic model. Within the reformulated RBM, the CD algorithm without MCMC approximates the gradient descent (GD) method. This reformulated RBM can take the continuous scalar and vector variables on the nodes with flexibility in choosing the activation functions. Numerical experiments show its capability in both linear and nonlinear dimensionality reduction, and, for the nonlinear dimensionality reduction, the reformulated RBM can outperform principal component analysis (PCA) by choosing the proper activation functions. Finally, we demonstrate its application to vector-valued nodes for the CIFAR-10 dataset (color images) and the multivariate sequence data, which cannot be configured naturally with the traditional RBM. This work not only provides theoretical insights regarding the traditional RBM but also unifies the linear and nonlinear dimensionality reduction for scalar and vector variables.