arXivDaily arXiv每日学术速递 周一至周五更新
2605.18753 2026-05-19 cs.CL cs.AI cs.LG 版本更新

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

DashAttention: 可微且自适应的稀疏分层注意力

Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti, Lei Li, Xu Han, Edoardo M. Ponti, André F. T. Martins, Marcos V. Treviso

发表机构 * Tsinghua University(清华大学) Instituto Superior Técnico, Universidade de Lisboa(里斯本大学理工学院) Instituto de Telecomunicações(电信研究院) Carnegie Mellon University(卡内基梅隆大学) Sapienza University of Rome(罗马萨皮恩扎大学) University of Edinburgh(爱丁堡大学) TransPerfect(TransPerfect公司) ELLIS Unit Lisbon(里斯本ELLIS单位)

AI总结 本研究提出DashAttention,一种可微且自适应的稀疏分层注意力机制,通过自适应稀疏α-entmax变换选择可变数量的块,从而在保持整个层次结构可微的同时,提升长上下文建模能力,实验表明其在高稀疏度下优于现有方法。

Comments Preprint

详情
AI中文摘要

当前的分层注意力方法,如NSA和InfLLMv2,基于粗粒度注意力得分选择前k个相关键值(KV)块,然后对所选标记应用细粒度softmax注意力。然而,top-k操作假设任何查询的相关标记数量固定,并且阻止了稀疏和密集阶段之间的梯度流动。在本工作中,我们提出了DashAttention(可微且自适应的稀疏分层注意力),它利用自适应稀疏α-entmax变换,在第一阶段根据当前查询选择可变数量的块。这反过来为第二阶段的softmax注意力提供先验信息,保持整个层次结构完全可微。与其他分层注意力方法不同,我们表明DashAttention是非发散的,这导致更好的长上下文建模能力。在大型语言模型(LLMs)上的实验表明,DashAttention在75%的稀疏度下达到与全注意力相当的准确性,并在高稀疏度情况下优于NSA和InfLLMv2,特别是在高稀疏度情况下。我们还提供了一个高效的、GPU-aware的DashAttention实现,在Triton中实现了比FlashAttention-3快超过一倍的推理速度。总体而言,DashAttention提供了一种成本效益高的长上下文建模策略。

英文摘要

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages. In this work, we propose DashAttention (Differentiable and Adaptive Sparse Hierarchical Attention), which leverages the adaptively sparse $α$-entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable. Contrary to other hierarchical attention methods, we show that DashAttention is non-dispersive, translating to better long-context modeling ability. Experiments with large language models (LLMs) show that DashAttention achieves comparable accuracy as full attention with 75% sparsity and a better Pareto frontier than NSA and InfLLMv2, especially in high-sparsity regimes. We also provide an efficient, GPU-aware implementation of DashAttention in Triton, which achieves a speedup of up to over FlashAttention-3 at inference time. Overall, DashAttention offers a cost-effective strategy to model long contexts.

2605.18750 2026-05-19 cs.DC cs.LG 版本更新

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

一种面向运行时变异的流水线并行训练的准备驱动运行时

Ruitao Liu, Xinyang Tian, Shuo Chen, Tingrui Zhang, Guang Yang, Alan Zhao, Wei Xu

发表机构 * Tsinghua University(清华大学) Scitix AI

AI总结 本文提出了一种准备驱动的流水线运行时,以解决流水线并行训练中由于运行时变异导致的阶段对齐问题和空闲泡现象,通过非绑定提示顺序来优化当前就绪工作的排序,从而提高资源利用率。

Comments 29 pages, including appendices

详情
AI中文摘要

流水线并行是扩展大模型训练的关键技术,但现代工作负载在计算和通信方面表现出运行时变异。现有的流水线系统通常消耗静态的、经过分析的或自适应生成的调度作为预承诺的执行顺序。当实现的任务准备度偏离预承诺顺序时,阶段可能等待尚未就绪的工作,尽管其他可执行的工作可用,从而产生阶段错位、空闲泡和利用率降低。我们提出了运行时准备优先流水线(RRFP),一种面向流水线并行训练的准备驱动运行时。RRFP改变了运行时调度的消费方式:而不是将调度视为阶段必须等待以遵循的序列,它将调度视为非绑定提示顺序,用于对当前就绪工作进行排序。为了支持这种模型,RRFP结合了消息驱动的异步通信、轻量级张量并行协调以实现集体一致性,以及用于低开销调度的就绪-设置仲裁。我们将在基于Megatron的训练框架中实现RRFP,并在语言模型和多模态工作负载上评估,最多支持128个GPU。RRFP在所有设置中均优于固定顺序流水线基线。使用BFW提示,RRFP在语言模型工作负载上实现了高达1.77倍的速度提升,在多模态工作负载上高达2.77倍。在跨框架比较中,RRFP使用默认BF提示在保持训练正确性的同时,比更快的可用外部系统高出高达1.84倍。

英文摘要

Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization. We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint order for ranking currently ready work. To support this model, RRFP combines message-driven asynchronous communication, lightweight tensor-parallel coordination for collective consistency, and ready-set arbitration for low-overhead dispatch. We implement RRFP in a Megatron-based training framework and evaluate it on language-only and multimodal workloads at up to 128 GPUs. RRFP improves over fixed-order pipeline baselines across all settings. Using the BFW hint, RRFP achieves up to 1.77$\times$ speedup on language-only workloads and up to 2.77$\times$ on multimodal workloads. In cross-framework comparisons, RRFP with the default BF hint outperforms the faster available external system by up to 1.84$\times$ while preserving training correctness.

2605.18735 2026-05-19 cs.CV cs.GR cs.LG 版本更新

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

PIXLRelight: 通过内在条件实现可控的图像重照明

Miguel Farinha, Ronald Clark

发表机构 * Department of Computer Science(计算机科学系) University of Oxford(牛津大学)

AI总结 PIXLRelight通过内在条件将物理基础渲染与学习图像合成相结合,实现对单图像重照明的可控性,其核心方法是利用真实照片或PBR渲染得到的内在条件进行训练和推理,从而在保证图像细节的同时实现高质量的重照明效果。

Comments Project page: https://mlfarinha.github.io/pixl-relight/. Under review

详情
AI中文摘要

我们提出了PIXLRelight,一种用于物理可控单图像重照明的前馈方法。现有方法要么提供有限的光照控制(例如通过文本或环境地图),要么在逆向和正向渲染链中累积误差,或者需要昂贵的每图像优化。我们的关键思想是通过共享的内在条件将物理基础渲染(PBR)与学习图像合成联系起来,该条件可以从真实照片或PBR渲染中获得。在训练时,成对的多光照照片被分解为反照率、漫反射阴影和非漫反射残差,这些条件用于模型训练。在推理时,相同的条件从粗略3D重建的输入下用户指定的PBR灯光路径追踪渲染中计算。基于变压器的神经渲染器然后将目标光照应用于源照片,通过每像素的仿射调制保留精细图像细节。PIXLRelight实现了任意PBR风格的光照控制,达到了最先进的重照明质量,并且每张图像的运行时间不到十分之一秒。代码和模型可在https://mlfarinha.github.io/pixl-relight/上获得。

英文摘要

We present PIXLRelight, a feed-forward approach for physically controllable single-image relighting. Existing methods either provide limited lighting control (e.g. through text or environment maps), accumulate errors when chaining inverse and forward rendering, or require costly per-image optimization. Our key idea is to bridge physically based rendering (PBR) and learned image synthesis through a shared intrinsic conditioning that can be obtained from either real photographs or PBR renders. At training time, paired multi-illumination photographs are decomposed into albedo, diffuse shading, and non-diffuse residuals, which condition the model. At inference time, the same conditioning is computed from a path-traced render of a coarse 3D reconstruction of the input under user-specified PBR lights. A transformer-based neural renderer then applies the target illumination to the source photograph, preserving fine image detail through a per-pixel affine modulation. PIXLRelight enables arbitrary PBR-style lighting control, achieves state-of-the-art relighting quality, and runs in under a tenth of a second per image. Code and models are available at https://mlfarinha.github.io/pixl-relight/.

2605.18732 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

可预测的编造:大型语言模型的事实回忆能力随模型大小和主题频率而增加

Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun, Iyiola E. Olatunji, Tegawendé F. Bissyandé

发表机构 * International Development Research Centre Canada(加拿大国际发展研究中心) University of Cape Town(开普敦大学) Global Center on AI Governance(人工智能治理全球中心) SnT, University of Luxembourg(卢森堡大学SnT分校) CITADEL AI Centre of Excellence, Burkina Faso(布基纳法索CITADEL人工智能卓越中心)

AI总结 本研究探讨了大型语言模型在事实回忆方面的可预测性,发现模型大小和训练数据中主题频率是影响回忆质量的关键因素,且模型大小和主题频率的组合能解释60%-94%的方差。

Comments 18 pages, 5 figures, 6 tables

详情
AI中文摘要

尽管规模定律支配了大规模语言模型的整体性能,但尚未有规模定律将事实回忆与模型大小和训练数据组成联系起来。我们评估了38个模型,超过8900篇学术参考文献由自动参考验证系统评估。回忆质量在模型参数数量和训练数据中主题表示的对数线性组合中呈现S形。这两个变量单独能解释16个密集模型中60%的方差,而在单个家族内上升至74-94%。这种形式与受叠加启发的账户相匹配,其中回忆由信号噪声比门控:信号强度与概念频率成正比,噪声底座与模型容量成正比。

英文摘要

While scaling laws govern aggregate large language model performance, no scaling law has linked factual recall to both model size and training-data composition. We evaluated 38 models on over 8,900 scholarly references evaluated by an automated reference verification system. Recall quality follows a sigmoid in the log-linear combination of model parameter count and topic representation in training data. These two variables alone explain 60% of the variance across 16 dense models from four families, rising to 74-94% within individual families. The form matches a superposition-inspired account in which recall is gated by a signal-to-noise ratio: signal strength scales with concept frequency and the noise floor with model capacity.

2605.18704 2026-05-19 eess.SP cs.LG 版本更新

Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation

在Sage-Husa卡尔曼滤波器中学习记忆衰减用于鲁棒无人机状态估计

Kenan Majewski, Marcin Żugaj

发表机构 * Institute of Aeronautics and Applied Mechanics, Warsaw University of Technology(航空航天与应用力学研究所,华沙技术大学)

AI总结 本文提出N-Deep Recurrent Sage-Husa滤波器,通过学习的记忆衰减策略改进传统卡尔曼滤波器,以提高无人机在动态环境中的状态估计鲁棒性。

Comments 49 pages, 9 figures. Preprint submitted to Aerospace Science and Technology

详情
AI中文摘要

无人机在动态环境中面临 telemetry 中断、结构振动和依赖于制度的噪声,这些都会破坏经典卡尔曼滤波器的静态协方差假设。Sage-Husa卡尔曼滤波器(SHKF)能够在线估计噪声统计信息,但其依赖于一个静态的标量遗忘因子,迫使在稳态稳定性与瞬态响应性之间做出严格权衡。本文引入了N-Deep Recurrent Sage-Husa滤波器(NDR-SHKF),将此标量参数替换为一个向量值的记忆衰减策略,该策略通过在白化创新序列上操作的分层递归网络进行学习。双分支架构将浅层递归状态用于捕捉瞬时传感器异常,将深层状态用于编码持续动态趋势,同时辅助重建目标防止特征崩溃。完整的滤波器,包括递归协方差更新,通过反向传播通过时间进行端到端训练,直接最小化状态估计误差。在拓扑上不同的混沌吸引子上的评估显示了跨领域泛化能力,优于纯数据驱动的基线,这些基线在分布外动态下会发散。此外,在记录的真实世界无人机飞行数据集上的评估验证了该框架的实用性,证明了其在进入本体感觉死 reckoning 时的过渡能力,并在传感器中断期间优于经典自适应估计器。

英文摘要

Unmanned Aerial Vehicles in dynamic environments face telemetry outages, structural vibrations, and regime-dependent noise that invalidate the stationary covariance assumptions of classical Kalman filters. The Sage-Husa Kalman Filter (SHKF) estimates noise statistics online, but its reliance on a static, scalar forgetting factor forces a strict compromise between steady-state stability and transient responsiveness. We introduce the N-Deep Recurrent Sage-Husa Filter (NDR-SHKF), which replaces this scalar parameter with a vector-valued memory attenuation policy learned by a hierarchical recurrent network operating on whitened innovation sequences. A bifurcated architecture routes shallow recurrent states to capture instantaneous sensor anomalies and deep states to encode sustained dynamic trends, while an auxiliary reconstruction objective prevents feature collapse. The complete filter, including recursive covariance updates, is trained end-to-end via backpropagation through time to directly minimize state estimation error. Evaluations on topologically distinct chaotic attractors demonstrate cross-domain generalization, outperforming purely data-driven baselines that diverge under out-of-distribution dynamics. Furthermore, evaluations on recorded real-world UAV flight datasets validate the framework's practical viability, demonstrating its capacity to bridge transitions into proprioceptive dead reckoning and outperform classical adaptive estimators during sensor outages.

2605.18703 2026-05-19 cs.CL cs.LG 版本更新

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory: 通过可执行环境合成和鲁棒强化学习扩展工具使用智能体

Minrui Xu, Zilin Wang, Mengyi DENG, Zhiwei Li, Zhicheng Yang, Xiao Zhu, Yinhong Liu, Boyu Zhu, Baiyu Huang, Chao Chen, Heyuan Deng, Fei Mi, Lifeng Shang, Xingshan Zeng, Zhijiang Guo

发表机构 * LARK, HKUST (GZ)(LARK,香港科技大学(广州)) University of Cambridge(剑桥大学) UCL(伦敦大学学院) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文提出EnvFactory框架,通过自动合成可执行环境和鲁棒强化学习,解决工具使用智能体扩展中的环境可扩展性和训练数据不足问题,显著提升训练效率和下游性能。

Comments 11 pages

详情
AI中文摘要

通过代理强化学习(Agentic RL)为LLM配备工具使用能力受到两个挑战的限制:缺乏可扩展且稳健的执行环境以及现实训练数据的稀缺性,这些数据无法捕捉隐含的人类推理。现有方法依赖于昂贵的真实世界API、易产生幻觉的LLM模拟器或依赖预收集文档的合成环境,这些环境通常是单轮次或依赖预收集文档。此外,合成轨迹经常过于指定,更像指令序列而非自然人类意图,从而降低了其对强化学习训练的有效性。我们引入EnvFactory,一个完全自动化的框架,解决这两个挑战。EnvFactory自动探索和验证具有状态的可执行工具环境,并通过拓扑感知采样和校准细化合成自然多轮次轨迹,生成具有隐含意图的扎根查询。仅使用7个领域中的85个验证环境,EnvFactory生成2,575个SFT和RL轨迹。尽管使用的环境数量远少于先前工作(通常是5倍),EnvFactory在训练效率和下游性能上均优于现有方法,使Qwen3系列模型在BFCLv3上提升高达+15%,在MCP-Atlas上提升+8.6%,并在对话基准测试中包括τ²-Bench和VitaBench上提升+6%。通过完全自动化环境构建和轨迹合成,EnvFactory为代理强化学习提供了可扩展、可扩展且稳健的基础。

英文摘要

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including $τ^2$-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.

2605.18702 2026-05-19 cs.LG cs.AI 版本更新

Distilling Tabular Foundation Models for Structured Health Data

为结构化健康数据 distilling 表格基础模型

Aditya Tanna, Nassim Bouarour, Mohamed Bouadi, Vinay Kumar Sankarapu, Pratinav Seth

发表机构 * Lexsi Labs(Lexsi实验室)

AI总结 本文研究了如何通过知识蒸馏将表格基础模型的预测行为转移到轻量级表格模型中,通过分层出折教师标签解决上下文泄露问题,在19个医疗数据集上验证了蒸馏学生模型在保持高AUC的同时显著提升了推理速度,并展示了多教师平均法并不总能超越最佳单教师。

详情
AI中文摘要

表格基础模型(TFMs)在健康数据集上表现出色,但其推理成本和基础设施需求限制了实际应用。我们研究了是否可以通过知识蒸馏将TFMs的预测行为转移到轻量级表格模型中。由于上下文TFMs在推理时依赖于训练集,直接蒸馏会引入上下文泄露;我们通过分层出折教师标签来解决这一问题。在19个医疗数据集、6个TFM教师、4个学生家族和多个多教师集成模型上,我们发现蒸馏后的学生模型至少保留了教师AUC的90%,在某些情况下优于教师,同时在CPU上运行速度至少快26倍,并保持了对健康应用至关重要的校准和公平性。此外,多教师平均法并不总能超越最佳单教师。因此,具有泄漏意识的蒸馏是一种将TFM质量预测带入受推理限制的健康环境中的可行途径。

英文摘要

Tabular foundation models (TFMs) achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. We study whether their predictive behavior can be transferred to lightweight tabular models through knowledge distillation. Since in-context TFMs condition on the training set at inference time, naive distillation can introduce context leakage; we address this with stratified out-of-fold teacher labeling. Across $19$ healthcare datasets, $6$ TFM teachers, $4$ student families, and several multi-teacher ensembles, we find that distilled students retain at least $90\%$ of teacher AUC, outperforming teachers in some cases, while running at least $26\times$ faster on CPU and preserving calibration and fairness critical for health applications. Moreover, multi-teacher averaging does not consistently improve over the best single teacher. Leakage-aware distillation is thus a viable route for bringing TFM-quality predictions into inference-constrained health settings.

2605.18701 2026-05-19 cs.LG q-bio.QM 版本更新

Learning Normal Representations for Blood Biomarkers

学习正常表示以血清生物标志物

Aashna P. Shah, Michelle M. Li, Yash Lal, Seffi Cohen, Liat F. Antwarg, Morgan Sanchez, James A. Diao, Chirag J. Patel, Ben Y. Reis, Ran D. Balicer, Noa Dagan, Arjun K. Manrai

发表机构 * Department of Biomedical Informatics, Harvard Medical School(哈佛医学院生物医学信息学系) Department of Systems Biology, Harvard Medical School(哈佛医学院系统生物学系) Department of Medicine, Brigham and Women’s Hospital(布里洛妇产科医院医学系) Department of Mathematics, Johns Hopkins University(约翰霍普金斯大学数学系) Computational Health Informatics Program (CHIP), Boston Children’s Hospital(波士顿儿童医院计算健康信息学计划) The Ivan and Francesca Berkowitz Family Living Laboratory Collaboration at Harvard Medical School and Clalit Research Institute(哈佛医学院伊万和弗rancesca伯克伍德家庭生活实验室合作项目及克劳斯研究机构) Clalit Research Institute, Innovation Division, Clalit Health Services(克劳斯研究机构创新部门,克劳斯健康服务) Faculty of Computer and Information Science, Ben Gurion University(本· Gurion大学计算机与信息科学系)

AI总结 该研究提出NORMA框架,通过结合患者历史和人口水平数据生成更精确的参考区间,以改善血清生物标志物的个性化解读,避免过度个性化导致的误诊风险。

详情
AI中文摘要

基于生物液体的生物标志物是临床诊断和管理的基础,但其解释主要依赖于固定的参考区间,这些区间忽略了稳定的个体间变异性。因此,基于群体的解释可能会掩盖个体基线的有意义偏差,从而延误疾病检测。为了解决这个问题,人们越来越多地尝试使用个体测试历史来个性化血清生物标志物的解释。然而,这些方法可能会过度拟合稀疏数据,导致假阳性率升高和不必要的随访,并可能无意中包含未被识别或亚临床疾病。在这里,我们利用近20亿个纵向实验室测量值,来自超过160万名北美洲、中东和东亚的个体,表明尽管实验室值高度个体化,但纯个性化区间经常过度拟合,将多达68%的测量值分类为异常,而没有与不良临床结果相应的关联。我们随后引入NORMA,一个基于条件变压器的框架,通过结合患者的历史和人口水平数据中的“正常”变异生成参考区间。NORMA生成的区间在预测结果方面更具精度,包括死亡率、急性肾损伤和慢性疾病。这些发现警示过度个性化在实验室医学中的风险,并证明将个体轨迹锚定到人口水平先验优于单独的方法。为了促进透明度,我们公开发布模型、代码和一个交互式用户界面,以实现可访问的个性化实验室解释。

英文摘要

Blood-based biomarkers underpin clinical diagnosis and management, yet their interpretation relies largely on fixed population reference intervals that ignore stable, intra-patient variability. As such, population-based interpretation can mask meaningful deviation from an individual's baseline, risking delayed disease detection. To remedy this, there have been increasing efforts to personalize blood biomarker interpretation using individual testing histories. However, these methods may overfit to sparse data, inflating false-positive rates and unnecessary follow-up, and can also unwittingly include unrecognized or subclinical disease. Here, we leverage nearly 2 billion longitudinal laboratory measurements from over 1.6 million individuals across North America, the Middle East, and East Asia, to show that while laboratory values are highly individual, purely personalized intervals routinely overfit, classifying up to 68% of measurements as abnormal, without corresponding associations with adverse clinical outcomes. We then introduce NORMA, a conditional transformer-based framework that generates reference intervals by conditioning on both a patient's history and population-level data about "normal" variation. NORMA-derived intervals achieve higher precision for predicting outcomes, including mortality, acute kidney injury, and chronic disease. These findings caution against over-personalization in laboratory medicine and demonstrate that anchoring individual trajectories to population-level priors outperforms either approach alone. To promote transparency, we publicly release the model, code, and an interactive user interface for accessible, individualized laboratory interpretation.

2605.18696 2026-05-19 cs.LG cs.AI 版本更新

Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

表格基础模型的集成——多样性上限与校准陷阱

Aditya Tanna, Yash Desai, Pratinav Seth, Mohamed Bouadi, Nassim Bouarour, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs(Lexsi实验室)

AI总结 本文研究了表格基础模型(TFMs)的集成方法,发现尽管集成通常能提升性能,但现代TFMs的集成池近似冗余,且某些集成策略在准确率和校准上表现不佳,建议采用贪心选择作为实用默认方案。

详情
AI中文摘要

表格基础模型(TFMs)如今在越来越多的表格任务上能够匹配或超越调优的梯度提升树,但没有单一的TFM能在所有数据集上获胜。集成是解决此问题的首选方法,但其效果不如预期。六个现代TFMs形成一个近似冗余的池:它们的平均成对Q统计量为0.961,接近1,因此任何凸组合都受限制。我们对六个TFMs在153个OpenML分类任务上进行了六个集成策略的基准测试。最佳集成策略,两层级联堆叠,在计算成本增加253倍的情况下,比最强单个TFM的准确率提高0.18%。Friedman和Nemenyi分析将三个集成策略和最佳基础TFM置于一个等价组中;其他三个集成策略显著劣于最佳基础TFM。使用逻辑回归元学习器进行堆叠是最引人注目的案例:在准确率和ROC-AUC上具有竞争力,但在log-loss排名中是最差的。元学习器通过锐化类别边界来提高准确率,这破坏了校准。我们建议贪心选择作为实用默认方案。

英文摘要

Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is $0.961$, close enough to $1$ that any convex combination is bounded above. We benchmark six ensemble strategies over six TFMs on 153 OpenML classification tasks. The best ensemble, two-level cascade stacking, buys $+0.18\%$ accuracy over the strongest single TFM at $253\times$ the compute. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly \emph{worse} than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the worst log-loss rank among the ensembles. The meta-learner improves accuracy by sharpening class boundaries, which destroys calibration. We recommend greedy selection as the practical default.

2605.18689 2026-05-19 cond-mat.quant-gas cs.LG physics.atom-ph quant-ph 版本更新

Can machine learning for quantum-gas experiments be explainable?

量子气体实验中的机器学习能否被解释?

I. B. Spielman amd J. P. Zwolak

发表机构 * National Institute of Standards and Technology(国家标准与技术研究院) Department of Physics, University of Maryland, College Park, MD 20742, USA(马里兰大学物理系) Joint Quantum Institute, University of Maryland, College Park, MD 20742, USA(联合量子研究所) Joint Center for Quantum Information and Computer Science, University of Maryland, College Park, MD 20742, USA(联合量子信息与计算机科学中心) Computer Science, University of Maryland, College Park, MD 20742, USA(马里兰大学计算机科学系)

AI总结 本文探讨了机器学习在量子气体实验中的应用,重点介绍了图像去噪和玻色-爱因斯坦凝聚体中孤子波的识别,并讨论了性能、模型复杂度和可解释性之间的关系。

详情
AI中文摘要

几乎所有多体原子物理的方面都具有挑战性:实验技术要求高,数据集变得庞大,而经典模拟通用量子系统的内存和CPU需求通常随系统规模呈指数增长。机器学习(ML)方法已经在这些领域中发挥作用,并有望成为变革性技术。本文重点关注机器学习在基于冷原子的量子模拟器中的两个具体应用。这些设备通常以图像形式生成数据;我们首先展示了原始图像的去噪,然后识别玻色-爱因斯坦凝聚体中的孤子波。在这两个例子中,我们评论了性能、模型复杂度和可解释性之间的相互作用。

英文摘要

Virtually all aspects of many-body atomic physics are challenging: experiments are technically demanding, datasets have become enormous, and the memory and CPU requirements for classical simulation of generic quantum systems often scale exponentially with system size. Machine learning (ML) methods are already assisting in each of these areas and are poised to become transformative. Here, we focus on two specific applications of ML to cold-atom-based quantum simulators. These devices generally generate data in the form of images; we first showcase denoising of raw images and then identify solitonic waves in Bose-Einstein condensates. In both of these examples, we comment on the interplay between performance, model complexity, and interpretability.

2605.18681 2026-05-19 cs.AI cs.LG 版本更新

Learning Quantifiable Visual Explanations Without Ground-Truth

学习无地面真实数据的可量化视觉解释

Amritpal Singh, Andrey Barsky, Mohamed Ali Souibgui, Ernest Valveny, Dimosthenis Karatzas

发表机构 * Computer Vision Center, Barcelona, Spain(巴塞罗那计算机视觉中心) Autonomous University of Barcelona, Spain(巴塞罗那自治大学)

AI总结 本文提出了一种基于连续输入扰动的可量化指标,用于评估XAI方法的质量,并提出了一种新的XAI方法,通过可微近似指标对模型进行微调,生成因果解释而不影响模型性能。

详情
AI中文摘要

可解释AI(XAI)技术对于验证和负责任使用现代深度学习模型日益重要,但缺乏良好的地面真实数据使得评估困难。我们提出了一种框架,该框架基于连续输入扰动作为XAI方法质量的可量化度量标准。我们的度量标准正式考虑了归因信息对模型决策的充分性和必要性,并展示了多种情况,其中它比现有度量标准更能符合人类对解释质量的直觉。为了利用该度量标准的特性,我们还提出了一种新的XAI方法,考虑了使用可微近似度量作为监督信号对模型进行微调的情况。结果是一个适配器模块,可以在任何黑盒模型上训练以输出因果解释,而不影响模型性能。我们证明了该方法生成的解释在多个可量化度量标准上优于竞争性的XAI技术。

英文摘要

Explainable AI (XAI) techniques are increasingly important for the validation and responsible use of modern deep learning models, but are difficult to evaluate due to the lack of good ground-truth to compare against. We propose a framework that serves as a quantifiable metric for the quality of XAI methods, based on continuous input perturbation. Our metric formally considers the sufficiency and necessity of the attributed information to the model's decision-making, and we illustrate a range of cases where it aligns better with human intuitions of explanation quality than do existing metrics. To exploit the properties of this metric, we also propose a novel XAI method, considering the case where we fine-tune a model using a differentiable approximation of the metric as a supervision signal. The result is an adapter module that can be trained on top of any black-box model to output causal explanations of the model's decision process, without degrading model performance. We show that the explanations generated by this method outperform those of competing XAI techniques according to a number of quantifiable metrics.

2605.18675 2026-05-19 cs.LG cs.AI 版本更新

COOPO: Cyclic Offline-Online Policy Optimization Algorithm

COOPO:循环离线-在线策略优化算法

Qisai Liu, Zhanhong Jiang, Joshua Russell Waite, Aditya Balu, Cody Fleming, Soumik Sarkar

发表机构 * Department of Mechanical Engineering, Iowa State University(伊阿华州立大学机械工程系) Department of Computer Science, Iowa State University(伊阿华州立大学计算机科学系) Department of Industrial and Manufacturing Systems Engineering, Iowa State University(伊阿华州立大学工业与制造系统工程系) Translational AI Center, Iowa State University(伊阿华州立大学转化人工智能中心)

AI总结 本文提出COOPO算法,通过循环离线训练和在线微调来解决离线强化学习的分布偏移和性能受限问题,以及在线强化学习的环境交互成本高问题,通过周期性回归离线训练减少遗忘和漂移,提升样本效率和性能。

详情
AI中文摘要

离线强化学习由于静态数据集的限制,在面对分布偏移和受限性能方面存在困难,而在线强化学习则需要大量的环境交互。最近出现的混合离线-在线方法连接了这两个领域,但存在转换过程中的分布漂移和对离线知识的灾难性遗忘问题。我们引入COOPO(循环离线-在线策略优化),一种通用框架,通过反复循环在受限的离线训练和在线微调之间进行。每个循环首先通过KL-正则化的优势加权离线更新将策略锚定到数据集,以最小化分布偏移,然后使用任何策略优化方法在线微调以实现稳定的探索。关键的是,定期返回离线训练可以消除遗忘和漂移,同时最大化数据集的再利用。循环行为还帮助减少在线环境交互。理论上,COOPO在样本效率上优于纯在线RL,满足标准覆盖假设下保证单调改进。广泛的D4RL基准测试显示,COOPO在减少在线交互的同时提高最终回报,保持在不同离线算法和在线优化器中的鲁棒性。这种循环协同为自适应RL设定了新的效率和性能标准。

英文摘要

Offline reinforcement learning struggles with distributional shift and constrained performance due to static dataset limitations, while online RL demands prohibitive environment interactions. The recent advent of hybrid offline-to-online methods bridges these domains but suffers from distribution drift during transitions and catastrophic forgetting of offline knowledge. We introduce COOPO (Cyclic Offline-Online Policy Optimization), a generalized framework that repeatedly cycles between constrained offline training and online fine-tuning. Each cycle first anchors the policy to the dataset via KL-regularized advantage-weighted offline updates to minimize distributional shift and then fine-tunes it online using any policy optimization for stable exploration. Crucially, periodically returning to offline training eliminates forgetting and drift while maximizing dataset reuse. The cyclic behavior also helps reduce the online environment interactions. Theoretically, COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions. Extensive D4RL benchmarks demonstrate COOPO reduces online interactions versus state-of-the-art hybrids while improving final returns, maintaining robustness across diverse offline algorithms and online optimizers. This looped synergy sets new efficiency and performance standards for adaptive RL.

2605.18667 2026-05-19 cs.CV cs.LG 版本更新

Better Together: Evaluating the Complementarity of Earth Embedding Models

Better Together: 评估地球嵌入模型的互补性

Thijs L van der Plas, Jacob JW Bakermans, Vishal Nedungadi, Gabrielė Tijūnaitytė, Marc Rußwurm, Ioannis N Athanasiadis

发表机构 * Wageningen University(瓦赫宁根大学) University College London(伦敦大学学院) University of Bonn(波恩大学)

AI总结 本文研究了地球嵌入模型的互补性,提出通过融合嵌入来提升性能,并评估了四种模型在不同任务中的表现,发现互补性在任务和位置上都具有依赖性。

详情
AI中文摘要

地球嵌入模型将地球观测数据转换为与地球表面位置唯一关联的嵌入。这些模型通常单独评估,比较不同地球嵌入在下游任务中的性能。然而,空间对齐的嵌入可以自然融合,提供更丰富的每位置信息,而孤立评估无法捕捉到这一点。因此,我们提出通过互补性评估地球嵌入:融合嵌入相对于最佳单模型基线的性能提升。为此,我们引入了一个适用于任何嵌入和任务的嵌入互补性指数,并在六个下游任务中评估了四种地球嵌入模型(AlphaEarth、Tessera、GeoCLIP、SatCLIP),分别单独、成对和联合评估。融合嵌入在六个任务中的四个任务中优于最佳单模型,证实了单嵌入评估通常低估了地球嵌入的能力。互补性在任务和位置上都具有依赖性。进一步,对于一个土地覆盖回归任务,我们发现互补性部分由土地覆盖类别的空间尺度决定。互补性重新定义了地球嵌入:未来的最大收益可能不来自任何单一地球嵌入模型,而是来自更好的组合。

英文摘要

Earth embedding models transform Earth observation data into embeddings uniquely tied to locations on the Earth's surface. These models are typically evaluated in isolation, comparing the downstream task performance across different Earth embeddings. However, spatially aligned embeddings can naturally be fused, providing richer information per location, a capability that isolated evaluations fail to capture. We therefore propose assessing Earth embeddings by their complementarity: the performance gain of fused embeddings over the best single-model baseline. To operationalise this, we introduce an embedding complementarity index applicable to any embedding and task, and evaluate four Earth embedding models (AlphaEarth, Tessera, GeoCLIP, SatCLIP) in isolation, in all pairs, and jointly across six downstream tasks. Fused embeddings outperform the best single model in four out of six tasks, confirming that single-embedding evaluations often underestimate Earth embedding capabilities. Complementarity proves both task- and location-dependent. Further, for a land cover regression task, we find that complementarity is partially determined by the spatial scale of land cover classes. Complementarity reframes Earth embeddings: the greatest future gains may come not from any single Earth embedding model, but from combinations that are better together.

2605.18666 2026-05-19 cs.LG cs.CR 版本更新

A No-Defense Defense Against Gradient-Based Adversarial Attacks on ML-NIDS: Is Less More?

对抗梯度攻击的无防御策略:更少即是更多?

Mohamed elShehaby, Ashraf Matrawy

发表机构 * Carleton University(卡尔顿大学) Computer Engineering Carleton University Ottawa, Canada(计算机工程系卡尔顿大学渥太华加拿大) School of Information Technology Carleton University Ottawa, Canada(信息科技学院卡尔顿大学渥太华加拿大)

AI总结 本文研究了通过精心选择网络架构是否能构建出固有鲁棒的深度神经网络(DNN)基于网络入侵检测系统(NIDS),而无需额外显式防御。通过数千次实验,发现较浅的网络、减少特征集和使用ReLU激活函数能有效降低对抗攻击的脆弱性,且简单模型在保持高清洁流量检测性能和低训练时间的同时优于更深层的模型。

详情
AI中文摘要

基于梯度的对抗攻击通过微妙地操纵机器学习(ML)模型的输入来诱导错误预测。本文探讨了是否仅通过精心选择的架构设计即可构建出固有鲁棒的深度神经网络(DNN)基于网络入侵检测系统(NIDS),而无需任何额外的显式防御。通过数千次实验,围绕2200次不同网络深度、特征维度、激活函数和dropout的FGSM、PGD和BIM攻击,我们证明较浅的网络、减少特征集和ReLU激活函数能够一致且联合地降低对抗脆弱性。此外,遵循此配方的简单模型在保持近完美的清洁流量检测和较低的训练时间的同时,优于更深层的、完全特征的对抗训练模型。然而,尽管更少即更多,选择正确的“更少”才是关键所在。

英文摘要

Gradient-based adversarial attacks subtly manipulate inputs of Machine Learning (ML) models to induce incorrect predictions. This paper investigates whether careful architectural choices alone can yield an inherently robust Deep Neural Network (DNN)-based Network Intrusion Detection Systems (NIDS), without any additional explicit defenses. Through thousands of experiments, around 2200, varying network depth, feature dimensionality, activation functions, and dropout across FGSM, PGD, and BIM attacks, we show that shallower networks, reduced feature sets, and ReLU activation consistently and jointly reduce adversarial vulnerability. Moreover, a simple model following this recipe outperforms deeper, fully-featured adversarially trained models, while maintaining near-perfect clean-traffic detection and lower training times. Nevertheless, while less is more, the selection of the right less is what truly matters.

2605.18663 2026-05-19 cs.AI cs.CL cs.LG 版本更新

GIM: Evaluating models via tasks that integrate multiple cognitive domains

GIM:通过整合多个认知领域的任务评估模型

Rohit Patel, Alexandre Rezende, Steven McClain

发表机构 * Meta Superintelligence Labs(Meta超智能实验室)

AI总结 本文提出GIM基准测试,通过整合多个认知领域的任务来评估模型,其核心方法是设计820个原创问题,结合广泛的知识和多种认知操作,从而保持推理在现实任务中的基础性,同时通过2PL IRT模型校准能力估计,发布涵盖22个模型和47种测试配置的综合排行榜,并深入研究了测试时计算与模型能力之间的权衡。

Comments 56 pages, 27 figures, 4 tables. Code: https://github.com/facebookresearch/gim ; Dataset: https://huggingface.co/datasets/facebook/gim

详情
AI中文摘要

随着LLM基准测试趋于饱和,评估社区已采取两种策略来提高难度:提升知识需求(GPQA,HLE)或完全去除知识而采用抽象推理(ARC-AGI)。前者将记忆混淆为能力,后者使推理脱离实际应用背景。我们采取了不同的方法。Grounded Integration Measure(GIM)是一个包含820个原创问题(615个公开问题,205个私有问题)的基准测试,其中难度来自于整合;每个问题都需要协调多种认知操作(约束满足、状态跟踪、知识警惕、受众校准)在广泛可获取的知识上,从而保持推理在现实任务中而不依赖专门的专家知识。每个问题都是原创专家撰写的组成,大多数有基于评分标准分解的评分(中位数6个独立判断的准则)。一个平衡的公开-私有划分提供了内置的污染诊断。我们校准了一个连续响应的2参数逻辑(2PL)IRT模型,超过200,000个提示-响应对,覆盖28个模型,产生稳健的能力估计,即使在原始准确率被错误或缺失数据扭曲的情况下,也能正确排序测试配置,解决了基准报告中的常见挑战。使用这一框架,我们发布了一个涵盖22个模型和47种测试配置的综合排行榜(独特的模型和思考级别对),并进行了迄今为止最广泛的已发表研究,探讨在固定基准上测试时计算与模型能力之间的权衡:11个模型在35种测试配置中被扫过。我们观察到,家庭内部配置选择,如思考预算和量化,与模型选择一样重要。我们发布了评估框架、校准的IRT参数和所有公开问题。

英文摘要

As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public--private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.

2605.18662 2026-05-19 cs.LG 版本更新

Efficient and Noise-Tolerant PAC Learning of Multiclass Linear Classifiers

高效且抗噪声的多类线性分类器PAC学习

Rita Adhikari, Shiwei Zeng

发表机构 * Augusta University(奥古斯塔大学)

AI总结 本文研究了在存在恶意噪声的情况下,如何高效学习多类线性分类器,并提出了一种在混合分布和边际条件下的PAC学习算法,该算法在常数噪声率下仅需O(k²·(d log d + log k))个样本。

详情
AI中文摘要

自上个世纪以来,噪声容忍的PAC学习线性模型一直是机器学习社区的核心关注点。近年来,许多计算高效的算法已被提出,用于在多种噪声模型下学习线性阈值函数。然而,当问题考虑多类学习设置,即当类别数k至少为3时,尚不清楚是否存在计算高效的PAC学习算法,当数据集被恶意破坏时。在本文中,我们假设边际分布是有限方差分布的混合,并且数据集同时满足边际条件。我们证明存在一种计算高效的算法,能够在常数速率的恶意噪声下,使用至多O(k²·(d log d + log k))个样本来PAC学习多类线性分类器{h_w:x↦argmax_{y∈[k]}w_y·x, x∈R^d, w∈R^{kd}}。我们的算法包含两个主要成分:基于聚类的修剪方案和标准的多类合页损失最小化程序。即使在二元设置的特殊情况下,即k=2时,我们的结果也严格优于所有先前工作。

英文摘要

Noise-tolerant PAC learning of linear models has been of central interests in machine learning community since the last century. In recent years, many computationally-efficient algorithms have been proposed for the problem of learning linear threshold functions under multiple noise models. Yet, when the problem is considered under multiclass learning settings, i.e. when the number of classes $k$ is at least $3$, it is unknown whether there exist computationally-efficient PAC learning algorithms when the data sets are maliciously corrupted. In this paper, we consider that the marginal distribution is a mixture of bounded variance distributions and the data sets satisfy a margin condition at the same time. We show that there exists a computationally-efficient algorithm that PAC learns multiclass linear classifiers $\{h_w:x\mapsto \arg\max_{y\in[k]}w_y\cdot x, x\in \mathbb{R}^d, w\in\mathbb{R}^{kd}\}$ using at most $O(k^2\cdot (d\log d+\log k))$ samples even under a constant rate of nasty noise. Our algorithm consists of two main ingredients: a cluster-based pruning scheme and a standard multiclass hinge loss minimization program. Even in the special case of binary setting, i.e. $k=2$, our result is strictly stronger than all prior works.

2605.18656 2026-05-19 stat.ML cs.AI cs.LG stat.ME 版本更新

Statistical Limits and Efficient Algorithms for Differentially Private Federated Learning

统计界限与差分隐私联邦学习的高效算法

Arnab Auddy, Xiangni Peng, Subhadeep Paul

发表机构 * Department of Statistics(统计系)

AI总结 本文研究了差分隐私联邦学习中估计精度、隐私约束和通信成本之间的权衡,提出了FedHybrid和FedNewton两种高效算法,通过减少通信成本提升准确性,并建立了均方误差的上界和下界以评估算法性能。

详情
AI中文摘要

联邦学习是训练机器学习和人工智能模型的一种主流框架,用于在众多用户设备或数据库之间协同训练。我们研究了差分隐私(DP)联邦M估计中估计精度、隐私约束和通信成本之间的权衡。文献中的两种标准方法是FedAvg,可能面临较高的联邦偏差,以及FedSGD,可能导致较高的通信成本。为了在减少通信成本的同时提高准确性,我们提出了FedHybrid,它使用FedSGD,但起始时通过FedAvg估计器改进初始化。我们还提出了FedNewton,通过平均本地牛顿迭代来减少FedAvg的偏差,从而在客户端数量增长缓慢时,以更少的通信轮次达到与FedSGD相当的估计精度。我们建立了这些估计器的DP版本的均方误差率的有限样本上界,作为客户端数量、本地样本大小、隐私预算和迭代次数的函数。我们进一步推导了任何迭代私有联邦过程的均方误差的最小最大下界,以作为评估这些方法最优性差距的基准。我们还通过在MNIST和CIFAR-10计算机视觉数据集上训练逻辑回归和神经网络来数值评估我们的方法。

英文摘要

Federated Learning is a leading framework for training ML and AI models collaboratively across numerous user devices or databases. We study the trade-offs among estimation accuracy, privacy constraints, and communication cost for differentially private (DP) federated M estimation. The two standard methods in the literature are FedAvg, which may suffer from high federation bias, and FedSGD, which can incur high communication cost. Aimed at improving accuracy at a reduced communication cost, we propose FedHybrid, which uses FedSGD starting with an improved initialization by the FedAvg estimator. We propose FedNewton, which averages local Newton iterations to reduce bias in FedAvg, achieving an estimation accuracy comparable to FedSGD with much fewer communication rounds when the number of clients grows sufficiently slowly. We establish finite sample upper bounds on the mean-squared error rates of the DP versions of these estimators as functions of the number of clients, local sample sizes, privacy budget, and number of iterations. We further derive a minimax lower bound on the MSE of any iterative private federated procedure that provides a benchmark to assess the optimality gap of these methods. We numerically evaluate our methods for training a logistic regression and a neural network on the computer vision datasets MNIST and CIFAR-10.

2605.18654 2026-05-19 cs.LG cs.AI 版本更新

Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

口袋基础模型:将TFMs压缩成CPU可用的梯度提升树

Aditya Tanna, Nassim Bouarour, Mohamed Bouadi, Vinay kumar Sankarapu, Pratinav Seth

发表机构 * Lexsi Labs(Lexsi实验室)

AI总结 本文提出了一种将高性能表格基础模型(TFMs)压缩成CPU原生梯度提升树的方法,以解决实时欺诈评分需求与现有模型性能之间的差距,同时在多个数据集上验证了该方法的有效性。

详情
AI中文摘要

一个欺诈评分器需要在2毫秒内响应。最好的表格基础模型(TFMs)在GPU上需要151-1275毫秒。我们通过将TFM离线压缩成XGBoost或CatBoost的学生模型,该模型可以在CPU上原生运行,从而缩小这一差距。核心障碍是特定于上下文学习(ICL)教师:他们在评分自己的训练集时会泄露标签,导致软目标崩溃为近一热向量,不再有可供压缩的类间结构。分层出折(OOF)教师标注可以防止这一问题。在153个来自TALENT、OpenML-CC18、TabZilla和TabArena的数据集上,将TabICLv2压缩成XGBoost在CPU上达到0.882宏均AUC(96.5%的教师AUC),在1.9毫秒内,比教师-学生对的教师模型快38到860倍,且在统计上显著优于调优的CatBoost基线(Wilcoxon p=0.0008;51%胜率)。四个进一步发现:教师排名精确转移到学生排名;收益集中在低维数据(<21个特征:比CatBoost高0.011 vs. >21个特征:高0.001);多教师平均有助于MLP学生(+0.006,p=0.003)但对树学生增加不到0.001;在高维任务中,当教师本身落后于CatBoost时,压缩反而使情况更糟。完整的流水线作为TabTune库的一部分开源。

英文摘要

A fraud scorer needs to answer in under 2 ms. The best tabular foundation models (TFMs) take 151-1,275 ms on GPU. We close this gap by distilling the TFM offline into an XGBoost or CatBoost student that runs natively on CPU. The central obstacle is specific to in-context learning (ICL) teachers: they leak labels when scoring their own training set, so the soft targets collapse to near-one-hot vectors with no inter-class structure left to distill. Stratified out-of-fold (OOF) teacher labeling prevents this. Across 153 classification datasets drawn from TALENT, OpenML-CC18, TabZilla, and TabArena, distilling TabICLv2 into XGBoost gives 0.882 macro-mean AUC (96.5% of teacher AUC) at 1.9 ms on CPU, a 38x to 860x speedup across teacher-student pairs with a statistically significant edge over a tuned CatBoost baseline (Wilcoxon p = 0.0008; 51% win rate). Four further findings: teacher rank transfers exactly to student rank; gains concentrate on low-dimensional data (< 21 features: +0.011 over CatBoost vs. >21 features: +0.001); multi-teacher averaging helps MLP students (+0.006, p = 0.003) but adds less than 0.001 for tree students; and on high-dimensional tasks where the teacher itself trails CatBoost, distillation makes things worse rather than better. The full pipeline is open-sourced as part of the TabTune library.

2605.18648 2026-05-19 cs.LG cs.AI cs.CL 版本更新

An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

对软标签学习和校准中人类与模型不确定性的评估

Maja Pavlovic, Silviu Paun, Massimo Poesio

发表机构 * Queen Mary University London(伦敦女王玛丽大学) Amazon(亚马逊) University of Utrecht(乌得勒支大学)

AI总结 本文通过对比人类和模型标签在软标签学习中的效果,发现人类标签不仅提升了模型准确性,还通过正则化作用改善了模型在困难样本上的校准和训练稳定性。

详情
AI中文摘要

人类对齐的人工智能的核心在于理解人类提取的标签相对于合成标签的优势。虽然人类软标签通过捕捉不确定性来提高校准,但先前研究将这些好处与隐含的错误标签修正(模式偏移)混淆了,从而掩盖了软标签的真实效果。我们对MNIST和一个合成变体上的软标签学习进行了受控审计,重新标注子集以提取人类不确定性。通过将软标签监督与底层标签模式偏移解耦,我们发现虽然人类软标签确实提供了准确性提升,但其更大的价值在于作为正则化器,改善模型在困难样本上的校准并促进训练运行中的稳定收敛。数据集制图显示,训练于人类软标签的模型能反映人类不确定性,而训练于合成标签的模型则无法与人类对齐。广泛而言,这项工作提供了一个用于人类-人工智能不确定性对齐的诊断测试平台。

英文摘要

Central to human-aligned AI is understanding the benefits of human-elicited labels over synthetic alternatives. While human soft-labels improve calibration by capturing uncertainty, prior studies conflate these benefits with the implicit correction of mislabeled data (mode shifts), obscuring true effects of soft-labels. We present a controlled audit of soft-label learning across MNIST and a synthetic variant, re-annotating subsets to extract human uncertainty. By decoupling soft-label supervision from underlying label mode shifts, we show that while human soft-labels do provide accuracy gains, their larger value lies in acting as a regularizer that improves model calibration on difficult samples and promotes stable convergence across training runs. Dataset cartography reveals models trained on human soft-labels mirror human uncertainty, whereas those trained on synthetic labels fail to align with humans. Broadly, this work provides a diagnostic testbed for human-AI uncertainty alignment.

2605.18635 2026-05-19 cs.LG cs.AI 版本更新

Data Presentation Over Architecture: Resampling Strategies for Credit Risk Prediction with Tabular Foundation Models

数据呈现与架构:用于表格基础模型的信用风险预测重采样策略

Aditya Tanna, Mitul Solanki, Mohamed Bouadi, Nassim Bouarour, Pratinav Seth, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs(Lexsi实验室)

AI总结 本文研究了在信用风险预测中,通过不同的上下文构建策略对表格基础模型性能的影响,发现上下文构建策略比模型架构对AUC-ROC指标的贡献更大。

详情
AI中文摘要

信用违约预测是一个具有严重类别不平衡、异质特征和严格延迟预算的表格学习问题。表格基础模型(TFMs)通过上下文学习来解决这个问题,其预测结果对上下文窗口的构建方式敏感。我们在Home Credit和Lending Club数据集上基准测试了四种经典模型和五种TFMs,变化上下文构建策略(七种选项)和上下文大小(1K到50K)。在两个数据集上,上下文策略的选择对AUC-ROC的方差解释比模型家族的选择更大:平衡和混合采样比均匀采样增加3到4个AUC点,且差距超过了TFMs之间的差异。使用5K到10K的平衡上下文,最强的TFMs达到经典基线模型在完整数据上训练的AUC,同时恢复了默认类别召回率,而默认阈值GBDTs无法做到。我们将此视为证据,表明在不平衡信用风险设置中,上下文构建而非架构选择是TFMs的主要部署杠杆。

英文摘要

Credit default prediction is a tabular learning problem with severe class imbalance, heterogeneous features, and tight latency budgets. Tabular Foundation Models (TFMs) approach this problem through in-context learning, which makes their predictions sensitive to how the context window is built. We benchmark four classical models and five TFMs on the Home Credit and Lending Club datasets, varying the context-construction strategy (seven options) and the context size (1K to 50K). On both datasets, the choice of context strategy explains more variance in AUC-ROC than the choice of TFM family: balanced and hybrid sampling add 3 to 4 AUC points over uniform sampling, and the gap exceeds the spread between TFMs. With a balanced context of 5K to 10K examples, the strongest TFMs reach the AUC of classical baselines trained on the full data, while also recovering meaningful default-class recall that default-threshold GBDTs do not. We frame this as evidence that context construction, rather than architecture choice, is the primary deployment lever for TFMs in imbalanced credit-risk settings.

2605.18632 2026-05-19 cs.LG cs.AI 版本更新

Position: Weight Space Should Be a First-Class Generative AI Modality

权重空间应成为一种第一类生成式AI模态

Zhangyang Wang, Peihao Wang, Kai Wang

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) Tencent Hy(腾讯实验室)

AI总结 本文提出将模型检查点视为第一类数据模态,并主张在权重空间中进行生成式建模应成为机器学习的核心原始操作。通过最近的进展表明,神经网络权重可以按需合成,通常在减少适应成本的规模下达到微调性能。本文认为这些结果反映了权重空间中高性能模型占据的低维、高度结构化区域的结构事实。基于此观点,本文将现有方法组织成五阶段流程,调查该方法已实际应用的领域,并澄清当前限制:适配器规模和条件生成正在迅速发展,而无限制的前沿规模检查点合成仍处于开放状态。

Comments AI systems routinely improve or create other AI systems

详情
AI中文摘要

神经网络检查点已悄然成为大规模数据资源:现在存在数百万个训练好的权重向量,每个都编码任务、领域和架构特定的知识。本文立场论文认为,模型检查点应被视为第一类数据模态,并且在权重空间中的生成式建模应被标准化为机器学习的核心基本操作。最近的进展表明,神经权重可以按需合成,通常在减少适应成本的规模下达到微调性能。我们主张这些结果反映了底层的结构事实:高性能模型占据由对称性、平坦性、模块性和共享子空间形状的权重空间中的低维、高度结构化区域。基于这一观点,我们组织现有方法为五阶段流程,调查该方法已实际应用的领域,并澄清当前限制:适配器规模和条件生成正在迅速发展,而无限制的前沿规模检查点合成仍处于开放状态。我们的目标是将社区的默认思维从按任务优化模型转变为从学习的权重分布中采样模型,加速迈向一个AI系统定期改进或创建其他AI系统的时代。

英文摘要

Neural network checkpoints have quietly become a large-scale data resource: millions of trained weight vectors now exist, each encoding task-, domain-, and architecture-specific knowledge. This position paper argues that model checkpoints should be treated as a first-class data modality, and that generative modeling in weight space should be standardized as a core machine learning primitive. Recent advances demonstrate that neural weights can be synthesized on demand, often matching fine-tuning performance while reducing adaptation cost by orders of magnitude. We contend that these results reflect an underlying structural fact: high-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces. Building on this view, we organize existing methods into a five-stage pipeline, survey applications where the approach is already practical, and clarify current limits: adapter-scale and conditional generation are advancing rapidly, while unrestricted frontier-scale checkpoint synthesis remains open. Our goal is to shift the community's default mindset from optimizing models per task to sampling models from learned weight distributions, accelerating toward an era in which AI systems routinely improve or create other AI systems.

2605.18624 2026-05-19 cs.CR cs.LG 版本更新

Learning to Look Benign: Targeted Evasion of Malware Detectors via API Import Injection

学习看起来无害:通过API导入注入实现针对恶意软件检测器的定向逃避

Juozas Dautartas, Olga Kurasova, Juozapas Rokas Čypas, Viktor Medvedev

发表机构 * Institute of Data Science and Digital Technologies, Faculty of Mathematics and Informatics, Vilnius University(数据科学与数字技术研究所,数学与信息学学院,维尔纽斯大学)

AI总结 本文研究了通过添加少量特定良性软件类别的Win32 API导入,将恶意软件样本故意误分类为特定良性类别而非仅仅非恶意软件的可能性。提出了一种基于条件变分自编码器(CVAE)的框架,其解码器严格加法,能够引入新的API调用但不移除现有调用,从而保留恶意软件功能。对于每个恶意软件样本,该框架自动识别其最接近的良性类别并将其作为逃避目标。

详情
AI中文摘要

基于机器学习的恶意软件检测器广泛应用于杀毒和端点检测系统,但其对静态特征的依赖使其容易受到对抗性操纵。本文研究了一种恶意软件样本是否可以通过添加少量具有所选类别特征的Win32 API导入,故意被误分类为特定良性软件类别,而不仅仅是非恶意软件。我们提出了一种以条件变分自编码器(CVAE)为核心的框架,其解码器严格加法。该框架可以引入新的API调用但永远不会移除现有的调用,通过设计保留恶意软件功能。对于每个恶意软件样本,该框架会自动识别其最接近的良性类别并将其作为逃避目标。一个知识蒸馏的可微代理使能够基于梯度训练对抗非可微的集成检测器。在六个类别二进制Win32 API导入向量数据集上的实验表明,针对一个达到87.5%恶意软件召回率的检测器,添加仅20个API导入可将召回率降低至30%。在k=20时,逃过检测的样本中99%被分类为预期的目标类别。CVAE在所有测试的注入大小(k=5到50)中均优于基于频率的基线和随机选择。在真实PE文件提交到VirusTotal的验证中确认,该攻击能够转移到商业静态检测引擎,平均减少标记引擎的标记率54.5%。这些发现暴露了基于API的恶意软件分类器中的具体漏洞,并证明通过最小化、功能保留的修改可以实现针对所选良性类别的定向逃避。

英文摘要

Machine learning-based malware detectors are widely deployed in antivirus and endpoint detection systems, yet their reliance on static features makes them vulnerable to adversarial manipulation. This paper investigates whether a malware sample can be intentionally misclassified as a specific benign software category, not merely as "not malware", by adding a small number of Win32 API imports characteristic of that selected category, without removing any existing imports or retraining the detector. We propose a framework centered on a Conditional Variational Autoencoder (CVAE) whose decoder is strictly additive. It can introduce new API calls but never remove existing ones, preserving malware functionality by design. For each malware sample, the framework automatically identifies which benign category it most closely resembles and uses that as the evasion target. A knowledge-distilled differentiable proxy enables gradient-based training against the non-differentiable ensemble detector. Experiments on a six-class dataset of binary Win32 API import vectors extracted from 3,799 Windows executables (five benign categories, one malware class) show that, against a detector achieving 87.5% malware recall, adding just 20 API imports reduces recall to 30%. At k=20, among samples that evaded detection, 99% are classified as the intended target category. The CVAE outperforms both a frequency-based baseline and random selection at every tested injection size (k = 5 to 50). Validation on real PE files submitted to VirusTotal confirms that the attack transfers to commercial static detection engines, with an average 54.5% reduction in flagging engines. These findings expose a concrete vulnerability in API-based malware classifiers and demonstrate that targeted evasion into a chosen benign category is achievable with minimal, functionality-preserving modifications.

2605.18610 2026-05-19 cs.CV cs.AI cs.LG 版本更新

CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic

CATA: 通过冲突厌恶任务算术实现持续机器去学习

Shen Lin, Junhao Dong, Rongjie Chen, Xiaoyu Zhang, Li Xu, Xiaofeng Chen

发表机构 * Fujian Normal University(福建师范大学) Nanyang Technological University(南洋理工大学) Xidian University(西安电子科技大学)

AI总结 本文首次研究了视觉语言模型的持续去学习问题,提出CATA方法,通过冲突厌恶任务算术有效解决去学习中的有效性、模型保真度和持续性挑战。

详情
AI中文摘要

视觉语言模型(VLMs)在对齐视觉和文本表示方面表现出色,能够支持多种多模态应用。然而,其大规模训练数据不可避免地引发了隐私、版权和不良内容的担忧,这使得机器去学习变得必要。尽管现有研究主要关注单次去学习,但实际VLM部署往往涉及随时间推移的连续删除请求,从而产生持续机器去学习。在本文中,我们首次研究了VLMs的持续去学习,并识别出该设置中的三个关键挑战:去除目标知识的有效性、保留模型效用的保真度以及在连续更新下防止知识重新出现的持续性。为了解决这些挑战,我们提出了CATA,一种冲突厌恶任务算术方法,将每个遗忘请求表示为一个去学习任务向量。通过维护历史任务向量并执行符号感知的冲突厌恶聚合,CATA抑制可能削弱先前遗忘效果的冲突更新组件。在单次和持续设置下的大量实验表明,CATA在遗忘有效性、模型保真度和遗忘持续性方面均优于基线方法。

英文摘要

Vision-language models (VLMs) have shown remarkable ability in aligning visual and textual representations, enabling a wide range of multimodal applications. However, their large-scale training data inevitably raises concerns about privacy, copyright, and undesirable content, creating a strong need for machine unlearning. While existing studies mainly focus on single-shot unlearning, practical VLM deployment often involves sequential removal requests over time, giving rise to continual machine unlearning. In this work, we make the first attempt to study continual unlearning for VLMs and identify three key challenges in this setting: effectiveness in removing target knowledge, fidelity in preserving retained model utility, and persistence in preventing knowledge re-emergence under sequential updates. To address these challenges, we propose CATA, a conflict-averse task arithmetic method that represents each forget request as an unlearning task vector. By maintaining historical task vectors and performing sign-aware conflict-averse aggregation, CATA suppresses conflicting update components that may weaken previous forgetting effects. Extensive experiments under both single-shot and continual settings show that CATA outperforms baselines in terms of forgetting effectiveness, model fidelity, and forgetting persistence.

2605.18609 2026-05-19 cs.LG 版本更新

Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration

在经典动量加速下实现小批量SGD的完美并行化

Sachin Garg, Michał Dereziński

发表机构 * University of Michigan(密歇根大学)

AI总结 本文提出了一种通用的小批量优化理论,展示了经典动量对梯度小批量大小的加速比例关系,从而实现小批量计算的完美并行化。

详情
AI中文摘要

利用经典动量方案(如Polyak的重球方案)加速随机梯度方法,在训练大规模机器学习模型中证明了其高度成功,特别是在结合大规模小批量计算的硬件加速时。然而,经典动量对随机小批量优化的影响在理论上理解甚微,先前工作需要强噪声假设和极大的小批量。在本文中,我们开发了一种通用的随机动量加速理论,用于在插值域中优化二次函数,这是一门研究深度学习动态的流行抽象,也包括随机Kaczmarz和坐标下降等经典方法。我们的框架涵盖了重球和Nesterov式动量,允许任意小批量大小,并对随机噪声做出最小假设。特别地,我们证明了经典动量的加速与梯度小批量大小成正比(除了自然饱和点),从而实现小批量计算的完美并行化。我们的理论还提供了一个简单的动量参数选择,该选择在经验上被证明是有效的。

英文摘要

Accelerating stochastic gradient methods with classical momentum schemes, such as Polyak's heavy ball, has proven highly successful in training large-scale machine learning models, particularly when combined with the hardware acceleration of large mini-batch computations. Yet, the effect of classical momentum on stochastic mini-batch optimization has been poorly understood theoretically, with prior works requiring strong noise assumptions and extremely large mini-batches. In this work, we develop a general theory of stochastic momentum acceleration for optimizing over quadratics in the interpolation regime, a popular abstraction for studying deep learning dynamics which also includes classical methods such as randomized Kaczmarz and coordinate descent. Our framework encompasses both heavy ball and Nesterov-style momentum, allows for arbitrary mini-batch sizes, and makes minimal assumptions on the stochastic noise. In particular, we show that acceleration from classical momentum is directly proportional to the gradient mini-batch size (up to a natural saturation point), thereby enabling perfect parallelization of mini-batch computations. Our theory also provides a simple choice for the momentum parameter, which is shown to be effective empirically.

2605.18607 2026-05-19 cs.CL cs.LG 版本更新

Forecasting Downstream Performance of LLMs With Proxy Metrics

通过代理指标预测大语言模型的下游性能

Arkil Patel, Siva Reddy, Marius Mosbach, Dzmitry Bahdanau

发表机构 * Mila – Quebec AI Institute & McGill University(魁北克AI研究院与麦吉尔大学) CIFAR AI Chair(CIFAR人工智能主席) ServiceNow Research Periodic Labs(ServiceNow研究周期实验室)

AI总结 本文提出通过聚合候选模型的下一个token分布中的token级统计信息(如熵、top-k准确率和专家token排名)来构建代理指标,以更准确地预测大语言模型的下游性能,优于传统的损失和计算量基线方法。

Comments Preprint. 31 pages

详情
AI中文摘要

语言模型的发展进步往往由比较决策驱动:选择哪种架构、哪种预训练语料库或哪种训练配方。做出这些决策需要可靠的性能预测,但常用的两个信号从根本上受到限制。交叉熵损失与下游能力不匹配,而直接下游评估成本高、稀疏且在早期训练阶段信息有限。相反,我们提出通过聚合候选模型的下一个token分布中的token级统计信息(如熵、top-k准确率和专家token排名)来构建代理指标。在三个设置中,我们的代理指标始终优于基于损失和计算量的基线方法:1)在跨家族模型选择中,它们对异质推理模型的排名平均Spearman Rho为0.81(与交叉熵损失的Rho为0.36相比);2)在预训练数据选择中,它们能以大约10,000倍更低的计算成本可靠地对25个候选语料库进行排名,推动帕累托前沿超越现有方法;3)在训练时间预测中,它们在18倍计算范围内预测下游准确性时,误差大约是现有方法的一半。这些结果表明,专家轨迹是评估模型能力广泛有用的信息源,使整个模型开发生命周期中的性能预测变得可靠。

英文摘要

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly $10{,}000\times$ less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an $18\times$ compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.

2605.18598 2026-05-19 cs.LG cond-mat.stat-mech math.FA math.PR math.ST stat.TH 版本更新

Pointwise Generalization in Deep Neural Networks

深度神经网络中的逐点泛化

Shaojie Li, Yunbei Xu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种深度神经网络逐点泛化的理论框架,通过分析全连接网络的点wise Riemannian 维度,建立了新的表示学习统计基础,提供了更精确的泛化界限。

详情
AI中文摘要

我们通过建立全连接网络的点wise泛化理论,探讨了深度神经网络为何能够泛化的根本问题。该框架解决了长期以来在刻画丰富非线性特征学习领域中的障碍,并为表示学习建立了新的统计基础。对于每个训练好的模型,我们通过从各层学习的特征表示的本征值推导出点wise Riemannian 维度来表征假设。这建立了一个有原则的框架,用于推导依赖假设的、具有表示意识的泛化界限。这些界限在理论和实验上都比基于模型大小、范数乘积和无限宽度线性化的方法有数量级更紧的保证。在分析上,我们识别了深度网络可 tractable 的结构属性和数学原理。在经验上,点wise Riemannian 维度表现出显著的特征压缩,随着过度参数化程度的增加而减小,并捕捉了优化器的隐含偏置。综合来看,我们的结果表明,深度网络在实际情况下是数学上可 tractable 的,并且其泛化性可以通过点wise、特征谱意识的复杂性得到清晰解释。

英文摘要

We address the fundamental question of why deep neural networks generalize by establishing a pointwise generalization theory for fully connected networks. This framework resolves long-standing barriers to characterizing the rich nonlinear feature-learning regime and builds a new statistical foundation for representation learning. For each trained model, we characterize the hypothesis via a pointwise Riemannian Dimension, derived from the eigenvalues of the learned feature representations across layers. This establishes a principled framework for deriving hypothesis-dependent, representation-aware generalization bounds. These bounds offer a systematic upgrade over approaches based on model size, products of norms, and infinite-width linearizations, yielding guarantees that are orders of magnitude tighter in both theory and experiment. Analytically, we identify the structural properties and mathematical principles that explain the tractability of deep networks. Empirically, the pointwise Riemannian Dimension exhibits substantial feature compression, decreases with increased over-parameterization, and captures the implicit bias of optimizers. Taken together, our results indicate that deep networks are mathematically tractable in practical regimes and that their generalization is sharply explained by pointwise, feature-spectrum-aware complexity.

2605.18591 2026-05-19 cs.LG cs.AI 版本更新

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

随机优势变换(RAT):通过直接反向传播计算自然策略梯度

Mingfei Sun

发表机构 * The University of Manchester, United Kingdom(曼彻斯特大学,英国)

AI总结 本文提出RAT方法,通过直接反向传播估计正则化自然策略梯度,解决了传统方法中估计和求逆Fisher矩阵成本高的问题,实验证明其在连续和视觉控制基准上性能优异且易于实现。

Comments Accepted to ICML 2026

详情
AI中文摘要

自然策略梯度通过考虑分布空间的几何特性来提高优化效果,但其实际应用受限于估计和求逆Fisher矩阵的成本。我们提出了随机优势变换(RAT),一种通过直接反向传播估计Tikhonov正则化自然策略梯度的方法。通过应用Woodbury公式,我们将正则化自然策略梯度重新表述为带有变换优势的普通策略梯度。RAT通过在在线小批量上应用随机块Kaczmarz迭代高效计算这种变换,避免了显式Fisher构造、共轭梯度求解器和架构特定的近似。我们为RAT提供了收敛保证,并实验证明其在连续和视觉控制基准上与现有自然梯度方法相媲美或更优,同时保持简单易用且兼容各种架构。

英文摘要

Natural policy gradients improve optimization by accounting for the geometry of distribution space, but their practical use is limited by the cost of estimating and inverting the Fisher matrix. We present Randomized Advantage Transformation (RAT), a method for estimating Tikhonov-regularized natural policy gradients via direct backpropagation. By applying the Woodbury formula, we reformulate the regularized natural policy gradients as vanilla policy gradients with a transformed advantage. RAT computes this transformation efficiently via randomized block Kaczmarz iterations on on-policy mini-batches, avoiding explicit Fisher construction, conjugate-gradient solvers, and architecture-specific approximations. We provide convergence guarantees for RAT and demonstrate empirically that it matches or exceeds established natural-gradient methods across continuous and visual control benchmarks, while remaining simple to implement and compatible with various architectures.

2605.18580 2026-05-19 cs.AI cs.LG 版本更新

When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State

当结果看似正确但纪律却失败:基于轨迹的评估在隐藏对手状态下的应用

Peiying Zhu, Sidi Chang

发表机构 * Blossom AI Blossom AI Labs(Blossom AI 实验室)

AI总结 本文提出了一种基于轨迹的评估方法,用于评估在隐藏对手状态下的行为纪律稳定性,通过轨迹诊断、机制分离和转移测试来改进强化学习策略,特别是在酒店定价和隐藏预算竞标任务中。

详情
AI中文摘要

仅结果的评估可能无法保证经济安全的智能体:一种策略可能在达到业务KPI的同时,违反可部署的行为纪律。在酒店定价中,当存在隐藏的对手状态时,学习者可能在看似合理的每间房收入上取得成绩,却无法保持规则基于的收益管理对手的定价纪律。我们引入了纪律稳定性,一种基于轨迹的评估范式:定义基准行为,限制观察到部署阶段,从失败中诱导轨迹诊断,通过消融分离机制,并测试转移和部署。在两个酒店基准和一个紧凑的隐藏预算竞标任务中,仅奖励的PPO变体无法实现轨迹对齐;揭示隐藏状态可减少标签不确定性;确定性复制可压缩不确定性;而轨迹先验或修正历史策略能更好地保持价格或投标分布。纯粹的行为克隆在对称模仿中几乎足够,而轨迹先验强化学习在容量不对称情况下增加有限的适应性。本文的贡献是一种评估和基准范式,而不是新的优化器或关于多智能体强化学习的普遍声明。

英文摘要

Outcome-only evaluation can certify economically unsafe agents: a policy can hit a business KPI while violating deployable behavioral discipline. In hotel pricing with hidden competitor state, a learner can achieve plausible revenue per available room while failing to preserve the rate discipline of a rule-based revenue-management competitor. We introduce discipline stability, a trace-based evaluation paradigm: define the benchmark behavior, restrict observations to the deployment regime, induce trace diagnostics from failure, separate mechanisms with ablations, and test transfer and deployment. Across a two-hotel benchmark and a compact hidden-budget bidding task, reward-only PPO variants miss trace alignment; revealing hidden state reduces label uncertainty; deterministic copy collapses uncertainty; and trace-prior or corrected history policies better preserve price or bid distributions. Pure behavior cloning is nearly enough for symmetric imitation, while Trace-Prior RL adds bounded adaptation under capacity asymmetry. The contribution is an evaluation and benchmark paradigm, not a new optimizer or a universal claim about MARL

2605.18576 2026-05-19 cs.LG 版本更新

scHelix: Asymmetric Dual-Stream Integration via Explicit Gene-Level Disentanglement

scHelix: 通过显式基因层面解缠实现非对称双流整合

Xichen Yan, Zelin Zang, Changxi Chi, Jingbo Zhou, Chang Yu, Jinlin Wu, Shenghui Cheng, Fuji Yang, Jiebo Luo, Zhen Lei, Stan Z. Li

发表机构 * Jinan University(济南大学) Westlake University(西湖大学)

AI总结 scHelix通过显式基因层面解缠实现非对称双流整合,解决单细胞RNA测序数据整合中消除批次效应与保持生物学忠实性之间的矛盾,通过双流稀疏扩散编码器和非对称对齐-细化-融合协议提升整合效果。

Comments 17 pages, 8 figures, accepted by KDD 26

详情
AI中文摘要

单细胞RNA测序(scRNA-seq)数据整合中一个关键挑战是解决消除批次效应与保持生物学忠实性之间的张力。尽管近期证据表明批次效应在基因层面异质性表现,但大多数现有方法对转录组进行统一处理,常导致过度校正和细微生物学信号的丢失。为此,我们提出了scHelix,一个数据自适应框架,通过在输入层面显式将基因划分为领域不变的Anchors和领域敏感的Variants。scHelix利用配备停止梯度图缓存的双流稀疏扩散编码器,高效学习多尺度结构表示。我们的核心方法是一种新的非对称Align-Refine-Fuse协议:首先将不稳定的Variant流对齐到稳定的Anchor流拓扑结构,随后进行保守细化阶段,其中Anchor流通过有界残差门吸收去噪细节。这种分而治之的架构防止了捷径学习,确保在不损害生物簇完整性的情况下实现稳健的批次去除。广泛基准测试表明,scHelix在性能上优于现有最先进方法。

英文摘要

A critical challenge in single-cell RNA sequencing (scRNA-seq) integration is resolving the tension between eliminating batch effects and maintaining biological fidelity. While recent evidence indicates that batch effects manifest heterogeneously across genes, most existing methods process the transcriptome uniformly, frequently resulting in over-correction and loss of subtle biological signals. To address this, we present scHelix, a dataset-adaptive framework that fundamentally changes how features are processed by explicitly partitioning genes into domain-invariant Anchors and domain-sensitive Variants at the input level. scHelix utilizes a dual-stream sparse diffusion encoder equipped with stop-gradient graph caching to efficiently learn multi-scale structural representations. The core of our approach is a novel asymmetric Align-Refine-Fuse protocol: the unstable Variant stream is first aligned to the robust topology of the Anchor stream, followed by a conservative refinement phase where the Anchor stream absorbs denoised details via bounded residual gating. This divide-and-conquer architecture prevents shortcut learning and ensures robust batch removal without compromising the integrity of biological clusters. Extensive benchmarking demonstrates that scHelix outperforms state-of-the-art methods.

2605.18567 2026-05-19 cs.CL cs.LG 版本更新

GUT-IS: A Data-Driven Approach to Integrating Constructs and Their Relations in Information Systems

GUT-IS: 一种数据驱动的方法,用于整合信息系统的构念及其关系

Maximilian Reinhardt, Jonas Scharfenberger, Burkhardt Funk

发表机构 * Institute of Information Systems(信息系统研究所)

AI总结 本文提出了一种数据驱动的方法,通过结合任务适应的文本嵌入和聚类技术,生成构念分组候选集,并利用显式权衡语义纯度和聚类数量简洁性的损失函数选择最优解,从而分析构念分组及其关系在优先级从纯度转向简洁性时的变化。

Comments Accepted at the 34th European Conference on Information Systems (ECIS 2026), Milan, Italy

详情
AI中文摘要

结构方程建模在信息系统研究中被广泛应用。然而,不一致的构念定义阻碍了知识的累积发展。在本工作中,我们提出了一种旨在将结构方程模型整合到统一模型中的方法:我们使用任务适应的文本嵌入和聚类技术生成构念分组的候选集。随后,我们利用一个损失函数来显式权衡语义纯度和聚类数量的简洁性,通过显式权衡,我们的方法允许分析构念分组及其关系如何在优先级从纯度转向简洁性时发生变化。实证上,我们对两个来自信息系统领域的数据集进行了评估和探索。

英文摘要

Structural equation modeling is widely used in IS research. However, inconsistent construct definitions impede the cumulative development of knowledge. In this work, we present an approach that aims at the integration of structural equation models into a unified model: We use a combination of task-adapted text embeddings and clustering to produce a candidate set of construct groupings. Subsequently, we select the optimal solution using a loss function that explicitly trades off semantic purity and parsimony in the number of clusters. By making this trade-off explicit, our approach allows to analyze how construct groupings and their relations change as one shifts the priority from purity to parsimony. Empirically, we evaluate and explore the proposed methodology on two datasets from the IS domain.

2605.18562 2026-05-19 stat.ME cs.AI cs.LG stat.AP 版本更新

Estimating Item Difficulty with Large Language Models as Experts

利用大语言模型作为专家估算项目难度

Diana Kolesnikova, Kirill Fedyanin, Abe D. Hofman, Matthieu J. S. Brinkhuis, Maria Bolsinova

发表机构 * Department of Methodology and Statistics, Tilburg University(蒂尔堡大学方法学与统计学系) Smart Business Technologies(智能商务技术公司) Department of Psychological Methods, University of Amsterdam(阿姆斯特丹大学心理方法系) Prowise Learn, Amsterdam(Prowise Learn公司,阿姆斯特丹) Department of Information and Computing Sciences, Utrecht University(乌得勒支大学信息与计算科学系)

AI总结 本文研究了如何利用大语言模型估算新任务的难度,通过对比不同配置下的模型表现,发现基于对偶比较的配置在无额外优化时表现更优,而结合token概率和已知难度示例的绝对判断配置也表现出中等至高水平的对齐度。

Comments 24 pages, 2 figures, 9 tables

详情
AI中文摘要

准确估计项目难度对于有效的评估和适应性学习至关重要。然而,对于新创建的任务,响应数据通常不可用。预测试和专家判断可能成本高且耗时,而机器学习方法通常需要大量标记训练数据。最近的研究表明,大语言模型(LLMs)可能有所帮助。然而,关于如何通过提示配置来模拟专家进行难度估计的证据有限。本研究通过评估三种现成的LLMs作为新任务的难度评估者,填补了这一空白。使用一个在线学习系统中的项目库,研究了6个小学数学领域,将经验难度作为参考。研究采用全因子设计,交叉三个因素:判断格式(绝对vs对偶比较)、决策类型(硬决策vs基于token概率的估计)和提示策略(零样本vs少量样本)。LLM生成的难度估计与经验难度通过斯皮尔曼等级相关性进行比较。在各领域中,LLM生成的估计与经验项目难度表现出中等至强正相关。对于简单的算术任务,某些配置接近之前研究中人类专家报告的准确性范围的上限。对偶比较在无额外优化时始终优于绝对判断。然而,当结合token级概率并提供已知难度的项目示例时,绝对判断配置也表现出中等至高水平的对齐度。本研究将LLMs定位为初始项目校准的有前途的工具,并提供了有效工作流程配置的见解。

英文摘要

Accurate estimates of item difficulty are essential for valid assessment and effective adaptive learning. However, for newly created tasks, response data are typically unavailable. Pretesting and expert judgement can be costly and slow, while machine learning methods often require large labelled training datasets. Recent work suggests that large language models (LLMs) may help. However, there is limited evidence on the elicitation procedures and prompt configurations used to emulate experts for difficulty estimation. This study addresses this gap by evaluating three off-the-shelf LLMs as difficulty raters for newly created items without access to response data. Using an item bank from an online learning system, the study examined 6 domains of primary-school mathematics, with empirical difficulty estimates treated as empirical reference. The study used a full factorial design crossing three factors: judgement format (absolute vs pairwise), decision type (hard decisions vs token-probability-based estimates), and prompting strategy (zero-shot vs few-shot). LLM-derived difficulty estimates were compared with empirical difficulties using Spearman rank correlations. Across domains, LLM-based estimates exhibited moderate to strong positive correlations with empirical item difficulties. For simpler arithmetic tasks, some configurations approached the upper end of the accuracy range reported for human experts in previous research. Pairwise comparison consistently outperformed absolute judgement in the absence of additional refinements. However, when token-level probabilities were incorporated and examples of items with known empirical difficulty were provided, the absolute judgement configuration likewise demonstrated moderate-to-high alignment. The study positions LLMs as a promising tool for initial item calibration and offers insights into effective workflow configuration.

2605.18557 2026-05-19 cs.LG cs.NE q-bio.NC 版本更新

Self-supervised local learning rules learn the hidden hierarchical structure of high-dimensional data

自监督局部学习规则学习高维数据的隐藏层次结构

Ariane Delrocq, Wu S. Zihan, Guillaume Bellec, Wulfram Gerstner

发表机构 * School of Life Science and School of Computer and Communications Sciences, EPFL(生命科学学院和计算机与通信科学学院,瑞士联邦理工学院) Machine Learning Research Unit, TU Wien(机器学习研究单位,维也纳技术大学)

AI总结 本文研究了自监督局部学习规则在随机层次模型上的表现,发现第一类规则因输入特定的非线性(masking)失效,而第二类规则能有效学习层次结构并具备数据效率和生物合理性。

详情
AI中文摘要

大脑学习高维感觉输入的抽象表示,但使这种学习成为可能的可塑性规则尚不明确。我们研究了生物合理的算法在随机层次模型(RHM)上的表现,RHM是一个人工数据集,用于研究深度神经网络如何学习高维数据的内在层次结构。我们专注于两种类型的局部学习规则,它们避免了长收敛时间和对称误差网络的使用。第一类使用直接反馈信号来近似从输出层的误差传播。第二类使用分层自监督对比或非对比损失函数,不显式近似输出层的误差。我们证明所有第一类规则都无法解决RHM的任务,并追溯这种失败到输入特定的非线性(masking),这些非线性在完全反向传播中被实现,并对学习复杂任务至关重要。然而,第二类算法能够学习RHM任务的层次隐藏结构,并且与监督反向传播训练一样高效,同时与已知的皮层突触可塑性规则兼容。

英文摘要

The brain learns abstract representations of high-dimensional sensory input, but the plasticity rules that enable such learning are unknown. We study biologically plausible algorithms on the Random Hierarchy Model (RHM), an artificial dataset designed to investigate how deep neural networks learn the intrinsic hierarchical structure of high-dimensional data. We focus on two types of local learning rules that avoid both a long convergence time and the use of a symmetric error network. The first type uses direct feedback signals to approximate error propagation from the output layer. The second type uses layerwise self-supervised contrastive or non-contrastive loss functions that do not explicitly approximate errors at the output layer. We show that all rules of the first type fail to solve the tasks of the RHM and trace this failure back to input-specific nonlinearities (`masking') that are implemented in full backpropagation and are essential for learning complex tasks. However, algorithms of the second type are able to learn the hierarchical hidden structure of the RHM tasks and are as data-efficient as supervised backpropagation training, while being compatible with known rules of synaptic plasticity in cortex.

2605.18554 2026-05-19 cs.LG stat.ML 版本更新

Federated Martingale Posterior Samping

联邦马尔可夫后验采样

Boning Zhang, Matteo Zecchin, Mingzhao Guo, Dongzhu Liu, Osvaldo Simeone

发表机构 * School of Computing, University of Glasgow(格拉斯哥大学计算学院) Communication Systems Department, EURECOM(EURECOM通信系统部门) Institute for Intelligent Networked Systems, Northeastern University London(伦敦东北大学智能网络系统研究所)

AI总结 本文提出联邦马尔可夫后验采样方法,通过在不共享本地数据集的情况下,利用预测分布恢复参数不确定性,从而在联邦学习中提升模型校准性能。

Comments 5 pages

详情
AI中文摘要

联邦贝叶斯神经网络需要在模型参数上固定先验分布和似然函数。在现代过度参数化模型的权重空间上提取有意义的先验分布非常困难,且任一组件的不准确都会严重降低准确性和校准性。受预测模型(如大语言模型)快速发展的启发,马尔可夫后验(也称为预测贝叶斯)用预测分布替代先验-似然对,并通过反复绘制预测样本和重新拟合模型来恢复参数不确定性。然而,直接实现联邦版本需要客户端共享本地数据集。本文提出联邦马尔可夫后验(FMP)采样,是一种单次 embarrassingly parallel 协议,其中每个客户端上传一小组可训练的数据嵌入,服务器在中心运行预测采样器。在MNIST、CIFAR-10和CIFAR-100上的实验表明,FMP与集中式方法高度匹配,并在共识式基线之上显著提升校准性。

英文摘要

Federated Bayesian neural networks require fixing a prior on the model parameters together with a likelihood. Eliciting meaningful priors on the weight space of modern overparameterized models is notoriously difficult, and misspecification of either component can severely degrade accuracy and calibration. Motivated by the rapid progress of predictive models such as large language models, the martingale posterior, also known as predictive Bayes, replaces the prior--likelihood pair with a predictive distribution and recovers parameter uncertainty by repeatedly drawing predictive samples and refitting the model. A direct federated implementation, however, would require clients to share the local data sets. This letter proposes {federated martingale posterior} (FMP) sampling, a one-shot embarrassingly parallel protocol in which each client uploads a small set of trainable data embeddings and the server runs the predictive sampler centrally. Experiments on MNIST, CIFAR-10, and CIFAR-100 show that FMP closely matches the centralized counterpart and significantly improves calibration over consensus-style baselines.

2605.18552 2026-05-19 cs.LG q-bio.BM q-bio.QM 版本更新

Protein Fold Classification at Scale: Benchmarking and Pretraining

大规模蛋白质折叠分类:基准测试与预训练

Dexiong Chen, Andrei Manolache, Mathias Niepert, Karsten Borgwardt

发表机构 * Max Planck Institute of Biochemistry(马克斯·普朗克生物化学研究所) Computer Science Department, University of Stuttgart(斯图加特大学计算机科学系)

AI总结 本文提出TEDBench,一个大规模非冗余的蛋白质折叠分类基准,通过Encyclopedia of Domains和Foldseek-clustered AlphaFold结构构建。基于此基准,作者提出Masked Invariant Autoencoders (MiAE)框架,通过高掩码率和SE(3)不变编码器实现蛋白质结构表示学习,从而在TEDBench上取得优异性能。

Comments Accepted at ICML 2026 (spotlight)

详情
AI中文摘要

对蛋白质拓扑进行分类对于解析生物学功能至关重要,但进展受限于缺乏大规模基准和无法扩展的模型。我们引入TEDBench,一个大规模、非冗余的蛋白质折叠分类基准,由Encyclopedia of Domains (TED)和Foldseek-clustered AlphaFold结构构建。我们证明在TEDBench上,当前的蛋白质表示学习方法要么需要非常大的模型,要么无法提供强大的性能。为解决这一挑战,我们提出了Masked Invariant Autoencoders (MiAE),一种自监督的蛋白质结构表示学习框架。MiAE使用高达90%的高掩码率,结合SE(3)-不变编码器和轻量级解码器,从潜在表示和掩码标记中重建骨架坐标。MiAE具有良好的扩展性,并在TEDBench上优于监督方法和最先进的基线,建立了蛋白质折叠分类的强大配方。为了测试超越AlphaFold结构的迁移能力,我们进一步在CATH v4.4的实验结构数据集上进行基准测试。TEDBench可在https://github.com/BorgwardtLab/TEDBench获取。

英文摘要

Classifying protein topology is essential for deciphering biological function, but progress is held back by the lack of large-scale benchmarks that avoid duplicates and by models that do not scale well. We introduce TEDBench, a large-scale, non-redundant benchmark for protein fold classification constructed from the Encyclopedia of Domains (TED) and Foldseek-clustered AlphaFold structures. We show that on TEDBench, current protein representation learning methods either require very large models or fail to deliver strong performance. To address this challenge, we propose Masked Invariant Autoencoders (MiAE), a self-supervised framework for protein structure representation learning. MiAE uses an extremely high masking ratio of up to 90% with an $\mathrm{SE(3)}$-invariant encoder and a lightweight decoder that reconstructs backbone coordinates from the latent representation and mask tokens. MiAE scales well and outperforms supervised counterparts and state-of-the-art baselines on TEDBench, establishing a strong recipe for protein fold classification. To test transfer beyond AlphaFold structures, we further benchmark on a curated dataset from experimental structures of CATH v4.4. TEDBench is available at https://github.com/BorgwardtLab/TEDBench.

2605.18537 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Probing for Representation Manifolds in Superposition

在叠加中探测表示流形

Alexander Modell

发表机构 * Department of Mathematics(数学系)

AI总结 本文提出Manifold Probe方法,用于发现叠加中的表示流形,通过学习可线性预测的特征空间以及编码方向,从而揭示模型行为中因果相关的流形。

Comments 19 pages, 7 figures

详情
AI中文摘要

本文介绍了一个名为Manifold Probe的监督方法,用于在叠加中发现表示流形。该方法通过学习一个概念的特征空间,该空间可以线性预测自表示,然后学习用于编码这些特征的方向。我们展示了该方法在Llama 2-7b中时间与空间的表示上,发现每个案例中都能线性表示可解释的特征集合。在时间案例中,我们展示了通过沿流形引导,可以影响模型对著名歌曲、电影和书籍发布年份的完成,提供了证据表明Manifold Probe能够发现与模型行为因果相关的流形。

英文摘要

This paper introduces the Manifold Probe, a supervised method for discovering representation manifolds in superposition. The method generalizes linear regression probes by learning the space of features of a concept that can be linearly predicted from the representations, and then learning the directions used to encode them. We demonstrate the probe on representations of time and space in Llama 2-7b, finding manifolds which linearly represent an interpretable set of features in each case. In the case of time, we show that by steering along the manifold, we can influence the model's completions about the years in which famous songs, movies and books were released, providing evidence that the Manifold Probe can discover manifolds which are causally involved in model behaviour.

2605.18535 2026-05-19 cs.LG cs.MA 版本更新

Beyond Scaling: Agents Are Heading to the Edge

超越扩展:智能体正走向边缘

Chunlin Tian, Dongqi Cai, Wanru Zhao, Nicholas D. Lane

发表机构 * University of Cambridge(剑桥大学) University of Macau(澳门大学) Nanjing University(南京大学)

AI总结 本文探讨了智能体技术发展的瓶颈从单一模型压缩世界知识转向协调系统执行,提出个人智能体架构必须转向边缘计算,以适应高保真局部环境的结构耦合和零延迟执行循环需求。

详情
AI中文摘要

有用智能体智能的瓶颈已从将世界知识压缩到单一模型转变为执行协调系统。本文主张个人智能体架构必须走向边缘,因为智能体任务的核心特性,特别是其与高保真局部环境的结构耦合以及对零延迟执行循环的需求,无法与以云为中心的设计兼容。我们通过三个结构性转变来支持这一主张。首先,前额转变:能力的主要边际杠杆已从预训练规模转移到框架级执行控制。此类控制必须保持与行动环境的物理接近,以确保智能体保持认知一致性。其次,数据地理悖论,智能体数据的“暗物质”(本地文件层次结构、实时传感器流和瞬态操作系统状态)在准备传输到云时会退化、消失或失去意义,从而切断智能体与真实环境上下文的联系。第三,交互对齐循环,唯一经济和生态可持续的智能体细化数据来源是通过实时本地交互产生的高保真隐含偏好信号。我们最后提出可检验的预测,用于个人智能体的下一次部署周期。

英文摘要

The bottleneck of useful agentic intelligence has shifted from compressing world knowledge into a single model to executing a coordinated system. This position paper argues that personal-agent architecture must move to the edge because the core properties of agentic intelligence tasks, particularly their structural coupling with high-fidelity local context and the need for zero-latency execution loops, do not sit well with cloud-centric designs. We develop this claim through three structural shifts. First, the Prefrontal Turn: the main marginal lever of capability has moved from pre-training scale to framework-level executive control. Such control must remain physically close to the environment of action if the agent is to preserve cognitive alignment. Second, the Data-Geography Paradox, the ``dark matter'' of agentic data (local file hierarchies, real-time sensor streams, and transient OS states) degrades, disappears, or loses meaning once prepared for cloud transmission, thereby cutting the agent off from ground-truth context. Third, the interaction-alignment loop, the only economically and ecologically sustainable source of agentic refinement data is the high-fidelity implicit preference signal produced through real-time local interaction. Third, the interaction-alignment loop, the only economically and ecologically sustainable source of agentic refinement data is the high-fidelity implicit preference signal produced through real-time local interaction. We conclude with falsifiable predictions for the next deployment cycle of personal agents.

2605.18534 2026-05-19 cs.LG 版本更新

XCTFormer: Leveraging Cross-Channel and Cross-Time Dependencies for Enhanced Time-Series Analysis

XCTFormer: 利用跨通道和跨时间依赖性提升时间序列分析

Israel Zexer, Omri Azencot

发表机构 * The Stein Faculty of Computer and Information Science(施坦计算机与信息科学系) Ben-Gurion University of the Negev(本·古里安大学)

AI总结 本文提出XCTFormer模型,通过增强的注意力机制显式捕捉时间序列中的跨时间与跨通道依赖性,以提升时间序列分析性能,特别是在缺失值填补任务中取得state-of-the-art结果。

Comments TMLR 2026

详情
AI中文摘要

多变量时间序列分析涉及从多个相互依赖变量的序列中提取信息性表示,支持预测、填补和异常检测等任务。在现实场景中,这些变量通常来自共享上下文或底层现象,表明存在时间与通道间的潜在依赖性,可以利用以提高性能。然而,最近的研究发现,假设无变量间依赖性的通道独立(CI)模型往往优于显式建模此类关系的通道依赖(CD)模型。这一意外结果表明,当前CD模型可能由于依赖性捕捉的限制而未能充分发挥潜力。最近的研究重新审视了通道依赖建模,但这些方法通常采用间接建模策略,可能导致有意义的依赖性被忽视。为了解决这个问题,我们引入了XCTFormer,一种基于Transformer的通道依赖(CD)模型,通过增强的注意力机制显式捕捉跨时间和跨通道依赖性。该模型以token到token的方式操作,建模时间与通道之间每对token之间的成对依赖性。架构包括(i)数据处理模块,(ii)新型的跨关系注意力块(CRAB),以增加容量和表达性,以及(iii)可选的依赖压缩插件(DeCoP),以提高可扩展性。通过在三个时间序列基准上的广泛实验,我们证明XCTFormer在与广泛认可的基线相比时取得了强劲的结果;特别是,在填补任务中,它在MSE和MAE上分别比第二好的方法平均高出20.8%和15.3%。

英文摘要

Multivariate time-series analysis involves extracting informative representations from sequences of multiple interdependent variables, supporting tasks such as forecasting, imputation, and anomaly detection. In real-world scenarios, these variables are typically collected from a shared context or underlying phenomenon, suggesting the presence of latent dependencies across time and channels that can be leveraged to improve performance. However, recent findings show that channel-independent (CI) models, which assume no inter-variable dependencies, often outperform channel-dependent (CD) models that explicitly model such relationships. This surprising result indicates that current CD models may not fully exploit their potential due to limitations in how dependencies are captured. Recent studies have revisited channel dependence modeling with various approaches; however, these methods often employ indirect modeling strategies, which can lead to meaningful dependencies being overlooked. To address this issue, we introduce XCTFormer, a transformer-based channel-dependent (CD) model that explicitly captures cross-temporal and cross-channel dependencies via an enhanced attention mechanism. The model operates in a token-to-token fashion, modeling pairwise dependencies between every pair of tokens across time and channels. The architecture comprises (i) a data processing module, (ii) a novel Cross-Relational Attention Block (CRAB) that increases capacity and expressiveness, and (iii) an optional Dependency Compression Plugin (DeCoP) that improves scalability. Through extensive experiments on three time-series benchmarks, we show that XCTFormer achieves strong results compared to widely recognized baselines; in particular, it attains state-of-the-art performance on the imputation task, outperforming the second-best method by an average of 20.8% in MSE and 15.3% in MAE.

2605.18530 2026-05-19 cs.CL cs.AI cs.LG stat.ML 版本更新

Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

连续扩散在语言领域中能与离散扩散竞争性地扩展

Zhihan Yang, Wei Guo, Shuibai Zhang, Subham Sekhar Sahoo, Yongxin Chen, Arash Vahdat, Morteza Mardani, John Thickstun

发表机构 * NVIDIA & Cornell(NVIDIA与康奈尔大学) NVIDIA & Georgia Tech(NVIDIA与佐治亚理工学院) UW-Madison(威斯康星大学麦迪逊分校) MBZUAI-IFM(梅兰德大学-IFM) Cornell(康奈尔大学)

AI总结 本文研究了连续扩散模型在语言建模中的扩展能力,通过改进Plaid模型构建RePlaid,证明连续扩散模型在计算效率和性能上可与离散模型竞争,并提供了理论支持。

详情
AI中文摘要

尽管扩散模型近期在语言建模领域受到广泛关注,但连续扩散模型在扩展性方面似乎不如离散方法。为了挑战这一观点,我们重新审视Plaid,一种基于似然的连续扩散语言模型(DLM),并构建RePlaid,通过将Plaid的架构与现代离散DLMs对齐。在统一的设定下,我们建立了第一个连续DLMs的扩展定律,表明RePlaid的计算差距仅为自回归模型的20倍,使用更少的参数优于Duo,并在过训练范围内优于MDLM。我们将RePlaid与最近的连续DLMs进行基准测试:在OpenWebText上,RePlaid实现了连续DLMs中的新状态-of-the-art PPL界值为22.1,并在生成质量上更优。这些结果表明,当通过似然训练时,连续扩散是与离散DLMs高度竞争且可扩展的替代方案。此外,我们提供了理论见解以理解基于似然训练的优势。我们展示了优化噪声调度以最小化ELBO的方差自然会得到时间上的线性交叉熵(信息损失)。这均匀地分配去噪难度,而无需任何特定时间的重参数化。此外,我们发现通过似然优化嵌入会创建结构化的几何形状并驱动最大的似然增益。

英文摘要

While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only $20\times$ compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of $22.1$ among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs. Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.

2605.18522 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Beyond Morphology: Quantifying the Diagnostic Power of Color Features in Cancer Classification

超越形态学:量化颜色特征在癌症分类中的诊断能力

Farnaz Kheiri, Shahryar Rahnamayan, Masoud Makrehchi

发表机构 * Dept. of Electrical, Computer and Software Engineering(电气、计算机与软件工程系) Ontario Tech University(安大略技术大学) Dept. of Engineering(工程系) Brock University(布鲁克大学)

AI总结 本文研究了颜色特征在癌症分类中的诊断能力,通过排除形态学信息,评估了全局颜色特征的判别力,发现颜色特征在二分类任务中可达到高达89%的准确率,表明颜色分布包含非随机的诊断信号。

详情
AI中文摘要

在组织病理学中,人类专家主要依靠颜色增强对比度来解读组织形态,而机器视觉模型则将颜色视为原始统计信息。这一区别提出了一个根本性问题:像素强度本身,独立于结构和形态学线索,能支持多少癌症分类?为了解决这个问题,我们系统评估了全局颜色特征的独立判别力,同时刻意排除所有形态学信息。具体而言,我们提取了统计颜色矩,并对RGB和HSV颜色直方图进行离散化处理,然后在十个不同的实验设置中使用经典机器学习分类器评估其性能。我们的结果表明,在二元诊断任务(例如良性与恶性)中,仅颜色特征即可实现强劲的性能,分类准确率可达到89%。这种性能很可能归因于与恶性相关的全局色度变化。重要的是,这些简单的颜色基表示在很大程度上优于随机基线,表明原始颜色分布编码了非随机且具有诊断意义的信号用于癌症检测。因此,本研究表明,简单的、计算高效的色彩特征可以作为一种有效的预筛选工具。通过识别具有强色度指示恶性特征的样本,这些轻量模型可以作为第一道筛选系统,减少对复杂深度学习架构的计算负担。

英文摘要

In histopathology, human experts primarily rely on color as a means of enhancing contrast to interpret tissue morphology, whereas machine vision models process color as raw statistical information. This distinction raises a fundamental question: to what extent can pixel intensity alone, independent of structural and morphological cues, support cancer classification? To address this question, we systematically evaluated the standalone discriminative power of global color features while deliberately excluding all morphological information. Specifically, we extracted statistical color moments and discretized RGB and HSV color histograms, and assessed their performance across ten diverse experimental settings using classical machine learning classifiers. Our results demonstrate that color features alone can achieve strong performance in binary diagnostic tasks (e.g., benign versus malignant), with classification accuracies reaching up to 89%. This performance is likely attributable to global chromatic shifts associated with malignancy. Importantly, these simple color-based representations consistently outperformed random baselines by a substantial margin, indicating that raw color distributions encode a non-random and diagnostically relevant signal for cancer detection. Consequently, this study suggests that simple, computationally efficient color features can serve as an effective pre-screening tool. By identifying samples with strong chromatic indicators of malignancy, these lightweight models could function as a first-pass triage system, reducing the computational burden on complex deep learning architectures.

2605.18509 2026-05-19 cs.LG 版本更新

Offline Contextual Bandits in the Presence of New Actions

离线情境老虎机中存在新动作的情况

Ren Kishimoto, Tatsuhiro Shimizu, Kazuki Kawamura, Takanori Muroi, Yusuke Narita, Yuki Sasamoto, Kei Tateno, Takuma Udagawa, Yuta Saito

发表机构 * Institute of Science Tokyo(科学研究所东京) Yale University(耶鲁大学) Sony Group Corporation(索尼集团公司) Hanjuku-kaso, Co., Ltd.(汉库吉卡索有限公司)

AI总结 本文研究了在部署日志策略后引入的新动作对离线情境老虎机(OPL)的影响,提出了一种新的OPL方法,通过局部组合伪逆(LCPI)估计器和Policy Optimization for Effective New Actions(PONA)算法,有效学习和选择新动作,同时保持整体策略性能。

Comments 12pages, 7 figures

详情
AI中文摘要

自动化决策算法驱动推荐系统和搜索引擎等应用。这些算法通常依赖于离线情境老虎机或离线学习(OPL)。传统上,OPL选择现有动作集中的动作以最大化预期奖励。然而,在许多现实场景中,动作(如新闻文章或视频内容)会持续变化,且在数据收集后,动作空间会随时间演变。我们定义在部署日志策略后引入的动作为新动作,并专注于包含新动作的OPL。现有OPL方法能有效识别现有动作集中的最优动作,但无法学习和选择新动作,因为没有相关数据被记录。为解决这一限制,我们提出了一种新的OPL方法,利用动作特征。我们首先引入局部组合伪逆(LCPI)估计器用于策略梯度,扩展了最初为离线情境老虎机滑动评估提出的伪逆估计器。LCPI在奖励建模条件和数据收集条件之间控制动作特征的权衡,捕捉不同动作特征维度之间的交互效应。此外,我们提出了一种名为Policy Optimization for Effective New Actions(PONA)的通用算法,将专门用于新动作选择的LCPI组件与在现有动作中学习效果出色的双重稳健(DR)算法结合。我们定义PONA为LCPI和DR估计器的加权和,优化现有和新动作的选择,并允许通过权重参数调整新动作选择的比例。通过广泛的实验,我们证明PONA能够高效地选择新动作,同时保持整体策略性能,相较于大多数现有方法无法选择新动作。

英文摘要

Automated decision-making algorithms drive applications such as recommendation systems and search engines. These algorithms often rely on off-policy contextual bandits or off-policy learning (OPL). Conventionally, OPL selects actions that maximize the expected reward from an existing action set. However, in many real-world scenarios, actions, such as news articles or video content, change continuously, and the action space evolves over time after data collection. We define actions introduced after deploying the logging policy as new actions and focus on OPL with new actions. Existing OPL methods identify optimal actions from the existing set effectively but cannot learn and select new actions because no relevant data are logged. To address this limitation, we propose a new OPL method that leverages action features. We first introduce the Local Combination PseudoInverse (LCPI) estimator for the policy gradient, generalizing the PseudoInverse estimator initially proposed for off-policy evaluation of slate bandits. LCPI controls the trade-off between reward-modeling condition and the condition for data collection regarding the action features, capturing the interaction effects among different dimensions of action features. Furthermore, we propose a generalized algorithm called Policy Optimization for Effective New Actions (PONA), which integrates LCPI, a component specialized for new action selection, with Doubly Robust (DR), which excels at learning within existing actions. We define PONA as a weighted sum of the LCPI and DR estimators, optimizing both the selection of existing and new actions, and allowing the proportion of new action selections to be adjusted by the weight parameter. Through extensive experiments, we demonstrate that PONA efficiently selects new actions while maintaining the overall policy performance as opposed to most existing methods that cannot select new actions.

2605.18508 2026-05-19 cs.LG cs.AI 版本更新

DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization

DiPRL: 通过架构熵正则化学习离散程序性策略

Chengpeng Hu, Yingqian Zhang, Hendrik Baier

发表机构 * Eindhoven University of Technology(埃因霍温理工大学) Centrum Wiskunde & Informatica(数学与信息学研究中心)

AI总结 本文提出DiPRL,一种通过架构熵正则化学习可解释程序性策略的方法,以避免事后细化阶段,提高策略表达性和任务性能。

详情
AI中文摘要

程序性强化学习(PRL)通过将策略表示为可读可编辑的程序,为深度强化学习提供了一种可解释的替代方案。尽管基于梯度的方法已被开发用于优化程序的连续松弛,但在将连续松弛转换回离散程序时会显著降低性能。事后离散化会丢弃优化的分支和参数,导致策略表达性崩溃和任务性能下降,从而需要额外的微调。为克服这些限制,我们提出了可微离散程序性强化学习(DiPRL),一种在训练过程中使程序接近离散的方法,避免了单独的事后微调阶段。我们首先分析了基于梯度方法事后离散化引入的性能下降固有风险。然后,我们引入了程序架构熵正则化,这使得训练过程平滑且可微,鼓励收敛到离散程序。DiPRL在保持基于梯度优化效率的同时,减轻了事后离散化的风险。在多个离散和连续RL任务中的实验表明,DiPRL可以通过可解释的程序性策略实现强大的性能。

英文摘要

Programmatic reinforcement learning (PRL) offers an interpretable alternative to deep reinforcement learning by representing policies as human-readable and -editable programs. While gradient-based methods have been developed to optimize continuous relaxations of programs, they face a significant performance drop when converting the continuous relaxations back into discrete programs. Post-hoc discretization can discard optimized branches and parameters in a program, which results in a collapse of policy expressivity and lowered task performance, leading in turn to a need for additional fine-tuning. To overcome these limitations, we propose Differentiable Discrete Programmatic Reinforcement Learning (DiPRL), a method that learns programmatic policies that become nearly discrete during training, avoiding a separate post-hoc fine-tuning stage. We first analyze the inherent risks of performance drop introduced by post-hoc discretization of gradient-based methods. Then, we introduce programmatic architecture entropy regularization, which enables smooth, differentiable training that encourages convergence toward a discrete program. DiPRL maintains the efficiency of gradient-based optimization while mitigating the risks of post-hoc discretization. Our experiments across multiple discrete and continuous RL tasks demonstrate that DiPRL can achieve strong performance via interpretable programmatic policies.

2605.18498 2026-05-19 cs.LG cs.AI 版本更新

DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs

DBES: 一种用于评估大规模MoE模型专家专业化程度的系统性基准和度量套件

Jing Wang, Hongxuan Lu, Jazze Young, Shu Wang, Zhimin Xin

发表机构 * Jing Wang(王静) Hongxuan Lu(卢洪轩) Jazze Young(杨杰兹) Shu Wang(王舒) Zhimin Xin(辛志敏)

AI总结 本文提出DBES系统性基准和度量套件,通过多领域基准和五个理论基础的度量指标,评估MoE模型中的专家专业化程度,并验证这些度量指标在领域特定后训练中的可操作性,实现了显著的性能提升。

详情
AI中文摘要

MoE模型中的专家专业化仍缺乏深入理解,传统评估将架构负载均衡与功能专业化混淆。我们引入DBES,一种综合的诊断框架,结合多领域基准和五个理论基础的度量指标:路由专业化、归一化有效秩、领域隔离、路由刚度分数和n-gram专家度量。关键发现显示不同模型展现出不同的专业化范式:Qwen系列表现出模块化专业化,具有高领域隔离,而DeepSeek和GLM采用分布式协作。然而,我们强调专业化是诊断维度,必要但不充分用于下游性能。最重要的是,干预证据验证了这些度量指标的可操作性:通过使用DBES在领域特定后训练中识别高专业化专家路径,我们仅使用15%的原始训练资源,在专业化领域实现了66%至94.48%的性能提升,证明这些诊断工具可以转化为具体的优化算子。本文提供了首个系统性的方法,用于独立于准确度指标评估专家专业化,为下一代MoE系统的设计和后训练优化提供了关键见解。

英文摘要

Expert specialization in Mixture-of-Experts (MoE) models remains poorly understood, with traditional evaluations conflating architectural load-balancing with functional specialization. We introduce DBES, a comprehensive diagnostic framework combining a multi-domain benchmark with five theoretically grounded metrics: Routing Specialization, Normalized Effective Rank, Domain Isolation, Routing Stiffness Score, and N-gram Expertise measures. Critical findings demonstrate distinct specialization paradigms across models: Qwen-series exhibit modular specialization with high domain isolation, while DeepSeek and GLM employ distributed collaboration. However, we emphasize that specialization is a diagnostic dimension, necessary but not sufficient for downstream performance. Most crucially, interventional evidence validates the actionability of these metrics: by using DBES to identify high-specialization expert paths during domain-specific post-training, we achieved 66% to 94.48% improvement in specialized domains with only 15% of original training resources, demonstrating that these diagnostic tools can be converted into concrete optimization operators. This work provides the first systematic methodology for evaluating expert specialization independently of accuracy metrics, offering crucial insights for the design and post-training optimization of next-generation MoE systems.

2605.18483 2026-05-19 cs.LG cs.AI 版本更新

Modality vs. Morphology: A Framework for Time Series Classification for Biological Signals

模态与形态:生物信号时间序列分类的框架

Jordan Tschida, Matthew Yohe, Edward Kane, Gavin Jager, Emma J. Reid, Tony G. Allen, Mark Story, Leanne Thompson, Joe Hoskins, Brandon Schreiber, Stan Seiferth, Scott Dolvin, David Cornett

发表机构 * UT-Battelle, LLC(UT-巴特勒公司) Oak Ridge National Laboratory(橡树岭国家实验室)

AI总结 本文提出了一种统一的形态-模态框架,通过分析生物信号的形态结构,揭示了如何影响模型设计和性能,强调形态对预处理和建模策略的重要性,并指出未来的工作方向包括形态数据增强和评估指标改进。

详情
AI中文摘要

生物信号时间序列分类(TSC)已从手工制作的模态特定方法发展为能够表示底层生理过程多样波形结构的深度架构(即形态)。本文综述介绍了一种统一的形态-模态框架,将波形结构与方法论设计连接起来,揭示了尖峰、爆发、振荡、慢漂移和层次节奏如何影响模型设计。通过分析脑电图、肌电图、心电图、脉搏波描记图以及眼动模态(电眼图、瞳孔测量、眼动追踪),本文展示了形态如何决定预处理和建模策略。整合这些生物信号的证据,该框架揭示形态而非模型类别最强烈地决定了性能和可解释性。这提供了深度模型在诱导偏见与底层波形动态一致时为何成功的原因。本文还识别了未来的工作,包括形态数据增强和评估指标改进以提高泛化能力。这些见解将形态意识建模定位为开发跨生物信号通用、可解释和生理意义的TSC模型的统一原则。

英文摘要

Time series classification (TSC) of biological signals has progressed from handcrafted, modality-specific approaches to deep architectures capable of representing the diverse waveform structures of underlying physiological processes (i.e., morphology). This review introduces a unified morphology--modality framework that connects waveform structure to a methodological design, revealing how spikes, bursts, oscillations, slow drift, and hierarchical rhythms inform model design. By analyzing electroencephalography, electromyography, electrocardiography, photoplethysmography, and ocular modalities (electrooculography, pupillometry, eye-tracking), the review demonstrates how morphology determines preprocessing and modeling strategies. Integrating evidence across these biological signals, the framework reveals that morphology, not model class, most strongly determines performance and interpretability. This provides insight into why deep models succeed when their inductive biases align with underlying waveform dynamics. This review also identifies future work including morphological data augmentation and evaluation metrics to improve generalization. Together, these insights position morphology-aware modeling as a unifying principle for developing generalizable, interpretable, and physiologically meaningful TSC models across biological signals.

2605.18476 2026-05-19 stat.CO cs.AI cs.LG 版本更新

AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers

AI4BayesCode: 从自然语言描述到经过验证的模块化状态性贝叶斯采样器

Jungang Zou, Alex Ziyu Jiang, Qixuan Chen

发表机构 * Department of Biostatistics, Columbia University(哥伦比亚大学生物统计学系)

AI总结 该研究提出AI4BayesCode系统,通过自然语言描述生成可运行且验证过的MCMC采样器,采用模块化设计和递归状态性编码范式,提升了贝叶斯模型的可靠性和扩展性。

详情
AI中文摘要

编码和计算仍然是马尔可夫链蒙特卡洛(MCMC)工作流程中的主要瓶颈,尤其是在现代采样算法日益复杂的情况下,现有的概率编程系统在模型支持、扩展性和可组合性方面仍然有限。我们介绍了AI4BayesCode,这是一个可扩展的LLM驱动系统,能够将自然语言的贝叶斯模型描述转换为可运行且经过验证的MCMC采样器。为了提高可靠性,AI4BayesCode采用模块化设计,将模型分解为模块化采样块,并将每个块映射到内置的采样组件,从而减少从头实现复杂采样算法的需要。通过预生成模型规范的验证和后生成采样器代码的验证进一步提高了可靠性。AI4BayesCode还引入了一种新的递归状态性编码范式,使模块化采样组件(可能由不同贡献者开发)能够在更大的MCMC过程中协同一致地组成。我们开发了一个基准测试套件来评估AI4BayesCode的采样器生成能力。实验表明,AI4BayesCode能够仅通过自然语言描述实现广泛的贝叶斯模型。作为一项开放系统,其能力可以随着底层AI代理的改进和新增内置块的添加而继续扩展。

英文摘要

Coding and computation remain major bottlenecks in Markov chain Monte Carlo (MCMC) workflows, especially as modern sampling algorithms have become increasingly complex and existing probabilistic programming systems remain limited in model support, extensibility, and composability. We introduce \textbf{AI4BayesCode}, an extensible LLM-driven system that translates natural-language Bayesian model descriptions into runnable, validated MCMC samplers. To improve reliability, AI4BayesCode adopts a modular design that decomposes models into modular sampling blocks and maps each block to a built-in sampling component, reducing the need to implement complex sampling algorithms from scratch. Reliability is further improved through pre-generation validation of model specifications and post-generation validation of generated sampler code. AI4BayesCode also introduces a novel recursively stateful coding paradigm for MCMC, allowing modular sampling components, potentially developed by different contributors, to be composed coherently within larger MCMC procedures. We develop a benchmark suite to evaluate AI4BayesCode for sampler-generation. Experiments show that AI4BayesCode can implement a wide range of Bayesian models from natural-language descriptions alone. As an open-ended system, its capability can continue to expand with improvements in the underlying AI agent and the addition of new built-in blocks.

2605.18475 2026-05-19 cs.LG cs.AI 版本更新

GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

GAMMA:在任意预算下为混合精度模型进行全局位分配

Zhangyang Yao, Haiyan Zhao, Haoyu Wang, Tianbo Huang, Lihua Zhang, Xu Han

发表机构 * Beihang University(北航) Tsinghua University(清华) ByteDance Inc(字节跳动)

AI总结 本文提出GAMMA框架,通过后训练流水线学习模块级精度偏好,优化教师强制隐藏状态重建目标并利用整数规划实现精确预算分配,从而在任意预算下提升大语言模型的精度,优于固定精度基线和搜索基混合精度方法。

详情
AI中文摘要

混合精度量化通过将更多位分配给敏感模块,提高了大语言模型(LLMs)的预算-精度权衡。然而,在LLM规模上自动化这种分配面临独特约束:可学习方法需要量化感知训练,这在十亿参数模型中不可行;训练自由替代方案依赖静态代理指标,无法捕捉跨模块交互,并且必须为每个目标预算重新计算;搜索方法成本高且无法保证精确预算符合。我们提出GAMMA,一种量化器无关的框架,完全在后训练流水线内学习模块级精度偏好。GAMMA在增强拉格朗日约束下优化教师强制隐藏状态重建目标,并通过整数规划将学习的偏好投影到精确预算可行的离散分配中。关键性质是分数重用:因为学习的偏好编码了一个稳定的敏感性排名而非预算特定权重,单次训练运行可服务于任意部署目标,仅需重新求解整数规划,将每预算适应时间从小时减少到几分钟。在Llama和Qwen模型(8B-32B)上,GAMMA优于固定精度基线(最高+12.99 Avg.)和搜索基混合精度方法(最高+7.00 Avg.),并在2.5位平均精度下可匹配固定3位质量,从而在大幅减小内存占用的情况下实现部署。

英文摘要

Mixed-precision quantization improves the budget--accuracy trade-off for large language models (LLMs) by allocating more bits to sensitive modules. However, automating this allocation at LLM scale faces a unique combination of constraints: learnable approaches require quantization-aware training, which is infeasible for billion-parameter models; training-free alternatives rely on static proxy metrics that miss cross-module interactions and must be recomputed per target budget; and search-based methods are expensive without guaranteeing exact budget compliance. We propose GAMMA, a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline. GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete assignments via integer programming. A key property is score reuse: because the learned preferences encode a stable sensitivity ranking rather than budget-specific weights, a single training run serves arbitrary deployment targets by re-solving only the integer program, reducing per-budget adaptation from hours to a few minutes. Across Llama and Qwen models (8B--32B), GAMMA outperforms both fixed-precision baselines (up to +12.99 Avg.) and search-based mixed-precision methods (up to +7.00 Avg.), and can match fixed 3-bit quality at 2.5-bit average precision, enabling deployment at substantially smaller memory footprints.

2605.18472 2026-05-19 stat.ML cs.AI cs.LG 版本更新

Flowing with Confidence

流中自信

Friso de Kruiff, Dario Coscia, Max Welling, Erik Bekkers

发表机构 * CuspAI AMLab, University of Amsterdam(阿姆斯特丹大学AMLab) mathLab, SISSA(SISSA数学实验室)

AI总结 本文提出了一种名为流匹配与自信(FMwC)的方法,通过在选定层注入输入依赖的乘法噪声,传播其方差并通过网络闭式形式传播,从而在标准采样成本下获得每个样本的置信度评分,用于改进图像质量和晶体热力学稳定性、轨迹编辑和自适应步长等应用。

详情
AI中文摘要

生成模型可以产生不合逻辑的文本、不现实的图像和不稳定的材料,其生成速度比模拟或人类审查更快;没有每个样本的置信度,信任会逐渐丧失。现有解决方案运行k个集成或随机轨迹,消耗k倍的计算资源,测量模型之间的变异性,而不是模型的置信度。我们提出流匹配与自信(FMwC)。FMwC在选定的层注入输入依赖的乘法噪声,通过网络闭式形式传播其方差,并沿ODE轨迹整合,从而在标准采样成本下获得每个样本的置信度评分。该评分支持多种用途:过滤可以提高图像质量和晶体的热力学稳定性;编辑可以将轨迹回退到模型承诺的点并重新定向;自适应步长将ODE计算集中在流不明确的地方。我们发现置信度评分与学习速度场的发散量的大小相关,这为我们提供了一个窗口来理解生成过程,开启了针对关键时刻的手术形式指导,新的采样算法和生成模型的可解释性。

英文摘要

Generative models can produce nonsensical text, unrealistic images, and unstable materials faster than simulation or human review can absorb; without per-sample confidence, trust erodes. Existing fixes run $k$ ensembles or stochastic trajectories at $k\times$ compute, measuring variability between models, not model confidence. We propose Flow Matching with Confidence (FMwC). FMwC injects input-dependent multiplicative noise at selected layers, propagates its variance through the network in closed form, and integrates it along the ODE trajectory, yielding a per-sample confidence score at standard sampling cost. The score supports multiple uses: filtering improves image quality and thermodynamic stability of crystals; editing rewinds trajectories to the points where the model commits and redirects them; and adaptive stepping concentrates ODE compute where the flow is ambiguous. We find that the confidence score correlates with the magnitude of the divergence of the learned velocity field, which gives us a window to understand the generative process, opening up surgical forms of guidance that target the moments that matter, new sampling algorithms and interpretability of generative models.

2605.18460 2026-05-19 cs.AI cs.LG cs.NE 版本更新

When Fireflies Cluster; Enhancing Automatic Clustering via Centroid-Guided Firefly Optimization

当萤火虫聚类;通过重心引导萤火虫优化增强自动聚类

MKA Ariyaratne, Azwirman Gusrialdi, Yury Nikulin, Jaakko Peltonen

发表机构 * Department of Computer Science, Faculty of Applied Sciences, University of Sri Jayewardenepura(Sri Lanka 瑞籍耶文纳普拉大学计算机科学系,应用科学学院) Faculty of Engineering and Natural Sciences,Tampere University(蒂帕雷大学工程与自然科学学院) Department of Mathematics and Statistics, University of Turku(图尔库大学数学与统计学系)

AI总结 本文提出了一种改进的萤火虫算法用于数据聚类,解决了传统方法如K均值在处理非均匀聚类形状、密度以及需要预先定义聚类数的局限性。该算法引入了重心移动策略和多目标适应度函数,平衡了紧凑性、分离性和新的TSP基于的导航惩罚。它能够自动估计最佳聚类数并动态调整聚类边界。在机器人传感器网络中的应用展示了其实际价值,实验表明其聚类质量优于K均值,且减少集群内路径距离。这些结果证实了该算法在复杂空间聚类任务中的鲁棒性,未来可能扩展到更高维和适应性场景。

Comments 34 pages, 19 Figures

详情
AI中文摘要

本文提出了一种新的萤火虫算法变体用于数据聚类,以解决传统方法如K均值在处理非均匀聚类形状、密度以及需要预先定义聚类数的局限性。所提出的算法引入了重心移动策略和多目标适应度函数,该函数平衡了紧凑性、分离性和一个新的基于TSP的导航惩罚。该算法能够自动估计最佳聚类数并动态调整聚类边界。在机器人传感器网络中的应用展示了其实际价值,实验表明其聚类质量优于K均值,且减少集群内路径距离。这些结果证实了该算法在复杂空间聚类任务中的鲁棒性,具有未来扩展到更高维和适应性场景的潜力。

英文摘要

This work presents a novel variant of the Firefly Algorithm (FA) for data clustering, addressing limitations of traditional methods like K-Means that struggle with non-uniform cluster shapes, densities, and the need for pre-defining the number of clusters. The proposed algorithm introduces a centroid movement strategy and a multi-objective fitness function that balances compactness, separation, and a novel TSP-based navigation penalty. It automatically estimates the optimal number of clusters and dynamically adjusts cluster boundaries. Application to robotic sensor networks highlights its practical value, with experiments showing improved clustering quality and reduced intra-cluster path distances compared to K-Means. These results confirm the algorithm's robustness in complex spatial clustering tasks, with potential for future extensions to higher-dimensional and adaptive scenarios.

2605.18459 2026-05-19 cs.LG stat.ML 版本更新

Adaptive Experimentation for Censored Survival Outcomes

适应性实验设计用于截断生存结果

Yuxin Wang, Dennis Frauen, Jonas Schweisthal, Maresa Schröder, Emil Javurek, Stefan Feuerriegel

发表机构 * LMU Munich(慕尼黑大学) Munich Center of Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 本文提出了一种新的适应性实验框架,用于在右截断情况下估计因果效应,通过推导平均生存效应曲线的半参数效率界限,得到闭合形式的效率最优分配策略,并通过数值实验展示了与均匀随机化和截断无关基线相比的一致效率提升。

详情
AI中文摘要

适应性实验设计能够高效估计因果效应,但现有方法未针对具有截断的生存数据进行设计,其中事件时间仅部分观察(例如癌症试验中的总生存时间但存在退出)。本文开发了一种新的适应性实验框架,用于在右截断情况下估计因果效应。为此,我们推导了平均生存效应曲线的半参数效率界限,作为治疗分配策略的函数,从而获得闭合形式的效率最优分配策略。该策略通过优先考虑同时事件和截断动态导致高不确定性的患者分层,将经典Neyman分配扩展到生存设置。在此基础上,我们提出了自适应生存估计器(ASE),一种能够学习分配策略并依次估计平均生存效应曲线的自适应框架。我们的框架有三个主要优势:(i)它可以容纳任意机器学习模型用于非必要估计;(ii)它由闭合形式的效率最优分配策略引导;(iii)它具有强的理论保证,包括通过鞅中心极限定理获得的渐近正态性。我们通过各种数值实验展示了该框架,以显示与均匀随机化和截断无关基线相比的一致效率提升。

英文摘要

Adaptive experimentation enables efficient estimation of causal effects, but existing methods are not designed for survival data with censoring, where event times are only partially observed (e.g., overall survival in cancer trials but with dropout). In this paper, we develop a novel framework for adaptive experimentation to estimate causal effects under right censoring. For this, we derive the semiparametric efficiency bound for the average survival effect curve as a function of the treatment allocation policy and thereby obtain a closed-form efficiency-optimal allocation policy. The policy generalizes classical Neyman allocation to survival settings by prioritizing patient strata where both event and censoring dynamics induce high uncertainty. Building on this, we propose the Adaptive Survival Estimator (ASE), an adaptive framework that learns the allocation policy and estimates the average survival effect curve sequentially. Our framework has three main benefits: (i) it accommodates arbitrary machine learning models for nuisance estimation; (ii) it is guided by a closed-form efficiency-optimal allocation policy; and (iii) it admits strong theoretical guarantees, including asymptotic normality via a martingale central limit theorem. We demonstrate our framework across various numerical experiments to show consistent efficiency gains over uniform randomization and censoring-agnostic baselines.

2605.18454 2026-05-19 cs.LG cs.AI cs.SC 版本更新

Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework

能说话的调度:一种可解释的程序化强化学习框架

Chengpeng Hu, Yingqian Zhang, Hendrik Baier

发表机构 * Eindhoven University of Technology, Eindhoven, the Netherlands Centrum Wiskunde \& Informatica, Amsterdam, the Netherlands

AI总结 本文提出了一种可解释的程序化强化学习框架ProRL,通过人类可读且可编辑的程序化策略实现高效调度,解决了传统深度强化学习在透明性和计算效率方面的不足。

详情
AI中文摘要

深度强化学习(DRL)最近涌现出作为求解组合优化问题(如作业车间调度)的有希望的方法。然而,DRL学习的策略通常由深度神经网络(DNNs)表示,其不透明的神经架构和不可解释的策略决策可能引起人类决策者的关键信任和可用性问题。此外,DNNs的计算需求还会进一步阻碍在资源受限环境中实际部署。在本工作中,我们提出ProRL,一种新颖的可解释程序化强化学习框架,能够通过人类可读且可编辑的程序化策略实现高性能调度(即程序)。我们首先介绍了一种用于调度的领域特定语言(DSL-S)来表示调度策略为结构化程序。ProRL然后通过局部搜索探索由DSL-S定义的程序空间,以识别不完整的程序,这些程序随后通过贝叶斯优化学习其参数。ProRL学习选择哪种调度启发式规则,因此它自然地整合了已在工业场景中使用的现有启发式方法。在广泛使用的基准实例上的实验表明,ProRL在现有启发式方法和DRL基线方面表现出色。此外,ProRL在强约束计算资源下表现良好,例如仅使用100个episode进行训练。我们的代码可在https://github.com/HcPlu/ProRL上获得。

英文摘要

Deep reinforcement learning (DRL) has recently emerged as a promising approach to solve combinatorial optimization problems such as job shop scheduling. However, the policies learned by DRL are typically represented by deep neural networks (DNNs), whose opaque neural architectures and non-interpretable policy decisions can lead to critical trust and usability concerns for human decision makers. In addition, the computational requirements of DNNs can further hinder practical deployment in resource constrained environments. In this work, we propose ProRL, a novel interpretable programmatic reinforcement learning framework that achieves high-performance scheduling with human-readable and editable programmatic policies (i.e., programs). We first introduce a domain-specific language for scheduling (DSL-S) to represent scheduling strategies as structured programs. ProRL then explores the program space defined by DSL-S using local search to identify incomplete programs, which are subsequently completed by learning their parameters via Bayesian optimization. ProRL learns which scheduling heuristic rules to select, and hence, it naturally incorporates existing heuristics already used in industrial scenarios. Experiments on widely used benchmark instances demonstrate the strong performance of ProRL against existing heuristics and DRL baselines. Furthermore, ProRL performs well under strongly constrained computational resources, such as training with only 100 episodes. Our code is available at https://github.com/HcPlu/ProRL.

2605.18449 2026-05-19 cs.LG cs.AI 版本更新

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

用强化学习建模客户轨迹以获得实际零售洞察

Ken Ming Lee, Paul Barde, Maxime C. Cohen, Derek Nowrouzezahrai

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克人工智能研究所)

AI总结 本文提出了一种基于智能体的建模框架,将客户轨迹预测转化为最大熵强化学习问题,以更准确地反映具有有限理性的客户行为,从而提供更精确的冲动购买率和货架交通密度估计。

Comments Proceeding of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
AI中文摘要

理解零售空间内客户移动对于优化商店布局至关重要。现实世界轨迹数据可以提供高度准确的洞察,但收集起来成本高昂且对许多零售商来说难以实现。启发式方法如旅行商问题(TSP)和概率最近邻(PNN)常被用作廉价的近似方法,但实际客户轨迹与最短路径的偏差平均为28%,突显了准确性和实用性之间的权衡。我们提出了一种基于智能体的建模框架,将客户轨迹预测视为最大熵强化学习(RL)问题,通过平衡奖励最大化与随机性来更好地反映具有有限理性的客户。使用现实世界便利商店的轨迹数据,我们证明RL生成的轨迹比TSP和PNN更接近客户行为,提供了更准确的冲动购买率和货架交通密度估计。此外,只有基于RL的预测能够为冲动产品提供与实际轨迹数据一致的重新定位决策,从而产生可比的估计利润增长。我们的工作表明,RL提供了一种实用且基于行为的替代方法,弥合了过于简化的启发式方法和数据密集型方法之间的差距,使准确的布局优化更具可及性。为了鼓励进一步研究,源代码可在GitHub上获得。

英文摘要

Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highly accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.

2605.18437 2026-05-19 cs.LG cs.DC 版本更新

Heterogeneous Tasks Offloading in Vehicular Edge Computing: A Federated Meta Deep Reinforcement Learning Approach

车载边缘计算中的异构任务卸载:一种联邦元深度强化学习方法

Yaorong Huang, Jingtao Luo, Xuechao Wang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Chengdu Neusoft University(成都新soft大学)

AI总结 本文提出了一种联邦元深度强化学习框架FedMAGS,用于解决车载边缘计算中异构任务卸载问题,通过图注意力网络捕捉DAG依赖关系,序列到序列策略生成结构化卸载决策,并利用联邦元学习实现跨分布式MEC服务器的快速适应。

详情
AI中文摘要

车载边缘计算(VEC)通过将计算密集型任务卸载到附近的边缘服务器,使延迟敏感的车载应用成为可能。然而,现实中的车载工作负载通常被建模为具有复杂依赖结构的异构有向无环图(DAG)任务,这使得联合卸载和资源分配极具挑战性。此外,分布式MEC部署在协同训练基于学习的策略时会引发隐私问题。本文提出了一种联邦元深度强化学习框架,结合GAT-Seq2Seq建模(FedMAGS),用于车载边缘计算系统中的异构任务卸载。所提出的方法利用图注意力网络捕捉DAG依赖关系,基于序列到序列的策略生成结构化卸载决策,并利用联邦元学习实现跨分布式MEC服务器的快速适应,而无需共享原始数据。大量模拟表明,FedMAGS在收敛速度、执行延迟和可扩展性方面均优于现有最先进的基线方法。此外,联邦设计在保护数据隐私的同时减少了通信开销,使该框架非常适合动态和大规模的VEC环境。

英文摘要

Vehicular edge computing (VEC) enables latency-sensitive vehicular applications by offloading computation-intensive tasks to nearby edge servers. However, real-world vehicular workloads are typically modeled as heterogeneous directed acyclic graph (DAG) tasks with complex dependency structures, making joint offloading and resource allocation highly challenging. Moreover, distributed MEC deployment raises privacy concerns when collaboratively training learning-based policies. In this paper, we propose a Federated Meta Deep Reinforcement Learning framework with GAT-Seq2Seq modeling (FedMAGS) for heterogeneous task offloading in VEC systems. The proposed approach leverages Graph Attention Networks to capture DAG dependencies, a Seq2Seq-based policy to generate structured offloading decisions, and federated meta-learning to enable fast adaptation across distributed MEC servers without sharing raw data. Extensive simulations demonstrate that FedMAGS achieves faster convergence, lower execution delay, and better scalability compared with state-of-the-art baselines. In addition, the federated design preserves data privacy while reducing communication overhead, making the framework well suited for dynamic and large-scale VEC environments.

2605.18430 2026-05-19 cs.LG 版本更新

Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation

Text2CAD-Bench: 一个用于基于LLM的文本到参数化CAD生成的基准

Liang Wang, Heng Meng, Zekai Xiang, Jin Liu, Pingyi Zhou, Litao Chen, Yongqiang Tang

发表机构 * School of Computer Science, Wuhan University, Wuhan 430000, Hubei, China(武汉大学计算机科学学院) Spatial Design Intelligence Lab, BitInf Ltd., Shanghai 200003, China(BitInf Ltd.空间设计智能实验室) College of Computer and Information Engineering, Nanjing Tech University, Nanjing 211800, Jiangsu, China(南京理工大学计算机与信息工程学院) State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China(中国科学院自动化研究所多模态人工智能系统国家重点实验室)

AI总结 本文提出Text2CAD-Bench,首个系统评估文本到CAD在几何复杂度和应用多样性方面的基准,发现当前模型在基本几何上表现良好,但在复杂拓扑和高级功能上表现下降。

详情
AI中文摘要

文本到CAD生成旨在从自然语言创建参数化CAD模型,使快速原型设计和直观设计流程成为可能。然而,现有基准主要关注基本原始体和简单的草图-拉伸序列,缺乏现实应用中必需的高级功能,并仅涵盖传统机械部件。我们引入Text2CAD-Bench,首个系统评估文本到CAD在几何复杂度和应用多样性方面的基准。我们的基准包含600个由人类整理的例子,涵盖四个层次:L1-L2涵盖基本几何和标准特征,L3引入复杂拓扑和自由曲面,L4扩展到机械部件之外的现实领域。每个示例配对双风格提示--几何描述模仿非专家用户,以及程序序列对齐专家级规范。评估主流通用LLM和领域特定模型,发现当前模型在基本几何上表现良好,但在复杂拓扑和高级功能上表现下降。我们发布此基准以推动文本到CAD研究的发展。

英文摘要

Text-to-CAD generation aims to create parametric CAD models from natural language, enabling rapid prototyping and intuitive design workflows. However, existing benchmarks focus on basic primitives and simple sketch-extrude sequences, lacking advanced features essential for real-world applications and covering only traditional mechanical parts. We introduce Text2CAD-Bench, the first benchmark systematically evaluating text-to-CAD across geometric complexity and application diversity. Our benchmark comprises 600 human-curated examples spanning four levels: L1-L2 cover fundamental geometry with standard features, L3 introduces complex topology and freeform surfaces, and L4 extends to real-world domains beyond mechanical parts. Each example pairs dual-style prompts -- geometric descriptions mimicking non-expert users, and procedural sequences aligned with expert-level conventions. Evaluating mainstream general LLMs and domain-specific models, we find that current models perform reasonably on basic geometry but degrade substantially on complex topology and advanced features. We release our benchmark to drive progress in text-to-CAD research.

2605.18425 2026-05-19 cs.LG math.ST stat.TH 版本更新

Generative Adversarial Learning from Deterministic Processes

从确定性过程生成对抗学习

Joris C. Kühl, Hanno Gottschalk

发表机构 * Institute of Mathematics, Technical University of Berlin(柏林技术大学数学研究所)

AI总结 本文研究了生成对抗网络在非独立同分布数据中的成功应用,证明了通过无限维生成对抗学习模型可以从单个确定性时间序列中学习混沌动力系统不变分布,并给出了收敛速率。

Comments 37 pages, 3 figures

详情
AI中文摘要

物理人工智能正被成功应用于不遵循传统独立同分布(i.i.d.)样本 paradigm 的数据。事实上,物理人工智能常常在非随机数据上进行训练,这些数据来源于混沌动力系统,如湍流。我们旨在通过生成对抗网络(GANs)的例子来解释这些方法的实证成功,其统计学习理论在i.i.d.假设下通常被很好地理解。我们证明了使用无限维的生成对抗学习(GAL)模型,可以从单个确定性演变的时间序列中学习足够混沌的动力系统的不变分布,并以詹森-香农散度给出收敛到解的显式速率。

英文摘要

Physical AI is being successfully applied to data which does not follow the traditional paradigm of independent and identically distributed (i.i.d.) samples. In fact, physical AI is often trained on data which is not random at all, and is instead derived from chaotic dynamical systems like turbulence. We aim to explain the empirical success of these methods using the example of generative adversarial networks (GANs), whose statistical learning theory under the i.i.d. assumption is generally well understood. We prove that it is possible, using an infinite-dimensional model of generative adversarial learning (GAL), to learn the invariant distribution of a sufficiently chaotic dynamical system from a single deterministically evolving time series of its states or measurements thereof, and give explicit rates for the convergence to the solution in terms of the Jensen-Shannon divergence.

2605.18422 2026-05-19 stat.ML cs.LG math.ST stat.TH 版本更新

Generalized Functional ANOVA in Closed-Form: A Unified View of Additive Explanations

广义函数ANOVA的闭式表达:加性解释的统一视角

Baptiste Ferrere, Nicolas Bousquet, Fabrice Gamboa, Jean-Michel Loubes

发表机构 * EDF R&D, SINCLAIR Lab(EDF研究院,SINCLAIR实验室) Université de Toulouse, ANITI(图卢兹大学,ANITI) Université de Toulouse Sorbonne Université(图卢兹大学,索邦大学) Universidad Medellin(梅尔辛大学) INRIA Regalia(INRIA皇家研究所)

AI总结 本文提出了一种闭式表达的广义函数ANOVA方法,提供了一种统一的加性解释框架,能够处理依赖输入情况下的模型预测分解问题。

Comments 34 pages, 23 Figures, 101 equations, 8 Tables

详情
AI中文摘要

函数ANOVA,或Hoeffding分解,提供了一个原理性的框架用于可解释性,通过将模型预测分解为主效应和高阶交互作用。对于独立输入,这种经典分解是显式的。它与SHAP值、广义加性模型和正交多项式展开密切相关,因此构成了加性可解释性的重要工具。然而,在更一般和现实的依赖设置中,获得可处理的表示并从数据中估计分解仍然具有挑战性。在本文中,我们针对连续输入解决了这个问题。通过结合Hilbert空间方法与广义函数ANOVA,我们构建了一个显式的Riesz基分解,使得分解计算变得容易。我们的方法恢复了经典独立情况及其相关的正交分解。基于此表示,我们提出了一种简单但强大的算法,能够在模型无关的设置下从数据样本中估计分解,并通过与几种最先进的解释方法进行实证比较,展示了该方法的威力。

英文摘要

The functional ANOVA, or Hoeffding decomposition, provides a principled framework for interpretability by decomposing a model prediction into main effects and higher-order interactions. For independent inputs, this classical decomposition is explicit. It is closely connected to SHAP values, generalized additive models, and orthogonal polynomial expansions, and therefore constitutes a fundamental tool for additive explainability. In the more general and realistic dependent setting, however, obtaining a tractable representation and estimating the decomposition from data remain challenging. In this work, we address this problem for continuous inputs. By combining Hilbert space methods with the generalized functional ANOVA, we build an explicit decomposition Riesz Basis allowing to easily compute the decomposition. Our formulation recovers the classical independent case and its associated orthogonal decomposition. Building on this representation, we propose a simple but mighty algorithm to estimate the decomposition from a data sample in a model-agnostic setting and we compare it empirically with several state-of-the-art explanation methods, demonstrating the power of the approach.

2605.18387 2026-05-19 cs.LG cs.AI 版本更新

Graph Hierarchical Recurrence for Long-Range Generalization

图层次递归用于长距离泛化

Stefano Carotti, Marco Pacini, Alessio Gravina, Davide Bacciu, Bruno Lepri, Sebastiano Bontorin

发表机构 * Department of Computer Science, University of Trento(特伦托大学计算机科学系) Fondazione Bruno Kessler(布鲁诺·克谢勒基金会) Department of Computer Science, University of Pisa(帕尔马大学计算机科学系)

AI总结 本文提出了一种名为图层次递归(GHR)的新框架,通过在输入图和通过池化获得的层次抽象上联合操作,解决了图神经网络和图转换器在长距离相关性捕捉任务中的限制,并在多个长距离基准测试中表现出色,参数效率高。

详情
AI中文摘要

图神经网络(GNNs)和图转换器(GTs)已成为图学习的基本范式,结合了深度模型的表示学习能力与诱导偏置带来的样本效率。尽管其有效性已得到广泛认可,但大量研究表明这些模型在需要捕捉图中远距离区域之间相关性的任务中仍面临根本性限制。为了解决这一问题,我们引入了图层次递归(GHR),一种新的框架,该框架同时在输入图和通过池化获得的层次抽象上进行操作。我们还展示了现有模型的局限性在超出范围的泛化中更加明显,其中测试实例涉及比训练时观察到的更长距离的相互作用。相比之下,尽管其设计简单,GHR提供了三个关键优势:在长距离依赖上表现强劲,改进了超出范围的泛化能力,以及高参数效率。为了验证这些主张,我们展示了在广泛的长距离基准测试中,GHR在使用当前最先进的模型参数的1%的情况下,始终优于现有的图模型。这些结果表明,当前趋势通过扩展架构来获得图基础模型的互补方向,表明仅增加模型容量可能不足以实现泛化。

英文摘要

Graph Neural Networks (GNNs) and Graph Transformers (GTs) are now a fundamental paradigm for graph learning, combining the representation-learning capabilities of deep models with the sample efficiency induced by their inductive biases. Despite their effectiveness, a large body of work has shown that these models still face fundamental limitations in tasks that require capturing correlations between distant regions of a graph. To address this issue, we introduce Graph Hierarchical Recurrence (GHR), a novel framework that operates jointly on the input graph and on a hierarchical abstraction obtained through pooling. We also show that the limitations of existing models are even more pronounced in out-of-range generalization, where test instances involve interactions over distances longer than those observed during training. By contrast, despite its simple design, GHR provides three key advantages: strong performance on long-range dependencies, improved out-of-range generalization, and high parameter efficiency. To corroborate these claims, we show that across a broad set of long-range benchmarks, GHR consistently outperforms existing graph models while using as little as 1% of the parameters of current state-of-the-art models. These results suggest a complementary direction to the current trend of scaling architectures to obtain graph foundation models, indicating that increased model capacity alone may not be sufficient for generalization.

2605.18383 2026-05-19 cs.LG 版本更新

TabH2O: A Unified Foundation Model for Tabular Prediction

TabH2O:用于表格预测的统一基础模型

Pascal Pfeiffer, Dmitry Gordeev, Mathias Müller, Laura Fink, Joan Salvà Soler, Mark Landry, Branden Murray, Marcos V. Conde, Sri Satish Ambati

发表机构 * H2O.ai

AI总结 本文提出TabH2O,一种统一的基础模型,通过上下文学习在单次前向传递中实现分类和回归。该模型基于TabICL架构进行了关键改进,包括统一训练、单阶段预训练和噪声感知预训练,从而在表格数据预测任务中表现出色。

Comments Technical Report - https://tabh2o.h2oai.com/

详情
AI中文摘要

我们提出了TabH2O,一种用于表格数据的基础模型,该模型通过上下文学习在单次前向传递中实现分类和回归。TabH2O基于TabICL架构进行了若干关键改进:(1) 统一训练,一个模型通过双头架构同时处理分类和回归,消除了对单独模型的需要,从而降低了总预训练成本;(2) 单阶段预训练,通过训练稳定性改进(有界可扩展softmax、阶段间归一化、可学习残差缩放、logit软上限)消除了多阶段课程学习的需要,使模型能够从一开始就使用完整长度序列进行训练;(3) 噪声感知预训练,合成数据集包含显式噪声维度以教导模型对无关特征具有鲁棒性。我们在TALENT基准(300个数据集)上评估了TabH2O v1(29.2M参数),其中它在6种评估方法中的平均排名为2.55,优于调优的CatBoost(4.07)、H2O AutoML(4.18)和LightGBM(5.08),与TabPFN v2.6(2.74)竞争,但落后于TabICL v2(2.12),并在分类和回归任务中81%的测试数据集上位列前三名。

英文摘要

We present TabH2O, a foundation model for tabular data that performs classification and regression in a single forward pass via in-context learning. TabH2O builds on the TabICL architecture with several key modifications: (1) unified training, a single model handles both classification and regression via a dual-head architecture, eliminating the need for separate models and reducing total pretraining cost; (2) single-stage pretraining, training stability improvements (bounded scalable softmax, inter-stage normalization, learnable residual scaling, logit soft-capping) eliminate the need for multi-stage curriculum learning, enabling training with full-length sequences from the start; and (3) noise-aware pretraining, synthetic datasets include explicit noise dimensions to teach the model robustness to irrelevant features. We evaluate TabH2O v1 (29.2M parameters) on the TALENT benchmark (300 datasets), where it achieves an average rank of 2.55 out of 6 evaluated methods, outperforming tuned CatBoost (4.07), H2O AutoML (4.18), and LightGBM (5.08), competitive with TabPFN v2.6 (2.74), and behind TabICL v2 (2.12), while placing in the top-3 on 81% of the testing datasets across classification and regression tasks.

2605.18381 2026-05-19 cs.LG 版本更新

Generating Physically Consistent Molecules with Energy-Based Models

生成具有物理一致性的分子的基于能量模型

Christoph Griesbacher, Lea Bogensperger, Andreas Habring, Thomas Pock

发表机构 * Graz University of Technology(格拉茨技术大学) University of Zurich(苏黎世大学)

AI总结 本文提出了一种基于能量模型(EBM)的方法EBMol,用于生成三维分子,通过学习原子可加的标量势能恢复了能量归纳偏差,从而在QM9和GEOM-Drugs数据集上实现了最先进的性能,并展示了学习的能量景观作为质量度量用于配置排序和过滤,以及通过形状引导采样实现可控生成。

详情
AI中文摘要

处于平衡状态的分子遵循玻尔兹曼分布,使底层的能量景观成为一种基于物理的建模目标。然而,这样的景观从数据中学习起来困难,一旦学习完成,也难以进行采样。扩散模型和流匹配模型通过学习噪声与数据之间的时条件分数或传输场来规避这些困难,以更可处理的训练目标交换了能量归纳偏差。我们引入EBMol,一种基于能量模型(EBM),通过在训练过程中不进行显式模拟而学习原子可加的标量势能来恢复这种归纳偏差。我们的方法采用受流启发的恢复场匹配目标来近似能量景观。我们采用镜像-兰格-恩算法进行采样,使原子位置和类型的统一更新成为可能,并在推理时间采用并行退火来扩展计算规模。EBMol是首个在三维分子生成中实现最先进的性能的EBM,已在QM9和GEOM-Drugs数据集上达到最先进的性能。此外,我们还证明了学习的能量景观可以作为原理性的质量度量用于排序和过滤配置,并通过潜在能组成和零样本连接器设计通过形状引导采样实现可控生成,而无需重新训练。

英文摘要

Molecules in equilibrium follow a Boltzmann distribution, making the underlying energy landscape a physically grounded modeling objective. However, such landscapes are difficult to learn from data and, once learned, hard to sample from. Diffusion and flow-matching models sidestep these difficulties by learning a time-conditional score or transport field between noise and data, losing the energy inductive bias in exchange for a more tractable training objective. We introduce EBMol, an energy-based model (EBM) that restores this inductive bias by learning an atom-additive scalar potential without explicit simulation during training. Our method employs a flow-inspired Restoring Field Matching objective to approximate the energy landscape. We adopt the Mirror-Langevin algorithm for sampling, enabling unified updates of atomic positions and types, and incorporate parallel tempering for inference-time compute scaling. EBMol is the first EBM for 3D molecular generation to achieve state-of-the-art performance on QM9 and GEOM-Drugs. Moreover, we show that the learned energy landscape serves as a principled quality metric for ranking and filtering configurations, and demonstrate controllable generation without retraining through shape-steered sampling via potential composition and zero-shot linker design.

2605.18379 2026-05-19 cs.LG 版本更新

Beyond Square Roots: Explicit Memory-Efficient Factorization for Multi-Epoch Private Learning

超越平方根:多轮差分隐私学习的显式内存高效分解

Nikita P. Kalinin, Aki Rehn, Joel Daniel Andersson, Antti Honkela, Christoph H. Lampert

发表机构 * Institute of Science and Technology Austria(奥地利科学与技术研究所) University of Helsinki(赫尔辛基大学)

AI总结 本文提出了一种统一的分解方法γ-BIFR,用于多轮差分隐私学习,该方法在低内存和低带宽情况下显著提升了RMSE、放大RMSE和隐私训练性能,同时提供了更紧的理论保证。

详情
AI中文摘要

相关噪声机制是提高差分隐私模型训练效用最具前景的方法之一,但严格的保证需要显式、可分析的分解,而实际部署需要内存效率。最近的研究开发了带状逆分解,通过利用相关矩阵的带状结构来同时满足这两个要求。带宽控制用于在迭代之间相关噪声的噪声缓冲区大小,从而控制效用和内存成本之间的权衡。现有分解强调这种权衡:DP-λCGD通过仅使用一个步骤的噪声缓冲区实现了高内存效率,但限制了其效用增益,而带状逆平方根(BISR)分解利用更大的相关窗口,在大带宽下渐近最优,但在低带宽下表现不佳。我们提出γ-BIFR,是这两种分解的统一泛化。在低内存、低带宽情况下,γ-BIFR显著提高了RMSE、放大RMSE和隐私训练性能,同时为多轮参与误差提供了更紧的理论保证。

英文摘要

Correlated-noise mechanisms are among the most promising approaches for improving the utility of differentially private model training, but rigorous guarantees require explicit, analyzable factorizations, and practical deployment requires memory efficiency. Recent works have developed banded inverse factorizations, which address both requirements by exploiting a banded structure in the correlation matrix. The bandwidth controls the size of the noise buffer used to correlate noise across iterations, and thus governs the tradeoff between utility and memory cost. Existing factorizations highlight this tradeoff: DP-$λ$CGD achieves high memory efficiency by using only a one-step noise buffer, but this limits its utility gains, while the banded inverse square root (BISR) factorization exploits larger correlation windows and is asymptotically optimal for large bandwidths but performs poorly at low bandwidths. We propose $γ$-BIFR, a unified generalization of both factorizations. In the low-memory, low-bandwidth regime, $γ$-BIFR significantly improves RMSE, amplified RMSE, and private training performance, while yielding tighter theoretical guarantees for multi-participation error in multi-epoch training.

2605.18374 2026-05-19 cs.LG cs.AI 版本更新

Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers

超越推理时间搜索:强化学习合成可重用求解器

Soheyl Massoudi, Gabriel Apaza, Milad Habibi, Mark Fuge

发表机构 * ETH Zürich(苏黎世联邦理工学院) University of Maryland(马里兰大学)

AI总结 本文探讨了强化学习能否将组合优化的推理成本转移到代码LLM的权重中,从而合成可重用的求解器。通过Synergistic Dependency Selection问题,研究发现强化学习能有效生成约束感知的模拟退火模板,并在多个领域展示出更高的效率和鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)通常将组合优化视为推理时间的过程,通过采样、搜索或重复提示单独解决每个实例。我们询问强化学习是否可以将部分推理成本转移到代码LLM的权重中,从而让模型为整个问题家族合成可重用的求解器。我们研究了Synergistic Dependency Selection(SDS),一种受约束的二次背包问题的受控变体,旨在暴露特定的失败模式:局部信号和严格可行性约束使贪心启发式方法具有吸引力但不可靠。在相同的框架下,Best-of-64基础模型采样在接近全局虚拟最佳求解器(VBS)的28.7%差距处饱和;代码审计显示基础模型经常检索模拟退火模板但错误实现Metropolis接受规则。我们使用可行性门控奖励和轻量结构框架对Qwen2.5-Coder-14B-Instruct进行微调,使用组相对策略优化(GRPO)。所得到的策略在99.8%的可行SDS输出中收敛到一个约束感知的模拟退火模板,达到VBS的5.0%差距,并且在生成后执行/搜索成本方面比累积Best-of-64评估便宜91倍。一次编译检查显示,每个种子的最优冻结求解器在SDS测试集上重复使用时仍然高度竞争,而额外领域评估在作业调度问题上提供了更窄但积极的证据,表明框架可以超越SDS。负消融揭示了这种配方的局限性:标准稳定器会降低性能,软可行性门控失败,结果仍对奖励归一化和领域特定设计选择敏感。

英文摘要

Large language models (LLMs) typically approach combinatorial optimization as an inference-time procedure, solving each instance separately through sampling, search, or repeated prompting. We ask whether reinforcement learning can instead shift part of this reasoning cost into the weights of a code LLM, so that the model synthesizes a reusable solver for an entire problem family. We study this question on Synergistic Dependency Selection (SDS), a controlled variant of constrained Quadratic Knapsack designed to expose a specific failure mode: local signals and strict feasibility constraints make greedy heuristics attractive but unreliable. Under identical scaffolding, Best-of-64 base-model sampling saturates at an approximately 28.7% gap to the global Virtual Best Solver (VBS); code audits show that the base model often retrieves Simulated Annealing templates but misimplements the Metropolis acceptance rule. We fine-tune Qwen2.5-Coder-14B-Instruct with Group Relative Policy Optimization (GRPO) using a feasibility-gated reward and light structural scaffolding. The resulting policy converges to a constraint-aware Simulated Annealing template in 99.8% of feasible SDS outputs, achieves a 5.0% gap to that VBS, and is 91 times cheaper in post-generation execution/search cost than cumulative Best-of-64 evaluation. A compile-once check shows that one best frozen solver per seed remains highly competitive when reused unchanged across the SDS test set, while an additional-domain evaluation on Job Shop Scheduling provides narrower but positive evidence that the scaffold transfers beyond SDS. Negative ablations reveal the limits of this recipe: standard stabilizers degrade performance, a soft feasibility gate fails, and results remain sensitive to reward normalization and domain-specific design choices.

2605.18373 2026-05-19 cs.RO cs.LG math.DS math.OC 版本更新

Dynamic robotic cloth folding with efficient Koopman operator-based model predictive control

动态机器人布料折叠与高效的Koopman算子基于模型预测控制

Edoardo Caldarelli, Franco Coltraro, Adrià Colomé, Lorenzo Rosasco, Carme Torras

发表机构 * Istituto Italiano di Tecnologia(意大利技术研究院) Institut de Robòtica i Informàtica Industrial(机器人与信息技术研究所) MaLGa Center, DIBRIS, Università degli Studi di Genova(MaLGa中心,DIBRIS,热那亚大学)

AI总结 本文提出了一种基于Koopman算子的模型预测控制方法,用于快速生成布料折叠轨迹,结合物理仿真和高效的核基Koopman算子回归,以提高折叠任务的效率和精度。

Comments Accepted for presentation at the 2026 IEEE International Conference on Robotics and Automation (ICRA)

详情
AI中文摘要

机器人布料折叠是一项具有挑战性的任务,尤其是在动态折叠任务中,需要通过快速运动利用布料的动力学特性进行折叠。当受到这种快速运动的影响时,布料动力学的复杂性会阻碍系统识别和折叠轨迹的规划,导致在使用物理布料模型时仿真到现实的转移困难。与人类在折叠任务中表现出的灵活性相比,机器人通常使用小而刚性的衣物,要么太慢,要么太快但不精确,需要多次尝试才能获得相对良好的折叠效果。在本文中,我们通过生成快速折叠轨迹来解决这些问题,采用了一种新的模型预测控制器,结合基于物理的布料动力学仿真和高效的核基Koopman算子回归。Koopman算子回归是一种日益流行的机器学习技术,用于非线性系统识别,用于获得被折叠布料的线性模型。此类代理模型,通过高保真的物理布料仿真器的数据进行训练,可以用于合适的模型预测控制算法中,替代昂贵的非线性模型,以高效地生成由机器人执行的折叠轨迹。在模拟和真实机器人实验中,我们展示了Koopman算子基于模型提供的线性化如何能够有效地生成未见过的姿势的快速折叠轨迹,而不牺牲折叠的准确性。

英文摘要

Robotic cloth folding is a challenging task, particularly when considering dynamic folding tasks, which aim at folding cloth by fast motions that leverage its dynamics. When subject to such fast motions, the complexity of cloth dynamics hinders both system identification and planning of folding trajectories, resulting in a difficult simulation-to-reality transfer when using physical models of cloth. Compared to the dexterity that humans exhibit when performing folding tasks, robotic approaches usually employ small garments with quite rigid dynamics, and are either too slow, or fast but imprecise, requiring several attempts to achieve a reasonably good fold. In this paper, we tackle these challenges by generating fast folding trajectories with a novel model predictive controller, integrating physics-based simulation of cloth dynamics and efficient, kernel-based Koopman operator regression. Koopman operator regression, an increasingly popular machine learning technique for nonlinear system identification, is used to obtain a linear model for the cloth being folded. Such a surrogate model, trained with data from a high-fidelity, physics-based cloth simulator, can then be employed within a suitable model predictive control algorithm, in place of the costly, nonlinear one, to efficiently generate folding trajectories to be executed by a robotic manipulator. Both in simulated and real-robot experiments, we show how the linearization supplied by the Koopman operator-based model can be employed to efficiently generate fast folding trajectories to unseen poses, without sacrificing folding accuracy.

2605.18354 2026-05-19 cs.LG 版本更新

Decoupled Conformal Optimisation: Efficient Prediction Sets via Independent Tuning and Calibration

解耦的符合优化:通过独立调优和校准实现高效的预测集

Fanyi Wu, Lihua Niu, Samuel Kaski, Michele Caprio

发表机构 * Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) UKRI AI Centre for Doctoral Training in Decision Making for Complex Systems(UKRI人工智能博士培训中心(复杂系统决策)) Department of Computer Science, Aalto University(艾尔沃斯大学计算机科学系) ELLIS Institute, Finland(芬兰ELLIS研究所)

AI总结 本文提出了解耦的符合优化(DCO),通过独立的调优和校准步骤,实现了更高效的预测集生成,无需使用相同的验证数据进行调优和校准,从而在保证有限样本边际符合覆盖的同时,减少了预测集的大小或区间宽度。

Comments 33 pages, 6 figures, accepted by ICML 2026 Workshop: Epistemic Intelligence in Machine Learning

详情
AI中文摘要

贝叶斯符合优化方法通常使用相同的一组验证数据来搜索高效的预测集并验证覆盖或风险。这种耦合对于高概率风险控制保证是自然的,但在目标是标准有限样本边际符合覆盖时并非必要。我们提出了解耦的符合优化(DCO),一种训练-调优-校准的设计原则,使用独立的调优拆分进行以效率为导向的结构选择,使用新鲜的校准拆分进行最终的符合分位数。在调优结构的条件下,标准拆分-符合交换性为任何候选类提供有限样本边际覆盖,无需置信参数或多测试校正。DCO因此针对与PAC式方法不同的有限样本保证:边际符合覆盖而不是高概率风险控制。在耦合风险界的一致性假设下,这两种方法最终会收敛到相同的总体阈值。在分类和回归基准上,包括ImageNet-A、CIFAR-100、糖尿病、加州住房和混凝土,DCO紧密跟踪名义覆盖水平,同时通常比PAC式校准减少平均预测集大小或区间宽度。例如,在ImageNet-A上,平均集大小从26.52减少到25.26,95百分位数集大小从58.95减少到53.73;在糖尿病上,平均区间宽度从2.098减少到1.914。

英文摘要

Bayesian conformal optimisation methods often use the same held-out data both to search for efficient prediction sets and to certify coverage or risk. This coupling is natural for high-probability risk-control guarantees, but it is not necessary when the target is standard finite-sample marginal conformal coverage. We propose Decoupled Conformal Optimisation (DCO), a train-tune-calibrate design principle that uses an independent tuning split for efficiency-oriented structural selection and a fresh calibration split for the final conformal quantile. Conditional on the tuned structure, standard split-conformal exchangeability yields finite-sample marginal coverage for any candidate class, without a confidence parameter or multiple-testing correction. DCO therefore targets a different finite-sample guarantee from PAC-style methods: marginal conformal coverage rather than high-probability risk control. Under consistency assumptions on the coupled risk bound, the two approaches nevertheless converge to the same population threshold. Across classification and regression benchmarks, including ImageNet-A, CIFAR-100, Diabetes, California Housing, and Concrete, DCO tracks the nominal coverage level closely while often reducing average prediction-set size or interval width relative to PAC-style calibration. On ImageNet-A, for example, the average set size decreases from $26.52$ to $25.26$ and the 95th-percentile set size from $58.95$ to $53.73$; on Diabetes, the average interval width decreases from $2.098$ to $1.914$.

2605.18345 2026-05-19 quant-ph cs.LG 版本更新

Hybrid Quantum-Classical Neural Architecture Search

混合量子-经典神经网络架构搜索

Alberto Marchisio, Muhammad Kashif, Nouhaila Innan, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University Abu Dhabi (NYUAD)(eBRAIN实验室,工程系,纽约大学阿布扎比分校(NYUAD)) Center for Quantum and Topological Systems (CQTS), NYUAD Research Institute(量子与拓扑系统中心(CQTS),NYUAD研究院)

AI总结 本文研究了混合量子-经典神经网络架构搜索的基础,探讨了NAS如何扩展到量子和混合场景,并展示了FLOPs感知搜索作为构建高效且可部署的HQNN的重要方向。

详情
AI中文摘要

混合量子-经典神经网络(HQNNs)正成为噪声中等规模量子(NISQ)时代量子机器学习的实用方法,因为它们结合了经典学习组件和参数化量子电路在一个端到端可训练的框架中。然而,其性能和效率高度依赖于架构选择,如数据编码、电路结构、测量设计以及经典和量子模块之间的耦合。这使得手动设计变得越来越困难,尤其是在考虑硬件限制和资源约束时。本文研究了HQNNs和神经架构搜索(NAS)的基础,讨论了NAS如何扩展到量子和混合设置,并展示了FLOPs感知搜索(其中FLOPs作为计算复杂性的代理)作为构建不仅准确而且计算高效且可实际部署的HQNN的重要方向。

英文摘要

Hybrid quantum-classical neural networks (HQNNs) are emerging as a practical approach for quantum machine learning in the noisy intermediate-scale quantum (NISQ) era, as they combine classical learning components with parameterized quantum circuits in an end-to-end trainable framework. However, their performance and efficiency depend strongly on architectural choices such as data encoding, circuit structure, measurement design, and the coupling between classical and quantum modules. This makes manual design increasingly difficult, especially when hardware limitations and resource constraints must also be taken into account. In this paper, we study the foundations of HQNNs and neural architecture search (NAS), discuss how NAS extends to quantum and hybrid settings, and demonstrate FLOPs-aware search (where FLOPs serve as a proxy for computational complexity), as an important hardware-aware direction for building HQNNs that are not only accurate but also computationally efficient and practically deployable.

2605.18338 2026-05-19 stat.AP cs.LG 版本更新

Robust Player-Conditional Champion Ranking for League of Legends: Style Similarity, Mastery Priors, and Archetype-Constrained Discovery

《英雄联盟中稳健的玩家条件冠军排名:风格相似性、熟练度先验知识和范式约束发现》

Min Heo, Pranav Kadiyam, Prasun Panthi

发表机构 * Wabash College(瓦巴什学院) Arizona State University(亚利桑那州立大学)

AI总结 本文提出了一种基于玩家条件的稳健冠军排名方法,结合风格相似性、熟练度先验知识和范式约束,以解决《英雄联盟》中的冠军推荐问题。

Comments 11 pages, 3 figures

详情
AI中文摘要

在多人在线战斗竞技场游戏中,冠军推荐通常被非正式地视为元游戏强度、个人舒适度或全局胜率的问题。我们正式将《英雄联盟》中的冠军推荐建模为一个可解释的、玩家条件的排名问题,该问题在稀疏、嘈杂和非平稳的行为数据下进行。所提出的框架结合了四个信息源:人口强度代理、玩家风格相似性、直接和间接熟练度先验知识以及范式级的保护措施。该方法使用稳健的中位数/MAD标准化、对数转换用于偏斜事件计数、近期加权的玩家风格向量、熟练度加权的冠军池向量、加权余弦相似度、排名缩放的得分组件以及k-means++聚类用于粗略的范式支持。实现原型使用Python/Pandas建模层、Supabase支持的存储以及面向网页的推荐接口。与黑箱监督胜利预测系统不同,所提出的方法返回分解的推荐评分,可以作为预期性能代理、拟合、熟练度和范式兼容性的检查。包含一个单人案例研究,针对玩家标识符DIVINERAINRACCON的100场比赛历史进行端到端的合理性检查。因此,本文是一项方法和系统贡献:它指定了一个可重复、模块化和可审计的冠军推荐器,并通过时间训练-测试分割、下一冠军恢复、校准分析和消融研究提供了未来大规模评估的验证协议。

英文摘要

Champion recommendation in multiplayer online battle arena games is usually framed informally as a problem of metagame strength, personal comfort, or global win rate. We formalize champion recommendation in League of Legends as an interpretable, player-conditional ranking problem under sparse, noisy, and non-stationary behavioral data. The proposed framework combines four information sources: a population-strength proxy, player-style similarity, direct and indirect mastery priors, and archetype-level guardrails. The method uses robust median/MAD normalization, logarithmic transforms for skewed event counts, recency-weighted player style vectors, mastery-weighted champion-pool vectors, weighted cosine similarity, rank-scaled score components, and k-means++ clustering for coarse archetype support. The implemented prototype uses a Python/Pandas modeling layer, Supabase-backed storage, and a web-facing recommendation interface. Unlike black-box supervised win-prediction systems, the proposed method returns decomposed recommendation scores that can be inspected as expected-performance proxy, fit, mastery, and archetype compatibility. A single-player case study on a 100-game history for the player identifier DIVINERAINRACCON is included as an end-to-end sanity check. The manuscript is therefore a methods and systems contribution: it specifies a reproducible, modular, and auditable champion recommender and gives a validation protocol for future large-scale evaluation through temporal train-test splits, next-champion recovery, calibration analysis, and ablation studies.

2605.18333 2026-05-19 quant-ph cs.LG 版本更新

QLIF-CAST: Quantum Leaky-Integrate-and-Fire for Time-Series Weather Forecasting

QLIF-CAST:用于时间序列天气预报的量子泄漏积分-放电神经网络

Alberto Marchisio, Aayan Ebrahim, Nouhaila Innan, Muhammad Kashif, Muhammad Shafique

发表机构 * eBrain Lab, Division of Engineering, New York University Abu Dhabi(eBrain实验室,工程系,纽约大学阿布扎克分校) Center for Quantum and Topological Systems, NYUAD Research Institute, New York University Abu Dhabi(量子与拓扑系统中心,NYUAD研究学院,纽约大学阿布扎克分校)

AI总结 本文提出QLIF-CAST模型,将量子泄漏积分-放电神经网络应用于多变量天气短期预报,通过量子神经动态降低预测误差,且在训练时间与精度之间取得良好平衡。

详情
AI中文摘要

准确且高效的时序预测仍然是经典和量子神经架构在多变量环境设置中的挑战性问题。本文将量子泄漏积分-放电(QLIF)脉冲神经网络适应于时序回归任务,特别是短期多变量天气预报。我们扩展了QLIF的应用范围,证明其适用于连续值预测问题。QLIF-CAST模型将神经元激发状态编码为单量子比特的量子叠加态,由Rx旋转门和T1弛豫衰减驱动,并嵌入在混合量子-经典递归架构中。我们进行了两项不同的评估。首先,与参数匹配的经典LIF基线在多变量天气数据集上的受控比较显示,QLIF-CAST在MSE和MAE上分别降低了15.4%和4.4%,证明量子神经动态在预测误差上优于经典等效模型。其次,在空气质量与风速基准上与最先进的量子LSTM(QLSTM)和量子神经网络(QNN)模型的跨领域比较显示,QLIF-CAST在训练时间内最多减少了94%,在速度-误差权衡空间中占据独特位置。在IBM Marrakesh(156量子比特QPU)上的硬件验证确认了电路执行的可靠性,仅存在1.2%的平均偏差。

英文摘要

Accurate and efficient time-series forecasting remains a challenging problem for both classical and quantum neural architectures, particularly in multivariate environmental settings. This work adapts the Quantum Leaky Integrate-and-Fire (QLIF) spiking neural network for time-series regression tasks, specifically short-term multivariate weather forecasting. We extend QLIF beyond classification and demonstrate its applicability to continuous-valued prediction problems. The QLIF-CAST model encodes neuron excitation states as single-qubit quantum superpositions, driven by Rx rotation gates and T1 relaxation decay, and is embedded within a hybrid quantum-classical recurrent architecture. We conduct two distinct evaluations. First, a controlled comparison against a parameter-matched classical LIF baseline on a multivariate weather dataset shows that QLIF-CAST achieves 15.4% lower MSE and 4.4% lower MAE, demonstrating that quantum neuronal dynamics reduce prediction error over classical equivalents. Second, a cross-domain comparative analysis with state-of-the-art quantum LSTM (QLSTM) and quantum neural network (QNN) models on air quality and wind speed benchmarks reveals that QLIF-CAST converges in up to 94% less training time, occupying a distinct position in the speed-error trade-off space. Hardware verification on IBM Marrakesh (156-qubit QPU) confirms reliable circuit execution with only 1.2% average deviation from simulation.

2605.18331 2026-05-19 cs.LG 版本更新

Prune, Update and Trim: Robust Structured Pruning for Large Language Models

剪枝、更新与裁剪:大型语言模型的鲁棒结构剪枝

Diego Coello de Portugal Mecke, Tom Hanika, Lars Schmidth-Thieme

发表机构 * ISMLL & DARC VWFS University of Hildesheim(ISMLL与DARC VWFS大学海德斯海姆大学) ISMLL University of Hildesheim(ISMLL大学海德斯海姆大学)

AI总结 本文提出Putri方法,通过更新未剪枝权重、按顺序剪枝FFN层以及移除单个注意力头来改进大型语言模型的后训练剪枝,实现了在极端稀疏率下的高效剪枝。

详情
AI中文摘要

大型语言模型(LLMs)近年来经历了显著的增长和开发。然而,进行LLMs的推理仍然成本高昂,尤其是在长上下文推理或资源受限的设备上。这促使开发新的后训练剪枝(PTP)方法。这些方法通过移除模型参数的大量部分来降低LLMs的要求。被丢弃的权重根据其对模型性能的影响进行选择。当前的PTP方法通过移除FFN层中信息较少的隐藏节点和最不重要的注意力层来剪枝模型。我们提出Putri,一种PTP方法,引入了三个改进:首先,更新未剪枝的FFN权重以补偿引入的剪枝误差;其次,按顺序剪枝FFN层,考虑之前层的更新;第三,而不是移除完整的注意力层,我们移除单个注意力头。我们扩展了这种方法,使其能够处理分组查询注意力。总之,Putri是一种保持简单但表现卓越的结构剪枝方法。在多个模型上进行剪枝实验,涵盖广泛的稀疏率范围和不同的数据集,验证了Putri的通用性。值得注意的是,我们证明,与以前的方法不同,Putri可以在极端稀疏率下剪枝LLMs。代码可在:https://github.com/Coello-dev/Putri 获取。

英文摘要

Large Language Models (LLMs) have experienced significant growth and development in recent years. However, performing inference on LLMs remains costly, especially for long-context inference or in resource-constrained devices. This motivates the development of new post-training pruning (PTP) methods. These methods reduce LLMs' requirements by removing a substantial part of the model's parameters. The discarded weights are selected depending on their impact on the models performance. Current PTP methods prune the models by removing the less informative hidden nodes from the FFN layers, and the least important attention layers. We propose Putri, a PTP method that introduces three changes to the State- of-the-art. First, we update the un-pruned weights of the FFN to compensate for the introduced pruning error. Second, the FFN layers are pruned sequentially, taking into account the updates done to the previous layers. Third, instead of removing full attention layers, we remove individual attention-heads. We extend this method such that it can also address Grouped-Query Attention. In summary, Putri is a structure pruning method which remains simple while showing SOTA performance. Pruning experiments on multiple models with a wide variety of sparsity ranges and on different datasets, validate the generality of Putri. Notably, we demonstrate that, unlike previous methods, Putri can prune LLMs on extreme sparsity ratios. The code is available at: https://github.com/Coello-dev/Putri.

2605.18320 2026-05-19 cs.LG cs.AI 版本更新

ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization

ISEP: 通过随机策略优化实现离线强化学习的隐式支持扩展

Yifei Chen, Shaoqin Zhu, Xiaoqiang Ji

发表机构 * The Chinese University of Hong Kong, Shenzhen Longgang(香港中文大学(深圳)松山湖校区)

AI总结 本文提出ISEP方法,通过随机策略优化实现离线强化学习中的隐式支持扩展,以解决传统方法在安全约束下难以发现最优行为的问题,核心贡献是通过价值函数插值和随机动作选择策略提高策略改进的导航能力。

详情
AI中文摘要

离线强化学习方法通常强制严格的约束以确保安全;然而这种刚性往往阻止了在行为策略即时支持之外发现最优行为。为了解决这个问题,我们提出了通过随机策略优化实现的隐式支持扩展(ISEP),该方法利用在分布数据和策略样本之间插值的价值函数,以隐式方式扩展可行动作支持。这种机制“密集化”高奖励区域,为策略改进创建可导航路径,同时在理论上保证价值误差的有界性。然而,优化此扩展支持会创建多模态景观,标准确定性平均会导致模式崩溃和无效动作。ISEP通过随机动作选择策略缓解了这一问题,通过随机交替保守克隆和乐观扩展信号来优化策略。我们通过使用条件流匹配利用分类器免费引导,将此框架实例化为ISEP-FM,以有效捕捉插值的价值信号。

英文摘要

Offline reinforcement learning methods typically enforce strict constraints to ensure safety; yet this rigidity often prevents the discovery of optimal behaviors outside the immediate support of the behavior policy. To address this, we propose Implicit Support Expansion via stochastic Policy optimization (ISEP), which leverages a value function interpolated between in-distribution data and policy samples to implicitly expand the feasible action support. This mechanism "densifies" high-reward regions, creating a navigable path for policy improvement while theoretically guaranteeing bounded value error. However, optimizing against this expanded support creates a multimodal landscape where standard deterministic averaging leads to mode collapse and invalid actions. ISEP mitigates this via a stochastic action selection strategy, optimizing the policy by stochastically alternating between conservative cloning and optimistic expansion signals. We instantiate this framework as ISEP-FM using Conditional Flow Matching utilizing classifier-free guidance to effectively capture the interpolated value signal.

2605.18319 2026-05-19 cs.LG cs.DM math.AG math.CO 版本更新

The Symmetries of Three-Layer ReLU Networks

三层ReLU网络的对称性

Johanna Marie Gegenfurtner, Moritz Grillo, Guido Montúfar

发表机构 * Technical University of Denmark(丹麦技术大学) Max Planck Institute for Mathematics in the Sciences(马克斯·普朗克数学研究所) UCLA and MPI MiS(加州大学洛杉矶分校和马克斯·普朗克研究所)

AI总结 本文研究了三层ReLU网络参数对称性的分析框架,给出了三层层状架构通用参数纤维的完整刻画,并提出了一个多项式时间算法来判断两个参数的功能等价性。

详情
AI中文摘要

我们开发了一个分析深度ReLU网络参数对称性的框架,并获得了三层层状架构通用参数纤维的完整刻画。我们的方法为这些纤维提供了显式的半代数描述,并给出了一个多项式时间算法来决定两个参数的功能等价性。这些对称性包括来自层组合的离散和连续变换,并取决于更深的层是否隐藏或保留先前层的几何结构。最后,我们证明了一些这些对称性在梯度流中诱导局部守恒定律,而其他则不。

英文摘要

We develop a framework for analyzing parameter symmetries in deep ReLU networks and obtain a complete characterization of the generic parameter fibers for three-layer bottleneck architectures. Our approach provides explicit semi-algebraic descriptions of these fibers and yields a polynomial time algorithm for deciding functional equivalence of two parameters. The symmetries include discrete and continuous transformations arising from layer composition, and depend on whether deeper layers hide or preserve geometric structure from preceding layers. Finally, we show that some of these symmetries induce local conservation laws along gradient flow, while others do not.

2605.18316 2026-05-19 cs.LG cs.GR 版本更新

Dynamic Elliptical Graph Factor Models via Riemannian Optimization with Geodesic Temporal Regularization

通过黎曼优化与测地时间正则化进行动态椭圆图因子模型

Chuansen Peng, Xiaojing Shen

发表机构 * School of Mathematics, Sichuan University(四川大学数学学院)

AI总结 本文提出了一种基于黎曼流形的动态估计方法(Degfm),通过结合低秩加对角结构和椭圆图因子模型,解决时间变化图结构推断中的时空一致性与黎曼几何保持问题,并在合成数据和真实数据集上验证了其有效性。

详情
AI中文摘要

从高维节点观测推断时间变化的图结构是神经科学、金融、气候学等领域中的基本问题。该问题有两个内在挑战:在连续观测窗口中保持潜在图的时空一致性,以及尊重对称正定流形的内在黎曼几何,这是一个曲面,其测地结构与欧几里得空间根本不同。本文提出了一种在Grassmann流形上进行动态估计的方法(Degfm),这是一种新颖的算法,共同解决这两个挑战。我们将时间变化的精度矩阵序列建模为低秩加对角结构,由潜在的椭圆图因子模型所驱动,这大大减少了有效参数数量,并在具有挑战性的小样本情况下实现了可靠的估计。通过在Grassmann流形上定义黎曼测地惩罚,强制执行时间一致性,确保估计的图轨迹在内在几何上而非环境欧几里得空间上是平滑的。为了解决由此产生的非凸优化问题,我们推导出一个高效的黎曼梯度下降算法,该算法在每次迭代中都尊重流形结构,并严格建立了其收敛到平稳点的收敛性。在合成基准和真实世界数据集上的广泛实验表明,Degfm在所有评估指标上都优于最先进的基线方法,证实了所提框架的实用性。

英文摘要

Inferring time-varying graph structures from high-dimensional nodal observations is a fundamental problem arising in neuroscience, finance, climatology, and beyond. Two intrinsic challenges govern this problem: maintaining the \emph{temporal coherence} of the latent graph across successive observation windows, and respecting the \emph{intrinsic Riemannian geometry} of the symmetric positive definite manifold on which precision matrices naturally reside, a curved space whose geodesic structure departs fundamentally from that of the ambient Euclidean space. In this paper we propose dynamic estimation on the Grassmann manifold with a factor model (\textsc{Degfm}), a novel algorithm that jointly addresses both challenges. We model the time-varying precision matrix sequence as a low-rank-plus-diagonal structure governed by a latent elliptical graph factor model, which drastically reduces the effective parameter count and enables reliable estimation in the challenging small-sample regime. Temporal coherence is enforced through a Riemannian geodesic penalty defined on the Grassmann manifold, ensuring that the estimated graph trajectory is smooth with respect to the intrinsic geometry rather than the ambient Euclidean space. To solve the resulting non-convex optimization problem over Grassmann-manifold-valued sequences subject to the LRaD constraint, we derive an efficient Riemannian gradient descent algorithm that respects the manifold structure at every iterate and rigorously establish its convergence to a stationary point. Extensive experiments on both synthetic benchmarks and real-world datasets demonstrate that \textsc{Degfm} consistently outperforms state-of-the-art baselines across all evaluation metrics, confirming the practical effectiveness of the proposed framework.

2605.18309 2026-05-19 cs.LG cs.AI 版本更新

Alignment Dynamics in LLM Fine-Tuning

在LLM微调中的对齐动力学

Yuhan Huang, Huanran Chen, Yinpeng Dong

发表机构 * Shanghai Qi Zhi Institue & University of Tokyo(上海启智研究院 & 东京大学) College of AI, Tsinghua University(清华大学人工智能学院)

AI总结 本文研究了在LLM微调过程中对齐的动态特性,提出了一种可计算的对齐评分,并推导了其在微调过程中的闭式更新公式,从而建立了对齐动态的统一框架。通过将对齐更新分解为两种竞争成分:反弹力和驱动力,解释了为何先前的对齐可能被后续微调逆转,以及为何更狭窄的后验结构会增强这种逆转。此外,该框架预测了‘复习强化效应’,即先前的对齐会在重新暴露时留下潜在的后验印记,从而增强驱动力,导致更快的重新对齐。

详情
AI中文摘要

尽管大型语言模型(LLMs)通过监督微调和人类反馈强化学习实现了强大的对齐,但在后续微调中对齐往往容易崩溃。现有的解释要么将对齐脆弱性归因于梯度几何,要么将其描述为模型输出的分布转移,但很少有研究能提供一个统一的框架,将参数空间的学习动态与函数空间的对齐行为联系起来。在本文中,我们引入了一个可计算的对齐评分,并推导了其在微调过程中的闭式更新公式,从而建立了对齐动态的统一框架。我们的分析将对齐更新分解为两个竞争成分:一种由当前对齐状态和模型分布狭窄性共同决定的“反弹力”,以及一种由训练分布与条件后验对齐和非对齐完成的后验对齐程度决定的“驱动力”。这种分解解释了为何先前的对齐可能被后续微调逆转,以及为何更狭窄的后验结构会增强这种逆转。此外,我们的框架预测了“复习强化效应”:先前的对齐会在重新暴露时留下潜在的后验印记,从而增强驱动力,导致更快的重新对齐。我们通过安全对齐、新兴不一致和情感设置验证了这些预测,展示了在重新暴露下一致的对齐逆转和加速的重新对齐。此外,安全对齐的受控实验确认了预测的反弹强度与后验狭窄性之间的依赖关系。这些结果共同提供了一个统一的动态视角,说明在LLM微调过程中对齐是如何被破坏和重新激活的。

英文摘要

Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf{\color{red!60!black} Rebound Force}, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf{\color{green!60!black} Driving Force}, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a \textbf{Rehearsal Priming Effect}: prior alignment leaves a latent posterior imprint that amplifies the effective Driving Force upon re-exposure, leading to faster re-alignment. We validate these predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating consistent alignment reversal and accelerated re-alignment under re-exposure. In addition, controlled experiments in safety alignment confirm the predicted dependence of rebound strength on posterior narrowness. Together, these results provide a unified dynamical perspective on how alignment is disrupted and reactivated during LLM fine-tuning.

2605.18303 2026-05-19 cs.LG cs.AI cs.CV cs.RO 版本更新

PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics

PH-Dreamer: 通过端口-哈密顿生成动力学构建一个物理驱动的世界模型

Xueyu Luan, Chenwei Shi

AI总结 本文提出了一种基于端口-哈密顿框架的物理驱动世界模型PH-Dreamer,通过三个协同机制改进了基于递归状态空间架构的世界模型,实现了更紧凑且物理结构化的表示,同时提高了内部模拟器的保真度,并减少了潜在相空间体积、能量消耗和平均加速度平方。

Comments 12 pages, 3 figures

详情
AI中文摘要

基于递归状态空间架构构建的世界模型能够实现高效的潜在想象,但仍然缺乏物理结构,导致动力学违反守恒和耗散原理。我们引入了一个统一的端口-哈密顿框架,通过三种协同机制来解决这一问题。首先,我们将隐含的物理先验嵌入到递归转换中,通过将投影的潜在演变建模为受流动和耗散控制的能量路由,使投影的PH相空间偏向于更紧凑且物理结构化的表示。其次,我们开发了一个具有运动学意识的能量世界模型,该模型从本体感觉观察估计哈密顿量和功率平衡,提供了一个明确的物理信号用于热力学推理。第三,利用这些能量梯度,我们建立了基于能量的Actor-Critic,利用拉格朗日乘数来正则化策略优化,使其朝着更低的能量和更平滑的控制方向发展。在视觉控制基准测试中,该范式不仅实现了更优的渐近回报,还通过在想象奖励和真实奖励之间建立更紧密且方差更低的对齐关系,提高了内部模拟器的保真度,同时将潜在相空间体积减少了4.18-8.41%,能量消耗降低了高达7.80%,平均加速度平方降低了高达9.38%。

英文摘要

World models built on recurrent state space architectures enable efficient latent imagination, yet remain physically unstructured, producing dynamics that violate conservation and dissipative principles. We introduce a unified Port-Hamiltonian framework that remedies this through three synergistic mechanisms. First, we embed implicit physical priors into recurrent transitions by modeling projected latent evolution as action controlled energy routing governed by flow and dissipation, biasing the projected PH phase space toward a more compact and physically structured representation. Second, we develop a kinematics aware energy world model that estimates the Hamiltonian and power balance from proprioceptive observations, providing an explicit physical signal for thermodynamic reasoning. Third, leveraging these energy gradients, we establish an energy guided Actor-Critic that uses Lagrangian multipliers to regularize policy optimization toward lower energy and smoother control. Across visual control benchmarks, this paradigm not only attains superior asymptotic returns but also elevates internal simulator fidelity by establishing a tighter, lower variance alignment between imagined and real rewards, all while reducing latent phase space volume by 4.18-8.41%, energy consumption by up to 7.80%, and mean squared jerk by up to 9.38%.

2605.18298 2026-05-19 cs.AI cs.HC cs.LG 版本更新

DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG

DARE-EEG: 一种用于挖掘双对齐表示的EEG基础模型

Yang Shao, Peiliang Gong, Qun Dai, Daoqiang Zhang

发表机构 * College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics(航空宇航学院人工智能学院)

AI总结 本文提出DARE-EEG,一种通过双对齐表示学习预训练的自监督基础模型,旨在解决EEG编码器在不完整观测下学习不变表示的问题,通过对比学习和动量更新实现语义稳定性,并通过卷积-线性探针策略适应异构电极配置和采样率,实验表明其在EEG基准测试中表现优异。

Comments 22 pages, 10 pages of main text + 12 pages of appendices

详情
AI中文摘要

通过在大规模EEG数据上进行掩码重建预训练,基础模型已成为在多样化脑机接口应用中学习通用神经表示的有前景范式。然而,一个关键但被忽视的挑战是EEG编码器必须学习对不完整观测不变的表示——当不同掩码视图的同一信号有最小重叠时,现有方法无法将它们约束到一致的潜在子空间,导致转移性下降。为此,我们提出DARE-EEG,一种自监督基础模型,通过预训练期间的双对齐表示学习显式强制掩码不变性。具体而言,我们引入掩码对齐,通过对比学习约束同一EEG样本多个掩码视图的表示,补充锚点对齐,将掩码表示对齐到动量更新的完整特征以实现语义稳定性。此外,我们提出卷积-线性探针,一种参数高效策略,通过解耦频谱-空间投影适应异构电极配置和采样率。在多样化的EEG基准测试中,广泛实验表明DARE-EEG在准确性表现上始终领先,同时保持相对较低的参数复杂度和优于现有方法的跨数据集可移植性。此外,DARE-EEG有助于有效发现和利用EEG中的丰富潜在表示。

英文摘要

Foundation models pre-trained through masked reconstruction on large-scale EEG data have emerged as a promising paradigm for learning generalizable neural representations across diverse brain-computer interface applications. However, a critical yet overlooked challenge is that EEG encoders must learn representations invariant to incomplete observations-when different masked views of the same signal have minimal overlap, existing methods fail to constrain them to a consistent latent subspace, leading to degraded transferability. To address this, we propose DARE-EEG, a self-supervised foundation model that explicitly enforces the mask-invariance property through dual-aligned representation learning during pre-training. Specifically, we introduce mask alignment that constrains representations from multiple masked views of the same EEG sample via contrastive learning, complementing anchor alignment that aligns masked representations to momentum-updated complete features for semantic stability. Additionally, we propose conv-linear-probing, a parameter-efficient strategy that adapts pre-trained representations to heterogeneous electrode configurations and sampling rates through decoupled spectro-spatial projections. Extensive experiments across diverse EEG benchmarks demonstrate that DARE-EEG consistently achieves state-of-the-art in accuracy performance while maintaining relatively low parameter complexity and superior cross-dataset portability compared to existing methods. Furthermore, DARE-EEG contributes to effectively discovering and utilizing the rich potential representations in EEG.

2605.18281 2026-05-19 cs.LG 版本更新

Temporal Task Diversity: Inductive Biases Under Non-Stationarity in Synthetic Sequence Modelling

时间任务多样性:非平稳性下的归纳偏置

Afiq Abdillah Effiezal Aswadi, Oliver Britton, Ross Baker, Matthew Farrugia-Roberts

发表机构 * University of Oxford(牛津大学)

AI总结 研究探讨了在合成序列建模中,任务分布随时间变化对深度学习模型归纳偏置的影响,发现任务分布的多样性增强了模型对泛化而非记忆的偏好。

Comments Presented at Technical AI Safety Conference (TAIS), Oxford, May 2026. Code available at https://github.com/matomatical/temporal-task-diversity

详情
AI中文摘要

现代深度学习科学常常假设神经网络从固定的数据分布中学习。然而,许多实际重要的学习问题涉及在训练过程中数据分布发生变化的情况。这种非平稳性如何影响深度学习对具有不同结构、泛化性和安全性属性的模型的归纳偏置?一个研究归纳偏置的有成效的测试平台是在上下文线性回归序列建模中,其中小型变压器根据训练任务分布的多样性表现出显著不同的泛化模式。在本文中,我们探讨了在训练时间多样化任务分布的影响,发现这种时间多样性导致对泛化而非记忆的偏置增加。

英文摘要

Modern deep learning science often assumes that neural networks learn from a fixed data distribution. However, many practically important learning problems involve data distributions that change throughout training. How does such non-stationarity impact the inductive biases of deep learning towards models with different structural, generalisation, and safety properties? A fruitful testbed for studying inductive bias is in-context linear regression sequence modelling, where small transformers display strikingly different generalisation patterns depending on the diversity of the (fixed) training task distribution. In this paper, we explore the effect of diversifying the task distribution across training time, finding that such temporal diversity leads to an increased bias towards generalisation over memorisation.

2605.18276 2026-05-19 stat.ML cs.LG 版本更新

Geometric Dictionary Learning of Dynamical Systems with Optimal Transport

通过最优传输的几何字典学习动力系统

Thibaut Germain, Sami Chemlal, Rémi Flamary, Vladimir R. Kostic, Karim Lounici

发表机构 * CMAP, Ecole Polytechnique(CMAP,巴黎高等学院) Istituto Italiano di Tecnologia & University of Novi Sad(意大利技术研究院 & 新萨大学)

AI总结 本文提出DOODL框架,通过几何字典学习方法在谱算子空间中学习低维流形,从而实现对复杂动力系统的高效表征和可解释的算子估计。

详情
AI中文摘要

通过算子理论表示学习动力系统提供了一个强大的框架,用于分析复杂动态,因为诸如特征值和不变结构等谱量编码了特征时间尺度和长期行为。然而,动力算子通常独立地为每个系统估计,阻止了发现相关动态中的共享结构。为了解决这一限制,我们提出相关动力系统位于谱算子空间中的低维流形附近。基于这一假设,我们引入DOODL(Dynamical OperatOr Dictionary Learning),一个框架,学习一组特征谱动态的字典,其组合近似该流形并产生紧凑、可解释的个体系统嵌入。除了表征学习外,DOODL通过将估计限制在学习的算子流形上,使从短且部分观测轨迹中快速且可解释地估计算子成为可能。在metastable Langevin动力学和湍流等离子体模拟中的实验表明,DOODL能够扩展到高度复杂的多尺度区域,同时捕捉支配动态的特征谱结构,而不是仅仅拟合轨迹,在具有挑战性的低数据区域中,其误差比独立算子估计方法低一个到两个数量级。

英文摘要

Learning dynamical systems through operator-theoretic representations provides a powerful framework for analyzing complex dynamics, as spectral quantities such as eigenvalues and invariant structures encode characteristic time scales and long-term behavior. However, dynamical operators are typically estimated independently for each system, preventing the discovery of shared structure across related dynamics. To address this limitation, we posit that related dynamical systems lie near a low-dimensional manifold in spectral operator space. Based on this hypothesis, we introduce DOODL (Dynamical OperatOr Dictionary Learning), a framework that learns a dictionary of characteristic spectral dynamics whose combinations approximate this manifold and yield compact, interpretable embeddings of individual systems. Beyond representation learning, DOODL enables fast and interpretable operator estimation from short and partially observed trajectories by constraining the estimation to the learned operator manifold. Experiments on metastable Langevin dynamics and turbulent plasma simulations demonstrate that DOODL scales to highly complex multiscale regimes while capturing characteristic spectral structure governing the dynamics rather than merely fitting trajectories, achieving errors one to two orders of magnitude lower than independent operator estimation methods in challenging low-data regimes.

2605.18251 2026-05-19 eess.SP cs.LG q-bio.NC 版本更新

Subject-Specific Analysis of Self-Initiated Attention Shifts from EEG with Controlled Internal and External Attention Conditions

基于EEG的受试者特异性自我启动注意力转移的分析:受控内部和外部注意力条件

Yuwen Zeng, Dengzhe Hou, Zhang Zhang, Sai Sun, Yongsong Huang, Chia-huei Tseng, Satoshi Shioiri

发表机构 * Advanced Institute of Convergence Knowledge Informatics, Tohoku University, Japan(东京东京大学融合知识信息研究院) Graduate School of Information Sciences, Tohoku University, Japan(东京东京大学信息科学研究生院) Center for Data-Driven Science and Artificial Intelligence, Tohoku University, Japan(东京东京大学数据驱动科学与人工智能研究中心) Research Institute of Electrical Communication, Tohoku University, Japan(东京东京大学电气通信研究所)

AI总结 本文研究了自我启动注意力转移的神经机制,通过EEG特征分析和机器学习方法,揭示了受试者特异性信息在可控实验条件下的应用价值。

详情
AI中文摘要

自我启动的注意力转移在自愿行为中起关键作用,但由于缺乏显式的时序标记而难以研究。尽管之前的研究所探讨了其神经相关性,但尚不清楚多维脑电图(EEG)特征如何在可解释的计算框架中贡献于其表征。在本研究中,我们基于之前的工作开发的实验范式,实现了在相同视觉刺激下的任务受限自我启动转移与外部指导转移的受控比较。在此设置中,我们探讨了准备性EEG活动是否能区分这两种类型的注意力转移。我们采用基于机器学习的方法,进行了两种互补的分析:(1)以性能为导向的频率特异性地形模式评估,以及(2)使用SHapley Additive exPlanations(SHAP)的模型基于特征归因分析。这些分析提供了对感兴趣区域跨频谱特征如何贡献于模型行为的结构化视图。我们的结果表明,具有可靠受试者内分类性能,表明准备性EEG活动在此范式中包含受试者特异性判别信息。分析显示,高频带和前额区域对模型决策有显著贡献,尽管由于高频EEG信号中可能存在的非神经伪影影响,这种贡献应谨慎解释。总体而言,本文强调了可解释机器学习在受控实验条件下分析受试者特异性EEG信号模式的价值,具有在个性化和异步脑机接口系统中的潜在应用。

英文摘要

Self-initiated attention shifts play a critical role in voluntary behavior but are difficult to study due to the absence of explicit temporal markers. While previous studies have examined their neural correlates, it remains unclear how multi-dimensional electroencephalography (EEG) features contribute to their characterization within an interpretable computational framework. In this study, we build on an experimental paradigm developed in our previous work, which enables controlled comparison between task-constrained self-initiated shifts and externally instructed shifts under identical visual stimulation. Within this setting, we investigate whether preparatory EEG activity can distinguish these two types of attention shifts. We adopt a machine learning-based approach and conduct two complementary analyses: (1) a performance-oriented assessment of frequency-specific topographic patterns, and (2) a model-based feature attribution analysis using SHapley Additive exPlanations (SHAP). These analyses provide a structured view of how spectral features across regions of interest contribute to model behavior. Our results demonstrate reliable within-subject classification performance, indicating that preparatory EEG activity contains subject-specific discriminative information within this paradigm. The analysis shows that higher-frequency bands and frontal regions contribute strongly to model decisions, although such contributions should be interpreted cautiously due to the potential influence of non-neural artifacts in high-frequency EEG signals. Overall, this work highlights the value of interpretable machine learning for analyzing subject-specific EEG signal patterns in a controlled experimental setting, with potential applications in personalized and asynchronous brain-machine interface systems.

2605.18246 2026-05-19 cs.LG cs.AI 版本更新

Privacy Preserving Reinforcement Learning with One-Sided Feedback

具有单侧反馈的隐私保护强化学习

Lin William Cong, Guangyan Gan, Hanzhang Qin, Zhenzhen Yan

发表机构 * Nanyang Technological University(南洋理工大学) National University of Singapore(国立新加坡大学) Cornell SC Johnson College of Business(康奈尔大学SC Johnson商学院)

AI总结 本文研究了在多维连续状态和动作空间中,代理仅接收状态部分观测并仅在每个时间步获得状态-动作空间子集奖励信息的强化学习问题,提出了一种新的隐私保护强化学习算法POOL,并通过理论分析证明其样本复杂度与非隐私强化学习的下界一致,展示了在保持高学习效率的同时实现强隐私保障的可行性。

Comments Accepted at IJCAI-ECAI 2026

详情
AI中文摘要

我们研究了在多维连续状态和动作空间中具有单侧反馈的强化学习(RL)。在此设置中,智能体仅接收状态的部分观测,并在每个时间步仅获得状态-动作空间子集的奖励信息。这种设置在学习效率和隐私保护方面带来了重大挑战。为了解决这些挑战,我们提出了POOL,一种新颖的隐私保护RL算法。我们对POOL进行了全面的理论分析,推导出一个样本复杂度界,该界与已知的非隐私RL下界相匹配。其中,E_rho表示隐私参数,H是时间范围,alpha是最优性差距参数。我们的研究结果表明,可以在保持高学习效率的同时实现强隐私保障,这标志着在具有单侧反馈的多维环境中实现实用的隐私感知RL迈出重要一步。

英文摘要

We study reinforcement learning (RL) in multi-dimensional continuous state and action spaces with one-sided feedback, where the agent receives partial observations of the state and obtains reward information for only a subset of the state-action space at each time step. This setting introduces substantial challenges in both learning efficiency and privacy preservation. To address these challenges, we propose POOL, a novel privacy-preserving RL algorithm. We conduct a comprehensive theoretical analysis of POOL, deriving a sample complexity bound that matches the known lower bounds for non-private RL. Here, E_rho denotes the privacy parameter, H is the time horizon, and alpha is the optimality-gap parameter. Our findings show that it is possible to enforce strong privacy guarantees while maintaining high learning efficiency, marking a significant step toward practical, privacy-aware RL in multi-dimensional environments with one-sided feedback.

2605.18229 2026-05-19 cs.LG cs.AI 版本更新

Are Sparse Autoencoder Benchmarks Reliable?

稀疏自编码基准测试是否可靠?

David Chanin

发表机构 * Decode Research, MATS, UCL(Decode研究、MATS、伦敦大学学院)

AI总结 该研究评估了稀疏自编码(SAE)基准测试的可靠性,发现其中两个指标在多个角度下表现不佳,其他指标也未能达到预期效果,表明需要改进SAE基准测试。

详情
AI中文摘要

稀疏自编码(SAEs)是大型语言模型的核心可解释性工具,其进展依赖于能够可靠区分更好和更差SAE的基准测试。我们通过三种互补的视角审计了SAEBench中SAE质量指标:固定SAE上的重新播种噪声、合成SAE上的真实相关性以及训练轨迹的可区分性。我们发现,两个指标,即目标探测扰动(TPP)和虚假相关性消除(SCR),在它们的典型设置下未能通过多个视角,不应用于评估SAE。其他指标显示出更高的重新播种噪声和更低的可区分性,比领域假设的要差。sae-probes变体的k-稀疏探测是我们在测试中发现最可靠的指标,但即使sae-probes也难以区分同一体系结构的不同变体。我们的结果表明,领域需要更好的SAE基准测试。

英文摘要

Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of $k$-sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture. Our results show the field needs better SAE benchmarks.

2605.18221 2026-05-19 cs.SD cs.CL cs.CV cs.LG physics.med-ph 版本更新

SIREM: Speech-Informed MRI Reconstruction with Learned Sampling

SIREM: 语音引导的MRI重建与学习采样

Md Hasan, Nyvenn Castro, Daiqi Liu, Lukas Mulzer, Jana Hutter, Jonghye Woo, Moritz Zaiss, Andreas Maier, Paula A. Perez-Toro

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg(埃森哲-埃尔朗根-纽伦堡大学模式识别实验室) Institute of Radiology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根大学医院放射学研究所) Institut für Informationsverarbeitung, Leibniz Universität Hannover(汉诺威莱比锡大学信息处理研究所) Department of Radiology, Harvard Medical School and Massachusetts General Hospital(哈佛医学院放射科和麻省总医院)

AI总结 本文提出了一种语音引导的MRI重建框架SIREM,通过同步语音作为跨模态先验,利用语音与声音学之间的相关性预测图像内容,从而在更高的吞吐量下实现更合理的解剖结构重建。

详情
AI中文摘要

实时磁共振成像(rtMRI)在语音生产中的应用能够非侵入性地可视化动态声带运动,对语音科学和临床评估具有价值。然而,rtMRI本质上受到空间分辨率、时间分辨率和获取速度之间的权衡限制,常常导致k空间测量不足和重建质量下降。我们提出SIREM,一种利用同步语音作为跨模态先验的MRI重建框架。核心思想是语音期间的声带配置与产生的声音学相关,使图像部分内容可从音频预测。SIREM将每帧建模为音频驱动组件和MRI驱动组件的融合,通过空间加权图。音频分支从语音预测发音器相关结构,而MRI分支从测量的k空间数据重建互补内容。我们进一步引入了可学习的软加权轮廓,使螺旋臂的使用与语音引导融合的交互研究可微分。这产生了一个统一的多模态公式,结合了音频驱动预测、MRI重建和采样适应。我们在USC语音rtMRI基准上评估了SIREM,与标准基线(包括栅格、基于小波的压缩感知和总变分)进行比较。SIREM引入了一种语音引导的重建范式,在比迭代方法高得多的吞吐量下运行,同时保持解剖上合理的声带结构。这些结果为多模态语音引导的rtMRI重建建立了初步基准,并突显了同步语音作为快速重建辅助先验的潜力。源代码可在https://github.com/mdhasanai/SIREM获取。

英文摘要

Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at https://github.com/mdhasanai/SIREM

2605.18204 2026-05-19 stat.ML cs.LG 版本更新

Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster

前向学习离散扩散:学习如何更快地噪声去噪声

Grigory Bartosh, Teodora Pandeva, Sushrut Karmalkar, Javier Zazo

发表机构 * University of Amsterdam(阿姆斯特丹大学) Microsoft Research, Cambridge(微软研究院,剑桥)

AI总结 本文提出前向学习离散扩散(FLDD),通过引入可学习的前向(噪声)过程,减少目标分布与模型分布之间的差距,实现少步生成。该方法采用非马尔可夫形式,利用可学习的边缘和后验分布,使生成过程保持因子化同时匹配噪声过程定义的目标。实验表明,在相同采样步数下,FLDD生成的样本质量优于传统离散扩散模型。

详情
AI中文摘要

离散扩散模型是一类强大的生成模型,在许多领域表现出色。然而,为了效率,离散扩散通常用因子化分布参数化生成(反向)过程,这使得模型难以在少量步骤内学习目标过程,并需要长且计算成本高的采样过程。为减少目标与模型分布之间的差距并实现少步生成,我们提出前向学习离散扩散(FLDD),引入可学习的前向(噪声)过程。不同于固定马尔可夫前向链,我们采用非马尔可夫形式,结合可学习的边缘和后验分布。这使生成过程保持因子化,同时匹配由噪声过程定义的目标。我们通过标准变分目标端到端训练所有参数。在各种基准测试中,实验表明,对于给定的采样步数,我们的方法生成的样本质量优于使用相同反向参数化的传统离散扩散模型。

英文摘要

Discrete diffusion models are a powerful class of generative models with strong performance across many domains. For efficiency, however, discrete diffusion typically parameterizes the generative (reverse) process with factorized distributions, which makes it difficult for the model to learn the target process in a small number of steps and necessitates a long, computationally expensive sampling procedure. To reduce the gap between the target and model distributions and enable few-step generation, we propose Forward-Learned Discrete Diffusion (FLDD), which introduces discrete diffusion with a learnable forward (noising) process. Rather than fixing a Markovian forward chain, we adopt a non-Markovian formulation with learnable marginal and posterior distributions. This allows the generative process to remain factorized while matching the target defined by the noising process. We train all parameters end-to-end under the standard variational objective. Experiments on various benchmarks show that, for a given number of sampling steps, our approach produces a higher quality samples than conventional discrete diffusion models using the same reverse parameterization.

2605.18202 2026-05-19 cs.LG cs.AI 版本更新

Concise and Logically Consistent Conformal Sets for Neuro-Symbolic Concept-Based Models

简洁且逻辑一致的神经符号概念模型的符合集

Samuele Bortolotti, Emanuele Marconato, Andrea Pugnana, Andrea Passerini, Stefano Teso

发表机构 * Department of Information Engineering and Computer Science, University of Trento, Italy(特伦托大学信息工程与计算机科学系) CIMeC, University of Trento, Rovereto, Italy(特伦托大学罗韦雷托CIMeC)

AI总结 本文提出COCOCO框架,通过整合符合预测方法,解决神经符号概念模型中标签和概念预测过于自信的问题,满足一致性、覆盖性和简洁性三个要求,提升模型的可靠性。

详情
AI中文摘要

神经符号概念模型(NeSy-CBMs)是一类将神经网络与符号推理相结合的架构,用于在高风险应用中提高可靠性。它们通过从输入中提取高层概念,然后在给定的逻辑约束下推断任务标签。然而,其标签和概念预测可能过于自信,使利益相关者难以判断何时可以信任模型的决策。本文通过整合符合预测(CP)框架,提供严格的分布无关覆盖保证,正式化了三个要求——一致性、覆盖性和简洁性,证明现有方法至少在一项上不足。然后引入COCOCO,一种后处理框架,联合符合概念和标签,并通过单个推断-反推修订步骤进行协调。COCOCO满足所有三个要求,保留分布无关覆盖,对不完美的知识具有鲁棒性,并支持用户指定的大小预算。在8个数据集上的实验显示,COCOCO在性能和集合大小方面优于竞争对手和自然基线。

英文摘要

Neuro-Symbolic Concept-based Models (NeSy-CBMs) are a family of architectures that integrate neural networks with symbolic reasoning for enhanced reliability in high-stakes applications. They work by first extracting high-level concepts from the input and then inferring a task label from these compatibly with given logical constraints. Yet, their label and concept predictions can be overconfident, making it difficult for stakeholders to gauge when the model's decisions can be trusted. We address this issue by integrating ideas from Conformal Prediction (CP), a framework providing rigorous, distribution-free coverage guarantees. We formalize three desiderata -- consistency, coverage, and conciseness -- that any conformal method for NeSy-CBMs should satisfy, and show that existing approaches fall short of at least one. We then introduce COCOCO, a post-hoc framework that conformalizes concepts and labels jointly and reconciles them via a single deduction-abduction revision step. COCOCO satisfies all three desiderata, retains distribution-free coverage, is robust to imperfect knowledge and supports user-specified size budgets. Our experiments on 8 data sets highlight how COCOCO compares favorably against competitors and natural baselines in terms of performance and set size.

2605.18190 2026-05-19 cs.LG cs.CV 版本更新

Dual-Rate Diffusion: Accelerating diffusion models with an interleaved heavy-light network

双速率扩散:通过交错重-轻网络加速扩散模型

Grigory Bartosh, David Ruhe, Emiel Hoogeboom, Jonathan Heek, Thomas Mensink, Tim Salimans

发表机构 * Google DeepMind Amsterdam(谷歌深Mind阿姆斯特丹) Amsterdam University of Amsterdam(阿姆斯特丹大学)

AI总结 本文提出双速率扩散方法,通过交错执行高容量上下文编码器和轻量解噪模型,加速扩散模型推理,同时保持样本质量,在ImageNet基准上实现性能与计算成本的平衡。

详情
AI中文摘要

扩散模型在生成性能上达到最先进的水平,但在推理过程中由于重复评估重的神经网络而面临高昂的计算成本。在本文中,我们提出了双速率扩散,一种通过交错执行高容量的上下文编码器和轻量高效的去噪模型来加速采样的方法。上下文编码器被稀疏评估以提取高维特征,这些特征在每一步都被轻量去噪模型有效重用,以高效地细化样本。这种方法显著加速了推理过程,而不会牺牲样本质量。在ImageNet基准上,双速率扩散在性能上与标准基线相匹配,同时将计算成本降低了2-4倍。此外,我们证明了我们的方法与蒸馏技术,如动量匹配蒸馏,兼容,从而在少步生成中进一步提高效率。

英文摘要

Diffusion models achieve state-of-the-art generative performance but suffer from high computational costs during inference due to the repeated evaluation of a heavy neural network. In this work, we propose Dual-Rate Diffusion, a method to accelerate sampling by interleaving the execution of a heavy high-capacity context encoder and a light efficient denoising model. The context encoder is evaluated sparsely to extract high-dimensional features, which are effectively reused by the light denoising model at every step to refine the sample efficiently. This approach significantly accelerates inference without compromising sample quality. On ImageNet benchmarks, Dual-Rate Diffusion matches the performance of standard baselines while reducing computational cost by a factor of $2$-$4$. Furthermore, we demonstrate that our method is compatible with distillation techniques, such as Moment Matching Distillation, enabling further efficiency gains in few-step generation.

2605.18188 2026-05-19 cs.LG 版本更新

UTOPYA: A Multimodal Deep Learning Framework for Physics-Informed Anomaly Detection and Time-Series Prediction

UTOPYA:一种用于物理信息异常检测和时间序列预测的多模态深度学习框架

Robson W. S. Pessoa, Julien Amblard, Alessandra Russo, Idelfonso B. R. Nogueira

发表机构 * Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU)(化学工程系,挪威科学与技术大学) Department of Computing, Imperial College London(计算系,帝国理工学院伦敦分校)

AI总结 本文提出UTOPYA框架,通过融合八种数据模态,利用FiLM条件交叉模态注意力和门控融合,共同解决批次蒸馏中的异常检测、时间序列预测和相分类问题,并通过物理信息正则化方案和课程学习方法提升性能。

详情
AI中文摘要

批次过程中的异常检测受到瞬态动态、稀少故障标签和依赖单一模态传感器数据的限制。本文介绍了UTOPYA(统一时间观测用于物理信息异常检测和时间序列预测),一种具有15.2M参数的多模态框架,通过特征-wise线性调制(FiLM)条件交叉模态注意力和门控融合,共同解决批次蒸馏中的异常检测、时间序列预测和相分类问题。本文引入的物理信息正则化方案强制时间平滑性和热力学单调性,而课程学习则按物理难度顺序引入训练样本。在Arweiler等人(2026)的119次实验多模态批次蒸馏数据集上,UTOPYA在窗口级别测试中达到0.832和0.874的AUROC,显著优于四个外部基线(PCA、自动编码器、隔离森林和LSTM自动编码器)在相同条件下的表现(+0.147窗口级别AUROC超过最佳基线)。对15种架构配置的多模态消融研究显示,通过FiLM条件的静态上下文是关键使能器,使实验级别多信号AUROC提高+0.145(从0.729到0.874)。此外,对14种设计选择的训练消融研究发现,包括实例归一化、Mixup、集成、测试时增强和随机权重平均在内的几种广泛采用的技巧在数据稀少的设置中未能提升或主动降低泛化能力。这些负面结果揭示了平滑基于正则化和异常检测之间的根本矛盾,为多模态过程监控部署提供了实际指导。

英文摘要

Anomaly detection in batch processes is hindered by transient dynamics, scarce fault labels, and reliance on single-modality sensor data. This work introduces UTOPYA (Unified Temporal Observation for Physics-Informed Anomaly Detection and Time-Series Prediction), a 15.2M-parameter multimodal framework that jointly addresses anomaly detection, time-series prediction, and phase classification in batch distillation by fusing eight data modalities through Feature-wise Linear Modulation (FiLM) conditioned cross-modal attention and gated fusion. A physics-informed regularisation scheme introduced in this work enforces temporal smoothness and thermodynamic monotonicity, while curriculum learning introduces training samples in order of physical difficulty. On the 119-experiment multimodal batch distillation dataset of Arweiler et al. (2026), UTOPYA achieves a window-level test AUROC of 0.832 and 0.874 under multi-signal experiment-level scoring, substantially outperforming four external baselines (PCA, autoencoder, Isolation Forest, and LSTM autoencoder) evaluated under identical conditions (+0.147 window-level AUROC over the best baseline). A multimodal ablation over 15~architectural configurations shows that static context via FiLM conditioning is the key enabler, lifting experiment-level multi-signal AUROC by +0.145 over the unimodal baseline (0.729 to 0.874). Separately, a training ablation across 14 design choices reveals that several widely-adopted techniques, including instance normalisation, Mixup, ensembling, test-time augmentation, and stochastic weight averaging, fail to improve or actively degrade generalisation in this data-scarce setting. These negative results expose a fundamental tension between smoothing-based regularisation and anomaly detection, providing practical guidance for multimodal process monitoring deployment.

2605.18180 2026-05-19 stat.ML cs.LG 版本更新

Canonical Regularisation of Wide Feature-Learning Neural Networks

宽特征学习神经网络的规范正则化

George Whittle, Pranav Vaidhyanathan, Juliusz Ziomek, Natalia Ares, Maike A. Osborne

发表机构 * Department of Engineering Science University of Oxford(工程科学系牛津大学)

AI总结 本文研究了宽特征学习神经网络中梯度流训练所隐含的正则化性质,揭示了在核域中广泛研究的范数正则化在特征学习域中会导致诱导偏差扭曲,并提出了弧范数作为可扩展的替代方案,扩展了范数正则化到特征学习域。

详情
AI中文摘要

宽神经网络在特征学习范式中推动了现代深度学习的发展,但它们的研究远少于核范式中的网络。我们考虑了这两个范式之间一个关键但研究不足的差异:梯度流训练所隐含的正则化和先验。这种规范正则化性质在核范式网络中已被广泛研究——在所有无限全局极小点中,梯度流精确选择消失的岭解——并支撑了著名的NN-GP对应关系,精确允许在训练过程中建模噪声。然而,我们证明在特征学习范式网络中,岭正则化会扭曲梯度流的诱导偏差,即使在正则化趋于零的极限下也是如此。在训练过程中,岭正则化会扭曲网络的诱导偏差,尤其对预训练网络造成损害,因为隐含的先验信息是有信息的。我们通过将规范正则化作为一种无关范式函数空间能量和提升函数来公理化,这在核范式中唯一识别岭解,并且关键地扩展到特征学习范式。通过研究特征学习网络的黎曼几何,我们从框架中推导出黎曼几何岭,将岭扩展到特征学习范式。相应地,我们证明规范函数空间先验是一个黎曼-高斯过程,扩展了更熟悉的高斯过程。作为实际贡献,我们提出了弧岭作为最小最大鲁棒、可扩展的替代方案,揭示了早停和规范正则化在学习范式中的深刻关系。最后,我们在图像处理和NLP迁移学习问题上展示了我们的理论后果。

英文摘要

Wide neural networks in the feature-learning regime drive modern deep learning, and yet they remain far less studied than their kernel-regime counterparts. We consider a critical yet under-explored difference between these two regimes: the regulariser and prior implied by gradient flow training. This canonical regularisation property is well-studied in kernel regime networks -- of all the infinite global minima, gradient flow selects exactly the vanishing ridge solution -- and underpins the celebrated NN-GP correspondence, precisely allowing the modelling of noise during training. However, we prove ridge regularisation biases gradient flow in feature-learning regime networks, even in the infinitesimal limit of vanishing regularisation. Over training, ridge distorts the inductive bias of the network, with a particular damage done to pretrained networks where the implicit prior is informative. We resolve this by axiomatising the canonical regulariser as a regime-agnostic function-space energy and lift, which uniquely identifies ridge in the kernel regime, and crucially generalises to the feature-learning regime. By studying the Riemannian geometry of feature-learning networks, we derive geodesic ridge from our framework, generalising ridge to the feature-learning regime. Correspondingly, we prove the canonical function-space prior is a Riemannian Gibbs Process, generalising the more familiar Gaussian Process. As a practical contribution, we propose arc ridge as a minimax-robust, scalable surrogate to geodesic ridge, revealing a deep relationship between early stopping and canonical regularisation across learning regimes. Finally, we demonstrate the consequences of our theory empirically on both image processing and NLP transfer-learning problems.

2605.18174 2026-05-19 cs.LG cs.DC math.OC stat.ML 版本更新

Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

Ringmaster LMO: 异步线性最小化Oracle动量方法

Abdurakhmon Sadiev, Artavazd Maranjyan, Ivan Ilin, Peter Richtárik

AI总结 本文提出Ringmaster LMO,一种用于无约束随机非凸优化的异步线性最小化Oracle动量方法,通过延迟阈值机制改进传统同步方法,适用于异构分布式系统,实验表明其在系统异构性增强时表现更优。

详情
AI中文摘要

Muon最近作为一种强大的替代AdamW方法出现,展现出大规模预训练的良好结果和矩阵结构更新在实践中可能更快的证据。然而,Muon以及更一般的线性最小化Oracle(LMO)方法通常用于同步方式。这在异构分布式系统中存在问题,因为工人完成梯度计算的速度不同,同步训练必须反复等待较慢的工人。本文引入Ringmaster LMO,一种用于无约束随机非凸优化的异步LMO基于动量方法。我们的方法基于Ringmaster ASGD的延迟阈值思想。对于SGD类型方法,Ringmaster ASGD通过丢弃过于陈旧的梯度实现最优时间复杂度。Ringmaster LMO将这一机制扩展到一般LMO更新。我们建立了在广义$(L_0, L_1)$-平滑条件下的收敛保证,并进一步开发了参数无关变体,具有递减步长和自适应延迟阈值。最后,我们将我们的迭代保证转换为在异构工人计算时间下的时间复杂度界限。在经典欧几里得平滑设置中,这些界限恢复了Ringmaster ASGD的最优时间复杂度。在随机二次问题和NanoChat语言模型预训练中的实验表明,Ringmaster LMO的优势随着系统异构性增加而增强,并且该方法在同步和异步基线方法中表现更优。

英文摘要

Muon has recently emerged as a strong alternative to AdamW for training neural networks, with encouraging large-scale pretraining results and growing evidence that matrix-structured updates can be faster in practice. Yet Muon, and more generally Linear Minimization Oracle (LMO) based methods, are typically used synchronously. This is problematic in heterogeneous distributed systems, where workers complete gradient computations at different speeds and synchronous training must repeatedly wait for slower workers. In this work, we introduce Ringmaster LMO, an asynchronous LMO-based momentum method for unconstrained stochastic nonconvex optimization. Our method builds on the delay-thresholding idea of Ringmaster ASGD. For SGD-type methods, Ringmaster ASGD achieves optimal time complexity by discarding overly stale gradients. Ringmaster LMO extends this mechanism to general LMO-based updates. We establish convergence guarantees under generalized $(L_0, L_1)$-smoothness and further develop a parameter-agnostic variant with decreasing stepsizes and adaptive delay thresholds. Finally, we translate our iteration guarantees into time complexity bounds under heterogeneous worker computation times. In the classical Euclidean smooth setting, these bounds recover the optimal time complexity of Ringmaster ASGD. Experiments on stochastic quadratic problems and NanoChat language-model pretraining show that the advantages of Ringmaster LMO grow with system heterogeneity and that the method outperforms strong synchronous and asynchronous baselines.

2605.18170 2026-05-19 eess.SP cs.CE cs.LG 版本更新

Buffer-Parameterized Machine Learning Surrogate Models for Cross-Technology Signal Integrity Analysis and Optimization

基于缓冲参数的机器学习替代模型用于跨技术信号完整性分析与优化

Julian Withöft, Werner John, Emre Ecik, Ralf Brüning, Jürgen Götze

发表机构 * Information Processing Lab, Faculty for Electrical Engineering and Information Technology, TU Dortmund(信息处理实验室,电气工程与信息科技学院,图尔尼大学) Pyramide2525 EMC Technology Center Paderborn, Zuken GmbH(帕德博恩电磁兼容技术中心,Zuken GmbH)

AI总结 本文提出了一种基于缓冲参数的机器学习替代模型,用于处理跨技术变化而无需重新训练,通过将IC缓冲特性作为动态模型输入,结合PCB参数,以提高信号完整性分析和优化的效率。

Comments 12 pages, 16 figures, 7 tables. This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

印刷电路板(PCB)互连中的信号完整性(SI)分析由于集成电路(IC)缓冲技术的多样性、操作条件的变化和制造公差而变得更加复杂。现有的机器学习(ML)替代模型用于预测SI指标,如内眼轮廓、眼高(EH)、眼宽(EW)和瞬态波形特征,通常依赖于固定的缓冲参数,需要为每次技术转换生成新的数据并重新训练,成本高昂。本文介绍了一种缓冲参数化的ML替代建模方法,能够处理跨技术变化而无需重新训练,通过将IC缓冲特性(例如时钟频率、供电电压、上升/下降时间、抖动和内部电阻和电容)作为动态模型输入,与PCB参数相结合。为了确定此高维空间的最佳替代架构,进行了全面的基准研究,比较了基于树的方法(RFR/GBM)、核方法(SVR/KRR)、高斯过程回归(GPR)和神经网络。随后,该框架在具有44个设计参数的复杂互连上进行了验证。结果表明,各向异性GPR在低数据量情况下表现优异,而神经网络在大数据集上显著优于其他模型。最后,通过跨技术设计空间探索和优化场景展示了ML替代模型的实用价值,证明了与模拟相比,眼罩合规检查的计算速度大幅提高。

英文摘要

Signal integrity (SI) analysis in printed circuit board (PCB) interconnects faces increasing complexity due to diverse integrated circuit (IC) buffer technologies, varying operating conditions, and manufacturing tolerances. Existing machine learning (ML) surrogate models for predicting SI metrics such as the inner eye contour, eye-height (EH), eye-width (EW), and transient waveform features typically rely on fixed buffer parameters, requiring costly new data generation and retraining cycles for every technology shift. This paper introduces a buffer-parameterized ML surrogate modeling methodology capable of handling cross-technology variations without retraining by treating IC buffer characteristics, e.g., clock frequency, supply voltage, rise/fall times, jitter, and internal resistors and capacitors, as dynamic model inputs alongside PCB parameters. To identify the optimal surrogate architecture for this high-dimensional space, a comprehensive benchmarking study compares tree-based methods (RFR/GBM), kernel methods (SVR/KRR), Gaussian process regression (GPR), and neural networks. The framework is subsequently validated on a complex interconnect with 44 design parameters. Results show that while anisotropic GPR excels in low-data regimes, neural networks heavily outperform other models on large datasets. Finally, the practical value of the ML surrogate models is demonstrated through a cross-technology design space exploration and optimization scenario, showcasing massive computational speedups for eye mask compliance checking compared to simulation.

2605.18165 2026-05-19 cs.LG 版本更新

Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs

Elastic-dLLM: Diffusion LLMs的弹性上下文压缩与增强

Junyi Wu, Tianchen Zhao, Shaoqiu Zhang, Linfeng Zhang, Guohao Dai, Yu Wang

发表机构 * Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) Infinigence AI

AI总结 本文针对扩散大语言模型中上下文压缩和增强问题,提出了一种位置保持的上下文压缩和终端感知增强方法,以提高解码效率并实现长上下文扩展。

详情
AI中文摘要

与自回归模型生成一个token一次不同,dLLMs通过联合去噪一批[MASK] tokens并每一步采样一个或多个token;尽管这允许并行解码,但由于被掩码token的大量批大小,这个过程会带来显著的计算成本。我们观察到,大部分成本用于重复处理前面的上下文和许多[MASK] tokens的相同特征表示,表明存在相当大的计算冗余。在本工作中,我们从[MASK] tokens的角度重新审视dLLM的冗余性。通过系统分析,我们验证了[MASK] tokens的冗余性并揭示了它们在提供结构信息中的关键作用。基于这些发现,我们提出了位置保持的[MASK] token压缩和终端感知增强。通过压缩冗余的[MASK]计算,该方法加速了解码,并进一步为受有限输入长度约束的完整序列dLLMs(如LLaDA-8B-Instruct和LLaDA-1.5)提供了自然的上下文折叠式长上下文扩展。此外,对于块dLLMs(如LLaDA2.0-mini),它通过添加受保护的终端[MASK] token来增强生成质量,且无显著开销。

英文摘要

Unlike autoregressive models, which generate one token at a time, dLLMs denoise a chunk of [MASK] tokens jointly and sample one or more tokens per step; despite enabling parallel decoding, this process incurs substantial computational cost due to the large chunk size of masked tokens. We observe that much of this cost is spent on repeatedly processing the preceding context and many [MASK] tokens with the same feature representations, indicating considerable computational redundancy. In this work, we revisit dLLM's redundancy from the perspective of [MASK] tokens. Through systematic analysis, we verify the redundancy of [MASK] tokens while revealing their critical role in providing structural information. Guided by these findings, we propose position-preserving [MASK] token compression and terminal-aware augmentation. By compressing redundant [MASK] computation, this approach accelerates decoding and further provides a natural extension toward context-folding-like long-context scaling under limited input-length constraints for full-sequence dLLMs such as LLaDA-8B-Instruct and LLaDA-1.5. Moreover, for block dLLMs such as LLaDA2.0-mini, it augments the context with a protected terminal [MASK] token to enhance generation quality with negligible overhead.

2605.16142 2026-05-19 cs.AI cs.LG 版本更新

Property-Guided LLM Program Synthesis for Planning

基于属性的LLM程序合成用于规划

André G. Pereira, Augusto B. Corrêa, Jendrik Seipp

发表机构 * Federal University of Rio Grande do Sul(里约格朗德杜斯尔大学) University of Oxford(牛津大学) Linköping University(林奈大学)

AI总结 本文研究了一种基于属性的LLM程序合成方法,通过检查候选程序是否满足形式定义的属性来指导LLM生成更高质量的程序,从而减少生成和评估成本。

详情
AI中文摘要

LLMs在程序合成中表现出色,能够发现超越先前解决方案的程序。然而,这些方法依赖于简单的数值评分来指示程序质量,如解决方案的值或通过的测试数量。因为评分无法指导程序为何失败,系统必须生成并评估许多候选程序,希望其中一些成功,从而增加LLM推理和评估成本。我们研究了一种不同的方法:属性引导的LLM程序合成。与评分程序后评估不同,我们检查候选程序是否满足形式定义的属性。当属性被违反时,我们提前停止评估并提供具体的反例,显示程序为何失败。这种反馈显著减少了程序生成的数量和评估成本,并可以指导LLM生成更强大的程序。我们在PDDL规划领域评估了这种方法,要求LLM合成直接启发函数:每个通过严格改进转换可达的状态都有严格改进的后继。具有这种属性的启发函数可使爬山算法直接到达目标状态。反例引导的修复循环生成一个候选程序,检查训练集上的属性,并返回第一个违反属性的案例。我们在十个规划领域上评估了这种方法,并使用分布外测试集。合成的启发函数在几乎所有测试任务中都是直接的,与最佳先前生成方法相比,我们的方法在每个领域平均生成的程序数量少七倍,无需使用搜索即可解决更多任务,并且评估候选人的计算量减少了几个数量级。只要问题允许可验证的属性,属性引导的LLM合成可以降低成本并提高程序质量。

英文摘要

LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.

2605.16015 2026-05-19 cs.RO cs.LG 版本更新

Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning

通过强化学习实现四旋翼机的自适应外环控制

Vishnu Saj, Sushil Vemuri, Dileep Kalathil, Moble Benedict

发表机构 * Texas A&M University(德克萨斯大学)

AI总结 本文提出了一种新颖的自适应控制架构,通过强化学习和残差动力学预测器来提高四旋翼飞行器在动态扰动下的控制性能,实验证明其在现实环境中具有更高的轨迹跟踪精度。

详情
AI中文摘要

深度强化学习(DRL)在四旋翼飞行器控制中通常依赖于领域随机化(DR)进行仿真到现实的转移,导致过于保守的策略难以应对动态扰动。为了解决这个问题,我们提出了一种新的自适应控制架构,能够主动感知并响应即时扰动。首先,我们训练了一个最优的外环策略,然后用残差动力学预测器(RDP)替代其对地面真实扰动数据的依赖。RDP通过仅使用状态和控制动作的历史数据在线估计飞行器所受的外部力和力矩。为了实现无缝的硬件转移,我们引入了数据高效的线性校准桥和在线推力校正机制,利用仅几秒的飞行数据将模拟的潜在空间与现实对齐。在真实世界中对Crazyflie微型四旋翼的验证表明,我们的自适应控制器在严重不确定性下,包括质量变化、不对称载荷和动态悬挂载荷,均显著优于基线方法,保持了精确的轨迹跟踪性能。

英文摘要

Deep Reinforcement Learning (DRL) for quadrotor flight control typically relies on Domain Randomization (DR) for sim-to-real transfer, resulting in overly conservative policies that struggle with dynamic disturbances. To overcome this, we propose a novel adaptive control architecture that actively perceives and reacts to instantaneous perturbations. First, we train an optimal outer-loop policy, then replace its reliance on ground-truth disturbance data with a Residual Dynamics Predictor (RDP). The RDP estimates the external forces and moments acting on the aircraft in flight online using only the history of states and control actions. For seamless hardware transfer, we introduce a data-efficient linear calibration bridge and an online thrust correction mechanism that align the simulated latent space with reality using mere seconds of flight data. Real-world validations on a Crazyflie micro-quadrotor demonstrate that our adaptive controller significantly outperforms baselines, maintaining precise trajectory tracking under severe uncertainties including mass variations, asymmetric payloads, and dynamic slung loads

2605.15960 2026-05-19 cs.AI cs.LG 版本更新

Imperfect World Models are Exploitable

不完美的世界模型是可利用的

Logan Mondal Bhamidipaty, Esmeralda S. Whitammer, David Abel, Mykel J. Kochenderfer, Subramanian Ramamoorthy

发表机构 * University of Edinburgh(爱丁堡大学) Stanford University(斯坦福大学)

AI总结 本文提出了一种新的强化学习中模型利用的定义,指出世界模型如果暗示某种策略应严格优于另一种策略,而真实环境转移模型却暗示相反,那么该模型就是可利用的。研究通过发展奖励黑客和模型利用的一般理论,证明在大规模策略集上利用本质上是不可避免的,并揭示了安全规划在世界模型中的局限性。

Comments 17 pages, 3 figures, 2 tables; modified (fixed metadata)

详情
AI中文摘要

我们提出了一种新的强化学习中模型利用的定义。非正式地说,如果世界模型暗示一种策略应严格优于另一种策略,而环境的真实转移模型却暗示相反,则该世界模型是可利用的。我们通过类比先前对奖励黑客的描述,但发现相关的不可避免性证明无法转移到利用上。为克服这一障碍,我们发展了一种奖励黑客和模型利用的一般理论,证明在大规模策略集上利用本质上是不可避免的,并得出黑客作为特殊情况的相应结论。不幸的是,我们还发现保证在有限策略集上不可黑客的条件没有对应的防止利用的条件。因此,我们引入了一种放松的利用概念,并推导出一个安全的视野,在其中可以避免利用。总的来说,我们的结果建立了奖励黑客和模型利用之间的正式桥梁,并阐明了世界模型中安全规划的局限性。

英文摘要

We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.

2605.15239 2026-05-19 cs.LG 版本更新

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

通过在线策略自我蒸馏减少大语言模型安全对齐的安全部署税

Yu Fu, Longxuan Yu, Haz Sameen Shahgir, Zhipeng Wei, Hui Liu, N. Benjamin Erichson, Yue Dong

发表机构 * International Computer Science Institute(国际计算机科学研究所) Microsoft(微软) Berkeley Lab(伯克利实验室)

AI总结 本文提出了一种名为OPSA的在线策略自我蒸馏方法,通过引入教师翻转率指标,有效减少大语言模型在安全对齐过程中因分布不匹配导致的安全税,实验显示其在不同模型规模上均优于传统方法。

Comments 20 pages, 5 figures

详情
AI中文摘要

安全对齐通常以牺牲推理能力为代价来提高对有害查询的鲁棒性,这种权衡被称为安全税。常见原因是分布不匹配:监督微调训练目标模型时,通常使用人类生成的安全演示、外部模型或固定自动生成的轨迹,而不是从自身策略采样的轨迹。我们识别出非策略训练不匹配是这一税的第二个来源,并研究了用于安全对齐的在线策略自我蒸馏(OPSA)。模型生成自己的轨迹,并从冻结的教师副本中接收密集的每token KL监督,该教师副本在特权安全上下文中进行条件化。由于该教师必须比采样的学生轨迹更安全,我们引入了教师翻转率:一个衡量特权上下文将不安全响应转换为安全响应频率的指标。我们使用此信号来寻找激活潜在安全推理而非仅引发安全外观演示的上下文。在两个推理模型家族和五个模型规模上,OPSA在匹配数据和全参数微调条件下,比非策略自我蒸馏和外部教师蒸馏实现了更优的安全-推理权衡,其在较小模型上获得最大收益(R1-Distill-1.5B增加8.85分,Qwen3-0.6B增加5.49分)。这些收益在不同训练集大小和自适应禁言评估中持续存在。token级分析进一步显示,OPSA将更新集中在早期合规决策token附近,提供了一种在保持通用推理的同时改进安全性的机制。

英文摘要

Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety demonstrations produced by humans, external models, or fixed self-generated traces, rather than on trajectories sampled from its own policy. We identify off-policy training mismatch as a second source of this tax and study on-policy self-distillation for safety alignment, which we call OPSA. The model generates its own rollouts and receives dense per-token KL supervision from a frozen teacher copy of itself conditioned on a privileged safety context. Because this teacher must be safer than the sampled student trajectory, we introduce \emph{teacher flip rate}: a criterion that measures how often a privileged context converts unsafe responses into safe ones. We use this signal to search for contexts that activate latent safety reasoning rather than merely elicit safe-looking demonstrations. Across two reasoning-model families and five model scales, OPSA achieves a stronger safety--reasoning tradeoff than off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning, with the largest gains on smaller models (+8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B). The gains persist across training-set sizes and adaptive jailbreak evaluations. Token-level analyses further show that OPSA concentrates updates near early compliance-decision tokens, providing a mechanism for improving safety while preserving general reasoning.

2605.12765 2026-05-19 cs.LG 版本更新

Inference-Time Machine Unlearning via Gated Activation Redirection

推理时的机器去学习 via 门控激活重定向

Vinícius Conte Turani, Otávio Parraga, João Vitor Boer Abitante, Kristen K. Arguello, Joana Pasquali, Ramiro N. Barros, Flavio du Pin Calmon, Christian Mattjie, Rodrigo C. Barros, Lucas S. Kupssinskü

发表机构 * MALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, Brazil(MALTA机器学习理论与应用实验室,PUCRS,波士顿-阿尔格雷,巴西) Harvard University(哈佛大学) Kunumi Institute, Brazil(库努米研究所,巴西)

AI总结 本文提出了一种无需训练和梯度的机器去学习方法GUARD-IT,通过在推理时依赖输入的激活引导来消除特定数据集的影响,同时保持模型性能,且在量化部署下仍有效。

详情
AI中文摘要

大型语言模型会记住大量训练数据,这引发了隐私、版权侵犯和安全方面的担忧。机器去学习旨在在不改变模型性能的情况下移除特定遗忘集的影响,理想上近似于从头重新训练模型而不包含遗忘集。现有方法通过梯度基方法更新模型参数来实现这一目标。然而,这些更新计算成本高,导致不可逆的权重变化,并在模型量化部署时性能下降。一种最近的替代方法是激活工程,在推理期间更改激活以引导模型行为。尽管绕过了权重编辑,但朴素的激活引导会引入自身的问题,因为单一的全局引导向量对每个输入应用相同的干预,导致模型行为的意外变化。我们引入了推理时的机器去学习 via 门控激活重定向(GUARD-IT),这是一种训练和梯度自由的方法,通过在推理时依赖输入的激活引导来实现去学习。所得到的干预作为残差流中的规范保持旋转应用,不改变模型权重。在TOFU和MUSE上的实验表明,GUARD-IT在三个模型规模上匹配或超过了12种基于梯度的基线方法,是唯一一个在所有设置中同时保持效用、抑制记忆和避免灾难性崩溃的方法。GUARD-IT进一步支持无需重新训练的连续去学习,并在参数编辑方法会退化的量化场景下仍有效。

英文摘要

Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model performance, ideally approximating a model retrained from scratch without the forget set. Existing approaches aim to achieve this by updating model parameters via gradient-based methods. However, these updates are computationally expensive, lead to irreversible weight changes, and degrade when the model is quantized for deployment. A recent alternative to changing model weights is activation engineering, where activations are changed during inference to steer model behavior. Despite circumventing weight editing, naive activation steering introduces its own failure modes, as a single global steering vector applies the same intervention to every input, leading to unintended changes in model behavior. We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. The resulting intervention is applied as a norm-preserving rotation in the residual stream, leaving model weights untouched. Experiments on TOFU and MUSE show that GUARD-IT matches or exceeds 12 gradient-based baselines across three model scales, while being the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse across all settings. GUARD-IT further supports continual unlearning without retraining, and remains effective under quantization, a scenario in which parameter-editing methods degrade.

2605.12000 2026-05-19 cs.LG 版本更新

Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation

拆分差异,融合其余:可证明高效的多目标模仿

Ziyad Sheebaelhamd, Luca Viano, Volkan Cevher, Claire Vernade

发表机构 * University of Tübingen(图宾根大学) EPFL(苏黎世联邦理工学院) University of Technology Nuremberg(纽伦堡技术大学)

AI总结 本文研究了多目标模仿学习问题,即在多目标马尔可夫决策过程(MOMDP)中,根据多个帕累托最优专家的演示数据恢复位于帕累托前沿的策略。传统模仿方法无法应对这一场景,因为简单地聚合冲突的专家轨迹可能导致被支配的策略。为此,我们引入了多输出增强行为克隆(MA-BC)算法,该算法系统地划分分歧的专家数据,同时融合无行为冲突的状态-动作对。理论上,我们证明MA-BC以比单独考虑每个专家数据集的学习器更快的统计速率收敛到帕累托最优策略。此外,我们建立了多目标模仿学习的新下界,证明MA-BC是最小最大最优的。最后,我们在多样化的离散环境和连续线性二次调节器(LQR)控制任务中经验验证了我们的算法。

详情
AI中文摘要

本文研究了多目标模仿学习问题:在多目标马尔可夫决策过程(MOMDP)中,根据多个帕累托最优专家的演示数据恢复位于帕累托前沿的策略。标准模仿方法无法应对这一场景,因为简单地聚合冲突的专家轨迹可能导致被支配的策略。为此,我们引入了多输出增强行为克隆(MA-BC)算法,该算法系统地划分分歧的专家数据,同时融合无行为冲突的状态-动作对。理论上,我们证明MA-BC以比单独考虑每个专家数据集的学习器更快的统计速率收敛到帕累托最优策略。此外,我们建立了多目标模仿学习的新下界,证明MA-BC是最小最大最优的。最后,我们在多样化的离散环境和连续线性二次调节器(LQR)控制任务中经验验证了我们的算法。

英文摘要

This work investigates multi-objective imitation learning: the problem of recovering policies that lie on the Pareto front given demonstrations from multiple Pareto-optimal experts in a Multi-Objective Markov Decision Process (MOMDP). Standard imitation approaches are ill-equipped for this regime, as naively aggregating conflicting expert trajectories can result in dominated policies. To address this, we introduce Multi-Output Augmented Behavioral Cloning (MA-BC), an algorithm that systematically partitions divergent expert data while pooling state-action pairs where no behavior conflict is observed. Theoretically, we prove that MA-BC converges to Pareto-optimal policies at a faster statistical rate than any learner that considers each expert dataset independently. Furthermore, we establish a novel lower bound for multi-objective imitation learning, demonstrating that MA-BC is minimax optimal. Finally, we empirically validate our algorithm across diverse discrete environments and, guided by our theoretical insights, extend and evaluate MA-BC on a continuous Linear Quadratic Regulator (LQR) control task.

2605.11710 2026-05-19 cs.LG cs.CV 版本更新

Unlocking Compositional Generalization in Continual Few-Shot Learning

解锁持续少样本学习中的组合泛化

Phu-Quy Nguyen-Lam, Phu-Hoa Pham, Dao Sy Duy Minh, Chi-Nguyen Tran, Huynh Trung Kiet, Long Tran-Thanh

发表机构 * Faculty of Information Technology, University of Science, Vietnam National University(信息科技学院,科学大学,越南国家大学) Department of Computer Science, University of Warwick(计算机科学系,沃里克大学)

AI总结 本文提出了一种新的持续少样本学习范式,通过严格解耦表示学习与组合推理,实现对新概念的高效泛化,并在多个基准测试中取得最佳性能。

Comments 10 pages

详情
AI中文摘要

基于对象的表示方法在少样本学习中具有关键属性:而不是将场景视为单一单元,模型可以将其分解为个体对象级别的部分,这些部分可以在不同概念之间进行匹配和比较。在实践中,这种潜力很少被实现。持续学习者要么将场景压缩成全局嵌入,要么通过部分级匹配目标进行训练,这使表示过于紧密地依赖于已见过的模式,从而无法泛化到真正的新概念。在本文中,我们识别出这种根本性的结构冲突,并开创了一种新的范式,严格解耦表示学习与组合推理。利用自监督视觉变换器(ViTs)固有的片段级语义几何,我们的框架采用双阶段策略。在训练期间,槽表示完全优化为整体类别身份,保留高度可泛化的对象级几何结构。在推理期间,保留的槽被动态组合以匹配新场景。我们证明了这种范式提供了双重结构优势:冻结的主干自然防止了表示漂移,而我们的轻量级、整体优化保持了特征对新概念转移的能力。广泛的实验验证了这种方法,在标准持续学习基准中实现了最佳的未见概念泛化和最小的遗忘。

英文摘要

Object-centric representations promise a key property for few-shot learning: Rather than treating a scene as a single unit, a model can decompose it into individual object-level parts that can be matched and compared across different concepts. In practice, this potential is rarely realized. Continual learners either collapse scenes into global embeddings, or train with part-level matching objectives that tie representations too closely to seen patterns, leaving them unable to generalize to truly novel concepts. In this paper, we identify this fundamental structural conflict and pioneer a new paradigm that strictly decouples representation learning from compositional inference. Leveraging the inherent patch-level semantic geometry of self-supervised Vision Transformers (ViTs), our framework employs a dual-phase strategy. During training, slot representations are optimized entirely toward holistic class identity, preserving highly generalizable, object-level geometries. At inference, preserved slots are dynamically composed to match novel scenes. We demonstrate that this paradigm offers dual structural benefits: The frozen backbone naturally prevents representation drift, while our lightweight, holistic optimization preserves the features' capacity for novel-concept transfer. Extensive experiments validate this approach, achieving state-of-the-art unseen-concept generalization and minimal forgetting across standard continual learning benchmarks.

2605.11617 2026-05-19 cs.LG math.ST stat.TH 版本更新

MIST: Reliable Streaming Decision Trees for Online Class-Incremental Learning via McDiarmid Bound

MIST:通过McDiarmid界实现可靠的流决策树用于在线类增量学习

Phu-Hoa Pham, Chi-Nguyen Tran, Nguyen Lam Phu Quy, Dao Sy Duy Minh, Huynh Trung Kiet, Long Tran-Thanh

发表机构 * Faculty of Information and Technology University of Science Vietnam National University Ho Chi Minh City(信息科技学院科学大学越南国家大学胡志明市) Department of Computer Science University of Warwick(计算机科学系沃里克大学)

AI总结 本文提出MIST方法,通过三个集成组件解决流决策树在在线类增量学习中的可靠性问题,包括McDiarmid置信半径、贝叶斯继承协议和KLL量化图,以提升在非高斯几何中的鲁棒性。

Comments 9 pages of main text, 5 figures

详情
AI中文摘要

流决策树是开放世界持续学习的自然候选者,因为它们执行局部更新,具有有界内存,并且具有静态决策边界。尽管如此,它们仍然在在线类增量学习中失败,由于两个耦合的校准问题:(i)随着类别数K的增加,其分裂标准逐渐变得不可靠;(ii)在分裂时间缺乏知识转移。这两种失败的共同根源是信息增益的范围本质上与log2 K成比例。因此,任何基于它的Hoeffding式置信半径必然随着类别数的增长而增长,使得结构上独立于K的分裂标准不可能,从而剥夺了应用流决策树进行持续学习的潜在优势。为了解决这个问题,我们提出了MIST(McDiarmid增量流树),通过三个集成组件解决这两种失败:(i)一个紧致且独立于K的McDiarmid置信半径用于Gini分裂,作为结构正则化器;(ii)一个贝叶斯继承协议,通过截断高斯矩将父统计信息投影到子节点,方差减少保证在最保守的分裂时最强;(iii)每个叶子的KLL量化图支持连续阈值评估和几何自适应的叶子预测。在标准和压力测试表格流上,MIST在近高斯基准上与全局参数方法竞争,并在非高斯几何中表现出独特鲁棒性,其中SOTA基准崩溃。

英文摘要

Streaming decision trees are natural candidates for open-world continual learning, as they perform local updates, enjoy bounded memory, and static decision boundaries. Despite these, they still fail in online class-incremental learning due to two coupled miscalibrations: (i) their split criterion grows unreliable as the class count K expands, and (ii) the absence of knowledge transfer at split time. Both failures share a common root: the range of Information Gain intrinsically scales with log2 K. Consequently, any Hoeffding-style confidence radius derived from it must inevitably grow with the class count, making a K-independent split criterion structurally impossible, taking away the potential benefits of applying streaming decision trees to continual learning. To fix this issue, we present MIST (McDiarmid Incremental Streaming Tree), which resolves both failures through three integrated components: (i) a tight, K-independent McDiarmid confidence radius for Gini splitting that acts as a structural regulariser; (ii) a Bayesian inheritance protocol that projects parent statistics to child nodes via truncated-Gaussian moments, with variance reduction guarantees strongest precisely when splitting is most conservative; and (iii) per-leaf KLL quantile sketches that support both continuous threshold evaluation and geometry-adaptive leaf prediction from a single data structure. On standard and stress-test tabular streams, MIST is competitive with global parametric methods on near-Gaussian benchmarks and uniquely robust on non-Gaussian geometry where SOTA benchmarks collapse.

2605.11365 2026-05-19 cs.AI cs.LG stat.ML 版本更新

Causal Bias Detection in Generative Artificial Intelligence

生成人工智能中的因果偏见检测

Drago Plecko

发表机构 * Department of Statistics & Data Science(统计与数据科学系)

AI总结 本文研究了生成人工智能中的因果公平性问题,提出了新的因果分解结果,以量化不同因果路径和现实机制被生成模型替代对公平性的影响,并通过分析大型语言模型中的种族和性别偏见验证了方法的有效性。

详情
AI中文摘要

基于人工智能构建的自动化系统越来越多地应用于高风险领域,引发了关于公平性和现实世界中存在的人口差异持续存在的关键担忧。在此背景下,因果推断提供了一个有原则的框架来思考公平性,因为它将观察到的不平等与潜在机制联系起来,并自然与人类直觉和法律上的歧视观念相一致。先前关于因果公平性的研究主要集中在标准机器学习设置中,其中决策者为结果变量Y构建单一预测机制f_Ŷ,同时继承其他协变量的因果机制。然而,生成人工智能的设置却更加复杂:生成模型可以从任意条件下对任何变量集进行采样,隐式地构建了自己对所有因果机制的看法,而不是学习单一预测函数。这种根本性的差异要求因果公平性方法论有新的发展。我们正式定义了生成人工智能中的因果公平性问题,并在统一的理论框架下将其与标准机器学习设置相结合。然后,我们推导了新的因果分解结果,使能够对不同因果路径以及现实机制被生成模型机制替代的公平性影响进行精细量化。我们建立了识别条件并引入了用于因果感兴趣的量的高效估计器,并通过分析不同数据集中的大型语言模型中的种族和性别偏见来证明了我们方法的价值。

英文摘要

Automated systems built on artificial intelligence (AI) are increasingly deployed across high-stakes domains, raising critical concerns about fairness and the perpetuation of demographic disparities that exist in the world. In this context, causal inference provides a principled framework for reasoning about fairness, as it links observed disparities to underlying mechanisms and aligns naturally with human intuition and legal notions of discrimination. Prior work on causal fairness primarily focuses on the standard machine learning setting, where a decision-maker constructs a single predictive mechanism $f_{\widehat Y}$ for an outcome variable $Y$, while inheriting the causal mechanisms of all other covariates from the real world. The generative AI setting, however, is markedly more complex: generative models can sample from arbitrary conditionals over any set of variables, implicitly constructing their own beliefs about all causal mechanisms rather than learning a single predictive function. This fundamental difference requires new developments in causal fairness methodology. We formalize the problem of causal fairness in generative AI and unify it with the standard ML setting under a common theoretical framework. We then derive new causal decomposition results that enable granular quantification of fairness impacts along both (a) different causal pathways and (b) the replacement of real-world mechanisms by the generative model's mechanisms. We establish identification conditions and introduce efficient estimators for causal quantities of interest, and demonstrate the value of our methodology by analyzing race and gender bias in large language models across different datasets.

2605.07263 2026-05-19 eess.SP cs.AI cs.DC cs.LG stat.ML 版本更新

Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning

非协作空中联邦学习的资源元素能量差

Hao Chen, Zavareh Bozorgasl

发表机构 * Signal, Communication, and Learning Lab (SCALE Lab), Department of Electrical and Computer Engineering, Boise State University(信号、通信与学习实验室(SCALE实验室),电气与计算机工程系,博伊西州立大学)

AI总结 本文提出了一种非协作物理层原始方法,即资源元素能量差(REED),用于连续符号聚合。该方法通过将实值更新的正负部分映射到配对正交的资源元素上的传输能量,并通过减去对应的接收到的能量来估计符号和。REED利用慢时间尺度校准的平均信道功率,但不需要瞬时发射端或接收端CSI或信道反转。对于独立的瑞利衰落,我们推导了单次REED和芯片多样扩展的精确一阶和二阶矩表达式。

Comments Preprint; Under-review; Codes to replicate the results is available at: https://github.com/zavareh1/REED

详情
AI中文摘要

Over-the-air federated learning (OTA-FL) reduces uplink latency by aggregating client updates directly over the wireless multiple-access channel. Coherent analog aggregation realizes this idea by aligning the phases and amplitudes of simultaneously transmitted waveforms, which typically requires synchronization, instantaneous channel-state information (CSI), phase compensation, and power control. Noncoherent energy detection removes the need for phase-coherent combining, but a single energy measurement is nonnegative and, therefore, cannot represent signed model updates. This paper introduces resource-element energy difference (REED), a noncoherent physical-layer primitive for continuous signed aggregation. REED maps the positive and negative parts of each real-valued update to transmit energies on paired orthogonal resource elements and estimates the signed sum by subtracting the corresponding received energies. The construction uses slow-timescale calibration of average channel powers, but does not require instantaneous transmitter- or receiver-side CSI or channel inversion. For independent Rayleigh fading, we derive exact first- and second-moment expressions for single-shot REED and for a chip-diverse extension that spreads each coordinate over multiple independently faded paired chips. The resulting variance laws separate fading-induced self-noise, signal-noise interaction, and receiver-noise fluctuation, giving an explicit diversity-resource tradeoff. More->The rest of abstract is in the paper.

英文摘要

Over-the-air federated learning (OTA-FL) reduces uplink latency by aggregating client updates directly over the wireless multiple-access channel. Coherent analog aggregation realizes this idea by aligning the phases and amplitudes of simultaneously transmitted waveforms, which typically requires synchronization, instantaneous channel-state information (CSI), phase compensation, and power control. Noncoherent energy detection removes the need for phase-coherent combining, but a single energy measurement is nonnegative and, therefore, cannot represent signed model updates. This paper introduces resource-element energy difference (REED), a noncoherent physical-layer primitive for continuous signed aggregation. REED maps the positive and negative parts of each real-valued update to transmit energies on paired orthogonal resource elements and estimates the signed sum by subtracting the corresponding received energies. The construction uses slow-timescale calibration of average channel powers, but does not require instantaneous transmitter- or receiver-side CSI or channel inversion. For independent Rayleigh fading, we derive exact first- and second-moment expressions for single-shot REED and for a chip-diverse extension that spreads each coordinate over multiple independently faded paired chips. The resulting variance laws separate fading-induced self-noise, signal-noise interaction, and receiver-noise fluctuation, giving an explicit diversity-resource tradeoff. More->The rest of abstract is in the paper.

2605.06933 2026-05-19 cs.LG cs.CR cs.MA 版本更新

MAGIQ: A Post-Quantum Multi-Agentic AI Governance System with Provable Security

MAGIQ: 一种具有可证明安全性的多智能体AI治理系统

Sepideh Avizheh, Tushin Mallick, Alina Oprea, Cristina Nita-Rotaru, Reihaneh Safavi-Naini

发表机构 * University of Calgary(卡尔加里大学) Northeastern University(东北大学)

AI总结 本文提出MAGIQ,一种利用新型高效抗量子加密协议进行多智能体AI系统策略定义和执行的框架,旨在解决智能体通信和访问控制策略的安全性问题,并提供可追溯的问责机制。

详情
AI中文摘要

我们的计算生态系统正受到两种新兴范式的转变:代理AI系统部署的增加和量子计算的进步。对于代理AI系统而言,最关键的问题之一是创建安全的治理架构,以确保代理遵循其所有者的通信和交互政策,并对其与其他代理交换的消息负责。对于量子计算而言,现有系统必须进行改造,同时必须设计新的加密机制以确保长期安全性和抗量子性。事实上,NIST建议从2030年起弃用标准公钥加密算法,包括RSA、Diffie-Hellman(DH)和椭圆曲线构造(ECC),并在2035年后禁止使用。在本文中,我们提出了MAGIQ,一种使用新型高效、抗量子的加密协议进行多智能体AI系统策略定义和执行的框架。MAGIQ(i)允许用户为智能体到智能体的会话和任务定义丰富的通信和访问控制策略预算,包括针对一对一智能体会话的全局预算;(ii)利用后量子加密原语执行这些策略;(iii)支持基于会话的策略执行,用于智能体到智能体和一对一智能体会话;(iv)通过消息归因提供智能体对其用户的责任。我们使用通用可组合性(UC)框架正式建模并证明系统的正确性和安全性。我们评估了我们框架的计算和通信开销,并将其与最先进的代理AI框架SAGA进行比较。MAGIQ是朝着后量子安全的代理AI系统解决方案迈出的第一步。

英文摘要

Our computing ecosystem is being transformed by two emerging paradigms: the increased deployment of agentic AI systems and advancements in quantum computing. With respect to agentic AI systems, one of the most critical problems is creating secure governing architectures that ensure agents follow their owners' communication and interaction policies and can be held accountable for the messages they exchange with other agents. With respect to quantum computing, existing systems must be retrofitted and new cryptographic mechanisms must be designed to ensure long-term security and quantum resistance. In fact, NIST recommends that standard public-key cryptographic algorithms, including RSA, Diffie-Hellman (DH), and elliptic-curve constructions (ECC), be deprecated starting in 2030 and disallowed after 2035. In this paper, we present MAGIQ, a framework for policy definition and enforcement in multi-agent AI systems using novel, highly efficient, quantum-resistant cryptographic protocols with proven security guarantees. MAGIQ (i) allows users to define rich communication and access-control policy budgets for agent-to-agent sessions and tasks, including global budgets for one-to-many agent sessions; (ii) enforces such policies using post-quantum cryptographic primitives; (iii) supports session-based enforcement of policies for agent-to-agent and one-to-many agent sessions; and (iv) provides accountability of agents to their users through message attribution. We formally model and prove the correctness and security of the system using the Universal Composability (UC) framework. We evaluate the computation and communication overhead of our framework and compare it with the state-of-the-art agentic AI framework SAGA. MAGIQ is a first step toward post-quantum-secure solutions for agentic AI systems.

2604.26793 2026-05-19 cs.LG eess.SP 版本更新

Super-resolution Multi-signal Direction-of-Arrival Estimation by Hankel-structured Sensing and Decomposition

基于Hankel结构传感与分解的超分辨率多信号方向到达估计

Georgios I. Orfanidis, Dimitris A. Pados, George Sklivanitis, Elizabeth S. Bentley

发表机构 * Center for Connected Autonomy and AI(连接自主与人工智能中心) Dept. of Electrical Engineering and Computer Science(电气工程与计算机科学系) Florida Atlantic University(佛罗里达大学) Air Force Research Laboratory(空军研究实验室) AFRL/RI(空军研究实验室/RI)

AI总结 本文提出了一种基于Hankel结构传感和任意秩数据矩阵分解的新框架,用于快速超分辨率多信号方向到达估计,在L2和L1范数下均达到最大似然最优,通过大量仿真验证了其强大的超分辨率能力和更高的分辨率概率。

详情
AI中文摘要

受现代自主系统中受限相干时间下大阵列硬件受限空间采样的传感模式启发,我们开发了一种基于Hankel结构传感和任意秩数据矩阵分解的新框架,用于快速超分辨率多信号方向到达(DoA)估计,在L2和L1范数下均达到最大似然最优。L2范数估计器在高斯白噪声中最优,L1范数估计器在独立同分布(i.i.d.)各向同性拉普拉斯噪声中最优,具有对实际中常见脉冲干扰和损坏测量的广泛鲁棒性。大量仿真表明,所提方法具有强大的超分辨率能力,要求显著更低的信噪比,并在分辨率概率上优于最近的竞争对手方法。

英文摘要

Motivated by sensing modalities in modern autonomous systems that involve hardware-constrained spatial sampling over large arrays with limited coherence time, we develop a novel framework for rapid super-resolution multi-signal direction-of-arrival (DoA) estimation based on Hankel-structured sensing and data matrix decomposition of arbitrary rank, under both the $L_2$ and $L_1$-norm formulation. The resulting $L_2$-norm estimator is shown to be maximum-likelihood optimal in white Gaussian noise. The $L_1$-norm estimator is shown to be maximum-likelihood optimal in independent, identically distributed (i.i.d.) isotropic Laplace noise, offering broad robustness to impulsive interference and corrupted measurements commonly encountered in practice. Extensive simulations demonstrate that the proposed methods exhibit powerful super-resolution capabilities, requiring significantly lower SNR and achieving substantially higher resolution probability than recent competing approaches.

2604.15950 2026-05-19 cs.LG 版本更新

TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

TwinTrack: 医学图像分割的后验多评分者校准

Tristan Kirscher, Alexandra Ertl, Klaus Maier-Hein, Xavier Coubez, Philippe Meyer, Sylvain Faisan

发表机构 * ICube Laboratory, CNRS UMR-7357, University of Strasbourg(ICube实验室,CNRS UMR-7357,斯特拉斯堡大学) CLCC Institut Strauss(CLCC斯特拉斯堡研究所) German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing(海德堡德国癌症研究中心(DKFZ),医学图像计算部门) Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital(放射肿瘤科,海德堡大学医院模式分析与学习小组) Medical Faculty Heidelberg, Heidelberg University(海德堡医学系,海德堡大学)

AI总结 针对医学图像分割中多评分者不确定性建模问题,TwinTrack通过后验校准方法将集成分割概率校准到经验人类响应(MHR),提高概率校准和可解释性。

Comments Accepted for publication at MIDL 2026

详情
AI中文摘要

胰腺导管腺癌(PDAC)在增强CT中的分割本质上具有歧义性:专家之间的分歧反映的是真正的不确定性而非标注噪声。标准深度学习方法假设存在单一真实情况,产生概率输出,但在这种歧义下可能校准不良且难以解释。我们提出TwinTrack框架,通过后验校准集成分割概率到经验人类响应(MHR)——即专家标注器对体素标记为肿瘤的比例。校准后的概率可直接解释为标注者分配肿瘤标签的预期比例,明确建模评分者分歧。所提出的后验校准过程简单,仅需少量多评分者校准数据集。在MICCAI 2025 CURVAS-PDACVI多评分者基准测试中,该方法在校准指标上 consistently 改善了标准方法的表现。

英文摘要

Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.

2604.08874 2026-05-19 cs.LG cs.AI 版本更新

A Mathematical Framework for Temporal Modeling and Counterfactual Policy Simulation of Student Dropout

面向学生退学的时序建模与反事实政策模拟的数学框架

Rafael da Silva, Jeff Eicher, Gregory Longo

发表机构 * Applied Data Science Program(应用数据科学项目) Eastern University(东部大学)

AI总结 本文提出了一种结合反事实政策模拟层的时序建模框架,用于分析高等教育学生退学问题,通过LMS参与数据和行政退学记录进行建模,采用时间到事件结局的方式,并通过惩罚性、类别平衡逻辑回归进行每周风险建模,展示了模型在训练和测试集上的高AUC表现,并通过消融分析验证了时间参与信号的重要性。

Comments Approx. 20 pages, 9 figures. Code and reproducibility package available at https://github.com/rafa-rodriguess/TCM-Student-Dropout This work introduces a temporal survival framework with counterfactual policy simulation

详情
AI中文摘要

本研究提出了一种针对高等教育学生退学问题的时序建模框架,结合反事实政策模拟层,利用LMS参与数据和行政退学记录进行建模。退学被定义为在入学层面的时间到事件结局;通过在人-时期行上进行惩罚性、类别平衡逻辑回归,对每周风险进行离散时间建模。在晚期事件时间验证下,模型在训练集和测试集上分别达到0.8350和0.8405的行级AUC,整体校准可接受但最高风险分箱支持稀疏。消融分析表明性能对特征集组成敏感,突显了时间参与信号的作用。一个基于场景的政策层产生生存对比ΔS(T)在显式的触发/计划合同下:正对比被限制在冲击分支(T_policy=18:0.0102,0.0260,0.0819),而机制-aware分支为负(ΔS_mech(18)=-0.0078,ΔS_mech(38)=-0.0134)。通过性别子组分析量化了场景诱导的生存差距,通过bootstrap方法进行统计检验;对比方向稳定但较小。结果未被因果识别;它们展示了在观察数据限制下,该框架进行内部结构场景比较的能力。

英文摘要

This study proposes a temporal modeling framework with a counterfactual policy-simulation layer for student dropout in higher education, using LMS engagement data and administrative withdrawal records. Dropout is operationalized as a time-to-event outcome at the enrollment level; weekly risk is modeled in discrete time via penalized, class-balanced logistic regression over person--period rows. Under a late-event temporal holdout, the model attains row-level AUCs of 0.8350 (train) and 0.8405 (test), with aggregate calibration acceptable but sparsely supported in the highest-risk bins. Ablation analyses indicate performance is sensitive to feature set composition, underscoring the role of temporal engagement signals. A scenario-indexed policy layer produces survival contrasts $ΔS(T)$ under an explicit trigger/schedule contract: positive contrasts are confined to the shock branch ($T_{\rm policy}=18$: 0.0102, 0.0260, 0.0819), while the mechanism-aware branch is negative ($ΔS_{\rm mech}(18)=-0.0078$, $ΔS_{\rm mech}(38)=-0.0134$). A subgroup analysis by gender quantifies scenario-induced survival gaps via bootstrap; contrasts are directionally stable but small. Results are not causally identified; they demonstrate the framework's capacity for internal structural scenario comparison under observational data constraints.

2604.07292 2026-05-19 cs.LG 版本更新

Graph Neural ODE Digital Twins for Control-Oriented Reactor Thermal-Hydraulic Forecasting Under Partial Observability

基于图神经ODE数字孪生的面向控制的反应堆热力学预测(在部分可观测性下)

Akzhol Almukhametov, Doyeong Lim, Rui Hu, Yang Liu

发表机构 * Department of Nuclear Engineering, Texas A&M University, College Station, TX 77843, USA(德克萨斯A&M大学核工程系,学院站,TX 77843,美国) Argonne National Laboratory, Nuclear Science and Engineering Division, USA(阿贡国家实验室,核科学与工程部,美国)

AI总结 本文提出了一种结合物理信息的图神经网络与神经普通微分方程(GNN-ODE)的模型,用于在部分可观测性下实现反应堆热力学状态的准确预测,该模型在预测精度、毫秒级推理速度和对部分可观测性的鲁棒性方面均表现出色。

详情
AI中文摘要

先进的反应堆实时监督控制需要准确预测整个系统的热力学状态,包括物理传感器不可用的位置。为满足这一需求,需要结合预测精度、毫秒级推理速度以及对部分可观测性的鲁棒性的替代模型。在本文中,我们提出了一种结合物理信息的图神经网络与神经普通微分方程(GNN-ODE)来同时解决这三个要求。我们将整个系统表示为一个有向传感器图,其边通过流/热传递感知的消息传递编码液压连接性,并通过受控的神经ODE在连续时间推进潜在动态。拓扑引导的缺失节点初始化器在运行开始时重建未仪器化状态;预测然后完全自回归进行。GNN-ODE替代模型在系统动态预测中取得了令人满意的成果。在测试模拟瞬态中,替代模型在60秒时对未仪器化节点的平均MAE为0.91 K,在300秒时为2.18 K,对于缺失节点状态重建,$R^2$达到0.995。在单个GPU上推理速度大约是模拟时间的105倍,使64成员的集合运行成为可能,用于不确定性量化。为了评估仿真到现实的转移,我们使用逐层判别微调将预训练的替代模型适应到实验设施数据上,仅使用30个训练序列。学习的流依赖热传递缩放恢复了与已确立相关性一致的雷诺数指数,表明了超越轨迹拟合的构成学习。该模型跟踪了陡峭的功率变化瞬态,并在未仪器化位置产生了准确的轨迹。

英文摘要

Real-time supervisory control of advanced reactors requires accurate forecasting of plant-wide thermal-hydraulic states, including locations where physical sensors are unavailable. Meeting this need calls for surrogate models that combine predictive fidelity, millisecond-scale inference, and robustness to partial observability. In this work, we present a physics-informed message-passing Graph Neural Network coupled with a Neural Ordinary Differential Equation (GNN-ODE) to addresses all three requirements simultaneously. We represent the whole system as a directed sensor graph whose edges encode hydraulic connectivity through flow/heat transfer-aware message passing, and we advance the latent dynamics in continuous time via a controlled Neural ODE. A topology-guided missing-node initializer reconstructs uninstrumented states at rollout start; prediction then proceeds fully autoregressively. The GNN-ODE surrogate achieves satisfactory results for the system dynamics prediction. On held-out simulation transients, the surrogate achieves an average MAE of 0.91 K at 60 s and 2.18 K at 300 s for uninstrumented nodes, with $R^2$ up to 0.995 for missing-node state reconstruction. Inference runs at approximately 105 times faster than simulated time on a single GPU, enabling 64-member ensemble rollouts for uncertainty quantification. To assess sim-to-real transfer, we adapt the pretrained surrogate to experimental facility data using layerwise discriminative fine-tuning with only 30 training sequences. The learned flow-dependent heat-transfer scaling recovers a Reynolds-number exponent consistent with established correlations, indicating constitutive learning beyond trajectory fitting. The model tracks a steep power change transient and produces accurate trajectories at uninstrumented locations.

2604.02184 2026-05-19 cs.LG 版本更新

Neural-network methods for two-dimensional finite-source reflector design

用于二维有限源反射器设计的神经网络方法

Roel Hacking, Lisa Kusch, Koondanibha Mitra, Martijn Anthonissen, Wilbert IJzerman

发表机构 * Eindhoven University of Technology(埃因霍温理工大学) Signify(Signify公司)

AI总结 本文提出了一种基于神经网络的二维有限源反射器设计方法,通过直接变量变换损失和基于网格的损失函数优化反射器高度,实现了高精度的远场分布控制,并在多个基准测试中展示了比传统反卷积方法更高的精度和速度。

Comments 25 pages, 12 figures, 2 tables. Submitted to Machine Learning: Science and Technology

详情
AI中文摘要

我们解决了将有限扩展光源发出的光转换为指定远场分布的二维反射器设计的逆问题。反射器高度由神经网络表示,并通过两个目标函数进行优化:一个基于闭式反射射线图的直接变量变换损失,以及一个将目标单元映射回光源的基于网格的损失,适用于不连续光源。通过自动微分计算梯度,并使用稳健的拟牛顿方法进行最小化。作为基线,我们采用了一种基于简化有限源近似的反卷积流程:从通量平衡中恢复一维单调映射,通过积分因子ODE求解转换为反射器,并嵌入修改后的Van Cittert迭代中,结合非负性裁剪和射线追踪反馈。在四个基准测试中,涵盖连续和不连续光源以及最小高度约束,精度通过射线追踪归一化均方误差测量。在两个主要基准测试中,神经方法在几秒钟内达到约2e-5和5e-5的误差,相比之下,反卷积基线在数百秒后仍为4e-3和5e-2。结果表明,神经方法在精度和速度上均优于传统方法,同时仍支持实际的高度约束。我们还讨论了通过迭代校正方案扩展到旋转对称和全三维反射器设计的可能性。

英文摘要

We address the inverse problem of designing two-dimensional reflectors that transform light from a finite, extended source into a prescribed far-field distribution. The reflector height is represented by a neural network and optimized with two objective functions: a direct change-of-variables loss based on the closed-form inverse ray map, and a mesh-based loss that maps target cells back to the source and remains usable for discontinuous sources. Gradients are computed by automatic differentiation and minimized with a robust quasi-Newton method. As a baseline, we adapt a deconvolution pipeline built on a simplified finite-source approximation: a one-dimensional monotone map is recovered from flux balance, converted to a reflector by an integrating-factor ODE solve, and embedded in a modified Van Cittert iteration with nonnegativity clipping and ray-traced feedback. Across four benchmarks, covering continuous and discontinuous sources and minimum-height constraints, accuracy is measured by ray-traced normalized mean absolute error. On the two main benchmarks, the neural method reaches errors of about 2e-5 and 5e-5 within a few seconds on one NVIDIA RTX 4090 GPU, compared with 4e-3 and 5e-2 for the deconvolution baseline after several hundred seconds. The results show that the neural formulation is both more accurate and substantially faster, while still supporting practical height constraints. We also discuss extensions to rotationally symmetric and full three-dimensional reflector design through iterative correction schemes.

2603.23194 2026-05-19 cs.GR cs.CV cs.LG 版本更新

PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning

PhysSkin: 通过自监督神经皮肤化实现实时且可泛化的基于物理的动画

Yuanhang Lei, Tao Cheng, Xingxuan Li, Boming Zhao, Siyuan Huang, Ruizhen Hu, Peter Yichen Chen, Hujun Bao, Zhaopeng Cui

发表机构 * State Key Laboratory of CAD&CG(CAD与计算机图形学国家重点实验室) BIGAI Shenzhen University(深圳大学) University of British Columbia(不列颠哥伦比亚大学)

AI总结 本文提出PhysSkin框架,通过自监督学习策略实现对多样3D形状和离散化形式的实时基于物理的动画,其核心方法是神经皮肤化场自动编码器和物理感知的学习策略。

Comments Accepted by CVPR 2026 Highlight. Project Page: https://zju3dv.github.io/PhysSkin/

详情
AI中文摘要

实现能够在多样3D形状和离散化形式之间泛化的真实时间基于物理的动画仍然是一个基本挑战。我们引入PhysSkin,一个基于物理的框架,解决这一挑战。受线性混合皮肤化的启发,我们学习连续皮肤化场作为基函数,将运动子空间坐标提升到全空间变形,子空间由手柄变换定义。为了生成无网格、离散化无关且物理一致的皮肤化场,PhysSkin采用新的神经皮肤化场自动编码器,由基于Transformer的编码器和交叉注意力解码器组成。此外,我们还开发了一种新的物理感知自监督学习策略,结合实时皮肤化场归一化和冲突感知梯度校正,从而有效平衡能量最小化、空间平滑性和正交约束。PhysSkin在可泛化的神经皮肤化上表现出色,并实现了实时基于物理的动画。

英文摘要

Achieving real-time physics-based animation that generalizes across diverse 3D shapes and discretizations remains a fundamental challenge. We introduce PhysSkin, a physics-informed framework that addresses this challenge. In the spirit of Linear Blend Skinning, we learn continuous skinning fields as basis functions lifting motion subspace coordinates to full-space deformation, with subspace defined by handle transformations. To generate mesh-free, discretization-agnostic, and physically consistent skinning fields that generalize well across diverse 3D shapes, PhysSkin employs a new neural skinning fields autoencoder which consists of a transformer-based encoder and a cross-attention decoder. Furthermore, we also develop a novel physics-informed self-supervised learning strategy that incorporates on-the-fly skinning-field normalization and conflict-aware gradient correction, enabling effective balancing of energy minimization, spatial smoothness, and orthogonality constraints. PhysSkin shows outstanding performance on generalizable neural skinning and enables real-time physics-based animation.

2603.20216 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Locally Coherent Parallel Decoding in Diffusion Language Models

局部相干并行解码在扩散语言模型中

Michael Hersche, Nicolas Menet, Ronan Tanios, Abbas Rahimi

发表机构 * IBM Research - Zurich(IBM瑞士研究实验室)

AI总结 本文提出CoDiLA方法,通过引入小型辅助自回归模型来解决扩散语言模型在并行解码中的相干性问题,从而在代码生成任务中实现更高的准确性和速度。

Comments Accepted at ICML 2026

详情
AI中文摘要

扩散语言模型(DLMs)作为一种有前景的替代自回归(AR)模型,提供了亚线性生成延迟和双向能力,这在代码生成和编辑中尤为吸引人。在离散DLMs中实现亚线性延迟需要并行预测多个token。然而,标准DLMs从条件边缘分布独立采样token,无法捕捉同时生成token之间的联合依赖关系。因此,它们常常导致语法不一致并破坏多token结构。在本工作中,我们引入CoDiLA(Coherent Diffusion with Local Autoregression),一种方法,通过引入小型辅助AR模型来解决并行采样与局部依赖建模之间的矛盾。该方法将局部解码委托给一个小型辅助AR模型,该模型在扩散潜变量上进行操作。这种设计允许并行生成,同时在块内确保序列的有效性,并保持核心DLM能力,包括跨块的双向建模。我们证明使用高度紧凑的辅助AR模型(例如,0.6B参数)可以有效消除相干性伪影,在代码生成基准中建立了一个新的帕累托前沿。

英文摘要

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel generation while ensuring sequential validity within a block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.

2603.17577 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Identifying Latent Actions and Dynamics from Offline Data via Demonstrator Diversity

通过示范多样性从离线数据中识别潜在动作和动态

Felix Schur

发表机构 * ETH Zürich(苏黎世联邦理工学院)

AI总结 本文研究了在不观察动作的情况下从离线轨迹中恢复潜在动作和环境动态的问题,通过示范多样性假设,证明了在满足特定条件时,潜在转移和示范策略可以被唯一确定,从而为从离线强化学习数据中学习潜在动作和动态提供了新的方法。

详情
AI中文摘要

在动作未被观察的情况下,能否从离线轨迹中恢复潜在动作和环境动态?我们研究了在轨迹无动作但带有示范者身份标签的设置中这一问题。我们假设每个示范者遵循不同的策略,而环境动态在所有示范者之间是共享的,身份仅通过所选动作影响下一个观测。在这些假设下,条件下一个观测分布 $p(o_{t+1}\mid o_t,e)$ 是潜在动作条件化转移核的混合,具有示范者特定的混合权重。我们证明,这导致每个状态的可观测条件分布具有列随机非负矩阵分解。通过充分分散的策略多样性和秩条件,我们证明潜在转移和示范策略在潜在动作标签的排列下是可识别的。通过Gram行列式最小体积准则,我们将结果扩展到连续观测空间,并证明在连接的状态空间上转移映射的连续性将局部排列模糊性提升为单一全局排列。少量标记的动作数据足以消除最终的模糊性。这些结果确立了示范多样性作为从离线强化学习数据中学习潜在动作和动态的原理性可识别性来源。

英文摘要

Can latent actions and environment dynamics be recovered from offline trajectories when actions are never observed? We study this question in a setting where trajectories are action-free but tagged with demonstrator identity. We assume that each demonstrator follows a distinct policy, while the environment dynamics are shared across demonstrators and identity affects the next observation only through the chosen action. Under these assumptions, the conditional next-observation distribution $p(o_{t+1}\mid o_t,e)$ is a mixture of latent action-conditioned transition kernels with demonstrator-specific mixing weights. We show that this induces, for each state, a column-stochastic nonnegative matrix factorization of the observable conditional distribution. Using sufficiently scattered policy diversity and rank conditions, we prove that the latent transitions and demonstrator policies are identifiable up to permutation of the latent action labels. We extend the result to continuous observation spaces via a Gram-determinant minimum-volume criterion, and show that continuity of the transition map over a connected state space upgrades local permutation ambiguities to a single global permutation. A small amount of labeled action data then suffices to fix this final ambiguity. These results establish demonstrator diversity as a principled source of identifiability for learning latent actions and dynamics from offline RL data.

2603.17041 2026-05-19 stat.ML cs.AI cs.LG stat.ME 版本更新

When Marginals Match but Structure Fails: Covariance Fidelity in Generative Models

当边缘匹配但结构失败:生成模型中的协方差保真度

Nazia Riasat

发表机构 * North Dakota State University(北达科他州立大学)

AI总结 本文提出了一种基于协方差层面的依赖保真度评估标准,以弥补传统边缘分布匹配评估方法的不足,通过实验证明该标准能更准确地区分结构保留与结构丢失的生成模型。

Comments 44 pages, 25 figures. Extended version of paper accepted at MathAI 2026 (International Conference on Mathematics of Artificial Intelligence), March 30 - April 3, 2026

详情
AI中文摘要

生成模型正越来越多地被用作真实数据的替代品用于下游科学流程,但标准评估标准仍然集中在边缘分布匹配上。我们主张这代表了一个根本性的差距:下游推断很少是边缘操作,且一个通过所有单变量诊断的模型仍可能产生结构不可靠的合成数据。我们引入了协方差层面的依赖保真度,通过D_Sigma(P,Q) = ||Sigma_P - Sigma_Q||_F来衡量生成模型是否在超出单变量边缘之外保留数据的联合结构。三个结果正式化了这一准则。首先,边缘保真度对依赖结构没有任何约束:D_Sigma可以被任意增大,同时所有单变量边缘完全匹配。其次,协方差分歧会引起可量化的下游不稳定性,包括总体回归系数的符号反转。第三,通过Davis-Kahan型界提供对依赖敏感过程如PCA的正向稳定性保证。在三个领域,图像数据(Fashion-MNIST VAE,n = 60,000)、批量RNA-seq(TCGA-BRCA,n = 1,111)和小样本压力测试(阿尔茨海默症基因表达,n = 113)的实证验证显示,D_Sigma/delta在标准边缘诊断显示很少分离的情况下,能一致地区分结构丢弃与结构保留的生成器,确认了协方差层面保真度在跨领域和样本大小上提供了与现有评估指标正交的信息。

英文摘要

Generative models are increasingly deployed as substitutes for real data in downstream scientific workflows, yet standard evaluation criteria remain focused on marginal distribution matching. We argue that this represents a fundamental gap: downstream inference is rarely a marginal operation, and a model that passes every univariate diagnostic can still produce structurally unreliable synthetic data. We introduce covariance-level dependence fidelity, measured by D_Sigma(P,Q) = ||Sigma_P - Sigma_Q||_F, as a principled, computable criterion for evaluating whether a generative model preserves the joint structure of data beyond its univariate marginals. Three results formalise this criterion. First, marginal fidelity provides no constraint on dependence structure: D_Sigma can be made arbitrarily large while all univariate marginals match exactly. Second, covariance divergence induces quantifiable downstream instability, including sign reversals in population regression coefficients. Third, bounding D_Sigma provides positive stability guarantees for dependence-sensitive procedures such as PCA via Davis-Kahan-type bounds. Empirical validation across three domains, image data (Fashion-MNIST VAE, n = 60,000), bulk RNA-seq (TCGA-BRCA, n = 1,111), and a small-sample stress test (Alzheimer's gene expression, n = 113), shows that D_Sigma/delta consistently distinguishes structure-discarding from structure-preserving generators in cases where standard marginal diagnostics show little separation, confirming that covariance-level fidelity provides information orthogonal to existing evaluation metrics across domains and sample sizes.

2603.08462 2026-05-19 cs.LG 版本更新

Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

推理作为压缩:通过条件信息瓶颈统一预算强制

Fabio Valerio Massoli, Andrey Kuzmin, Arash Behboodi

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 本文提出将高效推理视为信息瓶颈原则下的损失性压缩问题,通过引入条件信息瓶颈(CIB)原则,解决了传统预算强制方法在处理transformers时的理论缺陷,并通过语义先验实现了更高效的推理压缩,提升了准确率并减少了计算成本。

详情
AI中文摘要

\ac{CoT}提示方法提高了LLM在复杂任务上的准确性,但通常会增加token使用和推理成本。现有的"预算强制"方法通过使用启发式长度惩罚进行微调来减少成本,但会抑制必要的推理和冗余填充。我们重新将高效推理视为在\ac{IB}原则下的损失性压缩问题,并识别出在应用朴素\ac{IB}到transformers时的关键理论缺口:注意力违反了提示、推理轨迹和响应之间的马尔可夫性质。为了解决这个问题,我们模型\ac{CoT}生成在\ac{CIB}原则下,其中推理轨迹$Z$作为计算桥梁,只包含响应$Y$中无法直接从提示$X$获得的信息。这产生了一个通用的强化学习目标:在推理轨迹的先验分布下最大化任务奖励,同时压缩完成内容,将常见启发法(如长度惩罚)作为特殊情况(如均匀先验)包含在内。与传统的token计数方法不同,我们引入了一个语义先验,通过语言模型测量token成本的惊奇度。关键的是,该先验仅在token级log-概率上进行查询,对训练循环的开销可忽略不计。实证表明,我们的\ac{CIB}目标在保留流畅性和逻辑性的同时修剪推理冗余,提高准确率在中等压缩水平,并在最小的准确率下降下实现激进压缩。这些收益在不同模型家族和任务领域中得到验证,确认\ac{CIB}作为一种领域无关的CoT压缩框架。

英文摘要

\ac{CoT} prompting improves LLM accuracy on complex tasks but often increases token usage and inference cost. Existing ``Budget Forcing'' methods reduce cost via fine-tuning with heuristic length penalties, suppressing both essential reasoning and redundant filler. We recast efficient reasoning as a lossy compression problem under the \ac{IB} principle, and identify a key theoretical gap when applying naive \ac{IB} to transformers: attention violates the Markov property between prompt, reasoning trace, and response. To resolve this issue, we model \ac{CoT} generation under the \ac{CIB} principle, where the reasoning trace $Z$ acts as a computational bridge that contains only the information about the response $Y$ that is not directly accessible from the prompt $X$. This yields a general Reinforcement Learning objective: maximize task reward while compressing completions under a prior over reasoning traces, subsuming common heuristics (e.g., length penalties) as special cases (e.g., uniform priors). In contrast to naive token-counting approaches, we introduce a semantic prior that measures token cost by surprisal under a language model. Crucially, the prior is queried only for token-level log-probabilities, adding negligible overhead to the training loop. Empirically, our \ac{CIB} objective prunes reasoning redundancy while preserving fluency and logic, improving accuracy at moderate compression and enabling aggressive compression with minimal accuracy drop. These gains generalize across model families and task domains, confirming \ac{CIB} as a domain-agnostic CoT compression framework.

2603.08290 2026-05-19 cs.LG cs.AI 版本更新

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

先浅后深:一种由深度诱导的sharpness-aware minimization的隐式偏见

Chaewon Moon, Dongkuk Si, Chulhee Yun

发表机构 * Graduate School of AI, KAIST(韩国成均馆大学人工智能研究生院) Mobilint, Inc.(Mobilint公司)

AI总结 该研究探讨了在训练线性可分二分类问题时,sharpness-aware minimization (SAM) 的隐式偏见,发现对于深度L=2的情况,SAM的行为与深度L=1时不同,展示了sequential feature amplification现象。

Comments Accepted to ICLR 2026, 84 pages, 35 figures

详情
AI中文摘要

我们研究了在训练L层线性对角网络时,sharpness-aware minimization (SAM) 的隐式偏见。对于线性模型(L=1),ℓ∞-SAM和ℓ2-SAM都能恢复ℓ2最大间隔分类器,与梯度下降(GD)一致。然而,对于深度L=2,行为发生剧烈变化——即使在单例数据集上。对于ℓ∞-SAM,极限方向依赖于初始化,并可能收敛到零向量或任何标准基向量,与GD的极限方向形成鲜明对比。对于ℓ2-SAM,我们证明其极限方向与GD的ℓ1最大间隔解一致,但有限时间动态表现出我们称之为“顺序特征放大”的现象,即预测器最初依赖于次要坐标,然后逐渐转向更大的坐标。我们的理论分析将这种现象归因于ℓ2-SAM在扰动中应用的梯度归一化因子,该因子在早期放大次要坐标,允许主要坐标在后期主导。合成和真实数据实验验证了我们的发现。

英文摘要

We study the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drastically -- even on a single-example dataset. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $\mathbf{0}$ or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For $\ell_2$-SAM, we show that although its limit direction matches the $\ell_1$ max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.

2603.06984 2026-05-19 stat.ML cs.AI cs.GT cs.LG cs.SI 版本更新

Masking Causality and Conditional Dependence

掩盖因果关系与条件依赖

Zou Yang, Sophia Xiao, Bijan Mazaheri

发表机构 * Thayer School of Engineering(泰勒学校工程学院) Dartmouth College(达特茅斯学院)

AI总结 本文研究了通过平均约束来强制条件独立性的问题,发现这种约束在监管层面无法满足分层要求,而在优化者层面却能有效隐藏依赖关系,从而指出通过观测决策的平均统计来监管直接依赖是有限的,必须在决策规则层面进行监管。

详情
AI中文摘要

许多监管和分析问题要求被禁止的变量只能通过指定的允许渠道影响决策——这是一种出现在路径特定公平性、处理敏感信息和监管非公开信息交易等场景中的条件独立性要求。这些要求可以通过分层方式执行,或更常见且更高效地通过单个平均约束来执行。本文从监管者的角度将因果掩盖建模为一个线性规划,并证明平均约束优化几乎总是产生违反分层要求但恰好满足平均约束的政策。掩盖收益随着混淆和结果异质性增加而增长,检测需要精确的条件独立性测试,而平均约束旨在避免这些测试。从优化者的角度来看,相同的构造表明,被掩盖的政策恢复了大部分无约束利用的收益,但更难被检测到,因此在决策基础本身敏感的任何设置中都具有吸引力。这些结果表明,通过观测决策的平均统计来监管直接依赖在结构上是有限的,有意义的监管必须在决策规则本身层面进行。

英文摘要

Many regulatory and analytic problems require that a prohibited variable influence a decision only through a designated allowable channel -- a conditional-independence requirement that arises in path-specific fairness, the handling of classified information, and the regulation of trading on non-public information, among other settings. Such requirements may be enforced either stratum-by-stratum or, more commonly (and more efficiently), through a single averaged constraint on the conditional effect. We study the resulting enforcement problem from two perspectives. From the regulator's side, we formulate causal masking as a linear program and show that averaged-constraint optimization almost surely produces policies that violate the stratum-wise requirement while satisfying the averaged one exactly. The gains from masking grow with confounding and outcome heterogeneity, and detection requires precisely the conditional-independence tests that average constraints aim to avoid. From the optimizer's side, the same construction shows that masked policies recover most of the reward of unconstrained exploitation while being far harder to detect, making them attractive in any setting where the basis of decisions is itself sensitive. Together, these results argue that regulating direct dependence through averaged statistics on observed decisions is structurally limited, and that meaningful enforcement must operate at the level of the decision rule itself.

2602.21707 2026-05-19 eess.IV cs.CV cs.LG math.OC 版本更新

Learning spatially adaptive sparsity level maps for arbitrary convolutional dictionaries

学习任意卷积字典的时空自适应稀疏性水平图

Joshua Schulz, David Schote, Christoph Kolbitsch, Kostas Papafitsoros, Andreas Kofler

发表机构 * Physikalisch-Technische Bundesanstalt (PTB), Braunschweig and Berlin, Germany(物理技术联邦机构(PTB),柏林和不莱梅,德国) School of Mathematical Sciences, Queen Mary University of London, UK(伦敦女王学院数学科学学院,英国)

AI总结 本文提出了一种学习方法,通过改进的网络设计和专门的训练策略,扩展了基于神经网络推断的时空自适应稀疏性水平图的图像重建方法,实现了滤波器排列不变性,并在低场MRI中展示了使用不同字典的优势。

Comments accepted for publication at ICIP 2026; differs from previous versions after a bugfix in one of the used packages; corresponds to the final camera-ready version submitted to the conference

详情
AI中文摘要

最先进的学习重建方法通常依赖于黑盒模块,尽管性能强大,但对其可解释性和鲁棒性提出了质疑。本文基于最近提出的一种图像重建方法,通过将数据驱动的信息嵌入到基于模型的卷积字典正则化中,利用神经网络推断的时空自适应稀疏性水平图。通过改进的网络设计和专门的训练策略,我们扩展了该方法,以实现滤波器排列不变性以及在推理时更改卷积字典的可能性。我们将该方法应用于低场MRI,并与其他几种最近的深度学习方法进行了比较,包括体内数据,展示了使用不同字典的优势。我们进一步评估了该方法在测试体内和体外数据时的鲁棒性。当测试体外数据时,所提出的方法比其他学习方法受到的数据分布偏移影响更小,这归因于其基于模型的重建组件对训练数据的依赖性较低。

英文摘要

State-of-the-art learned reconstruction methods often rely on black-box modules that, despite their strong performance, raise questions about their interpretability and robustness. Here, we build on a recently proposed image reconstruction method, which is based on embedding data-driven information into a model-based convolutional dictionary regularization via neural network-inferred spatially adaptive sparsity level maps. By means of improved network design and dedicated training strategies, we extend the method to achieve filter-permutation invariance as well as the possibility to change the convolutional dictionary at inference time. We apply our method to low-field MRI and compare it to several other recent deep learning-based methods, also on in vivo data, where the benefit of using a different dictionary is demonstrated. We further assess the method's robustness when tested on in- and out-of-distribution data. When tested on the latter, the proposed method suffers less from the data distribution shift compared to the other learned methods, which we attribute to its reduced reliance on training data due to its underlying model-based reconstruction component.

2602.12703 2026-05-19 cs.LG 版本更新

SWING: Unlocking Implicit Graph Representations for Graph Random Features

SWING: 解锁隐式图表示用于图随机特征

Alessandro Manenti, Avinava Dubey, Arijit Sehanobish, Cesare Alippi, Krzysztof Choromanski

发表机构 * Google Research(谷歌研究) Independent Researcher(独立研究者) Google DeepMind(谷歌深Mind) Columbia University(哥伦比亚大学)

AI总结 SWING通过在连续空间中进行行走而非在图节点上进行行走,实现了对隐式图表示(i-graphs)中图随机特征的高效计算,其核心方法是结合随机特征和重要性采样技术的定制Gumbel-softmax采样机制,从而在不需显式图结构的情况下,提高了计算效率和精度。

详情
AI中文摘要

我们提出了SWING:空间行走用于隐式网络图,这是一种新的算法类别,用于在由隐式表示(i-graphs)给出的图上进行图随机特征的计算,其中边权重定义为相应节点特征向量的双变量函数。这些图类包括多个显著例子,如ε邻域图,广泛用于机器学习。与在图节点上进行行走不同,这些方法依赖于在连续空间中的行走,在其中这些图被嵌入。为了准确且高效地近似原始组合计算,SWING应用了通过随机特征结合重要性采样技术获得的定制Gumbel-softmax采样机制,具有线性化内核。该算法本身具有独特价值。SWING依赖于隐式定义图与傅里叶分析之间的深刻联系,本文中已提出。SWING具有加速友好特性,不需要输入图的显式材料。我们对SWING进行了详细的分析,并在不同类别的i-graphs上进行了彻底的实验。

英文摘要

We propose SWING: Space Walks for Implicit Network Graphs, a new class of algorithms for computations involving Graph Random Features on graphs given by implicit representations (i-graphs), where edge-weights are defined as bi-variate functions of feature vectors in the corresponding nodes. Those classes of graphs include several prominent examples, such as: $ε$-neighborhood graphs, used on regular basis in machine learning. Rather than conducting walks on graphs' nodes, those methods rely on walks in continuous spaces, in which those graphs are embedded. To accurately and efficiently approximate original combinatorial calculations, SWING applies customized Gumbel-softmax sampling mechanism with linearized kernels, obtained via random features coupled with importance sampling techniques. This algorithm is of its own interest. SWING relies on the deep connection between implicitly defined graphs and Fourier analysis, presented in this paper. SWING is accelerator-friendly and does not require input graph materialization. We provide detailed analysis of SWING and complement it with thorough experiments on different classes of i-graphs.

2602.09805 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

超越准确率:分解大语言模型的推理效率

Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

发表机构 * Integreat - Norwegian Centre for knowledge-driven machine learning(Integreat - 挪威知识驱动机器学习中心) UiT - The Arctic University of Norway(UiT - 北极大学) University of Oslo(奥斯陆大学)

AI总结 本文提出一种无需追踪的评估协议,通过完成率、条件正确性和生成长度三个指标分解大语言模型的token效率,同时考虑任务工作量元数据进行归一化处理,并评估模型在不同任务上的推理效率和冗余问题。

Comments Preprint (under review). 29 pages, 4 figures

详情
AI中文摘要

随着推理大语言模型越来越多地通过推理、搜索和自我纠正来换取准确性,单一的准确性分数已无法说明这些token是否带来了有用的推理、从困难实例中恢复或不必要的冗长。我们介绍了一种可选追踪的评估协议,通过三个即使在封闭模型中也可用的观测指标精确分解token效率:完成率、在完成条件下正确性的条件正确性以及生成长度。当实例级工作量元数据可用时,我们进一步将生成长度归一化为声明的任务隐含工作,并将平均口头冗余与工作量依赖的扩展分离。当此类元数据不可用时,我们定义了一个可审计的求解器衍生工作量规模,并在留出自我、留出top-k和持有参考池扰动下评估其稳定性。我们在CogniLoad、GSM8K、ProofWriter和ZebraLogic上评估了14个共享开放权重模型。我们进一步在CogniLoad上评估了11个额外模型,从而能够对推理任务难度因素进行细致分析:任务长度、内在难度和干扰项密度。效率和冗余排名在所有基准对中保持稳定,比准确性排名更加稳健,同时分解了逻辑受限、上下文受限(截断驱动)和冗余受限的失败模式,这些模式在准确性每token下看起来是相同的。我们发布了评估工具包和报告模板,详细说明了LLM在推理上的低效原因。

英文摘要

As reasoning LLMs increasingly trade tokens for accuracy through deliberation, search, and self-correction, a single accuracy score can no longer tell whether those tokens buy useful reasoning, recovery from hard instances, or unnecessary verbosity. We introduce a trace-optional evaluation protocol that exactly decomposes token efficiency using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available, we further normalize generated length by declared task-implied work and separate mean verbalization overhead from workload-dependent scaling. When such metadata is absent, we define an auditable solver-derived workload scale and evaluate its stability under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. We evaluate 14 shared open-weight models on CogniLoad, GSM8K, ProofWriter, and ZebraLogic. We further evaluate 11 additional models on CogniLoad, enabling a fine-grained analysis of reasoning-task difficulty factors: task length, intrinsic difficulty, and distractor density. Efficiency and overhead rankings remain stable across all benchmark pairs, more robustly than accuracy rankings, while the decomposition separates logic-limited, context-limited (truncation-driven), and verbosity-limited failure modes that look identical under accuracy-per-token. We release an evaluation artifact and reporting template, which elaborates on why an LLM is inefficient at reasoning.

2602.07618 2026-05-19 cs.LG stat.ML 版本更新

Neural Networks With Dense Weights Are Not Universal Approximators

具有密集权重的神经网络不是通用逼近器

Levi Rauchwerger, Stefanie Jegelka, Ron Levie

发表机构 * Princeton University, Dept of CS(普林斯顿大学计算机科学系) MIT, Dept of EECS and CSAIL(麻省理工学院电子工程与计算机科学系及计算机科学与人工智能实验室) TUM, School of CIT, MCML, MDSI(技术大学(TUM)信息科技学院,MCML,MDSI) Technion – IIT, Faculty of Mathematics(技术学院–以色列理工学院数学学院)

AI总结 研究探讨了密集神经网络的逼近能力,指出在有限的权重约束下,密集连接的神经网络无法逼近任意连续函数,从而揭示了密集层神经网络的固有局限性,推动了稀疏连接在实现真正通用性中的必要性。

详情
AI中文摘要

我们研究了密集神经网络的逼近能力。虽然通用逼近定理表明,如果对权重值没有限制,足够大的架构可以逼近任意连续函数,但我们证明密集神经网络并不具备这种普遍性。我们的论证基于一种模型压缩方法,结合弱正则性引理与将前馈网络解释为消息传递图神经网络的解释。我们考虑具有自然权重、输入和输出维度约束的ReLU神经网络,这建模了一种密集连接的概念。在此设置中,我们展示了存在无法被此类网络逼近的Lipschitz连续函数。这突显了密集层神经网络的固有局限性,并推动了稀疏连接作为实现真正通用性的必要成分的使用。

英文摘要

We investigate the approximation capabilities of dense neural networks. While universal approximation theorems establish that sufficiently large architectures can approximate arbitrary continuous functions if there are no restrictions on the weight values, we show that dense neural networks do not possess this universality. Our argument is based on a model compression approach, combining the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks. We consider ReLU neural networks subject to natural constraints on weights and input and output dimensions, which model a notion of dense connectivity. Within this setting, we demonstrate the existence of Lipschitz continuous functions that cannot be approximated by such networks. This highlights intrinsic limitations of neural networks with dense layers and motivates the use of sparse connectivity as a necessary ingredient for achieving true universality.

2602.06866 2026-05-19 cs.LG 版本更新

T-STAR: A Context-Aware Transformer Framework for Short-Term Probabilistic Demand Forecasting in Dock-Based Shared Micro-Mobility

T-STAR: 一种基于上下文的Transformer框架用于基于码头的共享微出行短期概率需求预测

Jingyi Cheng, Gonçalo Homem de Almeida Correia, Oded Cats, Shadi Sharif Azadeh

发表机构 * Transport and Planning, Delft University of Technology(代尔夫特理工大学交通与规划)

AI总结 本文提出T-STAR框架,通过两级结构分离一致需求模式和短期波动,提升短期概率需求预测的准确性,实验表明其在确定性和概率性准确性上均优于现有方法,且具备良好的时空鲁棒性。

Comments This work has been submitted to Transportation Research Part C

详情
AI中文摘要

可靠的短期需求预测对于管理共享微出行服务和确保响应、以用户为中心的操作至关重要。本文介绍了T-STAR(Two-stage Spatial and Temporal Adaptive contextual Representation),一种新的基于Transformer的概率框架,旨在以15分钟的分辨率预测车站级自行车共享需求。T-STAR通过分层两级结构解决高分辨率预测中的关键挑战,第一阶段捕捉粗粒度的小时需求模式,第二阶段通过整合高频、本地化的输入(包括近期波动和实时需求变化)提高预测精度,以考虑短期需求的时间转移。时间序列Transformer模型用于两个阶段生成概率预测。使用华盛顿特区的Capitol Bikeshare数据的广泛实验表明,T-STAR在确定性和概率性准确性上均优于现有方法。该模型在车站和时间期间表现出强大的时空鲁棒性。零样本预测实验进一步展示了T-STAR在无需重新训练的情况下能够转移到以前未见过的服务区域的能力。这些结果凸显了该框架在提供细粒度、可靠且不确定性的短期需求预测方面的潜力,从而无缝整合以支持多模式出行规划,提高共享微出行服务的实时操作能力。

英文摘要

Reliable short-term demand forecasting is essential for managing shared micro-mobility services and ensuring responsive, user-centered operations. This study introduces T-STAR (Two-stage Spatial and Temporal Adaptive contextual Representation), a novel transformer-based probabilistic framework designed to forecast station-level bike-sharing demand at a 15-minute resolution. T-STAR addresses key challenges in high-resolution forecasting by disentangling consistent demand patterns from short-term fluctuations through a hierarchical two-stage structure. The first stage captures coarse-grained hourly demand patterns, while the second stage improves prediction accuracy by incorporating high-frequency, localized inputs, including recent fluctuations and real-time demand variations in connected metro services, to account for temporal shifts in short-term demand. Time series transformer models are employed in both stages to generate probabilistic predictions. Extensive experiments using Washington D.C.'s Capital Bikeshare data demonstrate that T-STAR outperforms existing methods in both deterministic and probabilistic accuracy. The model exhibits strong spatial and temporal robustness across stations and time periods. A zero-shot forecasting experiment further highlights T-STAR's ability to transfer to previously unseen service areas without retraining. These results underscore the framework's potential to deliver granular, reliable, and uncertainty-aware short-term demand forecasts, which enable seamless integration to support multimodal trip planning for travelers and enhance real-time operations in shared micro-mobility services.

2602.05172 2026-05-19 stat.ML cs.LG math.ST stat.TH 版本更新

Finite-Particle Rates for Regularized Stein Variational Gradient Descent

有限粒子率的正则化Stein变分梯度下降

Ye He, Krishnakumar Balasubramanian, Sayan Banerjee, Promit Ghosal

发表机构 * Department of Mathematics, Georgia Institute of Technology(佐治亚理工学院数学系) Department of Statistics, University of California, Davis(加州大学戴维斯分校统计系) Department of Statistics and Operations Research, University of North Carolina, Chapel Hill(北卡罗来纳大学夏洛特分校统计与运筹学系) Department of Statistics, University of Chicago(芝加哥大学统计系)

AI总结 本文研究了正则化Stein变分梯度下降算法的有限粒子率,通过应用树脂型预条件器来校正SVGD的常数阶偏差,推导了时间平均经验测度的非渐近界,并在目标满足W₁I条件下,证明了对于光滑核函数的大类,W₁收敛。

详情
AI中文摘要

我们推导了He等人(2024)提出的正则化Stein变分梯度下降(R-SVGD)算法的有限粒子率,该算法通过在核化Wasserstein梯度上应用树脂型预条件器来校正SVGD的常数阶偏差。对于由此得到的相互作用N粒子系统,我们建立了时间平均(退火)经验测度的显式非渐近界,展示了在真正的(非核化)Fisher信息上的收敛,并在目标满足W₁I条件下,对于一大类光滑核函数,对应W₁收敛。我们的分析涵盖了连续时间和离散时间动力学,并给出了正则化参数、步长和平均时间范围的原理性调整规则,这些规则量化了近似Wasserstein梯度流和控制有限粒子估计误差之间的权衡。

英文摘要

We derive finite-particle rates for the regularized Stein variational gradient descent (R-SVGD) algorithm introduced by He et al. (2024) that corrects the constant-order bias of the SVGD by applying a resolvent-type preconditioner to the kernelized Wasserstein gradient. For the resulting interacting $N$-particle system, we establish explicit non-asymptotic bounds for time-averaged (annealed) empirical measures, illustrating convergence in the \emph{true} (non-kernelized) Fisher information and, under a $\mathrm{W}_1\mathrm{I}$ condition on the target, corresponding $\mathrm{W}_1$ convergence for a large class of smooth kernels. Our analysis covers both continuous- and discrete-time dynamics and yields principled tuning rules for the regularization parameter, step size, and averaging horizon that quantify the trade-off between approximating the Wasserstein gradient flow and controlling finite-particle estimation error.

2602.03797 2026-05-19 cs.LG 版本更新

Manifold Random Features

流形随机特征

Ananya Parashar, Derek Long, Dwaipayan Saha, Krzysztof Choromanski

发表机构 * Department of Industrial Engineering and Operations Research(工业工程与运筹学系) Columbia University(哥伦比亚大学) Google DeepMind(谷歌DeepMind)

AI总结 本文提出了一种新的方法,通过离散化流形和最近引入的图随机特征(GRFs)技术,学习流形上的连续场,从而近似一般流形上定义的双变量函数(特别是核函数)。该方法提供了正且有界的特征,对于准确且低方差的近似至关重要。

详情
AI中文摘要

我们提出了一种新的范式,用于创建随机特征以近似在一般流形上定义的双变量函数(特别是核函数)。这种新的机制称为流形随机特征(MRFs),利用流形的离散化和最近引入的图随机特征(GRFs)技术来学习流形上的连续场。这些场用于找到在一般情况下无法解析推导的连续近似机制。MRFs提供正且有界的特征,这是准确、低方差近似的关键属性。我们展示了GRFs在离散图对象上定义与用于正则核的连续随机特征之间的深刻渐近联系。作为我们方法的副产品,我们重新发现最近引入的高斯核近似机制,特别是用于改进线性注意力Transformer,通过考虑简单的图随机游走并绕过原始复杂的数学计算。我们还补充了我们的算法的严格理论分析,并通过详尽的实验研究进行了验证。

英文摘要

We present a new paradigm for creating random features to approximate bi-variate functions (in particular, kernels) defined on general manifolds. This new mechanism of Manifold Random Features (MRFs) leverages discretization of the manifold and the recently introduced technique of Graph Random Features (GRFs) to learn continuous fields on manifolds. Those fields are used to find continuous approximation mechanisms that otherwise, in general scenarios, cannot be derived analytically. MRFs provide positive and bounded features, a key property for accurate, low-variance approximation. We show deep asymptotic connection between GRFs, defined on discrete graph objects, and continuous random features used for regular kernels. As a by-product of our method, we re-discover recently introduced mechanism of Gaussian kernel approximation applied in particular to improve linear-attention Transformers, considering simple random walks on graphs and by-passing original complex mathematical computations. We complement our algorithm with a rigorous theoretical analysis and verify in thorough experimental studies.

2602.03664 2026-05-19 cs.AI cs.LG 版本更新

Mitigating Conversational Inertia in Multi-Turn Agents

缓解多轮代理中的对话惯性

Yang Wan, Zheng Cao, Zhenhao Zhang, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China(浙江大学计算机科学与技术学院) University of Rochester, Rochester, NY, USA(罗切斯特大学)

AI总结 本文研究了多轮代理中对话惯性问题,提出通过上下文偏好学习来校准模型偏好,以减少惯性并提升性能。

Comments ICML2026

详情
AI中文摘要

大型语言模型在获得适当演示时表现出色,但在多轮代理场景中,LLM错误地模仿自身之前的响应作为少样本示例。通过注意力分析,我们识别出对话惯性现象,即模型对先前响应表现出强烈的对角注意力,这与模仿偏差有关,限制了探索。这揭示了将少样本LLM转化为代理时的张力:更长的上下文丰富了环境反馈以供利用,但也加剧了对话惯性,从而损害探索。我们的关键见解是,对于相同状态,生成时使用更长上下文的动作表现出更强的惯性,这使得可以在没有环境奖励的情况下构建偏好对。基于此,我们提出上下文偏好学习,以校准模型偏好,使模型更倾向于选择低惯性响应而非高惯性响应。我们进一步提供了推理时的上下文管理策略,以平衡探索与利用。在八个代理环境和一个深度研究场景中的实验结果验证了我们的框架能够减少对话惯性并实现性能提升。

英文摘要

Large language models excel as few-shot learners when provided with appropriate demonstrations, yet this strength becomes problematic in multiturn agent scenarios, where LLMs erroneously mimic their own previous responses as few-shot examples. Through attention analysis, we identify conversational inertia, a phenomenon where models exhibit strong diagonal attention to previous responses, which is associated with imitation bias that constrains exploration. This reveals a tension when transforming few-shot LLMs into agents: longer context enriches environmental feedback for exploitation, yet also amplifies conversational inertia that undermines exploration. Our key insight is that for identical states, actions generated with longer contexts exhibit stronger inertia than those with shorter contexts, enabling construction of preference pairs without environment rewards. Based on this, we propose Context Preference Learning to calibrate model preferences to favor low-inertia responses over highinertia ones. We further provide context management strategies at inference time to balance exploration and exploitation. Experimental results across eight agentic environments and one deep research scenario validate that our framework reduces conversational inertia and achieves performance improvements.

2601.19667 2026-05-19 cs.CL cs.AI cs.IR cs.LG 版本更新

SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

SynCABEL:面向生物医学实体链接的合成上下文增强

Adam Remaki, Christel Gérardin, Eulàlia Farré-Maduell, Martin Krallinger, Xavier Tannier

发表机构 * Sorbonne Université, Inserm, Université Sorbonne Paris Nord, Limics(索邦大学、国家医学研究院、巴黎索邦大学、Limics) Service de médecine interne, Hôpital Tenon, Assistance Publique - Hôpitaux de Paris(内科服务部,Tenon医院,巴黎公共医院) Barcelona Supercomputing Center, Barcelona, Spain(巴塞罗那超级计算中心,西班牙巴塞罗那)

AI总结 SynCABEL通过利用大型语言模型生成丰富的上下文合成训练示例,解决了监督式生物医学实体链接中专家标注数据稀缺的问题,并在三个多语言基准上实现了新的最先进的结果。

Comments 7 pages, 5 figures

详情
AI中文摘要

我们提出了SynCABEL(Synthetic Contextualized Augmentation for Biomedical Entity Linking),一个框架,旨在解决监督式生物医学实体链接(BEL)中的核心瓶颈:专家标注训练数据的稀缺性。SynCABEL利用大型语言模型为目标知识库中的所有候选概念生成上下文丰富的合成训练示例,提供广泛的监督而无需手动标注。我们证明,当结合解码器-only模型和引导推理时,SynCABEL在三个广泛使用的多语言基准上建立了新的最先进结果:MedMentions(英语)、QUAERO(法语)和SPACCC(西班牙语)。评估数据效率时,我们显示SynCABEL在使用最多60%的标注数据的情况下达到全人工监督的性能,显著减少了对劳动密集型和昂贵的专家标注的依赖。最后,考虑到基于精确代码匹配的标准评估往往低估了由于本体冗余而具有临床价值的预测,我们引入了LLM-as-a-judge协议。这项分析揭示了SynCABEL显著提高了具有临床价值的预测率。我们的合成数据集、模型和代码已发布以支持可重复性和未来研究。

英文摘要

We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference, establishes new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research.

2601.16880 2026-05-19 cs.LG cs.IT math.IT 版本更新

Theory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks

深度网络中最小权重扰动的理论及其在低秩激活后门攻击中的应用

Bethan Evans, Jared Tanner

发表机构 * Department of Mathematics, University of Oxford, Oxford, UK(牛津大学数学系)

AI总结 本文推导了深度网络实现指定输出变化所需的最小范数权重扰动,并讨论了其大小决定因素,同时将其应用于精度修改激活的后门攻击,确定了攻击成功的压缩阈值,并展示了低秩压缩可以在保持全精度准确性的同时可靠激活潜在后门。

详情
AI中文摘要

深度网络中实现指定输出变化所需的最小范数权重扰动被推导出来,并讨论了其大小决定因素。这些单层精确公式与更通用的多层Lipschitz常数基于的鲁棒性保证被对比;两者都被观察到具有相同数量级,这表明它们在保证效果上相似。这些结果应用于精度修改激活的后门攻击,确定了攻击成功的压缩阈值,并通过实验证明低秩压缩可以在保持全精度准确性的同时可靠激活潜在后门。这些表达式揭示了反向传播边际如何控制逐层敏感性,并提供了关于与所需输出变化一致的最小参数更新的可验证保证。

英文摘要

The minimal norm weight perturbations of DNNs required to achieve a specified change in output are derived and the factors determining its size are discussed. These single-layer exact formulae are contrasted with more generic multi-layer Lipschitz constant based robustness guarantees; both are observed to be of the same order which indicates similar efficacy in their guarantees. These results are applied to precision-modification-activated backdoor attacks, establishing provable compression thresholds below which such attacks cannot succeed, and show empirically that low-rank compression can reliably activate latent backdoors while preserving full-precision accuracy. These expressions reveal how back-propagated margins govern layer-wise sensitivity and provide certifiable guarantees on the smallest parameter updates consistent with a desired output shift.

2601.14330 2026-05-19 cs.CV cs.LG 版本更新

LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models

LURE: 用于扩散模型多概念重新唤醒的潜在空间解阻

Mengyu Sun, Ziyuan Yang, Andrew Beng Jin Teoh, Junxu Liu, Haibo Hu, Yi Zhang

发表机构 * Sichuan University(四川大学) The Hong Kong Polytechnic University(香港理工大学) Nanyang Technological University(南洋理工大学) Yonsei University(延世大学)

AI总结 本文提出LURE方法,通过重建潜在空间和引导采样轨迹,实现多概念的高保真重新唤醒,解决了现有方法在多概念场景下的梯度冲突和特征纠缠问题。

详情
AI中文摘要

概念擦除旨在抑制扩散模型中的敏感内容,但最近的研究表明,被擦除的概念仍可能被重新唤醒,揭示了擦除方法的脆弱性。现有重新唤醒方法主要依赖于提示级优化来操控采样轨迹,忽略了其他生成因素,限制了对底层动态的全面理解。在本文中,我们将生成过程建模为一个隐式函数,以实现对多个因素的全面理论分析,包括文本条件、模型参数和潜在状态。我们理论证明,扰动每个因素可以重新唤醒被擦除的概念。基于这一见解,我们提出了一种新的概念重新唤醒方法:用于概念重新唤醒的潜在空间解阻(LURE),通过重建潜在空间并引导采样轨迹来重新唤醒被擦除的概念。具体而言,我们的语义重新绑定机制通过将去噪预测与目标分布对齐来重建潜在空间,以重新建立断裂的文本-视觉关联。然而,在多概念场景中,朴素的重建会导致梯度冲突和特征纠缠。为了解决这个问题,我们引入了梯度场正交化,强制特征正交以防止相互干扰。此外,我们的潜在语义识别引导采样(LSIS)通过后验密度验证确保重新唤醒过程的稳定性。广泛的实验表明,LURE能够在多种擦除任务和方法中同时实现多个被擦除概念的高保真重新唤醒。

英文摘要

Concept erasure aims to suppress sensitive content in diffusion models, but recent studies show that erased concepts can still be reawakened, revealing vulnerabilities in erasure methods. Existing reawakening methods mainly rely on prompt-level optimization to manipulate sampling trajectories, neglecting other generative factors, which limits a comprehensive understanding of the underlying dynamics. In this paper, we model the generation process as an implicit function to enable a comprehensive theoretical analysis of multiple factors, including text conditions, model parameters, and latent states. We theoretically show that perturbing each factor can reawaken erased concepts. Building on this insight, we propose a novel concept reawakening method: Latent space Unblocking for concept REawakening (LURE), which reawakens erased concepts by reconstructing the latent space and guiding the sampling trajectory. Specifically, our semantic re-binding mechanism reconstructs the latent space by aligning denoising predictions with target distributions to reestablish severed text-visual associations. However, in multi-concept scenarios, naive reconstruction can cause gradient conflicts and feature entanglement. To address this, we introduce Gradient Field Orthogonalization, which enforces feature orthogonality to prevent mutual interference. Additionally, our Latent Semantic Identification-Guided Sampling (LSIS) ensures stability of the reawakening process via posterior density verification. Extensive experiments demonstrate that LURE enables simultaneous, high-fidelity reawakening of multiple erased concepts across diverse erasure tasks and methods.

2601.09495 2026-05-19 cs.LG 版本更新

Parallelizable memory recurrent units

可并行化的记忆递归单元

Florent De Geeter, Gaspard Lambrechts, Damien Ernst, Guillaume Drion

发表机构 * Montefiore Institute, University of Liege(蒙费尔研究所,利耶日大学)

AI总结 本文提出了一种结合非线性递归网络持久记忆能力和状态空间模型并行计算能力的新递归神经网络——记忆递归单元(MRUs),通过多稳态机制实现持久记忆,同时避免瞬态动态以提高效率,并展示了其在长时序依赖任务中的有效性。

Comments 19 pages, 12 figures. This work has been the subject of patent applications (Numbers: EP26151077 and EP26175248.9)

详情
AI中文摘要

随着大规模并行处理单元的出现,并行化已成为新序列模型的 desirable 属性。在训练过程中,能够针对序列长度并行处理序列的能力是Transformer架构兴起的主要原因之一。然而,Transformer在序列生成方面效率低下,因为它们需要在每个生成步骤重新处理所有先前的时间步。最近,状态空间模型(SSMs)作为一种更高效的替代方案出现。这些新的递归神经网络(RNNs)在保持RNN高效更新的同时,通过去除非线性动态(或递归)获得了并行化能力。SSMs通过高效训练可能非常大的网络,可以达到最先进的性能,但仍受有限表示能力的限制。特别是,由于其单稳态性,SSMs无法表现出持久记忆,即保留信息无限期的能力。在本文中,我们介绍了一种新的RNN家族——记忆递归单元(MRUs),它们结合了非线性RNN的持久记忆能力与SSMs的并行计算能力。这些单元利用多稳态作为持久记忆的来源,同时通过去除瞬态动态以实现高效计算。我们随后推导出一个具体的实现作为概念验证:双稳态记忆递归单元(BMRU)。这种新的RNN与并行扫描算法兼容。我们证明BMRU在具有长期依赖的任务中表现良好,并且可以与状态空间模型结合,创建具有瞬态动态和持久记忆的混合网络。

英文摘要

With the emergence of massively parallel processing units, parallelization has become a desirable property for new sequence models. The ability to parallelize the processing of sequences with respect to the sequence length during training is one of the main factors behind the uprising of the Transformer architecture. However, Transformers lack efficiency at sequence generation, as they need to reprocess all past timesteps at every generation step. Recently, state-space models (SSMs) emerged as a more efficient alternative. These new kinds of recurrent neural networks (RNNs) keep the efficient update of the RNNs while gaining parallelization by getting rid of nonlinear dynamics (or recurrence). SSMs can reach state-of-the art performance through the efficient training of potentially very large networks, but still suffer from limited representation capabilities. In particular, SSMs cannot exhibit persistent memory, or the capacity of retaining information for an infinite duration, because of their monostability. In this paper, we introduce a new family of RNNs, the memory recurrent units (MRUs), that combine the persistent memory capabilities of nonlinear RNNs with the parallelizable computations of SSMs. These units leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations. We then derive a specific implementation as proof-of-concept: the bistable memory recurrent unit (BMRU). This new RNN is compatible with the parallel scan algorithm. We show that BMRU achieves good results in tasks with long-term dependencies, and can be combined with state-space models to create hybrid networks that are parallelizable and have transient dynamics as well as persistent memory.

2512.13506 2026-05-19 cs.LG stat.ML 版本更新

Learning under Distributional Drift: Prequential Reproducibility as an Intrinsic Statistical Resource

在分布漂移下学习:预quential可再现性作为内在统计资源

Sofiya Zaichyk

发表机构 * Innovative Defense Technologies (IDT)(创新防御技术(IDT))

AI总结 本文研究了在分布漂移下学习的问题,提出了一种内在的漂移预算$C_T$,用于量化数据分布沿实际学习者-环境轨迹的累积信息几何运动,以 Fisher-Rao 距离衡量。该预算将外生环境变化与学习者动作引起的反馈分离,从而提供了基于速率的预quential可再现性特征。文章证明了漂移反馈界,并建立了匹配的下界,展示了平均 Fisher-Rao 运动率的依赖性是紧的。此外,还证明了信息论上的不可区分性结果,并通过实验表明适当选择的监控通道可以保留风险相关的漂移信号。

Comments Revised: Added additional experiment. Clarified lower bound

详情
AI中文摘要

在分布漂移下统计学习仍然缺乏充分的描述,尤其是在闭环设置中,学习会改变数据生成规律。我们引入了一个内在的漂移预算$C_T$,用于量化数据分布沿实际学习者-环境轨迹的累积信息-几何运动,以Fisher-Rao距离衡量。该预算将外生环境变化与由学习者动作引起的反馈分离。这给出了基于速率的预quential可再现性特征:当使用实际流上的性能来预测下一步分布下的一步 ahead 性能时,漂移贡献通过平均运动率$C_T/T$,而不是单独的累积漂移。我们证明了一个漂移反馈界,其顺序为$T^{-1/2}+C_T/T$,至多有受控的二阶余项。我们还建立了在标准正则子类上的匹配尖锐下界。因此,对平均Fisher-Rao运动率的依赖性在常数范围内是紧的:$C_T/T$足够用于上界控制,并且在正则困难子类上是不可避免的。我们进一步证明了一个信息论上的不可区分性结果,表明在一步 ahead 目标上的顺序$C/T$效应不需要仅从实际性能流中识别。最后,我们表明固定监控通道诱导了收缩的可观察Fisher运动,并通过实验,包括一个不正确的现实数据反馈设置,表明适当选择的通道可以在内在数据生成规律不可用时保留风险相关的漂移信号。由此产生的理论将外生漂移、自适应数据分析和表现反馈视为沿同一学习者-环境轨迹的Fisher-Rao运动的不同来源。

英文摘要

Statistical learning under distributional drift remains poorly characterized, especially in closed-loop settings where learning alters the data-generating law. We introduce an intrinsic drift budget $C_T$ that quantifies cumulative information-geometric motion of the data distribution along the realized learner-environment trajectory, measured in Fisher-Rao distance. The budget separates exogenous environmental change from policy-sensitive feedback induced by the learner's actions. This gives a rate-based characterization of prequential reproducibility: when performance on the realized stream is used to predict one-step-ahead performance under the next distribution, the drift contribution enters through the average motion rate $C_T/T$, not through cumulative drift alone. We prove a drift-feedback bound of order $T^{-1/2}+C_T/T$, up to controlled second-order remainder terms, and establish a matching sharpness lower bound for the same prequential reproducibility gap on a canonical regular subclass. Thus the dependence on the average Fisher-Rao motion rate is tight up to constants: $C_T/T$ is sufficient for upper control and unavoidable on regular hard subclasses. We further prove an information-theoretic indistinguishability result showing that order-$C/T$ effects on the one-step-ahead target need not be identifiable from the realized performance stream alone. Finally, we show that fixed monitoring channels induce contracted observable Fisher motion, and experiments, including a misspecified real-data feedback setting, indicate that appropriately chosen channels can retain risk-relevant drift signal when the intrinsic data-generating law is unavailable. The resulting theory treats exogenous drift, adaptive data analysis, and performative feedback as different sources of Fisher-Rao motion along the same learner-environment trajectory.

2512.01537 2026-05-19 cs.SD cs.AI cs.IT cs.LG eess.SP math.IT 版本更新

Two-Dimensional Quantization for Geometry-Aware Audio Coding

二维量化用于几何感知的音频编码

Tal Shuster, Eliya Nachmani

发表机构 * School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Be’er Sheva, Israel(电气与计算机工程学院,内盖夫本· Gurion大学,贝尔谢巴,以色列)

AI总结 本文提出了一种二维量化方法Q2D2,通过将特征对投影到结构化的2D网格上,提高了音频压缩效率,同时保持了最先进的重建质量。

Comments Accepted to ICML 2026

详情
AI中文摘要

最近的神经音频编解码器在重建质量上取得了显著成就,通常依赖于残差向量量化(RVQ)、向量量化(VQ)和有限标量量化(FSQ)等量化方法。然而,这些量化技术限制了潜在空间的几何结构,使特征之间的相关性捕捉变得更加困难,导致表示学习、代码本利用和令牌速率的效率低下。在本文中,我们引入了二维量化(Q2D2),一种将特征对投影到结构化2D网格(如六边形、菱形或矩形铺砌)并量化到最近网格值的量化方案,从而生成由网格级别乘积定义的隐式代码本,其代码本大小与传统方法相当。尽管其简单的几何公式,Q2D2在音频压缩效率方面有所提升,具有低令牌速率和高代码本利用率,同时保持了最先进的重建质量。具体而言,Q2D2在语音、音频和音乐领域广泛实验中,在各种客观和主观重建度量上实现了具有竞争力甚至更优的性能。全面的消融研究进一步证实了我们设计选择的有效性。

英文摘要

Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two-Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids, such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech, audio and music domains compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.

2511.11654 2026-05-19 cs.LG cs.AI cs.MA 版本更新

Convergence of Multiagent Learning Systems for Traffic control

多智能体学习系统在交通控制中的收敛性

Sayambhu Sen, Shalabh Bhatnagar

发表机构 * Amazon Alexa(亚马逊Alexa) Indian Institute of Science(印度科学研究院)

AI总结 本文研究了多智能体强化学习在交通信号控制中的收敛性问题,通过随机逼近方法分析学习动态,并证明了在特定条件下该算法能够收敛。

Comments 14 pages 2 figures

详情
AI中文摘要

快速城市化导致城市如班加罗尔面临严重的交通拥堵,使得高效的交通信号控制(TSC)变得至关重要。多智能体强化学习(MARL)作为一种减少平均通勤延误的有希望策略,通常将每个交通信号视为一个独立的智能体使用Q学习进行建模。尽管先前的工作Prashant L A等人已经证明了这种方法的有效性,但在交通控制背景下对这种算法稳定性及收敛性进行严谨理论分析的研究尚未开展。本文通过专注于该多智能体算法的理论基础,填补了这一空白。我们研究了在合作性TSC任务中使用独立学习者固有的收敛问题。利用随机逼近方法,我们正式分析了学习动态。本文的主要贡献是证明了特定的交通控制多智能体强化学习算法在给定条件下能够收敛,扩展了从单智能体收敛证明中异步价值迭代的结论。

英文摘要

Rapid urbanization in cities like Bangalore has led to severe traffic congestion, making efficient Traffic Signal Control (TSC) essential. Multi-Agent Reinforcement Learning (MARL), often modeling each traffic signal as an independent agent using Q-learning, has emerged as a promising strategy to reduce average commuter delays. While prior work Prashant L A et. al has empirically demonstrated the effectiveness of this approach, a rigorous theoretical analysis of its stability and convergence properties in the context of traffic control has not been explored. This paper bridges that gap by focusing squarely on the theoretical basis of this multi-agent algorithm. We investigate the convergence problem inherent in using independent learners for the cooperative TSC task. Utilizing stochastic approximation methods, we formally analyze the learning dynamics. The primary contribution of this work is the proof that the specific multi-agent reinforcement learning algorithm for traffic control is proven to converge under the given conditions extending it from single agent convergence proofs for asynchronous value iteration.

2511.07288 2026-05-19 cs.LG cs.AI 版本更新

Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

通过深度行为批评稳定化实现非策略模仿学习

Sayambhu Sen, Shalabh Bhatnagar

发表机构 * Amazon Alexa(亚马逊Alexa) Indian Institute of Science(印度科学研究院)

AI总结 本文提出一种结合非策略学习的对抗模仿学习算法,通过双Q网络稳定化和价值学习(无需奖励函数推断)来提高样本效率,从而更高效地匹配专家行为。

Comments 14 pages and 4 images

详情
AI中文摘要

使用强化学习(RL)学习复杂策略通常受到不稳定性慢收敛的阻碍,这一问题在奖励工程困难时尤为严重。模仿学习(IL)从专家演示中绕过了对奖励的依赖。然而,最先进的IL方法,如生成对抗模仿学习(GAIL)Ho等人,存在严重的样本不效率问题。这是由于其基础的策略学习算法,如TRPO Schulman等人,所导致的。在本文中,我们介绍了一种对抗模仿学习算法,该算法结合了非策略学习以提高样本效率。通过结合非策略框架和辅助技术,特别是在此情况下基于双Q网络的稳定化和价值学习(无需奖励函数推断),我们展示了在稳健匹配专家行为所需样本减少。

英文摘要

Learning complex policies with Reinforcement Learning (RL) is often hindered by instability and slow convergence, a problem exacerbated by the difficulty of reward engineering. Imitation Learning (IL) from expert demonstrations bypasses this reliance on rewards. However, state-of-the-art IL methods, exemplified by Generative Adversarial Imitation Learning (GAIL)Ho et. al, suffer from severe sample inefficiency. This is a direct consequence of their foundational on-policy algorithms, such as TRPO Schulman et.al. In this work, we introduce an adversarial imitation learning algorithm that incorporates off-policy learning to improve sample efficiency. By combining an off-policy framework with auxiliary techniques specifically, in this case a double Q network based stabilization and value learning without reward function inference we demonstrate a reduction in the samples required to robustly match expert behavior.

2510.26745 2026-05-19 cs.LG cs.AI cs.CL stat.ML 版本更新

Deep sequence models tend to memorize geometrically; it is unclear why

深度序列模型倾向于记忆几何学;不清楚为何

Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar

发表机构 * Machine Learning Department \& Heinz College, Carnegie Mellon University, Pittsburgh, PA, USA Google Research, NY, USA

AI总结 研究探讨了深度序列模型中原子事实的存储机制,发现几何记忆能编码全局关系,即使在训练中未共现的实体间也能建立联系,挑战了传统关联记忆的观点。

Comments Forty-third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

深度序列模型被认为主要通过关联记忆存储原子事实,即通过暴力查找共现实体。我们识别出一种不同的存储形式,称为几何记忆。在此模型中,嵌入编码了所有实体之间的新型全局关系,包括训练中未共现的实体。这种存储形式强大:例如,我们展示了它如何将涉及ℓ-折叠组合的困难推理任务转化为易于学习的一步导航任务。从这一现象中,我们提取了神经嵌入几何学中难以解释的基本方面。我们认为,这种几何的出现,与局部关联的查找相比,不能简单归因于典型的监督、架构或优化压力。反直觉的是,即使几何比暴力查找更复杂,它仍然会被学习。然后,通过分析与Node2Vec的联系,我们展示了几何起源于一种光谱偏见,这与主流理论相反,确实自然产生,尽管缺乏各种压力。这一分析也指出了从业者在使Transformer记忆更几何化方面的可见空间。我们希望几何视角的参数记忆鼓励重新审视指导知识获取、容量、发现和遗忘等领域的默认直觉。

英文摘要

Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn $1$-step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.

2510.24208 2026-05-19 cs.CL cs.LG 版本更新

Beyond Neural Incompatibility: Cross-Scale Knowledge Transfer in Language Models through Latent Semantic Alignment

超越神经不兼容:通过潜在语义对齐实现语言模型中的跨尺度知识转移

Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

发表机构 * Monash University(墨尔本大学) Technical University of Munich(慕尼黑技术大学) Chongqing University(重庆大学)

AI总结 本文提出SemAlign方法,通过潜在语义对齐实现跨尺度知识转移,解决了不同架构和参数化模型间参数重用受限的问题,通过激活值作为转移介质,利用语义分解与重组稳定地实现知识迁移。

Comments an early-stage version

详情
AI中文摘要

语言模型(LMs)在其参数中编码了大量知识,但如何以细粒度方式转移此类知识,即参数化知识转移(PKT)仍不明确。核心挑战是当源模型和目标模型在架构和参数化上存在差异时,如何实现有效的、高效的跨尺度转移,这使得直接参数重用受到神经不兼容的限制。在本文中,我们识别出潜在语义对齐是跨尺度知识转移的关键前提。与直接移动层参数不同,我们的方法使用激活值作为转移介质。SemAlign包含两个阶段:一个层归因阶段,用于归因任务相关的源层并为每个目标层选择恰好一个源层;一个语义对齐阶段,通过逐层配对并优化目标模型,利用源侧语义监督。对齐通过语义分解和重组在潜在空间中进行。在浅层到深层的转移过程中,只有前沿目标层是可训练的。层目标通过匹配中心化的词-词关系几何与对齐的监督残差来监督该层的残差贡献,而输出KL保持源级预测行为。因此,转移介质既不是参数块也不是绝对的隐藏状态,而是由配对源层监督诱导的目标空间残差几何。在四个基准测试中的评估证实了SemAlign的有效性,进一步分析确认语义分解和重组为跨尺度知识转移提供了一个稳定的机制。

英文摘要

Language Models (LMs) encode substantial knowledge in their parameters, yet it remains unclear how to transfer such knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A central challenge is to make cross-scale transfer effective and efficient when source and target models differ in architecture and parameterization, making direct parameter reuse strongly limited by neural incompatibility. In this paper, we identify latent semantic alignment as the key prerequisite for cross-scale knowledge transfer. Instead of directly moving layer parameters, our approach uses activations as the transfer medium. \textsc{SemAlign} has two stages: an \emph{layer attribution} stage that attributes task-relevant source layers and selects exactly one source layer for each target layer, and a \emph{semantic alignment} stage that pairs them layer by layer and optimizes the target with source-side semantic supervision. The alignment is carried out in latent space through semantic decomposition and recomposition. During the shallow-to-deep transfer, only the frontier target layer is trainable. The layer objective supervises the residual contribution of that layer by matching centered token-token relation geometry against an aligned supervisory residual, while output KL preserves source-level predictive behavior. The transferred medium is therefore neither a parameter block nor an absolute hidden state, but target-space residual geometry induced by paired source-layer supervision. Evaluations on four benchmarks demonstrate the efficacy of \textsc{SemAlign}, and further analysis confirms that semantic decomposition and recomposition provide a stable mechanism for cross-scale knowledge transfer.

2510.08141 2026-05-19 cs.LG 版本更新

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

SCOPE-RL: 稳定和定量控制强化学习后训练中的策略熵

Chen Wang, Zhaochun Li, Jionghao Bai, Hexuan Deng, Ge Lan, Yue Wang

发表机构 * College of Software, Nankai University(南开大学软件学院) Zhongguancun Academy(中关村学院) Beijing Institute of Technology(北京理工大学) Zhejiang University(浙江大学) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本文提出SCOPE-RL框架,通过温度自适应的正样本构造正则化项,稳定并定量控制强化学习后训练中的策略熵,实验表明其在Pass@1和Pass@$k$任务上优于现有基线方法。

详情
AI中文摘要

强化学习(RL)是训练大型语言模型(LLMs)的关键范式,但广泛使用的分组相对策略优化(GRPO)常面临熵崩溃问题:探索迅速消失,策略提前收敛,样本多样性下降,最终损害训练效果。现有解决方案,包括熵奖励和裁剪方法,很少能保持熵在稳定的探索范围内,且常引入振荡的熵或奖励退化。在本文中,我们识别出熵动态中被忽视的不对称性:在高温度采样下,正样本和负样本对策略熵有相反影响。具体而言,高温度正样本促进熵增长,而负样本抑制它。我们为此现象提供了理论解释:当策略更新过程中熵下降时,其对温度的导数在正样本更新下严格为正,表明高温度正样本可以抵消熵衰减,从而减缓熵崩溃并可能逆转它。受此启发,我们提出了SCOPE-RL,通过构造来自温度自适应正样本的正则化项,实现稳定且定量的熵控制。广泛实验表明,SCOPE-RL在Pass@1和Pass@$k$任务上均优于现有强RL基线方法。我们的结果提供了证据,证明摆脱熵崩溃可以提高推理性能,同时显示收益是非单调的,RL后训练在推理LLMs中存在最优的探索水平。

英文摘要

Reinforcement learning (RL) is a key paradigm for post-training large language models (LLMs), but the widely used Group Relative Policy Optimization (GRPO) often suffers from entropy collapse: exploration quickly disappears, policies converge prematurely, and sample diversity declines, ultimately harming training effectiveness. Existing remedies, including entropy bonuses and clip-based methods, rarely keep entropy within a stable exploration regime and often introduce oscillatory entropy or reward degradation. In this work, we identify a previously overlooked asymmetry in entropy dynamics: under high-temperature sampling, positive and negative samples have opposite effects on policy entropy. Specifically, high-temperature positive samples promote entropy growth, whereas negative samples suppress it. We provide a theoretical explanation for this phenomenon: when entropy decreases during policy updates, its derivative with respect to temperature is strictly positive under positive-sample updates, indicating that high-temperature positive samples can counteract entropy decay, thereby slowing entropy collapse and potentially reversing it. Motivated by this insight, we propose SCOPE-RL, a stable and quantitative entropy control framework through a regularization term constructed from temperature-adaptive positive samples. Extensive experiments show that SCOPE-RL consistently outperforms strong RL baselines on both Pass@1 and Pass@$k$. Our results provide evidence that escaping entropy collapse can improve reasoning performance, while also showing that the benefit is non-monotonic, with an optimal level of exploration for RL post-training in reasoning LLMs.

2510.04930 2026-05-19 cs.LG 版本更新

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

平等梯度下降:一种加速 Grokking 的简单方法

Ali Saheb Pasand, Elvis Dohmatob

发表机构 * McGill University(麦吉尔大学) Mila Institute(Mila研究院) Concordia University(康科迪亚大学)

AI总结 本文提出平等梯度下降(EGD)方法,通过规范化梯度使所有主方向的动态以相同速度演化,从而加速模型的 Grokking 过程,消除测试性能的停滞现象。

详情
AI中文摘要

Grokking 是一种现象,其中不同于训练性能在早期达到峰值,模型的测试/泛化性能在任意多个周期内停滞,然后突然跃升至接近完美的水平。在实践中,减少此类停滞的长度是有利的,即使学习过程'更快地 Grok'。在本工作中,我们提供了对 Grokking 的新见解。首先,我们通过实证和理论证明,不对称的(随机)梯度下降速度可以在不同主方向(即奇异方向)上诱导 Grokking。然后,我们提出了一种简单的修改,规范化梯度,使得所有主方向的动力学以相同的速度演化。接着,我们证明这种修改方法,称为平等梯度下降(EGD),可以被视为一种精心修改的自然梯度下降方法,能够更快地 Grok。事实上,在某些情况下,停滞完全被消除。最后,我们实证地展示了在经典算术问题如模加法和稀疏奇偶问题上,这种停滞现象被我们的方法消除。

英文摘要

Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process "grok" faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.

2510.02590 2026-05-19 cs.LG 版本更新

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

在可以的时候使用在线网络:迈向快速且稳定的强化学习

Ahmed Hendawy, Henrik Metternich, Théo Vincent, Mahdi Kallel, Jan Peters, Carlo D'Eramo

发表机构 * Technical University of Darmstadt(德累斯顿技术大学) German Research Center for AI (DFKI)(德国人工智能研究中心(DFKI)) Robotics Institute Germany (RIG)(德国机器人研究所(RIG)) University of Würzburg(弗赖堡大学)

AI总结 本文提出了一种新的更新规则,通过在目标网络和在线网络之间取最小估计来改进价值函数学习,从而实现更快且更稳定的强化学习。

Comments Accepted at the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
AI中文摘要

在深度强化学习(RL)中,使用目标网络来估计价值函数是一种流行的方法。虽然有效,但目标网络仍是一种折中方案,它在保持稳定性的同时牺牲了缓慢移动的目标,从而延迟了学习。相反,使用在线网络作为强化目标在直觉上很有吸引力,但众所周知会导致不稳定的学。在本文中,我们旨在结合两者的优势,通过引入一种新的更新规则,该规则通过目标网络和在线网络之间的最小估计来计算目标,从而得到我们的方法MINTO。通过这种简单而有效的修改,我们证明MINTO能够通过缓解使用在线网络进行强化时的潜在过估计偏差,从而实现更快且更稳定的价值函数学习。值得注意的是,MINTO可以无缝集成到广泛的价值基础和演员-评论家算法中,成本极低。我们对MINTO在多种基准上的进行了广泛评估,涵盖了在线和离线RL以及离散和连续动作空间。在所有基准上,MINTO都一致地提高了性能,展示了其广泛的应用性和有效性。

英文摘要

The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

2509.23183 2026-05-19 cs.LG cs.NI 版本更新

ZeroSiam: An Efficient Asymmetry for Test-Time Entropy Optimization without Collapse

ZeroSiam: 一种高效的非对称方法用于测试时熵优化而不发生崩溃

Guohao Chen, Shuaicheng Niu, Deyu Chen, Jiahao Yang, Zitian Zhang, Mingkui Tan, Pengcheng Wu, Zhiqi Shen

发表机构 * Nanyang Technological University(南洋理工大学) Joint WeBank-NTU Research Institute on Fintech(金融科技联合研究机构) South China University of Technology(华南理工大学)

AI总结 本文提出ZeroSiam,一种针对测试时熵最小化的高效非对称架构,通过非对称发散对齐防止崩溃,并通过可学习预测器和stop-gradient操作符有效实现,实验和理论证明其能防止崩溃并正则化偏见学习信号,提升性能,尤其在易崩溃的小模型上表现稳定。

详情
AI中文摘要

测试时熵最小化有助于适应新环境并激励模型的推理能力,在推理过程中允许模型通过自身预测实时进化和改进,从而实现有竞争力的性能。然而,纯粹的熵最小化可能会偏好不可推广的捷径,如放大logit范数并驱动所有预测到主导类别以减少熵,从而导致崩溃解(例如,恒定的一热输出),这些解仅通过简单的方式最小化目标函数而没有有意义的学习。在本文中,我们揭示了非对称性作为防止崩溃的关键机制,并引入了ZeroSiam——一种专门针对测试时熵最小化的高效非对称孪生架构。ZeroSiam通过非对称发散对齐来防止崩溃,这一过程通过在分类器之前使用可学习预测器和stop-gradient操作符高效实现。我们提供了实证和理论证据表明,ZeroSiam不仅能够防止崩溃,还能正则化偏见学习信号,即使在没有崩溃的情况下也能提升性能。尽管其简单性,广泛的结果显示,ZeroSiam在使用可忽略开销的情况下,比先前的方法更稳定,展示了其在视觉适应和大语言模型推理任务中的有效性,包括在具有挑战性的测试场景和多样化的模型中,特别是易崩溃的微型模型上。

英文摘要

Test-time entropy minimization helps adapt a model to novel environments and incentivize its reasoning capability, unleashing the model's potential during inference by allowing it to evolve and improve in real-time using its own predictions, achieving promising performance. However, pure entropy minimization can favor non-generalizable shortcuts, such as inflating the logit norm and driving all predictions to a dominant class to reduce entropy, risking collapsed solutions (e.g., constant one-hot outputs) that trivially minimize the objective without meaningful learning. In this paper, we reveal asymmetry as a key mechanism for collapse prevention and introduce ZeroSiam--an efficient asymmetric Siamese architecture tailored for test-time entropy minimization. ZeroSiam prevents collapse through asymmetric divergence alignment, efficiently achieved by a learnable predictor and a stop-gradient operator before the classifier. We provide empirical and theoretical evidence that ZeroSiam not only prevents collapse, but also regularizes biased learning signals, enhancing performance even when no collapse occurs. Despite its simplicity, extensive results show that ZeroSiam performs more stably over prior methods using negligible overhead, demonstrating efficacy on both vision adaptation and large language model reasoning tasks across challenging test scenarios and diverse models, including particularly collapse-prone tiny models.

2509.23068 2026-05-19 stat.ML cs.LG 版本更新

Sparse Deep Additive Model with Interactions: Enhancing Interpretability and Predictability

稀疏深度加法模型与交互:增强可解释性和预测性

Yi-Ting Hung, Li-Hsiang Lin, Vince D. Calhoun

发表机构 * Department of Mathematics and Statistics(数学与统计学系) Georgia State University(佐治亚州立大学) Tri-institutional Center for Translational Research in Neuroimaging and Data Science(神经影像与数据科学转化研究三机构中心)

AI总结 本文提出了一种结合稀疏特征选择与深度子网络的稀疏深度加法模型与交互(SDAMI),通过三阶段策略实现高维回归中的可解释性和预测性提升。

详情
AI中文摘要

近年来深度学习的进步突显了需要能够从少量样本中学习、处理高维特征并保持可解释性的个性化模型。为此,我们提出了稀疏深度加法模型与交互(SDAMI)框架,该框架结合了以稀疏性驱动的特征选择与深度子网络以实现灵活的功能近似。SDAMI的核心是效应足迹原理,该原理认为高阶交互会在构成变量上留下可检测的边际痕迹,从而无需穷尽搜索即可发现它们。SDAMI通过三阶段策略执行这一原理:(1)筛选足迹变量,(2)通过组Lasso分离主效应与交互,(3)使用专用深度子网络建模组件。理论分析证实,足迹仅在测度零对称条件下消失,而这些条件在实践中极为罕见,从而确保了一致的交互恢复。广泛模拟显示,SDAMI能够成功识别出基于遗传的基线方法根本无法识别的纯交互,以接近零的假阳性率恢复复杂的效应结构。这些结果将SDAMI定位为一种原理上适用于高维回归的可解释框架。

英文摘要

Recent advances in deep learning highlight the need for personalized models that can learn from small samples, handle high-dimensional features, and remain interpretable. To address this, we propose the Sparse Deep Additive Model with Interactions (SDAMI), a framework that combines sparsity-driven feature selection with deep subnetworks for flexible function approximation. Central to SDAMI is the Effect Footprint principle, which posits that higher-order interactions leave detectable marginal traces on constituent variables, enabling their discovery without exhaustive search. SDAMI executes this principle through a three-stage strategy: (1) screening for footprint variables, (2) disentangling main effects from interactions via group lasso, and (3) modeling components with dedicated deep subnetworks. Theoretical analysis confirms that footprints vanish only under measure-zero symmetry conditions that are rare in practice, ensuring consistent interaction recovery. Extensive simulations demonstrate that SDAMI successfully identifies pure interactions that heredity-based baselines fundamentally miss, recovering complex effect structures with near-zero false positive rates. Together, these results position SDAMI as a principled framework for interpretable high-dimensional regression.

2509.22459 2026-05-19 stat.ML cs.LG 版本更新

Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)

通用逆向蒸馏用于匹配模型与真实数据监督(无GANs)

Nikita Kornilov, David Li, Tikhon Mavrin, Aleksei Leonov, Nikita Gushchin, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin

发表机构 * Applied AI Institute(应用人工智能研究所) MIRAI BRAIn Lab(BRAIn实验室) AI Foundation lab(人工智能基础实验室) MBZUAI(穆扎伊国际人工智能大学) AXXX

AI总结 本文提出RealUID框架,通过无需GANs的方式将真实数据无缝融入逆向蒸馏过程,为所有匹配模型提供统一的蒸馏方法,涵盖流匹配和扩散模型,并可扩展至其变种。

详情
AI中文摘要

尽管生成质量优异,现代扩散、流及其他匹配模型在推理时速度较慢,因为它们需要许多迭代生成步骤。最近的蒸馏方法通过在预训练教师模型指导下训练高效的单步生成器来解决这个问题。然而,这些方法通常局限于特定框架,例如仅限于扩散或仅限于流模型。此外,这些方法原本是数据无关的,为了利用真实数据,需要使用额外的复杂对抗训练和额外的判别器模型。在本文中,我们提出了RealUID,一种适用于所有匹配模型的通用蒸馏框架,能够无缝地将真实数据整合到蒸馏过程中而无需GANs。我们的RealUID方法提供了一个简单的理论基础,涵盖了流匹配和扩散模型之前的蒸馏方法,并可扩展到其变种,如桥接匹配和随机插值。代码可在https://github.com/David-cripto/RealUID中找到。

英文摘要

While achieving exceptional generative quality, modern diffusion, flow, and other matching models suffer from slow inference, as they require many steps of iterative generation. Recent distillation methods address this problem by training efficient one-step generators under the guidance of a pre-trained teacher model. However, these methods are often constrained to only one specific framework, e.g., only to diffusion or only to flow models. Furthermore, these methods are originally data-free, and to benefit from the usage of real data, it is required to use an additional complex adversarial training with an extra discriminator model. In this paper, we present RealUID, a universal distillation framework for all matching models that seamlessly incorporates real data into the distillation procedure without GANs. Our RealUID approach offers a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models, and can be also extended to their modifications, such as Bridge Matching and Stochastic Interpolants. The code can be found in https://github.com/David-cripto/RealUID.

2509.06984 2026-05-19 cs.LG cs.AI 版本更新

FediLoRA: Practical Federated Fine-Tuning of Foundation Models Under Missing-Modality Constraints

FediLoRA: 在缺失模态约束下联邦微调基础模型的实用方法

Lishan Yang, Wei Emma Zhang, Nam Kha Nguygen, Po Hu, Yanjun Shu, Weitong Chen, Mong Yuan Sim

发表机构 * Adelaide University(阿德莱德大学) Central China Normal University(中央中国师范大学) Harbin Institute of Technology(哈尔滨工程大学)

AI总结 本文提出FediLoRA,一种轻量级的联邦LoRA聚合框架,旨在解决联邦学习中异构环境下的缺失模态问题,通过联合简单平均和结构化编辑提升全局和个性化模型性能,实现在多个通用领域和医疗领域基准数据集上的强大表现。

Comments 8 pages, 7 figures

详情
AI中文摘要

联邦学习与LoRA微调提供了一种高效且隐私友好的解决方案,使机构能够协作利用其大规模数据集来训练VLLMs。然而,参与机构通常拥有异质计算资源,导致LoRA秩不平衡,这对有效协作构成重大挑战。此外,医疗和交通等现实应用领域常因用户错误或设备故障导致缺失模态,这显著降低了联邦设置中的全局模型性能。到目前为止,没有先前工作同时解决了联邦VLLMs中的这两个挑战。为了解决这些问题,我们提出FediLoRA,一种轻量级的联邦LoRA聚合框架,有效减轻了异构环境中的缺失模态影响。FediLoRA受到观察的启发,即简单平均和结构化编辑可以同时受益于全局和个性化模型。我们的方法在多个通用领域和医疗领域基准数据集上实现了强大性能。此外,在医疗数据上的额外实验进一步证明,FediLoRA适合实际应用部署场景。我们的代码已发布在https://github.com/gotobcn8/FediLoRA。

英文摘要

Federated Learning with LoRA fine-tuning offers an efficient and privacy-aware solution for institutions to collaboratively leverage their large datasets to train VLLMs. However, participating institutions often possess heterogeneous computational resources, resulting in imbalanced LoRA ranks, which pose a major challenge for effective collaboration. In addition, real-world applications in domains such as healthcare and transportation frequently suffer from missing modalities due to user mistakes or device failures, which significantly degrade global model performance in federated settings. To the best of our knowledge, no prior work has addressed these two challenges simultaneously in federated VLLMs. To tackle these issues, we propose FediLoRA, a lightweight federated LoRA aggregation framework that effectively mitigates the impact of missing modalities in heterogeneous environment. FediLoRA is explicitly motivated by the observation that simple averaging and structured editing can jointly benefit both global and personalized models. Our approach achieves strong performance across multiple general-domain and medical-domain benchmark datasets. Additional experiments on healthcare data further demonstrate that FediLoRA is well-suited for practical, real-world deployment scenarios. Our code is released at https://github.com/gotobcn8/FediLoRA.

2508.17431 2026-05-19 cs.CV cs.AI cs.LG 版本更新

FedKLPR: KL-Guided Pruning-Aware Federated Learning for Person Re-Identification

FedKLPR: 基于KL引导的剪枝感知联邦学习用于人重识别

Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien

发表机构 * Media IC and System Lab, the Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University(媒体IC与系统实验室,电子工程研究所及电气工程系,国立台湾大学)

AI总结 本文提出FedKLPR框架,通过KL散度引导训练、无结构剪枝和跨轮次恢复技术,解决联邦学习在人重识别中的统计异质性和通信开销问题,实验表明其在通信开销和准确性方面均优于现有方法。

Comments 10 pages, 3 figures, 5 tables, submitted to IEEE Transactions on Multimedia

详情
AI中文摘要

人重识别(re-ID)是智能监控和公共安全中的基本任务。联邦学习(FL)提供了一种隐私保护的协同模型训练范式,无需集中数据收集。然而,由于非独立同分布(non-IID)客户端数据导致的统计异质性和频繁传输大规模模型带来的通信开销,将FL应用于现实世界中的re-ID系统仍然具有挑战性。为了解决这些挑战,我们提出了FedKLPR,一种轻量且通信高效的联邦学习框架用于人重识别。FedKLPR包含三个关键组件。首先,KL散度引导训练,包括KL散度正则化损失(KLL)和KL散度聚合权重(KLAW),用于缓解统计异质性和在非IID设置下提高收敛稳定性。其次,引入无结构剪枝以减少通信开销,并提出剪枝率聚合权重(PRAW)以衡量剪枝后客户端参数的相对重要性。与KLAW结合,PRAW形成KL散度-剪枝权重聚合(KLPWA),使在异构数据分布下能够有效聚合剪枝后的本地模型。第三,跨轮次恢复(CRR)适应性地控制剪枝跨通信轮次以防止过度压缩并保持模型准确性。在八个基准数据集上的实验表明,FedKLPR在保持竞争性准确性的同时实现了显著的通信节省。与现有最先进方法相比,FedKLPR在ResNet-50上将通信成本减少了40%--42%,并实现了更优异的总体性能。

英文摘要

Person re-identification (re-ID) is a fundamental task in intelligent surveillance and public safety. Federated learning (FL) provides a privacy-preserving paradigm for collaborative model training without centralized data collection. However, deploying FL in real-world re-ID systems remains challenging due to statistical heterogeneity caused by non-IID client data and the substantial communication overhead incurred by frequent transmission of large-scale models. To address these challenges, we propose FedKLPR, a lightweight and communication-efficient federated learning framework for person re-ID. FedKLPR consists of three key components. First, KL-Divergence-Guided training, including the KL-Divergence Regularization Loss (KLL) and KL-Divergence-aggregation Weight (KLAW), is introduced to mitigate statistical heterogeneity and improve convergence stability under non-IID settings. Second, unstructured pruning is incorporated to reduce communication overhead, and the Pruning-ratio-aggregation Weight (PRAW) is proposed to measure the relative importance of client parameters after pruning. Together with KLAW, PRAW forms KL-Divergence-Prune Weighted Aggregation (KLPWA), enabling effective aggregation of pruned local models under heterogeneous data distributions. Third, Cross-Round Recovery (CRR) adaptively controls pruning across communication rounds to prevent excessive compression and preserve model accuracy. Experiments on eight benchmark datasets demonstrate that FedKLPR achieves substantial communication savings while maintaining competitive accuracy. Compared with state-of-the-art methods, FedKLPR reduces communication cost by 40\%--42\% on ResNet-50 while achieving better overall performance.

2508.16663 2026-05-19 cs.CV cs.AI cs.LG 版本更新

The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

The Loupe: 一种用于增强视觉变换器中判别特征的插件式注意力模块

Naren Sengodan

发表机构 * Jain University(贾因大学)

AI总结 本文提出The Loupe模块,通过在视觉变换器的中间特征阶段插入轻量级插件式空间门控模块,利用小CNN预测单通道空间掩码,并在端到端训练中使用交叉熵目标和l1稀疏项对特征激活进行加权,从而提升细粒度视觉分类性能。

详情
AI中文摘要

细粒度视觉分类(FGVC)要求模型关注于细微的、与任务相关的区域,而非广泛的物体上下文。我们提出了The Loupe,一种轻量级的插件式空间门控模块,用于层次化的视觉变换器。该模块在中间特征阶段插入,使用小CNN预测单通道空间掩码,并在端到端训练中使用交叉熵目标和l1稀疏项对特征激活进行加权。在CUB-200-2011数据集上,The Loupe将Swin-Base的准确率从88.36%提升至91.72%,将Swin-Tiny的准确率从85.14%提升至88.61%,且仅增加0.1%的参数。消融实验表明,改进依赖于插入点和稀疏正则化器,表明受控的空间门控比朴素的多尺度遮蔽在此设置下更有效。定性结果表明,学习到的掩码通常与判别鸟类部分对齐,尽管该模块不是部分级监督的替代品,在遮挡或细粒度内部分差异时可能会失效。

英文摘要

Fine-Grained Visual Classification (FGVC) requires models to focus on subtle, task-relevant regions rather than broad object context. We present The Loupe, a lightweight plug-and-play spatial gating module for hierarchical Vision Transformers. The module is inserted at an intermediate feature stage, predicts a single-channel spatial mask with a small CNN, and uses that mask to reweight feature activations during end-to-end training with a cross-entropy objective and an l1 sparsity term. On CUB-200-2011, The Loupe improves Swin-Base from 88.36% to 91.72% and Swin-Tiny from 85.14% to 88.61%, with under 0.1% additional parameters. Ablations show that the improvement depends on the insertion point and the sparsity regularizer, suggesting that controlled spatial gating is more effective than naive multi-scale masking in this setting. Qualitative results indicate that the learned masks often align with discriminative bird parts, although the module is not a substitute for part-level supervision and can fail under occlusion or fine-grained intra-part differences.

2508.15878 2026-05-19 cs.LO cs.AI cs.CL cs.LG 版本更新

Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs

Lean 与理论计算机科学的交汇:形式-非形式对中可扩展的定理证明挑战合成

Terry Jingchen Zhang, Wenyuan Jiang, Rongchuan Liu, Yisong Wang, Junran Yang, Ning Wang, Nicole Ni, Yinya Huang, Mrinmaya Sachan

发表机构 * D-CHAB, ETH Zurich, Zurich, Switzerland. D-INFK, ETH Zurich, Zurich, Switzerland. ETH AI Center, Zurich, Switzerland. University of Pennsylvania, PA, USA. Independent Researcher.

AI总结 本文提出利用理论计算机科学作为可扩展的严谨证明问题来源,通过算法定义自动生成大量挑战性定理-证明对,展示了在Busy Beaver问题和混合布尔算术问题上的应用,并揭示了自动定理证明在复杂问题上的局限性。

Comments Accepted to AI4MATH@ICML2025

详情
AI中文摘要

形式定理证明(FTP)已成为评估大语言模型推理能力的关键基础,使大规模自动验证数学证明成为可能。然而,进展受到有限数据集的限制,因为手动编纂成本高且缺乏具有验证形式-非形式对应关系的挑战性问题。我们提出利用理论计算机科学(TCS)作为可扩展的严谨证明问题来源,其中算法定义能够自动生成任意多的挑战性定理-证明对。我们在此两个TCS领域中展示了这种方法:Busy Beaver问题,涉及证明图灵机停止行为的界限,以及混合布尔算术问题,结合了逻辑和算术推理。我们的框架自动合成具有并行形式(Lean4)和非形式(Markdown)规范的问题,创建了一个可扩展的生成验证证明挑战的流水线。对前沿模型的评估揭示了自动定理证明的显著差距:尽管DeepSeekProver-V2-671B在Busy Beaver问题上达到57.5%的成功率,但在混合布尔算术问题上仅达到12%。这些结果突显了即使对于计算上易于验证的问题,长形式证明生成的难度,展示了TCS领域在推动自动推理研究中的价值。

英文摘要

Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoning capabilities of large language models, enabling automated verification of mathematical proofs at scale. However, progress has been constrained by limited datasets due to the high cost of manual curation and the scarcity of challenging problems with verified formal-informal correspondences. We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems, where algorithmic definitions enable automated generation of arbitrarily many challenging theorem-proof pairs. We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning. Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges. Evaluation on frontier models reveals substantial gaps in automated theorem proving: while DeepSeekProver-V2-671B achieves 57.5\% success on Busy Beaver problems, it manages only 12\% on Mixed Boolean Arithmetic problems. These results highlight the difficulty of long-form proof generation even for problems that are computationally easy to verify, demonstrating the value of TCS domains for advancing automated reasoning research.

2508.14769 2026-05-19 cs.LG cs.DC 版本更新

Federated Distillation on Edge Devices: Efficient Client-Side Filtering for Non-IID Data

边缘设备上的联邦蒸馏:非iid数据的高效客户端过滤

Ahmed Mujtaba, Gleb Radchenko, Radu Prodan, Marc Masana

发表机构 * 1 Embedded Systems Division, Silicon Austria Labs, Graz, Austria 2 Department of Computer Science, University of Innsbruck, Austria 5 Institute of Information Technology, University of Klagenfurt, Austria 3 TU-Graz SAL DES Lab, Silicon Austria Labs, Graz, Austria 4 Institute of Visual Computing, Graz University of Technology, Austria

AI总结 本文提出了一种高效的边缘联邦蒸馏方法EdgeFD,通过在客户端使用KMeans基于的密度比估计器来过滤分布内外的代理数据,从而减少计算复杂度并提高知识共享质量,适用于非iid数据分布。

Comments This paper was accepted at the International Conference on Federated Learning Technologies and Applications, 2025. The final version is available at IEEE Xplore

详情
AI中文摘要

联邦蒸馏作为一种有前途的协同机器学习方法,通过交换模型输出(软日志)而不是完整模型参数,相较于传统联邦学习提供了增强的隐私保护和减少的通信开销。然而,现有方法采用复杂的选择性知识共享策略,要求客户端通过计算昂贵的统计密度比估计器来识别分布内代理数据。此外,服务器端对模糊知识的过滤引入了延迟。为了解决这些挑战,我们提出了一个鲁棒且资源高效的EdgeFD方法,该方法减少了客户端侧密度比估计的复杂性并消除了服务器端过滤的需要。EdgeFD引入了一个高效的KMeans基于的密度比估计器,用于在客户端上有效过滤分布内和分布外的代理数据,显著提高了知识共享的质量。我们评估了EdgeFD在多样化的实际场景中的表现,包括强非iid、弱非iid和iid数据分布,无需在服务器上预训练教师模型进行知识蒸馏。实验结果表明,EdgeFD优于最先进的方法,在异构和挑战性条件下仍能持续达到接近iid场景的准确率。KMeans基于的估计器显著减少的计算开销适用于在资源受限的边缘设备上部署,从而增强了联邦蒸馏的可扩展性和实际应用性。代码已在线提供以供复现。

英文摘要

Federated distillation has emerged as a promising collaborative machine learning approach, offering enhanced privacy protection and reduced communication compared to traditional federated learning by exchanging model outputs (soft logits) rather than full model parameters. However, existing methods employ complex selective knowledge-sharing strategies that require clients to identify in-distribution proxy data through computationally expensive statistical density ratio estimators. Additionally, server-side filtering of ambiguous knowledge introduces latency to the process. To address these challenges, we propose a robust, resource-efficient EdgeFD method that reduces the complexity of the client-side density ratio estimation and removes the need for server-side filtering. EdgeFD introduces an efficient KMeans-based density ratio estimator for effectively filtering both in-distribution and out-of-distribution proxy data on clients, significantly improving the quality of knowledge sharing. We evaluate EdgeFD across diverse practical scenarios, including strong non-IID, weak non-IID, and IID data distributions on clients, without requiring a pre-trained teacher model on the server for knowledge distillation. Experimental results demonstrate that EdgeFD outperforms state-of-the-art methods, consistently achieving accuracy levels close to IID scenarios even under heterogeneous and challenging conditions. The significantly reduced computational overhead of the KMeans-based estimator is suitable for deployment on resource-constrained edge devices, thereby enhancing the scalability and real-world applicability of federated distillation. The code is available online for reproducibility.

2508.08080 2026-05-19 cs.LG cs.NE stat.AP 版本更新

Symbolic Quantile Regression for the Interpretable Prediction of Conditional Quantiles

符号量化回归用于条件量化可解释性预测

Cas Oude Hoekstra, Floris den Hengst

发表机构 * Independent researcher(独立研究者) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)

AI总结 本文提出了一种符号量化回归方法,用于预测条件量化并解释预测变量对结果的影响,通过在航空燃料使用案例中比较预测极值和中央结果的模型,展示了SQR在高风险应用中的有效性。

详情
Journal ref
Transactions on Machine Learning Research, May 2026, https://openreview.net/pdf?id=x9OYbyPJOG
AI中文摘要

符号回归(SR)是一种生成可解释或白盒预测模型的已知框架。尽管SR已被成功应用于创建结果平均值的可解释估计,但目前尚不清楚如何利用SR来估计目标变量分布其他点处变量之间的关系。例如,中位数或极值的估计提供了预测变量如何影响结果的更全面图景,并在高风险、安全关键应用领域是必要的。本文介绍了符号量化回归(SQR),一种利用SR预测条件量化的做法。在广泛的评估中,我们发现SQR在透明模型上表现优于,并且在不牺牲透明性的情况下与强大的黑盒基线模型表现相当。我们还展示了如何利用SQR通过比较预测极值和中央结果的模型来解释目标分布的差异。我们得出结论,SQR适用于预测条件量化并理解不同分位数下的有趣特征影响。

英文摘要

Symbolic Regression (SR) is a well-established framework for generating interpretable or white-box predictive models. Although SR has been successfully applied to create interpretable estimates of the average of the outcome, it is currently not well understood how it can be used to estimate the relationship between variables at other points in the distribution of the target variable. Such estimates of e.g. the median or an extreme value provide a fuller picture of how predictive variables affect the outcome and are necessary in high-stakes, safety-critical application domains. This study introduces Symbolic Quantile Regression (SQR), an approach to predict conditional quantiles with SR. In an extensive evaluation, we find that SQR outperforms transparent models and performs comparably to a strong black-box baseline without compromising transparency. We also show how SQR can be used to explain differences in the target distribution by comparing models that predict extreme and central outcomes in an airline fuel usage case study. We conclude that SQR is suitable for predicting conditional quantiles and understanding interesting feature influences at varying quantiles.

2508.00901 2026-05-19 cs.LG cs.CL 版本更新

Provable Knowledge Acquisition and Extraction in One-Layer Transformers

在单层变换器中可证明的知识获取与提取

Ruichen Xu, Kexin Chen

AI总结 本文研究了单层变换器中知识获取与提取的机制,通过理论分析和实验验证,揭示了预训练和微调过程中知识存储与提取的关系,以及低秩微调如何恢复预训练的事实知识。

详情
AI中文摘要

大型语言模型在预训练过程中可能获得事实性知识,但在微调后却无法可靠地使用这些知识。尽管有越来越多的实证证据表明MLP层存储事实关联,并且微调影响事实回忆,但连接下一个标记预训练、知识存储和后微调提取的训练动态机制仍然理解有限。我们研究了这个问题,使用了一个简化的一层变换器,包含自注意力和MLP模块,通过下一个标记预测进行训练,随后在问答数据上进行微调。在适当的正则性条件下,我们首先证明模型在学习结构化注意力模式和关系特定的特征方向时达到接近最优的预训练损失,从而提供了一个事实性知识获取的机制。然后我们展示微调可以将问答提示格式转化为触发预训练关系特征的手段,使模型能够提取在微调过程中未被重新访问的事实。我们的分析给出了知识提取的关联覆盖特征化:微调不需要重新访问每一个存储的主体-答案对,但必须覆盖足够的潜在关系-模板方向,通过这些方向在预训练中编码了事实。因此,提取随着预训练的多重性和微调的覆盖度而提高,但随着关系-模板宇宙的增长而变得更加困难。相反,不足的覆盖度会导致失败状态,其中事实可能被存储但仍然无法访问,提供了一个简化的幻觉机制。该理论适用于全和低秩微调,为为什么当关系覆盖度足够时低秩适应可以恢复预训练的事实知识提供了见解。在合成数据和基于PopQA的GPT-2/Llama模型上的实验支持了预测的趋势。

英文摘要

Large language models may encounter factual knowledge during pre-training yet fail to reliably use that knowledge after fine-tuning. Despite growing empirical evidence that MLP layers store factual associations and fine-tuning affects factual recall, the training-dynamics mechanisms linking next-token pre-training, knowledge storage, and post-fine-tuning extraction remain poorly understood. We study this problem in a stylized one-layer transformer with self-attention and MLP modules, trained by next-token prediction and subsequently fine-tuned on question-answering data. Under suitable regularity conditions, we first prove that the model reaches near-optimal pre-training loss while learning structured attention patterns and relation-specific feature directions, giving a mechanism for factual knowledge acquisition. We then show that fine-tuning can turn the Q&A prompt format into a trigger for pre-trained relation features, enabling the model to extract facts that are not revisited during fine-tuning. Our analysis yields a relation-covering characterization of knowledge extraction: fine-tuning need not revisit every stored subject-answer pair, but it must cover enough latent relation-template directions through which facts were encoded during pre-training. Consequently, extraction improves with pre-training multiplicity and fine-tuning coverage, but becomes harder as the relation-template universe grows. Conversely, insufficient coverage leads to a failure regime in which facts may be stored but remain inaccessible, providing a stylized mechanism for hallucination. The theory applies to both full and low-rank fine-tuning, offering insight into why low-rank adaptation can recover pre-trained factual knowledge when relation coverage is sufficient. Experiments on synthetic data and PopQA-based GPT-2/Llama models support the predicted trends.

2507.17798 2026-05-19 cs.LG 版本更新

Wasserstein GAN-Based Precipitation Downscaling with Optimal Transport for Enhancing Perceptual Realism

基于Wasserstein GAN与最优传输的降水下scaling以增强感知现实性

Kenta Shiraishi, Yuka Muto, Atsushi Okazaki, Shunji Kotsuki

发表机构 * Graduate School of Science and Engineering, Chiba University(千叶大学科学技术研究生院) Center for Environmental Remote Sensing, Chiba University(千叶大学环境遥感中心) Institute for Advanced Academic Research, Chiba University(千叶大学高级学术研究所) Research Institute of Disaster Medicine, Chiba University(千叶大学灾害医学研究所)

AI总结 本文提出利用Wasserstein GAN与最优传输成本进行降水下scaling,以提高降水预测的感知现实性,尽管WGAN在传统评估指标上略逊,但其生成的降水场在视觉上更真实,且能有效识别不真实输出和参考数据中的潜在伪影。

详情
Journal ref
Progress in Earth and Planetary Science, 13, 29, 2026
AI中文摘要

高分辨率(HR)降水预测对于减少静止和局部强降雨造成的损害至关重要;然而,使用过程驱动的数值天气预测模型进行HR降水预测仍然具有挑战性。本研究提出利用Wasserstein生成对抗网络(WGAN)结合最优传输成本进行降水下scaling。与传统神经网络使用均方误差训练不同,WGAN能够生成具有精细结构的视觉上逼真的降水场,尽管WGAN在传统评估指标上略逊。WGAN学习的批评者与人类感知现实性密切相关。基于案例的分析表明,批评者分数的显著差异有助于识别不真实的WGAN输出和参考数据中的潜在伪影。这些发现表明,WGAN框架不仅提高了降水下scaling的感知现实性,还为评估和质量控制降水数据集提供了新的视角。

英文摘要

High-resolution (HR) precipitation prediction is essential for reducing damage from stationary and localized heavy rainfall; however, HR precipitation forecasts using process-driven numerical weather prediction models remains challenging. This study proposes using Wasserstein Generative Adversarial Network (WGAN) to perform precipitation downscaling with an optimal transport cost. In contrast to a conventional neural network trained with mean squared error, the WGAN generated visually realistic precipitation fields with fine-scale structures even though the WGAN exhibited slightly lower performance on conventional evaluation metrics. The learned critic of WGAN correlated well with human perceptual realism. Case-based analysis revealed that large discrepancies in critic scores can help identify both unrealistic WGAN outputs and potential artifacts in the reference data. These findings suggest that the WGAN framework not only improves perceptual realism in precipitation downscaling but also offers a new perspective for evaluating and quality-controlling precipitation datasets.

2507.05482 2026-05-19 cs.LG stat.ML 版本更新

Stein Diffusion Guidance: Training-Free Posterior Correction for Sampling Beyond High-Density Regions

Stein Diffusion Guidance: Training-Free Posterior Correction for Sampling Beyond High-Density Regions

Van Khoa Nguyen, Lionel Blondé, Alexandros Kalousis

发表机构 * Department of Computer Science, University of Geneva(日内瓦大学计算机科学系)

AI总结 本文提出了一种基于Stein扩散引导的训练自由后验校正方法,用于在高密度区域之外进行采样。该方法结合了随机最优控制和Stein变分推断,通过引入新的理论界和运行成本函数,实现了在低密度区域的有效引导。

Comments Revised version accepted to the ICML 2026 main track; prior version accepted to two ICLR 2026 workshops: ReALM-GEN and DeLTa

详情
AI中文摘要

Training-free diffusion guidance offers a flexible framework for leveraging off-the-shelf classifiers without additional training. Yet, current approaches hinge on posterior approximations via Tweedie's formula, which often yield unreliable guidance, particularly in low-density regions. Stochastic optimal control (SOC), in contrast, enables principled posterior sampling but remains computationally prohibitive for efficient inference. In this work, we reconcile the strengths of these paradigms by introducing Stein Diffusion Guidance (SDG), a novel 免训练 framework grounded in a surrogate SOC objective. We establish a new theoretical bound on the SOC value function, revealing the necessity of correcting approximate posteriors to reflect true diffusion dynamics. Building on Stein variational inference, SDG computes the steepest descent direction that minimizes the Kullback-Leibler divergence between approximate and true posteriors. By integrating a principled Stein correction mechanism along with a novel running cost functional, SDG enables effective guidance in low-density regions. Our experiments on diverse image-guidance tasks and on challenging small-ligand sampling for protein docking suggest that SDG consistently outperforms standard 免训练 guidance methods and highlights its potential for broader posterior sampling problems beyond high-density regimes.

英文摘要

Training-free diffusion guidance offers a flexible framework for leveraging off-the-shelf classifiers without additional training. Yet, current approaches hinge on posterior approximations via Tweedie's formula, which often yield unreliable guidance, particularly in low-density regions. Stochastic optimal control (SOC), in contrast, enables principled posterior sampling but remains computationally prohibitive for efficient inference. In this work, we reconcile the strengths of these paradigms by introducing Stein Diffusion Guidance (SDG), a novel training-free framework grounded in a surrogate SOC objective. We establish a new theoretical bound on the SOC value function, revealing the necessity of correcting approximate posteriors to reflect true diffusion dynamics. Building on Stein variational inference, SDG computes the steepest descent direction that minimizes the Kullback-Leibler divergence between approximate and true posteriors. By integrating a principled Stein correction mechanism along with a novel running cost functional, SDG enables effective guidance in low-density regions. Our experiments on diverse image-guidance tasks and on challenging small-ligand sampling for protein docking suggest that SDG consistently outperforms standard training-free guidance methods and highlights its potential for broader posterior sampling problems beyond high-density regimes.

2507.01533 2026-05-19 math.NA cs.LG cs.NA math.PR 版本更新

Consistency of Learned Sparse Grid Quadrature Rules using NeuralODEs

利用神经ODEs的学得稀疏网格求积规则的一致性

Hanno Gottschalk, Emil Partow, Tobias J. Riedlinger

发表机构 * Technische Universität Berlin(柏林技术大学) Ludwig-Maximilians-Universität München(慕尼黑路德维希-马克西米利安大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 本文研究了利用神经ODEs学习的稀疏网格求积规则的一致性问题,通过分析运输映射与Clenshaw-Curtis稀疏网格求积的组合,证明了在一般目标和产品目标下的求积速率,并展示了在两种不同情况下得到的PAC一致性结果。

Comments 39 pages, 8 figures

详情
AI中文摘要

我们证明了最近提出的一种方案的一致性,该方案通过将学习的运输映射与Clenshaw--Curtis稀疏网格求积组合来评估期望值。我们的分析基于这样一个结构事实:将一个具有混合正则性C^{k}_{mix}的函数(其快速求积速率为m^{-k}(log m)^{(d-1)(k+1)})与一个C^1的微分同胚相组合,只有当微分同胚在坐标上至多是置换时,才能保证其本身仍然是C^{k}_{mix}。因此,快速速率仅适用于产品目标,分析分为两种情形。在一般情形下,任意目标中,我们学习运输作为由最大似然训练的ReLU^{k+1}神经ODE的时间一流。所得到的流位于各向同性的C^k空间中,产生速率m^{-k/d}(log m)^{(d-1)(k/d+1)},其中提升密度平滑度k和匹配的激活阶数k+1缓解了维度灾难,但代价是更困难的优化。在对角线情形下,Knothe--Rosenblatt映射本身是对角线的,我们通过经验分位数运输点估计它,这是一种轻量级的替代方法,可以恢复完整的混合正则性速率。在两种情形中,所得到的LtI估计器都是PAC(probably approximately correct)一致的。以高概率,当样本大小n和求积预算m趋于无穷时,数值积分近似真实值的精度可以任意高。

英文摘要

We prove consistency of a recently proposed scheme that evaluates expected values by composing a learned transport map with Clenshaw--Curtis sparse-grid quadrature on a tractable product source. Our analysis hinges on the structural fact that composition of a $C^k_{\mathrm{mix}}$-regular function -- which carries the fast quadrature rate $m^{-k}(\log m)^{(d-1)(k+1)}$ -- with a $C^1$-diffeomorphism can only be guaranteed to be $C^k_{\mathrm{mix}}$ itself, if the diffeomorphism is diagonal up to a permutation of coordinates. The fast rate is therefore available exclusively for product targets, and the analysis splits into two regimes. In the general regime of arbitrary targets, we learn the transport as the time-one flow of a $\mathrm{ReLU}^{k+1}$-neural ODE trained by maximum likelihood. The resulting flow lies in the isotropic space $C^k$ and yields the rate $m^{-k/d}(\log m)^{(d-1)(k/d+1)}$, with raising the density smoothness $k$ and the matched activation order $k+1$ mitigating the curse of dimensionality at the cost of harder optimization. In the diagonal regime of product targets, the Knothe--Rosenblatt map is itself diagonal and we estimate it pointwise via empirical quantile transport, a lightweight alternative that recovers the full mixed-regularity rate. In both regimes, the resulting LtI estimator is PAC (probably approximately correct) consistent. With high probability the numerical integral approximates the true value to arbitrary accuracy as both the sample size $n$ and the quadrature budget $m$ tend to infinity.

2506.16042 2026-05-19 cs.AI cs.LG cs.OS 版本更新

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

OSWorld-Human: 评估计算机使用代理的效率基准

Reyna Abhyankar, Qi Qi, Yiying Zhang

发表机构 * OpenAI Anthropic Google DeepMind ByteDance(字节跳动) Agent S2 GTA1 Lei Jedi

AI总结 本文研究了计算机使用代理在OSWorld基准上的时间性能,发现大模型调用导致高延迟,并构建了包含人类轨迹的OSWorld Human数据集,评估发现最佳代理仍需更多步骤。

详情
AI中文摘要

生成式AI正被用于解决涉及桌面应用的多种计算机使用任务。最先进的系统仅专注于提高领先基准的准确性。然而,这些系统由于端到端延迟极高(例如,数十分钟)而实际上不可用,因为通常只需人类几分钟即可完成的任务。为了理解这一现象并指导未来计算机代理的发展,我们首次研究了计算机使用代理在OSWorld基准上的时间性能。我们发现,规划、反思和判断的大模型调用占总延迟的主要部分,并且随着代理使用更多步骤完成任务,每一步骤的时间会比任务开始时的步骤长3倍。我们随后构建了OSWorld Human,即原始OSWorld数据集的手动标注版本,其中包含每个任务的人类确定轨迹。我们使用OSWorld Human评估了16个代理的效率,并发现即使最佳代理也比必要多出2.7-4.3倍的步骤。

英文摘要

Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning, reflection, and judging account for most of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld Human and found that even the best agents take 2.7-4.3x more steps than necessary.

2506.15588 2026-05-19 cs.LG 版本更新

Memory-Efficient Differentially Private Training with Gradient Random Projection

内存高效的差分隐私训练与梯度随机投影

Alex Mulrooney, Devansh Gupta, James Flemings, Huanyu Zhang, Murali Annavaram, Meisam Razaviyayn, Xinwei Zhang

发表机构 * University of Delaware(德克萨斯大学) University of Southern California(南加州大学) Meta(Meta公司) Amazon(亚马逊)

AI总结 本文提出DP-GRAPE方法,通过随机高斯矩阵替代SVD子空间,减少内存使用并保持与一阶DP方法相当的效用,同时消除了昂贵的SVD计算需求,显著提升内存效率和模型性能。

详情
AI中文摘要

差分隐私(DP)在神经网络训练中保护敏感数据,但标准方法如DP-Adam由于每个样本梯度裁剪导致高内存开销,限制了可扩展性。我们引入DP-GRAPE(梯度随机投影),一种差分隐私训练方法,显著减少内存使用,同时保持与一阶DP方法相当的效用。DP-GRAPE的灵感来自我们发现隐私化使梯度奇异值谱变平,使基于SVD的投影(如GaLore(Zhao等人,2024))变得不必要的。因此,DP-GRAPE采用三个关键组件:(1)随机高斯矩阵替代基于SVD的子空间;(2)在投影后对梯度进行隐私化;(3)在反向传播期间应用投影。这些贡献消除了昂贵的SVD计算需求,实现了显著的内存节省,并提高了效用。尽管在较低维子空间中运行,我们的理论分析显示,DP-GRAPE在隐私-效用权衡上与DP-SGD相当。我们的广泛实验证明,DP-GRAPE可以显著减少DP训练的内存足迹,而不牺牲准确性和训练时间。特别是,DP-GRAPE在预训练视觉Transformer时将内存使用减少超过63%,在微调RoBERTa-Large时减少超过70%,同时实现相似性能。我们进一步证明,DP-GRAPE能够扩展到微调大型模型,如具有67亿参数的OPT,这是DP-Adam因内存限制而无法处理的规模。我们的代码可在https://github.com/alexmul1114/DP_GRAPE获得。

英文摘要

Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage while maintaining utility on par with first-order DP approaches. DP-GRAPE is motivated by our finding that privatization flattens the gradient singular value spectrum, making SVD-based projections (as in GaLore (Zhao et al., 2024)) unnecessary. Consequently, DP-GRAPE employs three key components: (1) random Gaussian matrices replace SVD-based subspaces, (2) gradients are privatized after projection, and (3) projection is applied during backpropagation. These contributions eliminate the need for costly SVD computations, enable substantial memory savings, and lead to improved utility. Despite operating in lower-dimensional subspaces, our theoretical analysis shows that DP-GRAPE achieves a privacy-utility tradeoff comparable to DP-SGD. Our extensive empirical experiments show that DP-GRAPE can significantly reduce the memory footprint of DP training without sacrificing accuracy or training time. In particular, DP-GRAPE reduces memory usage by over 63% when pre-training Vision Transformers and over 70% when fine-tuning RoBERTa-Large as compared to DP-Adam, while achieving similar performance. We further demonstrate that DP-GRAPE scales to fine-tuning large models such as OPT with up to 6.7 billion parameters, a scale at which DP-Adam fails due to memory constraints. Our code is available at https://github.com/alexmul1114/DP_GRAPE.

2506.08244 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Algebraic Priors for Approximately Equivariant Networks

代数先验用于近似等变网络

Riccardo Ali, Pietro Liò, Jamie Vicary

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出了一种无需参数的代数方法,利用群表示理论来构建等变网络的先验,通过实验验证该方法在多个任务中表现优异,甚至在无限群情况下也优于专门设计的模型。

详情
AI中文摘要

等变神经网络通过群作用来整合对称性,将其作为归纳偏差以提高性能。现有方法在潜在空间中学习等变作用,或设计具有等变结构的架构。这些方法通常能获得良好的经验结果,但可能涉及架构特定的约束、大量参数和高计算成本。我们挑战复杂等变架构范式,提出一种无参数的方法,基于群表示理论。我们证明,对于有限群上的等变编码器,潜在空间几乎必然包含每个线性无关数据轨道的一个副本,我们通过多个实验证明这一点。利用这一基础的代数洞察,我们通过辅助损失将群的正则表示作为归纳偏差,不增加可学习参数。我们的广泛评估显示,该方法在多个任务中表现优异,甚至在无限群情况下也优于专门设计的模型。我们进一步通过消融研究验证了正则表示的选择,显示其在所有情况下均优于定义和平凡群表示的基线模型。

英文摘要

Equivariant neural networks incorporate symmetries through group actions, embedding them as an inductive bias to improve performance. Existing methods learn an equivariant action on the latent space, or design architectures that are equivariant by construction. These approaches often deliver strong empirical results but can involve architecture-specific constraints, large parameter counts, and high computational cost. We challenge the paradigm of complex equivariant architectures with a parameter-free approach grounded in group representation theory. We prove that for an equivariant encoder over a finite group, the latent space must almost surely contain one copy of its regular representation for each linearly independent data orbit, which we explore with a number of empirical studies. Leveraging this foundational algebraic insight, we impose the group's regular representation as an inductive bias via an auxiliary loss, adding no learnable parameters. Our extensive evaluation shows that this method matches or outperforms specialized models in several cases, even those for infinite groups. We further validate our choice of the regular representation through an ablation study, showing it consistently outperforms defining and trivial group representation baselines.

2506.04170 2026-05-19 quant-ph cond-mat.stat-mech cs.LG hep-lat hep-th 版本更新

Estimation of the reduced density matrix and entanglement entropies using autoregressive networks

利用自回归网络估计简化的密度矩阵和纠缠熵

Piotr Białas, Piotr Korcyl, Tomasz Stebel, Dawid Zapolski

发表机构 * Institute of Applied Computer Science(应用计算机科学研究所) Institute of Theoretical Physics(理论物理学研究所) Doctoral School of Exact and Natural Sciences(精确与自然科学研究博士学院)

AI总结 本文提出利用自回归神经网络对量子自旋链的蒙特卡罗模拟进行应用,通过与经典二维自旋系统的对应关系,直接估算简化的密度矩阵元素,并计算Ising链中由最多5个自旋构成的区间地纯态的von Neumann和Rényi双分纠缠熵的连续极限。

Comments 9 pages, 7 figures

详情
Journal ref
Acta Physica Polonica B, Vol. 56 (2025), No. 12
AI中文摘要

我们提出将自回归神经网络应用于量子自旋链的蒙特卡罗模拟,通过与经典二维自旋系统的对应关系。我们使用能够估计连续自旋条件概率的神经网络层次结构,直接估算简化的密度矩阵元素。以Ising链为例,我们计算了由最多5个自旋构成的区间地纯态的von Neumann和Rényi双分纠缠熵的连续极限。我们证明了我们的架构能够仅通过一次训练,针对固定的离散化时间和晶格体积,估算所有所需的矩阵元素。我们的方法可以应用于其他类型的自旋链,可能包含缺陷,以及非零温度热态的纠缠熵估计。

英文摘要

We present an application of autoregressive neural networks to Monte Carlo simulations of quantum spin chains using the correspondence with classical two-dimensional spin systems. We use a hierarchy of neural networks capable of estimating conditional probabilities of consecutive spins to evaluate elements of reduced density matrices directly. Using the Ising chain as an example, we calculate the continuum limit of the ground state's von Neumann and Rényi bipartite entanglement entropies of an interval built of up to 5 spins. We demonstrate that our architecture is able to estimate all the needed matrix elements with just a single training for a fixed time discretization and lattice volume. Our method can be applied to other types of spin chains, possibly with defects, as well as to estimating entanglement entropies of thermal states at non-zero temperature.

2505.24438 2026-05-19 cs.LG 版本更新

Weisfeiler and Leman Follow the Arrow of Time: Expressive Power of Message Passing in Temporal Event Graphs

Weisfeiler和Leman跟随时间之箭:时间事件图中消息传递的表达能力

Franziska Heeg, Jonas Sauer, Petra Mutzel, Ingo Scholtes

发表机构 * Chair of Machine Learning for Complex Networks(复杂网络机器学习教授席) Center for AI and Data Science (CAIDAS)(人工智能与数据科学中心(CAIDAS)) University of Würzburg(乌尔姆大学) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) University of Karlsruhe(卡尔斯鲁厄大学) Institute for Computer Science 1(计算机科学研究所1) University of Bonn(波恩大学)

AI总结 研究探讨了时间事件图中消息传递方法的表达能力,提出了一种基于一致事件图同构的扩展Weisfeiler-Leman算法,以区分非同构的时间图。

详情
AI中文摘要

时间图的一个重要特征是时间箭头如何影响其因果拓扑,即哪些节点可能通过时间尊重的路径因果地相互影响。由此产生的模式常被时间图神经网络(TGNNs)忽视。为了正式分析TGNNs的表达能力,我们缺乏一个将图同构扩展到时间图的一般化方法,以完全捕捉其因果拓扑。针对这一缺口,我们引入了一致事件图同构的概念,该概念利用了时间图中时间尊重路径的时间展开表示。我们比较了这一定义与现有时间图同构的概念。我们展示了并突出了我们方法的优势,并开发了一个时间图的Weisfeiler-Leman算法的扩展,以启发式地区分非同构的时间图。基于这一理论基础,我们推导出一种新的消息传递方案,用于时间图神经网络,该方案在时间图的事件图表示上运行。实验评估显示,我们的方法在时间图分类实验中表现良好。

英文摘要

An important characteristic of temporal graphs is how the directed arrow of time influences their causal topology, i.e., which nodes can possibly influence each other causally via time-respecting paths. The resulting patterns are often neglected by temporal graph neural networks (TGNNs). To formally analyze the expressive power of TGNNs, we lack a generalization of graph isomorphism to temporal graphs that fully captures their causal topology. Addressing this gap, we introduce the notion of consistent event graph isomorphism, which utilizes a time-unfolded representation of time-respecting paths in temporal graphs. We compare this definition with existing notions of temporal graph isomorphisms. We illustrate and highlight the advantages of our approach and develop a temporal generalization of the Weisfeiler-Leman algorithm to heuristically distinguish non-isomorphic temporal graphs. Building on this theoretical foundation, we derive a novel message passing scheme for temporal graph neural networks that operates on the event graph representation of temporal graphs. An experimental evaluation shows that our approach performs well in a temporal graph classification experiment.

2505.21893 2026-05-19 cs.LG cs.AI 版本更新

SIPO: Stabilized and Improved Preference Optimization for Aligning Diffusion Models

SIPO: 用于对齐扩散模型的人类偏好优化的稳定与改进方法

Xiaomeng Yang, Mengping Yang, Junyan Wang, Zhijian Zhou, Zhiyu Tan, Hao Li

发表机构 * Shanghai Science and Intelligence Institute, Shanghai, China(上海科学与智能研究所) Fudan University, Shanghai, China(复旦大学) Australian Institute for Machine Learning, The University of Adelaide(澳大利亚机器学习研究所,阿德莱德大学)

AI总结 本研究提出SIPO框架,通过时间步感知的重要性重新加权和梯度稳定技术,解决扩散模型对齐中训练不稳定和策略偏差问题,提升了对齐效果和稳定性。

Comments This version supplements with more detailed content on reasoning and proof, additional experimental results, and ablation studies

详情
AI中文摘要

偏好学习作为一种有效技术,已被广泛用于将扩散模型与人类偏好对齐在视觉生成中。然而,现有对齐方法如Diffusion-DPO面临两个根本性挑战:由于各个时间步的高梯度方差导致的训练不稳定以及由于优化数据与策略模型分布之间的差异引起的策略偏差。我们的第一项贡献是对不同时间步的扩散轨迹进行系统分析,发现不稳定性主要源于早期时间步的低重要性权重。为了解决这些问题,我们提出了SIPO,即一种用于将扩散模型与人类偏好对齐的稳定和改进的偏好优化框架。具体而言,引入了一个关键梯度,即DPO-C&M,通过裁剪和屏蔽无信息的时间步来稳定训练。随后,采用时间步感知的重要性重新加权范式以缓解策略偏差并在对齐过程中强调信息更新。在各种基线模型上进行的广泛实验,包括图像生成模型SD1.5、SDXL和视频生成模型CogVideoX-2B/5B、Wan2.1-1.3B,表明我们的SIPO在稳定训练和性能方面均优于现有对齐方法。总体而言,这些结果表明了时间步感知对齐的重要性,并为改进扩散模型的偏好优化提供了有价值的指导。

英文摘要

Preference learning has garnered extensive attention as an effective technique for aligning diffusion models with human preferences in visual generation. However, existing alignment approaches such as Diffusion-DPO suffer from two fundamental challenges: training instability caused by high gradient variances at various timesteps and high parameter sensitivities, and off-policy bias arising from the discrepancy between the optimization data and the policy models' distribution. Our first contribution is a systematic analysis of diffusion trajectories across different timesteps, identifying that the instability primarily originates from early timesteps with low importance weights. To address these issues, we propose \textbf{SIPO}, a \textbf{S}tabilized and \textbf{I}mproved \textbf{P}reference \textbf{O}ptimization framework for aligning diffusion models with human preferences. Concretely, a key gradient, \emph{i.e.,} DPO-C\&M is introduced to stabilize training by clipping and masking uninformative timesteps. This is followed by a timestep-aware importance-reweighting paradigm to mitigate off-policy bias and emphasize informative updates throughout the alignment process. Extensive experiments on various baseline models including image generation models on SD1.5, SDXL, and video generation models CogVideoX-2B/5B, Wan2.1-1.3B, demonstrate that our SIPO consistently promotes stabilized training and outperforms existing alignment methods that with meticulous adjustments on parameters.Overall, these results suggest the importance of timestep-aware alignment and provide valuable guidelines for improved preference optimization in aligning diffusion models.

2505.20218 2026-05-19 cs.LG 版本更新

Fine-grained List-wise Alignment for Generative Medication Recommendation

细粒度列表级对齐用于生成性药物推荐

Chenxiao Fan, Chongming Gao, Wentao Shi, Yaxin Gong, Zihao Zhao, Fuli Feng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出FLAME框架,通过细粒度列表级对齐方法,利用大语言模型生成药物列表,以提高药物推荐的准确性和安全性,同时考虑药物间的相互作用和潜在不良反应。

Comments NeurIPS 2025 Spotlight

详情
AI中文摘要

准确且安全的药物推荐对于有效的临床决策至关重要,尤其是在多病共存的情况下。然而,现有系统依赖于点预测范式,忽略了药物间的协同效应和潜在的不良药物-药物相互作用(DDIs)。我们提出FLAME,一种针对大语言模型(LLMs)的细粒度列表级对齐框架,能够生成药物-药物的药物列表。FLAME将推荐视为一个顺序决策过程,每一步添加或移除一种药物。为了提供细粒度的学习信号,我们设计了基于潜在函数的奖励塑造的步骤式组相对策略优化(GRPO),明确建模DDIs并优化每种药物对整体处方的贡献。此外,FLAME通过整合结构化临床知识和协作信息,增强了患者建模。在基准数据集上的实验表明,FLAME实现了最先进的性能,提供了更高的准确性和可控的安全性-准确性权衡,以及在多样化的临床场景中的强大泛化能力。我们的代码可在https://github.com/cxfann/Flame获取。

英文摘要

Accurate and safe medication recommendations are critical for effective clinical decision-making, especially in multimorbidity cases. However, existing systems rely on point-wise prediction paradigms that overlook synergistic drug effects and potential adverse drug-drug interactions (DDIs). We propose FLAME, a fine-grained list-wise alignment framework for large language models (LLMs), enabling drug-by-drug generation of drug lists. FLAME formulates recommendation as a sequential decision process, where each step adds or removes a single drug. To provide fine-grained learning signals, we devise step-wise Group Relative Policy Optimization (GRPO) with potential-based reward shaping, which explicitly models DDIs and optimizes the contribution of each drug to the overall prescription. Furthermore, FLAME enhances patient modeling by integrating structured clinical knowledge and collaborative information into the representation space of LLMs. Experiments on benchmark datasets demonstrate that FLAME achieves state-of-the-art performance, delivering superior accuracy, controllable safety-accuracy trade-offs, and strong generalization across diverse clinical scenarios. Our code is available at https://github.com/cxfann/Flame.

2505.17138 2026-05-19 cs.LG cs.AI 版本更新

RAP: Runtime Adaptive Pruning for LLM Inference

RAP: 用于大语言模型推理的运行时自适应剪枝

Huanrong Liu, Chunlin Tian, Xuyang Wei, Qingbiao Li, Li Li

发表机构 * Faculty of Science and Technology, University of Macau, Macau, China(澳门大学科学与技术学院) School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China(电子科技大学信息与软件工程学院)

AI总结 本文提出RAP,一种基于强化学习的弹性剪枝框架,通过动态调整压缩策略来适应运行时内存变化和异构KV缓存需求,首次在推理过程中同时考虑模型权重和KV缓存。

详情
AI中文摘要

大语言模型(LLMs)在语言理解和生成方面表现出色,但其巨大的计算和内存需求限制了部署。压缩提供了一种潜在的解决方案来缓解这些约束。然而,大多数现有方法依赖于固定的启发式方法,因此无法适应运行时内存变化或来自多样化用户请求的异构KV缓存需求。为了解决这些限制,我们提出了RAP,一种由强化学习(RL)驱动的弹性剪枝框架,能够以运行时感知的方式动态调整压缩策略。具体而言,RAP动态跟踪实际执行过程中模型参数与KV缓存之间的演变比例。认识到前馈网络(FFNs)包含大部分参数,而参数轻量的注意力层主导KV缓存的形成,RL代理只保留那些在当前内存预算内最大化效用的组件,基于即时的工作负载和设备状态。广泛的实验结果表明,RAP优于最先进的基线方法,标志着首次在推理过程中同时考虑模型权重和KV缓存。

英文摘要

Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.

2504.03035 2026-05-19 stat.ML cs.LG math.PR math.ST stat.ME stat.TH 版本更新

High-dimensional ridge regression with random features for non-identically distributed data with a variance profile

具有随机特征的高维岭回归:非同分布数据的方差轮廓

Issa-Mbenard Dabo, Jérémie Bigot

发表机构 * New York University Abu Dhabi(纽约大学阿布扎赫分校) Institut de mathématiques de Bordeaux(波尔多数学研究所)

AI总结 本文研究了在非同分布数据下,使用随机特征的高维岭回归,通过方差轮廓模型分析训练和测试风险的渐近等价,并揭示了异质方差轮廓对泛化性能的影响。

详情
AI中文摘要

随机特征岭回归通常在同质采样模型下分析,即$x_i=Σ^{1/2}x_i'$,其中向量$x_i'$具有独立同分布的条目和相同的协方差矩阵$Σ$。本文超越了这一设定,通过方差轮廓模型研究非同分布数据,其中训练和测试协变量具有行依赖的对角协方差矩阵$Σ_i=diag(γ_{i1}^2,…,γ_{ip}^2)$和$\widetildeΣ_i=diag( ildeγ_{i1}^2,…, ildeγ_{ip}^2)$。我们的主要贡献是推导了当$n$、$p$和$m$按比例增长时,具有随机特征的岭回归的训练和测试风险的渐近等价。第一组等价是通过线性加混沌近似与交通概率论证相结合得到的,而第二组是确定性的,并通过通过主对角线的融合论证从算子值自由概率中获得。这些等价在数值实验中是精确的。它们还揭示了异质方差轮廓,包括受MNIST启发的混合型轮廓,如何修改泛化性能,并在岭参数较小时表现出双下降行为。

英文摘要

Random feature ridge regression is often analyzed in the high-dimensional regime under the homogeneous sampling model $x_i=Σ^{1/2}x_i'$, where the vectors $x_i'$ have iid entries and the same covariance matrix $Σ$ is shared by all samples. In this paper, we move beyond this setting and study non-identically distributed data through a variance-profile model in which the training and test covariates have row-dependent diagonal covariance matrices $Σ_i=\diag(γ_{i1}^2,\ldots,γ_{ip}^2)$ and $\widetildeΣ_i=\diag(\tildeγ_{i1}^2,\ldots,\tildeγ_{ip}^2)$. Our main contribution is the derivation of asymptotic equivalents for the training and test risks of ridge regression with random features when $n$, $p$, and $m$ grow proportionally. The first set of equivalents is obtained by combining the linear-plus-chaos approximation with traffic-probability arguments, whereas the second set is deterministic and follows from operator-valued free probability through an amalgamation-over-the-diagonal argument. These equivalents are sharp in numerical experiments. They also reveal how heterogeneous variance profiles, including mixture-type profiles inspired by MNIST, can modify generalization and exhibit double-descent behavior when the ridge parameter is small.

2502.20969 2026-05-19 cs.DC cs.LG 版本更新

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

TeleRAG: 通过前瞻性检索实现高效的检索增强生成推理

Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye, Kan Zhu, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, Luis Ceze, Baris Kasikci

发表机构 * Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA(华盛顿大学保罗·G·艾伦计算机科学与工程学院,西雅图,华盛顿州,美国) Harvard John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA(哈佛大学约翰·A·保罗森工程与应用科学学院,剑桥,马萨诸塞州,美国) Shanghai Jiao Tong University, Shanghai, China(上海交通大学,上海,中国)

AI总结 本文提出TeleRAG,一种通过前瞻性检索机制减少延迟并提高吞吐量的高效检索增强生成推理系统,该系统在有限的GPU内存下实现了更高的性能和可扩展性。

详情
AI中文摘要

检索增强生成(RAG)通过外部数据源扩展大型语言模型(LLMs),以提高事实正确性和领域覆盖范围。现代RAG流水线依赖于大型数据存储,这带来了显著的系统挑战:在GPU内存有限时,实现高吞吐量和低延迟非常困难。为了解决这些挑战,我们提出了TeleRAG,一种高效的推理系统,该系统在最小的GPU内存需求下减少延迟并提高吞吐量。TeleRAG的核心创新是前瞻性检索,这是一种预取机制,可以预测所需的数据并将它们从CPU传输到GPU,与LLM生成同时进行。此外,TeleRAG采用预取调度器和缓存感知调度器,以支持高效的多GPU推理,且具有最小的开销。评估显示,TeleRAG在单查询情况下实现了高达1.53倍的端到端延迟减少,在批量处理时平均吞吐量提高了1.83倍,并在吞吐量方面表现出良好的可扩展性。这证实了TeleRAG在更快、更内存高效的RAG应用部署中的实用价值。

英文摘要

Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, creating a significant system challenge: achieving high throughput and low latency is difficult, especially when GPU memory is limited. To address these challenges, we propose TeleRAG, an efficient inference system that reduces latency and improves throughput with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that predicts required data and transfers them from CPU to GPU in parallel with LLM generation. In addition, TeleRAG adopts a prefetching scheduler and a cache-aware scheduler to support efficient multi-GPU inference with minimal overhead. Evaluations show TeleRAG achieves up to a 1.53x average end-to-end latency reduction (single-query) and 1.83x higher average throughput (batched), as well as good scalability in throughput. This confirms the practical utility of TeleRAG for faster and more memory-efficient deployments of RAG applications.

2502.02463 2026-05-19 stat.ML cs.LG 版本更新

Distribution Transformers: Fast Approximate Bayesian Inference With On-The-Fly Prior Adaptation

分布变换器:通过实时先验适应实现快速近似贝叶斯推断

George Whittle, Juliusz Ziomek, Jacob Rawling, Maike A. Osborne

发表机构 * Mind Foundry Ltd(Mind Foundry有限公司)

AI总结 本文提出分布变换器,一种能够学习任意分布到分布映射的新型架构,通过实时先验适应实现快速近似贝叶斯推断,显著降低计算时间并达到与现有方法相当或更优的对数似然性能。

Comments Spotlight acceptance at ICML 2026

详情
AI中文摘要

尽管贝叶斯推断为在不确定性下的推理提供了原理性框架,但其广泛应用受到精确后验计算不可行的限制,需要使用近似推断。然而,现有方法通常计算成本高,或在先验变化时需要昂贵的重新训练,限制了其在如实时传感器融合等连续推断问题中的实用性。为了解决这些挑战,我们引入了分布变换器——一种新型架构,能够学习任意分布到分布的映射。我们的方法可以训练为将先验映射到对应的后验,条件于某些数据集——从而执行近似贝叶斯推断。我们的新型架构将先验分布表示为(通用近似)高斯混合模型(GMM),并将其实变为后验的GMM表示。GMM的组成部分通过自注意力机制相互关注,并通过交叉注意力机制与数据点相互作用。我们证明分布变换器在保持先验变化的灵活性的同时,显著减少了计算时间——从分钟到毫秒——并在序列推断、量子系统参数推断以及具有超先验的高斯过程预测后验推断等任务中实现了与现有近似推断方法相当或更优的对数似然性能。

英文摘要

While Bayesian inference provides a principled framework for reasoning under uncertainty, its widespread adoption is limited by the intractability of exact posterior computation, necessitating the use of approximate inference. However, existing methods are often computationally expensive, or demand costly retraining when priors change, limiting their utility, particularly in sequential inference problems such as real-time sensor fusion. To address these challenges, we introduce the Distribution Transformer -- a novel architecture that can learn arbitrary distribution-to-distribution mappings. Our method can be trained to map a prior to the corresponding posterior, conditioned on some dataset -- thus performing approximate Bayesian inference. Our novel architecture represents a prior distribution as a (universally-approximating) Gaussian Mixture Model (GMM), and transforms it into a GMM representation of the posterior. The components of the GMM attend to each other via self-attention, and to the datapoints via cross-attention. We demonstrate that Distribution Transformers both maintain flexibility to vary the prior, and significantly reduces computation times-from minutes to milliseconds-while achieving log-likelihood performance on par with or superior to existing approximate inference methods across tasks such as sequential inference, quantum system parameter inference, and Gaussian Process predictive posterior inference with hyperpriors.

2309.01243 2026-05-19 cs.CR cs.LG 版本更新

The Normal Distributions Indistinguishability Spectrum and its Application to Privacy-Preserving Machine Learning

正态分布不可区分光谱及其在隐私保护机器学习中的应用

Yu Wei, Yun Lu, Malik Magdon-Ismail, Vassilis Zikas

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Victoria(维多利亚大学) Rensselaer Polytechnic Institute(拉特格斯理工学院)

AI总结 本文研究了任何输出具有高斯分布的算法的隐私性,提出了一种名为正态分布不可区分光谱(NDIS)的通用引理,用于计算任意两个多元高斯分布之间的hockey-stick散度,并将其应用于随机投影等算法的隐私证明,从而实现更高效的隐私保护机制。

详情
AI中文摘要

我们研究了任何输出具有高斯分布的算法的隐私性。这项工作受到在多个有用(ML)应用中广泛使用此类算法,以及在数据上添加高斯噪声(如DP-SGD)之外相对较少关注隐私保护学习的启发。什么是任何具有多元高斯输出的算法的DP?我们通过一个通用引理来回答这个问题,该引理称为正态分布不可区分光谱(NDIS),用于计算任意两个多元高斯分布之间的hockey-stick散度δ,参数化为隐私参数ε。为了展示其实际影响,我们证明了NDIS引理的几个性质。这些性质形成了一组结果工具,可用于为任何高斯输出算法提供更简单的隐私证明。作为我们工具包的一个应用示例,我们证明了随机投影(RP)隐私的更紧参数化,并由此获得一个更节省噪声的DP机制。除了随机投影,NDIS可以用于将任何具有`sensitivity`(我们定义)的高斯输出算法提升为高斯输出的DP机制。该机制增强了现有算法中的随机性,使得可以将机制的隐私描述为单对高斯分布之间的IS,然后通过NDIS进行分析。最后,我们利用NDIS与广义χ²分布CDF之间的联系(具有高效的实证估计器)来提出一个用于高斯输出算法白盒审计的工具。

英文摘要

We investigate the privacy of {\em any} algorithm whose outputs have Gaussian distribution. This work is motivated by the prevalence of such algorithms in several useful (ML) applications, and the comparatively little research that focuses on privacy-preserving learning outside of adding Gaussian noise to the data (such as DP-SGD). {\em What is the DP of any algorithm with multivariate Gaussian output?} We answer the above research question with a general lemma which we call {\em Normal Distributions Indistinguishability Spectrum} (NDIS), a closed-form analytic computation of the hockey-stick divergence $δ$ between an arbitrary pair of multivariate Gaussians, parameterized by privacy parameter $ε$. To show its practical implications, we prove several properties of our NDIS lemma. These properties form a {\em toolbox} of results which lead to potentially {\em easier} privacy proofs for any Gaussian-output algorithm. As an example application of our toolbox, we prove a tighter parametrisation of the privacy of {\em random projection (RP)}, and obtaining from it a more noise-frugal DP mechanism. Beyond random projection, NDIS can be used to lift {\em any} Gaussian-output algorithm with a `sensitivity' (which we define) to a Gaussian-output DP mechanism. The mechanism boosts the existing randomness in the algorithm, so that one can describe the mechanism's privacy as the IS between a single pair of Gaussians, which can then be analyzed via NDIS. Lastly, we leverage the connections between NDIS and the CDF of the generalized $χ^2$ distribution (which have efficient empirical estimators) to present a tool for white-box auditing of Gaussian-output algorithms.

2307.08643 2026-05-19 cs.LG stat.ML 版本更新

Corruptions of Supervised Learning Problems: Typology and Mitigations

监督学习问题的腐败:类型与缓解方法

Laura Iacovissi, Nan Lu, Robert C. Williamson

发表机构 * Tübingen AI Center, University of Tübingen(图宾根人工智能中心,图宾根大学)

AI总结 本文提出了一种通用的腐败理论,通过马尔可夫核分析底层概率分布的变化,统一了不同类型的腐败模型,并探讨了针对各种腐败类型的缓解方法。

Comments 73 pages. To be published in Journal of Machine Learning Research 27 (2026) 1-73

详情
AI中文摘要

腐败在数据收集中普遍存在。尽管已有大量研究,现有文献主要集中在特定设置和学习场景,缺乏对腐败建模和缓解的统一视角。本文开发了一种通用的腐败理论,涵盖监督学习问题的所有修改,包括模型类和损失的变化。通过分析底层概率分布的变化,我们的方法带来了三个新机会:首先,构建了一个新型且可证明的腐败框架,区分不同类型的腐败;其次,通过比较清洁和受污染场景下的贝叶斯风险,系统分析了腐败对学习任务的影响;第三,基于这些结果,我们研究了各种腐败类型的缓解方法。我们扩展了现有的标签腐败损失修正方法以处理依赖性腐败类型。我们的发现强调了将经典腐败修正学习框架推广到更宽松的范式以涵盖更多腐败类型的必要性。我们提供了这种范式以及属性和联合腐败情况下的损失修正公式。

英文摘要

Corruption is notoriously widespread in data collection. Despite extensive research, the existing literature predominantly focuses on specific settings and learning scenarios, lacking a unified view of corruption modelization and mitigation. In this work, we develop a general theory of corruption, which incorporates all modifications to a supervised learning problem, including changes in model class and loss. Focusing on changes to the underlying probability distributions via Markov kernels, our approach leads to three novel opportunities. First, it enables the construction of a novel, provably exhaustive corruption framework, distinguishing among different corruption types. This serves to unify existing models and establish a consistent nomenclature. Second, it facilitates a systematic analysis of corruption's consequences on learning tasks, by comparing Bayes risks in the clean and corrupted scenarios. Notably, while label corruptions affect only the loss function, attribute corruptions additionally influence the hypothesis class. Third, building upon these results, we investigate mitigations for various corruption types. We expand existing loss-correction methods for label corruption to handle dependent corruption types. Our findings highlight the necessity to generalize this classical corruption-corrected learning framework to a new paradigm with weaker requirements to encompass more corruption types. We provide such a paradigm as well as loss correction formulas in the attribute and joint corruption cases.

2305.18578 2026-05-19 stat.ME cs.LG stat.ML 版本更新

Quick Adaptive Ternary Segmentation: An Efficient Decoding Procedure For Hidden Markov Models

快速自适应三元分割:一种适用于隐马尔可夫模型的高效解码过程

Alexandre Mösching, Housen Li, Axel Munk

发表机构 * Nonclinical Biostatistics, F. Hoffmann-La Roche, Switzerland(非临床生物统计学,霍夫曼拉罗奇公司,瑞士) Institute for Mathematical Stochastics, Cluster of Excellence “Multiscale Bioimaging: from Molecular Machines to Networks of Excitable Cells”(数学概率研究所,卓越中心“多尺度生物成像:从分子机器到可兴奋细胞网络”) Georg-August-Universität Göttingen, Germany(哥廷根大学,德国)

AI总结 本文提出了一种快速自适应三元分割(QATS)方法,通过分治策略在序列长度上具有多项对数复杂度,在状态空间大小上具有三次复杂度,适用于大规模隐马尔可夫模型。该方法通过自适应搜索近似最大化局部似然得分,实现了比Viterbi和PMAP更快的解码速度和更高的精度。

详情
Journal ref
Journal of Computational and Graphical Statistics, 35(2), 865-879, 2026
AI中文摘要

隐马尔可夫模型(HMMs)由一个不可观测的马尔可夫链和一个可观测的过程组成——隐藏链的噪声版本。从噪声观测中解码原始信号是几乎所有基于HMM的数据分析的主要目标。现有的解码算法,如维特比算法和点最大后验(PMAP)算法,其计算复杂度在最坏情况下是观测序列长度的线性函数,或隐藏链状态空间大小的亚二次函数。我们提出了快速自适应三元分割(QATS),一种分治策略,其计算复杂度在序列长度上为多项对数,在状态空间大小上为三次方,因此特别适用于具有相对较少状态的大规模HMM。它还提出了一种有效的数据存储方法,即特定的累积和。本质上,估计的状态序列在所有最多三个段的局部路径中最大化局部似然得分,并且是可接受的。最大化仅通过自适应搜索过程近似进行。我们的模拟展示了QATS相比维特比和PMAP的速度提升,以及精度分析。QATS的实现可在GitHub上的R包QATS中找到。

英文摘要

Hidden Markov models (HMMs) are characterized by an unobservable Markov chain and an observable process -- a noisy version of the hidden chain. Decoding the original signal from the noisy observations is one of the main goals in nearly all HMM based data analyses. Existing decoding algorithms such as Viterbi and the pointwise maximum a posteriori (PMAP) algorithm have computational complexity at best linear in the length of the observed sequence, and sub-quadratic in the size of the state space of the hidden chain. We present Quick Adaptive Ternary Segmentation (QATS), a divide-and-conquer procedure with computational complexity polylogarithmic in the length of the sequence, and cubic in the size of the state space, hence particularly suited for large scale HMMs with relatively few states. It also suggests an effective way of data storage as specific cumulative sums. In essence, the estimated sequence of states sequentially maximizes local likelihood scores among all local paths with at most three segments, and is meanwhile admissible. The maximization is performed only approximately using an adaptive search procedure. Our simulations demonstrate the speedups offered by QATS in comparison to Viterbi and PMAP, along with a precision analysis. An implementation of QATS is in the R-package QATS on GitHub.

2605.18147 2026-05-19 cs.LG 版本更新

Foundation Models for Credit Risk Prediction: A Game Changer?

信贷风险预测的基础模型:变革性突破?

Bart Baesens, Andreas Goethals, Stefan Lessmann, Simon De Vos, Cristián Bravo, David Martens, Victor Medina-Olivares, Christophe Mues, Maria Oskarsdóttir, Seppe vanden Broucke, Tim Verdonck, Wouter Verbeke

发表机构 * Faculty of Economics and Business, KU Leuven, Belgium(比利时库勒万大学经济与商业学院) School of Business and Economics, Humboldt University of Berlin, Germany(德国洪堡大学商学院) Department of Statistical and Actuarial Sciences, Western University, Canada(加拿大西部大学统计与精算科学系) Department of Engineering Management, University of Antwerp, Belgium(比利时安特卫普大学工程管理系) Business School, University of Edinburgh, United Kingdom(英国爱丁堡大学商学院) Business School, University of Southampton, United Kingdom(英国南安普顿大学商学院) School of Mathematical Sciences, University of Southampton, United Kingdom(英国南安普顿大学数学科学学院) Department of Business Informatics and Operations Management, Ghent University, Belgium(比利时根特大学商业信息与运营管理系) Department of Mathematics, University of Antwerp, Belgium(比利时安特卫普大学数学系) Department of Mathematics, KU Leuven, Belgium(比利时库勒万大学数学系)

AI总结 本文研究了信贷风险预测中基础模型的应用,探讨了其在小数据环境下提升预测性能的能力,并通过对比多种方法验证了基础模型在PD和LGD建模任务中的优越性。

详情
AI中文摘要

预测模型在信贷风险管理中发挥着关键作用,通过准确估计违约概率和损失来指导关键决策。大量研究引入了新的建模技术,并通过大规模基准研究巩固了最先进的方法。如今,梯度提升模型配以SHAP解释器已成为准标准,但风险模型的持续改进仍是首要任务。同时,人工智能的快速进展,尤其是大型语言模型,已颠覆了预测建模范式。基础模型通过在广泛领域数据集上预训练,利用先验知识表现出色。尽管在自然语言处理和计算机视觉中广泛应用,但针对表格数据的基础模型才刚刚出现。我们推测,在小数据设置中,如中小企业贷款或专门化的公司投资组合中,使用非领域数据进行预训练可能特别有益,并可能帮助解决长期存在的挑战,包括低违约率投资组合和类别不平衡问题。本文将最近提出的方法与广泛竞争对手进行基准测试,包括已建立和先进的机器学习技术,在PD和LGD建模两个核心任务中进行评估。我们的评估涵盖了各种数据集、性能指标和实验条件。我们发现,表格基础模型在各种数据集和任务中表现最佳。此外,当数据集规模减小时,它们在预测性能上提供了显著改进。这些结果令人印象深刻,因为模型在即开即用的情况下进行测试,无需超参数调优,确保了易用性和降低了计算成本。

英文摘要

Predictive models play a pivotal role in credit risk management, guiding critical decisions through accurate estimation of default probabilities and losses. Extensive research has introduced new modeling techniques, complemented by large-scale benchmarking studies consolidating the state-of-the-art. Today, quasi-standards such as gradient-boosting models paired with SHAP explainers have emerged, yet continuous improvement of risk models remains a top priority. Concurrently, rapid advancements in AI, most notably large language models, have disrupted predictive modeling paradigms. Foundation models, pretrained on extensive datasets from diverse domains, have demonstrated remarkable performance by leveraging prior knowledge. While prevalent in natural language processing and computer vision, foundation models for tabular data have only recently emerged. We conjecture that pretraining on out-of-domain data is particularly beneficial in small-data settings, such as SME lending or specialized corporate portfolios, and may help address longstanding challenges including low default portfolios and class imbalance. This paper benchmarks recently proposed tabular foundation models against a broad set of competitors, including established and advanced machine learning techniques, across two core tasks: PD and LGD modeling. Our evaluation encompasses various datasets, performance indicators, and experimental conditions. We find that tabular foundation models generally perform best across datasets and tasks. Moreover, they offer significant improvement in predictive performance as dataset size shrinks. These results are remarkable given that the models are tested out-of-the-box, without hyperparameter tuning, ensuring ease of use and mitigating computational costs.

2605.18082 2026-05-19 cs.LG 版本更新

pyforce-1.0.0: Python Framework for data-driven model Order Reduction of multi-physiCs problEms

pyforce-1.0.0: 用于多物理问题数据驱动模型降阶的Python框架

Stefano Riva, Yantao Luo, Carolina Introini, Antonio Cammi

发表机构 * Department of Energy, Nuclear Engineering Division, Politecnico di Milano(能源学院,核工程系,米兰理工学院) Department of Mechanical and Nuclear Engineering and Emirates Nuclear Technology Center, Khalifa University(机械与核工程学院和阿联酋核技术中心,卡比大学)

AI总结 本文提出pyforce-1.0.0框架,采用数据驱动降阶建模技术用于多物理问题,主要应用于核工程领域,改进了传感器位置优化和实测数据整合,提升了物理系统认知。

Comments Github Repo: https://github.com/ERMETE-Lab/ROSE-pyforce

详情
AI中文摘要

pyforce是一个实现数据驱动降阶建模技术的Python包,主要用于多物理问题的应用,主要集中在核工程领域。该包是ROSE(用于多物理问题的数据驱动降阶建模)的一部分:数学算法旨在减少多物理模型的复杂性(用于核反应堆应用),寻找最优传感器位置,并整合真实测量以提高对物理系统的认识。与之前的基于dolfinx包的原始实现(v0.6.0)相比,pyforce 1.0.0完全重写,使用pyvista作为网格导入、积分计算和结果可视化后端;此外,函数存储为numpy数组,提高了包的易用性。这一选择允许pyforce与任何能够导出VTK格式结果的软件求解器一起使用。

英文摘要

pyforce is a Python package implementing Data-Driven Reduced Order Modelling techniques for applications to multi-physics problems, mainly set in the Nuclear Engineering world. The package is part of the ROSE (Reduced Order modelling with data-driven techniques for multi-phySics problEms): mathematical algorithms aimed at reducing the complexity of multi-physics models (for nuclear reactors applications), at searching for optimal sensor positions and at integrating real measures to improve the knowledge on the physical systems. With respect to the previous original implementation based on dolfinx package (v0.6.0), version 1.0.0 of pyforce has been completely re-written using pyvista as backend for mesh importing, computing integrals, and visualisation of results; in addition, functions are stored as numpy arrays, improving the ease of use of the package. This choice allows to use pyforce with any software solver able to export results in VTK format.

2605.18079 2026-05-19 cs.LG cs.CC cs.CL 版本更新

The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought

低精度softmax变换器的表达能力(摘要)链式思维

Moritz Brösamle, Stephan Eckstein

发表机构 * Department of Mathematics, University of Tübingen, Germany(图宾根大学数学系)

AI总结 本文研究了低精度softmax变换器在链式思维中的表达能力,通过构造三元激活和分离注意力分数的硬max变换器来模拟图灵机,从而将构造转换为等效的softmax变换器,并分析了最近提出的总结链式思维范式在模拟图灵机时的效率。

Comments Accepted to ICML 2026

详情
AI中文摘要

现有的变换器表达性结果通常依赖于hardmax注意力、高精度和其它架构修改,这些修改将它们与实际使用的模型脱节。我们通过分析具有softmax注意力和激活值及注意力权重四舍五入的标准变换器解码器,同时允许深度和宽度以对数方式增长于上下文长度,来弥合这一差距。作为中间步骤,我们构造了具有三元激活和良好分离注意力分数的硬max变换器,利用链式思维(CoT)模拟图灵机。这使我们能够将构造转换为等效的softmax变换器,而无需先前方法所需的不现实的参数规模或激活精度。使用相同的技术,我们分析了最近提出的总结Co T范式,并展示其在模拟图灵机时更加高效,模型大小以空间界而非时间界缩放。我们通过在数独推理任务上验证我们的结果,并发现其比先前的高精度结果更符合可学习性。我们的代码可在https://github.com/moritzbroe/transformer-expressivity上获得。

英文摘要

Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and well-separated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm and show that it simulates Turing machines more efficiently, with model size scaling logarithmically in a space bound rather than a time bound. We empirically test predictions made by our results on a Sudoku reasoning task and find better alignment with learnability than for prior high-precision results. Our code is available at https://github.com/moritzbroe/transformer-expressivity.

2605.18078 2026-05-19 cs.LG 版本更新

Equilibrium Selection in Multi-Agent Policy Gradients via Opponent-Aware Basin Entry

通过对手感知盆地入口进行多智能体策略梯度的均衡选择

Yevhen Shcherbinin, Arina Redina, Maxim Kalpin, Vlad Kochetov

发表机构 * Bloomsbury Technology(布洛姆斯伯里技术) London School of Economics and Political Science(伦敦政治经济学院) University of Bristol(布里斯托大学) Johannes Kepler University Linz(林茨约翰尼斯·开普勒大学) Odesa Polytechnic National University(敖德萨国立技术大学)

AI总结 本文研究了多智能体策略梯度方法在局部收敛到稳定纳什均衡时的均衡选择问题,提出通过对手感知的盆地入口概率机制来提升目标均衡集的进入概率,并通过实验验证了该机制在合作盆地中的有效性。

详情
AI中文摘要

多智能体策略梯度方法已被证明能够局部收敛到稳定的纳什均衡。然而,局部收敛并不决定最终达到哪一个均衡。本文通过相对于由外部标准(如收益支配)选择的目标均衡集的盆地入口概率来研究这一问题。对于有限展开的元Meta-MAPG,我们证明更新可以分解为普通的策略梯度加上自身学习和同伴学习的修正,其中包含受控的采样噪声和有限展开偏差。我们识别出同伴学习修正作为主要的均衡选择机制:在局部对齐条件下,进入目标稳定纳什集的认证吸引区域的概率相对于普通的策略梯度会增加。由于持续的修正可能会改变原始游戏的零更新点,进入盆地后对修正进行退火可以恢复普通的策略梯度动态,并继承局部稳定的纳什收敛保证。在 stag hunt、迭代囚徒困境和初步的神经策略协调环境中的实验支持了这一盆地入口观点,显示在同伴意识更新下合作盆地的进入概率增加。

英文摘要

Multi-agent policy-gradient methods have been shown to converge locally near stable Nash equilibria. Local convergence, however, does not determine which equilibrium is reached. We study this question through basin-entry probability with respect to a target set of equilibria selected by an external criterion, such as payoff dominance. For finite-unroll Meta-MAPG, we show that the update decomposes into ordinary policy gradient plus own-learning and peer-learning corrections, with controlled sampling noise and finite-unroll bias. We identify the peer-learning correction as the main equilibrium-selection mechanism: under a local alignment condition, the probability of entering the certified attraction region of the target stable-Nash set increases, relative to ordinary policy gradient. Because persistent correction may shift zero-update points of the original game, annealing the correction after entering the basin recovers ordinary policy-gradient dynamics and inherits local stable-Nash convergence guarantees. Experiments in Stag Hunt, iterated Prisoner's Dilemma, and preliminary neural-policy coordination environments support this basin-entry view, showing increased entry into cooperative basins under peer-aware updates.

2605.18069 2026-05-19 stat.ML cs.LG math.PR math.ST stat.TH 版本更新

Wasserstein bounds for denoising diffusion probabilistic models via the Föllmer process

通过Föllmer过程研究去噪扩散概率模型的Wasserstein界限

Yuta Koike

发表机构 * Graduate School of Mathematical Sciences, University of Tokyo(东京大学数学科学研究院) CREST, Japan Science and Technology Agency(日本科学技术 Agency CREST)

AI总结 本文研究了去噪扩散概率模型(DDPMs)在2-Wasserstein距离下的采样误差界限,提出了三种核心贡献:一是基于一般Lipschitz型条件和广泛方差调度(包括余弦调度),建立了最优的上界;二是证明了相同的Lipschitz型条件蕴含对数Sobolev不等式和二次运输成本不等式;三是展示了对于一般的对数凹目标分布,即使没有二次运输成本不等式,最优的Wasserstein误差界限仍可达到。

Comments 45 pages

详情
AI中文摘要

本文研究了去噪扩散概率模型(DDPMs)在2-Wasserstein距离下的采样误差界限。我们的贡献有三个方面。 (i) 在一般Lipschitz型条件和广泛方差调度(包括余弦调度)下,我们建立了最优的上界,该上界在维度和步骤数上都是最优的,并恢复了文献中已获得的几个最优误差界限。 (ii) 我们证明了相同的Lipschitz型条件,涵盖了通常施加于(学习的)得分函数的条件,蕴含对数Sobolev不等式以及DDPM的二次运输成本不等式。因此,在现有工作的覆盖设置中,最优的Wasserstein界限(在对数因子范围内)可以从最近在Kullback-Leibler散度下的最优误差界限中推导出来。 (iii) 我们展示了对于一般的对数凹目标分布,即使没有目标的二次运输成本不等式,最优的Wasserstein误差界限仍可达到。我们的分析基于将DDPM采样器视为Föllmer过程的离散化,而不是传统的反向Ornstein-Uhlenbeck过程。

英文摘要

This paper studies sampling error bounds for denoising diffusion probabilistic models (DDPMs) in the 2-Wasserstein distance. Our contributions are threefold. (i) Under general Lipschitz-type conditions on the score function and for a broad class of variance schedules, including the cosine schedule, we establish sharp upper bounds that are optimal in both the dimension and the number of steps, and recover several sharp error bounds previously obtained in the literature. (ii) We prove that the same Lipschitz-type conditions, which encompass those commonly imposed on the (learned) score, imply a logarithmic Sobolev inequality and hence a quadratic transportation cost inequality for the DDPM. As a consequence, in settings covered by existing work, an optimal Wasserstein bound, up to a logarithmic factor, follows from the recently obtained sharp error bound in the Kullback-Leibler divergence under geometric-type variance schedules. (iii) We show that for general log-concave target distributions, the optimal Wasserstein error bound remains attainable even without a quadratic transportation cost inequality for the target. Our analysis is based on viewing the DDPM sampler as a discretization of the Föllmer process rather than the conventional reverse Ornstein-Uhlenbeck process.

2605.18068 2026-05-19 cs.LG cs.AI 版本更新

Improving Spatio-Temporal Residual Error Propagation by Mitigating Over-Squashing

通过缓解过压缩来改进时空残差误差传播

Seyed Mohamad Moghadas, Esther Rodrigo Bonet, Bruno Cornelis, Adrian Munteanu

发表机构 * ETRO Department, Vrije Universiteit Brussel(瓦隆联合大学布鲁塞尔分校ETRO系) imec

AI总结 本文提出Teger模块,通过空间曲率感知的图重排机制改进误差相关的自回归预测,提升时空预测的连续排名概率得分。

详情
AI中文摘要

残差误差传播仍然是递归模型中的基本问题,其中小的预测不准确会随时间累积并降低长周期性能。准确建模此类残差的相关结构对于概率多变量时间序列预测中的可靠不确定性量化至关重要。尽管最近的时间序列深度模型能够高效参数化时间变化的同期相关性,但它们通常假设误差的时序独立性,并忽略了观测网络中的空间相关性。在本文中,我们引入Teger,一个结构化的不确定性模块,克服了误差相关自回归预测中的空间和时间限制。Teger提出了一种空间曲率感知的图重排机制,明确加强了由离散Forman曲率识别出的信息瓶颈边。该组件被集成到低秩加对角协方差头中,通过Woodbury恒等式保持可推断性。Teger是backbone无关的,仅需任何自回归编码器产生的潜在状态。我们提供了Teger的理论证据,并在四个现实世界的时空数据集上实验评估了它在LSTM、Transformer和xLSTM backbone上的表现,显示了连续排名概率得分的一致改进。我们进一步提供了将曲率感知重排与(i)过压缩缓解、(ii)改进的谱连接性、(iii)减少有效电阻以及(iv)改进的协方差校准界联系起来的正式理论分析。

英文摘要

Residual error propagation remains a fundamental problem in recurrent models, where small prediction inaccuracies compound over time and degrade long-horizon performance. Accurately modeling the correlation structure of such residuals is critical for reliable uncertainty quantification in probabilistic multivariate timeseries forecasting. While recent time-series deep models efficiently parametrize time-varying contemporaneous correlations, they often assume temporal independence of errors and neglect spatial correlation across the observed network. In this paper, we introduce Teger, a structured uncertainty module that overcomes the spa- tial and temporal limitations of error-correlated autoregressive forecasting. Teger proposes a spatial curvature-aware graph rewiring mechanism explicitly strengthening information-bottleneck edges identified by discrete Forman curvature. The component is integrated into a low-rank-plus-diagonal covariance head, preserving tractable inference via the Woodbury identity. Teger is backbone-agnostic, requiring only the latent state produced by any autoregressive encoder. We provide theoretical evidence of Teger, and experimentally evaluate it on LSTM, Transformer, and xLSTM backbones across four real-world spatio-temporal datasets, showing consistent improvement in Continuous Ranked Probability Score (CRPS). We further provide a formal theoretical analysis connecting curvature-aware rewiring to (i) oversquashing alleviation, (ii) improved spectral connectivity, (iii) reduced effective resistance, and (iv) improved covariance calibration bounds

2605.18063 2026-05-19 cs.CV cs.LG 版本更新

The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting

MixCount数据集:弥合开放词汇物体计数的数据缺口

Corentin Dumery, Niki Amini-Naieni, Shervin Naini, Pascal Fua

发表机构 * EPFL(苏黎世联邦理工学院) University of Oxford(牛津大学) Northwestern University(西北大学)

AI总结 本文提出MixCount数据集,通过自动生成管道解决开放词汇物体计数中混合物体场景下的数据不足问题,展示了在真实世界基准上的显著提升。

Comments Co-first authors. Dataset and project page https://corentindumery.github.io/projects/mixcount.html

详情
AI中文摘要

物体计数是一个基础的视觉任务,已有超过十年的专门研究,但最先进的模型在混合物体设置中仍系统性地失败,这在工业检测和产品分拣等现实应用中占主导地位。我们证明,这一差距主要是由现有训练和评估数据的限制造成的:真实的计数数据集标注成本过高且存在标签噪声,而现有的合成替代方案缺乏多样性和现实感。我们通过MixCount数据集和基准来解决这一问题,该数据集旨在针对当前计数模型的失败模式。为了克服构建和标注此类数据的高成本,我们开发了一种自动生成管道,能够大规模合成图像、细粒度文本描述和像素级计数注释,消除了此前数据集中的标注模糊性。在MixCount上评估最先进的计数模型会暴露混合物体设置下的严重退化。更重要的是,将这些模型在我们的合成数据上训练,在真实世界基准上取得了显著提升,将FSC-147的MAE降低了20.14%,在PairTally上降低了18.3%。这些结果确立了MixCount作为细粒度计数的基准和训练数据集,并证明了我们的管道能够产生实际上无限的标注数据,从而解决了计数模型中长期存在的瓶颈问题。

英文摘要

Object counting is a foundational vision task with over a decade of dedicated research, yet state-of-the-art models still fail systematically in the mixed-object setting that dominates real-world applications such as industrial inspection and product sorting. We show that this gap is strongly driven by limitations in existing training and evaluation data: real counting datasets are prohibitively expensive to annotate and suffer from labeling noise, while existing synthetic alternatives lack diversity and realism. We address this with MixCount, a dataset and benchmark for mixed-object counting designed to target the failure modes of current counting models. To overcome the high cost of constructing and labeling such data, we develop an automatic generation pipeline that synthesizes images, fine-grained textual descriptions, and pixel-perfect counting annotations at scale, eliminating the labeling ambiguity that plagues prior datasets. Evaluating state-of-the-art counting models on MixCount exposes severe degradation in the mixed-object setting. More importantly, training these models on our synthesized data yields substantial gains on real-world benchmarks, reducing MAE by 20.14% on FSC-147 and by 18.3% on PairTally. These results establish MixCount as both a benchmark and a training dataset for fine-grained counting, and demonstrate that our pipeline, which produces effectively unlimited labeled data, helps address a long-standing bottleneck in counting models.

2605.18055 2026-05-19 cs.LG cs.AI 版本更新

FLAG: Foundation model representation with Latent diffusion Alignment via Graph for spatial gene expression prediction

FLAG: 通过图结构的潜在扩散对齐实现基础模型表示以空间基因表达预测

Qi Si, Penglei Wang, Yushuai Wu, Yifeng Jiao, Xuyang Liu, Xin Guo, Yuan Qi, Yuan Cheng

发表机构 * Shanghai Academy of Artificial Intelligence for Science, Shanghai, China.(上海人工智能科学研究院) School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China.(上海交通大学生物医学工程学院) Incubation Institute, Fudan University, Shanghai, China.(复旦大学孵化院)

AI总结 本文提出FLAG框架,通过图结构的潜在扩散对齐方法,解决空间基因表达预测中的基因协调和空间分布关系问题,并引入基因维度诅咒的概念,通过空间图编码器和基因基础模型对齐来提升模型的结构一致性与基因间保真度。

Comments 9 pages for main text, 3 pages for references, 19 pages for appendix. accepted by ICML 2026

详情
AI中文摘要

从常规的H&E染色预测空间基因表达能够实现大规模分子谱分析,但当前模型将此任务视为孤立的点wise任务,从而忽略了诸如基因协调和空间分布等关键生物结构。为保持这些关系,我们引入FLAG,一种基于扩散的框架,将此任务重新定义为结构分布建模。同时,我们识别出关键的基因维度诅咒,即联合建模基因表达及其空间相互作用在高维空间中失效,而FLAG通过整合空间图编码器以实现拓扑一致性,并利用基因基础模型(GFM)对齐以在生成过程中保持基因-基因的保真度。为严格评估模型性能,我们提出了一组新的结构评估度量标准,包括基因结构相关性(GSC)和空间结构相关性(SSC)。我们的实验表明,FLAG在传统准确性(PCC/MSE)方面具有高度竞争力,同时在捕捉基因-基因和基因-空间关系时实现了显著增强的结构保真度。代码可在https://github.com/darkflash03/FLAG上获取。

英文摘要

Predicting spatial gene expression from routine H\&E enables large-scale molecular profiling, yet current models treat this as isolated pointwise tasks, thereby overlooking essential biological structures like gene coordination and spatial distribution. To preserve these relationships, we introduce \textbf{FLAG}, a diffusion-based framework that redefines this task as structured distribution modeling. At the same time, we identify the critical \textbf{Gene Dimension Curse}, where joint modeling gene expression and their spatial interactions fail in high-dimensional spaces, and FLAG solves this challenge by integrating a spatial graph encoder for topological consistency and utilizing Gene Foundation Model (GFM) alignment for gene-gene fidelity in the generation process. To rigorously assess model performance, we propose a set of novel structural evaluation metrics, including Gene Structural Correlation (\textbf{GSC}) and Spatial Structural Correlation (\textbf{SSC}). Our experiments demonstrate that FLAG is highly competitive in traditional accuracy (PCC/MSE) while achieving significantly enhanced structural fidelity in capturing both gene-gene and gene-spatial relationships. The code is available at https://github.com/darkflash03/FLAG.

2605.18053 2026-05-19 cs.LG cs.CL cs.CR cs.PF 版本更新

Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction

保护几乎就是一切:结构保护在全局受限的KV淘汰中占据主导地位

Gabriel Garcia

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了在共享全局受限解码时间Harness下的KV缓存淘汰问题,发现结构保护在保持质量方面起关键作用,通过保留边界缓存恢复了大部分参考天花板质量,并展示了保护机制在不同模型上的有效性。

Comments 38 pages, 6 figures, 25 tables (includes one longtable). Code and figure regeneration scripts: https://github.com/gpgabriel25/KVCacheBoundaryProtection

详情
AI中文摘要

我们研究了在共享全局受限解码时间Harness下的KV缓存淘汰问题。七种策略(LRU、H2O、SnapKV、StreamingLLM、Ada-KV、QUEST、Random)共享一个提示边界漏洞:在没有结构保护的情况下,它们在六个纯Transformer模型上几乎降级到质量为零(F1≤0.064)。保留每个边界10%的缓存可在七个LongBench模型上恢复69-90%的C=2048参考天花板质量(13%保留率);一个十模型面板覆盖68-98%。一个注意力质量试点(Qwen2.5-3B,N=30)表明原因:位置0的sink占据约75%的前缀质量,而其他边界token接近约0.41倍的均匀期望,因此注意力评分器保留sink但仍会丢弃结构关键token。有保护的情况下,简化评分隔离变体在K=32时与LRU TOST等价(Δ=0.02);在K=8时,注意力策略对彼此收敛但仍比LRU高0.011-0.021 F1(在C=256和C=512时)。忠实的Ada-KV/QUEST在Mistral-7B和Phi-3.5上添加约0.03-0.04 F1超过简化变体。在Qwen3-4B上的NIAH-32K领域转移试点(解码vs.预填,C∈{512,2048})显示保护提升几乎相同(比率0.99-1.00)。在64K时,保护有所帮助但恢复有限;忠实的每头评分在Gemma-3-4B上仅在模型已支持强64K检索而不淘汰时才能在6.3%保留率下达到全缓存天花板。总体而言:保护占主导;一旦边界被保护,评分差异变得次要;每头分配进一步带来小幅提升。

英文摘要

We study KV cache eviction under a shared globally capped decode-time harness. Seven policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random) share a prompt-boundary vulnerability: without structural protection, they collapse to near-zero quality on six pure-transformer models (F1$\leq$0.064). Reserving 10\% of cache at each boundary recovers 69--90\% of the $C{=}2{,}048$ reference-ceiling quality on seven LongBench models at $C{=}256$ (13\% retention); a ten-model panel spans 68--98\%. An attention-mass pilot (Qwen2.5-3B, $N{=}30$) suggests why: the position-0 sink holds ${\sim}75\%$ of prefix mass, while other boundary tokens sit near ${\sim}0.41{\times}$ uniform expectation, so attention scorers retain the sink but still drop structurally critical tokens. With protection, simplified score-isolation variants are TOST-equivalent to LRU at $K{=}32$ ($Δ{=}0.02$); at $K{=}8$, attention policies pairwise converge yet beat LRU by 0.011--0.021 F1 across $C{=}256$ and $C{=}512$. Faithful Ada-KV/QUEST add ${\sim}0.03$--$0.04$ F1 on Mistral-7B and Phi-3.5 beyond simplified variants. A NIAH-32K regime-transfer pilot on Qwen3-4B (decode vs.\ prefill, $C{\in}\{512,2048\}$) shows near-identical protection lifts (ratio 0.99--1.00). At 64K, protection helps but recovery is modest; faithful per-head scoring matches full-cache ceiling on Gemma-3-4B at 6.3\% retention only when the model already supports strong 64K retrieval without eviction. Overall: protection dominates; scoring differences are secondary once boundaries are guarded; per-head allocation gives a further modest gain.

2605.18040 2026-05-19 stat.ML cs.LG math.PR 版本更新

A note on connections between the Föllmer process and the denoising diffusion probabilistic model

关于Föllmer过程与去噪扩散概率模型之间联系的注记

Yuta Koike

发表机构 * Graduate School of Mathematical Sciences, University of Tokyo(东京大学数学科学研究院) CREST, Japan Science and Technology Agency(日本科学技术 Agency CREST)

AI总结 本文探讨了Föllmer过程与去噪扩散概率模型(DDPM)之间的联系,指出离散化的Föllmer过程可以作为DDPM采样器的自然超参数设置,并系统地恢复了DDPM采样误差界的结果。

Comments 32 pages

详情
AI中文摘要

Föllmer过程是一种在时间1处具有预指定分布的布朗运动。该过程可以被解释为去噪扩散概率模型(DDPM)的逆随机微分方程(SDE)的'增强'时间压缩版本。尽管这一事实已间接用于通过逆SDE的离散化分析DDPM采样误差,但Föllmer过程的直接离散化与DDPM采样器之间的联系尚未被充分探讨。本文旨在澄清这一点,并回顾现有工作中相关的结果。我们证明离散化的Föllmer过程可以作为DDPM采样器的自然超参数设置。此外,这使我们能够系统地恢复最先进的DDPM采样误差界结果,并稍作改进。

英文摘要

The Föllmer process is a Brownian motion conditioned to have a pre-specified distribution at time 1. This process can be interpreted as an "augmented" time-compressed version of the reverse stochastic differential equation (SDE) for the denoising diffusion probabilistic model (DDPM). While this fact has been indirectly used to analyze DDPM sampling errors via discretization of the reverse SDE, connections between direct discretization of the Föllmer process and the DDPM sampler have not yet been fully explored. This note aims to clarify this point while surveying relevant results from existing work. We show that discretized Föllmer processes give natural hyper-parameter settings of the DDPM sampler. Moreover, this allows us to systematically recover state-of-the-art results on DDPM sampling error bounds with slight improvements.

2605.18035 2026-05-19 cs.AI cs.LG 版本更新

New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

零阶硬阈值化中方差减少的新见解:缓解梯度误差和扩张性矛盾

Xinzhe Yuan, William de Vazelhes, Bin Gu, Huan Xiong

发表机构 * IASM, Harbin Institute of Technology(哈尔滨工业大学人工智能研究所,哈尔滨工业大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 本文提出了一种通用的方差减少零阶硬阈值化算法,通过考虑方差的作用,缓解零阶梯度与硬阈值操作之间的冲突,从而消除对随机方向数量的限制,提高收敛速度和应用范围。

Comments Published as a conference paper at ICLR 2024. 9 pages main paper, 24 pages appendix, 11 figures, 7 tables. Correspondence to Bin Gu and Huan Xiong

详情
Journal ref
International Conference on Learning Representations (ICLR), 2024
AI中文摘要

硬阈值化是机器学习中用于解决ℓ0约束优化问题的重要算法类型。然而,在某些情况下,目标函数的真实梯度可能难以获取,通常可以通过零阶(ZO)方法进行近似。到目前为止,SZOHT算法是唯一能够处理ℓ0稀疏性约束的ZO梯度算法。不幸的是,由于零阶梯度的偏差与硬阈值操作的扩张性之间存在固有的矛盾,SZOHT在ZO梯度的随机方向数量上存在明显的限制。本文通过考虑方差的作用,提供了一种新的方差减少见解:缓解零阶梯度与硬阈值操作之间的独特矛盾。在此视角下,我们提出了一种通用的方差减少零阶硬阈值化算法以及在标准假设下的通用收敛性分析。理论结果表明,新算法消除了对随机方向数量的限制,相较于SZOHT,具有改进的收敛速度和更广泛的应用范围。最后,我们通过岭回归问题以及黑盒对抗攻击问题展示了本方法的实用性。

英文摘要

Hard-thresholding is an important type of algorithm in machine learning that is used to solve $\ell_0$ constrained optimization problems. However, the true gradient of the objective function can be difficult to access in certain scenarios, which normally can be approximated by zeroth-order (ZO) methods. The SZOHT algorithm is the only algorithm tackling $\ell_0$ sparsity constraints with ZO gradients so far. Unfortunately, SZOHT has a notable limitation on the number of random directions % in ZO gradients due to the inherent conflict between the deviation of ZO gradients and the expansivity of the hard-thresholding operator. This paper approaches this problem by considering the role of variance and provides a new insight into variance reduction: mitigating the unique conflicts between ZO gradients and hard-thresholding. Under this perspective, we propose a generalized variance reduced ZO hard-thresholding algorithm as well as the generalized convergence analysis under standard assumptions. The theoretical results demonstrate the new algorithm eliminates the restrictions on the number of random directions, leading to improved convergence rates and broader applicability compared with SZOHT. Finally, we illustrate the utility of our method on a ridge regression problem as well as black-box adversarial attacks.

2605.18033 2026-05-19 cond-mat.mtrl-sci cs.LG physics.app-ph 版本更新

Real-time Multi-instrument Autonomous Discovery of Novel Phase-change Memory Materials

实时多仪器自主发现新型相变存储器材料

Chih-Yu Lee, Haotong Liang, Ryan Kim, Austin McDannald, Carlos A Rios Ocampo, A. Gilad Kusne, Ichiro Takeuchi

发表机构 * Department of Materials Science and Engineering, University of Maryland, College Park, MD, USA(材料科学与工程系,马里兰大学,College Park, MD, USA) Materials Measurement Science, Division of the National Institute of Standards and Technology, Gaithersburg, MD, USA(国家标准技术研究院材料测量科学部,Gaithersburg, MD, USA) Institute of Research in Electronics and Applied Physics, University of Maryland, College Park, Maryland, USA(电子与应用物理研究所,马里兰大学,College Park, Maryland, USA) Maryland Quantum Materials Center, University of Maryland, College Park, MD, USA(马里兰量子材料中心,马里兰大学,College Park, MD, USA)

AI总结 本文提出了一种实时多仪器自主发现框架,通过闭环方式同时进行结构属性映射和功能属性优化,用于发现新型相变存储器材料,实现了七倍速度提升。

Comments 25 pages, 5 figures

详情
AI中文摘要

自主实验室能够实现自动化实验执行、数据分析和决策制定。主要挑战在于整合来自多种仪器的异构且不同步的数据流。传统的不确定合成过程-结构-性质关系(SPSPR)学习过程通常依赖于数据完全收集后的实验后分析,而不是在实时实验中进行,且决策在不同表征设备之间独立进行。在此,我们展示了多仪器自主发现(MAD)框架——通过闭环方式同时进行结构属性映射和功能属性优化。作为示例,我们将MAD应用于相变存储器(PCM)材料,特别是Mn-Sb-Te三元体系,这此前未被探索的PCM材料系统。采用多输出模型通过共区域化核将X射线衍射(XRD)和电导率测量数据同时合并。输出概率后验和不确定性量化有助于利用共享知识进行决策,同时不同任务的目标不同。我们旨在通过非负矩阵因子分解(NMF)最大化晶体结构分布的知识,同时并行寻找具有最大电阻值的组成,这是PCM的重要性能指标。利用MAD,我们发现了有前途的电PCMs,并在25次闭环迭代内确定了SPSPR,对应于七倍的速度提升。该框架为大规模自主设施开辟了新的研究路径,其中未来实验可以并行运行,而不是独立运行。

英文摘要

Autonomous labs enable the integration of automated experiment execution, data analysis and decision making. The main challenge remains the integration of diverse data streams from multiple instruments, where the data is often heterogeneous and unsynchronized. The standard learning process of undetermined synthesis-process-structure-property relationships (SPSPR) usually relies on post-experiment analysis after data is fully collected, not during live experiments, and decision making is carried out independently across characterization equipment. Here, we demonstrate the Multi-instrument Autonomous Discovery (MAD) framework -- combining structural property mapping and functional property optimization simultaneously in a closed-loop manner. As an example, we applied MAD to phase change memory (PCM) materials, and, in particular on the Mn-Sb-Te ternary, a previously unexplored materials system for PCM. A multi-output model is employed to merge data from x-ray diffraction (XRD) and electrical resistance measurements simultaneously through a co-regionalization kernel that models the relationship between them. The output probabilistic posterior and uncertainty quantification facilitate decision making with shared knowledge, while the goals are different across tasks. We aimed to maximize the knowledge of crystal structure distribution using non-negative matrix factorization (NMF), while in parallel, we find the composition with the maximum resistance value, an important figure of merit for PCM. Leveraging MAD, we found promising electrical PCMs and identified the SPSPR within 25 closed-loop iterations, corresponding to a seven-fold speed-up. The framework opens a new path of study in large-scale autonomous facilities, where future experiments can be run in parallel together, not independently.

2605.18028 2026-05-19 cs.LG cs.AI 版本更新

FedSDR: Federated Self-Distillation with Rectification

FedSDR: 带校正的联邦自我蒸馏

Ziheng Ren, Zhanming Shen, Hao Wang, Ning Liu, You Song

发表机构 * Beijing University of Aeronautics(北京航空航天大学) Zhejiang University(浙江大学) Shandong University(山东大学) Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 本文提出FedSDR,一种改进的联邦自我蒸馏方法,通过引入双重流机制来解决联邦学习中数据分布不匹配和幻觉问题,提升模型的准确性和一致性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大规模语言模型的联邦微调面临严重的统计异质性。然而,现有模型级防御方法往往忽视了根本原因:内在的数据分布不匹配。在本文中,我们首先建立了联邦自我蒸馏(FedSD)作为基本且有力的策略。通过将客户端表示投影到一个平滑的

英文摘要

Federated fine-tuning of Large Language Models faces severe statistical heterogeneity. However, existing model-level defenses often overlook the root cause: intrinsic data distribution mismatches. In this work, we first establish Federated Self-Distillation (FedSD) as a fundamental and potent strategy. By projecting client representations into a smoothed ``model-understanding space,'' FedSD alone serves as a universal booster, demonstrating superior performance over conventional algorithms. Despite its success, we identify a subtle trade-off termed the Rewrite Paradox -- unconstrained self-distillation can inadvertently increase hallucinations and redundancy. To refine this paradigm, we further propose FedSDR (Federated Self-Distillation with Rectification), the ultimate reinforced framework. It augments FedSD with a dual-stream mechanism: a local LoRA-S (Smoothing) branch to implicitly absorb heterogeneity via distilled data, and a parallel global LoRA-R (Rectification) branch anchored to raw data to enforce factual correctness. By selectively aggregating only LoRA-R, FedSDR yields a globally aligned and faithful model. Extensive experiments verify its superior performance.

2605.18022 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise

揭示记忆与泛化共存:在带有标签噪声的算术任务中的案例研究

Linyu Liu, Pinyan Lu

发表机构 * Taylor Lab, Huawei Technologies Co., Ltd.(华为技术有限公司泰勒实验室) Key Laboratory of Interdisciplinary Research of Computation and Economics, Shanghai University of Finance and Economics(上海财经大学计算与经济交叉研究重点实验室)

AI总结 本文研究了在高过参数化模型中如何同时记忆噪声标签和泛化,通过模运算任务中的实验发现,适当优化和模型配置下大模型泛化能力更强,噪声标签被更快记忆,而过参数化模型内部形成泛化结构,但输出被拟合噪声标签的需求所抑制。通过频率方法提取内部结构可实现高准确率,提出任务无关方法将网络分为泛化和记忆组件,尽管该子网络提升泛化能力,但相比频率提取方法仍有局限,表明泛化结构分布于神经元中,需要新工具来检索过参数化网络中的可泛化知识。

Comments 27 pages, 32 figures

详情
AI中文摘要

高度过参数化的模型可以同时记忆噪声标签并良好泛化,但如何这些行为共存仍不明确。本文通过模运算任务在重噪声标签下研究其内在机制。通过在两层神经网络上的广泛实验发现,适当优化和模型配置下大模型泛化能力更强,而噪声标签被更快记忆。过参数化模型内部形成泛化结构,但其在输出中的表达被拟合噪声标签的需求所抑制。值得注意的是,即使在80%的标签噪声下,通过频率方法提取内部结构也可实现接近完美的测试准确率。我们进一步提出一种任务无关的方法将网络分为泛化和记忆组件。尽管该子网络提升泛化能力,但相比频率提取方法仍有局限,表明泛化结构分布于神经元中,需要新工具来检索过参数化网络中的可泛化知识。

英文摘要

Highly over-parameterized models can simultaneously memorize noisy labels and generalize well, yet how these behaviors coexist remains poorly understood. In this work, we investigate the underlying mechanisms of this coexistence using modular arithmetic tasks under heavy label noise. Through extensive experiments on two-layer neural networks, we find that larger models tend to generalize better under appropriate optimization and model configurations, while noisy labels are memorized faster than clean data. Over-parameterized models internally form a generalization structure, but its expression in the output is suppressed by the need to fit noisy labels. Remarkably, even with 80\% label noise, near-perfect test accuracy can be achieved by extracting this internal structure using frequency-based methods. We further propose a task-agnostic method to partition networks into generalization and memorization components. Although this subnetwork improves generalization, it is limited compared with frequency-based extraction, indicating that the generalization structure is distributed across neurons and motivating the development of new tools to retrieve generalizable knowledge from over-parameterized networks.

2605.18020 2026-05-19 cs.LG 版本更新

Federated Learning by Utility-Constrained Stochastic Aggregation for Improving Rational Participation

通过效用约束随机聚合改进理性参与的联邦学习

M Yashwanth, Arunabh Singh, Ashok Nayak, Sai Kiran Bulusu, Anirban Chakraborty

发表机构 * Indian Institute of Science(印度科学研究所) Indian Institute of Technology Bombay(印度理工学院孟买分校) IIIT Hyderabad(海得拉巴IIIT)

AI总结 本文提出FedUCA框架,通过形式化服务器作为优化器的角色,旨在通过维持客户端参与来最大化全局模型性能,从而提高客户端参与度和全局模型性能。

Comments Federated Learning, Rational Clients, Endogenous Participation, and Aggregation

详情
AI中文摘要

联邦学习(FL)算法隐含假设客户端在服务器请求下被动地分享本地模型更新以配合服务器端的协调。然而,这忽略了现实世界跨机构环境中一个重要的方面:客户端通常是理性的代理,可能会优先考虑本地模型性能等效用而非全局模型的性能。在统计异质性显著的设置中,理性客户端可能会退出联邦如果感知到的合作利益未能满足其本地效用阈值。此类退出会降低全局模型性能并可能导致联邦训练过程的崩溃。在本文中,我们引入FedUCA(通过效用约束随机聚合改进理性参与的联邦学习),一个框架,形式化了服务器作为优化器的角色,旨在通过维持客户端参与来最大化全局模型性能。我们通过在标准数据集上的广泛实验验证了我们的框架,证明通过优先考虑参与可行性,FedUCA实现了显著更高的客户端保留率,从而实现了更优的全局模型性能。

英文摘要

Federated Learning (FL) algorithms implicitly assume that clients passively comply with server-side orchestration by sharing local model updates upon server request. However, this overlooks an important aspect in real-world cross-silo environments: clients are often rational agents who may prioritize their utilities such as local model performance over that of the global model. In settings with significant statistical heterogeneity, rational clients may opt out of the federation if the perceived benefits of collaboration fail to meet their local utility thresholds. Such attrition degrades the global model performance and can lead to the collapse of the federated training process. In this work, we introduce FedUCA, (Federated Learning by Utility-Constrained Stochastic Aggregation for Improving Rational Participation), a framework that formalizes the server's role as an optimizer seeking to maximize global model performance by sustaining client participation. We substantiate our framework through extensive experiments on standard datasets demonstrating that by prioritizing participation feasibility, FedUCA achieves significantly higher client retention and, consequently, a superior global model performance.

2605.18015 2026-05-19 cs.LG cs.DB cs.SE 版本更新

LogRouter: Adaptive Two-Level LLM Routing for Log Question Answering in Big Data Systems

LogRouter: 一种自适应的两级LLM路由用于大数据系统中的日志问题解答

Mert Coskuner, Merve Zeybel, Melik Mert Dolan

发表机构 * TUBITAK BILGEM(土耳其国家研究 institute)

AI总结 本文提出LogRouter,一种自适应两级LLM路由系统,用于在大数据系统中实现日志问题解答,通过结合PySpark-based Drain3数据摄入管道、GPU加速的嵌入和Apache Druid和PostgreSQL with pgvector的双索引存储,实现高效的日志查询处理。

详情
AI中文摘要

在自托管、资源受限的环境中,生产日志分析需要自然语言访问大规模日志流,而无需将每个查询路由通过大型语言模型的费用。我们提出了LogRouter,一个部署在TUBITAK BILGEM国家大数据平台上的端到端日志问题解答系统,结合了基于PySpark的Drain3数据摄入管道、GPU加速的嵌入以及Apache Druid和PostgreSQL with pgvector的双索引存储。一个两级成本感知路由器将每个查询沿着四个执行路径之一进行路由:直接响应、Druid关键词搜索、使用SQL生成的模板查找和pgvector语义检索,同时二级路由器选择14B或32B类生成器用于语义路径。一个专用的编码器LLM处理文本到SQL生成。我们在四个LogHub数据集(Linux、Apache、Windows和Mac;共70个问题)上评估了该系统,分别在在线完整管道配置和隔离生成器的离线配置下进行测试。路由器在各数据集上的平均准确率为88.4%,在Linux上为94.7%。完整管道的平均ROUGE-1为0.373,BERTScore为0.879,RAGAS Faithfulness为0.779,端到端延迟为18.6秒。在公平的离线比较中,路由系统将平均延迟减少了55%(与Fixed-32B基线46.3秒 vs. 102.1秒相比),同时保持答案正确性在5.8分以内,并在所有数据集上超过Fixed-14B基线的RAGAS Faithfulness。因此,成本感知的路由是生产日志QA的实用机制:路由恢复了始终使用32B配置的大部分质量,延迟不到一半,且L1关键词词汇表使路由决策具有高精度,而无需使用学习分类器。

英文摘要

Production log analytics in self-hosted, resource-constrained environments requires natural-language access to massive log streams without the cost of routing every query through a large language model. We present LogRouter, an end-to-end log question-answering system deployed on TUBITAK BILGEM's national big data platform that combines a PySpark-based Drain3 ingestion pipeline, GPU-accelerated embeddings, and dual-index storage in Apache Druid and PostgreSQL with pgvector. A two-level cost-aware router dispatches each query along one of four execution paths: direct response, Druid keyword search, template lookup with SQL generation, and pgvector semantic retrieval, while a Level-2 router selects either a 14B-class or 32B-class generator for the semantic path. A dedicated coder LLM handles text-to-SQL generation. We evaluate the system on four LogHub datasets (Linux, Apache, Windows, and Mac; 70 questions in total) under both an online full-pipeline configuration and an offline configuration that isolates the generator. The router reaches 88.4% mean accuracy across datasets and 94.7% on Linux, while the full pipeline attains a mean ROUGE-1 of 0.373, BERTScore of 0.879, RAGAS Faithfulness of 0.779, and an end-to-end latency of 18.6 s. In an apples-to-apples offline comparison, the routed system reduces mean latency by 55% versus a Fixed-32B baseline (46.3 s vs. 102.1 s) while preserving Answer Correctness within 5.8 points and exceeding a Fixed-14B baseline on RAGAS Faithfulness across every dataset. Cost-aware dispatching is therefore a practical mechanism for production log QA: routing recovers most of the quality of an always-32B configuration at less than half the latency, and the L1 keyword vocabulary makes that routing decision with high precision without a learned classifier.

2605.18012 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SAS: Semantic-aware Sampling for Generative Dataset Distillation

SAS: 语义感知的生成数据集蒸馏

Mingzhuo Li, Guang Li, Linfeng Ye, Jiafeng Mao, Takahiro Ogawa, Konstantinos N. Plataniotis, Miki Haseyama

发表机构 * Hokkaido University(北海道大学) University of Toronto(多伦多大学) The University of Tokyo(东京大学)

AI总结 本文提出了一种语义感知的数据集蒸馏方法,通过利用CLIP作为语义先验,设计三个语义评分函数来量化类别相关性、类别间分离性和集合内多样性,从而生成紧凑且语义区分度高的数据集。

Comments Published as a journal paper in IEEE OJSP

详情
AI中文摘要

深度神经网络在广泛的任务中取得了显著的性能,但这种成功往往伴随着由于大规模训练数据带来的巨大计算和存储成本。数据集蒸馏通过构建紧凑且信息丰富的数据集,以实现高效的模型训练同时保持下游性能。然而,大多数现有方法主要强调匹配数据分布或下游训练统计,对蒸馏数据中高阶语义信息的保留有限。在本文中,我们引入了语义感知的视角进行数据集蒸馏,通过利用对比语言-图像预训练(CLIP)作为语义先验进行后采样。我们的目标是获得不仅紧凑而且语义上类别区分度高且多样化的蒸馏数据集。为此,我们设计了三个语义评分函数,以量化预训练语义空间中的类别相关性、类别间分离性和集合内多样性。基于现有蒸馏方法生成的图像池,我们进一步开发了一种两阶段策略进行有效的采样:第一阶段过滤语义区分度高的样本以形成可靠的候选集,第二阶段进行动态多样性感知选择以减少冗余并保持语义覆盖。在多个数据集、图像池和下游模型上的广泛实验显示了一致的性能提升,突显了在数据集蒸馏中整合语义信息的有效性。

英文摘要

Deep neural networks have achieved impressive performance across a wide range of tasks, but this success often comes with substantial computational and storage costs due to large-scale training data. Dataset distillation addresses this challenge by constructing compact yet informative datasets that enable efficient model training while maintaining downstream performance. However, most existing approaches primarily emphasize matching data distributions or downstream training statistics, with limited attention to preserving high-level semantic information in the distilled data. In this work, we introduce a semantic-aware perspective for dataset distillation by leveraging Contrastive Language-Image Pretraining (CLIP) as a semantic prior for post-sampling. Our goal is to obtain distilled datasets that are not only compact but also semantically class-discriminative and diverse. To this end, we design three semantic scoring functions that quantify class relevance, inter-class separability, and intra-set diversity in a pretrained semantic space. Based on image pools generated by existing distillation methods, we further develop a two-stage strategy for effective sampling: the first stage filters semantically discriminative samples to form a reliable candidate set, and the second stage performs a dynamic diversity-aware selection to reduce redundancy while preserving semantic coverage. Extensive experiments across multiple datasets, image pools, and downstream models demonstrate consistent performance gains, highlighting the effectiveness of incorporating semantic information into dataset distillation.

2605.18008 2026-05-19 cs.LG stat.ML 版本更新

Uncertainty Reliability Under Domain Shift: An Investigation for Data-Driven Blood Pressure Estimation in Photoplethysmography

域移情况下不确定性可靠性研究:面向光体积脉搏波测记中数据驱动血压估计的探讨

Mohammad Moulaeifard, Ciaran Bench, Philip J. Aston, Nils Strodthoff

发表机构 * AI4Health Department, University of Oldenburg(奥尔登堡大学AI4Health部门) Department of Data Science and AI, National Physical Laboratory(国家物理实验室数据科学与人工智能部门) School of Mathematics and Physics, University of Surrey(萨里大学数学与物理学院)

AI总结 本文研究了在域移情况下深度学习用于光体积脉搏波测记信号中血压估计的不确定性可靠性,比较了深度集成和蒙特卡洛滴答方法,并探讨了不确定性校准的重要性。

Comments 23 pages, 2 figures

详情
AI中文摘要

不确定性量化(UQ)对于安全关键领域如医疗至关重要,但很少在现实的分布外(OOD)条件下进行评估。本文评估了基于深度学习的血压(BP)估计在光体积脉搏波测记(PPG)信号中的预测性能和不确定性可靠性,分别在分布内(ID)和分布外(OOD)设置下进行。使用在PulseDB上训练的XResNet1D-50模型在四个外部数据集上进行测试,比较了深度集成(DE)和蒙特卡洛滴答(MCD)方法,并使用高斯负对数似然(GNLL)和均方误差(MSE)损失函数,可选地通过符合预测(CP)、温度缩放(TS)和等比回归(IR)进行后处理校准。我们的关键发现如下:(1)在域移情况下,DE比MCD提供更强的预测鲁棒性,这种优势主要在外部域移情况下显现。(2)经过校准的GNLL方法在不确定性校准方面表现最佳(例如,GNLL+DE+CP用于收缩压(SBP),GNLL+DE+TS用于舒张压(DBP)),而基于MSE的不确定性需要校准才能实用。(3)在各种设置中,CP和TS提供了最一致的增益,IR在某些情况下仍然具有竞争力。总体而言,我们的结果表明,基于DE的方法在域移下的预测性能最为稳健,GNLL在原生UQ中最强,而校准对于使MSE基于的不确定性实用化至关重要。这些发现突显了在外部数据上联合评估预测准确性和校准的重要性,以实现无袖带血压估计的可信度。

英文摘要

Uncertainty quantification (UQ) is critical for safety-critical domains like healthcare, yet it is rarely evaluated under realistic out-of-distribution (OOD) conditions. Here, we assessed predictive performance and uncertainty reliability for deep learning-based blood pressure (BP) estimation from photoplethysmography (PPG) signals under both in-distribution (ID) and OOD settings. Using an XResNet1D-50 trained on PulseDB and tested on four external datasets, we compared deep ensembles (DE) and Monte Carlo dropout (MCD) with Gaussian negative log-likelihood (GNLL) and mean squared error (MSE) losses, optionally followed by post-hoc recalibration via conformal prediction (CP), temperature scaling (TS), and isotonic regression (IR). The key findings of our study are as follows: (1) DE provides stronger predictive robustness under domain shift than MCD, an advantage that becomes clear primarily under external shift. (2) Recalibrated GNLL-based methods yield the best uncertainty calibration (e.g., GNLL+DE+CP for systolic blood pressure (SBP), GNLL+DE+TS for diastolic blood pressure (DBP)), while MSE-based uncertainty requires recalibration to become practically useful. (3) Across settings, CP and TS offer the most consistent gains, with IR remaining competitive in several cases. Overall, our results identify DE-based methods as most robust for predictive performance under domain shift, GNLL as strongest for native UQ, and recalibration as essential for making MSE-based uncertainty practical. These findings highlight the need to jointly assess predictive accuracy and calibration on external data for trustworthy cuffless BP estimation

2605.18005 2026-05-19 cs.LG stat.ML 版本更新

Scalable Decision-Focused Learning through Cost-Sensitive Regression

通过成本敏感回归实现可扩展的决策聚焦学习

Noah Schutte, Senne Berden, Tias Guns, Krzysztof Postek, Neil Yorke-Smith

发表机构 * Delft University of Technology(代尔夫特理工大学) KU Leuven(库尔勒大学) Independent Researcher(独立研究者)

AI总结 本文提出了一种基于成本敏感多输出回归的方法,用于解决包含多个不确定参数的组合优化问题,通过引入成本敏感的损失函数组件,提高了决策聚焦学习的效率和可扩展性。

Comments 12 pages, 7 figures

详情
AI中文摘要

许多现实世界中的组合问题涉及不确定参数,这些参数可以根据上下文特征和历史数据进行预测。这些'预测后优化'或'上下文优化'问题已获得显著关注:端到端训练方法现在可以最小化下游任务成本而不是预测误差。然而,尽管这些决策聚焦学习(DFL)方法有效,但它们通常在训练过程中依赖于重复解决底层组合优化问题,这使得它们计算成本高且难以扩展。我们重新将学习问题视为一个成本敏感的多输出回归问题:多输出是因为组合问题有多个不确定参数,而成本敏感是因为下游任务成本是真正的目标。我们的技术贡献是正式化了多个损失函数组件,这些组件来自于这种重新框架:成本不敏感的归一化、决策意识的不对称惩罚过预测和欠预测,以及实例化的成本,这些成本在本地模仿真正的下游任务损失。这些组件需要每个训练数据实例零或一次求解,而训练过程中不需要进一步求解。实验表明,损失组件的组合在下游任务质量上与最先进的方法相当,同时显著更高效,使能够扩展到以前无法用DFL解决的问题规模。

英文摘要

Many real-world combinatorial problems involve uncertain parameters, which can be predicted given contextual features and historical data. These `predict-then-optimize' or `contextual optimization' problems have gained significant attention: end-to-end training methods can now minimize the downstream task cost rather than the predictive error. However, despite their effectiveness, these decision-focused learning (DFL) approaches often rely on repeated solving of the underlying combinatorial optimization problem during training, making them computationally expensive and difficult to scale. We reframe the learning problem as a cost-sensitive multi-output regression problem: multi-output due to the combinatorial problem having multiple uncertain parameters, and cost-sensitive due to the downstream task cost being the real target. Our technical contribution is the formalization of multiple loss function components that follow from this reframing: cost-insensitive normalization, decision-aware asymmetric penalization of over- and underpredictions, and instance-based costs that mimic the true downstream task-based loss locally. These components require zero or one solve per training data instance, while requiring no further solves during training. Experiments show that the combination of loss components achieves comparable downstream task quality to the state of the art, while being significantly more efficient, enabling scaling to problem sizes that have not been tackled before with DFL.

2605.18004 2026-05-19 cs.LG 版本更新

RL4RLA: Teaching ML to Discover Randomized Linear Algebra Algorithms Through Curriculum Design and Graph-Based Search

RL4RLA: 通过课程设计和基于图的搜索教机器学习发现随机线性代数算法

Jinglong Xiong, Xiaotian Liu, Ruoxin Wang, Zihang Liu, Yefan Zhou, Yujun Yan, Yaoqing Yang

发表机构 * Pratt School of Engineering, Duke University, Durham, NC, USA(杜克大学工程学院) Department of Computer Science, Dartmouth College, Hanover, NH, USA(达特茅斯学院计算机科学系)

AI总结 本文提出RL4RLA框架,通过课程设计和基于图的搜索自动化发现可解释的符号随机线性代数算法,展示了其在重发现状态-of-the-art方法和优化算法性能方面的贡献。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026). 9 pages main text; 21 pages total

详情
AI中文摘要

随机线性代数(RLA)算法是一类现代数值线性代数技术,在科学计算和机器学习中扮演重要角色,已被广泛采用。然而,其发现仍主要依赖手动过程,需要深厚的专家知识和灵感。尽管强化学习(RL)提供了自动化路径,但标准方法在高绩效RLA算法的稀疏奖励景观和广阔搜索空间中遇到困难。本文提出RL4RLA,一个通用的RL框架,自动化发现可解释、符号化的RLA算法。与黑盒方法不同,我们的方法从基本线性代数原语构建显式算法,确保可验证和可实现的表示。为了实现高效发现,我们引入:(1)数值课程,逐步增加问题难度以编码RLA领域的归纳偏差;(2)蒙特卡洛图搜索,通过识别和合并等价的partial算法优化探索。我们证明RL4RLA重发现状态-of-the-art方法,包括sketch-and-precondition求解器、Randomized Kaczmarz和Newton Sketch,并可针对特定的准确率、速度和稳定性之间的权衡生成算法。代码可在https://github.com/Tim-Xiong/RL4RLA获取。

英文摘要

Randomized linear algebra (RLA) algorithms are a modern class of numerical linear algebra techniques that play an essential role in scientific computing and machine learning, with broad and growing adoption. However, their discovery remains mostly a manual process that requires deep expert knowledge and inspiration. While Reinforcement Learning (RL) offers a pathway to automation, standard approaches struggle with sparse reward landscapes and vast search spaces inherent to high-performing RLA algorithms. In this paper, we present RL4RLA, a general RL framework that automates the discovery of interpretable, symbolic RLA algorithms. Unlike black-box approaches, our method builds explicit algorithms from basic linear algebra primitives, ensuring verifiable and implementable representations. To enable efficient discovery, we introduce: (1) a numerical curriculum that progressively increments problem difficulty to encode inductive bias specific to the RLA domain; (2) Monte Carlo Graph Search, which optimizes exploration by identifying and merging equivalent partial algorithms. We demonstrate that RL4RLA rediscovers state-of-the-art methods, including sketch-and-precondition solvers, Randomized Kaczmarz, and Newton Sketch, and can be targeted to produce algorithms optimized for specific trade-offs between accuracy, speed, and stability. Code is available at https://github.com/Tim-Xiong/RL4RLA.

2605.17997 2026-05-19 cs.LG cs.AI cs.CV 版本更新

MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization

MARR: 模块自适应残差重建用于低比特后训练量化

Le Su, Xing Luo, Zhi Jin

发表机构 * Peng Cheng Laboratory(鹏城实验室)

AI总结 本文提出MARR,一种模块自适应残差重建方法,通过为每个模块分配特定的缩放系数,平衡残差相关的HA偏差和累积误差校正,从而在低比特量化中提升性能。

详情
AI中文摘要

近年来,基于残差重建的模型量化方法在低比特后训练量化(PTQ)中取得了有希望的性能,通过引入跨层残差来减少来自先前层的误差积累。然而,这些残差也可能引入额外的偏差,源于重建基于PTQ的Hessian近似(HA)假设,导致量化性能不理想。在本文中,我们分析发现,通过将残差项乘以一个缩放系数,可以提供一种直接的方法来缓解与残差强度相关的HA偏差,同时保持累积误差校正。更重要的是,我们观察到这种权衡是模块依赖性的,使单一全局残差强度不足以在不同模块之间平衡有效的校正和残差相关的偏差。基于这些观察,我们提出了模块自适应残差重建(MARR),为每个模块分配模块特定的缩放系数,以自适应地平衡累积误差校正和残差相关的HA偏差。为了避免昂贵的每模块系数搜索并获得稳定的系数估计,我们设计了一种基于比例-积分-微分(PID)的自适应更新策略,利用重建误差作为反馈,逐步细化此系数。在多个典型的大语言模型(LLMs)和视觉变换器(ViTs)上的实验表明,MARR在低比特量化(小于等于4位)中表现出色,实现了LLMs高达20.2%的性能提升,以及ViTs相对于残差重建最先进的方法高达4.6%的相对提升。代码将在接受后公开发布。

英文摘要

Recently, residual reconstruction-based model quantization methods have achieved promising performance in low-bit post-training quantization (PTQ) by introducing cross-layer residuals to reduce error accumulated from previous layers.However, these residuals may also introduce additional bias arising from the Hessian-approximation (HA) assumption underlying reconstruction-based PTQ, leading to suboptimal quantization performance.In this work, we analyze that multiplying the residual term by a scaling coefficient provides a direct way to mitigate the HA bias associated with residual strength, while preserving accumulated-error correction. More importantly, we observe that this trade-off is module-dependent, making a single global residual strength insufficient to balance effective correction and residual-related bias across modules.Based on these observations, we propose Module-Adaptive Residual Reconstruction (MARR), which assigns a module-specific scaling coefficient to adaptively balance accumulated-error correction and residual-related HA bias for each module.To avoid expensive per-module coefficient search and obtain a stable coefficient estimate, we design a Proportional-Integral-Derivative (PID)-based adaptive update strategy that uses reconstruction error as feedback to progressively refine this coefficient. Experiments on several typical large language models (LLMs) and vision transformers (ViTs) demonstrate the effectiveness of MARR under low-bit quantization (less than or equal to 4-bit), achieving up to 20.2% performance gains on LLMs and up to 4.6% relative gains on ViTs over the residual reconstruction state-of-the-art methods.Code will be made publicly available upon acceptance.

2605.17985 2026-05-19 cs.LG cs.AI 版本更新

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

SAFE-SVD:面向物理基础模型的敏感性感知保真度压缩SVD

Chengjie Hong, Feixiang He, Yiheng Zeng, Lulu Kang, He Wang

发表机构 * AI Centre, University College London(伦敦大学学院人工智能中心) University College London(伦敦大学学院) Central South University(中南大学) University of Massachusetts at Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本文提出了一种新的压缩物理基础模型的方法,通过在压缩过程中显式建模损失感知的层敏感性,以保持准确性和物理保真度,实验表明在多个模型和数据集上实现了显著的压缩增益。

详情
AI中文摘要

我们提出了一种新的方法,用于压缩物理基础模型(PFMs),这是AI for Science领域的新趋势。尽管模型压缩对于减少内存使用和加速大基础模型的推理至关重要,但其在PFMs中的应用仍然不足探索,因为保持物理保真度至关重要。挑战在于物理数据的功能性质,其中偏导数编码了时空动态,并对压缩具有高度敏感性。传统压缩方法忽视了这种结构,常常导致严重的性能退化或失败。为此,我们引入了一种敏感性感知的保真度强制压缩框架,在压缩过程中显式建模输出函数空间中的损失感知层敏感性。这为压缩科学基础模型提供了一条新途径,同时保持准确性和物理保真度。实验表明,在多个模型和数据集上,相较于现有方法,取得了显著的增益,实现了更高的压缩比,同时保持准确性,在某些情况下甚至提高了几个数量级。更广泛地说,这项工作可能引领AI for Science领域高效、可部署和可持续的科学基础模型的新子领域。

英文摘要

We propose a new method for compressing physics foundation models (PFMs) which is a new trend in AI for Science. While model compression is essential for reducing memory use and accelerating inference in large foundation models, it remains under-explored for PFMs, where preserving physical fidelity is crucial. The challenge lies in the functional nature of physics data, where partial derivatives encode spatiotemporal dynamics and exhibit high sensitivity to compression. Conventional compression methods ignore this structure, often causing severe performance degradation or failure. To address this, we introduce a sensitivity-aware fidelity-enforcing compression framework that explicitly models loss-aware layer sensitivity in the output function space during compression. This provides a new route to compressing scientific foundation models while preserving accuracy and physical fidelity. Experiments show substantial gains over existing methods across multiple models and datasets, achieving significantly higher compression ratios while maintaining accuracy, in some cases by orders of magnitude. More broadly, the work potentially leads to a new subfield of efficient, deployable, and sustainable scientific foundation models in AI for Science.

2605.17968 2026-05-19 cs.LG 版本更新

Function graph transformers universally approximate operators between function spaces

函数图变换器在函数空间之间近似算子

Takashi Furuya, David Mis, Ivan Dokmanić, Maarten V. de Hoop, Matti Lassas

发表机构 * Doshisha University(大阪市立大学) RIKEN AIP(日本科学技术厅Advanced Institute for Photonics and Electron器件) Rice University(里士满大学) University of Basel(巴塞尔大学) Simons Chair in Computational and Applied Mathematics and Earth Science(Simons计算与应用数学及地球科学主席职位) University of Helsinki(赫尔辛基大学)

AI总结 本文研究了通过变换器近似函数空间之间非线性算子的问题,提出了一种基于图度量的函数图变换器,能够以单值函数形式输出,并证明其在广义非线性算子近似中的通用性。

详情
AI中文摘要

我们研究了通过变换器近似函数空间之间非线性算子的问题。我们的方法是将函数提升为在其图上支持的度量,并利用最近引入的度量论视角来分析变换器。函数h通过其图度量γ_h表示,其中有限的token{(x_j,h(x_j))}_{j=1}^N是其经验近似。我们证明,该框架优雅地通过度量的收敛来建模离散化细化,并提供了一个自然的算子学习设置。在此框架中,我们引入了函数图变换器,即一种图保持的度量变换器子类,能够将图度量映射为图度量,也就是说,输出保持为单值函数。关键的是,这种额外的结构并不降低通用性:我们证明,所得到的图保持映射可以被标准softmax自注意力层和点wise MLP的有限组合近似,从而在广泛的非线性算子类别中实现通用近似结果。与现有基于变换器的算子学习理论方法不同,度量论框架还能够处理正则化的负阶Sobolev输入,这些输入的离散化不变性特别具有挑战性,以及不同输出域上的查询点。总体而言,函数图变换器为基于变换器的算子学习提供了一个连续视角和数学工具包,明确了位置编码、图结构、正则化和在离散化之间保持一致的作用。

英文摘要

We study the approximation of nonlinear operators between function spaces by transformers. Our approach is to lift functions to measures supported on their graphs and leverage a recently introduced measure-theoretic view of transformers. A function $h$ is represented by its graph measure $γ_h$, with finite tokens $\{(x_j,h(x_j))\}_{j=1}^N$ being its empirical approximations. We show that this framework elegantly models discretization refinement via convergence of measures and provides a natural setting for operator learning. Within this framework, we introduce function graph transformers, a graph-preserving subclass of measure-theoretic transformers that maps graph measures to graph measures, which is to say that outputs remain single-valued functions. Crucially, this additional structure does not reduce generality: we prove that the resulting graph-preserving maps can be approximated by finite compositions of standard softmax self-attention layers and pointwise MLPs, yielding universal approximation results for broad classes of nonlinear operators. Unlike existing theoretical approaches to operator learning with transformers, the measure-theoretic framework also accommodates regularized negative-order Sobolev inputs for which discretization invariance is particularly challenging, as well as query points on different output domains. Overall, function graph transformers provide a continuum viewpoint and mathematical toolkit for transformer-based operator learning, clarifying the roles of positional encodings, graph structure, regularization, and ensuring consistency across discretizations.

2605.17958 2026-05-19 cs.LG cs.PL 版本更新

Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning

通过基于一致性的强化学习增强大语言模型的代码推理能力

Zhanyue Qin, Jia Feng, Yibo Lyu, Yun Peng, Dianbo Sui, Cuiyun Gao, Qing Liao

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出CodeThinker框架,通过一致性驱动的强化学习方法提升大语言模型的代码推理能力,实验表明其在多个基准测试中表现优异,显著提升了代码生成和数学推理任务的准确性。

Comments Under review

详情
AI中文摘要

代码推理指的是在给定源代码和特定输入的情况下预测程序输出的任务。它可以衡量大语言模型(LLMs)的推理能力,并且有助于下游任务,如代码生成和数学推理。现有工作已验证了强化学习在该任务上的有效性。然而,这些方法仅基于最终输出或粗粒度信号设计奖励,忽略了任务中逐步推理过程的内在一致性。因此,这些方法常常导致稀疏奖励或奖励黑客问题,限制了增强学习能力的充分发挥。为缓解这些问题,我们提出CodeThinker,一种用于代码推理的一致性驱动强化学习框架。具体而言,CodeThinker有三个关键组件:(1)一个具有逐步推理意识的模型训练模块,利用一致性追踪范式作为模板,合成捕捉逐步推理过程的训练数据;(2)一个动态束采样策略,旨在在固定采样预算下提高采样输出的质量;(3)一个一致性奖励机制,可以有效缓解奖励黑客问题。在三个流行基准测试上的实验表明,CodeThinker在多个LLMs上均取得最佳性能。例如,当部署在Qwen2.5-Coder-7B-Instruct上时,其在准确性方面比最强基线高出4.3%。我们还验证了CodeThinker在下游任务中的有效性。结果表明,在不进行额外训练的情况下,CodeThinker在覆盖17种编程语言的数学推理和代码推理任务中分别获得了平均准确率提升5.33和3.11个百分点。

英文摘要

Code reasoning refers to the task of predicting the output of a program given its source code and specific inputs. It can measure the reasoning capability of large language models (LLMs) and also benefit downstream tasks such as code generation and mathematical reasoning. Existing work has verified the effectiveness of reinforcement learning on the task. However, these methods design rewards solely based on final outputs or coarse-grained signals, and neglect the inherent consistency of the stepwise reasoning process in the task. Therefore, these methods often result in sparse reward or reward hacking, which limits the full play of enhanced learning capabilities. To alleviate these issues, we propose CodeThinker, a consistency-driven reinforcement learning framework for code reasoning. Specifically, CodeThinker has three key components: (1) a stepwise reasoning-aware model training module, which utilizes a consistency tracing paradigm as a template to synthesize training data that captures the stepwise reasoning process; (2) a dynamic beam sampling strategy, which aims to improve the quality of sampled outputs under a fixed sampling budget; and (3) a consistency reward mechanism that can effectively alleviate reward hacking. Experiments on three popular benchmarks show that CodeThinker achieves state-of-the-art performance across multiple LLMs. For instance, it outperforms the strongest baseline by 4.3% in accuracy when deployed on Qwen2.5-Coder-7B-Instruct. We also validate the effectiveness of CodeThinker on downstream tasks. Results show that, without additional training, CodeThinker obtains average accuracy gains of 5.33 and 3.11 percentage points on mathematical reasoning and code reasoning tasks covering 17 programming languages, respectively.

2605.17954 2026-05-19 cs.CV cs.AI cs.LG 版本更新

A More Word-like Image Tokenization for MLLMs

一种更像单词的图像标记化方法用于大规模语言模型

Hyun Lee, Hyemin Jeong, Yejin Kim, Hyungwook Choi, Hyunsoo Cho, Soo Kyung Kim, Joonseok Lee

发表机构 * Seoul National University(首尔国立大学) Ewha Womans University(成均馆大学)

AI总结 本文提出了一种解耦视觉标记化方法(DiVT),通过将图像块嵌入聚类为语义单元,使每个标记对应于独特的视觉概念,从而提升多模态模型的性能和效率。

详情
Journal ref
Proceedings of the IEEE/CVF International Conference on Pattern Recognition and Computer Vision (CVPR), 2026
AI中文摘要

现代多模态大语言模型(MLLMs)通常保持语言模型不变,并训练一个视觉投影器,将像素映射到其嵌入空间中的标记序列,使图像能以与文本相同的形式呈现。然而,语言模型已优化以操作离散且具有语义意义的标记,而现有视觉投影器将图像转换为长流的连续且高度相关的嵌入。这导致视觉标记的行为不同于LLM最初训练以理解的单词状单元。我们提出了一种新的解耦视觉标记化(DiVT),将图像块嵌入聚类为连贯的语义单元,使得每个标记对应于一个独特的视觉概念,而不是一个刚性的网格单元。DiVT进一步根据图像复杂度调整其标记预算,提供显式的精度-计算权衡,既不修改视觉编码器也不修改语言模型。在多样化的多模态基准测试中,DiVT在显著较少的视觉标记下匹配或超越基线,展示了在有限标记预算下的鲁棒性,显著降低了内存成本和延迟,同时使视觉输入更兼容于LLM。我们的代码可在https://github.com/snuviplab/DiVT上获得。

英文摘要

Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.

2605.17938 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Training data attribution in diffusion models via mirrored unlearning and noise-consistent skew

通过镜像反学习和噪声一致偏斜训练数据归因

Joan Serrà, Dipam Goswami, Fabio Morreale, Wei-Hsiang Liao, Yuki Mitsufuji

发表机构 * Sony AI(索尼人工智能)

AI总结 本文提出了一种基于镜像反学习和噪声一致偏斜的方法,用于提升扩散模型的训练数据归因的可靠性与鲁棒性,通过在不同数据集上显著优于现有方法,展示了其在生成实例间影响实例重叠和扩散损失比较任务中的潜力。

Comments 21 pages, 5 figures, 9 tables (includes appendix)

详情
AI中文摘要

训练数据归因(TDA)应能够促进生成模型的可解释性,并推动各种相关下游任务的发展。然而,当前的TDA方法缺乏可靠性和鲁棒性,阻碍了其在实际应用中的采用。在本文中,我们采取了关键步骤,以实现更可靠和鲁棒的扩散模型TDA。我们提出通过镜像反学习和噪声一致偏斜(MUCS)进行TDA。该方法的核心思想是使用受限的镜像梯度上升微调第二个模型,并通过一致的噪声样本测量该模型相对于原始模型的归一化偏斜。我们展示了,尽管概念上简单且通用,MUCS在三个不同的数据集上系统性地大幅优于现有方法。此外,我们研究了核心设计选择对最终性能的影响,并分析了影响实例在生成项目中的重叠以及整合TDA方法的潜力。我们相信,我们的发现可能对更一般的反学习设置以及需要比较扩散损失的任务具有更广泛的意义。

英文摘要

Training data attribution (TDA) should enable generative model interpretability and foster a variety of related downstream tasks. Nonetheless, current TDA approaches lack reliability and robustness, preventing their adoption in real-world setups. In this paper, we take a decisive step towards more reliable and robust TDA for diffusion models. We propose to perform TDA with mirrored unlearning and noise-consistent skew (MUCS). The idea is to fine-tune a second model with bounded mirrored gradient ascent, and to measure the normalized skew of this model with respect to the original one using consistent noise samples. We show that, while being conceptually simple and generic, MUCS systematically outperforms existing methods on three different datasets by a large margin. We additionally study the effect that core design choices have on final performance, and analyze novel aspects regarding the overlap of influential instances across generated items and the potential of ensembling TDA approaches. We believe that our findings may have broader implications for more general unlearning setups, as well as for tasks requiring the comparison of diffusion losses.

2605.17936 2026-05-19 cs.CL cs.LG 版本更新

Universal Adversarial Triggers

通用对抗触发器

Benedict Florance Arockiaraj, Alexander Feng, Jianxiong Cai, Xiaoyu Cheng

AI总结 本文提出了一种结合词性过滤和困惑度损失函数的新技术,生成更接近自然短语的合理触发器,以提高对抗攻击的检测难度并促进鲁棒模型的发展。

详情
AI中文摘要

近期的研究表明,现代NLP模型在从情感分析到语言生成的多种任务中均受到通用对抗攻击的影响,这类攻击是一种输入无关的攻击,使用共同的触发序列攻击模型。尽管这些攻击成功,但由此生成的触发器却不合语法且不自然。我们的工作提出了一种新颖的技术,结合词性过滤和基于困惑度的损失函数,以生成更合理的触发器,这些触发器更接近自然短语。在SST数据集上的情感分析任务中,该方法生成的触发器能够将正向预测翻转为负向预测,准确率降至0.04和0.12。为了构建鲁棒模型,我们还使用生成的触发器进行对抗训练,使模型的准确率从0.12提升至0.48。我们旨在展示通过生成合理的触发器,可以使得对抗攻击难以被检测,并通过相关防御促进鲁棒模型的发展。

英文摘要

Recent works have illustrated that modern NLP models trained for diverse tasks ranging from sentiment analysis to language generation succumb to universal adversarial attacks, a class of input-agnostic attacks where a common trigger sequence is used to attack the model. Although these attacks are successful, the triggers generated by such attacks are ungrammatical and unnatural. Our work proposes a novel technique combining parts-of-speech filtering and perplexity based loss function to generate sensible triggers that are closer to natural phrases. For the task of sentiment analysis on the SST dataset, the method produces sensible triggers that achieve accuracies as low as 0.04 and 0.12 for flipping positive to negative predictions and vice-versa. To build robust models, we also perform adversarial training using the generated triggers that increases the accuracy of the model from 0.12 to 0.48. We aim to illustrate that adversarial attacks can be made difficult to detect by generating sensible triggers, and to facilitate robust model development through relevant defenses.

2605.17930 2026-05-19 cs.LG 版本更新

InfoFlow: A Framework for Multi-Layer Transformer Analysis

InfoFlow: 多层Transformer分析的框架

Penghao Yu, Haotian Jiang, Zeyu Bao, Qianxiao Li

发表机构 * Department of Mathematics(数学系) National University of Singapore(新加坡国立大学) Institute for Functional Intelligent Materials(功能智能材料研究所)

AI总结 该研究通过分析多层Transformer的近似能力,揭示了其与单层Transformer的根本差异,并提出InfoFlow框架以提升多层Transformer的近似效率。

Comments 36 pages

详情
AI中文摘要

尽管近期已有研究探讨了单层Transformer架构的近似性质,但对多层设置的严谨理论理解仍然有限。本文证明多层Transformer在某些检索任务中具有与单层Transformer根本不同的近似能力:对于某些检索任务,任何单层Transformer需要至少Ω(ε^{-k})参数才能达到精度ε,其中k与序列长度T线性增长,而双层Transformer每层一个头则能以至多O(ε^{-1})参数实现相同近似精度。为理解这种分离,我们识别出多层近似背后的两种结构机制。具体而言,softmax注意力只能高效检索获得最大注意力分数的token,导致k-th最大检索的参数成本呈指数级增长(k≥2)。此外,解码耦合信息的参数成本与所检索token集合的大小成正比。受这些发现启发,我们提出了InfoFlow框架,用于多层Transformer。该框架在每个token和层跟踪可访问的输入位置集合,并为每种信息传播模式分配明确的近似率。这种抽象恢复了已知的近似界限,与训练网络的实验观察保持一致,并在目前无法直接理论分析的设置中产生具体预测。我们的结果提供了一个原则性的框架,用于分析多层Transformer的近似效率。

英文摘要

While the approximation properties of single-layer Transformer architectures have been studied in recent works, a rigorous theoretical understanding of the multi-layer setting remains limited. In this work, we establish that multi-layer Transformers possess fundamentally different approximation capabilities from single-layer ones: for certain retrieval tasks, any single-layer Transformer requires least $Ω(\varepsilon^{-k})$ parameters to achieve precision $\varepsilon$, where $k$ grows linearly with sequence length $T$, whereas a two-layer Transformer with a single head per layer achieves the same approximation precision with at most $O (\varepsilon^{-1})$ parameters. To understand this separation, we identify two structural mechanisms underlying multi-layer approximation. Specifically, softmax attention can only efficiently retrieve the token attaining the maximum attention score, incurring exponential-in-length parameter cost for $k$-th largest retrieval with $k \geq 2$. Moreover, the parameter cost of decoding coupled information scales with the size of the retrieved token set. Motivated by these findings, we propose InfoFlow, a framework for multi-layer Transformers. The framework tracks an information set of accessible input positions at each token and layer, assigning an explicit approximation rate to each mode of information propagation. This abstraction recovers known approximation bounds, remains consistent with experimental observations on trained networks, and yields concrete predictions in settings where direct theoretical analysis is currently intractable. Our results provide a principled framework for reasoning about the approximation efficiency of multi-layer Transformers.

2605.17928 2026-05-19 cs.RO cs.LG 版本更新

Transfer Learning for Customized Car Racing Environments

迁移学习用于定制化的赛车环境

Benedict Florance Arockiaraj, Richard Chang, Wesley Yee

发表机构 * seas(系统工程与科学学院)

AI总结 本文研究了迁移学习在深度强化学习中的应用,旨在通过在单一赛道上训练智能体,实现零样本迁移或进一步微调以在其他定制化赛车环境中获得更快的圈速,并比较了基于模型和非基于模型方法的性能。

详情
AI中文摘要

迁移学习是一种技术,其中模型/智能体可以利用其在一项任务中获得的知识/专长来解决另一个密切相关任务。通过本项目,我们探讨了迁移学习在深度强化学习中的应用。具体而言,我们希望利用迁移学习在OpenAI的赛车环境中实现快速圈速,通过在单一赛道上训练智能体,并通过零样本迁移或额外微调在其他定制化目标环境中进行比赛。此外,我们比较了基于模型和非基于模型方法的性能,并观察到基于模型的方法在性能上占优,并且在该环境中比非基于模型的方法收敛得更快。我们观察到迁移学习在大多数设置中不仅提升了目标领域的性能,而且在学习过程中也表现出高水平的性能能力。

英文摘要

Transfer Learning, a technique where a model/agent can use the knowledge/expertise that it gained from one task and exploit that to solve another closely-related task, is often used in tackling problems in deep learning. Through this project, we explore transfer learning in the purview of deep reinforcement learning. Specifically, we want to use transfer learning to achieve the fast lap times in OpenAI's Car racing environment by training the agent on one circuit, and racing it on other customized target environments by zero-shot transfer or by additional fine-tuning. In addition, we compare the performance of model-based and model-free approaches, and observe that model-based approaches dominate in performance and converge faster than model-free approaches in this environment. We observe that transfer learning in most setups not only boosts the performance on the target domain, but also shows high performance ability during learning.

2605.17923 2026-05-19 cs.DC cs.AI cs.LG 版本更新

AdaptiveLoad: Towards Efficient Video Diffusion Transformer Training

AdaptiveLoad: 向高效视频扩散变换器训练迈进

Yucheng Guo, Yongjian Guo, Zhong Guan, Haoran Sun, Wen Huang, Wanting Xu, Jing Long, Shuai Di, Junwu Xiong

发表机构 * Tsinghua University(清华大学) Peking University(北京大学) Tianjin University(天津大学)

AI总结 本文提出AdaptiveLoad框架,通过双约束自适应负载平衡系统和融合LayerNorm-Modulate CUDA内核,解决视频生成模型中大规模视频扩散变换器(如DiT和MMDiT)训练中的计算不平衡问题,实验显示其在Wan 2.1世界模型上提升了计算效率和训练吞吐量。

详情
AI中文摘要

在视频生成模型,特别是世界模型中,训练大规模视频扩散变换器(如DiT和MMDiT)由于混合模式数据集中序列长度的极端差异,带来了显著的计算挑战。现有基于桶的数据加载策略通常依赖于'等长token'约束。这种方法未能考虑自注意力机制的二次复杂性,导致严重的负载不平衡和GPU资源利用率低下。本文提出了AdaptiveLoad,一个集成优化框架,包含两个核心组件:(1)双约束自适应负载平衡系统,通过同时限制内存消耗和计算负载(B×S^p≤M_comp)消除长序列瓶颈;(2)融合LayerNorm-Modulate CUDA内核,利用D-tile共alesced减少策略提高吞吐量并缓解内存压力。实验结果表明,在Wan 2.1世界模型上,我们的方法将计算不平衡率从39%降低到18.9%,峰值VRAM利用率效率提高22.7%,并实现了整体训练吞吐量增加27.2%。

英文摘要

In video generation models, particularly world models, training large-scale video diffusion Transformers (such as DiT and MMDiT) poses significant computational challenges due to the extreme variance in sequence lengths within mixed-mode datasets. Existing bucket-based data loading strategies typically rely on "equal token length" constraints. This approach fails to account for the quadratic complexity of self-attention mechanisms, leading to severe load imbalance and underutilization of GPU resources. This paper proposes \textit{AdaptiveLoad}, an integrated optimization framework consisting of two core components: (1) A dual-constraint adaptive load balancing system, which eliminates long-sequence bottlenecks by simultaneously limiting memory consumption and computational load ($B \times S^p \le M_{\text{comp}}$); (2) A fused LayerNorm-Modulate CUDA kernel, which utilizes a D-tile coalesced reduction strategy to increase throughput and alleviate memory pressure. Experimental results on the Wan 2.1 world model demonstrate that our method reduces the computational imbalance rate from 39\% to 18.9\%, improves peak VRAM utilization efficiency by 22.7\%, and achieves an overall training throughput increase of 27.2\%.

2605.17918 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Domain Transfer Becomes Identifiable via a Single Alignment

通过单个对齐使领域转移变得可识别

Sagar Shrestha, Subash Timilsina, Hoang-Son Nguyen, Xiao Fu

发表机构 * School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, USA(电气工程与计算机科学系,俄勒冈州立大学,科瓦利斯,俄勒冈,美国)

AI总结 本文提出了一种新的方法,通过结构稀疏性条件和单个配对锚样本实现领域转移的可识别性,减少了对监督信号的依赖,并提出了高效的雅可比稀疏性正则化器以支持高维学习。

详情
AI中文摘要

领域转移(DT)将源分布映射到目标分布,并支持无监督的图像到图像翻译、单细胞分析和跨平台医学影像任务。然而,DT本质上是不明确的:推动正向映射通常不可识别,因为保持测度的自同构(MPAs)在保持边缘分布的同时改变跨领域对应关系,导致内容不一致的翻译。最近的工作表明,通过联合转移多个对应的源/目标条件分布可以消除MPAs,但标记这些条件的监督信号在实践中并不总是可用。我们开发了一种替代的DT可识别性路线。在雅可比支持图案的结构稀疏性条件下,我们证明了分布匹配与单个配对锚样本足以识别真实转移——比先前方法需要的监督更少。为了支持实际的高维学习,我们进一步提出了一种基于随机掩码有限差分的高效雅可比稀疏性正则化器,得到一个可扩展的替代品,无需显式雅可比评估。在合成和现实任务上的实验证实了理论。

英文摘要

Domain transfer (DT) maps source to target distributions and supports tasks such as unsupervised image-to-image translation, single-cell analysis, and cross-platform medical imaging. However, DT is fundamentally ill-posed: push-forward mappings are generally non-identifiable, as measure-preserving automorphisms (MPAs) preserve marginals while altering cross-domain correspondences, leading to content-misaligned translation. Recent work shows that MPAs can be eliminated by jointly transferring multiple corresponding source/target conditional distributions, but supervision signals labeling such conditionals are not always available in practice. We develop an alternative route to DT identifiability. Under a structural sparsity condition on the Jacobian support pattern, we show that distribution matching together with a single paired anchor sample suffices to identify the ground-truth transfer -- requiring substantially less supervision than prior approaches. To enable practical high-dimensional learning, we further propose an efficient Jacobian sparsity regularizer based on randomized masked finite differences, yielding a scalable surrogate without explicit Jacobian evaluation. Empirical results on synthetic and real-world DT tasks validate the theory.

2605.17899 2026-05-19 cs.LG cs.AI q-bio.QM 版本更新

DCFold: Efficient Protein Structure Generation with Single Forward Pass

DCFold: 通过单次前向传递高效生成蛋白质结构

Zhe Zhang, Yuanning Feng, Yuxuan Song, Keyue Qiu, Hao Zhou, Wei-Ying Ma

发表机构 * Institute for AI Industry Research (AIR)(人工智能产业研究院) Department of Computer Science and Technology(计算机科学与技术系) School of Computer Science and Technology(计算机科学与技术学院) ByteDance Seed(字节跳动种子)

AI总结 本文提出DCFold,一种单步生成模型,实现了与AlphaFold3同等的精度,通过双一致性训练框架和新的时间测地匹配(TGM)调度器,在保持预测保真度的同时将推理速度提升15倍,验证了其在结构预测和结合设计基准上的有效性。

详情
AI中文摘要

AlphaFold3引入了一种基于扩散的架构,将蛋白质结构预测提升到原子级分辨率,并提高了准确性。这种最先进的性能使AlphaFold3成为多样化生成和设计任务的基础模型。然而,其迭代设计显著增加了推理时间,限制了在虚拟筛选和蛋白质设计等下游任务中的实际部署。我们提出DCFold,一种单步生成模型,实现了AlphaFold3级别的精度。我们的双一致性训练框架,结合了新的时间测地匹配(TGM)调度器,使DCFold在保持预测保真度的同时,将推理速度提升15倍。我们验证了其在结构预测和结合设计基准上的有效性。

英文摘要

AlphaFold3 introduces a diffusion-based architecture that elevates protein structure prediction to all-atom resolution with improved accuracy. This state-of-the-art performance has established AlphaFold3 as a foundation model for diverse generation and design tasks. However, its iterative design substantially increases inference time, limiting practical deployment in downstream settings such as virtual screening and protein design. We propose DCFold, a single-step generative model that attains AlphaFold3-level accuracy. Our Dual Consistency training framework, which incorporates a novel Temporal Geodesic Matching (TGM) scheduler, enables DCFold to achieve a 15x acceleration in inference while maintaining predictive fidelity. We validate its effectiveness across both structure prediction and binder design benchmarks.

2605.17898 2026-05-19 cs.LG 版本更新

Lightweight Gaussian Process Inference in C++ on Metal and CUDA

基于C++在Metal和CUDA上的轻量级高斯过程推断

Yu-Hsueh Fang

发表机构 * Department of Information Management, National Taiwan University(国立台湾大学信息管理系) H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(佐治亚理工学院H. Milton Stewart工业与系统工程学院)

AI总结 本文提出LightGP,一个无需依赖的C++17库,用于高斯过程回归,支持Apple Metal和NVIDIA CUDA后端,以及通过Apple Accelerate和OpenBLAS优化的CPU路径。LightGP提供了四种推断路径,覆盖从N=100到N=500,000的问题规模,并在不同硬件上实现了显著的性能提升。

详情
AI中文摘要

高斯过程(GP)推断在Python中主要由GPyTorch和GPflow等库主导,这些库基于深度学习框架,继承了它们的调度开销和依赖项足迹。我们提出了LightGP,一个无依赖的C++17库,用于GP回归,并提供Python绑定,支持Apple Metal和NVIDIA CUDA后端,以及通过Apple Accelerate和OpenBLAS优化的CPU路径。LightGP提供了四种推断路径——精确的Cholesky分解、无矩阵的共轭梯度法、稀疏变分自由能和结构化核插值(SKI)与FFT——覆盖从N=100到N=500,000的问题。在Apple M4上,LightGP CPU在精确GP推断中比GPyTorch CPU快2.6-8.7倍,在稀疏GP推断中每种规模都快1.5倍。在NVIDIA RTX 3060上,LightGP CUDA在精确GP推断中比GPyTorch CUDA快2.3-6.7倍,直到N=2048,而在N=4096时GPyTorch缩小了差距。在Metal上融合的无矩阵核-向量乘积在N=20,000时以O(N)内存实现了32倍的性能提升,而通过Accelerate vDSP加速的SKI矩阵-向量乘法在N=200,000时运行在亚毫秒级别。LightGP编译为一个单一的静态库,无外部依赖,并可通过pip install lightgp安装。

英文摘要

Gaussian process (GP) inference in Python is dominated by libraries such as GPyTorch and GPflow, which are built on deep-learning frameworks and inherit their dispatch overhead and dependency footprint. We present LightGP, a dependency-free C++17 library for GP regression with Python bindings, supporting Apple Metal and NVIDIA CUDA backends alongside tuned CPU paths via Apple Accelerate and OpenBLAS. LightGP provides four inference paths -- exact Cholesky, matrix-free conjugate gradients, sparse variational free energy, and structured kernel interpolation with FFT -- covering problems from $N{=}100$ to $N{=}500{,}000$. On an Apple M4, LightGP CPU is 2.6--8.7$\times$ faster than GPyTorch CPU for exact GP and ${\sim}1.5\times$ faster for sparse GP at every scale tested. On an NVIDIA RTX~3060, LightGP CUDA is 2.3--6.7$\times$ faster than GPyTorch CUDA for exact GP up to $N{=}2{,}048$, with GPyTorch closing the gap at $N{=}4{,}096$. A fused matrix-free kernel-vector product on Metal achieves 32$\times$ over the explicit path at $N{=}20{,}000$ with $O(N)$ memory, and an FFT-accelerated SKI matvec via Accelerate vDSP runs in sub-millisecond time at $N{=}200{,}000$. LightGP compiles as a single static library with zero external dependencies and is installable via \texttt{pip install lightgp

2605.17888 2026-05-19 physics.flu-dyn cs.LG 版本更新

Long-horizon prediction of three-dimensional wall-bounded turbulence with CTA-Swin-UNet and resolvent analysis

利用CTA-Swin-UNet和分辨率分析进行三维壁湍流长周期预测

Bo Chen, Yitong Fan, Jie Yao, Weipeng Li

发表机构 * School of Aeronautics and Astronautics, Shanghai Jiao Tong University, Shanghai 200240, China(航空航天学院,上海交通大学,上海200240,中国) School of Interdisciplinary Science, Beijing Institute of Technology, Beijing 100081, China(交叉科学学院,北京理工大学,北京100081,中国)

AI总结 本文提出了一种混合机器学习框架,通过CTA-Swin-UNet和多时间尺度融合校正策略,有效预测壁平行平面的湍流场,并通过分辨率基谱线性随机估计重构三维流场,展示了该框架在长周期自回归预测中的有效性与计算效率。

Comments 40 pages, 18 figures

详情
AI中文摘要

利用机器学习方法对三维(3D)壁湍流进行长周期预测仍是一项具有挑战性的任务,由于自回归误差的快速累积以及显著的计算成本。为解决这些挑战,我们提出了一种混合机器学习框架,其中开发了通道-时间-注意Swin-UNet(CTA-Swin-UNet)和多时间尺度融合校正(MTFC)策略,以在可控计算成本下预测壁平行平面的湍流场。然后,通过基于分辨率的谱线性随机估计(SLSE)重构三维流场,根植于预测的平面流。结果表明,CTA-Swin-UNet在单步预测和自回归滚动预测中均优于基线模型(LSTM、FNO和传统Swin-UNet),表明将CTA模块引入Swin-UNet架构是有效的。在相同的时间间隔内,CTA-Swin-UNet在约150次滚动步骤内保持稳定,而基线模型在20至50次滚动步骤内失败。引入MTFC策略后,实现了长达300次的预测周期。使用基于分辨率的SLSE重构进一步从预测的平面输入中恢复三维流结构和能量谱分布,这表明所提出的框架为三维壁湍流的长周期自回归预测提供了一种有效且计算高效的途径。

英文摘要

Long-horizon prediction of three-dimensional (3D) wall-bounded turbulence with machine-learning methods remains a challenging task, due to the rapid accumulation of autoregressive errors and the substantially computational cost. To address these challenges, we present a hybrid machine-learning framework, in which a channel-time-attention Swin-UNet (CTA-Swin-UNet) and a multi-time-scale fusion correction (MTFC) strategy are developed to predict the turbulent flow fields in a wall-parallel plane, with affordable computational cost. Then, 3D flow fields are reconstructed via a resolvent-based spectral linear stochastic estimation (SLSE), rooting from the predicted planar flow. Results show that the CTA-Swin-UNet outperforms the baseline models (LSTM, FNO and traditional Swin-UNet) in both single-step prediction and autoregressive rollouts, indicating the effectiveness of introducing the CTA module into the Swin-UNet architecture. At the same temporal interval, the CTA-Swin-UNet remains stable for approximately 150 rollout steps, while the baseline models fail within 20 to 50 rollout steps. After introducing the MTFC strategy, a longer horizon upto 300 steps is achieved. Using the resolvent-based SLSE reconstruction further recovers the 3D flow structures and energy spectral distributions from the predicted planar inputs, which demonstrates that the proposed framework provides an effective and computationally efficient approach for long-horizon autoregressive prediction of 3D wall-bounded turbulence.

2605.17887 2026-05-19 cs.LG cs.AI 版本更新

Attention Sinks and Outliers in Attention Residuals

注意力沉底与注意力残差中的异常值

Haozheng Luo, Haoran Dai, Shaoyang Zhang, Xi Chen, Eric Hanchen Jiang, Yijiang Li, Jingyuan Huang, Chenghao Qiu, Chenwei Xu, Zhenyu Pan, Haotian Zhang, Binghui Wang, Yan Chen

发表机构 * Department of Computer Science, Northwestern University(西北大学计算机科学系) Department of Computer Science and Engineering, University of Michigan(密歇根大学计算机科学与工程系) Department of Statistics and Data Science, University of California Los Angeles(加州大学洛杉矶分校统计与数据科学系) Department of Electrical and Computer Engineering, University of California San Diego(加州圣地亚哥大学电气与计算机工程系) Department of Computer Science, Rutgers University-New Brunswick(新泽西州立大学鲁特学院计算机科学系) Department of Computer Science and Engineering, Texas A&M University(德克萨斯农工大学计算机科学与工程系) Department of Computer Science, Columbia University(哥伦比亚大学计算机科学系)

AI总结 本文提出OASIS技术,通过层间空信号来解决注意力残差架构中注意力沉底、激活异常值以及推理稳定性下降的问题,通过双归一化设计和实验验证提升了模型的结构鲁棒性和量化鲁棒性。

详情
AI中文摘要

我们提出OASIS,一种基于层间空信号的异常值和沉底感知技术。As AttnResidual架构引入了额外的深度归一化通道,它们提高了层间路由的灵活性,但也加剧了注意力沉底、激活异常值以及由此导致的推理稳定性和量化鲁棒性下降。OASIS通过引入基于Softmax1的空空间和通过层间空信号将token级的空证据耦合到深度路由中,从而减少由沉底主导的路由并提高结构鲁棒性。理论上,我们证明了AttnResidual的双归一化设计加剧了沉底形成和量化脆性。实验上,我们在三个真实世界数据集上将OASIS与五个基线进行比较,并观察到在注意力沉底和后量化性能方面有持续的改进。值得注意的是,OASIS在评估设置中实现了最大无穷范数平均减少9.26%、平均峰度减少2.60%,并在W8A8下将困惑度降低了75.85%,在W4A4下将GSM8K Pass@1提高了12.42%。

英文摘要

We propose OASIS, an outlier- and sink-aware technique built on inter-layer null signaling. As AttnResidual architectures introduce an additional depth-wise normalization channel, they improve inter-layer routing flexibility but also exacerbate attention sinks, activation outliers, and the resulting degradation in inference stability and quantization robustness. OASIS addresses this issue by introducing a Softmax1-based null space and coupling token-level null evidence to depth routing through an inter-layer null signal, thereby reducing sink-dominated routing and improving structural robustness. Theoretically, we show that the dual-normalization design of AttnResidual intensifies sink formation and quantization brittleness. Experimentally, we compare OASIS against five baselines on three real-world datasets and observe consistent improvements in both attention sink and post-quantization performance. Notably, OASIS achieves an average reduction of 9.26% in maximum infinity norm and 2.60% in average kurtosis across the evaluated settings, while lowering perplexity by 75.85% under W8A8 and improving GSM8K Pass@1 by 12.42% under W4A4.

2605.17879 2026-05-19 cs.DC cs.AI cs.LG 版本更新

Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training

Guard:用于大规模训练的可扩展的延迟检测和节点健康管理

Guanliang Liu, Abhinandan Patni, Congzhu Lin, Zoe Zeng, Jack Wittmayer, Josh Wu, Ashvin Nihalani, Binxuan Huang, Yinghong Liu, Rory Na, Anthony Ko, Alexander Zhipa, Cong Cheng, Mi Sun, Vijay Rajakumar, Rejith George Joseph, Parthasarathy Govindarajen

发表机构 * Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country(匿名机构,匿名城市,匿名地区,匿名国家)

AI总结 本文提出Guard系统,通过在线性能监控和离线节点扫描机制,有效检测训练中的延迟节点并确保节点健康,从而提升大规模训练的效率和稳定性。

Comments Proceedings of the 9 th MLSys Conference, Bellevue, WA, USA, 2026

详情
AI中文摘要

训练前沿规模的基础模型需要协调成千上万的GPU进行多月运行,其中即使微小的性能退化也会累积成显著的效率损失。现有健康检查机制,如NCCL测试或GPU烧录,主要关注功能正确性,往往无法检测到悄无声息降低系统性能的fail-slow行为。在本文中,我们提出了Guard,一个用于检测stragglers并确保大规模训练集群中节点健康的可扩展系统。Guard结合了训练期间的轻量级在线性能监控与一个离线节点扫描机制,系统地评估和认证节点在参与生产工作负载之前。这种设计使Guard能够检测到传统诊断无法捕捉的急性故障和长期运行的fail-slow行为。在大规模基础模型预训练工作负载上部署Guard,可将平均FLOPs利用率提高多达1.7倍,将运行到运行的训练步骤方差从20%降至1%,增加平均故障时间(MTTF),并显著减少操作和调试开销。这些结果表明,主动检测stragglers和系统化的节点认证对于维持稳定和高效的大型训练至关重要。

英文摘要

Training frontier-scale foundation models involves coordinating tens of thousands of GPUs over multi-month runs, where even minor performance degradations can accumulate into substantial efficiency losses. Existing health-check mechanisms, such as NCCL tests or GPU burn-in, primarily focus on functional correctness and often fail to detect fail-slow behaviors that silently degrade system performance. In this paper, we present Guard, a scalable system for detecting stragglers and ensuring node health in large-scale training clusters. Guard combines lightweight online performance monitoring during training with an offline node-sweep mechanism that systematically evaluates and qualifies nodes before they participate in production workloads. This design enables Guard to detect both acute failures and long-running fail-slow behaviors that traditional diagnostics cannot capture. Deployed on large-scale foundation model pretraining workloads, Guard improves mean FLOPs utilization by up to 1.7x, reduces run-to-run training step variance from 20% to 1%, increases mean time to failure (MTTF), and significantly reduces operational and debugging overhead. These results demonstrate that proactive straggler detection and systematic node qualification are critical for maintaining stable and efficient large-scale training.

2605.17873 2026-05-19 cs.LG cs.AI cs.CL 版本更新

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

HINT-SD:针对长 Horizon 智能体的定向 hindsight 自监督学习

Woongyeng Yeo, Yumin Choi, Taekyung Ki, Sung Ju Hwang

AI总结 本文提出 HINT-SD,一种针对长 Horizon 智能体的定向 hindsight 自监督学习框架,通过全轨迹 hindsight 选择失败相关的动作,并仅在目标动作跨度上应用反馈条件自监督学习,实验表明该方法在 BFCL v3 和 AppWorld 上比密集的每回合反馈基线提高了 18.80 个百分点,同时训练时间降低 2.26 倍。

详情
AI中文摘要

训练具有长 horizon 的 LLM 智能体进行强化学习具有挑战性,因为稀疏结果奖励只能表明任务是否成功,而不能指示哪些中间动作导致了结果或如何修正。最近的方法通过从回合级动作-输出信号生成奖励或文本提示,或通过反馈条件自监督学习来缓解这一问题。然而,当许多中间回合已经成功或中性时,在每个回合生成反馈效率低下,而固定或错位的反馈难以监督导致失败的动作。为此,我们提出了 HINT-SD,一种基于全轨迹 hindsight 的定向自监督学习框架,用于选择失败相关的动作,并仅在目标动作跨度上应用反馈条件自监督学习。在 BFCL v3 和 AppWorld 上的实验表明,我们的方法在比密集的每回合反馈基线提高 18.80 个百分点的同时,实现了 2.26 倍更低的训练时间,表明选择何时进行自监督学习是有效且高效的长 horizon 智能体训练的关键因素。

英文摘要

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26$\times$ lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.

2605.17862 2026-05-19 cs.LG cs.AI 版本更新

$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control

f-OPD: 通过新鲜度感知控制稳定长周期在线策略蒸馏

Xianwei Chen, Shimin Zhang, Jibin Wu

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出f-OPD框架,通过引入样本级新鲜度评分来稳定长周期在线策略蒸馏,实现性能与效率的平衡,为大规模长周期智能体训练奠定基础。

详情
AI中文摘要

在大规模语言模型中扩展在线策略蒸馏(OPD)面临根本性矛盾:异步执行是系统效率的必要条件,但结构上偏离理想的在线策略目标。为解决这一挑战,我们理论上将目标偏差分解为回放漂移和监督漂移,分别捕捉学生回放和教师上下文的陈旧性。基于此,我们引入样本级新鲜度评分,量化缓冲样本相对于在线策略目标的可靠性。受此信号引导,我们进一步提出f-OPD,一种新颖的框架,能够自适应调节陈旧样本的影响并约束异步训练下累积的策略漂移。在推理、工具使用和编码代理任务中,f-OPD在增加交互周期时,始终能够实现与同步优化相当的任务性能,同时保留异步执行的吞吐量优势。我们的结果建立了OPD中实现性能-效率权衡的第一个配方,为大规模长周期智能体训练铺平道路。

英文摘要

Scaling on-policy distillation (OPD) for large language models (LLMs) confronts a fundamental tension: asynchronous execution is necessary for system efficiency, but structurally deviates from the ideal on-policy objective. To address this challenge, we theoretically decompose the objective discrepancy into rollout drift and supervision drift, capturing staleness in student rollout and teacher context, respectively. Building on this, we introduce a sample-level freshness score that quantifies the reliability of a buffered sample with respect to the on-policy objective. Guided by this signal, we further propose f-OPD, a novel framework that adaptively regulates stale-sample influence and constrains policy drift accumulated under asynchronous training. Across reasoning, tool-use, and coding-agent tasks of increasing interaction horizon, f-OPD consistently achieves task performance comparable to synchronous optimization while largely retaining the throughput advantages of asynchronous execution. Our results establish the first recipe for achieving a performance-efficiency trade-off in OPD, paving the way for long-horizon agentic post-training at scale.

2605.17854 2026-05-19 cs.LG 版本更新

Learning over Positive and Negative Edges with Contrastive Message Passing

通过对比信息传递学习正负边

Peter Pao-Huang, Charilaos I. Kanatsoulis, Michael Bereket, Jure Leskovec

发表机构 * Department of Computer Science(计算机科学系) Stanford University(斯坦福大学)

AI总结 本文研究了在低标签率、高同质性和高边密度设置下,负边信息对图表示学习的价值,并提出对比信息传递机制以同时利用正负边信息提升性能。

详情
AI中文摘要

传统的图学习方法通过现有(即正边)边进行信息传递来更新节点特征,但这些方法往往忽视了缺失(即负边)中可能有价值的信息。本文理论分析了负边在图表示中的价值,并证明在低标签率、高同质性和高边密度设置下,访问负边能提供比仅使用正边更大的信息增益。受此启发,我们引入对比信息传递(CMP),一种通用的信息传递架构,使图神经网络层能够推理正负边信息。通过在可学习权重上施加软正半定约束,我们的方法对正连接节点应用相似性保持变换,对负连接节点应用不相似性诱导变换。在不同数据条件下,CMP在低标签设置下,当负边信息有效时, consistently 超过基线方法。

英文摘要

Conventional approaches to learning on graphs involve message passing along existing (i.e., positive) edges to update node features. However, these approaches often disregard the potentially valuable information contained in the absence (i.e., negative) of edges. Here, we theoretically analyze the value of negative edges in graph representations and prove that in settings of low label rates, high homophily, and high edge density, access to negative edges provides significant information gain over using only positive edges. Motivated by this insight, we introduce Contrastive Message Passing (CMP), a general message passing architecture that enable graph neural network layers to reason over positive and negative edges. By imposing soft positive semidefinite constraints on the learnable weights, our approach differentially applies similarity-preserving transformations to positively connected nodes and dissimilarity-inducing transformations to negatively connected nodes. Over simulated and real datasets in varying data regimes, CMP consistently outperforms baselines in low-label settings when negative edges are informative.

2605.17850 2026-05-19 stat.ML cs.CV cs.LG cs.NA math.NA math.PR 版本更新

Simple Approximation and Derivative Free Inference-Time Scaling for Diffusion Models via Sequential Monte Carlo on Path Measures

通过路径测度的序列蒙特卡洛实现扩散模型的简单近似与无导数推理时间缩放

Chenyang Wang, Weizhong Wang, Yinuo Ren, Jose Blanchet, Yiping Lu

发表机构 * School of Mathematical Sciences, Peking University, Beijing, China School of Mathematical Sciences, Fudan University, Shanghai, China Department of Industrial Engineering \& Management Sciences, Northwestern University, Evanston, IL, United States Institute for Computational \& Mathematical Engineering, Stanford University, Stanford, CA, United States Management Science \& Engineering, Stanford University, Stanford, CA, United States

AI总结 本文提出URGE算法,一种无需梯度的推理时间缩放方法,通过路径重要性重加权提升扩散模型样本质量,同时在合成测试和扩散模型基准中表现出色,且实现简单且无梯度依赖。

Comments accepted by ICML 2026

详情
AI中文摘要

扩散生成模型越来越多地依赖于推理时间引导,通过添加漂移项或重新加权专家混合物来提高任务特定目标的样本质量。然而,大多数现有技术需要重复评估分数或梯度,引入偏差、高计算开销或两者兼有。我们引入URGE(Unbiased Resampling via Girsanov Estimation),一种无导数的推理时间缩放算法,通过Girsanov测度变换进行路径重要性重加权。与先前工作不同,URGE为每个模拟轨迹附加简单的乘法权重,并定期重新采样。无需计算基于梯度的粒子权重。我们建立了路径级和粒子级SMC之间的等价性:Girsanov路径权重允许一个向后条件期望,恢复先前的粒子级权重,保证两种方案产生相同的无偏终端分布。经验上,URGE在合成测试和扩散模型基准中优于现有推理时间引导基线,实现了更好的生成质量,同时显著更简单且完全无梯度依赖。

英文摘要

iffusion-based generative models increasingly rely on inference-time guidance, adding a drift term or reweighting mixture of experts, to improve sample quality on task-specific objectives. However, most existing techniques require repeated score or gradient evaluations, introducing bias, high computational overhead, or both. We introduce \texttt{URGE}, Unbiased Resampling via Girsanov Estimation, a derivative-free inference-time scaling algorithm that performs path-wise importance reweighting via a Girsanov change of measure. Instead of computing gradient-based particle weights in previous work, \texttt{URGE} attaches a simple multiplicative weight to each simulated trajectory and periodically resamples. No score, no Hessian, and no PDE evaluation is required. We establish an equivalence between path-wise and particle-wise SMC: the Girsanov path weight admits a backward conditional expectation that recovers the previous particle-level weights, guaranteeing that both schemes produce the same unbiased terminal law. Empirically, \texttt{URGE} outperforms existing inference-time guidance baselines on synthetic tests and diffusion-model benchmarks, achieving better generation quality, while being significantly simpler to implement and fully gradient-free.

2605.17849 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

从有机数据生成预训练令牌以实现数据驱动的扩展

Zichun Yu, Chenyan Xiong

发表机构 * Language Technologies Institute, Carnegie Mellon University(卡内基梅隆大学语言技术研究所)

AI总结 本文提出SynPro框架,通过重新表述和重新格式化操作,帮助大语言模型更充分地利用有限的有机数据,从而在数据驱动的预训练中实现更高效的扩展。

详情
AI中文摘要

LLM预训练正从计算驱动转向数据驱动的阶段,其中可用的人类(有机)文本远远无法满足扩展需求。然而,达到数据驱动阶段并不意味着模型已充分利用其有机语料库。在本文中,我们介绍了SynPro,一个合成数据生成框架,帮助LLM更深入地学习有限的有机数据。SynPro应用两种操作,即重新表述和重新格式化,以多样化的形式呈现相同的有机源,以促进更深层次的学习,而无需引入外部信息。两个生成器通过强化学习优化,使用质量、忠实度和数据影响奖励进行优化,并在预训练平台期持续更新,以针对模型尚未吸收的内容。我们使用DCLM-Baseline的10%最优令牌(0.8B和2.2B)预训练400M和1.1B模型,反映了前沿预训练中现实的数据驱动阶段。我们的结果表明,有机数据被标准重复方法显著低估:SynPro解锁了比重复方法多3.7-5.2倍的有效令牌,甚至在1.1B规模上超过了非数据驱动的Oracle,该Oracle在等效唯一数据上训练。分析证实,忠实、模型意识的合成可以在不导致分布崩溃的情况下实现数据驱动的扩展。我们开源代码在https://github.com/cxcscmu/SynPro。

英文摘要

LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline, reflecting a realistic data-bound regime in frontier pretraining. Our results reveal that organic data is significantly underutilized by standard repetition: SynPro unlocks 3.7-5.2x the effective tokens of repetition, even surpassing the non-data-bound oracle that trains on equivalent unique data at the 1.1B scale. Analyses confirm that faithful, model-aware synthesis sustains data-bound scaling without causing distribution collapse. We open-source our code at https://github.com/cxcscmu/SynPro.

2605.17833 2026-05-19 cs.LG cs.AI 版本更新

Efficient Bilevel Optimization for Meta Label Correction in Noisy Label Learning

高效的元标签校正中的双层优化

Ba Hoang Anh Nguyen, Viet Cuong Ta

发表机构 * Human-Machine Interaction Laboratory, VNU University of Engineering and Technology(人机交互实验室,越南工程与技术大学)

AI总结 本文提出了一种高效的元标签校正方法EBOMLC,通过引入一步内循环更新、混合上界损失和对齐感知的动态障碍物,提高了元模型的训练效率和稳定性,实验表明其在高噪声环境下表现优异。

详情
AI中文摘要

训练深度神经网络时使用噪声标签可以降低数据标注成本,但可能会将噪声引入学习模型中。在元标签校正方法中,除了主模型外,还会训练一个额外的元模型,使用小规模干净数据集来校正大规模噪声数据集。然而,元模型的更新需要在主模型的内部步骤中计算超梯度,这会显著增加计算成本。为了提高训练效率,我们首先引入动态障碍梯度下降到标准元标签校正中。虽然这种直接扩展能够将训练过程的速度提高到大约一阶复杂度,但缺乏防止噪声信号泄漏到主模型和稳定元模型学习的机制。基于这一观察,我们提出了EBOMLC方法,其设计包含三个关键改进:一步内循环更新、混合上界损失和对齐感知的动态障碍物。在CIFAR-10和CIFAR-100上的实验结果表明,EBOMLC在高噪声率设置下优于其他基线方法,同时减少了元标签校正方法的训练时间。

英文摘要

Training a deep neural network with noisy labels could reduce data annotation cost but may introduce noise into the learned model. In meta label correction approaches, an additional meta model besides the main model is trained with a small, clean dataset to correct the large, noisy dataset. However, the update of the meta model requires the computation of hypergradients at the inner step of the main model which signif- icantly increases the computational cost. To improve the training efficiency, we first introduce the dynamic barrier gradient descent into standard meta label correction. While this naive extenstion is able to speed up the training process to approximately first- order complexity, it lacks mechanisms to prevent the leakage of noisy signals to the main model and to stabilize the learning of the meta model. Based on this observation, we propose the EBOMLC method, which is designed with three key improvements including one-step inner loop update, mixture upper loss and alignment- aware dynamic barrier. Empirical results on CIFAR-10 and CIFAR-100 demonstrate that EBOMLC consistently outperforms other baselines, especially under high noise rate settings, while reducing training time of the meta label correction approach.

2605.17831 2026-05-19 cs.LG cs.DB 版本更新

Agentic Cost-Aware Query Planning with Knowledge Distillation for Big Data Analytics

具有知识蒸馏的代理成本感知查询规划用于大数据分析

Mahdi Naser-Moghadasi

发表机构 * Research Division, BrightMind AI(BrightMind AI 研究部) Texas Tech University(德克萨斯理工大学) University of Texas at Arlington(德克萨斯大学阿灵顿分校)

AI总结 本文提出了一种结合规则基教师规划器、UCB1老虎机探索、成本感知预测和知识蒸馏的轻量级学生规划器,以解决大数据分析中查询优化计算成本高且资源受限环境下的内存和延迟约束问题,实验结果显示在纽约出租车和IMDB数据集上相比默认规划器降低了23%的延迟并保持了94%的约束满足率。

Comments 8 pages, preprint, code at https://github.com/mahdinaser/agentic-kd-planner

详情
AI中文摘要

在大数据分析中查询优化仍然计算成本很高,尤其是在资源受限的环境中,传统优化器无法满足内存和延迟约束。我们提出了一种代理查询规划系统,结合规则基教师规划器、UCB1老虎机探索、成本感知预测和知识蒸馏来构建轻量级学生规划器。我们的教师规划器使用六个关键优化策略生成SQL计划,而UCB1老虎机搜索在显式资源约束下高效地探索计划空间。随机森林成本模型预测查询延迟,根据计划特征进行成本感知决策。蒸馏的学生规划器(逻辑回归或梯度提升)学习模仿教师-老虎机决策以实现快速推理。在纽约出租车和IMDB数据集上的评估显示,与默认规划器相比,延迟减少了23%,同时保持了94%的约束满足率。学生规划器在复制最优计划方面实现了89%的准确性,推理时间快15倍。我们的单文件实现使在资源受限机器上可重复的大数据分析成为可能,并在https://github.com/mahdinaser/agentic-kd-planner上公开发布。

英文摘要

Query optimization in big data analytics remains computationally expensive, particularly for resource-constrained environments where traditional optimizers fail to satisfy memory and latency constraints. We present an agentic query planning system that combines a rule-based teacher planner, UCB1 bandit exploration, cost-aware prediction, and knowledge distillation to a lightweight student planner. Our teacher planner generates SQL plans using six key optimization strategies, while UCB1 bandit search efficiently explores the plan space under explicit resource constraints. A Random Forest cost model predicts query latency from plan features, enabling cost-aware decisions. A distilled student planner (Logistic Regression or Gradient Boosting) learns to mimic teacher-bandit decisions for fast inference. Evaluation on NYC Taxi and IMDB datasets demonstrates 23% latency reduction compared to default planners while maintaining 94% constraint satisfaction. The student planner achieves 89% accuracy in replicating optimal plans with 15x faster inference time. Our single-file implementation enables reproducible big-data analytics on resource-limited machines and is publicly available at https://github.com/mahdinaser/agentic-kd-planner.

2605.17827 2026-05-19 cs.LG cs.AI 版本更新

Content-Style Identification via Differential Independence

通过微分独立性进行内容-风格识别

Subash Timilsina, Hoang-Son Nguyen, Sagar Shrestha, Xiao Fu

发表机构 * School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, USA(电气工程与计算机科学学院,俄勒冈州立大学,科瓦利斯,俄勒冈,美国)

AI总结 本文提出了一种新的结构条件,即内容-风格微分独立性(CSDI),用于在内容和风格可能依赖的情况下实现生成分析中的可识别性,通过在雅可比子空间上施加块状正交约束,并设计了基于数值雅可比近似的随机正则化器以支持高维生成模型。

Comments 24 pages, 15 figures, ICML 2026

详情
AI中文摘要

生成分析经常将多领域观察建模为领域不变内容变量和领域特定风格变量的非线性混合。从不成对的领域中识别这两种因素可以实现域迁移和反事实数据生成等任务。先前的工作在内容和风格之间(块状)统计独立性或通过非线性混合函数的稀疏雅可比假设下建立了可识别性,但这些条件在实践中可能过于严格。在本文中,我们引入了内容-风格微分独立性(CSDI),一种替代的结构条件,要求内容和风格的微小变化在数据流形上诱导正交方向,从而在内容和风格依赖且雅可比密集时也能实现可识别性。我们通过在内容和风格相关的雅可比子空间上施加块状正交约束来操作化这一条件。为了支持高维生成模型,我们设计了一个基于数值雅可比近似的随机正则化器,从而在如高分辨率图像生成等设置中实现可扩展训练。在多个数据集上的实验验证了可识别性分析,并展示了反事实生成和域迁移的实用优势。

英文摘要

Generative analysis often models multi-domain observations as nonlinear mixtures of domain-invariant content variables and domain-specific style variables. Identifying both factors from unpaired domains enables tasks such as domain transfer and counterfactual data generation. Prior work establishes identifiability under (block-wise) statistical independence between content and style, or via sparse Jacobian assumptions on the nonlinear mixing function, but such conditions can be restrictive in practice. In this work, we introduce content-style differential independence (CSDI), an alternative structural condition requiring that infinitesimal variations in content and style induce orthogonal directions on the data manifold, thereby enabling identifiability even when content and style are dependent and the Jacobian is dense. We operationalize this condition through a blockwise orthogonality constraint on the Jacobian subspaces associated with content and style. To support high-dimensional generative models, we design a stochastic regularizer based on numerical Jacobian approximation, enabling scalable training in settings such as high-resolution image generation. Experiments across multiple datasets corroborate the identifiability analysis and demonstrate practical benefits on counterfactual generation and domain translation.

2605.17811 2026-05-19 cs.LG cs.AI math.OC 版本更新

One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer

一个模型,两种角色:共享递归变压器中的涌现专业化

Jucheng Shen, Barbara Su, Anastasios Kyrillidis

发表机构 * Rice University(里士大学)

AI总结 该研究探讨了共享权重的递归变压器是否能在未被分割成独立模块的情况下发展出不同的内部角色,通过不对称输入递归(AIR)架构发现,模型内部状态分化出不同的功能角色,并展示了这种分化与模型状态动态的关系。

Comments 21 pages, 13 figures, 8 tables

详情
AI中文摘要

可以一个共享权重的递归变压器在未被分割成独立模块的情况下发展出不同的内部角色吗?我们研究了不对称输入递归(AIR),这是一种最小的两状态推理架构,在其中相同的Transformer模型被重复用于更新(根据文献,L和H),唯一的更新规则差异是编码输入在L更新中被注入但在H更新中不被注入。在Sudoku-Extreme和Maze中,解码的rollouts揭示出一致的分裂:$\zH$表现得像一个完全承诺的提案状态,而$\zL$保留局部不确定性和移动的中间结构。冻结实验显示,这种分裂实际上与模型的状态动态有关:在Sudoku中,冻结$\zH$会减少$\zL$的内容变化,而冻结$\zL$会增加$\zH$的内容变化;而在Maze中,冻结任一状态会增加另一个状态的内容变化。消融实验显示,为了诱导专业化,共享模型需要能够区分两种更新类型,要么通过输入注入的不对称性,要么通过一个单独的层级标记。机理上,注意力分析显示在Sudoku和Maze中,L更新始终比H更新更局部。这些结果表明,在两状态递归设置中,清晰的状态身份信号可以诱导共享参数递归变压器内部稳定的、相关的功能角色。代码可在https://github.com/juchengshen/air获得。

英文摘要

Can a shared-weight recurrent Transformer develop distinct internal roles without being partitioned into separate modules? We study this in Asymmetric Input Recurrence (AIR), a minimal two-state reasoning architecture in which the same Transformer model is reused for both updates (per literature, L and H) and the only built-in difference in the update rule is that the encoded input is injected during L-updates but not H-updates. Across Sudoku-Extreme and Maze, decoded rollouts reveal a consistent split: $\zH$ behaves like a fully committed proposal state, whereas $\zL$ retains local uncertainty and shifting intermediate structure. Freeze experiments show that this split is, in practice, related to the model's state dynamics: in Sudoku, freezing $\zH$ reduces $\zL$'s content changes whereas freezing $\zL$ increases $\zH$'s, while in Maze, freezing either state increases content changes in the other state. Ablations show that to induce specialization, the shared model needs to be able to tell the two update types apart, either from input injection asymmetry or from a separate level token. Mechanistically, attention analysis shows that L-updates are consistently more local than H-updates in both Sudoku and Maze. Together, these results show that, in a two-state recurrent setting, a clear state-identity signal can induce stable, related functional roles inside a shared-parameter recurrent Transformer. Code is available at \href{https://github.com/juchengshen/air}{\textcolor{blue}{https://github.com/juchengshen/air}}.

2605.17808 2026-05-19 cs.LG stat.ML 版本更新

A Unified Framework for Data-Free One-Step Sampling via Wasserstein Gradient Flows

通过Wasserstein梯度流构建数据免费一步采样的统一框架

Chenguang Wang, Tianshu Yu

发表机构 * School of Data Science(数据科学学院) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 本文提出了一种基于Wasserstein梯度流的数据免费一步采样的统一理论框架,展示了f-分歧度目标下诱导速度场的通用形式,并通过软欠覆盖功能理论推导了分歧度选择与质量运输几何之间的压缩-弹性恒等式,进一步扩展到Log-Variance分歧度,并通过KDE实现和归一化流路线实现了一步推断。

详情
AI中文摘要

我们开发了一种基于Wasserstein梯度流的数据免费一步采样的统一理论框架。对于广泛的标准f-分歧度目标,我们证明诱导速度场具有通用形式V(x)=w(r(x))β(x),其中β(x)=∇log(p(x)/q(x))在不同目标中共享,而w仅由分歧度的选择决定。这种分解表明标准f-分歧度漂移共享相同的渐近目标分布p,并主要区别于如何在欠覆盖区域重新分配瞬时修复努力。为了正式化这种区别,我们推导了软欠覆盖功能的一步区域响应理论,并获得了一个将分歧度选择与质量运输进入欠覆盖区域的几何联系的压缩-弹性恒等式。我们进一步将该框架扩展到Log-Variance (LV)分歧度,分析参考分布如何改变最终的漂移结构,并提出一个实用的LV启发式替代方案用于数据免费训练。基于此理论,我们通过KDE实现该框架,并描述了互补的归一化流路线,从而在训练后实现一步推断。在多模态高斯混合基准测试中的实验结果与理论预测一致,并在这些目标上展示了有效的一步采样。

英文摘要

We develop a unified theoretical framework for data-free one-step sampling from unnormalized target distributions based on Wasserstein gradient flows. For a broad class of standard f-divergence objectives, we show that the induced velocity field admits the universal form $\mathbf{V}(x)=w(r(x))\,β(x)$, where $β(x)=\nabla \log (p(x)/q(x))$ is shared across objectives and $w$ is determined solely by the choice of divergence. This decomposition shows that standard f-divergence drifts share the same asymptotic target distribution $p$ and differ primarily in how they redistribute transient repair effort across under-covered regions. To formalize this distinction, we derive a one-step regional-response theory for a soft under-coverage functional and obtain a compression--elasticity identity that links divergence choice to the geometry of mass transport into under-covered regions. We further extend the framework beyond the f-divergence family to the Log-Variance (LV) divergence, analyze how the reference distribution alters the resulting drift structure, and motivate a practical LV-inspired surrogate for data-free training. Based on this theory, we instantiate the framework with a KDE-based implementation and describe a complementary normalizing-flow route, enabling one-step inference after training. Experiments on multimodal Gaussian-mixture benchmarks are consistent with the theoretical predictions and demonstrate effective one-step sampling on these targets.

2605.17806 2026-05-19 cs.LG 版本更新

AMO: Adaptive Muon Orthogonalization

AMO:自适应缪子正交化

Xinlin Zhuang, Panyi Ouyang, Yichen Li, Jiangming Shi, Yizhang Chen, Shuman Liu, Ying Qian, Weiyang Liu, Haibo Zhang, Imran Razzak

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shopee MBZUAI East China Normal University(华东师范大学) Huazhong University of Science and Technology(华中科技大学) Xiamen University(厦门大学)

AI总结 本文研究了缪子优化中正交化过程的异质性,提出自适应缪子正交化方法,通过测量权重几何特性动态分配NS预算,提升预训练性能。

Comments preprint, under-review

详情
AI中文摘要

缪子最近作为一种替代AdamW的预训练优化器出现,其核心操作是通过牛顿-施鲁茨(NS)迭代实现正交化。现有缪子变体对所有参数矩阵应用统一的NS调度,忽略了正交化难度的差异及其对性能的影响。通过系统性的实证研究,我们发现这种每矩阵异质性普遍存在,主要由矩阵几何决定,其在不同操作类型、训练阶段和网络深度下动态变化。因此,统一的NS调度可能导致模型中正交化质量不均。受此启发,我们提出自适应缪子正交化(AMO),一种观察后承诺的方法,通过早期测量操作类型权重几何特性,并利用这些信号为剩余训练分配NS预算。AMO在标准、延长和连续预训练中均优于统一调度的缪子,其在Llama3.1-1.4B上平均下游性能提升+0.76,在Qwen3-1.7B上提升+0.51。

英文摘要

Muon has recently emerged as a competitive alternative to AdamW for large-scale pre-training, with orthogonalization via Newton-Schulz (NS) iterations as its core operation. Existing Muon variants apply a uniform NS schedule to all parameter matrices, overlooking possible differences in orthogonalization difficulty and its impact on performance. Through a systematic empirical study, we show that this per-matrix heterogeneity is pervasive and largely determined by matrix geometry, which evolves dynamically across operator types, training stages, and network depths. As a result, uniform NS schedules can lead to uneven orthogonalization quality across the model. Motivated by these findings, we propose Adaptive Muon Orthogonalization (AMO), an observe-then-commit method that measures weight geometry by operator type early in training and then uses these signals to allocate the NS budget for the remainder of training. AMO delivers consistent improvements over uniform-schedule Muon across standard, prolonged, and continual pre-training, surpassing the strongest baseline by +0.76 on Llama3.1-1.4B and +0.51 on Qwen3-1.7B in average downstream performance of 12 evaluation tasks.

2605.17799 2026-05-19 cs.CV cs.LG 版本更新

Is Complex Training Necessary for Long-Tailed OOD Detection? A Re-think from Feature Geometry

长尾分布外检测是否需要复杂的训练?从特征几何角度的重新思考

Ningkang Peng, Xuanming Chen, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学)

AI总结 本文重新审视长尾分布外检测问题,提出通过特征几何方法简化检测过程,改进Mahalanobis距离计算,提升检测性能。

详情
AI中文摘要

长尾分布外检测通常通过专门的训练方法解决,包括引入分布外数据、回避头、对比目标、能量损失或梯度冲突控制。我们表明这些训练机制可能掩盖了一个更简单的问题:冻结的长尾表示可能已经包含有用的分布外证据,但原始Mahalanobis距离受到频率耦合特征半径和不充分支持的尾部协方差的影响。我们提出了超球面池化Mahalanobis(HPM)方法,一种后处理检测器,将特征归一化到单位球面,并用池化、岭正则化的度量替换类特定协方差,同时保持类均值作为语义锚点。在CIFAR-LT实验和ImageNet-100-LT近分布外边界分析中,HPM提高了原始Mahalanobis评分;对于先验校准经验风险最小化(PC-ERM),在CIFAR-10-LT上将AUROC从46.49提升到85.67,在CIFAR-100-LT上从50.40提升到78.35。这个简单的PC-ERM+HPM流程在CIFAR-100-LT上实现了最佳对数效率分数(LES;3.08),在显著降低训练时间成本的情况下,保留了约95%的最佳CIFAR-100-LT AUROC观测值。这些结果表明,在长尾分布外检测中应分别评估表示质量、检测器几何和训练复杂性。

英文摘要

Long-tailed out-of-distribution (LT-OOD) detection is often addressed with specialized training, including auxiliary out-of-distribution (OOD) data, abstention heads, contrastive objectives, energy losses, or gradient-conflict control. We show that these training mechanisms can obscure a simpler issue: frozen long-tailed representations may already contain useful OOD evidence, but raw Mahalanobis distance is distorted by frequency-coupled feature radius and poorly supported tail covariance. We propose Hyperspherical Pooled Mahalanobis (HPM), a post-hoc detector that normalizes features onto the unit sphere and replaces class-specific covariance with a pooled, ridge-regularized metric while keeping class means as semantic anchors. In CIFAR-LT experiments and an ImageNet-100-LT near-OOD boundary analysis, HPM improves raw Mahalanobis scoring; for Prior-Calibrated ERM (PC-ERM), it raises AUROC from 46.49 to 85.67 on CIFAR-10-LT and from 50.40 to 78.35 on CIFAR-100-LT. This simple PC-ERM+HPM pipeline also achieves the best Log Efficiency Score (LES; 3.08) on CIFAR-100-LT, retaining roughly 95% of the best CIFAR-100-LT AUROC observed among the compared post-hoc scores at substantially lower training-time cost. These results argue for evaluating representation quality, detector geometry, and training complexity as separate factors in LT-OOD detection.

2605.17795 2026-05-19 cs.LG cs.CV 版本更新

When Accuracy Is Not Enough: Uncertainty Collapse between Noisy Label Learning and Out-of-Distribution Detection

当准确性不够时:噪声标签学习与分布外检测之间的不确定性崩溃

Ningkang Peng, Jingyang Mao, Runhan Zhou, Peirong Ma, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学)

AI总结 本文研究了噪声标签学习与分布外检测之间的不确定性崩溃问题,提出了一种通用的ACC-OOD基准,揭示了高准确率并不保证分布外可靠性,提出虚拟边距正则化方法来缓解这一问题。

详情
AI中文摘要

噪声标签学习(LNL)通常通过封闭集分类准确率进行评估,但部署时往往需要分类器能够拒绝分布外(OOD)输入。我们提出了一种学习者无关的ACC-OOD基准,冻结LNL检查点,并在合成和真实噪声标签上评估它们,使用标准化的近/远OOD路由和事后评分。该基准揭示了一种反复出现的失败模式:高封闭集准确率不保证OOD可靠性,因为低置信度、被错误分类的分布内样本可能在噪声训练下与OOD输入占据的得分和特征区域重叠。我们称之为这种病理现象不确定性崩溃。这种结构重叠可能导致高准确率的LNL方法在标准OOD评分下失去ID错误/OOD界面的分离性。作为干预措施,我们研究了虚拟边距正则化(VMR),一种轻量级的修复探针,主要通过PSSCL展示,通过在可信ID批次上合成边界虚拟异常值并扩大能量边距。VMR在不替换主机目标或牺牲封闭集准确率的情况下,部分减少了由崩溃引起的远OOD失败。这些结果支持LNL基准,同时报告封闭集泛化、开放世界可靠性以及结构重叠诊断。

英文摘要

Learning with noisy labels (LNL) is typically benchmarked by closed-set classification accuracy, yet deployment often requires classifiers to reject out-of-distribution (OOD) inputs. We present a learner-agnostic ACC-OOD benchmark that freezes LNL checkpoints and evaluates them with standardized near-/far-OOD routing and post-hoc scores across synthetic and real label noise. The benchmark reveals a recurring failure mode: high closed-set accuracy does not ensure OOD reliability, because low-confidence, misclassified in-distribution samples can overlap the score and feature regions occupied by OOD inputs under noisy training. We term this pathology uncertainty collapse. This structural overlap can make high-accuracy LNL methods lose separability at the ID-error/OOD interface under standard OOD scores. As an intervention, we study Virtual Margin Regularization (VMR), a lightweight repair probe demonstrated mainly with PSSCL that synthesizes boundary virtual outliers on trusted ID batches and widens the energy margin. VMR partially reduces the collapse-induced far-OOD failure without replacing the host objective or sacrificing closed-set accuracy in the tested settings. These results support LNL benchmarks that co-report closed-set generalization, open-world reliability, and structural overlap diagnostics.

2605.17792 2026-05-19 cs.LG physics.geo-ph 版本更新

HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

HydroAgent: 通过模拟器引导的强化学习缩小前沿大语言模型与人类专家在水文模型校准之间的差距

Zhi Li, Songkun Yan, Jie Cao, Mofan Zhang, Anjiang Wei, Jinwoong Yoo, Yang Hong

发表机构 * Civil, Environmental, and Architectural Engineering, University of Colorado Boulder(科罗拉多大学波尔德分校土木、环境与建筑工程系) Civil Engineering and Environmental Sciences, University of Oklahoma(俄克拉荷马大学土木工程与环境科学系) Department of Computer Science, University of Oklahoma(俄克拉荷马大学计算机科学系) Civil and Environmental Engineering, Stanford University(斯坦福大学土木与环境工程系) Department of Computer Science, Stanford University(斯坦福大学计算机科学系) NASA Goddard Space Flight Center(美国国家航空航天局戈达德空间飞行中心)

AI总结 本文研究如何利用前沿大语言模型(LLM)代理替代人类水文模型师进行水文模型校准,提出HydroAgent方法,通过模拟器引导的强化学习(RLSF)进行微调,以提高模型在不同流域中的适应性和准确性。

详情
AI中文摘要

校准分布式水文模型是操作水资源管理中的关键瓶颈——径流预测、水库调度、干旱监测、基础设施设计和洪水预测都依赖于此。每个流域都需要专家将水文图谱特征转化为高维参数向量的调整,而这种工作流程无法在不同流域之间转移。我们问:前沿大语言模型(LLM)代理能否替代人类水文模型师?如果不能,需要什么条件?我们对九个前沿LLM代理——Claude Opus 4.6/4.7、Sonnet 4.6、GPT-5/5.4/5.4-pro和Gemini 2.5-pro/3.1-pro/3-flash——在由美国国家气象局用于暴雨预报的运营CREST分布式水文模型上进行基准测试。最佳的二十轮次Nash-Sutcliffe效率(NSE)在四个保留的水文站上跨越329-40,792平方公里的范围从-0.16(GPT-5.4)到0.75(Sonnet 4.6);上限在所有三个供应商和能力层级中都保持一致,最强的模型集中在0.65-0.75范围内,除了Opus-4.7在其中一个水文站外,没有其他模型达到人类专家的参考水平。我们认为这个差距不是参数数量的问题,而是领域基础的问题。然后我们提出了HYDROAGENT,通过监督微调2,576条专家校准轨迹和使用NSE作为可验证奖励的组相对策略优化,对开放权重的Qwen3-4B进行微调——模拟器反馈的强化学习(RLSF)。对于地球系统科学,一个经过领域微调的策略,通过模拟器在环的强化学习,比扩展通用前沿模型更计算高效且物理上更忠实,而地球数据的多模态丰富性——遥感、现场时间序列和预报员叙述——使领域代理成为物理科学中人工智能发展的杠杆方向。

英文摘要

Calibrating distributed hydrologic models is a critical bottleneck across operational water resources management - streamflow prediction, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it. Each basin demands an expert to translate hydrograph signatures into adjustments of a high-dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents - Claude Opus 4.6/4.7, Sonnet 4.6, GPT-5/5.4/5.4-pro, and Gemini 2.5-pro/3.1-pro/3-flash - on the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash-flood forecasting. Best-of-twenty-rounds Nash-Sutcliffe Efficiency (NSE) across four held-out gauges spanning 329-40,792 km2 ranges from -0.16 (GPT-5.4) to 0.75 (Sonnet 4.6); the ceiling reproduces across all three vendors and capability tiers, with the strongest models concentrating in the 0.65-0.75 band, and no model reaches the human-expert reference except Opus-4.7 on one gauge. We argue this gap is not a parameter-count problem but a domain-grounding problem. We then propose HYDROAGENT, fine-tuning open-weight Qwen3-4B with supervised fine-tuning on 2,576 expert calibration trajectories and Group-Relative Policy Optimization using NSE as a verifiable reward from online CREST simulations - reinforcement learning with simulation feedback (RLSF). For Earth system science, a small domain-tuned policy with simulator-in-the-loop RL is a more compute-efficient and physically faithful path than scaling generic frontier models, and the multi-modal richness of Earth data - remote sensing, in-situ time series, and forecaster narrative - makes domain agents a leveraged direction for AI in physical science.

2605.17787 2026-05-19 cs.LG 版本更新

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

重新审视LLM预训练中Adam与SGD的差距:大有效学习率的作用

Athanasios Glentis, Dawei Li, Chung-Yiu Yau, Mingyi Hong

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 本文通过实证和理论分析,发现SGD在LLM预训练中表现较差的原因在于其无法维持与Adam相媲美的有效学习率,而大有效学习率需求源于小梯度范数和大权重-梯度比,且在大批次大小下更加明显。通过简单剪枝机制,SGD在大学习率下能恢复大部分Adam性能,实验显示验证损失差距从超过50%降至约3.5%。

详情
AI中文摘要

人们普遍认为随机梯度下降(SGD)在预训练大型语言模型(LLMs)时比自适应优化器如Adam表现更差。然而,这一差距的根源仍不清楚。本文认为,SGD无法维持与Adam相比更大的有效学习率是导致差异的主要原因。通过分析LLM预训练动态,我们发现训练过程中梯度范数较小且权重-梯度比较大,这一现象在预训练中常见的大批次大小下更加显著,需要较大的有效学习率。然而,我们发现输出层梯度幅度在不同token类别间差异显著,且训练过程中经常出现大梯度尖峰。这些因素严重限制了SGD的可接受学习率。基于这一理解,我们展示出简单的剪枝机制能够稳定SGD在大学习率下的表现,使其恢复大部分Adam的性能。在大规模实验中,使用1B参数的LLaMA模型和1M token批次大小预训练时,大学习率SGD与Adam的验证损失差距从超过50%降至仅约3.5%。

英文摘要

It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptive optimizers such as Adam in pre-training Large Language Models (LLMs). Yet the underlying reason for this gap remains unclear. In this work, we attribute a large part of the discrepancy to SGD's inability to sustain learning rates comparable to Adam's much larger effective learning rates. Through empirical and theoretical analysis of LLM pre-training dynamics, we identify that training is characterized by small gradient norms and large weight-to-gradient ratios, an effect that becomes more pronounced with larger batch sizes typical in pre-training, necessitating such large effective learning rates. However, we find that output-layer gradient magnitudes become highly uneven across token classes, and that large gradient spikes frequently occur during training. Together, these effects severely restrict the admissible learning rate of SGD. Guided by this understanding, we show that simple clipping mechanisms that stabilize SGD at large learning rates enable it to recover most of Adam's performance. In our large-scale experiments, the validation loss gap between large-learning-rate SGD and Adam shrinks from more than 50% to only about 3.5% when pre-training a 1B-parameter LLaMA model with a 1M-token batch size.

2605.17778 2026-05-19 math.ST cs.LG stat.ME stat.ML stat.TH 版本更新

Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models

自蒸馏在带噪协方差模型中的谱收缩估计器中是最优的

Radu Lecoiu, Debarghya Mukherjee, Pragya Sur

发表机构 * Department of Statistics, Harvard University(哈佛大学统计学系) Department of Mathematics & Statistics, Boston University(波士顿大学数学与统计学系)

AI总结 本文研究了自蒸馏在带噪协方差模型中的表现,证明了在谱收缩估计器中,s步自蒸馏在性能上最优,并展示了其在统计和机器学习中的优势。

Comments 103 pages, 8 figures

详情
AI中文摘要

自蒸馏已经 emerged 为提高现代机器学习系统模型性能的一种有前景的技术。我们通过引入并分析一个广泛的估计器类别,即谱收缩估计器,建立了自蒸馏在带噪协方差模型中的统计基础。我们证明了对于具有s个脊的带噪协方差矩阵,s步自蒸馏在谱收缩估计器中达到最优性能,优于统计和机器学习中已知的估计器。此外,我们还显示s步是必要的,任何(s-k)步蒸馏估计器对于1 ≤ k ≤ s都是严格次优的。对于等方差协方差的特殊子类,我们证明了最优调优的岭回归在谱收缩估计器中表现最佳。我们还研究了一种联邦方法,其中多个数据中心共享谱收缩估计器,并且一个共同的服务器试图聚合它们以实现最优性能。在这种情况下,我们发现最佳的本地规则再次采用自蒸馏的形式,尽管当数据集中在单一服务器上时,它与最优规则不同。总之,我们的结果阐明了自蒸馏如何提高预测性能,并提供了一个更广泛的统计框架,将自蒸馏与经典收缩方法联系起来。

英文摘要

Self-distillation has emerged as a promising technique for improving model performance in modern machine learning systems. We develop the statistical foundations of self-distillation in spiked covariance models, by introducing and analyzing a broad class of estimators, namely spectral shrinkage estimators. We establish that for spiked covariance matrices with $s$ spikes, $s$-step self-distillation achieves optimal performance among spectral shrinkage estimators, outperforming well-known estimators in statistics and machine learning. Moreover, we show that $s$ steps are necessary for optimality: any $(s-k)$-step distilled estimator is strictly suboptimal for $1 \leq k \leq s$. For the special subclass of isotropic covariances, we show that optimally tuned Ridge regression performs best among spectral shrinkage estimators. We also study a federated approach where multiple data centers share spectral shrinkage estimators and a common server seeks to aggregate them to achieve optimal performance. In this case, we find that the best local rule again takes the form of self-distillation, though it differs from the optimal rule when data are hosted centrally on a single server. Together, our results elucidate why self-distillation improves predictive performance and provide a broader statistical framework connecting it with classical shrinkage-based methods.

2605.17765 2026-05-19 cs.LG 版本更新

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

AURORA:用于医疗基础模型中几何表示学习的上下文正交化

Yuanyun Zhang, Shi Li

发表机构 * University of the Chinese Academy of Sciences(中国科学院大学) Columbia University(哥伦比亚大学)

AI总结 本文提出AURORA框架,通过上下文潜在几何进行正交化,以解决医疗基础模型中潜在表示的语义模糊和上下文变化不稳定性问题,提升了模型在不同机构分布变化下的鲁棒性和预测性能。

详情
AI中文摘要

近年来,医疗基础模型通过大规模自监督学习实现了强大的预测性能,但其潜在表示经常将生理严重程度、干预强度、观察结构和机构工作流程整合到共享嵌入方向中。尽管在下游预测中有效,这些表示在上下文变化下仍然语义模糊且不稳定。我们引入AURORA,即通过正交化关系对齐的适应性不确定性感知表示,这是一种基于上下文潜在几何的医疗表示学习新框架。与优化单一统一嵌入流形不同,AURORA将表示分解为对应于不同上下文因素的正交语义子空间,并在每个子空间内学习关系一致性目标。这诱导出既语义解耦又几何可解释的潜在空间。在多个临床预测和检索任务中,AURORA在重建、对比和自蒸馏基线方面表现一致优于,同时显著提高了上下文解耦、邻域纯度和机构分布变化下的鲁棒性。我们的结果表明,潜在几何本身是医疗基础模型设计的重要轴线,且根据上下文语义显式结构化表示空间为传统预测压缩目标提供了补充方向。

英文摘要

Recent healthcare foundation models have achieved strong predictive performance through large scale self supervised learning, yet their latent representations frequently entangle physiologic severity, intervention intensity, observational structure, and institutional workflow into shared embedding directions. While effective for downstream prediction, such representations remain semantically opaque and unstable under contextual shift. We introduce AURORA, Adaptive Uncertainty aware Representations through Orthogonalized Relational Alignment, a new framework for healthcare representation learning based on contextual latent geometry. Rather than optimizing a single unified embedding manifold, AURORA decomposes representations into orthogonal semantic subspaces corresponding to distinct contextual factors and learns relational consistency objectives within each subspace. This induces latent spaces that are both semantically disentangled and geometrically interpretable. Across multiple clinical prediction and retrieval tasks, AURORA consistently outperforms reconstruction, contrastive, and self distillation baselines while substantially improving contextual disentanglement, neighborhood purity, and robustness under institutional distribution shift. Our results suggest that latent geometry itself constitutes an important axis of healthcare foundation model design and that explicitly structuring representation space according to contextual semantics provides a complementary direction beyond conventional predictive compression objectives.

2605.17761 2026-05-19 cs.SI cs.LG 版本更新

MV-Gate: Insider Threat Detection via Multi-View Behavioral Statistics and Semantic Modeling

MV-Gate:通过多视图行为统计与语义建模进行内部威胁检测

Kaichuan Kong, Dongjie Liu, Xiaobo Jin, Guanggang Geng

发表机构 * College of Cyber Security, Jinan University(济南大学网络安全学院) School of Advanced Technology, Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学先进技术学院)

AI总结 本文提出MV-Gate框架,通过整合行为统计规律与序列语义,有效检测渐进性和低可见性内部威胁,提升了内部威胁检测的鲁棒性。

Comments Accepted by The 29th International Conference on Computer Supported Cooperative Work in Design (CSCWD 2026)

详情
AI中文摘要

内部威胁往往通过行为统计的早期异常(如复发模式的变化或短期与长期频率的转变)而非事件语义的变化来揭示。然而,随着领域从统计建模转向日志标记和深度序列编码,这些统计线索被削弱或丢失,导致当前模型对渐进性和低可见性内部行为不敏感。本文提出MV-Gate,一种多视图行为建模框架,明确整合统计规律与序列语义。MV-Gate构建了三个对齐的行为序列:活动标记、多尺度状态信号捕捉复发模式,以及频率偏差信号描述短期与长期强度差异。一个异常感知的门控机制将这些统计视图注入注意力计算,引导编码器强调统计不规则事件。在CERT r4.2、CERT r5.2和ADFA-LD上的实验表明,MV-Gate在经典、深度学习和领域特定基线模型上取得了显著提升,特别是在渐进性和弱信号威胁方面。这些结果强调了联合建模统计和序列证据对于鲁棒内部威胁检测的必要性。

英文摘要

Insider threats often reveal early anomalies through disruptions in behavioral statistics-such as altered recurrence patterns or short-versus long-term frequency shifts-rather than changes in event semantics. Yet, as the field has shifted from statistical modeling to log tokenization and deep sequential encoders, these statistical cues are weakened or lost, leaving current models insensitive to gradual and low-visibility insider behaviors.We propose MV-Gate, a multi-view behavior modeling framework that explicitly integrates statistical regularities with sequence semantics. MV-Gate constructs three aligned behavioral sequences: activity tokens, multi-scale status signals capturing recurrence patterns, and frequency-deviation signals describing short- vs long-term intensity differences. An anomaly-aware gating mechanism injects these statistical views into the attention computation, guiding the encoder to emphasize statistically irregular events. Experiments on CERT r4.2, CERT r5.2, and ADFA-LD show that MV-Gate achieves notable gains over classical, deep-learning, and domain-specific baselines, particularly for progressive, weak-signal threats. These results highlight the necessity of jointly modeling statistical and sequential evidence for robust insider-threat detection.

2605.17758 2026-05-19 cs.LG 版本更新

Memisis: Orchestrating and Evaluating Synthetic Data for Tabular Health Datasets

Memisis:协调和评估表格健康数据的合成数据

Nitish Nagesh, Mahdi Bagheri, Arshia Harish Puthran, Pengbao Zhou, Muhjaazee Love, Aadi Sharma, Ian Harris, Amir M. Rahmani

发表机构 * University of California Irvine(加州大学尔湾分校)

AI总结 本文提出Memisis工具,通过结合现有合成数据工具、大语言模型和先进评估指标,协调和评估合成数据,以提高下游预测任务和临床决策的质量。

详情
AI中文摘要

合成数据在医疗领域被广泛用于创建与原始数据相似但不涉及隐私问题的数据集。在隐私、效用和公平性方面生成和评估合成数据对于促进高质量数据的可用性以支持下游预测任务和临床决策至关重要。我们提出了Memisis,一个工具,通过利用现有的合成数据工具、大语言模型的威力以及最先进的评估指标来协调和评估合成数据。我们的工具创建了一个统一的工作流用于数据生成、验证和评估。用户可以控制训练大小、训练周期以及合成行的数量。而不是通过调整合成数据的参数,交互式代理允许用户指定其合成数据生成目标,工具将通过利用现有工具并执行必要的评估来协调工作流。在演示中,我们使用了一个开源的 schizophrenia 数据集,其中包含与种族和性别相关的受保护属性,三种不同的合成器和一个本地语言模型来协调工作流。我们观察到 CTGAN、TVAE 和 GaussianCopula 在公平性和效用指标上表现相当。工作流允许用户在数据生成和评估过程中拥有灵活性和控制。

英文摘要

Synthetic data is widely used in healthcare to create datasets that are similar to original data but without the privacy concerns. Generating and evaluating synthetic data across privacy, utility and fairness is crucial for facilitating high quality data availability for downstream prediction tasks and clinical decision making. We present Memisis, a tool that orchestrates and evaluates synthetic data by leveraging existing synthetic data tools, the power of large language models and state-of-the-art evaluation metrics. Our tool creates a unified workflow for data generation, validation and evaluation. Users have control over the training size, training epochs and the number of synthetic rows to sample. Instead of knobs to tune synthetic data, the interactive agent allows users to specify their synthetic data generation goals and the tool will orchestrate the workflow by leveraging existing tools while performing the requisite evaluation. For the demo, we use an open source schizophrenia dataset with protected attributes related to race and gender, three different synthesizers and a local language model to orchestrate the workflow. We observe that CTGAN, TVAE and GaussianCopula have comparable performance across fairness and utility metrics. The workflow allows users flexibility and control over the data generation and evaluation process.

2605.17757 2026-05-19 cs.LG cs.AI cs.DC cs.PF 版本更新

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

OSCAR: 2位KV缓存量化中的离线频谱协方差感知旋转

Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu

发表机构 * Together AI University of Sydney(悉尼大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出OSCAR方法,通过离线估计注意力感知的协方差结构,实现2位KV缓存量化的高效和准确,同时开发了可部署的系统,提升了LLM服务框架的性能和效率。

Comments 35 pages, 10 figures

详情
AI中文摘要

INT2 KV-cache量化对于长上下文LLM服务具有吸引力,但实现准确性和可部署性仍然具有挑战。简单的旋转如Hadamard变换可以减少异常值,但仍然在INT2层面失效,因为它们与下游注意力不对齐。我们提出了OSCAR,一种超低比特KV缓存量化方法,通过离线估计注意力感知的协方差结构,并利用这些结构推导出固定旋转和截断阈值用于量化。这样,KV量化就与注意力实际消耗的协方差结构对齐。更重要的是,我们不仅提供了理论依据,还开发了一个完全可部署的OSCAR系统,包含一个定制的INT2注意力内核,该内核与分页KV缓存服务和融合内核流水线保持兼容,从而无缝集成到现代LLM服务框架中,如SGLang和vLLM。我们评估了我们的方法在最近的推理模型上,使用最多32k token的推理轨迹进行跨5个任务的测试。在Qwen3-4B-Thinking-2507和Qwen3-8B上,OSCAR将BF16精度差距分别减少到3.78和1.42个点,而朴素旋转INT2几乎归零。我们进一步将OSCAR扩展到Qwen3-32B和GLM-4.7(358B参数),其中它仍然与BF16保持有效相当。在长上下文-RULER-NIAH(最多128K)上,OSCAR在Qwen3模型上保持稳健,而朴素旋转INT2崩溃。从系统层面来看,OSCAR将KV缓存内存减少约8倍,在相同内存预算下,大批次大小下吞吐量提高最多7倍,并且由于内存带宽开销减少,单批次解码速度比BF16快最多3倍。

英文摘要

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.

2605.17749 2026-05-19 cs.LG stat.ML 版本更新

Testable and Actionable Calibration for Full Swap Regret

可检验且可操作的全面交换懊悔校准

Konstantina Bairaktari, Lunjia Hu, Huy L. Nguyen, Jonathan Ullman

发表机构 * Department of Computer Science, Aarhus University(阿arhus大学计算机科学系) Khoury College of Computer Sciences, Northeastern University(东北大学计算机科学学院) Northeastern University(东北大学)

AI总结 本文提出了一种新的校准度量标准SCDL,该度量标准在不削弱任何要求的前提下,既可操作又可检验,同时具备连续性和一致性等理想特性,并通过实验验证了其在实际中的优越性能。

详情
AI中文摘要

人工智能生成的预测越来越多地影响关键任务中的决策制定,因此必须具有可信度。校准是衡量可信度的一种广泛使用的度量标准,要求预测与真实频率匹配,并可以像真实概率一样对待某一结果。然而,定义校准是微妙的,设计良好的校准误差度量标准一直是最近研究的活跃主题。第一个目标是找到可操作的校准度量标准,即能够向决策者说明当预测被视为真实概率时的效用损失,这被称为交换懊悔。第二个目标是找到可检验的校准度量标准,即校准误差可以从少量预测和结果中测量出来。尽管这些是基本要求,但目前没有现有的校准度量标准能够完全满足这两个属性,所有现有的度量标准都通过限制交换懊悔的弱化观念来放松可操作性,或通过具有次优估计误差来放松可检验性。我们介绍了一种新的校准度量标准,称为软分箱校准决策损失(SCDL),我们证明其在不削弱任何要求的前提下是完全可操作的,并且可检验性具有几乎最优的误差率。此外,SCDL还满足其他理想属性,如连续性和一致性。我们还提供了一组实验,证明了SCDL与其他度量标准的理论优势在实践中导致更好的性能。

英文摘要

AI generated predictions increasingly inform decision making in critical tasks, and therefore must be trustworthy. One widely used measure of trustworthiness is calibration, which requires that the predictions match the true frequencies and can be treated like real probabilities of a given outcome. However, defining calibration is subtle, and designing good measures of calibration error has been an active topic of recent research. The first goal is to find calibration measures that are actionable, meaning they can inform decision makers about their utility loss when predictions are treated as true probabilities, which is known as swap regret. The second goal is to find calibration measures that are testable, meaning that calibration error can be measured from a small sample of predictions and outcomes. Although these are very basic requirements, there is no existing calibration measure that fully satisfies both properties, and all existing measures relax actionability by bounding a weaker notion of swap regret, or relax testability by having suboptimal estimation error. We introduce a new calibration measure, Soft-Binned Calibration Decision Loss (SCDL), which we prove is fully actionable without weakening either requirement, and testable with nearly optimal error rate. In addition, SCDL satisfies other desired properties such as continuity and consistency. We also provide a set of experiments confirming that the theoretical advantages of SCDL compared to other measures lead to better performance in practice.

2605.17745 2026-05-19 stat.ML cs.LG 版本更新

StatQAT: Statistical Quantizer Optimization for Deep Networks

StatQAT: 深度网络的统计量化优化

Mehmet Aktukmak, Daniel Huang, Ke Ding

发表机构 * Intel(英特尔)

AI总结 本文提出了一种新的统计误差分析框架,用于统一和浮点量化,以提供理论洞察,针对不同数据分布的量化配置误差行为。基于此分析,作者提出了适用于任意数据分布的迭代量化器和适用于高斯似分布权重的分析量化器,从而实现了高效的低误差量化,适用于激活和权重。将这些量化器整合到量化感知训练中,并在整数和浮点格式上进行了评估,实验表明提高了准确性和稳定性,展示了该方法在训练低精度神经网络中的有效性。

详情
AI中文摘要

量化对于减少深度神经网络的计算成本和内存使用至关重要,使低精度硬件上的高效推断成为可能。尽管统一和浮点量化方案的广泛应用,选择最优的量化参数仍是一个关键挑战,尤其是在训练和推断过程中遇到的多样化数据分布。本文提出了一种新的统计误差分析框架,用于统一和浮点量化,提供了对量化配置下误差行为的理论洞察。基于此分析,我们提出了适用于任意数据分布的迭代量化器和适用于高斯似分布权重的分析量化器。这些方法使高效、低误差的量化成为可能,适用于激活和权重。我们将我们的量化器整合到量化感知训练中,并在整数和浮点格式中进行了评估。实验表明,精度和稳定性得到了提高,突显了我们的方法在训练低精度神经网络中的有效性。

英文摘要

Quantization is essential for reducing the computational cost and memory usage of deep neural networks, enabling efficient inference on low-precision hardware. Despite the growing adoption of uniform and floating-point quantization schemes, selecting optimal quantization parameters remains a key challenge, particularly for diverse data distributions encountered during training and inference. This work presents a novel statistical error analysis framework for uniform and floating-point quantization, providing theoretical insight into error behavior across quantization configurations. Building on this analysis, we propose iterative quantizers designed for arbitrary data distributions and analytic quantizers tailored for Gaussian-like weight distributions. These methods enable efficient, low-error quantization suitable for both activations and weights. We incorporate our quantizers into quantization-aware training and evaluate them across integer and floating-point formats. Experiments demonstrate improved accuracy and stability, highlighting the effectiveness of our approach for training low-precision neural networks.

2605.17733 2026-05-19 cs.AI cs.LG 版本更新

Divergence-Suppressing Couplings for Rectified Flow

修正流的发散抑制耦合

Yimeng Min, Carla P. Gomes

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出了一种修正流的发散抑制耦合方法,通过在耦合生成过程中抑制学习到的速度场中的发散成分,从而减少轨迹的扭曲,提升生成效果。

详情
AI中文摘要

修正流的潜力在于生成自我生成的耦合,其轨迹是直的或几乎如此。在实践中,基础流模型生成的轨迹可能会弯曲和交织,导致耦合继承这种扭曲。本文指出,这种轨迹交织通常与学习到的速度场中非零发散区域相关,其中局部扩张或收缩会扭曲轨迹并推动粒子远离理想终点。我们随后提出了一种修正流的发散抑制耦合,这是一种离线修正,可减小耦合生成过程中学习到的速度场的发散成分。该修正仅在每次耦合对生成时支付一次,且在训练过程中被摊销,因此部署运行的时钟时间成本与标准修正流相同。实验证明,这种离线修改在2D合成基准和图像生成任务上都带来了稳定改进。

英文摘要

The promise of Rectified Flow rests on producing self-generated couplings whose trajectories are straight, or nearly so. In practice, trajectories generated by the base flow model can bend and intertwine, and the resulting coupling inherits this distortion. In this paper, we identify that such trajectory entanglement is often associated with regions of nonzero divergence in the learned velocity field, where local expansion or contraction distorts trajectories and steers particles away from their ideal endpoints. We then propose divergence-suppressing couplings for Rectified Flow, an offline correction that attenuate the divergent component of the learned velocity during coupling generation. The correction is paid only once per coupling pair and amortized over training, so deployment runs plain Euler at identical wall-clock cost to standard Rectified Flow. Empirically, this offline modification yields consistent improvements on 2D synthetic benchmarks and on image generation.

2605.17729 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis

领域增量学习用于疫情 resilient 胸部X光分析

Danu Kim

发表机构 * Danu Kim(丹努·金)

AI总结 本文提出了一种基于回放的领域增量持续学习方法,用于在跨领域变化中保持肺炎检测的鲁棒性和一致性,通过类感知平衡回放和类感知损失实现平衡的类表示和动态重加权,实验表明该方法在领域偏移的PneumoniaMNIST数据集上达到88.66%的平均准确率,优于经验回放、微调和联合训练基线。

Comments Published in Korea Software Congress (2025)

详情
AI中文摘要

深度学习模型在肺炎检测中实现了高准确性,但其在临床领域中的泛化能力受限于成像设备、获取协议和机构条件的差异。本研究引入了一种基于回放的领域增量持续学习方法,旨在使模型能够持续适应跨领域变化而不发生灾难性遗忘。所提出的方法结合了类感知平衡回放以在受限内存中保持平衡的类表示,以及类感知损失以在训练过程中动态重新加权类不平衡。在包含五个模拟领域的领域偏移PneumoniaMNIST数据集上进行的实验表明,所提出的方法实现了88.66%的平均准确率,优于经验回放、微调和联合训练基线。这些发现突显了所提出方法在跨临床环境变化中实现稳健和一致肺炎检测的有效性。

英文摘要

Deep learning models achieved high accuracy in pneumonia detection from chest X-rays. However, their generalization across clinical domains remains limited due to variations in imaging devices, acquisition protocols, and institutional conditions. This study introduces a replay-based domain-incremental continual learning designed to enable continual adaptation to cross-domain variations without catastrophic forgetting. The proposed method incorporates a class-aware balanced replay to maintain balanced class representation within a constrained memory and a class-aware loss to dynamically reweight class imbalance during training. Experiments conducted on a domain-shifted PneumoniaMNIST dataset consisting of five simulated domains demonstrate that the proposed method achieves an average accuracy of 88.66%, outperforming Experience Replay, Fine-Tuning, and Joint Training baselines. These findings highlight the efficacy of the proposed approach in achieving robust and consistent pneumonia detection across clinical environment variations.

2605.17724 2026-05-19 q-fin.TR cs.LG q-fin.CP q-fin.ST 版本更新

Sequential Structure in Intraday Futures Data: LSTM vs Gradient Boosting on MNQ

交易日内期货数据中的序列结构:LSTM与梯度提升在MNQ上的比较

Mathias Mesfin

发表机构 * Independent Researcher(独立研究者)

AI总结 本文比较了梯度提升和长短期记忆(LSTM)架构在Micro E-Mini纳斯达克100期货(MNQ)日内方向预测中的表现,探讨了五分钟OHLCV棒序列在单个仪器数据集规模下是否具有可利用的序列预测结构。

Comments 18 pages, 4 figures. All results based on out-of-sample walk-forward validation and permutation testing. Data: MNQ futures (2021-2025)

详情
AI中文摘要

本文比较了梯度提升和长短期记忆(LSTM)架构在Micro E-Mini纳斯达克100期货(MNQ)日内方向预测中的表现。受最近对金融K线数据的基础模型研究启发,包括Kronos架构,我们测试了五分钟OHLCV棒序列在单个仪器数据集规模下是否具有可利用的序列预测结构。使用2021-2025年944个交易日的数据,在三个外样本期间下,通过严格扩展窗口滚动验证评估了四种模型配置。目标变量是该交易日收盘是否超过上午10:30开盘超过十个点。没有配置产生统计上显著高于51.8%基础率的外样本准确性。组合外样本准确性在梯度提升变体中从50.00%到50.89%不等,而LSTM达到50.59%。排列检验得到最佳梯度提升模型的p值为0.135,LSTM为0.515,表明没有统计上显著的预测优势。外样本折叠中的特征重要性不稳定性表明噪声拟合而非稳定的结构信号捕获。结果表明,四年单仪器五分钟OHLCV数据不足以进行可靠的序列ML基于日内预测。主要贡献是记录了受Kronos启发的架构在受限现实数据集上的评估,为序列金融ML提供了经验下限。

英文摘要

This paper compares gradient boosting and long short-term memory (LSTM) architectures for intraday directional prediction in Micro E-Mini Nasdaq 100 futures (MNQ). Motivated by recent foundation-model research on financial candlestick data, including the Kronos architecture, we test whether five-minute OHLCV bar sequences contain exploitable sequential predictive structure at the scale of a single instrument dataset. Using 944 trading days from 2021-2025, four model configurations are evaluated under strict expanding-window walk-forward validation across three out-of-sample periods. The target variable is whether the session close exceeds the 10:30 AM open by more than ten points. No configuration produces statistically significant out-of-sample accuracy above the 51.8% base rate. Combined OOS accuracies range from 50.00% to 50.89% across gradient boosting variants, while the LSTM achieves 50.59%. Permutation tests yield p-values of 0.135 for the best gradient boosting model and 0.515 for the LSTM, indicating no statistically significant predictive edge. Feature importance instability across walk-forward folds suggests noise fitting rather than stable structural signal capture. The results indicate that four years of single-instrument five-minute OHLCV data are insufficient for reliable sequential ML-based intraday forecasting. The primary contribution is a documented evaluation of a Kronos-inspired architecture on a constrained real-world dataset, providing an empirical lower bound on data scale requirements for sequential financial ML.

2605.17718 2026-05-19 stat.ML cs.LG 版本更新

How does feature learning reshape the function space?

特征学习如何重塑函数空间?

João Lobo, Bruno Loureiro, Long Tran-Than, Fanghui Liu

发表机构 * Department of Computer Science, University of Warwick, United Kingdom(沃里克大学计算机科学系,英国) Departement d’Informatique, École Normale Supérieure, PSL & CNRS(巴黎高等师范学院信息系,PSL与CNRS) School of Mathematical Sciences, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong University, China(上海交通大学数学科学学院,中国) Department of Computer Science, and Centre for Discrete Mathematics and its Applications (DIMAP), University of Warwick, United Kingdom(沃里克大学计算机科学系,以及离散数学及其应用中心(DIMAP),英国)

AI总结 本文研究了特征学习如何通过梯度下降训练改变两层神经网络的函数空间,揭示了特征学习在参数空间或输入空间中的分布变换作用,以及其对函数空间谱结构的影响。

Comments 59 pages, 1 figure

详情
AI中文摘要

特征学习被广泛认为是区分神经网络与固定核方法的关键机制,但其对诱导函数空间的影响仍不明确。本文精确刻画了两层神经网络特征所张开的函数空间在梯度下降训练中的演变。我们证明,在高维比例 regime 中,经过大梯度步后,更新后的特征分布可近似为一个依赖目标的 spiked Gaussian 协方差。这诱导出一个数据自适应的核,重塑函数空间并修改其谱结构。我们的分析揭示,特征学习可以被解释为参数空间或输入空间中的分布变换,等价于引入一个依赖目标的核。特别是,它会选择性地放大与目标方向对齐的本征值,并混合主导本征函数,将顶部径向模式与目标对齐的二次谐波耦合。总体而言,我们的结果为早期阶段的特征学习提供了精确的函数空间视角:而不是仅仅缩放固定核,梯度下降诱导出一种数据自适应的变形,优先增强与数据中的信号方向对齐的方向。

英文摘要

Feature learning is widely regarded as the key mechanism distinguishing neural networks from fixed-kernel methods, yet its impact on the induced function space remains poorly understood. In this work, we precisely characterize how the function space spanned by the features of a two-layer neural network evolves during gradient descent training. We prove that, in the high-dimensional proportional regime, after a large gradient step the post-update feature distribution is well approximated by a target-dependent spiked Gaussian covariance. This induces a data-adaptive kernel that reshapes the function space and modifies its spectral structure. Our analysis reveals that feature learning can be interpreted as a distributional transformation in either parameter space or input space, equivalently as the introduction of a target-dependent kernel. In particular, it selectively amplifies eigenvalues aligned with the target direction and mixes leading eigenfunctions, coupling the top radial mode with a target-aligned quadratic harmonic. Overall, our results provide a precise function-space perspective on early-stage feature learning: rather than just rescaling a fixed kernel, gradient descent induces a data-adaptive deformation that preferentially enhances directions aligned with the signal in the data.

2605.17705 2026-05-19 stat.ML cs.LG stat.ME 版本更新

Online Conformal Prediction for Non-Exchangeable Panel Data

在线非交换面板数据的符合预测

Daohong Tu, Kay Giesecke

发表机构 * Department of Management Science and Engineering(管理科学与工程系)

AI总结 本文提出了一种简单的在线符合框架,用于非交换面板数据,通过利用在线面板预测的关键特征,即当需要对一个单位进行预测时,相关单位的同期结果可能已观察到,可以作为校准面板。该方法利用了适应性量来形成预测集,从而在长期内提供覆盖保证。

Comments 34 pages, 5 figures

详情
AI中文摘要

面板数据,其中多个单位在时间上被反复观察,出现在科学和工程中。在这样的设置中量化预测不确定性具有挑战性,因为符合预测,虽然分布无关且模型无关,但传统上依赖于可交换性假设,这些假设在时间依赖性和单位异质性下失效。我们提出了一种简单的在线符合框架用于非交换面板数据。该方法利用了在线面板预测的一个关键特征:当需要对一个单位进行预测时,相关单位的同期结果可能已经观察到,并可以作为校准面板。在每一轮中,使用当前观察到的校准单位以及两个适应性量来形成预测集:基于历史的相似性权重,强调与目标相似的校准单位,以及一个适应性的误覆盖水平,当目标反馈被揭示时会更新。这种双状态设计产生了一种逐步覆盖界和长期覆盖保证。在经验上,跨合成和真实面板数据集,该方法通过适应性区间宽度分配改进了最差覆盖目标单位的覆盖,而不是均匀膨胀。两个状态是互补的:相似性权重在目标反馈稀疏时保护覆盖,而适应性水平进一步在反馈积累时提高覆盖。

英文摘要

Panel data, in which multiple units are repeatedly observed over time, arise throughout science and engineering. Quantifying predictive uncertainty in such settings is challenging because conformal prediction, while distribution-free and model-agnostic, classically relies on exchangeability assumptions that fail under temporal dependence and unit heterogeneity. We propose a simple online conformal framework for non-exchangeable panel data. The method exploits a key feature of online panel prediction: when a forecast is required for one unit, contemporaneous outcomes from related units may already be observed and can serve as a calibration panel. At each round, prediction sets are formed using currently observed calibration units together with two adaptive quantities: history-based similarity weights that emphasize calibration units resembling the target, and an adaptive miscoverage level that is updated whenever target feedback is revealed. This two-state design yields a stepwise coverage bound and a long-run coverage guarantee. Empirically, across synthetic and real panel data sets, the method improves coverage on the worst-covered target units through adaptive interval-width allocation rather than uniform inflation. The two states are complementary: similarity weights protect coverage when target feedback is sparse, while the adaptive level further improves coverage as feedback accumulates.

2605.17704 2026-05-19 cs.LG 版本更新

Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space

玩具组合可解释性模型揭示早期特征空间中的彩票彩票

Alon Bebchuk, Nir Shavit

发表机构 * Tel-Aviv University(特拉维夫大学) MIT and Red Hat AI(麻省理工学院和红帽AI)

AI总结 本文研究了彩票彩票假说在早期特征空间中的表现,通过组合玩具模型揭示了彩票彩票在特征空间中的保留对象,表明彩票彩票结构由隐藏的特征空间几何而非权重空间子网络身份决定。

详情
AI中文摘要

彩票彩票假说认为密集网络中包含稀疏子网络,即' winning tickets',当重置初始权重并单独训练时,其性能可与完整模型匹配。我们提出更机理性的问题:彩票彩票保留的是什么内部对象?我们采用组合、子句结构的玩具设置,该设置允许具有明确组合距离的可解释特征空间表示。我们显示,在权重空间中彩票彩票对应于特征空间中已接近最终特征通道编码的前驱位置。密集SGD通过结构化选择解决这些位置:近邻位置要么收敛到最终代码要么被拒绝,拒绝集中在更拥挤的神经元,暗示在叠加下存在竞争。因此,彩票彩票是兼容代码位置的家族,共同平衡接近最终代码与低特征间干扰。稀疏重训练通常在不同行上重新表达相同的子句/模板家族,因此保留的对象是家族层面而非微观行身份。我们通过轻量级探针基于特征空间距离和运动验证了这一观点;在我们的设置中,这些探针在准确性和精确代码恢复方面经常优于已建立的基于权重的彩票发现方法。尽管这些发现基于玩具设置,但它们表明彩票彩票结构由隐藏的特征空间几何而非权重空间子网络身份决定。

英文摘要

The lottery ticket hypothesis posits that dense networks contain sparse subnetworks, ``winning tickets,'' that, when rewound to their initial weights and retrained in isolation, match the performance of the full model. We ask a more mechanistic question: what internal object does a winning ticket preserve? We work in a combinatorial, clause-structured toy setting that admits an interpretable feature-space representation with well-defined combinatorial distances between features. We show that winning tickets in weight space correspond to precursor locations in feature space that are already near, at initialization, to the final feature-channel codes. Dense SGD resolves these locations through structured selection: proximal locations either converge to final codes or are rejected, with rejection concentrated at more crowded neurons, implicating competition under superposition. A winning ticket is thus a family of compatible code locations that jointly balance proximity to final codes with low inter-feature interference. Sparse retraining often re-expresses the same clause/template family on a different row, so the preserved object is family-level rather than microscopic row identity. We validate this account with lightweight probes based on feature-space distance and motion; in our setting, these probes frequently outperform established weight-based ticket discovery methods in both accuracy and exact code recovery. Although these findings are grounded in a toy setting, they suggest that the lottery ticket structure is governed by hidden feature-space geometry rather than weight-space subnetwork identity.

2605.17698 2026-05-19 cs.LG cs.MA 版本更新

Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces

Agent Bazaar: 使多智能体市场场所具备经济对齐能力

Seth Karten, Cameron Crow, Chi Jin

发表机构 * Princeton University(普林斯顿大学)

AI总结 该研究提出Agent Bazaar框架,用于评估多智能体系统的经济对齐能力,通过分析两种失败模式(算法不稳定和Sybil欺骗)发现模型难以自我调节,并提出经济对齐的训练方法和EAS评分标准。

Comments 17 pages, 9 figures

详情
AI中文摘要

将大型语言模型(LLMs)作为自主经济代理部署引入了系统性风险,这些风险超出了单个能力故障的范围。随着代理直接参与市场,其集体行为会放大波动并大规模掩盖欺骗。我们引入Agent Bazaar,一个多代理模拟框架,用于评估经济对齐能力,即代理系统维持市场稳定和完整性的能力。我们识别出两种失败模式:(1)在B2C市场中的算法不稳定(

英文摘要

The deployment of Large Language Models (LLMs) as autonomous economic agents introduces systemic risks that extend beyond individual capability failures. As agents transition to directly interacting with marketplaces, their collective behavior can amplify volatility and mask deception at scale. We introduce the Agent Bazaar, a multi-agent simulation framework for evaluating Economic Alignment, the capacity of agentic systems to preserve market stability and integrity. We identify two failure modes: (1) Algorithmic Instability in a B2C market ("The Crash"), where firms amplify price volatility until the market collapses, and (2) Sybil Deception in a C2C market ("The Lemon Market"), where a single deceptive agent controlling multiple coordinated seller identities floods the market with fraudulent listings, eroding trust and consumer welfare. We evaluate frontier and open-weight models across both scenarios and find that models largely fail to self-regulate, with failure severity varying by model rather than by size. We propose economically aligned harnesses, Stabilizing Firms and Skeptical Guardians, that improve outcomes but remain fragile under harder market conditions. To close this gap, we train agents with REINFORCE++ using an adaptive curriculum, producing a 9B model that outperforms all evaluated frontier and open-weight models. We propose the Economic Alignment Score (EAS), a 4-component scalar metric aggregating stability, integrity, welfare, and profitability, enabling direct cross-model comparison. Our results show that economic alignment is orthogonal to general capability and can be directly trained with targeted RL.

2605.17693 2026-05-19 cs.LG cs.AI 版本更新

Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization

通过去噪策略优化微调意识口袋扩散模型

Yuan Xue, Daniel Kudenko, Megha Khosla

发表机构 * L3S Research Center(L3S研究所以) Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出DEPPA方法,基于去噪扩散策略优化,通过强化学习微调预训练的意识口袋扩散模型,以优化结合亲和力、药物性、可合成性和多样性等多属性。

详情
AI中文摘要

基于结构的药物设计已被意识口袋3D生成模型加速,但大多数方法主要拟合训练分布,可能无法满足真实世界治疗药物发现所需的多种属性。最近,越来越多的关注集中在基于结构的分子优化(SBMO)上,其目标是精细控制多个指定的分子属性。在本文中,我们提出DEPPA,一种新的SBMO方法,基于去噪扩散策略优化,通过强化学习微调预训练的意识口袋扩散模型。DEPPA能够优化多个属性,包括结合亲和力、药物性、可合成性和多样性。我们将预训练的意识口袋扩散模型的反向去噪过程建模为多步马尔可夫决策过程,其中期望的属性作为奖励信号在最终生成的配体分子上进行评估。DEPPA在RL微调期间结合粗略的去噪调度器,以实现高效的分子优化。在CrossDocked2020基准上的实验结果表明,DEPPA在结合亲和力(Vina Score -8.5 kcal/mol)、药物性和多样性方面优于基线,在可合成性方面表现出竞争性性能。源代码可在https://github.com/xy9485/DePPA上获得。

英文摘要

Structure-based drug design has been accelerated by pocket-aware 3D generative models, yet most methods primarily fit the training distribution and may fall short of satisfying multiple properties required in real-world therapeutic drug discovery. Recently, increasing attention has focused on structure-based molecule optimization (SBMO), which targets fine-grained control over multiple specified molecular properties. In this paper, we present DEPPA, a novel SBMO approach building upon Denoising Diffusion Policy Optimization for fine-tuning a pre-trained pocket-aware diffusion model via reinforcement learning. DEPPA enables optimization over multiple properties, including binding affinity, drug-likeness, synthesizability and diversity. We formulate the reverse denoising process of the pretrained pocket-aware diffusion model as a multi-step Markov Decision Process, where the desired properties that serve as reward signals are evaluated on the final generated ligand molecules. DEPPA incorporates a coarse denoising scheduler during the RL fine-tuning to achieve efficient and effective molecule optimization. Experimental results on the CrossDocked2020 benchmark demonstrate that DEPPA outperforms baselines in binding affinity (Vina Score -8.5 kcal/mol), drug-likeness and diversity while exhibiting competitive performance in synthesizability. The source code is available at https://github.com/xy9485/DePPA .

2605.17692 2026-05-19 cs.LG math.OC 版本更新

Exact Convex Reformulations of Linear Neural Networks via Completely Positive Lifting

通过完全正提升实现线性神经网络的精确凸改写

Karthik Prakhya, Alp Yurtsever

AI总结 本文提出了一种将深度线性神经网络的训练问题精确地转化为凸优化问题的方法,利用完全正锥的提升空间,将非凸性编码在锥约束中,并展示了其与半正定规划的联系。

详情
AI中文摘要

我们证明,在平方损失下深度线性神经网络的训练问题可以精确地在提升空间中的广义完全正锥上重新表述。该改写形式与原非凸问题具有相同最优值,并且在提升变量中是线性的,所有非凸性都编码在锥约束中。其提升空间的维度仅取决于输入和输出维度,与网络深度和数据点数量无关,瓶颈宽度仅通过标量约束进入。构造过程是通过将多层参数化减少为双线性因子分解,将其提升为秩约束的半正定规划,通过互补性条件表达秩约束,并应用完全正提升。尽管一般情况下该形式在计算上不可行,但它给出了由线性因子分解引起的非凸性的精确锥表示,并将线性神经网络训练与半正定规划联系起来。

英文摘要

We show that the training problem of a deep linear neural network under the squared loss admits an exact convex reformulation in a lifted space over a generalized completely positive cone. The reformulation has the same optimal value as the original nonconvex problem and is linear in the lifted variables, with all nonconvexity encoded in the cone constraint. Its ambient lifted dimension depends only on the input and output dimensions, independent of the network depth and the number of data points, and the bottleneck width enters only through scalar constraints. The construction proceeds by reducing the multilayer parameterization to a bilinear factorization, lifting it to a rank-constrained semidefinite program, expressing the rank constraint via a complementarity condition, and applying a completely positive lifting. While the resulting formulation is computationally intractable in general, it gives an exact conic representation of the nonconvexity induced by linear factorization and connects linear neural network training with copositive programming.

2605.17678 2026-05-19 stat.ML cs.LG 版本更新

On Gaussian approximation for entropy-regularized Q-learning with function approximation

关于熵正则化Q学习与函数逼近的高斯近似

Artemy Rubtsov, Rahul Singh, Eric Moulines, Alexey Naumov, Sergey Samsonov

发表机构 * HSE University(俄罗斯高等经济大学) Mohamed Bin Zayed University of AI(阿布扎克人工智能大学) EPITA(EPITA研究所) Steklov Mathematical Institute of Russian Academy of Sciences(俄罗斯科学院斯捷克洛夫数学研究所)

AI总结 本文研究了熵正则化异步Q学习在高维中心极限定理下的收敛速率,通过线性函数逼近和多项式步长,建立了在凸距离下的高斯近似界,并推导了算法最后迭代的高阶矩界。

详情
AI中文摘要

在本文中,我们推导了在高维中心极限定理下,由熵正则化异步Q学习生成的Polyak-Ruppert平均迭代的收敛速率。假设观测到的三元组序列$(s_k,a_k,s_{k+1})_{k \geq 0}$形成一个均匀几何递归的马尔可夫链,并在适当的正则性条件下,针对投影软贝尔曼方程,我们建立了在凸距离下的高斯近似界,其收敛速率的顺序为$n^{-1/4}$,至多包含多项对数因子,其中$n$是算法所用的样本数量。为了获得这一结果,我们结合了软贝尔曼递归的线性化与对主导测度项的高斯近似。最后,我们推导了算法最后迭代的高阶矩界,这可能具有独立兴趣。

英文摘要

In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak--Ruppert averaged iterates generated by entropy-regularized asynchronous Q-learning with linear function approximation and a polynomial stepsize $k^{-ω}$, $ω\in (1/2,1)$. Assuming that the sequence of observed triples $(s_k,a_k,s_{k+1})_{k \geq 0}$ forms a uniformly geometrically ergodic Markov chain, and under suitable regularity conditions for the projected soft Bellman equation, we establish a Gaussian approximation bound in the convex distance with rate of order $n^{-1/4}$, up to polylogarithmic factors in $n$, where $n$ is the number of samples used by the algorithm. To obtain this result, we combine a linearization of the soft Bellman recursion with a Gaussian approximation for the leading martingale term. Finally, we derive high-order moment bounds for the algorithm's last iterate, which might be of independent interest.

2605.17671 2026-05-19 cs.LG cs.AI 版本更新

PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment

PEIRA: 通过视图回归对齐学习预测编码器

Michael Arbel, Basile Terver, Jean Ponce

发表机构 * Univ. Grenoble Alpes, Inria CNRS, Grenoble INP, LJK(格勒诺布尔大学、法国国家信息与自动化研究所、格勒诺布尔INP、LJK实验室) Ecole Normale Supérieure / PSL Inria Paris(巴黎高等师范学院/PSL 国家科学研究中心、法国国家信息与自动化研究所巴黎分部) New York University(纽约大学)

AI总结 本文提出PEIRA方法,通过显式目标函数和线性回归器对齐来实现非对比自监督学习,通过理论分析和实验验证其在ImageNet-1K和CIFAR-10上的有效性。

详情
AI中文摘要

非对比自监督学习(SSL)是预测表示学习的有效框架,但像SimSiam、BYOL、I-JEPA或DINO等流行方法依赖于自蒸馏来训练教师-学生网络,但通常不最小化明确的目标函数。我们分析了联合嵌入预测架构(JEPA)的一个变种,使用正则化的线性回归器来预测数据两个视图之间的学习表示,并完全表征其稳定性:非坍塌的稳定平衡点对齐于主导的非线性典型相关子空间,而坍塌的平衡点也可能是稳定的吸引子。受此结果启发,我们引入PEIRA,一种非对比SSL方法,其目标函数通过最优线性回归器的迹定义。我们证明其唯一稳定的平衡点是非平凡的全局最小值,并恢复相同的典型相关子空间,正则化选择有效维度。在ImageNet-1K和CIFAR-10上的实验表明,PEIRA与VICReg和LeJEPA基线具有竞争力,定性实验结果支持理论。

英文摘要

Non-contrastive self-supervised learning (SSL) is an effective framework for predictive representation learning, but popular (and in practice effective) methods such as SimSiam, BYOL, I-JEPA or DINO, which rely on a form of self-distillation to train a teacher-student network, remain poorly understood as they typically do not minimize a well-defined objective. We analyze the dynamics of a variant of the Joint Embedding Predictive Architecture (JEPA) using a regularized linear regressor to predict the learned representations of two views of the data from one another, and fully characterize its stability: non-collapsed stable equilibria align with leading nonlinear canonical correlation subspaces, while collapsed equilibria may also be stable attractors. Motivated by this result, we introduce PEIRA, a non-contrastive SSL method with an explicit objective defined through the trace of the optimal linear regressor. We show that its only stable equilibria are nontrivial global minimizers and recover the same canonical correlation subspaces, with regularization selecting the effective dimension. Experiments on ImageNet-1K and CIFAR-10 show PEIRA is competitive with VICReg and LeJEPA baselines, and qualitative empirical results support the theory.

2605.17660 2026-05-19 math.OC cs.AI cs.LG stat.ML 版本更新

Training Infinitely Deep and Wide Transformers

训练无限深且宽的Transformer

Raphaël Barboni, Maarten V. de Hoop, Takashi Furuya, Gabriel Peyré

发表机构 * Bocconi University(博科尼大学) Doshisha University, RIKEN AIP(滋贺大学、RIKEN AIP) Rice University(里士满大学) CNRS, ENS, PSL Université(国家科学研究中心、巴黎综合理工学院、巴黎萨克勒大学)

AI总结 本文提出了一种严格的数学框架,用于分析Transformer在均场 regime 中的梯度基于训练动态,通过研究无限深和宽的Transformer的均场模型,建立了训练风险的条件Wasserstein梯度的显式公式,并证明了在NTK注入性假设下梯度流收敛到全局极小值。

详情
AI中文摘要

Transformers已成为现代机器学习中占主导地位的架构,但其训练动态的理论理解仍然有限。本文开发了一个严格的数学框架,用于分析在均场 regime 中Transformer的梯度基于训练动态,其中深度(层数)和宽度(注意头数)趋于无穷大。虽然ResNet训练可以理解为控制神经ODE,但Transformer训练对应于控制神经PDE,因为通过注意力机制耦合了多个token分布。我们的均场模型特征两种类型的测度表示:通过层演变的token分布和每层的注意力参数。我们建立了无限深Transformer前向传递的well-posedness,通过流映射来表征token演变,这些流映射满足函数空间中的ODE。利用伴随敏感度分析,我们推导出训练风险的条件Wasserstein梯度的显式公式,该公式涉及由反向ODE控制的伴随变量。我们证明了在条件Wasserstein度量空间中梯度流曲线的存在性和唯一性,建立了梯度基于Transformer训练的严格基础。一个关键技术贡献是提供了注意力机制的神经切线核(NTK)注入性的必要且充分条件:我们证明NTK注入性等同于log-sum-exp函数的线性独立性模仿射函数,这一条件由多种token分布满足,包括离散分布、均匀分布和高斯混合分布。在NTK注入性假设下,我们证明当初始损失足够小时,梯度流收敛到全局极小值,消除了优化景观中的虚假局部极小值。

英文摘要

Transformers have become the dominant architecture in modern machine learning, yet the theoretical understanding of their training dynamics remains limited. This paper develops a rigorous mathematical framework for analyzing gradient-based training of transformers in the mean-field regime, where both the depth (number of layers) and width (number of attention heads) tend to infinity. While ResNet training can be understood as controlling a neural ODE, transformer training corresponds to controlling a neural PDE, due to the coupling of multiple token distributions through the attention mechanism. Our mean-field model features two types of measure representations: token distributions evolving through layers and attention parameters at each layer. We establish well-posedness of the forward pass through infinitely deep transformers, characterizing token evolution via flow maps that satisfy ODEs in function spaces. Using adjoint sensitivity analysis, we derive an explicit formula for the conditional Wasserstein gradient of the training risk, involving adjoint variables governed by backward ODEs. We prove the existence and uniqueness of gradient flow curves in the conditional Wasserstein metric space, establishing a rigorous foundation for gradient-based transformer training. A key technical contribution is providing necessary and sufficient conditions for injectivity of the Neural Tangent Kernel (NTK) for attention mechanisms: we show that NTK injectivity is equivalent to linear independence of log-sum-exp functions modulo affine functions, a condition satisfied by diverse token distributions, including discrete distributions, uniform distributions, and Gaussian mixtures. Under this NTK injectivity assumption, we prove that gradient flow converges to global minima when the initial loss is sufficiently small, eliminating spurious local minima from the optimization landscape.

2605.17658 2026-05-19 cs.LG 版本更新

When a Zero-Shooter Cheats: Improving Age Estimation via Activation Steering

当零样本射手作弊时:通过激活引导提升年龄估计

Erik Imgrund, Pia Hanfeld, Klim Kireev, Konrad Rieck

发表机构 * BIFOLD & TU Berlin(BIFOLD与柏林技术大学)

AI总结 本文研究了基于视觉语言模型的零样本年龄估计中出现的'身份捷径'现象,提出激活引导方法以提高年龄估计的准确性,减少均方误差达25%。

详情
AI中文摘要

不同年龄相关的规定已提出以保护未成年人免受有害内容和互动的在线影响。自动年龄估计是执行此类规定的关键,而视觉语言模型(VLMs)在该任务上实现了最先进的性能。然而,我们发现VLM基于的零样本年龄估计会产生一个意外的副作用,我们称之为'身份捷径':VLMs不再从视觉特征中估计年龄,而是识别所描绘的人并从记忆中的知识中推断他们的年龄。这种现象导致在非名人被误认为名人时产生显著错误的预测。它还产生了对名人图像的噪声和对抗扰动具有欺骗性高鲁棒性的效果,这些图像主导了流行基准。为缓解这一问题,我们提出了一种激活引导方法,通过干预VLM的隐藏状态来抑制捷径。该方法提高了对记忆中和未见过的身份的年龄估计准确性,减少均方误差达25%。

英文摘要

Different age-related regulations have been proposed to protect minors from harmful content and interactions online. Automated age estimation is central to enforcing such regulations, and vision-language models (VLMs) achieve state-of-the-art performance on this task. However, we find that the zero-shot nature of VLM-based age estimation produces an unexpected side effect we call the identity shortcut: Instead of estimating age from visual features, VLMs tend to identify the depicted person and infer their age from memorized knowledge. This phenomenon leads to substantially incorrect predictions when non-celebrities are misidentified as celebrities. It also produces deceptively high robustness to noise and adversarial perturbations on celebrity images, which dominate popular benchmarks. To mitigate this, we propose an activation steering method that suppresses the shortcut by intervening on the hidden states of the VLM. This method improves age estimation accuracy for both memorized and unseen identities, reducing mean absolute error by up to 25% across popular benchmarks.

2605.17653 2026-05-19 cs.LG cs.AI 版本更新

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

LLMForge: 多后端硬件感知的神经架构搜索与无限头注意力用于边缘语言模型

Xinting Jiang, Junyi Luo, Ruichen Qi, Kauna Lei, Ben Laurie, Gregory Kielian, Mehdi Saligane

发表机构 * Brown University(布朗大学) University of Michigan(密歇根大学) Google Research(谷歌研究)

AI总结 本文提出LLMForge,一种多后端硬件感知的神经架构搜索框架,通过无限头注意力扩展了每层注意力配置空间,并结合Forge-Former和Forge-DSE实现了高效的边缘语言模型架构搜索,最终在不同硬件子系统上获得了不同形状的架构,展示了在不同性能指标上的优化效果。

详情
AI中文摘要

子百亿参数的Transformer语言模型正越来越多地部署在边缘设备上,其中设备端推理的隐私、延迟和运行成本优势受到紧密的内存带宽、能量和热预算的限制,使得架构选择和加速器特定的成本成为高效推理的关键。我们提出了LLMForge,一种硬件感知的神经架构搜索(NAS)框架,其三个可组合的贡献共同使边缘LM架构搜索变得硬件条件化,因为不同的基材施加了不同的硬件成本瓶颈。无限头注意力(IHA)解耦了查询头数、KV组数和每个头的查询/键/值维度,扩展了在我们的搜索空间范围内每层注意力配置空间,大约扩大了400倍。Forge-Former是一种基于编码器的替代方案,用于对架构候选者进行排名,优于MLP和随机森林基线。Forge-DSE是一种基于NSGA-II的设计空间探索引擎,与Forge-Former配对,结合了覆盖GPU、张量核心加速器和环数据流边缘加速器的多后端硬件成本模型。在四种不同的硬件基材上,搜索收敛到明显不同的架构,其形状跟踪每个基材的成本瓶颈。在多芯片环基材上,我们的联合搜索返回了三个3亿参数规模的部署感知变体,这些变体位于帕累托前沿上。每个变体都在FineWeb-Edu-10BT上重新训练,以匹配SmolLM2-360M和Qwen-0.5B架构基线。准确的变体具有最低的验证损失2.798,并在参数较少的情况下具有竞争性的基准性能,能量优化的变体降低了每token的能量消耗40%,延迟优化的变体降低了TTFT和TPOT 43%。

英文摘要

Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal budgets that make architectural choice and accelerator-specific cost central to efficient inference. We present LLMForge, a hardware-aware neural architecture search (NAS) framework whose three composable contributions together make edge-LM architecture search hardware-conditioned, since different substrates impose different hardware cost bottlenecks. Infinite-Head Attention (IHA) decouples the number of query heads, KV groups, and per-head query/key and value dimensions, expanding the feasible per-layer attention configuration space by approximately 400x over grouped-query attention within our search-space ranges. Forge-Former, an encoder-based surrogate for ranking architectural candidates, outperforms MLP and random-forest baselines. Forge-DSE, an NSGA-II-based design-space-exploration engine, pairs Forge-Former with a multi-backend hardware cost model spanning GPUs, systolic accelerators, and ring-dataflow edge accelerators. Across four different hardware substrates, the searches converge to visibly different architectures whose shapes track each substrate's cost bottleneck. On the multi-chip ring substrate, our co-search returns three 300M-scale deployment-aware variants on the Pareto front. Each is re-trained on FineWeb-Edu-10BT under matched recipe against SmolLM2-360M and Qwen-0.5B architecture baselines. The accurate variant has the lowest validation loss 2.798 and competitive benchmark performance with fewer parameters, the energy-optimized variant lowers energy per token by 40%, and the latency-optimized variant lowers TTFT and TPOT by 43%.

2605.17651 2026-05-19 cs.LG 版本更新

Counterfactual Explanations Under Concept Drift

反事实解释在概念漂移下的应用

Marcin Kostrzewa, Jerzy Stefanowski, Maciej Zięba

发表机构 * Wrocław University of Science and Technology(沃拉什大学科学与技术学院) Poznań University of Technology(波兹南技术大学)

AI总结 本文研究了在数据不断变化的环境中,如何维护反事实解释的有效性,提出了一种轻量级的更新方案以修复现有解释,保持其与原始实例的接近性。

详情
AI中文摘要

反事实解释(CFEs)提供可行的救济措施,但大多数方法假设静态框架,数据和训练分类器固定。这一假设在演变数据环境中失效,如数据流,在此环境下,在线模型不断更新以适应概念漂移。我们识别出在该设置下CFEs的维护是一个之前被忽视的问题:生成时有效的解释可能在模型演变时变得无效,包括鲁棒的CFEs,它们并非为持续漂移而设计。我们提出了一种轻量级、模型无关的更新方案,通过局部采样估计有效性及合理性方向,同时保持与原始实例的接近性。在合成漂移数据流上的实验表明,最初创建的CFEs迅速失去有效性,而维护的CFEs在较低成本下保持有效性和局部合理性。

英文摘要

Counterfactual explanations (CFEs) provide actionable recourse, but most methods assume a static framework with fixed data and a trained classifier. This assumption breaks in evolving data environments, such as data streams, where online models are repeatedly updated under concept drift. We identify CFE maintenance in this setting as a previously overlooked problem: explanations that are valid when generated may silently become invalid as the model evolves, including robust CFEs, which are not designed for continuous drift. We propose a lightweight, model-agnostic update scheme that repairs existing CFEs using local sampling to estimate validity and plausibility directions while preserving proximity to the original instance. Experiments on synthetic drifting streams show that initially created CFEs rapidly lose validity, whereas maintained CFEs preserve validity and local plausibility at a lower cost than repeated regeneration.

2605.17642 2026-05-19 cs.LG 版本更新

TabKDE: Simple and Scalable Tabular Data Generation with Kernel Density Estimates

TabKDE: 通过核密度估计实现简单且可扩展的表格数据生成

Meysam Alishahi, Yan Zheng, Junpeng Wang, Chin-Chia Michael Yeh, Jeff M. Phillips

发表机构 * University of Utah(犹他大学) Visa Research(Visa研究)

AI总结 本文提出了一种基于核密度估计的表格数据生成方法,能够在无需大量训练时间的情况下实现与现有方法相当的准确性和防泄漏性能,并且能够高效处理大规模数据集。

详情
AI中文摘要

表格数据生成考虑的是一个包含多个列的大型表格,每个列包含数值、类别或有时顺序值。目标是生成新的行以复制原始数据行的分布,而不仅仅是复制初始行。过去四年中,这个问题取得了巨大的进展,主要使用计算成本高昂的方法,如one-hot编码、VAE和扩散模型。本文描述了一种新的表格数据生成方法。通过使用copula变换并将分布建模为核密度估计,我们几乎可以达到先前方法在准确性和防泄漏方面的性能,但训练时间几乎可以忽略不计。我们的方法非常可扩展,并且可以在简单的笔记本电脑上处理比现有最先进方法大数个数量级的数据集。此外,由于我们使用核密度估计,我们可以将模型存储为原始数据的coreset -- 我们认为这是生成建模中的首次尝试 -- 并因此需要显著较少的空间。我们的代码可在https://github.com/tabkde/tabkde-main获取。

英文摘要

Tabular data generation considers a large table with multiple columns -- each column comprised of numerical, categorical, or sometimes ordinal values. The goal is to produce new rows for the table that replicate the distribution of rows from the original data -- without just copying those initial rows. The last 4 years have seen enormous progress on this problem, mostly using computational expensive methods that employ one-hot encoding, VAEs, and diffusion. This paper describes a new approach to the problem of tabular data generation. By employing copula transformations and modeling the distribution as a kernel density estimate we can nearly match the accuracy and leakage-avoidance achievements of the previous methods, but with almost no training time. Our method is very scalable, and can be run on data sets orders of magnitude larger than prior state-of-the-art on a simple laptop. Moreover, because we employ kernel density estimates, we can store the model as a coreset of the original data -- we believe the first for generative modeling -- and as a result, require significantly less space as well. Our code is available here: \url{https://github.com/tabkde/tabkde-main}

2605.17626 2026-05-19 cs.LG cs.SE 版本更新

Verifier-Guided Code Translation via Meta-Step Decoding

通过元步骤解码实现验证器引导的代码翻译

Tianyang Zhou, Somesh Jha, Mihai Christodorescu, Kirill Levchenko, Varun Chandrasekaran

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Google(谷歌)

AI总结 本研究提出了一种元步骤解码框架DTV,通过在生成过程中整合验证器调用,提高了代码翻译的通过率,同时减少了token的使用。

Comments 31 pages, 8 figures

详情
AI中文摘要

测试时间缩放是提高大语言模型的重要机制,特别是在具有确定性验证器的任务中。代码翻译是典型例子:源程序约束有效输出,而编译器、类型检查器和行为检查提供精确的通过/失败反馈。现有方法通常在生成后才应用这些验证器,这效率低下,因为早期错误会破坏自回归上下文且很少被后续纠正。我们引入解码时间验证(DTV),一种框架将结构边界视为元步骤,用于引导解码。DTV在状态机控制器下交替生成与验证器调用,强制有效前缀,利用结构边界检查和结构感知回滚,防止错误传播并减少浪费的token。我们在C到Rust和JavaScript到TypeScript翻译上评估DTV。使用Qwen3-4B作为主要生成器,在匹配的token预算下,DTV将C到Rust的通过率从72.3%提升到82.0%,JavaScript到TypeScript的通过率从33.3%提升到46.0%,同时每案例使用更少的token;相同趋势在Gemma-4-E4B上也有所体现。在评估的匹配成本网格中,DTV在通过率与成本的权衡上优于事后验证或基于采样的缩放。这些结果表明,验证器引导的解码是代码翻译中有效利用推理时间计算的方法。

英文摘要

Test-time scaling is an important mechanism for improving large language models, especially on tasks with deterministic verifiers. Code translation is a canonical example: the source program constrains valid outputs, while compilers, type check- ers, and behavioral checks provide exact pass/fail feedback. Existing approaches typically apply these verifiers only after generation, which is inefficient because early errors corrupt the autoregressive context and are rarely corrected later. We introduce Decoding Time Verification (DTV), a framework that treats structural boundaries as meta steps for verifier-guided decoding. DTV interleaves generation with verifier calls under a state-machine controller that enforces valid prefixes, using structural-boundary checks and structure-aware rollback to prevent error propagation while reducing wasted tokens. We evaluate DTV on C-to-Rust and JavaScript-to-TypeScript translation. Using Qwen3-4B as the primary generator under matched token budgets, DTV improves pass rates from 72.3% to 82.0% on C-to-Rust and from 33.3% to 46.0% on JavaScript-to-TypeScript relative to matched self-refinement baselines, while using fewer tokens per case; the same trend largely transfers to Gemma-4-E4B. In the evaluated cost-matched grid, DTV achieves a more favorable pass-rate-cost tradeoff than post-hoc verification or sampling-based scaling. These results show that verifier-guided decoding is an effective use of inference-time compute for code translation.

2605.17624 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Multi-task learning on partially labeled datasets via invariant/equivariant semi-supervised learning

通过不变/等变半监督学习进行部分标注数据集上的多任务学习

Miquel Martí i Rabadán, Alessandro Pieropan, Hossein Azizpour, Atsuto Maki

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Univrses AB

AI总结 本文研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力,通过FixMatch方法和其等变扩展Dense FixMatch进行评估,在城市景观和BDD100K数据集上针对常见的目标检测和语义分割任务进行测试,发现不变和等变半监督学习在大多数情况下优于监督基线,特别是在标注样本较少时效果更佳。

Comments https://github.com/miquelmarti/DenseFixMatch

详情
AI中文摘要

我们研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力。具体而言,我们使用流行的FixMatch方法进行不变半监督学习,并采用其等变扩展Dense FixMatch。我们在Cityscapes和BDD100K数据集上评估了它们在计算机视觉中普遍的目标检测和语义分割任务中的性能。我们考虑了每个任务标注子集的不同大小以及它们之间的不同重叠情况。我们的结果表明,对于不变和等变半监督学习,大多数情况下都优于监督基线,特别是在任务中可用标注样本较少时,改进最为显著,且后者方法通常表现更好。我们的研究表明,不变/等变学习是有限标注数据下多任务学习的一个有前途的方向。

英文摘要

We investigate the potential of invariant and equivariant semi-supervised learning for addressing the challenges of training multi-task models on partially labeled datasets with differently structured output tasks. Specifically, we use the popular FixMatch method for invariant semi-supervised learning and its equivariant extension Dense FixMatch. We evaluate their performance on the Cityscapes and BDD100K datasets in the context of the prevalent object detection and semantic segmentation tasks in computer vision. We consider varying sizes of the subsets annotated for each task and different overlaps among them. Our results for both invariant and equivariant semi-supervised learning outperform supervised baselines in most situations, with the most significant improvements observed when fewer labeled samples are available for a task and generally better results for the latter approach. Our study suggests that invariant/equivariant learning is a promising general direction for multi-task learning from limited labeled data.

2605.17620 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing

SynVA:一种用于血管生成和动脉瘤编辑的模块化工具包

Marten J. Finck, Niklas C. Koser, Sarker M. Mahfuz, Tameem Jahangir, Jon E. Wilhelm, Daniel Behme, Naomi Larsen, Wojtek Palubicki, Sylvia Saalfeld, Sören Pirk

发表机构 * Visual Computing and Artificial Intelligence, Kiel University, Germany(视觉计算与人工智能研究所,基尔大学,德国) Institute for Medical Informatics and Statistics, Kiel University, Germany(医学信息学与统计研究所,基尔大学,德国) Clinic for Neuroradiology, Medical Faculty, Magdeburg University, Germany(神经放射科,马格德堡大学医学学院,德国) Department of Radiology and Neuroradiology, University Hospital Schleswig-Holstein, Germany(放射学与神经放射学部门,石勒苏益格-荷尔斯泰因大学医院,德国) Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Poland(数学与计算机科学学院,亚当·密茨凯维奇大学,波兰)

AI总结 本文提出SynVA,一种模块化工具包,用于生成血管网格和在解剖学上一致的动脉瘤合成,通过结合新的流匹配方法和基于学习的方法,生成真实血管几何和解剖学合理的动脉瘤,同时提供大规模标注数据集以提升医疗影像分析能力。

详情
AI中文摘要

颅内动脉瘤(IAs)以不可预测的生长和破裂风险为特征,是导致中风的主要原因,可能引发致命性出血,具有高死亡率和长期残疾。随着人口老龄化,脑血管疾病的发病率和整体负担预计会增加,凸显了需要可扩展的方法来分析复杂的医疗数据并提高对这些疾病的群体层面理解的必要性。尽管数字孪生和深度学习为提高诊断、预后和治疗提供了有希望的途径,但其效果受到大规模高质量医疗数据和相应标签稀缺的限制。我们提出了SynVA,一种用于血管网格生成和解剖学一致动脉瘤合成的模块化工具包。SynVA结合了基于流匹配的新型方法生成健康血管网格与基于学习的方法生成解剖条件下的动脉瘤网格——动脉瘤是从已有的血管几何结构计算而来的,而不是孤立生成。此外,我们引入了基于生理学原理和统计先验的SynVA过程模型,用于血管和动脉瘤合成,从而能够生成大规模数据集(例如用于训练基于网格的生成模型)。为此,我们发布了包含50,000个完全标注网格样本的数据集,用于各种下游视觉任务,如语义分割。广泛的定量和定性评估证明了SynVA能够生成逼真的血管几何和解剖学合理的动脉瘤。具体而言,我们的实验表明,某些方法生成的动脉瘤形状更符合专家人类感知,而其他方法在定量相似性度量上与真实动脉瘤的重建表现更优。

英文摘要

Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.

2605.17613 2026-05-19 cs.AR cs.LG 版本更新

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

VeriCache: 将损失性KV缓存转换为无损LLM推理

Jiayi Yao, Samuel Shen, Kuntai Du, Shaoting Feng, Dongjoo Seo, Rui Zhang, Yuyang Huang, Yuhan Liu, Shan Lu, Junchen Jiang

发表机构 * University of Chicago(芝加哥大学) Tensormesh Inc.(Tensormesh公司) Samsung Semiconductor(三星半导体) Microsoft Research(微软研究院)

AI总结 本文提出VeriCache框架,通过利用压缩的KV缓存生成token并验证其与完整KV缓存的一致性,实现了与完整KV缓存解码相同输出的同时保持高解码吞吐量。

详情
AI中文摘要

随着上下文长度的增加,KV缓存的大小已成为服务LLM的主要瓶颈。为此,许多KV缓存压缩方法,如token丢弃和量化,已被提出。然而,几乎所有这些方法本质上都是有损的——尽管对于短输出有最小的精度退化,但随着更多token被解码,其输出与完整KV缓存输出越来越偏离,这导致代码生成和工具调用出现灾难性失败。我们提出了VeriCache,这是首个确保与完整KV缓存解码相同输出但大幅保持多种KV缓存压缩算法高解码吞吐量的推理框架。VeriCache使用压缩的KV缓存草稿token,然后将其与完整KV缓存进行验证。虽然它可能看起来只是推测解码,但VeriCache需要解决一个关键的系统挑战才能工作——保持完整KV缓存不出GPU内存并最小化将其交换入内存进行验证的开销。这一见解是双重的:(1) 压缩的KV解码可以与完整的KV交换并行进行,因为一个是HBM带宽受限,而另一个是PCIe/网络受限;(2) 压缩的KV缓存通常会产生与完整KV缓存相似的输出,允许长的草稿时间 horizon 来摊销每次完整的KV交换。VeriCache适用于长上下文解码和远程前缀缓存,通过统一的压缩器接口支持广泛的token丢弃和量化方法,并与传统的推测解码组合。实验结果表明,VeriCache在产生相同输出的同时,比完整的KV解码快4倍。

英文摘要

The large size of the KV cache has become a major bottleneck for serving LLMs with increasing context lengths. In response, many KV cache compression methods, such as token dropping and quantization, have been proposed. However, almost all of these methods are inherently lossy-despite minimal accuracy degradation for short outputs, their outputs increasingly diverge from full-KV-cache outputs as more tokens are decoded, which leads to catastrophic failures in code generation and tool calling. We present VeriCache, the first inference framework that ensures the same output as full-KV-cache decoding but largely preserves the high decoding throughput of a range of KV cache compression algorithms. VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache. While it may seem like just speculative decoding, VeriCache requires addressing a key system challenge to work-keeping the full KV cache out of GPU memory and minimizing the overhead of swapping it in for verification. The insight is two-fold: (1) compressed-KV decoding can be parallelized with full-KV swap, because one is HBM-bandwidth-bound and the other is PCIe/network-bound, and (2) the compressed KV cache often produces output similar to the full KV cache, allowing a long drafting horizon to amortize each full-KV swap. VeriCache applies to both long-context decoding and remote prefix caching, supports a broad family of token-dropping and quantization methods through a uniform compressor interface, and composes with traditional speculative decoding. Experimental results show that VeriCache achieves up to 4X higher throughput than full-KV inference while producing identical outputs.

2605.17611 2026-05-19 cs.SE cs.LG 版本更新

A Feature-Driven Framework for Software Fault Prediction

基于特征的软件故障预测框架

Ahmad Nauman Ghazi, Nagajyothi Devarapalli, Ashir Javeed, Sadi Alawadi, Fahed Alkhabbas, Khalid AlKharabsheh

发表机构 * Department of Software Engineering(软件工程系) Blekinge Institute of Technology(布莱金厄理工大学) Department of Computer Science(计算机科学系) Malmö University(马尔默大学) Al-Balqa Applied University(阿尔巴卡应用大学)

AI总结 本文研究了特征选择和参数调优对机器学习模型在软件故障预测中的性能影响,通过结合相关性特征选择、递归特征消除、互信息和L1正则化等方法,以及网格搜索、随机搜索和遗传算法等优化技术,提升了故障预测的准确性,达到了88.40%的准确率,比基线模型提高了18%。

Comments Pages 1-9, Preprint, Accepted for publication in FLICS2026

详情
AI中文摘要

软件故障预测(SFP)是软件工程中的关键任务,能够帮助早期识别模块中的故障,从而提高软件质量并降低维护成本。本研究探讨了特征选择和参数调优对机器学习(ML)模型在SFP中的综合影响。本研究评估了特征选择方法(包括基于相关性的特征选择(CFS)、递归特征消除(RFE)、互信息(MI)和L1正则化)之间的相互作用,其中使用网格搜索、随机搜索和遗传算法(GA)等超参数调优技术来优化ML算法,包括随机森林(RF)、逻辑回归(LR)和支持向量机(SVM)以实现优化的故障预测性能。CFS与GA的结合应用获得了最高准确性,达到88.40%(使用RF),比没有特征选择或调优的基线模型提高了18%。特征选择减少了维度并识别了关键属性,如加权方法每类(WMC)和对象耦合(CBO),而迭代参数调优优化了模型对这些特征集的对齐。值得注意的是,所提出的方法表现出鲁棒性,交叉验证的变异性最小(±1.0%),并且效率高,减少了单变量方法如L1正则化的训练时间。

英文摘要

Software fault prediction (SFP) is a critical task in software engineering, enabling early identification of faults in modules to improve software quality and reduce maintenance costs. This research investigates the combined effects of feature selection and parameter tuning on the performance of machine learning (ML) models for SFP. This study evaluates the interaction between feature selection methods, including correlation-based feature selection (CFS), recursive feature elimination (RFE), mutual information (MI), and L1 regularization, where hyperparameter tuning techniques such as grid search, randomized search, and genetic algorithm (GA) are used for optimization of ML algorithms, including random forest (RF), logistic regression (LR), and support vector machines (SVM) for optimized fault prediction performance. The combined application of CFS and GA yielded the highest accuracy, achieving 88.40% with RF, representing an improvement of 18% over baseline models without feature selection or tuning. Feature selection reduced dimensionality and identified critical attributes such as weighted methods per Class (WMC) and coupling between objects (CBO), while iterative parameter tuning optimized model alignment to these feature sets. Notably, the proposed methods demonstrated robustness, with minimal cross-validation variability (+-1.0%), and efficiency, reducing training times in univariate methods such as L1 regularization.

2605.17605 2026-05-19 cs.LG 版本更新

Venom: A PyTorch Generative Modeling Toolkit

Venom:一个PyTorch生成建模工具包

Liang Yan

发表机构 * Paul G. Allen School of Computer Science & Engineering(保罗·G·艾伦计算机科学与工程学院)

AI总结 本文提出Venom,一个基于PyTorch的生成建模工具包,旨在通过统一的接口实现多种生成建模家族,提供可读、可复现的入口点以及一致的训练和采样API,便于教学、原型设计和轻量级基准测试。

Comments Preprints

详情
AI中文摘要

现代生成建模已发展为一个包含多种相关但通常独立实现范式的广泛集合,包括去噪扩散模型、基于分数的随机微分方程、流匹配、变分自编码器、归一化流、对抗模型和基于能量的模型。对于 newcomers 来说,这种碎片化使得在单一统一的代码库中比较训练目标、推理过程、采样算法和条件机制变得困难。我们介绍了 V ENOM,一个教育性的 PyTorch 工具包,它在统一的、以 MNIST 为先的接口下实现了代表性的生成建模家族。V ENOM 强调广度、可读性、可复现的入口点以及一致的训练和采样 API,而不是大规模的性能工程。该包目前包括扩散和基于分数的模型、流匹配和一步生成器、变分自编码器、归一化流、生成对抗网络和基于能量的模型。它提供了单独的训练和采样脚本、分类器和无分类器指导示例、双语教程笔记本以及支持教学、原型设计和轻量级基准测试的模型家族组织。

英文摘要

Modern generative modeling has grown into a broad collection of related but often separately implemented paradigms, including denoising diffusion models, score-based stochastic differential equations, flow matching, variational autoencoders, normalizing flows, adversarial models, and energy-based models. For newcomers, this fragmentation makes it difficult to compare training objectives, inference procedures, sampling algorithms, and conditioning mechanisms within a single coherent codebase. We introduce V ENOM, an educational PyTorch toolkit that implements representative generative modeling families under a unified, MNIST-first interface. V ENOM emphasizes breadth, readability, reproducible entry points, and consistent training and sampling APIs rather than large-scale performance engineering. The package currently includes diffusion and score-based models, flow matching and one-step generators, variational autoencoders, normalizing flows, generative adversarial networks, and energy-based models. It provides separate training and sampling scripts, classifier and classifier-free guidance examples, bilingual tutorial notebooks, and a model-family organization that supports teaching, prototyping, and lightweight benchmarking.

2605.17603 2026-05-19 physics.ao-ph cs.LG 版本更新

Longwang: Zero-Shot Global Spatiotemporal Precipitation Downscaling with a Latent Generative Prior

Longwang: 一种基于潜在生成先验的零样本全球时空降水降尺度方法

Yue Wang, Daniele Visioni

发表机构 * Department of Earth and Atmospheric Sciences(地球与大气科学系)

AI总结 本文提出Longwang方法,通过学习条件化的潜在生成先验和物理信息观测算子,实现从月尺度到日尺度的降水降尺度,优于传统方法在细尺度空间模式重建、时间一致性保持和极端降水强度恢复方面,并能泛化到历史气候模拟和未来气候预测。

详情
AI中文摘要

高分辨率降水信息对于气候影响评估至关重要,但全球气候模型仍然过于粗糙,无法解析关键的小尺度过程。现有的机器学习降尺度方法通常需要配对的低分辨率和高分辨率数据进行监督学习,在推理过程中受限于固定区域或尺度因子,并且在物理空间中训练和运行计算成本较高。本文介绍Longwang,一种用于全球时空降水降尺度的零样本潜在生成框架。Longwang学习了一个条件化的潜在生成先验,并通过后验采样与物理信息观测算子结合,使从月尺度O(100 km)输入生成日尺度O(10 km)降水场成为可能。在ERA5再分析数据上,Longwang在重建细尺度空间模式、保持时间一致性以及恢复极端降水强度方面优于标准后验采样方法。该框架进一步能够泛化到历史气候模拟和未来气候预测,在显著的分布偏移下仍保持有效性。

英文摘要

High-resolution precipitation information is essential for climate impact assessment, yet global climate models remain too coarse to resolve key small-scale processes. Existing machine learning downscaling methods often require paired low- and high-resolution data for supervised learning, are tied to fixed regions or scale factors during inference, and can be computationally expensive to train and run in physical space. Here we introduce Longwang, a zero-shot latent generative framework for global spatiotemporal precipitation downscaling. Longwang learns a context-conditioned latent generative prior and combines it with a physically informed observation operator through posterior sampling, enabling daily O(10 km) precipitation fields to be generated from monthly O(100 km) inputs. On ERA5 reanalysis, Longwang outperforms standard posterior sampling with an unconditional generative prior in reconstructing fine-scale spatial patterns, preserving temporal coherence, and recovering extreme precipitation intensities. The framework further generalizes to historical climate simulations and future climate projections under substantial distribution shift.

2605.17590 2026-05-19 cs.LG math.OC 版本更新

Form and Function: Machine Unlearning as a Problem of Misaligned States

形式与功能:将机器去学习视为不一致状态的问题

Kennon Stewart

发表机构 * Second Street Labs, Detroit, MI, USA(第二街实验室,密歇根州底特律) Department of Statistics, University of Michigan, Ann Arbor, MI, USA(密歇根大学统计系,密歇根州安阿伯)

AI总结 本文提出将在线L-BFGS的机器去学习问题建模为反事实状态对齐问题,通过引入状态感知度量和反事实 oracle 模型,证明去学习不仅仅是参数修正问题,还需要与可实现的反事实优化器状态对齐。

详情
AI中文摘要

我们把在线L-BFGS的机器去学习问题建模为反事实状态对齐问题。给定一个实际事件流和一个经过删除编辑的反事实流,去学习的目标是确定在从未处理过被删除样本的情况下会产生的优化器状态。我们引入了状态感知度量,分别衡量参数误差、内存运算符误差、综合状态误差和更新方向误差。内存度量比较由o-L-BFGS内存引起的逆Hessian作用,而不是将曲率对视为有限影响。在凸性假设下,我们推导出反事实状态偏差的递归界。然后,我们评估了一个状态感知的删除干预基准,包括仅内存和仅参数的修正,与反事实 oracle 模型进行比较。这些结果表明,在线L-BFGS的去学习不仅仅是参数修正问题:它需要与可实现的反事实优化器状态对齐。

英文摘要

We formulate machine unlearning for online L-BFGS as a counterfactual state-alignment problem. Given an actual event stream and a deletion-edited counterfactual stream, the target of unlearning is the optimizer state that would have arisen had the deleted samples never been processed. We introduce state-aware metrics that separately measure parameter error, memory-operator error, combined state error, and update-direction error. The memory metric compares the inverse-Hessian actions induced by the o-L-BFGS memory, rather than treating curvature pairs as of finite influence. Under convexity assumptions, we derive a recursive bound on counterfactual state deviation. We then evaluate a state-aware benchmark of deletion interventions, including memory-only and parameter-only corrections, against an counterfactual oracle model. These results show that unlearning for online L-BFGS is not merely a parameter-correction problem: it requires alignment with a realizable counterfactual optimizer state.

2605.17582 2026-05-19 cs.LG cs.CE 版本更新

Scale-Equivariant Generative Forecasting: Weight-Tied Dilated Convolutions, Wavelet Scattering Inputs, and Spectral-Consistency Training for Self-Similar Time Series

尺度等变生成预测:权重绑定的扩张卷积、小波散射输入和频谱一致性训练用于自相似时间序列

Andrea Morandi

发表机构 * Cisco Systems, Inc.(思科系统公司)

AI总结 本文提出了一种尺度等变生成预测方法,通过权重绑定的扩张卷积、小波散射输入和频谱一致性训练,用于自相似时间序列的生成,展示了在S&P 500日收益率上的优越性能。

详情
AI中文摘要

许多自然和工程时间序列--股票回报、气候异常、湍流速度、神经记录、分组网络流量--近似自相似:其时间跨度为T的分布与时间跨度为1的分布通过一个缩放指数H关联。标准深度生成序列模型(Transformer、扩张TCN、WaveNet家族)忽略了这一点。它们的感受野很宽,但内核参数在每个扩张级别独立存在,导致多尺度架构,而非尺度等变架构。我们有三个贡献。首先,我们为一维因果网络给出了离散尺度等变的精确定义,并证明了二进制扩张在边界效应范围内与任何内核权重在不同级别共享的扩张卷积堆栈相容。绑定内核将卷积参数预算减少L倍(L为深度),并强制自相似性作为归纳偏置。其次,我们将这种尺度等变WaveNet(SE-WaveNet)主干包裹在三个具有相同先验的组件中:一级Daubechies-4小波输入、Hurst-FiLM块暴露局部缩放指数、以及针对|f|^{-(2H+1)}幂律频谱的频谱一致性训练项。头部是条件归一化流,选择以保持等变性。第三,在30年S&P 500每日对数收益率上,SE-WaveNet样本在Allan方差前25个宇宙上重现经验缩放崩溃诊断(中位数C* = 0.020),而普通WaveNet在匹配容量下不(≥0.06)。NLL、KS校准和尾部能量距离与基线持平或优于基线,参数数量更少L倍。

英文摘要

Many natural and engineered time series -- equity returns, climate anomalies, turbulent velocities, neural recordings, packet-level network traffic -- are approximately self-similar: their horizon-$T$ distribution is tied to the horizon-$1$ distribution by one scaling exponent $H$. Standard deep generative sequence models (transformers, dilated TCNs, the WaveNet family) ignore this. Their receptive fields are wide, but kernel parameters live independently at every dilation level, yielding a multi-scale architecture, not a scale-equivariant one. We make three contributions. First, we give a precise definition of discrete scale equivariance for 1D causal networks and prove that dyadic dilation commutes (up to boundary effects) with any dilated-convolution stack whose kernel weights are shared across levels. Tying the kernel shrinks the convolutional parameter budget by an $L$-fold factor (where $L$ is depth) and hard-wires self-similarity in as an inductive bias. Second, we wrap this Scale-Equivariant WaveNet (SE-WaveNet) backbone in three components that carry the same prior: a one-level Daubechies-4 wavelet input, a Hurst-FiLM block exposing the local scaling exponent, and a spectral-consistency training term targeting the $|f|^{-(2H+1)}$ power-law spectrum. The head is a conditional normalising flow, chosen to preserve equivariance. Third, on 30 years of S&P 500 daily log-returns, SE-WaveNet samples reproduce the empirical scaling-collapse diagnostic on the Allan-Variance top-25 universe (median $\mathcal{C}^\star = 0.020$), while a vanilla WaveNet at matched capacity does not ($\geq 0.06$). NLL, KS-calibration, and tail energy distance tie or beat the baseline, with $L\times$ fewer convolutional parameters.

2605.17581 2026-05-19 cond-mat.soft cs.LG 版本更新

Topological Data Analysis combined with Machine Learning for Predicting Permeability of Porous Media

拓扑数据分析结合机器学习预测多孔介质渗透率

Ebru Dagdelen, Catherin Neena Lalu, Aakash Karlekar, Manav Arora, Matthew Illingworth, Jonathan Jaquette, Linda Cummings, Lou Kondic

发表机构 * Department of Mathematical Sciences, New Jersey Institute of Technology(新泽西理工学院数学科学系) Department of Physics, New Jersey Institute of Technology(新泽西理工学院物理系)

AI总结 本研究探讨了如何利用拓扑数据分析和机器学习方法,通过提取多孔介质的结构、拓扑和网络特征,来预测其渗透率,并展示了拓扑数据分析在结合机器学习时的有效性。

详情
AI中文摘要

多孔介质中的流体流动由于其复杂性难以通过标准的解析或数值方法解决。然而,由于合成多孔介质的表示容易生成且物理实验数据日益普及,该问题非常适合结合机器学习(ML)技术的研究。我们讨论了可以从此类数据中提取的多种特征及其作为标准ML算法输入变量的用途。这些特征包括描述多孔介质几何结构的结构度量、描述连通性的拓扑度量以及通过将多孔介质建模为简化孔隙网络获得的网络度量。这些特征使能够利用机器学习技术预测所考虑的(合成)多孔材料的渗透率,其中机器学习方法还利用了单独计算的精确渗透率(真实值)。通过比较不同输入变量所得出的结果,有助于更深入地理解各种度量在基于多孔介质结构预测渗透率方面的实用性。我们特别表明,拓扑数据分析(TDA)提供了一组有用的特征,可以轻松地与机器学习结合,以获得有意义的结果。

英文摘要

Flow in porous media is difficult to address using standard analytical or numerical methods due to its complexity. However, since synthetic representations of porous media are easy to produce and data from physical experiments are becoming more widely available, the problem is well-suited to studies that include machine learning (ML) techniques. We discuss a number of features that can be extracted from such data, and their utility as input variables into a standard ML algorithm. These features include structural measures describing the geometry of the porous media, topological measures describing the connectivity, and network measures obtained by modeling the porous media as simplified pore networks. These features enable the prediction of the permeability of the considered (synthetic) porous materials using ML techniques that also leverage the separately computed exact permeability (ground truth). Comparing results obtained using different input variables helps develop a better understanding of the utility of various measures for predicting permeability based on the porous media structure. We show, in particular, that topological data analysis (TDA) provides a useful set of features that can be easily combined with ML to yield meaningful results.

2605.17575 2026-05-19 cs.LG cs.AI 版本更新

UniAlign: A Model-Agnostic Framework for Robust Network Traffic Classification under Distribution Shifts

UniAlign:一种用于在分布偏移下鲁棒网络流量分类的模型无关框架

Tongze Wang, Xiaohui Xie, Wenduo Wang, Chuyi Wang, Yong Cui

发表机构 * Institute for Network Sciences and Cyberspace, Tsinghua University(网络科学与网络空间研究院,清华大学) Department of Computer Science and Technology, Tsinghua University(计算机科学与技术系,清华大学)

AI总结 本文提出UniAlign,一种模型无关的框架,通过领域对齐微调和稳定模型集成提升深度学习网络流量分类模型在分布偏移下的鲁棒性,实验表明其在准确率和F1分数上均优于现有基线。

详情
AI中文摘要

网络流量分类(NTC)模型在真实世界环境中部署时,由于网络条件的变化导致的分布偏移常常引起严重的性能下降。现有的增强鲁棒性的方法通常与特定的模型架构或数据设置耦合,无法泛化到最先进的原始字节基NTC模型,或导致显著的训练开销。在本文中,我们提出UniAlign,一种新的模型无关框架,旨在提升基于深度学习的NTC模型在分布偏移下的鲁棒性。UniAlign结合了领域对齐微调,该方法鼓励在异构网络条件下学习领域不变的流量表示,以及稳定模型集成,该方法通过在平坦损失区域内的检查点聚合来增强推理鲁棒性。该框架可以无缝集成到现有的监督NTC模型中,无需特定的特征模态或引入非常数的额外训练成本。我们在三个涵盖多样分布偏移的公开数据集上评估了UniAlign,包括加密方案、数据收集设备和攻击行为。在两个代表性的NTC模型上的实验结果表明,与标准训练相比,UniAlign将平均分类准确率提高了2.51%,平均F1分数提高了2.71%,在准确率和F1分数上均优于最强基线,同时仅需所有NTC特定基线训练时间的12.4%至53.9%。

英文摘要

Network traffic classification (NTC) models often suffer severe performance degradation when deployed in real-world environments due to distribution shifts caused by changing network conditions. Existing robustness-enhancing approaches are commonly coupled to specific model architectures or data settings, fail to generalize to state-of-the-art raw-byte-based NTC models, or incur significant training overhead. In this paper, we propose UniAlign, a novel model-agnostic framework that improves the robustness of deep learning-based NTC models under distribution shifts. UniAlign combines \emph{domain alignment fine-tuning}, which encourages the learning of domain-invariant traffic representations across heterogeneous network conditions, with \emph{stable model ensembling}, which enhances inference robustness by aggregating checkpoints within a flat loss region. The framework can be seamlessly integrated into existing supervised NTC models without requiring specific feature modalities or introducing non-constant additional training costs. We evaluate UniAlign on three public datasets covering diverse distribution shifts, including encryption schemes, data collection devices, and attack behaviors. Experimental results on two representative NTC models demonstrate that, compared with standard training, UniAlign improves average classification accuracy by 2.51\% and average F1 score by 2.71\%, outperforming the strongest baseline by 1.45\% in accuracy and 1.69\% in F1 score, while requiring only 12.4\%--53.9\% of the training time of all NTC-specific baselines.

2605.17571 2026-05-19 cs.CV cs.LG 版本更新

Stable Routing for Mixture-of-Experts in Class-Incremental Learning

混合专家在类增量学习中的稳定路由

Zirui Guo, Quan Cheng, Da-Wei Zhou, Lijun Zhang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院)

AI总结 本文研究了在类增量学习中混合专家模型的稳定路由问题,提出了一种稳定路由框架StaR-MoE,通过敏感性感知路由对齐和不对称容量正则化,提高了模型对新类别的适应能力和旧类别的知识保留能力。

详情
AI中文摘要

类增量学习(CIL)要求模型在学习新类别时保持先前知识。最近,结合预训练模型与混合专家(MoE)的方法在CIL中受到越来越多关注:它们通常在学习过程中扩展专家,并使用路由器分配权重。然而,现有MoE方法往往忽视了专家扩展引起的路由漂移。一旦引入新的专家,路由器可能会将样本从早期类别重新分配给新加入的专家,从而扰动已建立的专家组合,即使旧专家保持冻结。我们主张,可扩展的MoE在CIL中需要两个互补的性质:稳定的旧类路由用于知识保留和足够的容量利用用于新类适应。为此,我们提出了Stable Routing for MoE(StaR-MoE),一种用于可扩展MoE的路由级别框架。通过结合敏感性感知的路由对齐,StaR-MoE通过敏感性引导的约束将当前旧类路由行为与历史路由分布对齐。同时,StaR-MoE引入了不对称容量正则化,以鼓励有效利用扩展的专家池,而不影响类特定的路由专业化。在四个标准CIL基准上的广泛实验表明,StaR-MoE在平均准确率和最后准确率上均优于现有最先进方法,突显了稳定路由的重要性。

英文摘要

Class-incremental learning (CIL) requires models to learn new classes sequentially while preserving prior knowledge. Recently, approaches that combine pre-trained models with mixture-of-experts (MoE) have received increasing attention in CIL: they typically expand experts during learning and employ a router to assign weights across experts. However, existing MoE methods often overlook routing drift induced by expert expansion. Once new experts are introduced, the router may reassign samples from earlier classes to newly added experts, thereby perturbing previously established expert compositions and causing interference even when old experts remain frozen. We argue that expandable MoE in CIL requires two complementary properties: stable old-class routing for knowledge preservation and sufficient capacity utilization for new-class adaptation. To this end, we propose Stable Routing for MoE (StaR-MoE), a routing-level framework for expandable MoE in CIL. By incorporating sensitivity-aware routing alignment, StaR-MoE aligns current old-class routing behavior with historical routing distributions through sensitivity-guided constraints. Complementarily, StaR-MoE introduces asymmetric capacity regularization to encourage effective utilization of the expanded expert pool without compromising class-specific routing specialization. Extensive experiments across four standard CIL benchmarks demonstrate that StaR-MoE consistently improves both average and last accuracy over state-of-the-art methods, highlighting the importance of stable routing.

2605.17570 2026-05-19 cs.LG cs.CL 版本更新

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

GRPO在离线策略下的可能性:Mu-GRPO用于高效的大语言模型强化学习

Minghao Tian, Yunfei Xie, Chen Wei

发表机构 * Rice University(里士大学)

AI总结 本文探讨了GRPO在离线策略下的可行性,提出Mu-GRPO方法,通过减少rollout-optimization切换开销,实现高效的LLM强化学习,同时在多个基准测试中表现出色。

详情
AI中文摘要

组相对策略优化(GRPO)已成为近期大语言模型强化学习中可验证奖励(RLVR)进展的关键推动因素,但通常在低延迟、近策略的 regime 中训练,导致系统开销显著。我们提出一个简单的问题:GRPO可以多离线策略吗?我们证明GRPO类算法可以容忍比之前假设更大的rollout延迟,并提出Mu-GRPO,一种将训练分为少量(例如四个)大序列生成-优化阶段的RL训练框架。这种设计在诱导高rollout延迟的同时大幅减少了rollout-optimization切换开销。为了在延迟数据下稳定学习,Mu-GRPO结合了放松的剪裁(保留有用的延迟rollout梯度)与负优势 veto(移除不稳定后触发后缀更新)。在五个语言模型和多个数学推理基准测试中,Mu-GRPO在性能上与标准GRPO匹配或超过,同时在墙钟训练时间上实现了约2倍的加速,为LLM强化学习建立了显著改进的性能-效率权衡。

英文摘要

Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime that incurs substantial system overhead. We ask a simple question: How off-policy can GRPO be? We show that GRPO-style algorithms can tolerate substantially larger rollout staleness than previously assumed, and propose Mu-GRPO, an RL training framework that organizes training into a small number (e.g., four) of large sequential generation-optimization stages. This design induces high rollout staleness while greatly reducing rollout-optimization switching overhead. To stabilize learning under stale data, Mu-GRPO combines relaxed clipping, which preserves useful stale-rollout gradients, with negative-advantage veto, which removes destabilizing post-trigger suffix updates in negative-advantage responses. Across five language models and multiple math reasoning benchmarks, Mu-GRPO matches or exceeds the performance of standard GRPO while achieving around 2x speedup in wall-clock training time, establishing a substantially improved performance-efficiency trade-off for LLM reinforcement learning.

2605.17562 2026-05-19 cs.LG cs.AI cs.HC 版本更新

Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models

超越准确率:EEG基础模型的鲁棒性、可解释性和表达性

Urban Širca, Maryam Alimardani, Stefanos Zafeiriou, Konstantinos Barmpas

发表机构 * Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) Imperial College London(伦敦帝国学院)

AI总结 本文研究了EEG基础模型的鲁棒性、可解释性和表达性,通过在八个数据集上对六个EEG-FMs和一个基线深度学习模型进行基准测试,揭示了模型在不同扰动下的表现,以及其在可解释性和表达性方面的特性。

详情
AI中文摘要

EEG基础模型(EEG-FMs)主要在干净且分布内的准确性上进行了评估,其鲁棒性、可解释性和表征质量尚未得到充分考察。本研究通过在八个数据集上对六个EEG-FMs和一个基线深度学习模型进行基准测试,填补了这些空白。除了干净准确性外,我们进行了三层分析:(i)鲁棒性:我们应用了测试时扰动,包括加性噪声、随机和区域基于的通道丢弃以及区域特定的噪声注入。我们的分析表明,没有单一模型在所有失败模式中占主导地位。最抗噪的模型在通道丢弃下最为脆弱,当通道被移除而不是零填充时,许多丢弃脆弱性消失。(ii)可解释性:我们首次将注意力感知的层间相关传播(AttnLRP)应用于EEG-FMs,并展示了模型广泛集中在与任务相关的脑区,这与已知的神经生理学一致。然而,属性图在扰动下保持空间稳定,而预测性能下降,表明模型关注正确的脑区,但解码了被破坏的内容。(iii)表达性:通过块状探测,我们显示在微调过程中后期块被重新利用,而早期块已经包含任务相关的信息。此外,我们证明了之前归因于低质量预训练表示的头部-only性能较差,很大程度上是由于池化所致,且当EEG-FMs的token级嵌入被保留时,它们具有足够的表征能力。这些发现为EEG-FMs的鲁棒性、可解释性和表达性提供了首次系统的评估,并突显了其开发中的关键考虑因素。

英文摘要

EEG foundation models (EEG-FMs) have been evaluated predominantly on clean, in-distribution accuracy, leaving their robustness, interpretability and representational quality largely unexamined. This study addresses these gaps by benchmarking six EEG-FMs against a baseline deep learning model across eight datasets. Beyond clean accuracy, we conduct three layers of analysis: (i) Robustness: we apply test-time perturbations including additive noise, random and region-based channel dropout and region-specific noise injection. Our analyses show that no single model dominates all failure modes. The most noise-robust model is among the most fragile under channel dropout and much of the dropout fragility disappears when channels are removed rather than zero-padded. (ii) Interpretability: we present the first application of Attention-Aware Layer-Wise Relevance Propagation (AttnLRP) to EEG-FMs and show that models broadly concentrate relevance on task-appropriate brain regions consistent with known neurophysiology. However, attribution maps remain spatially stable under perturbation while predictions degrade, suggesting that the models attend to the correct brain regions but decode corrupted content. (iii) Expressiveness: With block-wise probing we show that late blocks are repurposed during fine-tuning, while early blocks already hold task-related information. Furthermore, we demonstrate that the poor head-only performance previously attributed to low-quality pre-trained representations is largely explained by pooling and that EEG-FMs possess sufficient representational capacity when their token-level embeddings are preserved. Together, these findings provide the first systematic assessment of robustness, interpretability and expressiveness for EEG-FMs and highlight critical considerations for their development.

2605.17555 2026-05-19 cs.LG cs.CV 版本更新

PFlow-T: A Persistence-Driven Forward Process for Topology-Controlled Generation

PFlow-T:基于持续性的拓扑控制生成过程

Snigdha Chandan Khilar

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出PFlow-T,一种基于持续性的前向过程生成模型,通过持续同调来控制拓扑结构,实现了对Betti数的生成和处理非分布任务的改进。

详情
AI中文摘要

当前拓扑感知的扩散模型由于使用高斯噪声进行破坏而存在架构不匹配的问题,通过条件侧通道恢复结构特征。为解决此问题,我们引入PFlow-T,一种生成模型,其前向过程完全基于持续同调。在PFlow-T中,时间度量的是H1拓扑特征如孔的破坏,而非高斯噪声注入。此前向过程根据特征的持续性来消除特征。反向网络则直接反转这种结构破坏以在一步内预测干净状态。在MNIST数字零、一和八上的测试显示,PFlow-T在生成请求的Betti数和处理非分布任务方面显著优于基线模型。PFlow-T是首个使用持续同调作为前向过程的生成架构,尽管我们注意到它目前仅限于低分辨率像素空间代理。

英文摘要

Current topology aware diffusion models face an architectural mismatch by using Gaussian noise for corruption while recovering structural features through conditional side channels To fix this we introduce PFlow T a generative model that bases its forward process entirely on persistent homology In PFlow T time measures the destruction of H1 topological features like holes rather than Gaussian noise injection This forward process eliminates features based on their persistence The reverse network then directly inverts this structured corruption to predict the clean state in one step Tests on MNIST digits zero one and eight show PFlow T significantly outperforms a baseline model in generating requested Betti numbers and handling out of distribution tasks PFlow T is the first generative architecture using persistent homology for the forward process although we note it is currently limited to low resolution pixel space proxies

2605.17552 2026-05-19 cs.LG 版本更新

Q-LocalAdam: Memory-Efficient Client-Side Adaptive Optimization for Edge Federated Learning

Q-LocalAdam: 一种内存高效的边缘联邦学习客户端自适应优化方法

Vedant Waykole, Haroon R. Lone

发表机构 * IISER Bhopal(印度比哈尔州科学与技术研究院)

AI总结 本文提出Q-LocalAdam,一种针对边缘联邦学习中非独立同分布数据和内存限制的自适应优化方法,通过分布感知的8位量化块线性编码和对数空间编码实现内存高效优化,显著提升模型性能和并发工作负载能力。

详情
AI中文摘要

边缘设备上的联邦学习必须应对非独立同分布的客户端数据和严格的内存预算。像Adam这样的自适应优化器在数据异质性下稳定训练,但需要存储全精度动量和方差状态,通常使客户端内存开销增加三倍。这限制了在资源受限设备上可部署的模型大小和同时进行的联邦任务数量。我们实证发现,联邦Adam中的动量和方差在统计特性上存在根本差异:动量值对称且有界,而方差跨越八个数量级并具有对数正态结构。受这种不对称性启发,我们提出了Q-LocalAdam,它对动量应用分布感知的8位量化块线性编码,对方差应用对数空间编码,同时保持模型参数在全精度下。在CIFAR-10和CIFAR-100上,针对不同数据异质性(α∈{0.1, 0.5, 1.0, IID}),Q-LocalAdam在中等异质性下实现3.37倍的优化器内存减少,无精度损失,在极端异质性下(如CIFAR-100,α=0.1)实现显著提升(+5.74pp)。多种子验证确认统计显著性(p<0.01)。相比之下,朴素的均匀量化退化到随机性能,证明了分布感知设计的重要性。Q-LocalAdam在内存受限的边缘设备上无需修改联邦协议即可实现更大的模型和更多的并发工作负载。

英文摘要

Federated learning on edge devices must cope with non-IID client data and tight memory budgets. Adaptive optimizers like Adam stabilize training under data heterogeneity but require storing full-precision momentum and variance states, often tripling client memory overhead. This limits deployable model sizes and concurrent federated jobs on resource-constrained devices. We empirically observe that momentum and variance in federated Adam exhibit fundamentally different statistical properties: momentum values are symmetric and bounded, while variance spans eight orders of magnitude with log-normal structure. Motivated by this asymmetry, we propose \textbf{Q-LocalAdam}, which applies distribution-aware 8-bit quantization block-wise linear encoding for momentum and log-space encoding for variance while keeping model parameters in full precision. Across CIFAR-10 and CIFAR-100 under varying data heterogeneity ($α\in \{0.1, 0.5, 1.0, \text{IID}\}$), Q-LocalAdam achieves $3.37\times$ optimizer memory reduction with no accuracy loss under moderate heterogeneity and significant improvements under extreme heterogeneity (e.g., +5.74pp on CIFAR-100, $α=0.1$). Multi-seed validation confirms statistical significance ($p<0.01$). In contrast, naive uniform quantization degrades to random performance, demonstrating that distribution-aware design is essential. Q-LocalAdam enables larger models and more concurrent workloads on memory-constrained edge devices without modifying the federated protocol.

2605.17546 2026-05-19 astro-ph.IM astro-ph.GA cs.LG 版本更新

Accelerating Redshift-Conditioned Galaxy Image Synthesis with One-step Generative Modeling

通过一步生成建模加速红移条件下的星系图像合成

Tianyue Yang, Sandro Tacchella, Xiao Xue

发表机构 * The Center for Computational Science(计算科学中心) University College London(伦敦大学学院) Cavendish Laboratory(卡文迪许实验室) Kavli Institute for Cosmology University of Cambridge(剑桥大学卡文迪许宇宙研究所)

AI总结 本文研究了利用扩散模型和像素MeanFlow实现高效红移条件下的星系图像生成,通过对比不同模型在GalaxiesML-64数据集上的表现,发现一步生成模型在计算成本大幅降低的情况下能有效恢复星系形态统计信息,为大规模宇宙巡天和基于模拟的科学推断提供了新路径。

Comments 19 pages, 8 figures

详情
AI中文摘要

理解宇宙不同时期星系形态演化的关键在于能够根据红移条件生成真实星系群体的模型。本文研究了利用扩散模型和像素MeanFlow实现高效红移条件下的生成建模。我们首先回顾了基于分数的扩散模型、流匹配、一步生成模型和现代扩散采样器之间的联系。然后我们在GalaxiesML-64数据集上评估了DDPM、DDIM、DEIS-AB2、DPM++2M和一步像素MeanFlow,使用基于形态的指标,包括椭圆率、半长轴、塞尔斯指数和等亮面积。我们的结果表明存在清晰的精度-效率权衡:标准DDPM采样在分布忠实度上最佳但计算成本高,而二阶采样器在DDIM上显著提高了效率。像素MeanFlow实现了单步生成并在多个形态统计上表现竞争,尽管在细粒度结构上仍弱于多步DDPM。我们的结果表明,一步生成模型可以在计算成本降低数量级的情况下恢复关键星系形态统计信息,为大规模宇宙巡天和基于模拟的科学推断开辟了新路径。

英文摘要

Understanding galaxy morphology evolution across cosmic time requires models that can generate realistic galaxy populations conditioned on redshift. In this work, we study efficient redshift-conditioned generative modeling for astrophysical image synthesis using diffusion models and pixel-MeanFlow. We first review the connections between score-based diffusion models, Flow Matching, one-step generative models, and modern diffusion samplers. We then evaluate DDPM, DDIM, DEIS-AB2, DPM++2M, and one-step pixel-MeanFlow on the GalaxiesML-64 dataset using morphology-based metrics, including ellipticity, semi-major axis, Sérsic index, and isophotal area. Our results show a clear accuracy-efficiency trade-off: standard DDPM sampling achieves the best distributional fidelity but requires high computational cost, while second-order samplers substantially improve efficiency over DDIM. Pixel-MeanFlow enables single-step generation and achieves competitive performance on several morphology statistics, though it remains weaker than many-step DDPM for fine-grained structure. Our results demonstrate that one-step generative models can recover key galaxy morphology statistics at orders-of-magnitude lower computational cost, opening a path toward efficient conditional simulators for large cosmological surveys and simulation-based scientific inference.

2605.17530 2026-05-19 cs.CR cs.AI cs.LG cs.NI 版本更新

Few-Shot Network Intrusion Detection Using Online Triplet Mining

基于在线三元组挖掘的少样本网络入侵检测

Jack Wilkie, Hanan Hindy, Christos Tachtatzis, Miroslav Bures, Robert Atkinson

发表机构 * Department of Electronics and Electrical Engineering, University of Strathclyde(斯特拉斯克莱德大学电子与电气工程系) Faculty of Computer and Information Sciences, Ain Shams University(爱思曼大学计算机与信息科学学院) Faculty of Electrical Engineering, Czech Technical University(捷克技术大学电气工程学院)

AI总结 本文提出利用在线三元组挖掘和KNN分类器的三元组网络,实现少样本下的有效网络入侵检测,通过对比不同三元组挖掘算法和模型设计,验证了在少量恶意样本下该方法的竞争力。

Comments Published in: MDPI Applied Sciences, 2026. Official version: https://doi.org/10.3390/app16104589 Code: https://github.com/jackwilkie/few_shot_nids_triplet_mining

详情
Journal ref
Wilkie, J.; Hindy, H.; Tachtatzis, C.; Bures, M.; Atkinson, R. Few-Shot Network Intrusion Detection Using Online Triplet Mining. Appl. Sci. 2026, 16, 4589. https://doi.org/10.3390/app16104589
AI中文摘要

网络入侵检测系统在网络保护中起着关键作用,通过检测恶意网络流量并由网络安全运营中心调查。最先进的方法利用监督机器学习方法训练分类模型以识别已知的网络攻击;然而,这些模型需要大量的标记数据集进行训练,并在训练较小数据集时表现不佳。为了解决这一不足,异常检测模型学习良性流量的分布,并将不符合的流量标记为恶意。虽然这些方法不需要恶意示例进行训练,但它们的高误报率使其不切实际。因此,当特定攻击类别的标记实例不足时,网络可能特别容易受到攻击。这通常发生在新建立的网络或之前未见过的攻击类型出现时。为了解决这一挑战,本文提出使用三元组网络,利用在线三元组挖掘和KNN分类器,能够进行少样本分类,从而在仅训练少量恶意示例后实现有效的入侵检测。各种在线三元组挖掘算法被探索,并通过一系列消融研究比较和评估了模型设计选择,如推断算法和优化的距离度量。最终模型在少样本二分类和多类分类中与现有方法进行了比较,发现当每个类别训练至少10个恶意样本时,所提出的方法在竞争性方面表现良好。

英文摘要

Network intrusion detection systems play a vital role in protecting networks by detecting malicious network traffic which can then be investigated by a cybersecurity operations centre. State-of-the-art approaches utilise supervised machine learning methods to train a classification model to recognise known cyberattacks; however, these models require a large labelled dataset to train and show poor performance when trained on smaller datasets. In an attempt to address this shortcoming, anomaly detection models learn the distribution of benign traffic and flag non-conforming traffic as malicious. While these methods do not require malicious examples to train, they suffer from high false-positive rates rendering them impractical. As a result, networks may be particularly vulnerable when there are insufficient labelled instances of a specific attack class to train an effective classifier. This often occurs in newly established networks or when previously unseen types of attacks emerge. To address this challenge, this work proposes the use of a triplet network, utilising online triplet mining and a KNN classifier, which is able to perform few-shot classification, enabling effective intrusion detection after being trained on a limited number of malicious examples. Various online triplet mining algorithms were explored and model design choices, such as the inference algorithm and optimised distance metrics, were compared and evaluated through a series of ablation studies. The final model was compared against other state-of-the-art approaches in few-shot binary and multiclass classification, where the proposed approach was found to be competitive with existing methods when trained on as little as 10 malicious samples of each class.

2605.17528 2026-05-19 cs.LG cs.AI cs.CL 版本更新

CasualSynth: Generating Structurally Sound Synthetic Data

CasualSynth: 生成结构上合理的合成数据

Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford(牛津大学计算机科学系) Institute of Logic and Computation, TU Wien(维也纳技术大学逻辑与计算研究所)

AI总结 本文提出CasualSynth框架,通过解耦因果结构生成与语义实现,生成既符合因果机制又语义丰富的合成数据,解决了LLM在生成合成数据时无法保证因果正确性的问题。

Comments 15 pages

详情
AI中文摘要

大型语言模型(LLMs)能够生成逼真的合成数据,但无法保证其输出符合目标领域的因果机制。我们引入CausalSynth框架,该框架将因果结构生成与语义实现解耦,生成既符合因果机制又语义丰富的合成数据。该框架分为三个阶段:首先,一个结构因果模型(SCM)——一个定义在有向无环图(DAG)上的结构方程组,通过祖先采样生成因果骨架,即满足支配图全局马尔可夫性质的变量赋值;其次,一个LLM作为受约束的实现者,一个条件翻译器,将每个骨架映射到高维观测,如临床笔记或交易日志;第三,一个迭代一致性验证模块通过确定性提取检测结构违规,并将针对性的修正反馈给LLM,形成闭环优化过程。我们识别出语义后门问题,即LLM系统性地用预训练先验覆盖施加的因果事实——并证明我们的迭代机制相对于标准拒绝采样减少了由此产生的选择偏差。在三个因果基准(ASIA、ALARM和MIMIC-Struct)上,CausalSynth在假阳性率接近名义α=0.05水平的情况下保持条件独立性,并在70B参数LLM基础上实现了超过96%的可实现率。该框架还通过保留噪声和图 mutilation 支持原理化的干预和反事实生成。

英文摘要

Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emph{realizer}, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects structural violations through deterministic extraction and feeds targeted corrections back to the LLM, forming a closed-loop refinement process. We identify the Semantic Backdoor problem the systematic tendency of LLMs to override imposed causal facts with pre-training priors -- and prove that our iterative mechanism reduces the resulting selection bias relative to standard rejection sampling. On three causal benchmarks (ASIA, ALARM, and MIMIC-Struct), CausalSynth preserved conditional independencies with false-positive rates near the nominal $α=0.05$ level and achieved realizability rates above 96% with 70B-parameter LLM backbones. The framework additionally supports principled interventional and counterfactual generation through noise retention and graph mutilation.

2605.17508 2026-05-19 cs.LG cs.AI 版本更新

BESplit: Bias-Compensated Split Federated Learning with Evidential Aggregation

BESplit: 偏差补偿分割联邦学习与证据聚合

Yuhan Xie, Chen Lyu, Jingrong Huang

发表机构 * MoE Key Laboratory of Interdisciplinary Research of Computation(交叉计算与经济学 interdisciplinary 研究 MOE 重点实验室) Shanghai University of Finance(上海财经大学)

AI总结 本文提出BESplit框架,通过证据聚合和偏差补偿协作来解决非独立同分布数据下分割联邦学习的偏差优化和收敛不稳定问题,提升了模型的准确性和效率。

详情
AI中文摘要

分割联邦学习(SFL)通过将模型分割到客户端和服务器之间实现隐私保护的协同训练。然而,在非独立同分布数据分布下,SFL常面临偏差优化和收敛不稳定的问题,而现有解决方案大多借鉴传统联邦学习的技术。在本工作中,我们发现SFL的分割架构本质上改变了客户端信息的表示和协调方式,为超越参数级聚合的偏差补偿提供了机会。基于这一见解,我们提出了BESplit,一个架构感知的框架,利用SFL内在结构来缓解非IID效应。首先,为防止偏见本地数据主导全局更新,我们引入证据聚合(EA)以基于证据不确定性对客户端贡献进行细粒度重新加权。其次,为进一步减少分布偏斜,我们开发了偏差补偿协作(BCC)以通过配对互补客户端对齐分割层表示。最后,双教师蒸馏(DTD)被纳入以同步解耦客户端和服务器模型之间的知识,使本地推理能够独立进行。在五个基准数据集上的广泛实验表明,BESplit在多样化的非IID设置下,准确率、收敛稳定性以及计算效率均优于现有最先进方法。

英文摘要

Split Federated Learning (SFL) enables privacy-preserving collaborative training by partitioning models between clients and a server. However, under non-IID data distributions, SFL often suffers from biased optimization and unstable convergence, while existing solutions largely adapt techniques from conventional federated learning. In this work, we observe that the split architecture of SFL inherently alters how client information is represented and coordinated, opening opportunities for bias compensation beyond parameter-level aggregation. Based on this insight, we propose BESplit, an architecture-aware framework that exploits the intrinsic structure of SFL to mitigate non-IID effects. First, to prevent biased local data from dominating global updates, we introduce Evidential Aggregation (EA) to perform fine-grained reweighting of client contributions based on evidential uncertainty. Second, to further reduce distributional skew, we develop Bias-Compensated Collaboration (BCC) to align split-layer representations by pairing complementary clients. Finally, Dual-Teacher Distillation (DTD) is incorporated to synchronize knowledge between decoupled client and server models, enabling independent local inference. Extensive experiments on five benchmark datasets demonstrate that BESplit consistently outperforms state-of-the-art methods in accuracy, convergence stability, and computational efficiency under diverse non-IID settings.

2605.17500 2026-05-19 cs.LG cs.CV 版本更新

The Silent Brush: Evaluating Artistic Style Leakage in AI Art Generation

沉默的画笔:评估AI艺术生成中的艺术风格泄露

Ninad Joshi, Ashutosh Ranjan, Vivek Srivastava, Shirish Karande

发表机构 * TCS Research(TCS研究)

AI总结 本文研究了AI艺术生成中由于模型学习并复现艺术风格而产生的无意风格复现问题,提出了一种评估方法Art Arena,用于衡量艺术作品的编码强度、交互情况以及在无明确提示的情况下风格特征的重现频率。

详情
AI中文摘要

生成式文本到图像模型通常是在大规模网络爬取数据集上训练的,这些数据集包含多样化的视觉内容,如受版权保护和风格独特的艺术品,引发了关于所有权、归属和受保护视觉表达的无意重用的担忧。一个关键问题是,模型可以从这些数据中学习风格模式,并在生成输出中复现这些模式,而无需在提示中显式引用。我们称这种现象为The Silent Brush,即使在未被请求的情况下,所学的风格也会再次出现。现有的评估方法主要集中在近似重复检索或成员推断,而没有考虑到这种跨提示的无意风格复现形式。为了解决这些差距,我们首先制定了评估The Silent Brush的指导原则。然后引入Art Arena评估协议,用于衡量艺术作品的编码强度、交互情况以及在无明确提示的情况下其风格特征在生成输出中重现的频率。我们对广泛使用的文本到图像扩散模型,包括Stable Diffusion v1.5、Stable Diffusion XL (SDXL)和SANA-1.5进行了评估,并设计使其能够跨文本到图像生成系统通用。我们的结果表明,The Silent Brush源于艺术作品之间表示强度和交互动态的差异,导致模型生成中的不对称混合。代码和评估资源可在:https://anonymous.4open.science/r/ArtArena-EBE4获取。

英文摘要

Generative text-to-image models are typically trained on large-scale web-scraped datasets that include diverse visual content such as copyrighted and stylistically distinctive artworks, raising concerns about ownership, attribution, and the unintended reuse of protected visual expressions. A key issue is that models can learn stylistic patterns from this data and reproduce them in generated outputs without any explicit reference in the prompt. We refer to this phenomenon as The Silent Brush, where such learned styles reappear even when they are not requested. Existing evaluation methods mainly focus on near-duplicate retrieval or membership inference and do not account for this form of unintended stylistic resurfacing across prompts. To address these gaps, we first formulate guiding principles for evaluation of The Silent Brush. We then introduce Art Arena, an evaluation protocol that measures how strongly artworks are encoded, how they interact, and how frequently their stylistic traits reappear in generated outputs without explicit mention in prompts. We evaluate Art Arena on widely used text-to-image diffusion models, including Stable Diffusion v1.5, Stable Diffusion XL (SDXL), and SANA-1.5, and design it to generalize across text-to-image generative systems. Our results show that The Silent Brush arises from differences in representational strength and interaction dynamics between artworks, leading to asymmetric blending in model generations. Code and evaluation resources are available at: https://anonymous.4open.science/r/ArtArena-EBE4.

2605.17499 2026-05-19 cs.LG 版本更新

t-gems: text-guided exit modules for decreasing clip image encoder

t-gems: 基于文本引导的退出模块用于减少clip图像编码器

Alberto Presta, Grzegorz Stefanski, Michal Byra, Krzysztof Arendt

发表机构 * Samsung AI Center Warsaw(三星AI中心华沙) Institute of Fundamental Technological Research, PAS, Warsaw(基础技术研究所,波兰科学院,华沙)

AI总结 本文提出t-gems文本引导退出模块,通过利用编码器中间层的语义内容分布,减少clip图像编码器的计算成本,同时保持跨模态理解性能。

Comments Accepted at ICASSP 2026

详情
AI中文摘要

多模态深度神经网络通过整合多种数据模态来增强深度理解。不同模态的数据通常被投影到共享的潜在空间中进行相似性计算,但这一过程由于大型图像编码器和预测期间对测试数据的等量处理而变得资源密集。早期退出方法通过利用中间层来减少计算负载,节省时间和内存。然而,对于像图像-文本对这样的多模态数据开发此类方法具有挑战性。本研究探讨了编码器如clip中中间层中存在的语义内容分布,这些分布可以从文本描述中推导出来。我们引入了文本引导退出模块(t-gems)和基于速率的正则化器,以控制编码器的使用成本,同时保持跨模态理解性能。

英文摘要

Multimodal deep neural networks enhance deep comprehension by integrating diverse data modalities. Data from different modalities are typically projected into a shared latent space for similarity computation, but this process is resource intensive due to large image encoders and equal processing of test data during prediction. Early exit methods reduce computational load by utilizing intermediate layers, saving time and memory. However, developing such methods is challenging for multimodal data like image-text pairs. This study investigates the semantic content distributions present in intermediate layers of encoders such as CLIP, which can be derived from textual descriptions. We introduce Text-Guided Exit Modules (T-GEMs) and a rate-based regularizer to control encoder usage costs while maintaining cross-modal understanding performance.

2605.17497 2026-05-19 cs.LG 版本更新

Self-Supervised On-Policy Distillation for Reasoning Language Models

自监督在线策略蒸馏用于推理语言模型

Zhiquan Tan, Yinrong Hong

发表机构 * Tsinghua University(清华大学) Beihang University(北航大学)

AI总结 本文提出自监督在线策略蒸馏(SSOPD)方法,通过对比正确与错误的完成过程信号,提升推理语言模型的表现,实验表明在多个基准测试中优于GRPO和OPSD基线。

详情
AI中文摘要

GRPO-style RLVR通过多个在线策略尝试来训练推理模型,但通常仅利用终端奖励。我们展示混合组包含更丰富的过程信号:正确完成是当前策略解决问题的自生成证据,而错误完成提供需要纠正的在线策略前缀。我们引入自监督在线策略蒸馏(SSOPD),将教师分布条件在最短正确完成上,蒸馏到最长错误完成的前缀中。这将组内正确-错误对比转化为密集的过程监督,而无需外部解决方案轨迹。一个停止时间观点激励最短正确/最长错误规则作为有限组对编辑持久失败以实现快速成功动作的近似。一个提示级前沿权重集中辅助损失,其中正确和错误分支共存。在AIME 2024、AIME 2025和HMMT 2025中,SSOPD在所有九个模型基准设置中优于GRPO。在Qwen3-8B上,它达到宏Avg@12为65.6,优于GRPO 1.6个百分点,优于解决方案条件的OPSD基线0.8个百分点。代码将在https://github.com/tzq1999/SSOPD上发布。

英文摘要

GRPO-style RLVR trains reasoning models from multiple on-policy attempts per prompt, but typically uses these attempts only through terminal rewards. We show that a mixed group contains a richer process signal: a correct completion is a self-generated witness of how the current policy can solve the problem, while a wrong completion provides on-policy prefixes where the policy needs correction. We introduce \emph{Self-Supervised On-Policy Distillation} (SSOPD), which distills a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion. This converts intra-group correct--wrong contrast into dense process supervision without external solution traces. A stopping-time view motivates the shortest-correct / longest-wrong rule as a finite-group approximation to editing persistent failures toward fast-success actions, and a prompt-level frontier weight concentrates the auxiliary loss where correct and wrong branches coexist. Across AIME 2024, AIME 2025, and HMMT 2025, SSOPD improves over GRPO in all nine model-benchmark settings. On Qwen3-8B, it reaches a macro Avg@12 of 65.6, outperforming GRPO by 1.6 points and the solution-conditioned OPSD baseline by 0.8 points. Code will be released at https://github.com/tzq1999/SSOPD.

2605.17493 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph 版本更新

Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE

超越线性叠加:利用KAN-SAE在AI天气模型中发现气候特征

Minjong Cheon

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 本文提出KAN-SAE,一种基于Kolmogorov-Arnold网络的稀疏自编码器,通过非线性激活函数揭示天气预测模型中的气候特征,相比线性基线提升了72%的活跃特征数量和降低了20%的特征冗余。

详情
AI中文摘要

深度学习天气预测模型在预测能力上表现出色,但其内部如何表示物理气候现象仍不明确。通过稀疏自编码器(SAEs)实现的机理可解释性提供了一种分解这些表示的有原则方法,但现有SAEs假设严格线性特征叠加,这与现代变压器中编码的高度非线性大气动力学不匹配。我们引入KAN-SAE,一种稀疏自编码器,其编码器将标准ReLU替换为可学习的每特征B-样条激活函数,这些激活函数来自Kolmogorov-Arnold网络(KANs),使每个潜在维度能够发展出自己的非线性门控配置。应用于Sonny时,KAN-SAE发现了975个活跃特征(相比线性基线的566个,提升了72%),并具有20%更低的特征冗余和可比的重建保真度。在无任何气候监督的情况下,KAN-SAE识别出一个在西欧空间集中的可解释热浪特征,并通过因果操控实验验证了西太平洋台风追踪器。我们的结果表明,非线性激活对于深度学习天气预测模型的机理可解释性至关重要,恢复了对线性基线不可见的气候特征。

英文摘要

Deep learning weather prediction models achieve remarkable predictive skill yet remain largely opaque: we know little about how they represent physical climate phenomena internally. Mechanistic interpretability through Sparse Autoencoders (SAEs) offers a principled route to decomposing these representations, but existing SAEs assume strictly linear feature superposition - a constraint ill-suited for the highly nonlinear atmospheric dynamics encoded in modern transformers. We introduce KAN-SAE, a sparse autoencoder whose encoder replaces the standard ReLU with learnable per-feature B-spline activations drawn from Kolmogorov-Arnold Networks (KANs), allowing each latent dimension to develop its own nonlinear gating profile. Applied to Sonny, KAN-SAE discovers 975 alive features (vs. 566 for a linear baseline, a 72% improvement) with 20% lower inter-feature redundancy and comparable reconstruction fidelity. Without any climate supervision, KAN-SAE identifies an interpretable European heatwave feature spatially concentrated over western Europe, and a western Pacific typhoon tracker confirmed by causal steering experiments. Our results demonstrate that nonlinear activations are essential for mechanistic interpretability of deep learning weather prediction models, recovering climate features that remain invisible to linear baselines.

2605.17486 2026-05-19 cs.RO cs.LG 版本更新

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

DyGRO-VLA: 通过动态分组残差优化实现跨任务的视觉-语言-动作模型扩展

Sixu Lin, Yunpeng Qing, Litao Liu, Ming Zhou, Ruixing Jin, Xiaoyi Fan, Guiliang Liu

发表机构 * School of Data Science, The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)数据科学学院) Shenzhen Loop Area Institute(深圳环城研究院) Zhejiang University(浙江大学) Rutgers University-New Brunswick(罗格斯大学新布朗斯维尔回声分校) Shanghai AI Laboratory(上海人工智能实验室) Jiangxing Intelligence Technology Inc.(江行智能科技有限公司)

AI总结 本文提出DyGRO-VLA,一种通过动态分组残差优化实现跨任务视觉-语言-动作模型扩展的两阶段优化框架,旨在提升模型的泛化能力。

详情
AI中文摘要

最近在强化学习(RL)方面的进展提供了一种系统的方法来优化视觉-语言-动作(VLA)模型,推动了从轨迹模仿到任务环境中的主动学习的转变。尽管在控制精度上有所改进,大多数RL优化器仍然任务特定,这使VLA模型从通用控制器退化为过度拟合狭窄任务集的策略。在本研究中,我们深入分析了这一现象,并强调了跨任务特征表示对提高VLA模型泛化能力的重要性。受这一发现的启发,我们引入了DyGRO-VLA,一种两阶段优化框架,1)基于信息论原理有效地捕捉跨任务潜在表示,2)通过混合的RL残差动态优化策略。DyGRO-VLA使RL优化器能够在优化过程中利用任务相关的潜在信息,同时战略性地减轻对学习表示的不利干扰。我们在LIBERO、RoboTwin2基准以及现实世界中评估了我们的方法,证明了在多任务训练和分布偏移下,与强基线相比,我们的方法具有持续的改进。

英文摘要

Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing Vision-Language-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process. We evaluate our approach on LIBERO, RoboTwin2 benchmarks, and further validate it on real world, demonstrating consistent improvements over strong baselines under multi-task training and distribution shift.

2605.17465 2026-05-19 cs.LG 版本更新

TriOpt: A Scalable Algorithm for Linear Causal Discovery

TriOpt: 一种适用于线性因果发现的可扩展算法

Rafat Ashraf Joy, Elena Zheleva

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出TriOpt算法,通过整合顺序方法和连续优化方法,解决了高维线性因果发现中的可扩展性问题,实现了显著的速度提升且保持了较高的准确性。

详情
AI中文摘要

从观测数据中学习因果关系具有挑战性,因为图搜索空间随着变量数量的增加而呈超指数增长。基于顺序的方法通过首先确定拓扑顺序来减少此空间,而连续优化方法通过将DAG学习转化为可微目标函数并加入循环性约束来探索最可能的区域。尽管这些方法在概念上具有吸引力,但在高维设置中仍面临显著的可扩展性限制,限制了其实际应用。在本文中,我们提出了一种新的线性因果发现方法,紧密整合这两种方法以在不牺牲准确性的情况下实现显著的可扩展性改进。我们的方法TriOpt将问题分解为两个高效的阶段。首先,它利用Sherman-Morrison秩1更新和线性核的加法结构来恢复拓扑顺序,从而实现快速且可扩展的顺序估计。其次,在给定此顺序的情况下,我们将结构学习重新公式化为一个凸的连续优化问题,完全避免了需要强制执行昂贵的循环性约束的需要。我们理论上证明,在真实的顺序下,TriOpt可以精确恢复潜在的线性DAG。经验上,在合成、半合成和现实数据集上,TriOpt在高维情况下相对于最先进的线性因果发现方法实现了数量级的速度提升,同时保持了可比或更优的准确性。

英文摘要

Learning causal relations from observational data is challenging because the graph search space grows super-exponentially with the number of variables. Ordering-based methods reduce this space by first identifying the topological ordering, whereas continuous optimization methods explore most likely regions of the space by casting DAG learning as a differentiable objective with an acyclicity constraint. Despite their conceptual appeal, both paradigms face significant scalability limitations in high-dimensional settings, restricting their practical applicability. In this work, we introduce a new formulation for linear causal discovery that tightly integrates these two paradigms to achieve substantial gains in scalability without sacrificing accuracy. Our approach, TriOpt, decomposes the problem into two efficient stages. First, it recovers the topological ordering by exploiting the Sherman-Morrison rank-1 downdate together with the additive structure of linear kernels, enabling fast and scalable ordering estimation. Second, given this ordering, we reformulate structure learning as a convex continuous optimization problem that entirely avoids the need for enforcing costly acyclicity constraints. We theoretically show that, under the true ordering, TriOpt exactly recovers the underlying linear DAG. Empirically, across synthetic, semi-synthetic, and real-world datasets, TriOpt achieves orders-of-magnitude speedups over state-of-the-art linear causal discovery methods in high-dimensional regimes, while maintaining comparable or superior accuracy.

2605.17458 2026-05-19 cs.LG 版本更新

ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks

ClaHF:一种基于人类反馈的强化学习框架,用于改进分类任务

Tianxiang Xu, Xiaoyan Zhu, Xin Lai, Jiayin Wang

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院)

AI总结 本文提出ClaHF,一种基于人类反馈的强化学习框架,用于改进文本分类任务,通过整合偏好建模和强化学习优化,无需额外人工标注,在分类流程中提升分类性能和置信度校准。

详情
AI中文摘要

文本分类模型通常通过监督微调(SFT)进行训练。然而,SFT本质上是从实例级标签进行行为克隆,因此无法充分捕捉样本之间的相对偏好关系,这限制了模型塑造决策边界和校准预测置信度的能力。在本文中,我们提出ClaHF,一种受人类反馈启发的强化学习(RL)框架,用于文本分类,该框架在分类流程中整合了偏好建模和RL优化,而无需额外的人工标注。与以往仅依赖实例级监督的工作不同,ClaHF同时构建多个候选预测及其相对排名关系,并在奖励模型(RM)中联合建模Top-1偏好以及非最优候选之间的顺序。这种设计将传统的标签监督转换为可以直接应用于策略优化的偏好信号。我们在八个分类任务上进行了系统评估,涵盖三种场景类别。结果表明,ClaHF在各种语言模型(LMs)上一致提升了分类性能和置信度校准。数据和代码可在https://anonymous.4open.science/r/ClaHF上获取。

英文摘要

Text classification models are typically trained via supervised fine-tuning (SFT). However, SFT essentially performs behavior cloning from instance-wise labels and thus fails to adequately capture relative preference relations among samples, which limits the model's ability to shape decision boundaries and calibrate predictive confidence. In this paper, we propose ClaHF, a human feedback-inspired reinforcement learning (RL) framework for text classification that integrates preference modeling and RL optimization into the classification pipeline without requiring additional human annotations. Unlike prior work that relies solely on instance-wise supervision, ClaHF constructs multiple candidate predictions together with their relative ranking relations, and jointly models the Top-1 preference and the ordering among non-optimal candidates within a reward model (RM). This design converts conventional label supervision into preference signals that are directly applicable to policy optimization. We conduct systematic evaluations on eight classification tasks spanning three categories of scenarios. Results demonstrate that ClaHF consistently improves both classification performance and confidence calibration across diverse language models (LMs). The data and code are available at https://anonymous.4open.science/r/ClaHF.

2605.17437 2026-05-19 cs.SE cs.LG 版本更新

A semantic mutation metric for metamorphic relation adequacy in scientific computing programs

一种用于科学计算程序元突变关系充分性的语义突变度量

Meng Li, Xiaohua Yang, Jie Liu, Shiyu Yan

发表机构 * School of Computing, University of South China(南华大学计算机学院) Hunan Engineering Research Center of Software Evaluation and Testing for Intellectual Equipment(湖南软件测评与智能设备工程研究中心) CNNC Key Laboratory on High Trusted Computing(中核集团高可信计算重点实验室)

AI总结 本文提出了一种基于领域语义操作符的语义突变度量(SMS),旨在解决传统突变度量在科学计算中忽略领域语义的问题,通过引入五个领域语义操作符,提高了对元突变关系充分性的评估能力。

Comments Submitted to Information and Software Technology (IST), Elsevier. Manuscript: 93 pages in elsarticle review mode (12pt double-spaced, ~28-35 pp typeset). Supplementary code and 12-PUT pool at https://github.com/meng004/P2-Semantic-Mutation

详情
AI中文摘要

背景。元突变测试用于解决科学计算中的测试- oracle 问题,但传统突变度量仅基于语法 AST 突变,忽略了领域语义。目标。我们提出了语义突变度量(SMS),其基于五个领域语义操作符(保守侵蚀、操作符替换、超参数、轨迹翻转、结构注入)。SMS 在限定条件下几乎退化为传统突变度量(MS),因此任何基于 SMS 的结论在经典范围内都与之前的突变测试文献一致。方法。一个 12-PUT x 5-MP 设计用于四个单输出浮点到浮点类(数值、概率、代理、机器学习)的组合,配以一个三层归因分类器,将真正的语义故障与容忍、OOD、统计和人工制品类别分开。在相同的提示下,同一源/跨源的消融实验隔离了 LLM 源多样性贡献。LLM 生成的突变体在 AST 正则化水平上与默认配置的宇宙射线语法池进行比较。结果。预注册的大型效应阈值在点估计标准下未被满足;观察到的效应位于中等效应范围内。在相同的提示下,跨源池化没有明显改变 delta,表明在此设计中 LLM 身份不是关键因素。LLM 生成的突变体与默认宇宙射线语法突变体在 AST 层面的重叠很小;在默认的一阶语法配置下,超参数、结构注入和轨迹翻转类别是不可达的。结论。SMS 是科学计算中领域语义元突变关系集的后向兼容充分性度量。一阶不可达证据与效应大小问题无关。

英文摘要

Context. Metamorphic Testing addresses the test-oracle problem in scientific computing, but classical Mutation Score operates on syntactic AST mutations and misses domain semantics. Objective. We propose the Semantic Mutation Score (SMS), built on five domain-semantic operators (Conservation Erosion, Operator Substitution, Hyperparameter, Trajectory Flip, Structural Injection). SMS degenerates almost everywhere to MS in a characterised limit, so any SMS-based conclusion remains consistent with prior mutation-testing literature in the classical regime. Method. A 12-PUT x 5-MP design over four single-output float-to-float classes (numeric, probabilistic, surrogate, machine-learning) is paired with a three-layer attribution classifier separating true semantic faults from tolerance, OOD, statistical, and artefact categories. A same-source / cross-source ablation under an identical prompt isolates the LLM-source-diversity contribution. LLM-generated mutants are compared against a default-configuration cosmic-ray syntactic pool at the AST-normalised level. Results. The pre-registered large-effect threshold for Cliff's delta is not met under the point-estimate criterion; the observed effect lies in the medium-effect range. Cross-source pooling under an identical prompt does not appreciably shift delta, indicating that LLM identity is not the lever within this design. AST-level overlap between LLM-generated and default cosmic-ray syntactic mutants is small; the Hyperparameter, Structural Injection, and Trajectory Flip classes are unreachable under default first-order syntactic configurations. Conclusion. SMS is a backward-compatible adequacy metric for domain-semantic metamorphic-relation sets in scientific computing. The first-order unreachability evidence is independent of the effect-size question.

2605.17432 2026-05-19 cs.LG cs.CR 版本更新

DP-SelFT: Differentially Private Selective Fine-Tuning for Large Language Models

DP-SelFT: 大语言模型的差分隐私选择性微调

Haichao Sha, Zihao Wang, Yuncheng Wu, Hong Chen, Wei Dong

发表机构 * Renmin University of China(中国人民大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出DP-SelFT框架,通过选择性微调方法在保持差分隐私的同时提升大语言模型的隐私-效用权衡。

详情
AI中文摘要

大型语言模型(LLMs)通常通过微调适应下游任务,但微调数据中包含敏感信息,可能导致模型泄露。差分隐私(DP)提供正式保护,但LLM的DP微调仍因梯度裁剪和噪声注入而显著降低效用。现有工作通过将DP与参数高效微调方法(如LoRA)结合来改进这一权衡。在本文中,我们研究了互补方向:选择性微调,通过限制更新应用的位置。我们提出DP-SelFT框架,用于大语言模型的差分隐私选择性微调。DP-SelFT解决参数选择中的三个DP特定挑战:避免重复隐私成本、在噪声估计下提高稳定性、以及选择在裁剪和噪声更新下仍有用的参数。首先构建轻量级DP合成数据集,并仅在该合成数据上进行选择,因此选择阶段不增加额外隐私成本。然后通过临时训练候选层子集在合成训练拆分上,并在合成验证拆分上评估它们。关键在于临时训练是在与下游DP微调匹配的扰动范围内进行的,最坏情况下的扰动规模与DP噪声相同。这有利于不仅可学习且对噪声私人更新具有鲁棒性的层子集。在基准任务上的实验表明,DP-SelFT在相同隐私保障下,一致地改进了隐私-效用权衡。

英文摘要

Large language models (LLMs) are commonly adapted to downstream tasks through fine-tuning, but fine-tuning data often contains sensitive information that may be leaked by the resulting model. Differential privacy (DP) offers formal protection against such leakage, yet DP fine-tuning of LLMs still suffers from substantial utility degradation due to gradient clipping and noise injection. Existing work improves this trade-off by combining DP with parameter-efficient fine-tuning methods such as LoRA, which constrain the form of updates. In this work, we study a complementary direction: selective fine-tuning, which constrains where updates are applied. We propose DP-SelFT, a framework for differentially private selective fine-tuning of LLMs. DP-SelFT addresses three DP-specific challenges in parameter selection: avoiding repeated privacy cost, improving stability under noisy estimates, and selecting parameters that remain useful under clipped and noisy updates. It first constructs a lightweight DP synthetic dataset and performs selection only on this synthetic data, so the selection stage incurs no additional privacy cost. It then conducts layer-level selection by temporarily training candidate layer subsets on a synthetic training split and evaluating them on a synthetic validation split. Crucially, this temporary training is performed under a perturbation regime matched to downstream DP fine-tuning, with worst-case perturbations of the same scale as DP noise. This favors layer subsets that are not only learnable but also robust to noisy private updates. Experiments on benchmark tasks show that DP-SelFT consistently improves the privacy--utility trade-off over existing DP fine-tuning baselines under the same privacy guarantees.

2605.17431 2026-05-19 cs.LG cs.AI 版本更新

MATE: Solving Contextual Markov Decision Processes with Memory of Accumulated Transition Embeddings

MATE:利用累积转移嵌入记忆解决上下文马尔可夫决策过程

Himchan Hwang, Hyeokju Jeong, Gene Chung, Seungyeon Kim, Sangwoong Yoon, Frank Chongwoo Park

发表机构 * Seoul National University(首尔国立大学) Ulsan National Institute of Science and Technology (UNIST)(釜山国立科学技术研究所(UNIST))

AI总结 MATE通过使用累积转移嵌入的记忆架构,解决了由未观察上下文参数化的上下文马尔可夫决策过程(CMDPs),在保持后验信念的同时,避免了传统方法的计算和梯度问题,实现了高效且性能优异的解决方案。

详情
AI中文摘要

我们提出了MATE,一种简单而有效的记忆架构,用于解决由未观察上下文参数化的上下文马尔可夫决策过程(CMDPs)。在CMDPs中,最优智能体可以通过维持上下文的后验信念来在线适应。MATE用求和聚合的记忆替代了不可行的后验,利用后验的排列不变性来保留可证明的充分表达性。与先前的记忆架构相比,MATE避免了Transformer的逐步展开成本增长和与循环神经网络(RNNs)通常相关的梯度问题。在多样化的基准测试中,MATE展示了清晰的计算优势,同时实现了与标准序列模型基线相当的性能。

英文摘要

We propose MATE, a simple yet effective memory architecture for solving Contextual Markov Decision Processes (CMDPs), a family of MDPs parameterized by an unobserved context. In CMDPs, an optimal agent can adapt online by maintaining the posterior belief over contexts. MATE replaces this intractable posterior with a sum-aggregated memory, leveraging the posterior's permutation invariance to retain provably sufficient expressiveness. Compared to prior memory architectures, MATE avoids the growing per-step rollout cost of Transformers and the gradient issues commonly associated with Recurrent Neural Networks (RNNs). Extensive evaluations across diverse benchmarks demonstrate that MATE provides clear computational advantages while achieving performance comparable to standard sequence-model baselines.

2605.17429 2026-05-19 cs.LG cs.CV 版本更新

Radial-Angular Geometry for Reliable Update Diagnosis in Noisy-Label Learning

径向-角向几何用于噪声标签学习中的可靠更新诊断

Ningkang Peng, Jingyang Mao, Xiaoqian Peng, Weiguang Qu, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学) Nanjing University of Chinese Medicine(南京中医药大学)

AI总结 本文提出了一种基于径向-角向几何的方法,用于在噪声标签学习中可靠地诊断更新,通过比较观测标签梯度与EMA教师诱导的参考梯度,区分对齐的困难清洁更新与由损坏标签引起的冲突更新。

详情
AI中文摘要

噪声标签方法通常从正向空间信号如损失、置信度或熵来估计样本可靠性。这些信号表明样本是否难以预测,但它们不直接测试其观察到的标签是否导致可靠的参数更新。这个差距很重要,因为困难的干净样本和错误标记的样本可能具有相似的损失,但会诱导不同的更新。我们重新诠释可靠性估计为观测标签更新的诊断。样本级经验Fisher迹提供了一个反向空间的更新能量度量:对于分类器层,它分解为一个预测残差项和一个特征敏感性项,因此捕获了超越标量损失的信息。然而,迹仍是一个径向幅度信号,无法决定大更新是否有益或有害。因此,我们提出了相对几何冲突(RGC),它将观测标签梯度与由EMA教师诱导的参考梯度进行比较。冲突项有助于区分大但对齐的困难清洁更新与由损坏标签引起的冲突更新。在合成和现实世界的噪声标签基准上,RGC在我们的评估协议下提高了困难清洁样本的保留和准确性。

英文摘要

Noisy-label methods often estimate sample reliability from forward-space signals such as loss, confidence, or entropy. These signals indicate whether a sample is difficult to predict, but they do not directly test whether its observed label induces a reliable parameter update. This gap matters because hard clean samples and mislabeled samples can have similar loss while inducing different updates. We recast reliability estimation as diagnosis of the observed-label update. The sample-wise empirical Fisher trace gives a backward-space measure of update energy: for the classifier layer, it factorizes into a prediction-residual term and a feature-sensitivity term, so it captures information beyond scalar loss. Trace, however, is still a radial magnitude signal and cannot decide whether a large update is useful or harmful. We therefore propose Relative Geometric Conflict (RGC), which compares the observed-label gradient with a reference gradient induced by an EMA teacher. The conflict term helps distinguish large but aligned hard-clean updates from large conflicting updates caused by corrupted labels. Across synthetic and real-world noisy-label benchmarks, RGC improves hard-clean preservation and accuracy under our evaluation protocol.

2605.17428 2026-05-19 cs.LG cs.AI 版本更新

Progressive Generalization Augmentation with Deeply Coupled RND-PPO and Domain-Prioritized Noise Injection for Robust Crop Management Reinforcement Learning

渐进泛化增强:结合深度耦合RND-PPO和领域优先噪声注入的稳健作物管理强化学习

Wu Yang

发表机构 * Chongho Bridge Group Limited(中宏桥梁集团有限公司)

AI总结 本文提出了一种渐进泛化增强方法,通过深度耦合RND-PPO和领域优先噪声注入,解决农业强化学习中早期学习效率与后期泛化能力的平衡、内在和外在奖励的简单加法结合以及统一噪声注入策略的问题,从而提高作物管理的鲁棒性。

详情
AI中文摘要

我们在gym-DSSAT玉米灌溉任务上的初步实验表明,±2摄氏度的温度噪声会导致在清洁条件下训练的PPO策略的经济收益减少11.9% - 这是现有研究未充分解决的系统性鲁棒性缺陷。本文针对阻碍农业RL系统实际部署的三个相互关联的限制:早期阶段学习效率与后期阶段泛化能力之间的权衡;探索增强PPO中内在和外在奖励的简单加法结合;以及忽视农业状态变量经验证实的差异敏感性的统一测量噪声注入策略。我们引入了三个系统性的创新:渐进泛化增强(PGA),实现一个三阶段课程(清洁训练0-800次回合,渐进800-1200次回合,完整增强1200-2000次回合);深度耦合RND-PPO架构,具有双通道GAE归一化、进度衰减的内在系数和语义离散化;以及领域优先噪声注入,具有层次激活。我们的实验评估显示:在佛罗里达州,相比最先进的BERT-DQN,产量提高了8.43%,氮肥利用效率提高了16.42%;在阿拉贡,产量提高了5.61%(尽管由于恶劣的地中海气候,经济评分降低了3.67%);在综合扰动下,性能保留率分别为94.4% vs 80.0%。所有实验均使用5个随机种子,在NVIDIA A100 GPU上进行,每运行约4.2±0.3小时(2000次回合,2048步缓冲区,64 mini-batch大小)。

英文摘要

Our preliminary experiments on gym-DSSAT maize irrigation tasks revealed that +/-2 degrees C temperature noise causes an 11.9% reduction in economic returns for PPO policies trained under clean conditions - a systematic robustness deficit that existing research has not adequately addressed. This paper tackles three interconnected limitations impeding practical deployment of agricultural RL systems: the trade-off between early-stage learning efficiency and late-stage generalization capability; the naive additive combination of intrinsic and extrinsic rewards in exploration-augmented PPO; and uniform measurement noise injection strategies that disregard empirically validated differential sensitivity across agricultural state variables. We introduce three systematic innovations: Progressive Generalization Augmentation (PGA) implementing a three-phase curriculum (clean training 0-800 episodes, progressive 800-1200, full augmentation 1200-2000); a deeply coupled RND-PPO architecture with dual-channel GAE normalization, progress-decayed intrinsic coefficients, and semantic discretization; and domain-prioritized noise injection with hierarchical activation. Our experimental evaluation demonstrates: 8.43% yield improvement and 16.42% nitrogen use efficiency improvement over SOTA BERT-DQN in Florida; 5.61% yield improvement in Zaragoza (though 3.67% lower economic score due to challenging Mediterranean climate); and 94.4% vs 80.0% performance retention under combined perturbations. All experiments used 5 random seeds on NVIDIA A100 GPUs with 4.2+/-0.3 hours per run (2000 episodes, 2048-step buffer, 64 mini-batch size).

2605.17426 2026-05-19 cs.MA cs.LG 版本更新

Human-Flow Digital Twin for Predicting the Effects of Mobility Introduction on Visitor Circulation

人类流动数字孪生用于预测移动引入对访客流动的影响

Chiharu Shima, Haruki Yonekura, Fukuharu Tanaka, Tatsuya Amano, Hirozumi Yamaguchi

发表机构 * bitA Inc.(bitA公司) The University of Osaka(大阪大学) RIKEN Center for Computational Science(理化学研究所计算科学中心)

AI总结 本文提出了一种利用人类流动数字孪生预测移动引入措施对访客流动影响的框架,通过多智能体模拟器模拟访客根据当前位置和景点吸引力选择目的地的过程,并利用训练好的决策模型来量化移动引入措施对访客数量和流动的影响。

Comments An accepted paper at the 27th IEEE International Conference on Mobile Data Management (MDM 2026). Project page: https://mc.net.ist.osaka-u.ac.jp/en/activity/wakayama-castle-mobility_2023/

详情
AI中文摘要

我们提出了一种框架,用于使用人类流动数字孪生预测移动引入措施的影响。该数字孪生包含一个多智能体模拟器,能够表示访客根据当前位置和景点吸引力等因素选择目的地的方式。我们提取了访客在测量前干预的人流数据、景点间距离、景点吸引力和交通量等数据,并利用这些数据训练每个智能体的决策模型。训练好的决策模型是一个函数,输入访客的当前状态和周围环境信息,并输出访客下一步将移动到的景点。通过将移动引入措施表示为景点间距离或景点吸引力的变化,该框架可以在多智能体模拟器中重现移动引入的人流,并从而量化访客数量和流动变化等影响。我们使用日本宫岛城公园在引入和不引入移动措施时测量的人流数据评估了所提出的方法。当使用多层感知机决策模型重现移动引入的人流时,空间人口分布的余弦相似性超过0.7,证实了该方法能够复制移动引入引起的流动变化。

英文摘要

We propose a framework for predicting the effects of mobility introduction measures using a human-flow digital twin. This digital twin incorporates a multi-agent simulator that can represent how visitors choose destinations depending on factors such as their current location and the attractiveness of spots. We extract data on how visitors selected destinations with respect to measured pre-intervention human-flow data, inter-spot distances, spot attractiveness, and travel volumes, and use these data to train each agent's decision model of this simulator. The trained decision model is a function that takes a visitor's current state and surrounding environmental information as input and outputs which spot the visitor will move toward next. By expressing mobility introduction measures as changes to inter-point distances or to spot attractiveness, the framework can reproduce human flows with mobility introduction in the multi-agent simulator and thereby quantify effects such as changes in visitor counts and circulation. We evaluated the proposed method using human-flow data measured with and without introducing mobility within Wakayama Castle Park in Japan. When reproducing flows with mobility introduction using a multi-layer perceptron decision model, the cosine similarity of the spatial population distribution exceeded 0.7, confirming that the approach can replicate the flow changes caused by the mobility introduction.

2605.17419 2026-05-19 cs.LG cs.AI 版本更新

Learning Displacement-Robust Representations for Landslide Early Warning under Rainfall Forecast Uncertainty

学习位移鲁棒的表示以在降雨预报不确定性下进行滑坡预警

Ren Ozeki, Hamada Rizk, Hirozumi Yamaguchi

发表机构 * Osaka University(大阪大学) RIKEN Center for Computational Science(理化学研究所计算科学中心) Tanta University(塔塔大学)

AI总结 本文提出了一种鲁棒于降雨场位移的滑坡预警系统,通过学习降雨和地形数据的潜在表示,以提高在降雨预报不确定性下的滑坡预测精度。

详情
AI中文摘要

由降雨引发的滑坡已成为全球范围内日益增长的风险,因为气候变化加剧了极端降雨事件。为了提供足够的撤离时间,实时灾害监测的滑坡预警系统(LEWS)必须通过整合观测降雨与短期降雨预报来估计近未来滑坡风险,这些预报来自时空环境数据流。尽管最近的滑坡预测方法通过统计和深度学习方法提高了预测性能,但大多数方法假设降雨输入是准确的。然而,在实际应用中,滑坡预测依赖于降雨预报,这些预报通常包含由于预测不确定性导致的降雨场空间位移。这种位移会改变局部累积降雨并降低预测准确性。为了解决这一挑战,我们提出了一种新的LEWS,其对降雨场位移具有鲁棒性。关键思想是学习降雨和地形数据的潜在表示,这些表示在降雨场运动中的位移下保持稳定,从而实现可靠的地理空间数据整合以估计滑坡风险。滑坡预测模型通过使用降雨-运动-感知对比学习(RMCL)进行训练,该方法引入了时间相关的降雨场扰动以模拟预报引起的降雨驱动时空环境数据流中的位移。实验使用了日本两年的降雨和地形数据,覆盖了19个地区中的滑坡事件。所提出的系统在精度上比最先进的基线高出高达37%。这些结果表明,将降雨建模为移动的空间场并在学习过程中处理降雨场位移显著提高了操作预警系统中短期滑坡预测的可靠性。

英文摘要

Rainfall-induced landslides pose a growing risk worldwide as climate change intensifies extreme rainfall events. To provide sufficient evacuation time, landslide early warning systems (LEWS) for real-time disaster monitoring must estimate near-future landslide risk by integrating observed rainfall with short-term rainfall forecasts from spatio-temporal environmental data streams. Although recent landslide prediction methods have improved predictive performance using statistical and deep learning approaches, most assume accurate rainfall inputs. In operational settings, however, landslide prediction relies on rainfall forecasts, which often contain spatial displacement of rainfall fields due to forecasting uncertainties. Such displacement can alter local accumulated rainfall and degrade prediction accuracy. To address this challenge, we propose a novel LEWS robust to rainfall field displacement. The key idea is to learn latent representations from rainfall and terrain data that remain stable under displacement in rainfall field motion, enabling reliable geospatial data integration for landslide risk estimation. The landslide prediction model is trained using Rainfall-Motion-Aware Contrastive Learning (RMCL), which introduces temporally correlated rainfall field perturbations to emulate forecast-induced displacement in rainfall-driven spatio-temporal environmental data streams. Experiments were conducted using two years of rainfall and terrain data across Japan, covering 19 regions with landslide events. The proposed system achieved up to 37% higher precision than state-of-the-art baselines. These results demonstrate that modeling rainfall as a moving spatial field and addressing rainfall field displacement during learning significantly improve the reliability of short-term landslide prediction in operational early warning systems.

2605.17403 2026-05-19 cs.LG 版本更新

Self-Supervised Learning for Sparse Matrix Reordering

稀疏矩阵重新排序的自监督学习

Ziwei Li, Tao Yuan, Fangfang Liu, Shuzi Niu, Huiyuan Li, Wenjia Wu

发表机构 * Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出了一种自监督学习方法,通过多网格图网络捕捉结构信息,基于不等式推导三元组采样策略,并引入端最大链损失函数以减少不满足不等式的三元组数量,从而在稀疏矩阵重新排序中实现填充分离减少和LU分解时间加速。

Comments Accepted by DASFAA 2026

详情
AI中文摘要

使用适当顺序重新排列稀疏矩阵可以显著减少填充分离,即矩阵分解过程中引入的新非零元素,从而减少内存使用和运行时间。然而,找到最小化填充分离的顺序是NP难问题。现有方法,包括图论和深度学习方法,依赖于替代目标函数而没有理论保证。填充分离定理揭示了填充分离生成与矩阵稀疏结构之间的直接内在关系,即路径三元组不等式。本文首先使用多网格图网络来捕获每个顶点的结构信息。然后基于不等式推导出三元组采样策略。最后,我们引入端最大链损失函数以减少预测分数满足这些不等式的三元组数量。在公开可用的SuiteSparse矩阵集合上的实验评估表明,所提出的方法在填充分离减少和LU分解时间加速方面均优于现有方法。

英文摘要

Rearranging the rows or columns of a sparse matrix using an appropriate ordering can significantly reduce fill-ins, i.e., new nonzeros introduced during matrix factorization, decreasing memory usage and runtime. However, finding an ordering that minimizes fill-ins is NP-complete. Existing approaches, including graph-theoretic and deep learning methods, rely on surrogate objectives without theoretical guarantees. The Fill-Path Theorem reveals a direct and intrinsic relationship between fill-in generation and the sparse structure of the matrix as path triplet inequalities. Here we first employ a multigrid graph network to capture structural information for each vertex. We then derive a triplet sampling strategy based on inequalities. Finally, we introduce an end-max chain loss function to reduce the number of triplets whose predicted scores satisfy these inequalities. Experimental evaluations on the publicly available SuiteSparse matrix collection demonstrate the superiority of the proposed method in terms of both fill-in reduction and speedup in LU factorization time.

2605.17398 2026-05-19 cs.CL cs.LG 版本更新

MiniGPT: Rebuilding GPT from First Principles

MiniGPT:从第一原理重新构建GPT

Jibin Joseph

发表机构 * Department of Computer Science(计算机科学系) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出MiniGPT,一个基于PyTorch从头实现的GPT风格自回归语言模型,旨在在研究nanoGPT设计后,从第一原理重新构建GPT核心流程,同时保持模型和训练代码独立编写。MiniGPT实现了词嵌入、位置嵌入、因果多头自注意力、预层归一化Transformer块、残差连接、前馈MLP层、下一词交叉熵训练(教师强制)、验证跟踪、检查点选择和自回归文本生成。

Comments 13 pages, 2 figures

详情
AI中文摘要

本文提出了MiniGPT,一个基于PyTorch从头实现的GPT风格自回归语言模型。目的是在研究nanoGPT设计后,从第一原理重新构建GPT核心流程,同时保持模型和训练代码独立编写。MiniGPT实现了词嵌入、位置嵌入、因果多头自注意力、预层归一化Transformer块、残差连接、前馈MLP层、下一词交叉熵训练(教师强制)、验证跟踪、检查点选择以及自回归文本生成。本文在Tiny Shakespeare数据集上评估了该实现,使用字符级分词。一个基线模型在3000次训练迭代后达到验证损失1.7236。一个更强的10.77M参数配置,使用更大的上下文长度和改进的训练设置,达到最佳验证损失1.4780,并生成具有可识别莎士比亚风格对话结构的文本。MiniGPT并未引入新的语言模型架构。相反,它记录了从原始文本到训练好的字符级生成的清晰且可重复的实现路径,包括设计选择、训练行为、生成质量以及实际限制。

英文摘要

This paper presents MiniGPT, a compact from-scratch implementation of GPT-style autoregressive language modeling in PyTorch. The aim is to rebuild the core GPT pipeline from first principles after studying the design of nanoGPT by Andrej Karpathy, while keeping the model and training code independently written in a single notebook. MiniGPT implements token and positional embeddings, causal multi-head self-attention, pre-LayerNorm Transformer blocks, residual connections, feed-forward MLP layers, next-token cross-entropy training (teacher forcing), validation tracking, checkpoint selection, and autoregressive text generation. This paper evaluates the implementation on Tiny Shakespeare dataset using character-level tokenization. A baseline 0.83M-parameter model reaches a validation loss of 1.7236 after 3000 training iterations. A stronger 10.77M-parameter configuration, using a larger context length and improved training settings, reaches a best validation loss of 1.4780 and generates text with recognizable Shakespeare-style dialogue structure. MiniGPT does not introduce a new language-model architecture. Instead, it documents a clear and reproducible implementation path from raw text to trained character-level generation, including design choices, training behavior, generation quality, and practical limitations.

2605.17393 2026-05-19 cs.AI cs.LG cs.MA 版本更新

Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

异质信息瓶颈协调图用于多智能体强化学习

Wei Duan, Junyu Xuan, En Yu, Xiaoyu Yang, Jie Lu

发表机构 * Australian Artificial Intelligence Institute (AAII)(澳大利亚人工智能研究所)

AI总结 本文提出异质信息瓶颈协调图(HIBCG),通过理论指导机制解决多智能体强化学习中协调图的边存在性和信息传递容量分配问题,通过信息瓶颈方法构建组对齐的块对角先验,实现边存在性和信息容量的理论验证。

详情
AI中文摘要

协调图是合作多智能体强化学习(MARL)中的核心抽象,然而现有的稀疏图学习者缺乏理论基础的机制来决定哪些边应存在以及每条边应携带多少信息。当前方法依赖于启发式标准,无法保证学习到的拓扑结构的正式保证,并且没有系统的方法来分配不同的通信容量以处理结构不同的智能体关系。为了解决这个问题,我们提出了异质信息瓶颈协调图(HIBCG),它学习了一个组感知的稀疏图,在其中边的存在性和信息容量都得到了理论支持。通过图信息瓶颈(GIB)作为底层工具,HIBCG首先构建了一个组对齐的块对角先验,提供了一个闭式标准用于边保留——确定哪些边应该存在以及每个组块的密度——然后在所得到的拓扑上控制每个智能体的特征带宽,压缩信息以保留仅与任务相关的内容。我们证明了组对齐的先验严格收紧拓扑学习的变分界,目标分解为每个组块,实现了微分边控制,且容量分配遵循水填充原则。

英文摘要

Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods rely on heuristic criteria that offer no formal guarantee on the learned topology, and no principled way to allocate different communication capacities to structurally different agent relationships. To address this, we propose Heterogeneous Information-Bottleneck Coordination Graphs (HIBCG), which learns a group-aware sparse graph in which both edge existence and message capacity are theoretically justified. With the graph information bottleneck (GIB) serving as the underlying tool, HIBCG first constructs a group-aligned block-diagonal prior that provides a closed-form criterion for edge retention -- determining which edges should exist and at what density per group block -- and then controls per-agent feature bandwidth on the resulting topology, compressing messages to retain only task-relevant content. We prove that the group-aligned prior strictly tightens the variational bound on topology learning, that the objective decomposes per group block, enabling differential edge control, and that capacity allocation follows a water-filling principle.

2605.17390 2026-05-19 cs.SE cs.LG cs.LO 版本更新

NOETHER: A Constructive Framework for Metamorphic Pattern Discovery from Operator Algebras

NOETHER:从算子代数中构造性地发现元模式的框架

Meng Li, Xiaohua Yang, Jie Liu, Shiyu Yan

发表机构 * School of Computing, University of South China(南华大学计算机学院) Hunan Engineering Research Center of Software Evaluation and Testing for Intellectual Equipment(湖南软件测评与智能设备工程研究中心) CNNC Key Laboratory on High Trusted Computing(中核集团高可信计算重点实验室)

AI总结 本文提出NOETHER框架,通过从程序诱导的算子代数到元模式集的机械且可证明的下游步骤,解决元模式关系识别中的基础问题,同时通过三个算子代数领域验证了该框架的代数闭包和多项式时间可判定性。

Comments 71 pages, 18 tables, 1 figure. Under review at ACM Transactions on Software Engineering and Methodology. Supplementary materials (algorithm reference implementation, 84-MR PWR corpus, SE(3) case study harness, three-tier METRIC+ replication) at https://github.com/meng004/P1-MetaPattern

详情
AI中文摘要

背景。元测试被IEEE/ISO软件测试标准认可,并越来越多地推荐用于AI系统,但其进展受元模式关系(MR)识别的瓶颈限制:现有方法(结构化框架、挖掘和进化流水线、LLM辅助方法、MetaPattern目录)共享一个归纳基础,留下三个根本问题未解决:起源、闭包和可转移性。目标。我们提出一个框架,其下游步骤从程序诱导的算子代数到元模式集是机械且可证明的,而上游的代数整理是一个明确的实证假设,具有显式的作用域前提。方法。NOETHER是一个两层框架。上游层是对递归数学结构(对称性、顺序、自共轭、时间反演、极限、定性动力学、方法比较、关系等价)的八块分解。下游CONSTRUCT-MP算法生成具有代数闭包(定理1)和多项式时间可判定性(定理2)保证的元模式集。我们测试了该框架在三个算子代数领域。结果。在Boltzmann反应堆物理NOETHER系统化了先前的归纳目录;在等变ML中推导出可执行的MRs用于旋转不变性、自轭对偶性和训练轨迹可逆性;在关系查询优化器中检验了关系等价块。核心可检验预测(L*-盲性在保持同质性突变器上)在作用域子基上成立。绝对完备性猜想(定理1')通过PWR核心扩散通过两个相互独立的反例被推翻,这些反例识别出五个Translate-extension维度。结论。归纳从单个程序MR采样转移到每个领域的代数层;下游步骤是演绎且机械的。

英文摘要

Context. Metamorphic Testing is recognised in IEEE/ISO software-testing standards and increasingly recommended for AI systems, but its progress is bottlenecked by metamorphic relation (MR) identification: existing approaches (structured frameworks, mining and evolutionary pipelines, LLM-assisted methods, MetaPattern catalogues) share an inductive grounding that leaves three foundational questions open: origin, closure, and transferability. Objective. We propose a framework whose downstream step from program-induced operator algebra to MetaPattern set is mechanical and provable, while the upstream curation of the algebra is a stated empirical hypothesis with explicit scope precondition. Method. NOETHER is a two-layer framework. The upstream layer is an eight-block decomposition over recurrent mathematical structures (symmetry, order, self-adjoint, time-reversal, limit, qualitative-dynamics, method-comparison, relational equivalence). The downstream CONSTRUCT-MP algorithm produces a MetaPattern set with algebraic-closure (Theorem 1) and polynomial-time decidability (Theorem 2) guarantees. We test the framework on three operator-algebraic domains. Results. On Boltzmann reactor physics NOETHER systematises a prior inductive catalogue; on equivariant ML it derives executable MRs for rotation invariance, adjoint duality, and training-trajectory reversibility; on relational query optimisers it exercises the relational-equivalence block. The central falsifiable prediction (L*-blindness on homogeneity-preserving mutators) holds on the in-scope substrate. The absolute-completeness conjecture (Theorem 1') is falsified on PWR core diffusion via two pairwise-independent counterexamples that identify five Translate-extension dimensions. Conclusion. Induction is relocated from per-program MR sampling to a per-domain algebraic layer; the downstream step is deductive and mechanical.

2605.17380 2026-05-19 cs.AI cs.CR cs.LG 版本更新

ADR: An Agentic Detection System for Enterprise Agentic AI Security

ADR:一种用于企业代理AI安全的代理检测系统

Chenning Li, Pan Hu, Justin Xu, Baris Ozbas, Olivia Liu, Caroline Van, Manxue Li, Wei Zhou, Mohammad Alizadeh, Pengyu Zhang, KK Sriramadhesikan, Ming Zhang

发表机构 * Uber

AI总结 本文提出ADR系统,一种大规模、经过生产验证的企业框架,用于安全地管理通过模型上下文协议(MCP)运行的AI代理。该系统解决了三个关键问题:观测有限、鲁棒性不足和检测成本高,并通过三个组件实现了这些目标:ADR传感器、ADR探索器和ADR检测器。

Comments Accepted at MLSys 2026 (Industry Track)

详情
AI中文摘要

我们提出了代理AI检测与响应(ADR)系统,这是首个大规模、经过生产验证的企业框架,用于安全地管理通过模型上下文协议(MCP)运行的AI代理。我们识别出该领域存在的三个持续挑战:(1)观测有限——现有的终端检测与响应(EDR)工具只能看到文件写入,而无法看到代理推理、提示或连接意图到执行的因果链;(2)鲁棒性不足——静态防御受限于预定义规则,无法在多样化的攻击技术和企业环境中泛化;(3)高检测成本——基于LLM的推理在大规模上成本过高。ADR通过三个组件解决这些挑战:ADR传感器用于高保真的代理遥测,ADR探索器用于系统性的预部署红队行动和困难示例生成,以及ADR检测器用于可扩展的、两阶段在线检测,结合快速初步筛查与上下文感知推理。在Uber部署超过十个月,ADR在生产中保持了可靠的检测,随着采用的增加,已覆盖超过7,200个唯一主机,每天处理超过10,000个代理会话,发现了数百个凭证泄露,涵盖26类,并启用了向左预防层(97.2%的精度,206个检测到的凭证)。为了验证该方法并促进社区采用,我们引入了ADR-Bench(302个任务,17种技术,133个MCP服务器),其中ADR实现了零误报,同时检测了67%的攻击——在F1分数上,比三个最先进的基线(ALRPHFS、GuardAgent、LlamaFirewall)高出2-4倍。在AgentDojo(公共提示注入基准)上,ADR检测了所有攻击,仅在93个任务中产生了3个误报。

英文摘要

We present the Agentic AI Detection and Response (ADR) system, the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol (MCP). We identify three persistent challenges in this domain: (1) limited observability -- existing Endpoint Detection and Response (EDR) tools see file writes but not the agent reasoning, prompts, or causal chains linking intent to execution; (2) insufficient robustness -- static defenses constrained by pre-defined rules fail to generalize across diverse attack techniques and enterprise contexts; and (3) high detection costs -- LLM-based inference is prohibitively expensive at scale. ADR addresses these challenges via three components: the ADR Sensor for high-fidelity agentic telemetry, the ADR Explorer for systematic pre-deployment red teaming and hard-example generation, and the ADR Detector for scalable, two-tier online detection combining fast triage with context-aware reasoning. Deployed at Uber for over ten months, ADR has sustained reliable detection in production with growing adoption reaching over 7,200 unique hosts and processing over 10,000 agent sessions daily, uncovering hundreds of credential exposures across 26 categories and enabling a shift-left prevention layer (97.2% precision, 206 detected credentials). To validate the approach and enable community adoption, we introduce ADR-Bench (302 tasks, 17 techniques, 133 MCP servers), where ADR achieves zero false positives while detecting 67% of attacks -- outperforming three state-of-the-art baselines (ALRPHFS, GuardAgent, LlamaFirewall) by 2--4x in F1-score. On AgentDojo (public prompt injection benchmark), ADR detects all attacks with only three false alarms out of 93 tasks.

2605.16234 2026-05-19 cs.LG cs.AI cs.CL 版本更新

No Free Swap: Protocol-Dependent Layer Redundancy in Transformers

没有免费的交换:Transformer中的协议依赖层冗余

Gabriel Garcia

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了Transformer中层冗余问题,通过比较替换和交换两种协议,发现它们在压缩中的效果存在显著差异,且在相同评估器下,不同协议可能导致层剪枝结果的变化,尤其在高替换距离时更为明显。

Comments 40 pages, 8 figures, 24 tables. Code is available at https://github.com/Gpgabriel25/ProtocolGapDiagnostic

详情
AI中文摘要

当研究人员询问两个Transformer层是否在压缩中“等价”时,他们常常混淆了不同的测试方法。替换测试询问是否可以将一层的映射替换为另一层的映射;交换测试询问是否当两层位置交换时,它们近似可交换。两者都是基于输出的swap-KL探测器,但它们并不总是一致:在预训练的Transformer中,协议差距可能在相同评估器下改变哪些层看起来可以安全剪枝,尤其是在替换距离较高时。我们跨检查点和架构测量了两种协议。在Pythia训练轨迹(410M和1.4B)上,替换-交换差距从初始化到收敛逐渐增大。在8B规模的WikiText-2合同下,Qwen3-8B进入了一个发散阶段:交换引导的移除比替换引导的在相同层预算下更安全,而Llama-3.1-8B在剪枝成本上两者持平,尽管交换KL较低,这表明指标差距不必一对一映射到移除。在层移除或合并之前,应在目标检查点上对两种swap-KL进行评分;该诊断仅需未标记的正向传递。

英文摘要

When researchers ask whether two transformer layers are "equivalent" for compression, they often conflate distinct tests. Replacement asks whether one layer's map can substitute for another's in place; interchange asks whether two layers approximately commute when their positions are swapped. Both are output-grounded swap-KL probes, but they need not agree: on pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower, showing metric gaps need not map one-to-one to removal. Before layer removal or merging, score both swap-KLs on the target checkpoint; the diagnostic requires only unlabeled forward passes.

2605.15694 2026-05-19 cs.LG 版本更新

Going Beyond the Edge: Distributed Inference of Transformer Models on Ultra-Low-Power Wireless Devices

超越边缘:在超低功耗无线设备上实现变压器模型的分布式推断

Alexander Gräfe, Ding Huo, Vincent de Bakker, Johannes Berger, Marco Zimmerling, Sebastian Trimpe

发表机构 * RWTH Aachen University(亚琛工业大学) TU Darmstadt(德累斯顿技术大学)

AI总结 本文提出CATS框架,通过在超低功耗无线设备上实现分布式变压器推断,使多个设备协同执行远大于单个设备能处理的模型。核心方法结合了变压器划分、无线通信和训练,采用SomeGather通信原语减少带宽和内存使用,同时设计高效的模型并行方法,并通过消息丢弃提高通信可靠性。

详情
AI中文摘要

Transformer模型正迅速成为现代物联网(IoT)应用的核心,但其计算和内存需求远超单个典型超低功耗IoT设备的能力。我们提出了CATS,一种用于超低功耗无线设备的分布式变压器推断框架,使多个设备能够协同执行远大于单个设备能处理的模型。CATS的核心是一种通信感知的分布式变压器推断方案,结合了变压器划分、无线通信和训练。它采用SomeGather,一种新的剪枝通信原语,选择性广播激活列以减少通信带宽和RAM使用,而不牺牲模型精度。基于SomeGather,我们设计了一种划分方法,利用该原语实现高效的模型并行。为应对不可靠的无线通信,CATS在训练期间采用消息丢弃,模拟数据包丢失,并在推断时产生对消息丢失具有鲁棒性的模型。在实际实验中,我们证明CATS首次将分布式变压器推断带到了超低功耗无线设备上,部署在多达16个设备上,协同执行的变压器模型大小是单个设备能运行的14倍。

英文摘要

Transformer models are rapidly becoming a cornerstone of modern Internet of Things (IoT) applications, yet their computational and memory demands far exceed the capabilities of a single typical ultra-low-power IoT device. We present CATS, a framework for distributed transformer inference on ultra-low-power wireless devices, enabling multiple devices to collaboratively execute models far larger than what a single device can sustain. At its core, CATS is a communication-aware distributed transformer inference scheme co-designed across transformer partitioning, wireless communication and training. It employs SomeGather, a new pruned communication primitive that selectively broadcasts activation columns to reduce communication bandwidth and RAM usage without sacrificing model accuracy. Building on SomeGather, we design a partitioning method that exploits this primitive for efficient model parallelism. To cope with unreliable wireless communication, CATS employs message-dropout during training, which mimics packet losses and yields models that are robust to message loss during inference. In real-world experiments, we show that CATS brings distributed transformer inference to ultra-low-power wireless devices for the first time, with deployments on up to 16 devices that collaboratively execute transformer models up to 14 times larger than what a single device can run.

2605.15622 2026-05-19 cs.LG 版本更新

Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

位置:深度学习中零阶优化被低估,而非无能

Sijia Liu, Yicheng Lang, Soumyadeep Pal, Changsheng Wang, Yancheng Huang, Chongyu Fan, James Diffenderfer, Bhavya Kailkhura, Yihua Zhang

发表机构 * OPTML Lab, Michigan State University, USA(密歇根州立大学OPTML实验室) Lawrence Livermore National Laboratory, USA(劳伦斯利弗莫尔国家实验室)

AI总结 本文探讨了深度学习中零阶优化(ZO)的局限性,指出其被低估而非无能,并提出了六个涵盖算法、系统和评估层面的立场,强调通过控制方差、方差-查询权衡和方向导数视角重新审视ZO方法的可行性,同时指出三个未被充分利用的机会,包括子空间和谱观点、ZO作为系统优势的通信效率以及去模糊化ZO评估与任务复杂性之间的关系。

Comments Accepted by ICML 2026 Position Paper Track as a Spotlight Paper

详情
AI中文摘要

零阶(ZO)优化,通过函数评估的有限差分来学习,由于其内存效率和适用于灰箱或黑箱管道的适用性,最近在深度学习中重新受到关注。然而,ZO方法往往被忽视,因为估计方差和不利的查询复杂性被认为是根本无法扩展的。我们主张这一结论可能是误导的:ZO优化是被低估的,而不是无能的。我们证明了许多看似限制性的因素源于短视的发展实践,尤其是全空间、元素-wise、估计器中心的设计。我们阐述了六个涵盖算法、系统和评估栈的立场。首先,我们通过方差控制、方差-查询权衡和方向导数视角重新审视估计器中心ZO方法的可行性边界。然后,我们识别出三个未被充分利用的机会:(i)子空间和谱观点的ZO,使通过优雅的查询扩展实现可解释的方差减少;(ii)ZO作为系统优势,为通信高效、管道友好的和资源受限的训练提供优势;(iii)需要去模糊化ZO评估与任务复杂性之间的关系。我们强烈倡导围绕ZO优化的独特优势重新思考,并采取相应行动,打开通往大规模、系统感知和资源高效学习的可行路径。

英文摘要

Zeroth-order (ZO) optimization, learning from finite differences of function evaluations without backpropagation, has recently regained attention in deep learning due to its memory efficiency and applicability to gray- or black-box pipelines. Yet, ZO methods are often dismissed as fundamentally unscalable because of estimator variance and unfavorable query complexity. We argue that this conclusion might be misguided: ZO optimization is underexplored, not underpowered. We show that many perceived limitations stem from myopic development practices, most notably full-space, element-wise, estimator-centric designs. We articulate six positions spanning the algorithmic, systems, and evaluation stack. First, we revisit the feasibility boundaries of estimator-centric ZO methods through variance control, variance-query tradeoffs, and directional-derivative lenses. Then, we identify three underexplored opportunities: (i) subspace and spectral views of ZO that enable interpretable variance reduction with graceful query scaling, (ii) the forward-only nature of ZO as a systems advantage for communication-efficient, pipeline-friendly, and resource-constrained training, and (iii) the need to de-obfuscate ZO evaluations from task complexity. We strongly advocate rethinking ZO optimization around its unique strengths and acting accordingly, opening a viable path toward large-scale, system-aware, and resource-efficient learning with ZO optimization.

2605.15586 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Embracing Biased Transition Matrices for Complementary-Label Learning with Many Classes

拥抱偏置转移矩阵以实现多类互补标签学习

Tan-Ha Mai, Chao-Kai Chiang, Han-Hwa Shih, Gang Niu, Masashi Sugiyama, Hsuan-Tien Lin

发表机构 * National Taiwan University(国立台湾大学) The University of Tokyo(东京大学) RIKEN Center for Advanced Intelligence Project(日本理化学研究院先进智能项目中心)

AI总结 本文提出了一种新的框架BICL,通过设计偏置的标签生成过程来克服传统互补标签学习在多类设置中的限制,从而在CIFAR-100和TinyImageNet-200上实现了传统方法的七倍以上准确率提升。

Comments 33 pages, 16 figures, 18 tables

详情
AI中文摘要

互补标签学习(CLL)是一种弱监督范式,其中实例被标记为不属于其类别的标签。尽管已有十年的研究,CLL方法主要在10类分类任务中具有竞争力,而扩展到大规模标签空间仍然是一个持久的瓶颈。这种限制源于传统方法对均匀标签生成的假设,这在多类设置中严重稀释了学习信号。在本文中,我们证明通过故意设计偏置(非均匀)的生成过程,将互补标签限制在类别的子集,可以克服这一长期存在的障碍。这一发现促使我们提出Bias-Induced Constrained Labeling(BICL),一个涵盖数据收集到训练的原理性框架,利用这种偏置。BICL在CIFAR-100和TinyImageNet-200上实现了有效学习,比传统方法的准确率提高了超过七倍。我们的发现为在现实应用中使CLL适用于多类问题开辟了新的道路。

英文摘要

Complementary-label learning (CLL) is a weakly supervised paradigm where instances are labeled with classes they do not belong to. Despite a decade of research, CLL methods remain competitive mainly on 10-class classification, with scaling to large label spaces continuing to be an enduring bottleneck. This limitation stems from the common assumption of uniform label generation in traditional methods, which fatally dilutes the learning signal in many-class settings. In this paper, we demonstrate that this long-standing barrier can be overcome by deliberately designing a biased (non-uniform) generation process that restricts complementary labels to a subset of classes. This finding motivates us to propose Bias-Induced Constrained Labeling (BICL), a principled framework spanning data collection to training that leverages this bias. BICL enables effective learning on CIFAR-100 and TinyImageNet-200, achieving more than sevenfold accuracy improvements over traditional methods. Our findings establish a new trajectory for making CLL feasible for many classes in real-world applications.

2605.15508 2026-05-19 cs.LG cs.CL 版本更新

STS: Efficient Sparse Attention with Speculative Token Sparsity

STS: 高效稀疏注意力与推测性标记稀疏性

Ceyu Xu, Jiangnan Yu, Yongji Wu, Yuan Xie

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) UC Berkeley(加州大学伯克利分校)

AI总结 本文提出STS,一种无需模型再训练的稀疏注意力机制,通过利用较小的草稿模型识别出的重要标记来预测更大目标模型的重要标记,从而在大规模语言模型推理中实现高效的稀疏注意力计算,显著提升速度并保持准确性。

Comments 14 pages, 12 figures

详情
AI中文摘要

注意力的二次复杂性对大型语言模型(LLM)推理造成了严重的内存和计算瓶颈。这一挑战在新兴的代理应用中尤为突出,这些应用需要处理数百万标记序列。我们提出STS,一种稀疏注意力机制,无需模型再训练。STS利用关键洞察:由较小的草稿模型识别出的重要标记对更大目标模型的重要标记具有高度预测性。通过整合到推测解码框架中,STS将草稿模型的注意力分数重新利用,动态构建标记和头部层面的稀疏性掩码。该掩码有效剪枝目标LLM中的昂贵注意力计算。我们的评估显示,STS在代表性的基准NarrativeQA上实现了约90%稀疏度下的2.67倍加速,与密集注意力相比,准确性降解可忽略不计。STS在稀疏性与准确性权衡上建立了新的状态-of-the-art,通过在给定准确性预算下实现更高的稀疏度水平,优于先前技术。

英文摘要

The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model retraining. STS leverages the key insight that tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model. By integrating into speculative decoding frameworks, STS repurposes the draft model's attention scores to dynamically construct a token-and-head-wise sparsity mask. This mask effectively prunes the expensive attention computation in the target LLM. Our evaluation shows that STS achieves a 2.67x speedup operating at approximately 90% sparsity on representative benchmark NarrativeQA, maintaining negligible accuracy degradation compared to dense attention. STS establishes a new state-of-the-art on the sparsity-accuracy trade-off, outperforming prior techniques by enabling higher sparsity levels for a given accuracy budget.

2605.15487 2026-05-19 cs.LG cs.CV eess.IV 版本更新

Learning Normalized Energy Models for Linear Inverse Problems

学习归一化能量模型以解决线性逆问题

Nicolas Zilberstein, Santiago Segarra, Eero Simoncelli, Florentin Guth

发表机构 * Rice University(里士满大学) Flatiron Institute(Flatiron研究所) New York University(纽约大学)

AI总结 本文提出了一种新的能量模型,用于解决线性逆问题,通过引入基于协方差的正则化项来提高不同测量条件下的一致性,从而计算出归一化的后验密度,无需额外训练或微调,同时实现了能量引导的自适应采样、无偏的Metropolis-Hastings修正步骤以及通过贝叶斯规则估计退化算子。

Comments ICML 2026

详情
Journal ref
Int'l Conf Machine Learning (ICML), Jul 2026. https://openreview.net/forum?id=PlFJwgaaDK
AI中文摘要

生成扩散模型可以为成像中的逆问题提供强大的先验概率模型,但现有实现存在两个关键限制:(i) 先验密度以隐式方式表示,(ii) 它们依赖于似然近似,这会引入采样偏见。我们通过引入一种新的能量模型来解决这些挑战,该模型针对去噪进行了训练,并引入了基于协方差的正则化项,以确保在不同测量条件下的一致性。训练后的模型能够为各种线性逆问题计算归一化的后验密度,而无需额外的重新训练或微调。除了保留扩散模型的采样能力外,这还使以前不可用的能力得以实现:能量引导的自适应采样,可以实时调整采样计划,无偏的Metropolis-Hastings修正步骤,以及通过贝叶斯规则估计退化算子。我们验证了该方法在多个数据集(ImageNet、CelebA、AFHQ)和任务(修复、去模糊)上的性能,证明了其与现有基线相比具有竞争力或更优的表现。

英文摘要

Generative diffusion models can provide powerful prior probability models for inverse problems in imaging, but existing implementations suffer from two key limitations: $(i)$ the prior density is represented implicitly, and $(ii)$ they rely on likelihood approximations that introduce sampling biases. We address these challenges by introducing a new energy-based model trained for denoising with a covariance-based regularization term that enforces consistency across different measurement conditions. The trained model can compute normalized posterior densities for diverse linear inverse problems, without additional retraining or fine tuning. In addition to preserving the sampling capabilities of diffusion models, this enables previously unavailable capabilities: energy-guided adaptive sampling that adjusts schedules on-the-fly, unbiased Metropolis-Hastings correction steps, and blind estimation of the degradation operator via Bayes rule. We validate the method on multiple datasets (ImageNet, CelebA, AFHQ) and tasks (inpainting, deblurring), demonstrating competitive or superior performance to established baselines.

2605.14005 2026-05-19 cs.CL cs.LG 版本更新

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

毒藤:针对推测解码的隐秘加速-崩溃攻击

Shuoyang Sun, Chang Dai, Hao Fang, Kuofeng Gao, Xinhao Zhong, Yi Sun, Fan Mo, Shu-Tao Xia, Bin Chen

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) South China University of Technology(华南理工大学) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Huawei Technology(华为技术)

AI总结 本文提出Mistletoe攻击,通过优化降质目标和语义保留目标,隐秘地降低推测解码的接受长度τ,从而减少加速效果,同时保持输出质量。

详情
AI中文摘要

推测解码已成为加速大型语言模型(LLM)推理的广泛采用技术,通过并行生成多个候选token并用目标模型验证。然而,其效率关键依赖于平均接受长度τ,即每个验证步骤中多少候选token能被接受。本文识别了基于模型的推测解码中的新机制层漏洞:drafter被训练去近似目标模型分布,但这种近似不可避免地不完美。这种drafter-目标不匹配创造了一个隐藏的攻击面,其中小扰动可以保持目标模型的可见行为,同时显著降低候选token的接受性。我们提出Mistletoe,一种针对推测解码的隐秘加速-崩溃攻击。Mistletoe直接针对推测解码的接受机制。它联合优化一个降质目标,以减少drafter-目标的一致性,以及一个语义保留目标,以约束目标模型的输出分布。为了解决这两个目标之间的冲突,我们引入了一个null-space投影机制,其中降质梯度被投影到局部语义保留方向之外,从而抑制候选token的接受,同时最小化语义漂移。在各种推测解码系统上的实验表明,Mistletoe显著降低了平均接受长度τ,崩溃速度提升,并降低了平均token吞吐量,同时保持输出质量和困惑度。我们的工作强调推测解码引入了超越现有输出鲁棒性的机制层攻击面,呼吁对LLM加速系统进行更鲁棒的设计。

英文摘要

Speculative decoding has become a widely adopted technique for accelerating large language model (LLM) inference by drafting multiple candidate tokens and verifying them with a target model in parallel. Its efficiency, however, critically depends on the average accepted length $τ$, i.e., how many draft tokens survive each verification step. In this work, we identify a new mechanism-level vulnerability in model-based speculative decoding: the drafter is trained to approximate the target model distribution, but this approximation is inevitably imperfect. Such a drafter-target mismatch creates a hidden attack surface where small perturbations can preserve the target model's visible behavior while substantially reducing draft-token acceptability. We propose Mistletoe, a stealthy acceleration-collapse attack against speculative decoding. Mistletoe directly targets the acceptance mechanism of speculative decoding. It jointly optimizes a degradation objective that decreases drafter-target agreement and a semantic-preservation objective that constrains the target model's output distribution. To resolve the conflict between these objectives, we introduce a null-space projection mechanism, where degradation gradients are projected away from the local semantic-preserving direction, suppressing draft acceptance while minimizing semantic drift. Experiments on various speculative decoding systems show that Mistletoe substantially reduces average accepted length $τ$, collapses speedup, and lowers averaged token throughput, while preserving output quality and perplexity. Our work highlights that speculative decoding introduces a mechanism-level attack surface beyond existing output robustness, calling for more robust designs of LLM acceleration systems.

2605.13520 2026-05-19 cond-mat.stat-mech cs.LG 版本更新

Beyond Explained Variance: A Cautionary Tale of PCA

超越解释方差:PCA的警示故事

Gionni Marchetti

发表机构 * Barcelona, Spain(西班牙巴塞罗那)

AI总结 本文通过分析早期哺乳类昆虫食性动物Kuehneotherium的化石牙齿数据集,指出PCA在可视化非线性低维流形上的不足,提出基于t-SNE和持久同调的分析方法,并提出一个生成概率-几何模型来解释数据分布。

Comments 12 pages, 10 figures. Corrected a typo and amended Ref. [57], which should be the same as in Ref. [22]

详情
AI中文摘要

我们针对主成分分析(PCA)在通过二维散点图可视化高维数据时的不足进行了探讨,重点关注早期哺乳类昆虫食性动物Kuehneotherium的化石牙齿数据集。虽然Jolliffe和Cadima(Philosophical Transactions of the Royal Society A, 2016)报告的PCA散点图显示在PC2 < 0区域的聚类,但基于t-SNE和持久同调(PH)的分析揭示出环状结构,没有明显的聚类,并且内在维数为一。我们进一步提出一个生成概率-几何模型,其中数据从单位圆上均匀采样。在此模型下,成对余弦距离遵循正弦分布,与观察到的U型分布定性一致,从而独立支持基于t-SNE和持久同调的分析。

英文摘要

We address shortcomings of principal component analysis (PCA) for visualizing high-dimensional data lying on a nonlinear low-dimensional manifold via two-dimensional scatterplots, focusing on a fossil teeth dataset from the early mammalian insectivore Kuehneotherium. While the PCA scatterplot reported by Jolliffe and Cadima (Philosophical Transactions of the Royal Society A, 2016) shows clustering in the region where PC2 < 0, our analysis based on t-SNE and persistent homology (PH) reveals a ring-like structure with no evident clustering and intrinsic dimensionality equal to one. We further propose a generative probabilistic-geometric model in which the data are sampled uniformly from a unit circle. Under this model, pairwise cosine distances follow an arcsine distribution, in qualitative agreement with the observed U-shaped distribution, thereby independently supporting the analysis based on t-SNE and persistent homology.

2605.13415 2026-05-19 cs.CL cs.AI cs.LG 版本更新

KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model

KIT-TIP-NLP 在 MultiPride 上的持续学习:多语言基础模型

Barathi Ganesh HB, Michal Ptaszynski, Rene Melendez, Juuso Eronen

发表机构 * Text Information Processing Lab, Kitami Institute of Technology, Kitami, Hokkaido 090-0015, Japan(函授信息处理实验室,Kitami理工学院,日本北海道Kitami,090-0015)

AI总结 本文提出了一种多阶段框架,用于检测社交媒体中多语言的重新使用侮辱性语言。该框架解决了跨英语、西班牙语和意大利语推文中识别重新使用与非重新使用LGBTQ+相关侮辱性语言的挑战,通过数据驱动的模型选择、语义保留的增强、归纳迁移学习和领域特定知识注入等方法,提高了多语言情感表达的识别能力。

Comments Final Workshop of the 9th evaluation campaign EVALITA 2026

详情
AI中文摘要

本文提出了一种多阶段框架,用于检测多语言社交媒体中重新使用的侮辱性语言。该框架解决了在英语、西班牙语和意大利语推文中识别重新使用与非重新使用LGBTQ+相关侮辱性语言的挑战。该框架处理了三个交织的方法学挑战:数据稀缺、类别不平衡和跨语言的情感表达差异。该框架整合了通过交叉验证的数据驱动模型选择、通过回译的语义保留增强、具有动态周期级欠采样的归纳迁移学习,以及通过掩码语言模型注入的领域特定知识。系统评估了八个多语言嵌入模型,XLM-RoBERTa被选为基础模型,基于宏平均F1分数。通过GPT-4o-mini回译进行的数据增强有效将训练语料库增加了三倍,同时保留了语义内容和类别分布比例。该框架生成了四个最终运行用于评估,其中RUN 1是带有增强和欠采样的归纳迁移学习,RUN 2是带有掩码语言模型预训练,RUN 3和RUN 4是通过语言特定决策阈值优化的先前预测。语言特定的阈值优化表明,最优决策边界在不同语言中存在显著差异。这反映了模型置信度分数的分布差异和重新使用语言使用的语言差异。基于阈值的优化在不需模型重新训练的情况下,带来了2-5%的绝对F1提升。该方法完全可复现,所有代码和实验设置可在https://github.com/rbg-research/MultiPRIDE-Evalita-2026上找到。

英文摘要

This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg-research/MultiPRIDE-Evalita-2026.

2605.11975 2026-05-19 cs.LG 版本更新

Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning

随机最小成本到达-避免强化学习

Jingduo Pan, Taoran Wu, Yiling Xue, Bai Xue

发表机构 * Key Laboratory of System Software (Chinese Academy of Sciences), Institute of Software, Chinese Academy of Sciences, Beijing, China(中国科学院系统软件重点实验室,软件研究所,中国科学院,北京,中国) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学,北京,中国) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学交叉学科学院,中国科学院大学,北京,中国)

AI总结 本文研究了随机最小成本到达-避免强化学习问题,提出了一种新的方法来在满足概率至少p的到达-避免约束的同时最小化预期累积成本。通过引入到达-避免概率证书(RAPCs)和基于收缩的Bellman公式,该方法能够将到达-避免考虑整合到强化学习中,并在概率约束下实现成本优化。

Comments Accepted at the Forty-third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

我们研究了随机最小成本到达-避免强化学习,其中智能体必须在概率至少p的情况下满足到达-避免规范,同时在随机环境中最小化预期累积成本。现有的安全和约束强化学习方法通常无法在随机环境中联合强制概率到达-避免约束并优化成本。为了解决这一挑战,我们引入了到达-避免概率证书(RAPCs),这些证书可以识别出从哪些状态可以满足随机到达-避免约束。基于RAPCs,我们开发了一种基于收缩的Bellman公式,该公式作为一种原理性的替代方法,用于将到达-避免考虑整合到强化学习中,从而在概率约束下实现成本优化。我们建立了所提出算法在结果目标下几乎确定收敛到局部最优策略。在MuJoCo模拟器中的实验显示了改进的成本性能和一致更高的到达-避免满足率。

英文摘要

We study stochastic minimum-cost reach-avoid reinforcement learning, where an agent must satisfy a reach-avoid specification with probability at least $p$ while minimizing expected cumulative costs in stochastic environments. Existing safe and constrained reinforcement learning methods typically fail to jointly enforce probabilistic reach-avoid constraints and optimize cost in the learning setting in stochastic environments. To address this challenge, we introduce reach-avoid probability certificates (RAPCs), which identify states from which stochastic reach-avoid constraints are satisfiable. Building on RAPCs, we develop a contraction-based Bellman formulation that serves as a principled surrogate for integrating reach-avoid considerations into reinforcement learning, enabling cost optimization under probabilistic constraints. We establish almost sure convergence of the proposed algorithms to locally optimal policies with respect to the resulting objective. Experiments in the MuJoCo simulator demonstrate improved cost performance and consistently higher reach-avoid satisfaction rates.

2605.11461 2026-05-19 cs.AI cs.LG 版本更新

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

打破赢家通吃:合作策略优化提升大语言模型的多样化推理

Haoxuan Chen, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu

发表机构 * ISEE Lab, Sun Yat-sen University(中山大学ISEE实验室)

AI总结 本文提出Group Cooperative Policy Optimization (GCPO)方法,通过改变训练范式从 rollout 竞争转向团队合作,提升大语言模型在推理任务中的准确性和解题多样性。

详情
AI中文摘要

基于验证器的强化学习(RLVR)已成为提升大语言模型(LLM)推理能力的核心范式,然而流行的基于群体的优化算法如GRPO常常面临探索崩溃问题,即模型过早收敛于一组高分模式,缺乏探索新解的能力。最近的研究尝试通过添加熵正则化或多样性奖励来缓解这一问题,但这些方法并未改变赢家通吃的本质,即rollouts仍为个体优势竞争而非合作最大化全局多样性。在本文中,我们提出Group Cooperative Policy Optimization(GCPO),将训练范式从rollout竞争转向团队合作。具体而言,GCPO将独立rollout评分替换为团队层面的信用分配:rollout被奖励其对团队有效解覆盖的贡献,而非其个体准确性。该覆盖被描述为奖励加权语义嵌入上的确定体体积,其中只有正确且非冗余的rollout才对这一体积做出贡献。在优势估计过程中,GCPO将集体团队奖励重新分配给每个单个rollout,根据其对团队的平均边际贡献。这种合作训练范式将优化方向导向非冗余的正确推理路径。在多个推理基准测试中,GCPO在现有方法的基础上显著提高了推理准确性和解题多样性。代码将在https://github.com/bradybuddiemarch/gcpo上发布。

英文摘要

Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at https://github.com/bradybuddiemarch/gcpo.

2605.10871 2026-05-19 physics.med-ph cs.AI cs.LG 版本更新

Attractor-Vascular Coupling Theory: Formal Grounding and Empirical Validation for AAMI-Standard Cuffless Blood Pressure Estimation from Smartphone Photoplethysmography

吸引子-血管耦合理论:为基于智能手机光电容积图的AAMI标准无创血压估计提供形式基础和实证验证

Timothy Oladunni, Farouk Ganiyu Adewumi

发表机构 * Department of Computer Science, Morgan State University(莫根州立大学计算机科学系)

AI总结 本文提出了一种数学框架,证明心脏吸引子几何编码了足够的血压信息,用于AAMI标准估计,并通过校准的无创血压模型验证了该理论,利用光电容积图(PPG)进行血压估计。

详情
AI中文摘要

本文提出吸引子-血管耦合理论(AVCT),一种数学框架,证明心脏吸引子几何编码了足够的血压(BP)信息,足以用于AAMI标准估计,并通过使用光电容积图(PPG)的校准无创血压模型验证了该理论。AVCT基于心脏稳定性理论,并通过Takens延迟嵌入和吸引子形态提取进行操作化。两个定理、一个命题和一个推论正式证明了PPG吸引子特征用于血压估计的使用,并预测了特征重要性层次。一个使用脉搏传导时间(PTT)和心脏稳定性指数(CSI)吸引子特征训练的LightGBM模型在严格留一受试者出交叉验证(LOSO-CV)上进行了评估,评估了来自BIDMC ICU(n=9)和VitalDB手术数据(n=37)的46名受试者,共29,684个窗口。该模型实现了收缩压(SBP)的平均绝对误差(MAE)为2.05 mmHg,舒张压(DBP)的MAE为1.67 mmHg,相关系数r=0.990和r=0.991,满足AAMI/IEEE SP10要求的MAE低于5 mmHg。每个受试者的中位数MAE为1.87/1.54 mmHg,70%/76%的受试者个体满足AAMI标准。使用九个智能手机吸引子特征的PPG-only消融与ECG+PPG模型的误差在0.05 mmHg以内,证明了仅使用智能手机摄像头即可实现临床级血压跟踪,超过了以往使用更少传感器的LOSO-CV结果。所有四个AVCT预测都得到了定量确认,从未校准到校准估计的误差减少了91.5%(epsilon_cal=0.915)。与后验可解释AI方法不同,AVCT预测的特征满足可解释AI可信度(EAT)框架的建筑忠实性标准,并将血压估计扎根于非线性动力学系统理论。

英文摘要

This work proposes Attractor-Vascular Coupling Theory (AVCT), a mathematical framework showing that cardiac attractor geometry encodes blood pressure (BP) information sufficient for AAMI-standard estimation, and validates the theory through a calibrated cuffless BP model using photoplethysmography (PPG). AVCT is grounded in Cardiac Stability Theory and operationalized using Takens delay embedding and attractor morphology extraction. Two theorems, one proposition, and one corollary formally justify the use of PPG attractor features for BP estimation and predict the feature-importance hierarchy. A LightGBM model trained on pulse transit time (PTT) and Cardiac Stability Index (CSI) attractor features under single-point calibration was evaluated using strict leave-one-subject-out cross-validation (LOSO-CV) on 46 subjects from BIDMC ICU (n = 9) and VitalDB surgical data (n = 37), comprising 29,684 windows. The model achieved systolic BP (SBP) mean absolute error (MAE) of 2.05 mmHg and diastolic BP (DBP) MAE of 1.67 mmHg, with correlations r = 0.990 and r = 0.991, satisfying the AAMI/IEEE SP10 requirement of MAE below 5 mmHg. Median per-subject MAE was 1.87/1.54 mmHg, and 70%/76% of subjects individually satisfied AAMI criteria. A PPG-only ablation using nine smartphone attractor features matched the ECG+PPG model within 0.05 mmHg, demonstrating that clinical-grade BP tracking is achievable using only a smartphone camera while surpassing prior generalized LOSO-CV results using fewer sensors. All four AVCT predictions were quantitatively confirmed, with 91.5% error reduction from uncalibrated to calibrated estimation (epsilon_cal = 0.915). Unlike post-hoc explainable AI methods, AVCT predicts features satisfying the architectural faithfulness criterion of the Explainable-AI Trustworthiness (EAT) framework and grounding BP estimation in nonlinear dynamical systems theory.

2605.10236 2026-05-19 cs.LG cs.AI 版本更新

When Does Non-Uniform Replay Matter in Reinforcement Learning?

在强化学习中非均匀回放何时起作用?

Michal Korniak, Mikołaj Czarnecki, Yarden As, Piotr Miłoś, Pieter Abbeel, Michal Nauman

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Warsaw(华沙大学) UC Berkeley(伯克利加州大学) Amazon FAR(亚马逊FAR)

AI总结 本文研究了非均匀回放在强化学习中的有效性,发现回放体积、预期近期性和回放分布熵是决定因素,并提出了一种简单有效的截断几何回放策略以提高样本效率。

详情
AI中文摘要

现代非策略强化学习算法通常依赖于简单的均匀回放采样,但非均匀回放何时以及为何优于这一强基线仍不清楚。在多样化的强化学习设置中,我们证明非均匀回放的有效性由三个因素决定:回放体积、每环境步骤回放的转换数量;预期近期性,即所采样转换的近期程度;以及回放采样分布的熵。我们的主要贡献是明确非均匀回放何时有益,并为现代非策略强化学习中的回放设计提供实用指导。我们发现,当回放体积较低时,非均匀回放最有益,且即使在预期近期性相当时,高熵采样也很重要。受这些发现的启发,我们采用了一种简单的截断几何回放策略,该策略倾向于近期经验,同时保持高熵并带来可忽略的计算开销。在大规模并行模拟、单任务和多任务设置中,包括在五个强化学习基准套件上评估的三种现代算法,这种回放采样策略在低体积情况下提高了样本效率,而在高回放体积时仍具有竞争力。

英文摘要

Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed by three factors: replay volume, the number of replayed transitions per environment step; expected recency, how recent sampled transitions are; and the entropy of the replay sampling distribution. Our main contribution is clarifying when non-uniform replay is beneficial and providing practical guidance for replay design in modern off-policy RL. Namely, we find that non-uniform replay is most beneficial when replay volume is low, and that high-entropy sampling is important even at comparable expected recency. Motivated by these findings, we adopt a simple Truncated Geometric replay that biases sampling toward recent experience while preserving high entropy and incurring negligible computational overhead. Across large-scale parallel simulation, single-task, and multi-task settings, including three modern algorithms evaluated on five RL benchmark suites, this replay sampling strategy improves sample efficiency in low-volume regimes while remaining competitive when replay volume is high.

2605.09855 2026-05-19 cs.LG 版本更新

Concordia: Self-Improving Synthetic Tables for Federated LLMs

Concordia:面向联邦大语言模型的自改进合成表格

Jimin Huang, Duanyu Feng, Nuo Chen, Xiaoyu Wang, Zhiqiang Zhang, Xueqing Peng, Mingquan Lin, Prayag Tiwari, Guojun Xiong, Alejandro Lopez-Lira, Sophia Ananiadou

发表机构 * University of Manchester(曼彻斯特大学) National University of Singapore(新加坡国立大学) New York University(纽约大学) University of Minnesota(明尼苏达大学) Halmstad University(哈姆斯塔德大学) Harvard University(哈佛大学) University of Florida(佛罗里达大学)

AI总结 本文研究了在无法共享原始数据的情况下,如何通过自改进的合成表格来提升联邦学习中大语言模型的适应能力,提出了一种三层优化框架Concordia,通过参数高效LoRA训练和轻量级效用评分器提升联邦验证效用和跨客户端稳定性。

Comments 12 pages

详情
AI中文摘要

联邦学习(FL)能够在不共享原始数据的情况下训练大型语言模型(LLMs),但在严格的数据隔离和非独立同分布(non-IID)客户端分布下,适应LLMs仍然具有挑战性。合成数据为本地训练提供了自然的隐私保护替代方案,但现有联邦流程通常将合成生成视为静态或松散耦合于下游优化,导致在异质客户端下效用迅速下降。我们研究了在无法共享原始记录和验证数据的情况下,如何在表格任务中进行联邦适应,并且本地训练必须完全依赖合成表格。我们提出Concordia,一种三层优化框架,该框架在这些约束下对齐合成数据生成与联邦验证效用。在客户端层面,模型通过参数高效LoRA训练在合成表格上进行适应。客户端还从私有验证反馈中学习轻量级效用评分器,以在本地训练中重新加权合成样本。在外层,每个客户端使用组相对策略优化(GRPO)来细化自己的合成表格生成器,由跨客户端共享的异质评分器集合引导,而无需聚合生成器参数或暴露验证数据。在隐私敏感的表格基准测试中,Concordia在金融和医疗领域展示了比静态和解耦合成数据基线更一致的联邦性能、跨客户端稳定性和对分布偏移的鲁棒性。

英文摘要

Federated learning (FL) enables training large language models (LLMs) without sharing raw data, but adapting LLMs under strict data isolation and non-IID client distributions remains challenging in practice. Synthetic data offers a natural privacy-preserving surrogate for local training, yet existing federated pipelines typically treat synthetic generation as static or loosely coupled with downstream optimization, leading to rapidly diminishing utility under heterogeneous clients. We study federated adaptation of LLMs on tabular tasks where raw records and validation data cannot be shared, and local training must rely entirely on synthetic tables. We propose Concordia, a tri-level optimization framework that aligns synthetic data generation with federated validation utility despite these constraints. At the client level, models are adapted via parameter-efficient LoRA training on synthetic tables. Clients additionally learn lightweight utility scorers from private validation feedback to reweight synthetic samples during local training. At the outer level, each client refines its own synthetic table generator using group-relative policy optimization (GRPO), guided by an ensemble of heterogeneous scorers shared across clients, without aggregating generator parameters or exposing validation data. Experiments on privacy-sensitive tabular benchmarks from finance and healthcare demonstrate that Concordia consistently improves federated performance, cross-client stability, and robustness to distribution shift compared to static and decoupled synthetic-data baselines.

2605.09040 2026-05-19 cs.AI cs.IR cs.LG 版本更新

UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence

UxSID:面向超长序列的语义感知用户兴趣建模

Hongwei Zhang, Qiqiang Zhong, Jiangxia Cao, Yiyang Lv, Huanjie Wang, Liwei Guan, Jing Yao, Yiyu Wang, Junfeng Shu, Zhaojie Liu, Han Li

发表机构 * Kuaishou Technology(快手科技)

AI总结 本文提出UxSID框架,通过语义组共享兴趣记忆和双层注意力策略,实现高效且语义感知的超长用户序列建模,取得最佳性能并提升广告收益。

Comments Work in progress

详情
AI中文摘要

建模超长用户序列涉及效率与效果之间的艰难权衡。尽管当前方法依赖于物品特定搜索或物品无关压缩,我们提出UxSID,探索第三种路径:语义组共享兴趣记忆。通过利用语义ID(SIDs)和双层注意力策略,UxSID在不付出物品特定模型高昂代价的情况下捕捉目标感知偏好。这种端到端架构在计算效率与语义感知之间取得平衡,实现了最先进的性能,并在大规模广告A/B测试中提升了0.337%的收益。

英文摘要

Modeling ultra-long user sequences involves a difficult trade-off between efficiency and effectiveness. While current paradigms rely on either item-specific search or item-agnostic compression, we propose UxSID, a framework exploring a third path: semantic-group shared interest memory. By utilizing Semantic IDs (SIDs) and a dual-level attention strategy, UxSID captures target-aware preferences without the heavy cost of item-specific models. This end-to-end architecture balances computational parsimony with semantic awareness, achieving state-of-the-art performance and a 0.337% revenue lift in large-scale advertising A/B test.

2605.08738 2026-05-19 cs.LG cs.AI cs.CL 版本更新

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

SlimQwen: 探索在大规模MoE模型预训练中的剪枝与知识蒸馏

Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, Dayiheng Liu

发表机构 * Qwen Team, Alibaba Inc.(通义实验室,阿里公司) MBZUAI KAUST(卡士大学)

AI总结 本文研究了在大规模预训练中如何应用剪枝和知识蒸馏技术,探讨了剪枝在初始化方面的优势、专家压缩对最终模型的影响以及训练策略的有效性,最终将Qwen3-Next-80A3B压缩到23A2B模型并保持竞争力。

详情
AI中文摘要

结构化剪枝和知识蒸馏(KD)是压缩大型语言模型的典型技术,但其在预训练规模下的应用仍不清楚,尤其是针对最近的混合专家(MoE)模型。本文系统研究了大规模预训练中的MoE压缩,重点探讨三个关键问题:剪枝是否比从头训练提供更好的初始化;专家压缩选择如何影响继续训练后的最终模型;以及哪种训练策略最有效。我们得出以下发现:首先,在深度、宽度和专家压缩方面,对预训练MoE进行剪枝在相同训练预算下优于从头训练。其次,不同的单次专家压缩方法在大规模持续预训练后收敛到相似的最终性能。受此启发,我们引入了一种简单的部分保留专家合并策略,该策略在大多数基准上提升了下游性能。第三,结合KD与语言建模损失在知识密集型任务上优于仅使用KD。我们进一步提出了多令牌预测(MTP)蒸馏,其效果一致。最后,鉴于相同的训练令牌,渐进式剪枝计划优于单次压缩,表明渐进的架构过渡导致更好的优化轨迹。综合来看,我们将Qwen3-Next-80A3B压缩到23A2B模型,保持了竞争力。这些结果为大规模高效MoE压缩提供了实用指导。

英文摘要

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

2605.07790 2026-05-19 cs.LG cs.CV 版本更新

Hessian Surgery: Class-Targeted Post-Hoc Rebalancing via Hessian Spike Perturbation

Hessian Surgery: 通过Hessian尖峰扰动实现类目标后处理重平衡

Hugo Vigna, Samuel Bontemps

发表机构 * CentraleSupélec – Université Paris-Saclay(中央理工巴黎高等学院 – 巴黎萨克莱大学) ESILV – Léonard de Vinci(ESILV – 莱昂纳德·德·文奇)

AI总结 本文提出Hessian Surgery方法,通过扰动模型权重沿尖峰特征向量来重平衡各类准确率,无需重新训练,提升了CIFAR-10和ISIC-2019数据集的平衡准确率和标准差。

Comments The code is available here: https://github.com/hugovigna/hessian-surgery.git

详情
AI中文摘要

训练好的深度网络的Hessian谱表现出一种特征结构:连续的近零特征值和少量的大异常特征值(尖峰),证实了随机矩阵理论在深度学习中的相关性。尖峰数量与类别数减一相匹配。尽管先前工作描述了这种结构,但没有方法将其操作化以提高分类性能。我们提出Hessian Surgery,一种后处理优化方法,直接扰动模型权重沿尖峰特征向量以重平衡各类准确率而无需重新训练。我们引入(i)一个尖峰类敏感度矩阵,量化每个类准确率沿每个尖峰特征向量的方向导数,(ii)一个约束优化扰动系数,针对弱类同时保持强类,以及(iii)自适应幅度控制,根据迭代级改进信号调整扰动预算。我们在CIFAR-10和ISIC-2019上获得了令人鼓舞的结果,同时在平衡准确率和标准差方面都取得了显著提升。

英文摘要

The Hessian spectrum of trained deep networks exhibits a characteristic structure: a continuous bulk of near-zero eigenvalues and a small number of large outlier eigenvalues (spikes), confirming the relevance of Random Matrix Theory in deep learning. The spike count matches the number of classes minus one. While prior work has described this structure, no method has exploited it operationally to improve classification performance. We propose Hessian Surgery, a post-hoc optimization method that directly perturbs model weights along spike eigenvectors to rebalance per-class accuracy without retraining. We introduce (i) a spike-class sensitivity matrix that quantifies the directional derivative of each class's accuracy along each spike eigenvector, (ii) a constrained optimization of perturbation coefficients that targets weak classes while preserving strong ones, and (iii) an adaptive amplitude control that raises or lowers the perturbation budget based on iteration-level improvement signals. We obtain encouraging results on CIFAR-10 and ISIC-2019 on both balanced accuracy and standard deviation.

2605.00264 2026-05-19 cs.LG cs.GT 版本更新

Pessimism-Free Offline Learning in General-Sum Games via KL Regularization

通过KL正则化实现一般和博弈中的无悲观离线学习

Claire Chen, Yuheng Zhang

AI总结 本文提出了一种基于KL正则化的离线学习方法,能够在一般和博弈中实现无悲观的均衡恢复,通过加速的统计速率和计算高效的算法提升学习效率。

详情
AI中文摘要

在一般和博弈的离线多智能体强化学习中, Logged数据集与目标均衡策略之间的分布偏移是一个挑战。尽管传统方法依赖于手动悲观惩罚,我们证明KL正则化足以稳定学习并实现均衡恢复。我们提出了General-sum Anchored Nash Equilibrium (GANE),其在加速的统计速率$\widetilde{O}(1/n)$下恢复正则化的纳什均衡。为了计算可行性,我们开发了General-sum Anchored Mirror Descent (GAMD),一种迭代算法,能够在标准速率$\widetilde{O}(1/\sqrt{n}+1/T)$下收敛到粗相关均衡。这些结果确立了KL正则化作为无悲观离线学习的独立机制,在多玩家一般和博弈中实现等效或加速的速率。

英文摘要

Offline multi-agent reinforcement learning in general-sum settings is challenged by the distribution shift between logged datasets and target equilibrium policies. While standard methods rely on manual pessimistic penalties, we demonstrate that KL regularization suffices to stabilize learning and achieve equilibrium recovery. We propose General-sum Anchored Nash Equilibrium (GANE), which recovers regularized Nash equilibria at an accelerated statistical rate of $\widetilde{O}(1/n)$. For computational tractability, we develop General-sum Anchored Mirror Descent (GAMD), an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of $\widetilde{O}(1/\sqrt{n}+1/T)$. These results establish KL regularization as a standalone mechanism for pessimism-free offline learning that achieves equivalent or accelerated rates in multi-player general-sum games.

2604.23267 2026-05-19 cs.CL cs.LG 版本更新

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

在大型语言模型中微调与上下文学习:从形式语言学习的角度

Bishwamittra Ghosh, Soumi Das, Till Speicher, Qinyuan Wu, Mohammad Aflah Khan, Deepak Garg, Krishna P. Gummadi, Evimaria Terzi

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) Boston University(波士顿大学)

AI总结 本文从形式语言学习的角度比较了大型语言模型中的微调与上下文学习,通过设计精确的语言边界、受控字符串采样和无数据污染的任务,发现微调在分布内泛化上优于上下文学习,而两者在分布外泛化上表现相当,且两者在不同熟练度水平上的归纳偏置也有所不同。

Comments Accepted at ACL 2026 (Main)

详情
AI中文摘要

大型语言模型(LLMs)在两种基本的学习模式中运作——微调(FT)和上下文学习(ICL),这引发了关于哪种模式产生更大的语言能力以及它们是否在归纳偏置上有所不同的关键问题。先前比较FT和ICL的研究由于实验设置不一致而得出混杂和不明确的结果。为了实现严格比较,我们提出了一项形式语言学习任务——提供精确的语言边界、受控字符串采样和无数据污染,并引入一种判别测试来评估语言能力,其中LLM成功当且仅当它将更高生成概率分配给语言字符串而不是非语言字符串。经验上,我们发现:(a)FT在分布内泛化上比ICL更具语言能力,但两者在分布外泛化上表现相当。(b)它们的归纳偏置,通过字符串生成概率的相关性来衡量,当两种模式部分学习语言时相似,但在更高熟练度水平上分化。(c)与FT不同,ICL的表现在不同大小和家族的模型之间差异显著,并且对语言的token词汇表敏感。因此,我们的工作展示了形式语言作为评估LLM的受控测试床的潜力,这些行为在自然语言数据集中难以隔离。我们的源代码可在https://github.com/bishwamittra/formallm上获得。

英文摘要

Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.

2604.23135 2026-05-19 cs.LG 版本更新

Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization

刻画 Lean 4 自动形式化中的同义词诱导失败

William Feng, Ethan Lou, Aryan Sharma

发表机构 * Yale University(耶鲁大学)

AI总结 本研究探讨了 Lean 4 自动形式化中由于同义词变化导致的失败模式,通过应用确定性同义词规则到本科和竞赛级数学问题数据集,发现代码生成层的失败主导了同义词敏感性,并揭示了不同数据集对失败类型的影响,结果为自动形式化提供了失败模式分类并推动了针对性的训练干预。

详情
AI中文摘要

近年来,Lean 4 自动形式化在前沿语言模型和开放权重自动形式化器中变得越来越流行,这些模型现在能够生成数学定理的有效形式化。然而,这些评估通常依赖于单个标准定理表述,很少探讨输出是否对输入的自然变化具有鲁棒性,而先前的工作已表明语义等价的同义词变化常导致形式化输出的差异。我们通过应用确定性同义词规则到本科和竞赛级数学问题数据集,研究了 Lean 4 中这些差异的结构。在四个前沿模型和三个开放权重自动形式化器上,我们发现同义词敏感性主要由代码生成层的失败主导,并且这些失败在不同数据集中被类型化不同。此外,这些模式扩展到开放权重模型,显示最先进的自动形式化器仍难以生成有效的 Lean 代码。我们的结果为自动形式化提供了失败模式分类,并推动了针对特定编译失败的训练干预。

英文摘要

Lean 4 autoformalization has become increasingly popular in recent years, with frontier language models and open-weight autoformalizers now producing valid formalizations of mathematical theorems. However, these evaluations often rely on single canonical phrasings of theorems and rarely probe whether outputs are robust to natural variation in inputs, while prior work has shown that semantically equivalent paraphrases often induce divergent formal outputs. We study the structure of these divergences in Lean 4 by applying deterministic paraphrase rules to datasets of undergraduate and Olympiad-level math problems. Across four frontier models and three open-weight autoformalizers, we find that paraphrase sensitivity is dominated by failures at the code-generation layer, and that these failures are typed differently by dataset. Furthermore, these patterns generalize to open-weight models, showing that state-of-the-art autoformalizers still struggle to generate valid Lean code. Our results provide a failure-mode taxonomy for autoformalization and motivate training-time interventions targeted at specific compilation failures.

2604.18966 2026-05-19 cs.LG cs.AI 版本更新

Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training

通过迭代奖励引导的后训练改进表格语言模型

Yunbo Long, Tejumade Afonja, Guangya Hao, Alexandra Brintrup, Mario Fritz

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系) CISPA Helmholtz Center for Information Security, Saarbrücken, Germany(德国萨尔布吕肯信息安全中心) The Alan Turing Institute, London(伦敦阿兰·图灵研究所)

AI总结 本文研究了通过生成-评分-对齐协议进行迭代奖励引导的后训练,提出了一种基于组相对对齐的方法TabGRAA,通过比较高分和低分生成组的组平均策略/参考对数比来改进表格语言模型,在五个混合类型基准上优于额外监督微调,并在保真度和下游效用之间实现了最佳平均权衡,同时保持经验隐私诊断接近监督基线。

详情
AI中文摘要

表格语言模型可以通过将行建模为令牌序列来生成合成表格,但通常通过监督微调一次后就作为静态生成器使用。这限制了下一步令牌似然不能直接优化用于评估合成数据的分布、效用和不可区分性属性。我们通过生成-评分-对齐协议研究了表格语言模型的迭代奖励引导后训练,其中生成器采样合成行,任务特定的奖励对其进行排序,模型则相对于固定监督参考进行更新。在该协议中,我们提出了TabGRAA(表格组相对优势对齐),通过组平均的策略/参考对数比比较高分和低分生成组,而非一对一偏好对。在五个混合类型基准上,TabGRAA在GReaT基座上优于额外监督微调,并在保真度和下游效用之间实现了最强的平均权衡,同时保持经验隐私诊断接近监督基线。消融研究显示,收益依赖于有意义的奖励排名和稳定的组级更新,而非额外训练本身。奖励替换和评分分离研究进一步表明,后训练循环可以使用基于分类器和无分类器的奖励,且适当的评分分离对于保持保真度-效用-隐私权衡至关重要。这些结果将TabGRAA定位为一种自改进的后训练方法,用于表格语言模型生成器,作为强大静态表格生成器的补充。

英文摘要

Tabular language models can generate synthetic tables by modeling rows as token sequences, but they are typically trained once with supervised fine-tuning and then used as static synthesizers. This is limiting because next-token likelihood does not directly optimize the distributional, utility, and indistinguishability properties used to evaluate synthetic data. We study iterative reward-guided post-training for tabular language models through a generate--score--align protocol, where a generator samples synthetic rows, a task-specified reward ranks them, and the model is updated relative to a fixed supervised reference. Within this protocol, we propose \textbf{TabGRAA} (\textbf{Tab}ular \textbf{G}roup-\textbf{R}elative \textbf{A}dvantage \textbf{A}lignment), a group-relative alignment method that compares high- and low-reward generated groups using group-averaged policy/reference log-ratios rather than one-to-one preference pairs. Across five mixed-type benchmarks, TabGRAA improves a GReaT backbone beyond additional supervised fine-tuning and achieves the strongest average trade-off among adapted DPO, KTO, and NPO baselines on fidelity and downstream utility, while maintaining empirical privacy diagnostics near the supervised baseline. Ablations show that the gains depend on meaningful reward ranking and stable group-level updates rather than extra training alone. Reward-substitution and scorer-separation studies further show that the post-training loop can use both classifier-based and classifier-free rewards, and that proper scorer separation is important for preserving the fidelity--utility--privacy trade-off. These results position TabGRAA as a self-improving post-training method for tabular language-model generators, complementary to strong static tabular synthesizers.

2604.16429 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph 版本更新

(Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models

(稀疏) 注意细节:在基于机器学习的天气预测模型中保持频谱保真度

Maksim Zhdanov, Ana Lucic, Max Welling, Jan-Willem van de Meent

发表机构 * AMLab(AM实验室) University of Amsterdam(阿姆斯特丹大学)

AI总结 本文提出Mosaic模型,通过学习功能扰动生成集合成员,并利用网格对齐的块稀疏注意力机制,在原分辨率网格上操作,以线性成本捕捉长距离依赖关系,从而在1.5°分辨率下达到或超越更精细分辨率模型的性能,实现了状态-of-the-art结果。

Comments Accepted to ICML 2026

详情
AI中文摘要

我们介绍Mosaic,一种概率天气预测模型,旨在解决基于机器学习的天气预测中频谱退化问题的三种失败模式:频谱阻尼(统计学)、高频混叠(架构学)和残余高频泄漏(参数学)。Mosaic通过学习的功能扰动生成集合成员,并通过网格对齐的块稀疏注意力机制在原分辨率网格上操作,该机制是一种硬件对齐的机制,通过在空间相邻查询之间共享键和值,以线性成本捕捉长距离依赖关系。在1.5°分辨率和214M参数下,Mosaic在关键变量上达到或超越了在6倍更精细分辨率上训练的模型的性能,并在1.5°模型中实现了最先进的结果,生成了经过良好校准的集合,其个体成员在所有解析频率上表现出近乎完美的频谱对齐。一个24成员、10天的预测在单个H100 GPU上不到12秒。代码可在https://github.com/maxxxzdn/mosaic上获得。

英文摘要

We introduce Mosaic, a probabilistic weather forecasting model that addresses three failure modes of spectral degradation in ML-based weather prediction: spectral damping (statistical), high-frequency aliasing (architectural), and residual high-frequency leakage (parametric). Mosaic generates ensemble members through learned functional perturbations and operates on native-resolution grids via mesh-aligned block-sparse attention, a hardware-aligned mechanism that captures long-range dependencies at linear cost by sharing keys and values across spatially adjacent queries. At 1.5° resolution with 214M parameters, Mosaic matches or outperforms models trained on 6$\times$ finer resolution on key variables and achieves state-of-the-art results among 1.5° models, producing well-calibrated ensembles whose individual members exhibit near-perfect spectral alignment across all resolved frequencies. A 24-member, 10-day forecast takes under 12s on a single H100~GPU. Code is available at https://github.com/maxxxzdn/mosaic.

2604.15851 2026-05-19 cs.LG cs.AI cs.CR 版本更新

DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy

DPrivBench:评估大语言模型在差分隐私推理中的基准测试

Erchi Wang, Pengrun Huang, Eli Chien, Om Thakkar, Kamalika Chaudhuri, Yu-Xiang Wang, Ruihan Wu

发表机构 * Halıcıoğlu Data Science Institute, UC San Diego(哈里奇奥格卢数据科学研究所,加州大学圣地亚哥分校) Department of Computer Science and Engineering, UC San Diego(计算机科学与工程系,加州大学圣地亚哥分校) Department of Electrical Engineering, National Taiwan University(电气工程系,国立台湾大学) OpenAI

AI总结 本文提出DPrivBench基准测试,用于评估大语言模型在差分隐私推理中的能力,发现当前模型在高级算法推理上存在显著差距,并为改进自动化差分隐私推理提供了方向。

详情
AI中文摘要

差分隐私(DP)在保护数据隐私方面有广泛的应用,但设计和验证DP算法需要专家级推理,这为非专家从业者设置了高门槛。先前的工作要么依赖于需要大量领域专业知识的专用验证语言,要么仍然是半自动化的,需要人工在循环中指导。在本文中,我们研究大语言模型(LLMs)能否自动化DP推理。我们引入了DPrivBench,这是一个基准测试,每个实例询问函数或算法是否在指定假设下满足陈述的DP保证。该基准测试精心设计,覆盖了广泛的DP主题,跨越不同的难度级别,并通过简单的模式匹配来抵抗快捷推理。实验显示,尽管最强的模型能够处理教科书机制,但所有模型在高级算法上都面临困难,揭示了当前DP推理能力的显著差距。通过进一步的分析研究和失败模式分析,我们识别出改进自动化DP推理的几个有前途的方向。我们的基准测试为开发和评估此类方法提供了坚实的基础,并补充了现有的数学推理基准测试。

英文摘要

Differential privacy (DP) has a wide range of applications for protecting data privacy, but designing and verifying DP algorithms requires expert-level reasoning, creating a high barrier for non-expert practitioners. Prior works either rely on specialized verification languages that demand substantial domain expertise or remain semi-automated and require human-in-the-loop guidance. In this work, we investigate whether large language models (LLMs) can automate DP reasoning. We introduce DPrivBench, a benchmark in which each instance asks whether a function or algorithm satisfies a stated DP guarantee under specified assumptions. The benchmark is carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning through trivial pattern matching. Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities. Through further analytic study and failure-mode analysis, we identify several promising directions for improving automated DP reasoning. Our benchmark provides a solid foundation for developing and evaluating such methods, and complements existing benchmarks for mathematical reasoning.

2604.15762 2026-05-19 cs.LG 版本更新

Zero-Shot Scalable Resilience in UAV Swarms: A Decentralized Imitation Learning Framework with Physics-Informed Graph Interactions

无人机群中的零样本可扩展韧性:一种带有物理信息图交互的去中心化模仿学习框架

Huan Lin, Lianghui Ding

发表机构 * Institute of Image Communication and Network Engineering, School of Integrated Circuits, Shanghai Jiao Tong University(图像通信与网络工程研究所,集成电路学院,上海交通大学)

AI总结 本文提出了一种去中心化模仿学习框架,通过物理信息图神经网络编码局部交互,实现无人机群在大规模故障和碎片化拓扑下的鲁棒恢复。

详情
AI中文摘要

大规模无人机(UAV)故障可能导致无人机群网络分裂为断开的子网络,使得去中心化恢复既紧迫又困难。集中式恢复方法依赖于全局拓扑信息,在严重碎片化后变得通信密集。去中心化启发法和多智能体强化学习方法更容易部署,但其性能在群规模和损坏严重程度变化时通常会退化。我们提出了物理信息图对抗模仿学习算法(PhyGAIL),该算法采用集中训练与去中心化执行。PhyGAIL从异构观测中构建有界的局部交互图,并利用物理信息图神经网络将方向局部交互编码为具有显式吸引力和排斥力的门控消息传递。这使策略具有物理基础的协调偏置,同时保持局部观测的尺度不变性。它还使用场景自适应模仿学习来改进在碎片化拓扑和可变长度恢复周期下的训练。我们的分析建立了有界局部图放大、有界交互动态和终端成功信号的受控方差。在20个UAV群上训练的策略可直接转移到最多500个UAV的群中,无需微调,且在重新连接可靠性、恢复速度、运动安全性和运行效率方面优于代表性基线。

英文摘要

Large-scale Unmanned Aerial Vehicle (UAV) failures can split an unmanned aerial vehicle swarm network into disconnected sub-networks, making decentralized recovery both urgent and difficult. Centralized recovery methods depend on global topology information and become communication-heavy after severe fragmentation. Decentralized heuristics and multi-agent reinforcement learning methods are easier to deploy, but their performance often degrades when the swarm scale and damage severity vary. We present Physics-informed Graph Adversarial Imitation Learning algorithm (PhyGAIL) that adopts centralized training with decentralized execution. PhyGAIL builds bounded local interaction graphs from heterogeneous observations, and uses physics-informed graph neural network to encode directional local interactions as gated message passing with explicit attraction and repulsion. This gives the policy a physically grounded coordination bias while keeping local observations scale-invariant. It also uses scenario-adaptive imitation learning to improve training under fragmented topologies and variable-length recovery episodes. Our analysis establishes bounded local graph amplification, bounded interaction dynamics, and controlled variance of the terminal success signal. A policy trained on 20-UAV swarms transfers directly to swarms of up to 500 UAVs without fine-tuning, and achieves better performance across reconnection reliability, recovery speed, motion safety, and runtime efficiency than representative baselines.

2604.12288 2026-05-19 stat.ML cs.LG stat.ME 版本更新

SMART Fine-tuning Factor Augmented Neural Lasso

SMART Fine-tuning Factor Augmented Neural Lasso

Jinhang Chai, Jianqing Fan, Cheng Gao, Qishuo Yin

发表机构 * Department of Operations Research and Financial Engineering(运筹学与金融工程系)

AI总结 本文提出了一种结合预训练源模型作为增强特征的残差调优框架(SMART),用于高维非参数回归中的变量选择问题,通过引入低秩因子结构和残差调优分解,实现了协变量和后验偏移的联合处理,并推导了最小最大最优的超额风险界。

Comments Authors are listed in alphabetical order

详情
AI中文摘要

细调是一种广泛用于将预训练模型适应到新任务的策略,然而在高维非参数设置中,其方法论和理论性质在变量选择方面尚未得到发展。我们提出了一种源模型增强残差调优(SMART)框架,该框架将预训练源模型作为增强特征纳入目标学习者,并仅估计残差目标特定组件。该方法广泛适用,从参数和稀疏模型到神经网络和黑箱机器学习模型。我们专注于细调因子增强神经Lasso的发展,从而得到SMART-FAN-Lasso。这种用于高维非参数回归的迁移学习框架,同时处理协变量和后验偏移。我们使用低秩因子结构来管理高维依赖协变量,并在残差调优分解中将目标函数表示为源模型和其他目标特定变量的函数,从而降低目标任务的有效复杂性。我们推导了最小最大最优的超额风险界,刻画了在相对样本量和函数复杂性条件下,细调在统计加速方面优于单任务学习的精确条件。在广泛的不同协变量和后验偏移场景中进行的大量数值实验表明,SMART-FAN-Lasso在严重的目标样本量约束下仍能超越标准基线,实现接近 oracle 的性能,经验上验证了推导的速率。

英文摘要

Fine-tuning is a widely used strategy for adapting pre-trained models to new tasks, yet its methodology and theoretical properties in high-dimensional nonparametric settings with variable selection have not yet been developed. We propose a source-model-augmented residual tuning (SMART) framework, which incorporates the pre-trained source model as an augmented feature into the target learner and estimates only the residual target-specific component. The approach is widely applicable, from parametric and sparse models to neural networks and blackbox machine learning models. We focus on the development of fine-tuning factor-augmented neural Lasso, resulting in SMART-FAN-Lasso. This transfer-learning framework for high-dimensional nonparametric regression with variable selection simultaneously handles covariate and posterior shifts. We use a low-rank factor structure to manage high-dimensional dependent covariates and a residual tuning decomposition in which the target function is expressed as a function of source model and other target-specific variables, thereby reducing the effective complexity of the target task. We derive minimax-optimal excess risk bounds, characterizing the precise conditions, in terms of relative sample sizes and function complexities, under which fine-tuning yields statistical acceleration over single-task learning. Extensive numerical experiments across diverse covariate- and posterior-shift scenarios demonstrate that SMART-FAN-Lasso consistently outperforms standard baselines and achieves near-oracle performance even under severe target sample size constraints, empirically validating the derived rates.

2604.11852 2026-05-19 q-bio.QM cs.AI cs.LG 版本更新

Limitations of Sequence-Based Protein Representations for Parkinson's Disease Classification: A Leakage-Free Benchmark

序列基蛋白质表示在帕金森病分类中的局限性:一种无泄漏的基准测试

César Jesús Núñez-Prado, Grigori Sidorov, Liliana Chanona-Hernández

发表机构 * Higher School of Mechanical and Electrical Engineering, Instituto Politécnico Nacional(机械与电气工程高等专科学校,墨西哥国立理工学院) Research Center for Computing, Instituto Politécnico Nacional(计算研究中心,墨西哥国立理工学院)

AI总结 本文研究了序列基蛋白质表示在帕金森病分类中的局限性,通过无泄漏的基准测试评估了多种基于蛋白质初级序列的表示方法,发现单一序列信息对疾病分类的判别能力有限,需引入更丰富的生物学特征。

Comments 36 pages, 10 figures, 9 tables. Updated title, abstract, figures, and revised experimental discussion

详情
AI中文摘要

可靠分子生物标志物的鉴定仍因帕金森病的多因素性质而具有挑战性。尽管蛋白质序列是基础且广泛可用的生物信息来源,但其单独判别能力用于复杂疾病分类仍不明确。本文提出了一个受控且无泄漏的评估,评估了多种仅基于蛋白质初级序列的表示方法,包括氨基酸组成、k-mer、物理化学描述符、混合表示以及来自蛋白质语言模型的嵌入,所有均在嵌套分层交叉验证框架下评估以确保性能估计的无偏性。表现最佳的配置(ProtBERT + MLP)达到F1分数为0.704 ± 0.028和ROC-AUC为0.748 ± 0.047,表明判别性能仅中等。传统表示如k-mer达到相似的F1值(最高约0.667),但表现出高度不平衡的行为,召回率接近0.98,精度约0.50,反映出对正样本预测的强烈偏倚。在各种表示中,性能差异仍保持在狭窄范围内(F1在0.60到0.70之间),而无监督分析揭示没有与类别标签对齐的内在结构,统计检验(Friedman检验,p = 0.1749)不显示模型间的显著差异。这些结果表明类别之间有显著重叠,并表明仅凭初级序列信息对帕金森病分类的判别能力有限。本研究建立了一个可重复的基线,并提供了实证证据,表明更丰富的生物学特征,如结构、功能或相互作用描述符,对于稳健的疾病建模是必需的。

英文摘要

The identification of reliable molecular biomarkers for Parkinson's disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including amino acid composition, k-mers, physicochemical descriptors, hybrid representations, and embeddings from protein language models, all assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 +/- 0.028 and ROC-AUC of 0.748 +/- 0.047, indicating only moderate discriminative performance. Classical representations such as k-mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions. Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70), while unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson's disease classification. This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as structural, functional, or interaction-based descriptors, are required for robust disease modeling.

2604.09450 2026-05-19 cs.LG cs.AI eess.IV 版本更新

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

ECHO: 通过一步块扩散实现高效的胸部X光报告生成

Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao, Xiao Han, Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin, Yi Xu

发表机构 * Beijing Jiaotong University(北京交通大学) Dalian University of Technology(大连理工大学)

AI总结 本文提出ECHO,一种基于扩散模型的高效视觉-语言模型,用于生成胸部X光报告,通过一步块扩散和响应不对称扩散策略,显著提高了生成效率和文本连贯性,同时在临床准确性上保持良好表现。

详情
AI中文摘要

胸部X光报告生成(CXR-RG)有潜力显著减轻放射科医生的工作负担。然而,传统自回归视觉-语言模型(VLMs)由于序列令牌解码而存在高推理延迟。基于扩散的模型通过并行生成提供了一种有前景的替代方案,但它们仍然需要多个去噪迭代。将多步去噪压缩到单步可以进一步减少延迟,但通常会因令牌因子化去噪器引入的均场偏差而降级文本连贯性。为了解决这一挑战,我们提出了ECHO,一种高效的基于扩散的VLM(dVLM),用于胸部X光报告生成。ECHO通过一种新颖的直接条件蒸馏(DCD)框架实现了稳定的每块一步推理,该框架通过从策略扩散轨迹中构建非因子化监督来缓解均场限制,以编码联合令牌依赖性。此外,我们引入了一种响应不对称扩散(RAD)训练策略,该策略进一步提高了训练效率,同时保持模型有效性。广泛的实验表明,ECHO超越了最先进的自回归方法,在RaTE和SemScore上分别提高了64.33%和60.58%,同时在临床准确性上几乎没有下降的情况下,实现了高达8倍的推理加速。

英文摘要

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33\%} and \textbf{60.58\%} respectively, while achieving up to \textbf{$8\times$} inference speedup with negligible degradation in clinical accuracy.

2604.06398 2026-05-19 physics.ao-ph cs.LG physics.comp-ph 版本更新

Calibration of a neural network ocean closure for improved mean state and variability

神经网络海洋闭合的校准以提高均值状态和变异性

Pavel Perezhogin, Alistair Adcroft, Laure Zanna

发表机构 * Courant Institute School of Mathematics, Computing, and Data Science, New York University, New York, NY, USA(科朗学院数学、计算与数据科学学院,纽约大学,纽约,纽约州,美国) Program in Atmospheric and Oceanic Sciences, Princeton University, Princeton, NJ, USA(大气与海洋科学项目,普林斯顿大学,普林斯顿,新泽西州,美国)

AI总结 本文提出利用集合卡尔曼反演方法对神经网络参数进行校准,以改进粗分辨率海洋模型的均值状态和变异性,通过校准减少了约1.7至3.3倍的误差。

详情
AI中文摘要

全球海洋模型在均值状态和变异性上存在偏差,特别是在粗分辨率下,其中次网格涡旋未被解析。为解决这些偏差,通常会通过任意方式调整参数化系数。本文将参数调整问题公式化为一个校准问题,使用集合卡尔曼反演(EKI)。我们优化了两个理想化海洋模型在粗分辨率下次网格涡旋的神经网络参数化参数。校准后的参数化在时间平均流体界面及其变异性上比未参数化的模型减少了1.7至3.3倍的误差,具体取决于度量和配置。EKI方法对时间平均统计中的噪声具有鲁棒性,源于混沌海洋动力学。此外,我们提出了一种高效的校准协议,通过精心选择初始条件来绕过统计平衡的积分。这些结果表明,系统性的校准可以显著提高粗分辨率海洋模拟,并为减少全球海洋模型中的偏差提供了一条实用路径。

英文摘要

Global ocean models exhibit biases in the mean state and variability, particularly at coarse resolution, where mesoscale eddies are unresolved. To address these biases, parameterization coefficients are typically tuned ad hoc. Here, we formulate parameter tuning as a calibration problem using Ensemble Kalman Inversion (EKI). We optimize parameters of a neural network parameterization of mesoscale eddies in two idealized ocean models at coarse resolution. The calibrated parameterization reduces errors by factors of 1.7-3.3 in the time-averaged fluid interfaces and their variability compared to the unparameterized model, depending on the metric and configuration. The EKI method is robust to noise in time-averaged statistics arising from chaotic ocean dynamics. Furthermore, we propose an efficient calibration protocol that bypasses integration to statistical equilibrium by carefully choosing an initial condition. These results demonstrate that systematic calibration can substantially improve coarse-resolution ocean simulations and provide a practical pathway for reducing biases in global ocean models.

2604.00919 2026-05-19 quant-ph cond-mat.stat-mech cs.LG 版本更新

Multi-Mode Quantum Annealing for Generative Representation Learning with Boltzmann Priors

多模式量子退火用于生成表示学习中的玻尔兹曼先验

Gilhan Kim, Daniel K. Park

发表机构 * Department of Statistics and Data Science, Yonsei University, Seoul 03722, Republic of Korea(统计与数据科学系,延世大学,首尔03722,大韩民国) Department of Applied Statistics, Yonsei University, Seoul 03722, Republic of Korea(应用统计学系,延世大学,首尔03722,大韩民国) Department of Quantum Information, Yonsei University, Seoul 03722, Republic of Korea(量子信息系,延世大学,首尔03722,大韩民国)

AI总结 本文提出了一种基于量子退火的框架,利用通用玻尔兹曼先验改进变分自编码器,通过三种互补的退火模式实现高效训练、无条件生成和条件生成,展示了在MNIST、Fashion-MNIST和CelebA上的稳定训练和高质量生成,同时在异常检测和金融数据中表现出色。

Comments 25 pages, 8 figures

详情
AI中文摘要

基于能量模型为统计物理和机器学习提供自然桥梁,通过结构化能量景观表示数据。玻尔兹曼机是此类模型中特别有吸引力的一类,能够捕捉潜在变量间的复杂相互作用,但其在现代生成学习中的应用受到经典方法难以从一般(非受限)玻尔兹曼分布中采样的限制。本文开发了一种基于量子退火的框架,使变分自编码器能够使用通用玻尔兹曼先验。该框架采用三种互补的退火模式,适用于学习和部署的不同阶段:非绝热量子退火提供无偏的玻尔兹曼样本以实现高效训练,较慢的退火集中在学习先验的低能配置附近以实现无条件生成,条件退火配合外部场将学习的能量景观引导至属性特定区域以实现条件生成和语义编辑。使用多达2000个量子比特的D-Wave Advantage2处理器,在MNIST、Fashion-MNIST和CelebA上展示了稳定的训练和高质量的生成,比具有相同编码器-解码器架构的高斯先验VAE更快收敛且重建损失更低。除了生成外,学习的能量函数还提供超出重建损失的判别能力,用于异常检测。在单类MNIST实验中,这些分数能够将分布内样本与外样本分开,并在金融数据中改进市场制度转换的检测。这些结果证明了量子退火作为能量表示学习和生成建模的实用且可控的物理机制,超越了可计算的经典方法的范围。

英文摘要

Energy-based models provide a natural bridge between statistical physics and machine learning by representing data through structured energy landscapes. Boltzmann machines are a particularly compelling class of such models for capturing complex interactions among latent variables, but their use in modern generative learning has been limited by the classical intractability of sampling from general (non-restricted) Boltzmann distributions. Here we develop a quantum-annealing-based framework that enables variational autoencoders with general Boltzmann priors. The framework employs three complementary annealing modes tailored to different stages of learning and deployment: diabatic quantum annealing provides unbiased Boltzmann samples for efficient training, slower annealing concentrates samples near low-energy configurations of the learned prior for unconditional generation, and conditional annealing with external fields steers the learned energy landscape toward attribute-specific regions for conditional generation and semantic editing. Using up to 2000 qubits on a D-Wave Advantage2 processor, we demonstrate stable training and high-quality generation on MNIST, Fashion-MNIST, and CelebA, achieving faster convergence and lower reconstruction loss than a Gaussian-prior VAE with the same encoder-decoder architecture. Beyond generation, the learned energy function provides out-of-distribution detection signals that add discriminative power beyond reconstruction loss. We demonstrate that these scores separate in-distribution samples from held-out digit classes in one-class MNIST experiments and improve the detection of market regime shifts in financial data. These results establish quantum annealing as a practical and controllable physical mechanism for energy-based representation learning and generative modeling beyond the reach of tractable classical approaches.

2603.27341 2026-05-19 cs.AI cs.CV cs.LG 版本更新

A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling

外科AI的比较研究:数据、计算和扩展的潜力与局限

Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara, Shauna Otto, Zhuang-Fang Yi, John Zhu, Daniel A. Donoho, X. Y. Han, Neeraj Mainkar, Margaux Masson-Forsythe

发表机构 * Center for Applied AI, Chicago Booth(应用人工智能中心,芝加哥商学院) Surgical Data Science Collective(外科数据科学集体) Children’s National Hospital(儿童医学中心) Operations Management & Tolan Center for Healthcare, Chicago Booth(运营管理与托兰医疗中心,芝加哥商学院)

AI总结 本文通过2026年最先进的AI方法,研究了外科手术工具检测中的性能和限制,发现即使使用多十亿参数模型和大量训练数据,当前的视觉语言模型在神经外科手术工具检测任务中仍表现不足,且模型规模和训练时间的增加对性能提升效果有限,表明当前AI在手术应用中仍面临显著挑战。

详情
AI中文摘要

最近的人工智能(AI)模型在多个生物医学任务基准上已匹配或超越了人类专家,但特别是在外科手术基准方面,这些基准往往缺失于主要的医学基准套件中。由于手术需要整合多种任务,一般能力的AI模型可能成为协作工具,如果性能可以得到提升。一方面,通过扩展架构大小和训练数据的常规方法具有吸引力,尤其是由于每年有数百万小时的手术视频数据生成。另一方面,为AI训练准备手术数据需要显著更高的专业水平,并且在该数据上训练需要昂贵的计算资源。这些权衡描绘了现代AI是否以及在多大程度上能够帮助外科实践的不确定图景。在本文中,我们通过使用2026年最先进的AI方法进行外科手术工具检测的案例研究来探讨这个问题。我们证明,即使使用多十亿参数模型和大量训练,当前的视觉语言模型在看似简单的神经外科手术工具检测任务中仍表现不足。此外,我们展示了扩展实验,表明增加模型规模和训练时间仅导致相关性能指标的边际改善。因此,我们的实验表明,当前模型在手术使用案例中仍可能面临重大障碍。此外,一些障碍无法通过额外的计算能力简单地“解决”并持续存在于不同的模型架构中,提出了数据和标签可用性是否是唯一限制因素的问题。我们讨论了这些约束的主要贡献者,并提出了潜在的解决方案。

英文摘要

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but surgical benchmarks in particular are often missing from prominent medical benchmark suites. Since surgery requires integrating disparate tasks, generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

2603.18972 2026-05-19 cs.LG 版本更新

Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives

兼顾两种世界的多对决老虎机:统一算法用于在康多塞和波尔多目标下的随机和对抗性偏好

S Akash, Pratik Gajane, Jawar Singh

AI总结 本文提出了一种兼顾随机和对抗性环境的多对决老虎机统一算法,针对康多塞和波尔多目标,同时在无先验知识的情况下实现了最优性能。

详情
AI中文摘要

多对决老虎机,其中学习者每轮选择m≥2个臂并仅观察胜者,自然出现在许多应用中,包括排名和推荐系统,但一个基本问题仍然存在:能否一个单一的算法在随机和对抗性环境中都表现最优,而无需知道所处的环境?我们对此给出了肯定答案,提供了第一个兼顾两种世界的多对决老虎机算法,适用于康多塞和波尔多目标。对于康多塞设置,我们提出MetaDueling,一种黑盒减少方法,将任何对决老虎机算法转换为多对决老虎机算法,通过将多方式胜者反馈转换为无偏的 pairwise 信号。将我们的减少方法应用于Versatile-DB,得到第一个兼顾两种世界的多对决老虎机算法:它在对抗性偏好下达到O(√(KT))的伪遗憾,在随机偏好下达到实例最优的O(∑_{i≠a*} logT/Δ_i)的伪遗憾,同时且无需先验知识。对于波尔多设置,我们提出SA-MiDEX,一种随机和对抗性算法,它在随机环境中达到O(K²logKT + Klog²T + ∑_{i:Δ_i^B>0} KlogKT/(Δ_i^B)²)的遗憾,在对抗者面前达到O(K√(TlogKT) + K^{1/3}T^{2/3}(logK)^{1/3})的遗憾,再次无需先验知识。我们用康多塞设置的上界补充了匹配的下界。对于波尔多设置,我们的上界在下界附近(因子K内)并且与文献中最好的结果相匹配。

英文摘要

Multi-dueling bandits, where a learner selects $m \geq 2$ arms per round and observes only the winner, arise naturally in many applications including ranking and recommendation systems, yet a fundamental question has remained open: can a single algorithm perform optimally in both stochastic and adversarial environments, without knowing which regime it faces? We answer this affirmatively, providing the first best-of-both-worlds algorithms for multi-dueling bandits under both Condorcet and Borda objectives. For the Condorcet setting, we propose $\texttt{MetaDueling}$, a black-box reduction that converts any dueling bandit algorithm into a multi-dueling bandit algorithm by transforming multi-way winner feedback into an unbiased pairwise signal. Instantiating our reduction with $\texttt{Versatile-DB}$ yields the first best-of-both-worlds algorithm for multi-dueling bandits: it achieves $O(\sqrt{KT})$ pseudo-regret against adversarial preferences and the instance-optimal $O\left(\sum_{i \neq a^\star} \frac{\log T}{Δ_i}\right)$ pseudo-regret under stochastic preferences, both simultaneously and without prior knowledge of the regime. For the Borda setting, we propose $\texttt{SA-MiDEX}$, a stochastic-and-adversarial algorithm that achieves $O\left(K^2 \log KT + K \log^2 T + \sum_{i: Δ_i^{\mathrm{B}} > 0} \frac{K\log KT}{(Δ_i^{\mathrm{B}})^2}\right)$ regret in stochastic environments and $O\left(K \sqrt{T \log KT} + K^{1/3} T^{2/3} (\log K)^{1/3}\right)$ regret against adversaries, again without prior knowledge of the regime. We complement our upper bounds with matching lower bounds for the Condorcet setting. For the Borda setting, our upper bounds are near-optimal with respect to the lower bounds (within a factor of $K$) and match the best-known results in the literature.

2603.18702 2026-05-19 cs.LG 版本更新

Off-Policy Learning with Limited Supply

有限供应下的离策略学习

Koichi Tanaka, Ren Kishimoto, Bushun Kawagishi, Yusuke Narita, Yasuo Yamamoto, Nobuyuki Shimizu, Yuta Saito

发表机构 * Keio University(Keio大学) Institute of Science Tokyo(东京科学研究所) Meiji University(Meiji大学) Yale University(Yale大学) LY Corporation(LY公司) Hanjuku-kaso, Co., Ltd.

AI总结 本文研究了在情境老虎机中受限供应下的离策略学习问题,提出了一种新的OPLS方法,通过考虑用户间的相对预期奖励来更高效地分配有限供应的物品,实验证明其在有限供应情境下的优越性。

Comments Published as a conference paper at WWW 2026

详情
AI中文摘要

我们研究了情境老虎机中的离策略学习(OPL),这在推荐系统和在线广告等广泛的实际应用中起着关键作用。典型的OPL在情境老虎机中假设一个无约束环境,其中策略可以无限次选择同一物品。然而,在许多实际应用中,包括优惠券分配和电子商务,有限供应通过分布式优惠券的预算限制或产品库存限制来限制物品。在这些设置中,贪心地选择当前用户预期奖励最高的物品可能导致该物品的早期耗尽,使其无法为未来可能生成更高预期奖励的用户使用。因此,最优的无约束设置中的OPL方法在有限供应设置中可能变得次优。为了解决这个问题,我们提供了一个理论分析,显示传统贪心OPL方法可能无法最大化策略性能,并证明在有限供应设置中必须存在性能更优的策略。基于这一见解,我们引入了一种新的方法,称为有限供应下的离策略学习(OPLS)。与简单选择预期奖励最高的物品不同,OPLS关注相对预期奖励较高的物品,从而更有效地分配有限供应的物品。我们在合成和现实数据集上的实验证明,OPLS在具有有限供应的情境老虎机问题中优于现有的OPL方法。

英文摘要

We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the item with the highest expected reward for the current user may lead to early depletion of that item, making it unavailable for future users who could potentially generate higher expected rewards. As a result, OPL methods that are optimal in unconstrained settings may become suboptimal in limited supply settings. To address the issue, we provide a theoretical analysis showing that conventional greedy OPL approaches may fail to maximize the policy performance, and demonstrate that policies with superior performance must exist in limited supply settings. Based on this insight, we introduce a novel method called Off-Policy learning with Limited Supply (OPLS). Rather than simply selecting the item with the highest expected reward, OPLS focuses on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply. Our empirical results on both synthetic and real-world datasets show that OPLS outperforms existing OPL methods in contextual bandit problems with limited supply.

2603.14462 2026-05-19 cs.LG cs.AI 版本更新

STAG-CN: Spatio-Temporal Apiary Graph Convolutional Network for Disease Onset Prediction in Beehive Sensor Networks

STAG-CN:时空蜂巢图卷积网络用于蜂巢传感器网络中疾病发病预测

Sungwoo Kang

AI总结 该研究提出STAG-CN模型,通过建模蜂箱间关系来预测疾病发病,利用时空图卷积网络结合物理位置和气候传感器相关性,验证了共享环境响应模式比空间接近性更有效。

Comments Null result after running with 10 seeds

详情
AI中文摘要

蜂蜜蜂群损失威胁着全球授粉服务,但当前监测系统将每个蜂箱视为孤立单元,忽略了疾病在养蜂场中传播的空间路径。本文介绍了时空蜂巢图卷积网络(STAG-CN),一种图神经网络,用于疾病发病预测。STAG-CN基于双邻接图,结合蜂箱会话间的物理共置和气候传感器相关性,通过基于因果扩张卷积和Chebyshev谱图卷积的时空-时空三明治架构处理多变量物联网传感器流。在韩国AI Hub养蜂数据集(数据集#71488)上进行扩展窗口时间交叉验证后,STAG-CN在三天预测范围内达到F1分数0.607。消融研究显示,仅气候邻接矩阵可达到全模型性能(F1=0.607),而仅物理邻接矩阵则为F1=0.274,表明共享的环境响应模式比空间接近性在疾病发病预测中更具预测信号。这些结果为基于图的生物安全监控在精准养蜂中的概念验证奠定了基础,证明了蜂箱传感器相关性编码了单个蜂箱方法无法察觉的疾病相关信息。

英文摘要

Honey bee colony losses threaten global pollination services, yet current monitoring systems treat each hive as an isolated unit, ignoring the spatial pathways through which diseases spread across apiaries. This paper introduces the Spatio-Temporal Apiary Graph Convolutional Network (STAG-CN), a graph neural network that models inter-hive relationships for disease onset prediction. STAG-CN operates on a dual adjacency graph combining physical co-location and climatic sensor correlation among hive sessions, and processes multivariate IoT sensor streams through a temporal--spatial--temporal sandwich architecture built on causal dilated convolutions and Chebyshev spectral graph convolutions. Evaluated on the Korean AI Hub apiculture dataset (dataset \#71488) with expanding-window temporal cross-validation, STAG-CN achieves an F1 score of 0.607 at a three-day forecast horizon. An ablation study reveals that the climatic adjacency matrix alone matches full-model performance (F1\,=\,0.607), while the physical adjacency alone yields F1\,=\,0.274, indicating that shared environmental response patterns carry stronger predictive signal than spatial proximity for disease onset. These results establish a proof-of-concept for graph-based biosecurity monitoring in precision apiculture, demonstrating that inter-hive sensor correlations encode disease-relevant information invisible to single-hive approaches.

2603.12145 2026-05-19 cs.LG cs.AI cs.SE 版本更新

Automatic Generation of High-Performance RL Environments

自动生成高性能强化学习环境

Seth Karten, Rahul Dev Appapogu, Chi Jin

发表机构 * Princeton University(普林斯顿大学) Independent Researcher(独立研究者)

AI总结 本文提出了一种闭环方法,通过最小的计算成本生成等效的高性能强化学习环境,展示了三种不同的工作流程,并在五个环境中验证了无仿真到仿真的差距,同时展示了新的环境创建方法。

Comments 20 pages, 5 figures

详情
AI中文摘要

将复杂的强化学习(RL)环境转换为高性能实现传统上需要数月的专业工程工作。我们提出了一种闭环方法,以最小的计算成本生成等效的高性能环境。我们的方法使用通用提示模板、分层验证(属性、交互和运行测试)、迭代修复和跨后端策略转移来验证无仿真到仿真的差距。我们展示了三个不同的工作流程跨越五个环境:(1)从Game Boy模拟器PyBoy直接翻译到我们的EmuRust(通过Rust IPC)和从Pokemon Showdown翻译到我们的PokeJAX(通过JAX);(2)通过与现有高性能实现的吞吐量一致性进行验证,如Puffer Pong、MJX和Brax在匹配的GPU批次大小下;(3)新环境的创建:TCGJax,第一个Pokemon TCG Pocket环境,从网页提取的规范中创建。在2亿个参数下,环境开销低于训练时间的4%。我们的闭环方法验证了所有五个环境的等效性。TCGJax,由一个不在公共存储库中的私有参考合成,用于控制代理预训练数据的污染问题。

英文摘要

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a closed-loop methodology that produces equivalent high-performance environments for minimal compute cost. Our method uses a generic prompt template, hierarchical verification (property, interaction, and rollout tests), iterative repair, and cross-backend policy transfer to verify no sim-to-sim gap. We demonstrate three distinct workflows across five environments: (1) Direct translation (no prior performance implementation exists) from Game Boy emulator PyBoy to our EmuRust (via Rust IPC) and from Pokemon Showdown to our PokeJAX (via JAX); (2) Translation verified against existing performance implementations via throughput parity with Puffer Pong, MJX and Brax at matched GPU batch sizes; and (3) New environment creation: TCGJax, the first Pokemon TCG Pocket environment, created from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Our closed-loop methodology confirms equivalence for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns.

2603.11276 2026-05-19 stat.ML cs.LG 版本更新

RIE-Greedy: Regularization-Induced Exploration for Contextual Bandits

RIE-Greedy: 基于正则化的探索策略用于上下文老虎机

Tong Li, Thiago de Queiroz Casanova, Eric M. Schwartz, Victor Kostyuk, Dehan Kong, Joseph J. Williams

发表机构 * University of Toronto(多伦多大学) University of Michigan(密歇根大学)

AI总结 本文提出了一种基于正则化的探索策略(RIE-Greedy),利用模型拟合过程中的随机性作为内在探索源,理论证明其在两臂老虎机情况下等价于Thompson Sampling,并在大规模商业环境中优于epsilon-greedy等基准方法。

详情
AI中文摘要

现实中的复杂奖励模型的上下文老虎机问题通常使用迭代训练的模型(如提升树)来解决。然而,直接应用简单的有效探索策略(如Thompson Sampling或UCB)在这些黑箱估计器上很困难。现有方法依赖于复杂的假设或不可行的程序,难以在实践中验证和实现。本文探讨了一种无探索(纯贪婪)的动作选择策略,利用模型拟合过程中的随机性作为内在探索源。更具体地说,我们注意到基于交叉验证的正则化过程中的随机性可以自然地诱导出Thompson Sampling-like的探索。我们证明了这种正则化诱导的探索在两臂老虎机情况下在理论上等价于Thompson Sampling,并在大规模商业环境中相对于epsilon-greedy和其他最先进的方法在经验上实现了可靠的探索。总体而言,本文揭示了正则化估计器训练本身如何诱导有效的探索,为上下文老虎机设计提供了理论洞察和实践指导。

英文摘要

Real-world contextual bandit problems with complex reward models are often tackled with iteratively trained models, such as boosting trees. However, it is difficult to directly apply simple and effective exploration strategies--such as Thompson Sampling or UCB--on top of those black-box estimators. Existing approaches rely on sophisticated assumptions or intractable procedures that are hard to verify and implement in practice. In this work, we explore the use of an exploration-free (pure-greedy) action selection strategy, that exploits the randomness inherent in model fitting process as an intrinsic source of exploration. More specifically, we note that the stochasticity in cross-validation based regularization process can naturally induce Thompson Sampling-like exploration. We show that this regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case and empirically leads to reliable exploration in large-scale business environments compared to benchmark methods such as epsilon-greedy and other state-of-the-art approaches. Overall, our work reveals how regularized estimator training itself can induce effective exploration, offering both theoretical insight and practical guidance for contextual bandit design.

2603.10935 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Spherical VAE with Cluster-Aware Feasible Regions: Guaranteed Prevention of Posterior Collapse

具有聚类感知可行区域的球形VAE:保证防止后验崩溃

Zegu Zhang, Jian Zhang

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种理论保证非崩溃解的新型框架,通过利用球壳几何和聚类感知约束,防止VAE中的后验崩溃问题,并在合成和现实数据集上实现了100%的崩溃预防。

Comments 8 pages, 6 figures

详情
AI中文摘要

变分自编码器(VAEs)经常受到后验崩溃的影响,其中潜在变量在近似后验退化为先验时变得无信息。尽管最近的研究将崩溃描述为由数据协方差属性决定的相变,但现有方法主要旨在避免而非消除崩溃。我们引入了一种新的框架,通过利用球壳几何和聚类感知约束,从理论上保证非崩溃解。我们的方法将数据转换为球壳,通过K-means计算最优聚类分配,并定义一个在聚类内方差W和崩溃损失δ-collapse之间的可行区域。我们证明当重构损失被限制在这个区域内时,崩溃解在数学上被排除在可行参数空间之外。关键的是,我们引入了规范约束机制,确保解码器输出保持与球壳几何兼容,而不限制表示能力。与以往方法不同,我们的方法提供了严格的理论保证,计算开销小,且不施加对解码器输出的限制。在合成和现实数据集上的实验表明,在传统VAE完全失败的条件下,实现了100%的崩溃预防,重构质量匹配或超过最先进的方法。我们的方法不需要显式的稳定性条件(例如σ² < λ_max),并且适用于任意神经网络架构。代码可在https://github.com/tsegoochang/spherical-vae-with-Cluster获取。

英文摘要

Variational autoencoders (VAEs) frequently suffer from posterior collapse, where the latent variables become uninformative as the approximate posterior degenerates to the prior. While recent work has characterized collapse as a phase transition determined by data covariance properties, existing approaches primarily aim to avoid rather than eliminate collapse. We introduce a novel framework that theoretically guarantees non-collapsed solutions by leveraging spherical shell geometry and cluster-aware constraints. Our method transforms data to a spherical shell, computes optimal cluster assignments via K-means, and defines a feasible region between the within-cluster variance $W$ and collapse loss $δ_{\text{collapse}}$. We prove that when the reconstruction loss is constrained to this region, the collapsed solution is mathematically excluded from the feasible parameter space. \textbf{Critically, we introduce norm constraint mechanisms that ensure decoder outputs remain compatible with the spherical shell geometry without restricting representational capacity.} Unlike prior approaches, our method provides a strict theoretical guarantee with minimal computational overhead without imposing constraints on decoder outputs. Experiments on synthetic and real-world datasets demonstrate 100\% collapse prevention under conditions where conventional VAEs completely fail, with reconstruction quality matching or exceeding state-of-the-art methods. Our approach requires no explicit stability conditions (e.g., $σ^2 < λ_{\max}$) and works with arbitrary neural architectures. The code is available at https://github.com/tsegoochang/spherical-vae-with-Cluster.

2603.03099 2026-05-19 cs.LG cs.AI 版本更新

Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

为何Adam能胜过SGD:二阶矩归一化产生更尖锐的尾部

Ruinan Jin, Yingbin Liang, Shaofeng Zou

发表机构 * Department of Electrical and Computer Engineering, The Ohio State University(俄亥俄州立大学电气与计算机工程系) School of Electrical, Computer and Energy Engineering, Arizona State University(亚利桑那州立大学电气、计算机与能源工程学院)

AI总结 本文揭示了Adam中的关键二阶矩归一化机制,并通过停止时间/鞅分析,在经典有界方差模型下,证明了Adam在高概率收敛行为上优于SGD,前者对置信参数δ的依赖为δ^{-1/2},而SGD则至少为δ^{-1}。

Comments 68 pages

详情
AI中文摘要

尽管Adam在许多应用中表现出比SGD更快的实证收敛速度,但现有的大多数理论保证与SGD几乎相同,无法充分解释实证性能差距。在本文中,我们揭示了Adam中的关键二阶矩归一化,并开发了一种停止时间/鞅分析,该分析在经典有界方差模型(一个二阶矩假设)下,能够证明Adam在高概率收敛行为上优于SGD。具体而言,我们建立了两种方法高概率收敛行为之间的第一个理论区分:Adam对置信参数δ的依赖为δ^{-1/2},而SGD对应的高概率保证至少需要δ^{-1}的依赖。

英文摘要

Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $δ^{-1/2}$ dependence on the confidence parameter $δ$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $δ^{-1}$ dependence.

2602.24238 2026-05-19 cs.LG 版本更新

Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis

时间序列基础模型在交通预测中的强大基准作用:一项大规模基准分析

Javier Yanes-Pulido, Filipe Rodrigues

发表机构 * Technical University of Denmark(丹麦技术大学)

AI总结 本文通过在十个真实世界数据集上评估最新时间序列模型Chronos-2的零样本性能,证明了通用时间序列基础模型在交通预测中的有效性,展示了其在多数数据集上达到或超越传统统计基线和专用深度学习架构的准确性,尤其在长预测范围内表现突出。

Comments 6 pages

详情
AI中文摘要

准确预测交通动态对于城市交通和基础设施规划至关重要。尽管近期工作在深度学习模型中取得了优异表现,但这些方法通常需要特定数据集的训练、架构设计和超参数调整。本文评估了通用时间序列基础模型是否能作为交通任务的预测器,通过在十个涵盖高速公路交通量和流、城市交通速度、自行车共享需求和电动汽车充电站数据的真实世界数据集上,对最新模型Chronos-2的零样本性能进行基准测试。在一致的评估协议下,我们发现,即使没有任何任务特定的微调,Chronos-2在大多数数据集上均达到或超越了传统统计基线和专用深度学习架构的准确性,特别是在长预测范围。除了点预测外,我们还通过预测区间覆盖和锐度评估其原生概率输出,证明Chronos-2在无需特定数据集训练的情况下也提供了有用的不确定性量化。总体而言,本研究支持将时间序列基础模型作为交通预测研究的关键基准。

英文摘要

Accurate forecasting of transportation dynamics is essential for urban mobility and infrastructure planning. Although recent work has achieved strong performance with deep learning models, these methods typically require dataset-specific training, architecture design and hyper-parameter tuning. This paper evaluates whether general-purpose time-series foundation models can serve as forecasters for transportation tasks by benchmarking the zero-shot performance of the state-of-the-art model, Chronos-2, across ten real-world datasets covering highway traffic volume and flow, urban traffic speed, bike-sharing demand, and electric vehicle charging station data. Under a consistent evaluation protocol, we find that, even without any task-specific fine-tuning, Chronos-2 delivers state-of-the-art or competitive accuracy across most datasets, frequently outperforming classical statistical baselines and specialized deep learning architectures, particularly at longer horizons. Beyond point forecasting, we evaluate its native probabilistic outputs using prediction-interval coverage and sharpness, demonstrating that Chronos-2 also provides useful uncertainty quantification without dataset-specific training. In general, this study supports the adoption of time-series foundation models as a key baseline for transportation forecasting research.

2602.23566 2026-05-19 cs.LG cs.AI 版本更新

Flowette: Flow Matching with Graphette Priors for Graph Generation

Flowette: 用于图生成的图结构先验的流匹配

Asiri Wijesinghe, Sevvandi Kandanaarachchi, Daniel M. Steinberg, Cheng Soon Ong

发表机构 * CSIRO’s Data61(CSIRO数据61) Australian National University(澳大利亚国立大学)

AI总结 本文提出Flowette框架,通过图神经网络基于transformer学习图表示上的速度场,结合最优传输耦合和正则化,利用图ettes先验结构模型提升图生成性能,实验证明结合结构先验和流训练的有效性。

Comments 48 Pages

详情
AI中文摘要

我们研究具有重复子图motif的图生成建模。我们提出了Flowette,一个连续流匹配框架,利用基于图神经网络的transformer学习具有节点和边属性的图表示上的速度场。我们的模型通过基于最优传输的耦合实现拓扑感知对齐,并通过正则化促进全局结构一致性。为整合领域驱动的结构先验,我们引入图ettes,一种新的概率图结构模型家族,通过受控的结构编辑推广图ons以适用于环、星形和树等motif。我们理论分析了框架的耦合、不变性和结构性质,评估了其在合成和分子基准上的性能,并通过受控消融实验隔离了结构先验、最优传输耦合和正则化项的贡献。Flowette在多个基准上取得了竞争性性能,达到多个指标的最先进结果,突显了结合结构先验与流训练在建模复杂图分布中的有效性。

英文摘要

We study generative modeling of graphs with recurring subgraph motifs. We propose Flowette, a continuous flow matching framework that employs a graph neural network-based transformer to learn a velocity field over graph representations with node and edge attributes. Our model promotes topology-aware alignment through optimal transport-based coupling and encourages global structural coherence through regularisation. To incorporate domain-driven structural priors, we introduce graphettes, a new probabilistic family of graph structure models that generalize graphons via controlled structural edits for motifs such as rings, stars, and trees. We theoretically analyze the coupling, invariance, and structural properties of the framework, evaluate it on synthetic and molecular benchmarks, and isolate the contributions of the structural prior, the optimal-transport coupling, and the regularisation terms through controlled ablations. Flowette achieves competitive performance overall, attaining state-of-the-art results on several metrics across multiple benchmarks, highlighting the effectiveness of combining structural priors with flow-based training for modeling complex graph distributions.

2602.21426 2026-05-19 cs.LG stat.CO 版本更新

Proximal-IMH: Proximal Posterior Proposals for Independent Metropolis-Hastings with Approximate Operators

Proximal-IMH: 用于独立Metropolis-Hastings的近端后验提议

Youguang Chen, George Biros

发表机构 * Oden Institute for Computational Engineering and Sciences(奥登计算工程与科学研究所)

AI总结 本文提出了一种改进的独立Metropolis-Hastings算法,通过引入辅助优化问题来消除近似后验分布中的偏差,从而在保持精确模型的同时提高稳定性和采样效率。

详情
AI中文摘要

我们考虑了在科学、工程和成像中的贝叶斯反问题中从后验分布采样的问题。我们的方法属于独立Metropolis-Hastings(IMH)采样算法家族,常用于贝叶斯推断。依赖于存在一个更便宜但可能有显著偏差的近似后验分布,我们引入了Proximal-IMH,通过辅助优化问题纠正近似后验的样本,从而在精确模型和近似参考点周围获得局部调整。对于理想化设置,我们证明了近端校正能够收紧近似和精确后验之间的匹配,从而提高接受率和混合性。该方法适用于线性和非线性输入-输出算子,并特别适用于精确后验采样成本过高的反问题。我们展示了包含多模态和数据驱动先验的数值实验,结果表明Proximal-IMH在现有IMH变体中表现更优。

英文摘要

We consider the problem of sampling from a posterior distribution arising in Bayesian inverse problems in science, engineering, and imaging. Our method belongs to the family of independence Metropolis-Hastings (IMH) sampling algorithms, which are common in Bayesian inference. Relying on the existence of an approximate posterior distribution that is cheaper to sample from but may have significant bias, we introduce Proximal-IMH, a scheme that removes this bias by correcting samples from the approximate posterior through an auxiliary optimization problem. This yields a local adjustment that trades off adherence to the exact model against stability around the approximate reference point. For idealized settings, we prove that the proximal correction tightens the match between approximate and exact posteriors, thereby improving acceptance rates and mixing. The method applies to both linear and nonlinear input-output operators and is particularly suitable for inverse problems where exact posterior sampling is too expensive. We present numerical experiments including multimodal and data-driven priors with nonlinear input-output operators. The results show that Proximal-IMH reliably outperforms existing IMH variants.

2602.21265 2026-05-19 cs.CL cs.LG cs.SE 版本更新

ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints

ToolMATH: 一种用于在系统性工具目录约束下评估长周期工具使用的诊断基准

Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出ToolMATH,一种基于数学的诊断基准,用于评估在可控工具目录条件下长周期工具使用的性能,通过将分步MATH解决方案转换为可重用的Python工具,并配对需要顺序工具使用、中间输出重用和逻辑连接工具调用链的问题,从而评估模型在不同工具目录条件下的适应性、鲁棒性和工具连接性。

Comments Submitted to NeurIPS Evaluation & Dataset Track

详情
AI中文摘要

我们介绍了ToolMATH,一种用于评估在可控工具目录条件下长周期工具使用的数学基础诊断基准。ToolMATH将分步MATH解决方案转换为具有自然语言描述和类型化架构的可重用Python工具,并配对每个问题与一个需要顺序工具使用、中间输出重用和逻辑连接工具调用链的工具环境。ToolMATH通过构建黄金工具和难度分级的干扰项来控制工具可用性和目录难度。ToolMATH还结合了行为条件度量指标,使诊断评估超越最终准确性。基于这些测量,ToolMATH强调三个评估轴:(1)适应性衡量在黄金工具被完全替换为干扰项时保留的黄金成功程度;(2)鲁棒性衡量在添加干扰项作为噪声时的稳定性;(3)工具连接性衡量模型是否在长执行的工具调用链中保持准确性。此外,跟踪级失败分析描述了模型在每种工具目录条件下如何失败。这些诊断揭示了不同的模型特征:可靠的工具使用、工具回避、适应性替代以及不可靠工具目录的影响。总体而言,ToolMATH提供了一个受控的测试平台,用于评估语言模型如何适应变化的工具可用性,保持对干扰项的鲁棒性,并在长周期工具使用轨迹中保持正确性。

英文摘要

We introduce \ToolMATH, a math-grounded diagnostic benchmark for evaluating long-horizon tool use under controllable tool-catalog conditions. \ToolMATH converts stepwise MATH solutions into reusable Python tools with natural-language descriptions and typed schemas, and pairs each problem with a tool environment requiring sequential tool use, intermediate-output reuse, and logically connected tool-call chains. \ToolMATH controls tool availability and catalog difficulty by constructing gold tools and graded distractors with varying similarity to gold tools. \ToolMATH also incorporates behavior-conditioned metrics, enabling diagnostic evaluation beyond final accuracy. Building on these measurements, \ToolMATH emphasizes three evaluation axes: (1) \emph{Adaptability} measures how much Gold-only success is retained when gold tools are replaced entirely by distractors; (2) \emph{Robustness} measures stability under adding distractors as a noise; and (3) \emph{Tool Connectivity} measures whether models preserve accuracy over long executed tool-call chains. Furthermore, trace-level failure analyses characterize how models fail under each tool-catalog condition. Together, these diagnostics reveal distinct model profiles: reliable tool use, tool avoidance, adaptive substitution, and impacts of unreliable tool catalogs. Overall, \ToolMATH provides a controlled testbed for evaluating how language models adapt to changing tool availability, remain robust to distractors, and maintain correctness across long-horizon tool-use trajectories.

2602.18895 2026-05-19 q-fin.RM cs.LG 版本更新

Could Large Language Models work as Post-hoc Explainability Tools in Credit Risk Models?

大语言模型能否在信用风险模型中作为事后可解释性工具?

Wenxi Geng, Dingyuan Liu, Liya Li, Yiqing Wang

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Georgia Institute of Technology(佐治亚理工学院) Southern Methodist University(南方 Methodist 大学)

AI总结 本文研究大语言模型是否能作为信用风险模型的事后可解释性接口,评估其在保持特征重要性排名和生成自主解释方面的能力。

Comments 39 pages, 1 figure

详情
AI中文摘要

大语言模型(LLMs)在将基于模型的解释转化为人类可读的叙述方面展现出了潜力。本研究评估了LLMs在信用风险模型中作为事后可解释性接口的能力,重点在于其保持特征重要性排名和生成自主解释的能力。使用LendingClub数据集,我们比较了LLMs输出与SHAP和系数基于的归因方法在三个主要LLMs(包括GPT-4-turbo、Claude-Sonnet-4.5和Gemini-2.5-Flash)上的表现。结果表明,在受控提示下,LLMs能够可靠地重现参考排名,但在生成自主解释时显示出有限的对齐能力。这些发现表明,LLMs最佳用作叙述接口,而不是在信用风险治理中替代正式归因方法。

英文摘要

Large language models (LLMs) have shown promise in translating model-based explanations into human-readable narratives. This study evaluates whether LLMs can serve as post-hoc explainability interfaces for credit risk models, focusing on their ability to preserve feature-importance rankings and generate autonomous explanations. Using a LendingClub dataset, we compare LLM outputs with SHAP and coefficient-based attributions on three major LLMs, including GPT-4-turbo, Claude-Sonnet-4.5, and Gemini-2.5-Flash. Results indicate that LLMs reliably reproduce reference rankings under controlled prompts but show limited alignment when generating explanations autonomously. These findings suggest that LLMs are best deployed as narrative interfaces rather than substitutes for formal attribution methods in credit risk governance.

2602.18227 2026-05-19 cs.LG 版本更新

Parameter-Efficient Domain Adaptation of Physics-Informed Self-Attention based GNNs for AC Power Flow Prediction

为交流电力流预测的物理信息自注意力基于GNN的领域适应参数高效方法

Redwanul Karim, Changhun Kim, Timon Conrad, Nora Gourmelon, Julian Oelhaf, David Riebesel, Tomás Arias-Vergara, Andreas Maier, Johann Jäger, Siming Bayer

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany(模式识别实验室,埃尔兰根-纽伦堡大学,埃尔兰根,德国) Institute of Electrical Energy Systems, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany(电气能源系统研究所,埃尔兰根-纽伦堡大学,德国)

AI总结 本文研究了物理信息自注意力基于GNN的参数高效领域适应方法,通过物理基础损失鼓励基尔霍夫一致行为,并限制适应为低秩更新,从而在电压领域转移下实现可控的效率-精度权衡。

详情
AI中文摘要

在中压(MV)电网训练的模型部署到高压(HV)网络时,准确的交流电力流(AC-PF)预测在领域转移下至关重要。现有的物理信息图神经网络(GNN)求解器通常依赖全微调进行跨领域转移,导致高再训练成本,并且对目标领域适应与源领域保留之间的稳定性-可塑性权衡控制有限。我们研究了物理信息自注意力基于GNN的参数高效领域适应,通过物理基础损失鼓励基尔霍夫一致行为,同时限制适应为低秩更新。具体而言,我们应用低秩适应(LoRA)到注意力投影,并选择性地解冻预测头以调节适应能力。这种设计在电压领域转移下实现了可控的效率-精度权衡。在多个电网拓扑结构上,所提出的LoRA+PHead适应方法在目标领域RMSE差距为$2.6 imes 10^{-4}$的情况下恢复了接近全微调的精度,同时将可训练参数数量减少了$85.46\%$。物理基础残差与全微调相当;然而,相对于全微调,LoRA+PHead在领域转移下将中压源保留减少了4.7个百分点(17.9% vs. 22.6%),同时仍实现了参数高效且物理一致的AC-PF估计。

英文摘要

Accurate AC power flow (AC-PF) prediction under domain shift is critical when models trained on medium-voltage (MV) grids are deployed on high-voltage (HV) networks. Existing physics-informed graph neural network (GNN) solvers typically rely on full fine-tuning for cross-regime transfer, incurring high retraining cost and offering limited control over the stability-plasticity trade-off between target-domain adaptation and source-domain retention. We study parameter-efficient domain adaptation for physics-informed self-attention-based GNNs, encouraging Kirchhoff-consistent behavior via a physics-based loss while restricting adaptation to low-rank updates. Specifically, we apply low-rank adaptation (LoRA) to attention projections with selective unfreezing of the prediction head to regulate adaptation capacity. This design yields a controllable efficiency-accuracy trade-off for physics-constrained inverse estimation under voltage-regime shift. Across multiple grid topologies, the proposed LoRA+PHead adaptation recovers near-full fine-tuning accuracy with a target-domain RMSE gap of $2.6 \times 10^{-4}$ while reducing the number of trainable parameters by $85.46\%$. The physics-based residual remains comparable to full fine-tuning; however, relative to Full FT, LoRA+PHead reduces MV source retention by 4.7 percentage points (17.9% vs. 22.6%) under domain shift, while still enabling parameter-efficient and physically consistent AC-PF estimation.

2602.17684 2026-05-19 cs.LG cs.AI 版本更新

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Reward Models

CodeScaler: 通过奖励模型扩展代码大语言模型的训练和测试时间推理

Xiao Zhu, Xinyu Zhou, Boyu Zhu, Hanxu Hu, Mingzhe Du, Haotian Zhang, Huiming Wang, Zhijiang Guo

发表机构 * LARK, HKUST(GZ)(LARK,香港科技大学(广州)) Kuaishou Technology(快手科技) UCL(伦敦大学学院) UZH(苏黎世联邦理工学院) NUS(国立新加坡大学)

AI总结 本文提出CodeScaler,一种通过奖励模型扩展代码生成模型的训练和测试时间推理的框架,通过精心编纂的偏好数据和语法感知的代码提取,实现了在四个编码基准上比基于执行的RL提升1.55分,在Qwen3-14B-Base上提升4.23分,并在无测试用例的情况下通过合成数据进一步提升14.64分,同时在推理时间减少10倍的延迟,且在代码、通用和推理领域均优于现有奖励模型。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)通过利用单元测试的执行反馈推动了代码大语言模型的最新进展,但其可扩展性从根本上受到高质量测试用例可用性和可靠性的影响。我们提出CodeScaler,一种奖励模型,旨在扩展代码生成的强化学习训练和测试时间推理。CodeScaler是在经过验证的代码问题上精心编纂的偏好数据上训练的,并结合语法感知的代码提取和保持有效性的奖励塑造,以确保稳定和稳健的优化。在四个编码基准上,CodeScaler在Qwen3-8B-Base上比基于执行的RL提升1.55分,在Qwen3-14B-Base上提升4.23分。通过进一步扩展到44K问题并添加额外的合成数据,CodeScaler在无任何测试用例的情况下,相对于基础模型提升了14.64分。在推理时间,CodeScaler作为有效的测试时间扩展方法,实现了与单元测试方法相当的性能,同时在推理时间减少了10倍的延迟。此外,CodeScaler在RM-Bench上不仅在代码领域(+3.3分)上优于现有奖励模型,还在通用和推理领域(平均+2.7分)上也表现优异。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, a reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across four coding benchmarks, CodeScaler consistently outperforms execution-based RL by +1.55 points on Qwen3-8B-Base and +4.23 points on Qwen3-14B-Base. By further scaling to 44K problems with additional synthetic data, CodeScaler yields +14.64 points improvement over the base model without requiring any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).

2602.12687 2026-05-19 cs.LG cs.AI 版本更新

Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

信任不确定的教师:通过校准的不确定性提炼暗知识

Jeonghyun Kim, SooKyung Kim, Richeng Xuan, Hyunsoo Cho

发表机构 * Ewha Womans University(成均馆大学) Tencent(腾讯)

AI总结 本文提出校准不确定性提炼(CUD)框架,通过从分布角度重新审视知识蒸馏,使暗知识更忠实地被访问。CUD鼓励教师在有信息的地方揭示不确定性,并引导学生学习校准而非锐化确定性,从而在易例中获益于自信信号,在难例中获益于结构化不确定性,提升了学生在分布偏移和长尾输入上的准确性和可靠性。

详情
AI中文摘要

知识蒸馏的核心在于将教师的丰富'暗知识'-即揭示类别间关系和不确定性分布的细微概率模式进行转移。尽管这一理念已建立,但传统交叉熵训练的教师往往无法保留此类信号。它们的分布会坍缩成尖锐、过度自信的峰,看似决定性但实际脆弱,提供的仅限于硬标签或在表示层面转移时微妙地阻碍。这种过度自信在高基数任务中尤为成问题,因为许多可能类别的细微差别对指导紧凑的学生至关重要。此外,这种脆弱的目标会降低对分布偏移的鲁棒性,使学生在现实条件下的校准变得不可靠。为解决这一限制,我们从分布角度重新审视蒸馏,并提出校准不确定性蒸馏(CUD)框架,旨在使暗知识更忠实地被访问。CUD鼓励教师在有信息的地方揭示不确定性,并引导学生学习校准而非锐化确定性。通过在转移前直接塑造教师的预测分布,我们的方法在准确性和校准之间取得平衡,使学生在易例中受益于自信信号,在难例中受益于结构化不确定性。在多样化的基准测试中,CUD产生的学生不仅更加准确,而且在分布偏移下更加校准,在模糊的长尾输入上更加可靠。

英文摘要

The core of knowledge distillation lies in transferring the teacher's rich 'dark knowledge'-subtle probabilistic patterns that reveal how classes are related and the distribution of uncertainties. While this idea is well established, teachers trained with conventional cross-entropy often fail to preserve such signals. Their distributions collapse into sharp, overconfident peaks that appear decisive but are in fact brittle, offering little beyond the hard label or subtly hindering representation-level transfer. This overconfidence is especially problematic in high-cardinality tasks, where the nuances among many plausible classes matter most for guiding a compact student. Moreover, such brittle targets reduce robustness under distribution shift, leaving students vulnerable to miscalibration in real-world conditions. To address this limitation, we revisit distillation from a distributional perspective and propose Calibrated Uncertainty Distillation (CUD), a framework designed to make dark knowledge more faithfully accessible. Instead of uncritically adopting the teacher's overconfidence, CUD encourages teachers to reveal uncertainty where it is informative and guides students to learn from targets that are calibrated rather than sharpened certainty. By directly shaping the teacher's predictive distribution before transfer, our approach balances accuracy and calibration, allowing students to benefit from both confident signals on easy cases and structured uncertainty on hard ones. Across diverse benchmarks, CUD yields students that are not only more accurate, but also more calibrated under shift and more reliable on ambiguous, long-tail inputs.

2602.11130 2026-05-19 cs.LG cs.CV 版本更新

Meltdown: Circuits and Bifurcations in Point-Cloud-Conditioned 3D Diffusion Transformers

Meltdown: 点云条件化3D扩散变换器中的电路与分叉

Maximilian Plattner, Fabian Paischer, Johannes Brandstetter, Arturs Berzins

发表机构 * Institute for Machine Learning, JKU Linz(机器学习研究所,林茨大学)

AI总结 该研究探讨了点云条件化3D扩散变换器在输入变化下的失败模式,揭示了Meltdown现象,通过机制性案例研究展示了其成因,并提出了PowerRemap方法以抑制该现象。

详情
AI中文摘要

稀疏点云是3D表面重建中常见的输入模式,包括在安全关键领域如手术导航和自动驾驶感知中。最近的点云条件化3D扩散变换器在这一领域通过利用学习先验知识实现了最先进的结果。我们展示了这些模型在现实输入变化下可能灾难性地失败,并展示了其原因。我们识别出一种称为Meltdown的失败模式:对稀疏输入点云的微小表面扰动可以将重建输出分解成数百个不连通的部分。对抗搜索在两个开放权重的最先进架构(WaLa、Make-a-Shape)上恢复Meltdown,在真实世界数据集(GSO、SimJEB)和DDPM和DDIM采样下恢复率在89.9-100%。我们追踪Meltdown在正向传递中:它由点在表面上分布的均匀性决定,通过点云编码器忠实传递,并由扩散骨干中的单个早期去噪交叉注意力写入步骤所提交。扩散轨迹集合在接近此提交步骤时表现出对称性破裂,与反向过程的分叉一致。通过一系列匹配幅度的控制,我们证明模型提交的变量是方向性的,集中在写入扰动漂移的低维子空间中。受此发现启发,我们引入PowerRemap,一种测试时间控制,通过重塑局部写入的奇异谱来抑制此漂移,在WaLa上恢复率为98.3%,在Make-a-Shape上为84.6%。这些结果将电路级交叉注意力机制与轨迹级失败解释联系起来,展示了机理分析如何解释和指导条件扩散变换器的行为。

英文摘要

Sparse point clouds are a common input modality for 3D surface reconstruction, including in safety-critical settings such as surgical navigation and autonomous perception. Recent point-cloud-conditioned 3D diffusion transformers achieve state-of-the-art results in this regime by leveraging learned priors. We show that these models can fail catastrophically under realistic input variation, and present a mechanistic case study of why. We identify a failure mode we call Meltdown: tiny on-surface perturbations to a sparse input point cloud can fracture the reconstructed output into hundreds of disconnected pieces. Adversarial search recovers Meltdown in 89.9-100% of shapes across the two open-weight state-of-the-art architectures we study (WaLa, Make-a-Shape) on real-world datasets (GSO, SimJEB) and under both DDPM and DDIM sampling. We trace Meltdown along the forward pass: it is governed by how uniformly the points are distributed on the surface, faithfully transduced through the point-cloud encoder, and committed by a single early-denoising cross-attention write in the diffusion backbone. Diffusion-trajectory ensembles exhibit symmetry-breaking near this commit step, consistent with a bifurcation of the reverse process. Through a suite of matched-magnitude controls, we show that the variable on which the model commits is directional, concentrated in a low-rank subspace of the write's perturbation drift. Motivated by this finding, we introduce PowerRemap, a test-time control that reshapes the singular spectrum of the localized write to suppress this drift, with rescue rates of 98.3% on WaLa and 84.6% on Make-a-Shape. Together, these results link a circuit-level cross-attention mechanism to a trajectory-level account of the failure, demonstrating how mechanistic analysis can explain and guide behavior in conditional diffusion transformers.

2602.07884 2026-05-19 cs.LG cs.AI 版本更新

GRAFT: Decoupling Ranking and Calibration for Survival Analysis

GRAFT:分离排名与校准用于生存分析

Mohammad Ashhad, Robert Hoehndorf, Ricardo Henao

发表机构 * KAUST(卡奥斯特大学) CEMSE KAUST(KAUST工程与科学学院) Duke University(杜克大学)

AI总结 本文提出GRAFT模型,通过分离预测排名与生存校准,解决生存分析中排名与校准之间的权衡问题,该模型结合线性AFT模型与非线性残差神经网络,并利用随机门进行自动特征选择,从而在公开基准测试中实现了更好的判别能力和校准性能。

详情
AI中文摘要

生存分析受到删失数据、高维特征和非线性交互的挑战。经典模型提供可解释性和优越的校准能力,但局限于线性或预定义的功能形式,而深度学习模型具有灵活性并实现了强大的判别性能,但倾向于产生校准不佳的生存估计。为了解决这一权衡问题,我们提出GRAFT(Gated Residual Accelerated Failure Time),一种新的AFT模型,该模型将预测排名与生存校准分离。GRAFT的混合架构结合了线性AFT模型与非线性残差神经网络,并整合了随机门用于自动特征选择。该模型通过优化可微的、C-index对齐的排名损失进行训练,利用局部Kaplan-Meier估计器的随机条件插补,而校准的生存估计则通过简单的后训练校准获得。在公开基准测试中,GRAFT在判别能力和校准性能上优于基线模型,同时在高噪声设置中保持稳健和稀疏。

英文摘要

Survival analysis is complicated by censored data, high-dimensional features, and non-linear interactions. Classical models offer interpretability and superior calibration but are restricted to linear or predefined functional forms, while deep learning models are flexible and achieve strong discriminative performance, but tend to produce poorly calibrated survival estimates. To address this trade-off, we propose GRAFT (Gated Residual Accelerated Failure Time), a novel AFT model that decouples prognostic ranking from survival calibration. GRAFT's hybrid architecture combines a linear AFT model with a non-linear residual neural network, and it also integrates stochastic gates for automatic feature selection. The model is trained by optimizing a differentiable, C-index-aligned ranking loss using stochastic conditional imputation from local Kaplan-Meier estimators, while calibrated survival estimates are obtained through simple post-training calibration. In public benchmarks, GRAFT outperforms baselines in discrimination and calibration, while remaining robust and sparse in high-noise settings.

2602.05742 2026-05-19 stat.ML cs.LG math.ST stat.TH 版本更新

Fast Rates for Nonstationary Weighted Risk Minimization

非平稳加权风险最小化中的快速收敛速率

Tobias Brock, Thomas Nagler

发表机构 * LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 本文研究了非平稳条件下加权经验风险最小化方法的样本外预测误差,提出了一种将超额风险分解为学习项和分布漂移相关项的通用分解方法,并在混合条件下证明了学习误差的Oracle不等式,考虑了权重向量的有效样本量、权重和假设类的复杂性以及数据依赖性。

详情
AI中文摘要

加权经验风险最小化是一种在分布漂移下进行预测的常见方法。本文研究其非平稳条件下的样本外预测误差。我们提供了一个将超额风险分解为学习项和与分布漂移相关项的通用分解,并在混合条件下证明了学习误差的Oracle不等式。学习界在任意权重类上均匀成立,并考虑了权重向量诱导的有效样本量、权重和假设类的复杂性以及潜在的数据依赖性。我们在回归问题中展示了结果的适用性和精确性,使用线性模型、基函数逼近和神经网络,当专门应用于无权和平稳设置时,恢复了最小最大最优速率(除对数因子外)

英文摘要

Weighted empirical risk minimization is a common approach to prediction under distribution drift. This article studies its out-of-sample prediction error under nonstationarity. We provide a general decomposition of the excess risk into a learning term and an error term associated with distribution drift, and prove oracle inequalities for the learning error under mixing conditions. The learning bound holds uniformly over arbitrary weight classes and accounts for the effective sample size induced by the weight vector, the complexity of the weight and hypothesis classes, and potential data dependence. We illustrate the applicability and sharpness of our results in (auto-) regression problems with linear models, basis approximations, and neural networks, recovering minimax-optimal rates (up to logarithmic factors) when specialized to unweighted and stationary settings.

2602.04872 2026-05-19 stat.ML cs.AI cs.LG 版本更新

Multi-layer Cross-attention is Provably Optimal for Multi-modal In-context Learning

多层交叉注意力是多模态上下文学习中可证明最优的

Nicholas Barnfield, Subhabrata Sen, Pragya Sur

发表机构 * Harvard University(哈佛大学)

AI总结 本文研究了多模态上下文学习中多层交叉注意力机制的理论最优性,证明了在多模态数据下,交叉注意力机制在梯度流优化下可达到贝叶斯最优,同时指出单层线性自注意力无法在任务分布下统一恢复贝叶斯最优预测。

详情
AI中文摘要

近期进展迅速推动了我们对现代基于注意力的神经网络中上下文学习机制的理解。然而,现有结果仅专注于单模态数据;相比之下,多模态数据的上下文学习的理论基础仍不清晰。我们引入了一个数学上可处理的框架来研究多模态学习,并探讨了在何种情况下Transformer-like架构可以在上下文中恢复贝叶斯最优性能。为了建模多模态问题,我们假设观测数据来自一个潜在因子模型。我们的第一个结果是对表达性的否定:我们证明单层线性自注意力无法在任务分布下统一恢复贝叶斯最优预测。为了解决这一限制,我们引入了一种新的线性化交叉注意力机制,并在交叉注意力层和上下文长度都较大的情况下进行研究。我们证明,当使用梯度流优化时,这种交叉注意力机制可证明是贝叶斯最优的。我们的结果强调了深度对上下文学习的好处,并确立了交叉注意力在多模态分布中的可证明效用。

英文摘要

Recent progress has rapidly advanced our understanding of the mechanisms underlying in-context learning in modern attention-based neural networks. However, existing results focus exclusively on unimodal data; in contrast, the theoretical underpinnings of in-context learning for multi-modal data remain poorly understood. We introduce a mathematically tractable framework for studying multi-modal learning and explore when transformer-like architectures can recover Bayes-optimal performance in-context. To model multi-modal problems, we assume the observed data arises from a latent factor model. Our first result comprises a negative take on expressibility: we prove that single-layer, linear self-attention fails to recover the Bayes-optimal predictor uniformly over the task distribution. To address this limitation, we introduce a novel, linearized cross-attention mechanism, which we study in the regime where both the number of cross-attention layers and the context length are large. We show that this cross-attention mechanism is provably Bayes optimal when optimized using gradient flow. Our results underscore the benefits of depth for in-context learning and establish the provable utility of cross-attention for multi-modal distributions.

2602.03535 2026-05-19 cs.LG cs.NA math.NA math.OC 版本更新

Sparse Training of Neural Networks based on Multilevel Mirror Descent

基于多级镜像下降法的神经网络稀疏训练

Yannick Lunk, Sebastian J. Scott, Leon Bungert

发表机构 * Institute of Mathematics(数学研究所) University of Würzburg(乌尔姆大学) Institute of Mathematics, CAIDAS University of Würzburg(数学研究所,CAIDAS乌尔姆大学)

AI总结 本文提出了一种基于线性化Bregman迭代/镜像下降的动态稀疏训练算法,通过交替静态和动态稀疏模式更新来利用自然产生的稀疏性,结合稀疏诱导Bregman迭代与自适应冻结网络结构,以高效探索稀疏参数空间并保持稀疏性。通过多级优化框架保证收敛性,并实验证明该算法在标准基准上能产生高稀疏性和准确性的模型,同时在理论FLOPs数量和训练时间上均有显著提升。

详情
AI中文摘要

我们介绍了一种基于线性化Bregman迭代/镜像下降的动态稀疏训练算法,该算法通过在静态和动态稀疏模式更新之间交替,利用自然产生的稀疏性。关键思想是将稀疏诱导的Bregman迭代与自适应冻结网络结构相结合,以在保持稀疏性的同时高效探索稀疏参数空间。我们通过将方法嵌入多级优化框架中,提供收敛保证。此外,我们实验证明,我们的算法可以在标准基准上产生高度稀疏且准确的模型。我们还显示,与SGD训练相比,理论上的FLOPs数量从标准Bregman迭代的38%减少到我们的方法的6%,同时保持测试精度。我们还显示,当使用稀疏感知的CPU实现时,训练时间可减少约50%。

英文摘要

We introduce a dynamic sparse training algorithm based on linearized Bregman iterations / mirror descent that exploits the naturally incurred sparsity by alternating between periods of static and dynamic sparsity pattern updates. The key idea is to combine sparsity-inducing Bregman iterations with adaptive freezing of the network structure to enable efficient exploration of the sparse parameter space while maintaining sparsity. We provide convergence guaranties by embedding our method in a multilevel optimization framework. Furthermore, we empirically show that our algorithm can produce highly sparse and accurate models on standard benchmarks. We also show that the theoretical number of FLOPs compared to SGD training can be reduced from 38% for standard Bregman iterations to 6% for our method while maintaining test accuracy.We additionally show a training time reduction by about 50%, when using a sparsity-aware CPU implementation of our method.

2602.02830 2026-05-19 cs.LG stat.ME 版本更新

SC3D: Dynamic and Differentiable Causal Discovery for Temporal and Instantaneous Graphs

SC3D:动态和可微的因果发现用于时序和瞬时图

Sourajit Das, Dibyajyoti Chakraborty, Romit Maulik

发表机构 * Institute for Computational Data Science(计算数据科学研究所) School of Mechanical Engineering(机械工程学院)

AI总结 本文提出SC3D,一种动态和可微的因果发现方法,用于处理时序和瞬时图,通过两阶段可微框架联合学习滞后特定的邻接矩阵和瞬时有向无环图,提升了因果结构的稳定性和准确性。

Comments 12 pages

详情
AI中文摘要

从多变量时间序列中发现因果结构是一个关键问题,因为相互作用跨越多个滞后并可能涉及瞬时依赖。此外,动态图的搜索空间本质上是组合性的。在本研究中,我们提出稳定因果动态可微发现(SC3D),一种两阶段可微框架,联合学习滞后特定的邻接矩阵以及如果存在的话瞬时有向无环图(DAG)。在第一阶段,SC3D通过节点级预测进行边预选以获得滞后和瞬时边的掩码,而第二阶段通过优化具有稀疏性的似然函数并强制瞬时块的无环性来细化这些掩码。在合成SVAR系统、非线性和混沌基准、非平稳动态和现实世界数据集上的数值结果表明,SC3D在稳定性和准确性方面优于现有基线,能够更准确地恢复滞后和瞬时因果结构。

英文摘要

Discovering causal structures from multivariate time series is a key problem because interactions span across multiple lags and possibly involve instantaneous dependencies. Additionally, the search space of the dynamic graphs is combinatorial in nature. In this study, we propose Stable Causal Dynamic Differentiable Discovery (SC3D), a two-stage differentiable framework that jointly learns lag-specific adjacency matrices and, if present, an instantaneous directed acyclic graph (DAG). In Stage 1, SC3D performs edge preselection through node-wise prediction to obtain masks for lagged and instantaneous edges, whereas Stage 2 refines these masks by optimizing a likelihood with sparsity along with enforcing acyclicity on the instantaneous block. Numerical results across synthetic SVAR systems, nonlinear and chaotic benchmarks, nonstationary dynamics and real-world datasets demonstrate that SC3D achieves improved stability and more accurate recovery of both lagged and instantaneous causal structures compared to existing baselines.

2602.01733 2026-05-19 stat.ML cs.LG 版本更新

ST-BCP: Tightening Coverage Bound for Backward Conformal Prediction via Non-Conformity Score Transformation

ST-BCP:通过非一致性分数转换紧缩后向符合预测的覆盖界

Junxian Liu, Hao Zeng, Hongxin Wei

发表机构 * Department of Statistics and Data Science(统计与数据科学系) Southern University of Science and Technology(南方科技大学) College of Mathematics and Statistics(数学与统计学院) Chongqing University(重庆大学)

AI总结 本文提出ST-BCP方法,通过引入数据依赖的非一致性分数转换来缩小后向符合预测中的覆盖界差距,实验表明该方法有效减少了覆盖差距。

详情
AI中文摘要

符合预测(CP)提供了一个用于不确定性量化统计框架,能够构造具有覆盖保证的预测集。尽管CP会产生不受控的预测集大小,后向符合预测(BCP)通过强制设定预测集大小的上界并估计由此产生的覆盖保证来反转这一范式。然而,BCP框架内马尔可夫不等式引起的松散性导致估计的覆盖界与经验覆盖之间存在显著差距。在本文中,我们提出ST-BCP,一种新颖的方法,引入数据依赖的非一致性分数转换以缩小覆盖差距。具体而言,我们开发了一种可计算的转换并证明其优于基线的恒等转换。广泛的实验展示了我们方法的有效性,在常见基准上将平均覆盖差距从4.20\%降至1.12\%。

英文摘要

Conformal Prediction (CP) provides a statistical framework for uncertainty quantification that constructs prediction sets with coverage guarantees. While CP yields uncontrolled prediction set sizes, Backward Conformal Prediction (BCP) inverts this paradigm by enforcing a predefined upper bound on set size and estimating the resulting coverage guarantee. However, the looseness induced by Markov's inequality within the BCP framework causes a significant gap between the estimated coverage bound and the empirical coverage. In this work, we introduce ST-BCP, a novel method that introduces a data-dependent transformation of nonconformity scores to narrow the coverage gap. In particular, we develop a computable transformation and prove that it outperforms the baseline identity transformation. Extensive experiments demonstrate the effectiveness of our method, reducing the average coverage gap from 4.20\% to 1.12\% on common benchmarks.

2601.21357 2026-05-19 cs.LG 版本更新

Beyond Objective-Based Improvement: Stationarity-Aware Expected Improvement for Bayesian Optimization

超越基于目标的改进:面向站性的期望改进用于贝叶斯优化

Joshua Hang Sai Ip, Georgios Makrygiorgos, Ali Mesbah

发表机构 * Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA, USA(加州大学伯克利分校化学与生物分子工程系)

AI总结 本文提出了一种新的期望改进(EI-GN)获取函数,通过引入一阶站性条件来扩展改进原则,从而在高表现和接近站点的区域促进采样,通过在获取标准中嵌入向站性进展,提供更丰富的改进概念。

详情
AI中文摘要

贝叶斯优化(BO)是一种用于优化昂贵黑盒函数的原理性框架,期望改进(EI)是其最广泛使用的获取函数之一。尽管在经验上取得了成功,但EI对一阶最优性条件漠不关心,仅依赖于目标值的改进。因此,它可能会在改进标准无信息的情况下表现出消失的获取信号,限制了其在引导搜索中的有效性。我们提出期望改进通过梯度范数(EI-GN),一种新的获取函数,将改进原则扩展到包含一阶站性,促进在高表现且接近站点的区域采样。我们推导了EI-GN的可计算闭式表达式,并证明其仍保持与基于改进的获取框架的一致性。通过在获取标准中嵌入向站性进展,EI-GN提供了一个更丰富和信息更丰富的改进概念。在标准BO基准上的实验证明了与基线方法的一致性改进,我们进一步展示了其在控制策略学习中的适用性。

英文摘要

Bayesian Optimization (BO) is a principled framework for optimizing expensive black-box functions, with Expected Improvement (EI) among its most widely used acquisition functions. Despite its empirical success, EI is agnostic to first-order optimality conditions, relying solely on objective-value improvement. As a result, it can exhibit vanishing acquisition signals where the improvement criterion is uninformative, limiting its effectiveness in guiding search. We propose Expected Improvement via Gradient Norms (EI-GN), a novel acquisition function that extends the improvement principle to incorporate first-order stationarity, promoting sampling in regions that are both high-performing and close to stationary points. We derive a tractable closed-form expression for EI-GN and show that it remains consistent with the improvement-based acquisition framework. By embedding progress toward stationarity into the acquisition criterion, EI-GN provides a richer and more informative notion of improvement. Empirical results on standard BO benchmarks demonstrate consistent gains over baseline methods, and we further illustrate its applicability to control policy learning.

2601.20888 2026-05-19 stat.ML cs.LG math.ST stat.CO stat.TH 版本更新

Latent-IMH: Efficient Bayesian Inference for Inverse Problems with Approximate Operators

Latent-IMH: 高效的贝叶斯推断用于具有近似算子的反问题

Youguang Chen, George Biros

发表机构 * Oden Institute for Computational Engineering and Sciences(奥登计算工程与科学研究院) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文研究了在贝叶斯线性反问题中如何高效地从后验分布采样,其中参数到观测算子A计算成本高。通过将A分解为可构造低成本近似算子A~的方式,提出了一种基于Metropolis-Hastings独立采样器的Latent-IMH方法,通过近似算子生成中间潜在变量并利用精确算子进行优化,从而将计算成本转移到离线阶段,理论分析表明其在KL散度和混合时间上表现优异,实验显示其在计算效率上优于NUTS等现有方法。

详情
AI中文摘要

我们研究了在贝叶斯线性反问题中从后验分布采样,其中A,参数到观测算子,计算成本高。在许多应用中,A可以分解为一种方式,从而构造出一个成本效益高的近似A~。在该框架中,我们引入了Latent-IMH,一种基于Metropolis-Hastings独立(IMH)采样器的采样方法。Latent-IMH首先使用近似A~生成中间潜在变量,然后利用精确A进行优化。其主要优势是将计算成本转移到离线阶段。我们通过KL散度和混合时间界限理论分析了Latent-IMH的性能。使用多个模型问题的数值实验,我们表明,在合理假设下,它在计算效率上优于NUTS等现有方法。在某些情况下,Latent-IMH比现有方案快几个数量级。

英文摘要

We study sampling from posterior distributions in Bayesian linear inverse problems where $A$, the parameters to observables operator, is computationally expensive. In many applications, $A$ can be factored in a manner that facilitates the construction of a cost-effective approximation $\tilde{A}$. In this framework, we introduce Latent-IMH, a sampling method based on the Metropolis-Hastings independence (IMH) sampler. Latent-IMH first generates intermediate latent variables using the approximate $\tilde{A}$, and then refines them using the exact $A$. Its primary benefit is that it shifts the computational cost to an offline phase. We theoretically analyze the performance of Latent-IMH using KL divergence and mixing time bounds. Using numerical experiments on several model problems, we show that, under reasonable assumptions, it outperforms state-of-the-art methods such as the No-U-Turn sampler (NUTS) in computational efficiency. In some cases, Latent-IMH can be orders of magnitude faster than existing schemes.

2601.19300 2026-05-19 cs.LG 版本更新

Queue Length Regret Bounds for Contextual Queueing Bandits

上下文队列强化学习中的队列遗憾界

Seoungbin Bae, Garyeong Kang, Dabeen Lee

发表机构 * Department of Industrial & Systems Engineering, KAIST(韩国科学技术院工业与系统工程系) Department of Mathematical Sciences, Seoul National University(首尔国立大学数学科学系) Research Institute of Mathematics, Seoul National University(首尔国立大学数学研究所) Korea Institute for Advanced Study(韩国高级研究院) Interdisciplinary Program in Artificial Intelligence, Seoul National University(首尔国立大学人工智能跨学科项目)

AI总结 本文提出了一种新的上下文感知调度框架,即上下文队列强化学习,用于在同时学习未知服务速率的过程中进行调度。通过考虑异质的上下文特征,智能体选择任务并将其匹配到服务器以最大化离开速率。服务/离开速率由具有未知服务器特定参数的逻辑模型决定。为了评估策略的性能,我们考虑队列长度遗憾,定义为策略与最优策略之间队列长度的差异。主要挑战在于,在给定时间步长下,队列中剩余任务特征列表可能因策略与最优策略的不同而不同,因为它们可能以不同的顺序处理任务。为此,我们提出了带有复杂耦合论证的策略切换队列的概念。这导致了一种新的队列长度遗憾分解框架,使我们能够理解选择次优任务-服务器对的短期影响及其对队列状态差异的长期影响。我们证明了我们的算法CQB-ε达到了O(T^{-1/4})的遗憾上界。我们还考虑了对抗性选择的上下文设置,其中我们的第二个算法CQB-Opt达到了O(log²T)的遗憾上界。最后,我们提供了实验结果以验证我们的理论发现。

详情
AI中文摘要

我们引入了上下文队列强化学习,一种新的上下文感知框架,用于调度的同时学习未知的服务速率。个体任务携带异质的上下文特征,基于此,智能体选择一个任务并将其与一个服务器匹配以最大化离开速率。服务/离开速率由具有未知服务器特定参数的逻辑模型决定。为了评估策略的性能,我们考虑队列长度遗憾,定义为策略与最优策略之间队列长度的差异。主要挑战在于,在给定时间步长下,队列中剩余任务特征列表可能因策略与最优策略的不同而不同,因为它们可能以不同的顺序处理任务。为此,我们提出了带有复杂耦合论证的策略切换队列的概念。这导致了一种新的队列长度遗憾分解框架,使我们能够理解选择次优任务-服务器对的短期影响及其对队列状态差异的长期影响。我们证明了我们的算法CQB-ε达到了O(T^{-1/4})的遗憾上界。我们还考虑了对抗性选择的上下文设置,其中我们的第二个算法CQB-Opt达到了O(log²T)的遗憾上界。最后,我们提供了实验结果以验证我们的理论发现。

英文摘要

We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated coupling argument. This leads to a novel queue length regret decomposition framework, allowing us to understand the short-term effect of choosing a suboptimal job-server pair and its long-term effect on queue state differences. We show that our algorithm, CQB-$\varepsilon$, achieves a regret upper bound of $\widetilde{\mathcal{O}}(T^{-1/4})$. We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of $\mathcal{O}(\log^2 T)$. Lastly, we provide experimental results that validate our theoretical findings.

2601.16414 2026-05-19 cs.LG cs.AI 版本更新

PyHealth 2.0: A Comprehensive Open-Source Toolkit for Accessible and Reproducible Clinical Deep Learning

PyHealth 2.0: 一个全面的开源工具包,用于可访问和可重复的临床深度学习

John Wu, Yongda Fan, Zhenbang Wu, Paul Landes, Eric Schrock, Sayeed Sajjad Razin, Arjun Chatterjee, Naveen Baskaran, Joshua Steier, Andrea Fitzpatrick, Bilal Arif, Rian Atri, Jathurshan Pradeepkumar, Siddhartha Laghuvarapu, Junyi Gao, Adam R. Cross, Jimeng Sun

发表机构 * University of Illinois Urbana-Champaign, Urbana, IL, USA(伊利诺伊大学厄巴纳-香槟分校) PyHealth Research Initiative(PyHealth研究计划) University of Illinois College of Medicine, Chicago, IL, USA(伊利诺伊大学医学院) The University of Edinburgh, Edinburgh, UK(爱丁堡大学) Health Data Research UK, London, UK(英国健康数据研究) Department of Biomedical Engineering, Bangladesh University of Engineering(孟加拉国工程大学生物医学工程系)

AI总结 本文提出PyHealth 2.0,一个全面的开源工具包,旨在解决临床AI研究中的可重复性和可访问性问题,通过统一15+数据集、20+临床任务、25+模型、5+可解释性方法和不确定性量化方法,实现7行代码即可完成预测建模。

Comments Under Review

详情
AI中文摘要

难以复制基线、高计算成本和所需领域专业知识创建了持续存在的临床AI研究障碍。为了解决这些挑战,我们介绍了PyHealth 2.0,一个增强的临床深度学习工具包,使在7行代码内即可实现预测建模。PyHealth 2.0提供了三个关键贡献:(1) 一个全面的工具包,通过统一15+数据集、20+临床任务、25+模型、5+可解释性方法和不确定性量化(包括符合预测的置信预测)在一个框架中解决可重复性和兼容性挑战,支持多种临床数据模态——信号、影像和电子健康记录——并翻译5+医学编码标准;(2) 以可访问性为重点的设计,支持多模态数据和多样化的计算资源,处理速度比以往快39倍,内存使用减少20倍,使从16GB笔记本电脑到生产系统都能轻松使用;(3) 一个活跃的开源社区,拥有400多名成员,通过详尽的文档、可重复研究贡献以及与学术医疗系统和产业伙伴的合作,包括通过RHealth实现的多语言支持,降低了领域专业知识的障碍。PyHealth 2.0建立了一个开源基础和社区,推动了可访问和可重复的医疗AI发展。可在pip install pyhealth中获取。

英文摘要

Difficulty replicating baselines, high computational costs, and required domain expertise create persistent barriers to clinical AI research. To address these challenges, we introduce PyHealth 2.0, an enhanced clinical deep learning toolkit that enables predictive modeling in as few as 7 lines of code. PyHealth 2.0 offers three key contributions: (1) a comprehensive toolkit addressing reproducibility and compatibility challenges by unifying 15+ datasets, 20+ clinical tasks, 25+ models, 5+ interpretability methods, and uncertainty quantification including conformal prediction within a single framework that supports diverse clinical data modalities - signals, imaging, and electronic health records - with translation of 5+ medical coding standards; (2) accessibility-focused design accommodating multimodal data and diverse computational resources with up to 39x faster processing and 20x lower memory usage, enabling work from 16GB laptops to production systems; and (3) an active open-source community of 400+ members lowering domain expertise barriers through extensive documentation, reproducible research contributions, and collaborations with academic health systems and industry partners, including multi-language support via RHealth. PyHealth 2.0 establishes an open-source foundation and community advancing accessible, reproducible healthcare AI. Available at pip install pyhealth.

2601.09071 2026-05-19 cs.LG 版本更新

Resolving Predictive Multiplicity for the Rashomon Set

解决Rashomon集的预测多样性

Parian Haghighat, Hadis Anahideh, Cynthia Rudin

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Duke University(杜克大学)

AI总结 本文针对Rashomon集中的预测不一致性问题,提出三种方法:异常值修正、局部修补和成对协调,以减少预测分歧并提升模型可靠性,实验表明这些方法能有效降低不一致度同时保持竞争性准确性。

详情
AI中文摘要

多个同样准确的模型对于给定的预测任务的存在导致了预测多样性,其中一组称为Rashomon集的模型在准确性上相似,但个体预测却存在分歧。这种不一致性削弱了在高风险应用中对一致预测的信任。我们提出了三种方法来减少Rashomon集中成员之间的不一致性。第一种方法是异常值修正,异常值具有无法被良好模型正确预测的标签,异常值可能导致Rashomon集在局部区域有高方差的预测,因此修正它们可以降低方差。第二种方法是局部修补,在测试点的局部区域,模型可能因为某些模型存在偏差而相互矛盾。我们可以通过验证集检测并修正这些偏差,从而减少多样性。第三种方法是成对协调,我们找到在测试点周围区域上意见不一致的模型对,并修改这些不一致的预测,使其更少偏向。这三种方法可以单独或共同使用,各自具有独特的优势。协调后的预测可以被提炼成一个单一的可解释模型用于现实部署。在多个数据集上的实验表明,我们的方法在减少不一致度的同时保持了竞争性的准确性。

英文摘要

The existence of multiple, equally accurate models for a given predictive task leads to predictive multiplicity, where a ``Rashomon set'' of models achieve similar accuracy but diverges in their individual predictions. This inconsistency undermines trust in high-stakes applications where we want consistent predictions. We propose three approaches to reduce inconsistency among predictions for the members of the Rashomon set. The first approach is \textbf{outlier correction}. An outlier has a label that none of the good models are capable of predicting correctly. Outliers can cause the Rashomon set to have high variance predictions in a local area, so fixing them can lower variance. Our second approach is local patching. In a local region around a test point, models may disagree with each other because some of them are biased. We can detect and fix such biases using a validation set, which also reduces multiplicity. Our third approach is pairwise reconciliation, where we find pairs of models that disagree on a region around the test point. We modify predictions that disagree, making them less biased. These three approaches can be used together or separately, and they each have distinct advantages. The reconciled predictions can then be distilled into a single interpretable model for real-world deployment. In experiments across multiple datasets, our methods reduce disagreement metrics while maintaining competitive accuracy.

2601.08118 2026-05-19 cs.AI cs.LG 版本更新

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

MirrorBench: 一个评估对话用户代理人类化能力的基准测试

Ashutosh Hathidara, Julien Yu, Vaishali Senthil, Sebastian Schreiber, Anil Babu Ankisettipalli

发表机构 * SAP Labs(SAP实验室)

AI总结 本文提出MirrorBench基准测试,用于评估对话用户代理的人类化能力,通过结合多种词汇多样性指标和LLM评估指标,揭示用户代理与真实人类用户之间的系统性差距。

Comments KDD 2026 (Dataset & Benchmark Track)

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作人类模拟器,既用于评估对话系统,也用于生成微调数据。然而,简单的

英文摘要

Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, motivating principled evaluation of *user proxy agents*. We present **MirrorBench**, a reproducible and extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational regimes, explicitly decoupled from downstream task success. **MirrorBench** combines three lexical-diversity metrics (**MATTR**, **Yule's~$K$**, and **HD-D**) with three LLM-judge-based metrics (**GTEval**, **Pairwise Indistinguishability**, and **Rubric-and-Reason**), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls. Across four public datasets, **MirrorBench** yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users. The framework is open sourced at https://github.com/SAP/mirrorbench and includes a command-line interface for running and managing user-proxy benchmarking experiments.

2601.07122 2026-05-19 cs.CR cs.AI cs.LG 版本更新

Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework

通过一个鲁棒的LLM赋能的多智能体强化学习框架增强云网络韧性

Yixiao Peng, Hao Hu, Feiyang Li, Xinye Cao, Yingchang Jiang, Jipeng Tang, Guoshun Nan, Yuling Liu

发表机构 * State Key Laboratory of Mathematical Engineering and Advanced Computing(数学工程与先进计算国家重点实验室) Henan Key Laboratory of Information Security(河南省信息安全重点实验室) National Engineering Research Center for Mobile Network Technologies(移动网络技术国家工程研究中心) Beijing University of Posts and Telecommunications(北京邮电大学) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所)

AI总结 本文提出了一种基于大语言模型的多智能体强化学习框架,旨在提升云网络的防御能力和韧性,通过分层架构和人类在回路支持来增强系统的适应性和可解释性。

详情
AI中文摘要

尽管虚拟化和资源池化赋予了云网络结构灵活性和弹性扩展能力,但它们不可避免地扩大了攻击面并挑战了网络的网络安全性。基于强化学习(RL)的防御策略已被开发用于在对抗条件下优化资源部署和隔离策略,以通过维护和恢复网络可用性来增强系统韧性。然而,现有方法缺乏鲁棒性,因为它们需要重新训练才能适应网络结构、节点规模、攻击策略和攻击强度的动态变化。此外,缺乏人类在回路(HITL)支持限制了可解释性和灵活性。为了解决这些限制,我们提出了CyberOps-Bots,一种由大语言模型(LLMs)赋能的分层多智能体强化学习框架。受MITRE ATT&CK的战术-技术模型启发,CyberOps-Bots具有双层架构:(1)一个上层LLM代理,包含四个模块——ReAct规划、IPDRR基础感知、长短时记忆和动作/工具整合,执行全局意识、人类意图识别和战术规划;(2)下层RL代理,通过异构分离预训练开发,执行原子防御动作,以在本地网络区域中执行。这种协同作用保留了LLM的适应性和可解释性,同时确保了可靠的RL执行。在真实云数据集上的实验表明,与最先进的算法相比,CyberOps-Bots在不重新训练的情况下,网络可用性保持在68.5%更高,并且在场景切换时实现了34.7%的性能提升。据我们所知,这是首次建立具有HITL支持的鲁棒LLM-RL框架用于云防御的研究。

英文摘要

While virtualization and resource pooling empower cloud networks with structural flexibility and elastic scalability, they inevitably expand the attack surface and challenge cyber resilience. Reinforcement Learning (RL)-based defense strategies have been developed to optimize resource deployment and isolation policies under adversarial conditions, aiming to enhance system resilience by maintaining and restoring network availability. However, existing approaches lack robustness as they require retraining to adapt to dynamic changes in network structure, node scale, attack strategies, and attack intensity. Furthermore, the lack of Human-in-the-Loop (HITL) support limits interpretability and flexibility. To address these limitations, we propose CyberOps-Bots, a hierarchical multi-agent reinforcement learning framework empowered by Large Language Models (LLMs). Inspired by MITRE ATT&CK's Tactics-Techniques model, CyberOps-Bots features a two-layer architecture: (1) An upper-level LLM agent with four modules--ReAct planning, IPDRR-based perception, long-short term memory, and action/tool integration--performs global awareness, human intent recognition, and tactical planning; (2) Lower-level RL agents, developed via heterogeneous separated pre-training, execute atomic defense actions within localized network regions. This synergy preserves LLM adaptability and interpretability while ensuring reliable RL execution. Experiments on real cloud datasets show that, compared to state-of-the-art algorithms, CyberOps-Bots maintains network availability 68.5% higher and achieves a 34.7% jumpstart performance gain when shifting the scenarios without retraining. To our knowledge, this is the first study to establish a robust LLM-RL framework with HITL support for cloud defense.

2601.06858 2026-05-19 eess.SP cs.LG 版本更新

Deep Learning-Based Channel Extrapolation for Dual-Band Massive MIMO Systems

基于深度学习的双频大规模MIMO系统的信道外推

Qikai Xiao, Kehui Li, Binggui Zhou, Shaodan Ma

发表机构 * State Key Laboratory of Internet of Things for Smart City and the Department of Electrical and Computer Engineering, University of Macau(物联网智能城市国家重点实验室和澳门大学电子与计算机工程系) Department of Electrical and Electronic Engineering, Imperial College London(帝国理工学院伦敦分校电子与电气工程系)

AI总结 本文提出了一种基于深度学习的多域融合信道外推方法,用于将sub-6 GHz频段的信道状态信息外推到毫米波频段,以减少毫米波信道状态信息获取的试点开销,提高大规模MIMO系统的效率。

详情
AI中文摘要

未来无线通信系统将越来越多地依赖毫米波(mmWave)和sub-6 GHz频段的整合,以满足对高速数据传输和广泛覆盖的异构需求。为了充分利用毫米波频段在大规模多输入多输出(MIMO)系统中的优势,需要高精度的信道状态信息(CSI)。然而,直接估计毫米波信道需要大量的试点开销,因为CSI维度大且由于严重的路径损耗和阻挡衰减导致信噪比(SNR)低。在本文中,我们提出了一种高效的MDFCE(Multi-Domain Fusion Channel Extrapolator)来外推sub-6 GHz频段的CSI到毫米波频段的CSI,从而减少双频大规模MIMO系统中毫米波CSI获取的试点开销。与基于数学建模的传统信道外推方法不同,所提出的MDFCE结合了专家混合框架和多头自注意力机制,以融合sub-6 GHz CSI的多域特征,旨在有效且高效地表征从sub-6 GHz CSI到毫米波CSI的映射。仿真结果表明,MDFCE在各种天线阵列规模和信噪比水平上,相比现有方法在较少的训练试点情况下实现了更优的性能,同时表现出更高的计算效率。

英文摘要

Future wireless communication systems will increasingly rely on the integration of millimeter wave (mmWave) and sub-6 GHz bands to meet heterogeneous demands on high-speed data transmission and extensive coverage. To fully exploit the benefits of mmWave bands in massive multiple-input multiple-output (MIMO) systems, highly accurate channel state information (CSI) is required. However, directly estimating the mmWave channel demands substantial pilot overhead due to the large CSI dimension and low signal-to-noise ratio (SNR) led by severe path loss and blockage attenuation. In this paper, we propose an efficient \textbf{M}ulti-\textbf{D}omain \textbf{F}usion \textbf{C}hannel \textbf{E}xtrapolator (MDFCE) to extrapolate sub-6 GHz band CSI to mmWave band CSI, so as to reduce the pilot overhead for mmWave CSI acquisition in dual band massive MIMO systems. Unlike traditional channel extrapolation methods based on mathematical modeling, the proposed MDFCE combines the mixture-of-experts framework and the multi-head self-attention mechanism to fuse multi-domain features of sub-6 GHz CSI, aiming to characterize the mapping from sub-6 GHz CSI to mmWave CSI effectively and efficiently. The simulation results demonstrate that MDFCE can achieve superior performance with less training pilots compared with existing methods across various antenna array scales and signal-to-noise ratio levels while showing a much higher computational efficiency.

2601.06163 2026-05-19 cs.CV cs.LG 版本更新

Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking

Forget-It-All: 通过概念感知神经元掩码实现多概念机器去学习

Kaiyuan Deng, Bo Hui, Gen Li, Jie Ji, Minghai Qin, Geng Yuan, Xiaolong Ma

发表机构 * The University of Arizona(亚利桑那大学) The University of Tulsa(塔尔萨大学) Clemson University(克莱姆森大学) Western Digital Corporation(西部数据公司) University of Georgia(佐治亚大学)

AI总结 该研究提出Forget-It-All框架,通过利用模型稀疏性,解决多概念去学习问题,有效提升去学习效果并保持生成质量。

Comments Accepted to ICML 2026

详情
Journal ref
Forty-Third International Conference on Machine Learning (ICML 2026)
AI中文摘要

文本到图像(T2I)扩散模型的广泛应用引发了对其可能生成版权、不当或敏感图像的担忧。作为实际解决方案,机器去学习旨在在不重新训练的情况下删除不需要的概念。尽管现有方法在单概念去学习中有效,但去除多个概念时往往面临显著挑战,包括去学习效果、生成质量和对超参数和数据集的敏感性。我们通过利用模型稀疏性,从独特角度看待多概念去学习,并提出Forget It All(FIA)框架。FIA首先引入对比概念显著性以量化每个权重连接对目标概念的贡献。然后通过结合时间信息和空间信息,识别出概念敏感神经元,确保只选择那些一致响应目标概念的神经元。最后,FIA从识别的神经元中构建掩码,并将其融合成统一的多概念掩码,其中对一般内容生成有广泛支持的无概念神经元被保留,而概念特定神经元被修剪以去除目标。FIA是无训练的,需要最少超参数调整即可用于新任务,实现即插即用。在三个不同的去学习任务上进行了广泛的实验,证明FIA在多概念去学习中实现了更可靠的性能,提高了遗忘效果同时保持生成的保真度和质量。代码可在https://github.com/kaiyuan02415/Forget-It-All获取。

英文摘要

The widespread adoption of text-to-image (T2I) diffusion models has raised concerns about their potential to generate copyrighted, inappropriate, or sensitive imagery. As a practical solution, machine unlearning aims to erase unwanted concepts without retraining from scratch. While most existing methods are effective for single-concept unlearning, they often struggle when removing multiple concepts, causing significant challenges in unlearning effectiveness, generation quality, and sensitivity to hyperparameters and datasets. We take a unique perspective on multi-concept unlearning by leveraging model sparsity and propose the Forget It All (FIA) framework. FIA first introduces Contrastive Concept Saliency to quantify each weight connection's contribution to a target concept. It then identifies Concept Sensitive Neurons by combining temporal and spatial information, ensuring that only neurons consistently responsive to the target concept are selected. Finally, FIA constructs masks from the identified neurons and fuses them into a unified multi-concept mask, where Concept Agnostic Neurons that broadly support general content generation are preserved while concept-specific neurons are pruned to remove the targets. FIA is training-free and requires minimal hyperparameter tuning for new tasks, enabling plug-and-play use. Extensive experiments across three distinct unlearning tasks demonstrate that FIA achieves more reliable multi-concept unlearning, improving forgetting effectiveness while maintaining generation fidelity and quality. Code is available at https://github.com/kaiyuan02415/Forget-It-All

2601.06162 2026-05-19 cs.LG cs.CV 版本更新

Forget Many, Forget Right: Scalable and Precise Concept Unlearning in Diffusion Models

忘却众多,忘却正确:扩散模型中可扩展且精确的概念反学习

Kaiyuan Deng, Gen Li, Yang Xiao, Bo Hui, Xiaolong Ma

发表机构 * The University of Arizona(亚利桑那大学) Clemson University(克莱姆森大学) The University of Tulsa(塔尔萨大学)

AI总结 本文提出了一种名为ScaPre的统一框架,用于在大规模扩散模型中实现精确的概念反学习,通过解决冲突更新、不精确机制和依赖额外数据的问题,提高了反学习的效率和精度。

Comments Accepted at ICLR 2026

详情
Journal ref
International Conference on Learning Representations (ICLR) 2026
AI中文摘要

文本到图像的扩散模型已取得显著进展,但其使用引发了版权和滥用问题,促使研究机器反学习。然而,将多概念反学习扩展到大规模场景仍然困难,因为存在三个挑战:(i)冲突的权重更新会阻碍反学习或降低生成质量;(ii)不精确的机制会导致对相似内容的损害;(iii)依赖额外数据或模块,造成可扩展性瓶颈。为了解决这些问题,我们提出了可扩展-精确概念反学习(ScaPre),一种专门针对大规模反学习的统一框架。ScaPre引入了冲突感知的稳定设计,整合了谱迹正则化和几何对齐,以稳定优化、抑制冲突并保持全局结构。此外,Informax解耦器识别与概念相关的参数并自适应地重新加权更新,严格将反学习限制在目标子空间内。ScaPre产生了一个高效的闭式解,无需额外数据或子模型。在对象、风格和显性内容上的全面实验表明,ScaPre能够有效移除目标概念并保持生成质量。它比最佳基线在可接受的质量限制内能忘却多达$ imes \mathbf{5}$更多的概念,实现了大规模反学习的最先进精度和效率。代码可在https://github.com/kaiyuan02415/scapre获取。

英文摘要

Text-to-image diffusion models have achieved remarkable progress, yet their use raises copyright and misuse concerns, prompting research into machine unlearning. However, extending multi-concept unlearning to large-scale scenarios remains difficult due to three challenges: (i) conflicting weight updates that hinder unlearning or degrade generation; (ii) imprecise mechanisms that cause collateral damage to similar content; and (iii) reliance on additional data or modules, creating scalability bottlenecks. To address these, we propose Scalable-Precise Concept Unlearning (ScaPre), a unified framework tailored for large-scale unlearning. ScaPre introduces a conflict-aware stable design, integrating spectral trace regularization and geometry alignment to stabilize optimization, suppress conflicts, and preserve global structure. Furthermore, an Informax Decoupler identifies concept-relevant parameters and adaptively reweights updates, strictly confining unlearning to the target subspace. ScaPre yields an efficient closed-form solution without requiring auxiliary data or sub-models. Comprehensive experiments on objects, styles, and explicit content demonstrate that ScaPre effectively removes target concepts while maintaining generation quality. It forgets up to $\times \mathbf{5}$ more concepts than the best baseline within acceptable quality limits, achieving state-of-the-art precision and efficiency for large-scale unlearning. Code is available at https://github.com/kaiyuan02415/scapre

2601.04855 2026-05-19 cs.LG cs.AI 版本更新

Rethinking GNNs and Missing Features: Challenges, Evaluation and a Robust Solution

重新思考图神经网络与缺失特征:挑战、评估和一个稳健的解决方案

Francesco Ferrini, Veronica Lachi, Antonio Longa, Bruno Lepri, Matono Akiyoshi, Andrea Passerini, Xin Liu, Manfred Jaeger

发表机构 * University of Trento, Trento, Italy(特伦托大学) UiT, The Arctic University of Norway, Tromsø, Norway(北极大学) Aalborg University, Aalborg, Denmark(奥尔堡大学)

AI总结 本文针对图神经网络中缺失节点特征的问题,提出了一种稳健的解决方案,通过设计更真实的缺失机制和评估协议,提高了模型的鲁棒性。

详情
AI中文摘要

处理缺失节点特征是部署图神经网络(GNNs)在现实领域如医疗和传感器网络中的关键挑战。现有研究主要针对相对温和的场景,即基准数据集,其中节点特征具有高维但稀疏的特征和由完全随机缺失(MCAR)机制生成的不完整数据。对于(a),我们理论证明高稀疏性显著限制了缺失性导致的信息损失,使所有模型看起来稳健,从而防止了对性能的有意义比较。为克服这一限制,我们引入了一个合成和三个真实世界的数据集,具有密集且语义丰富的特征。对于(b),我们超越MCAR并设计了更真实的缺失机制的评估协议。此外,我们提供了理论背景,明确陈述了缺失过程的假设,并分析了这些假设对不同方法的影响。基于此分析,我们提出了GNNmim,一种简单但有效的基线模型,用于具有不完整特征数据的节点分类。实验表明,GNNmim在各种数据集和缺失性制度下与专门设计的架构具有竞争力。

英文摘要

Handling missing node features is a key challenge for deploying Graph Neural Networks (GNNs) in real-world domains such as healthcare and sensor networks. Existing studies mostly address relatively benign scenarios, namely benchmark datasets with (a) high-dimensional but sparse node features and (b) incomplete data generated under Missing Completely At Random (MCAR) mechanisms. For (a), we theoretically prove that high sparsity substantially limits the information loss caused by missingness, making all models appear robust and preventing a meaningful comparison of their performance. To overcome this limitation, we introduce one synthetic and three real-world datasets with dense, semantically meaningful features. For (b), we move beyond MCAR and design evaluation protocols with more realistic missingness mechanisms. Moreover, we provide a theoretical background to state explicit assumptions on the missingness process and analyze their implications for different methods. Building on this analysis, we propose GNNmim, a simple yet effective baseline for node classification with incomplete feature data. Experiments show that GNNmim is competitive with respect to specialized architectures across diverse datasets and missingness regimes.

2601.03425 2026-05-19 cs.LG cs.AI 版本更新

The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models

领域专精的幻觉:揭示混合专家模型中的领域不变‘ standing committee ’

Yan Wang, Yitao Xu, Nanhan Shen, Jinyan Su, Jimin Huang, Zining Zhu

发表机构 * The Fin AI(Fin AI) Georgia Institute of Technology(佐治亚理工学院) Cornell University(康奈尔大学) Stevens Institute of Technology(史蒂文斯理工学院) The University of Manchester(曼彻斯特大学)

AI总结 本研究质疑混合专家模型通过稀疏路由实现领域专精的假设,提出COMMITTEEAUDIT框架分析专家组而非个体专家的路由行为,发现领域不变的standing committee,揭示模型存在向集中计算偏倚的结构倾向,表明混合专家模型中的专精程度远低于预期。

Comments Accepted by ACL 2026 main conference. Camera-ready version

详情
AI中文摘要

混合专家模型被广泛假设通过稀疏路由实现领域专精。在本工作中,我们通过引入COMMITTEEAUDIT框架,质疑这一假设,该框架在专家组层面而非个体专家层面分析路由行为。在三个代表性模型和MMLU基准测试中,我们揭示了一个领域不变的standing committee。这是一个紧凑的路由专家联盟,能够跨领域、层和路由预算持续捕获大多数路由质量,即使在架构已包含共享专家的情况下。定性分析进一步显示,standing committee锚定推理结构和语法,而外围专家处理领域特定知识。这些发现揭示了模型对集中计算的强结构偏倚,表明混合专家模型中的专精程度远低于人们普遍认为的水平。这种固有偏倚也表明,当前的训练目标,如强制均匀专家利用的负载平衡损失,可能与模型的自然优化路径相悖,从而限制了训练效率和性能。

英文摘要

Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing. In this work, we question this assumption by introducing COMMITTEEAUDIT, a post hoc framework that analyzes routing behavior at the level of expert groups rather than individual experts. Across three representative models and the MMLU benchmark, we uncover a domain-invariant Standing Committee. This is a compact coalition of routed experts that consistently captures the majority of routing mass across domains, layers, and routing budgets, even when architectures already include shared experts. Qualitative analysis further shows that Standing Committees anchor reasoning structure and syntax, while peripheral experts handle domain-specific knowledge. These findings reveal a strong structural bias toward centralized computation, suggesting that specialization in Mixture of Experts models is far less pervasive than commonly believed. This inherent bias also indicates that current training objectives, such as load-balancing losses that enforce uniform expert utilization, may be working against the model's natural optimization path, thereby limiting training efficiency and performance.

2601.01123 2026-05-19 cs.LG cs.AI 版本更新

Learning from Historical Activations in Graph Neural Networks

在图神经网络中学习历史激活

Yaniv Galron, Hadar Sinai, Haggai Maron, Moshe Eliasof

发表机构 * Technion – Israel Institute of Technology(技术ion–以色列理工学院) NVIDIA Ben-Gurion University of the Negev(贝内-约尔根大学) University of Cambridge(剑桥大学)

AI总结 本文提出HISTOGRAPH,一种基于注意力的两阶段最终聚合层,通过层间和节点间的注意力机制,利用节点的激活历史和图结构来优化最终预测特征,从而在多个图分类基准上实现了优于传统方法的性能。

Comments ICLR 2026

详情
AI中文摘要

图神经网络(GNNs)在社交网络、分子化学等领域展现了显著的成功。GNNs的关键组成部分是池化过程,其中模型计算的节点特征被结合成一个有信息量的最终描述符,用于下游任务。然而,先前的图池化方案依赖于最后一个GNN层的特征作为池化或分类层的输入,这可能未能充分利用模型前向传递过程中先前层产生的重要激活,即历史图激活。这种差距在节点表示在许多图神经层中显著变化的情况下尤为明显,并且在深度架构中受到过平滑问题的加剧。为弥合这一差距,我们引入HISTOGRAPH,一种新颖的两阶段注意力最终聚合层,首先在中间激活上应用统一的层间注意力,随后进行节点间注意力。通过建模节点表示在层间的演变,我们的HISTOGRAPH利用节点的激活历史和图结构来优化最终预测所用的特征。在多个图分类基准上的实验证明,HISTOGRAPH提供了强大的性能,能够一致地改进传统技术,特别是在深度GNNs中表现出特别强的鲁棒性。

英文摘要

Graph Neural Networks (GNNs) have demonstrated remarkable success in various domains such as social networks, molecular chemistry, and more. A crucial component of GNNs is the pooling procedure, in which the node features calculated by the model are combined to form an informative final descriptor to be used for the downstream task. However, previous graph pooling schemes rely on the last GNN layer features as an input to the pooling or classifier layers, potentially under-utilizing important activations of previous layers produced during the forward pass of the model, which we regard as historical graph activations. This gap is particularly pronounced in cases where a node's representation can shift significantly over the course of many graph neural layers, and worsened by graph-specific challenges such as over-smoothing in deep architectures. To bridge this gap, we introduce HISTOGRAPH, a novel two-stage attention-based final aggregation layer that first applies a unified layer-wise attention over intermediate activations, followed by node-wise attention. By modeling the evolution of node representations across layers, our HISTOGRAPH leverages both the activation history of nodes and the graph structure to refine features used for final prediction. Empirical results on multiple graph classification benchmarks demonstrate that HISTOGRAPH offers strong performance that consistently improves traditional techniques, with particularly strong robustness in deep GNNs.

2512.24497 2026-05-19 cs.AI cs.LG cs.RO stat.ML 版本更新

What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

在联合嵌入预测世界模型中成功因素是什么?

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

发表机构 * Meta FAIR Inria Paris(巴黎理工院) Ecole normale supérieure / PSL(巴黎高等师范学院 / PSL) New York University(纽约大学)

AI总结 本文研究了在物理规划中使用联合嵌入预测世界模型(JEPA-WMs)的成功因素,通过分析模型架构、训练目标和规划算法对规划成功的影响,提出了一种在导航和操作任务中优于现有基线方法的模型。

Comments V2 of the article: - Added AdaLN-zero - Added table comparing JEPA-WMs with baselines with std translating per-seed variability only, no variability across epochs - Reordered figures in main body of the paper V3: added data scaling experiments, theoretical appendix section on autoregressive rollout, acceptance at TMLR

详情
AI中文摘要

人工智能领域长期存在的挑战是开发能够解决广泛物理任务并泛化到新、未见过的任务和环境的智能体。一种流行的近期方法是通过状态-动作轨迹训练世界模型,然后使用规划算法解决新任务。规划通常在输入空间中进行,但最近出现的一类方法引入了在学习的表示空间中优化的规划算法,其承诺通过抽象无关细节来提高规划效率。在本工作中,我们将此类模型称为JEPA-WMs,并研究使此类算法有效技术选择。我们提出了一项全面研究几个关键组件,旨在找到该类中的最佳方法。我们使用模拟环境和真实世界机器人数据进行了实验,并研究了模型架构、训练目标和规划算法对规划成功的影响。我们结合发现,提出了一种在导航和操作任务中优于两个现有基线方法(DINO-WM和V-JEPA-2-AC)的模型。代码、数据和检查点可在https://github.com/facebookresearch/jepa-wms上获得。

英文摘要

A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa-wms.

2512.24365 2026-05-19 physics.geo-ph cs.LG 版本更新

A Critical Assessment of PINNs and Operator Learning for Geotechnical Engineering

对PINNs和操作学习在土木工程中的关键评估

Krishna Kumar

AI总结 本文评估了PINNs和操作学习在土木工程中的性能,比较了多种神经网络方法与有限差分和粒子方法在地质基准测试中的表现,并探讨了PINN反演与自动微分的优劣。

详情
AI中文摘要

科学机器学习(SciML)为土木工程中的数值流程提供了神经网络替代方案。本文将多层感知器(MLPs)、物理信息神经网络(PINNs)、深度操作网络(DeepONet)和图网络模拟器(GNS)与有限差分和粒子方法在地质基准测试中进行基准测试,并通过传统求解器比较PINN反演与自动微分(AD)。我们评估了每种方法在 extrapolation、训练、推理成本、跨问题实例转移和物理准确性方面的表现。一个在两年内训练的MLP能够拟合数据,但在第十年使用ReLU预测约290毫米,使用tanh或sigmoid预测约60毫米,而参考值为99.3毫米。一个带有时间域在[0,1]内的PINN在该区间内匹配闭合形式,但超出该范围失败,因为残差约束了仅在采样处的拟合。对于一维波动方程,PINN训练速度比有限差分方法慢约96,000倍且精度较低。DeepONet避免了PINN重新训练,但对于弹性基础上的梁,其训练成本等于约180万次有限差分求解,推理速度比直接求解器更慢。GNS通过局部粒子相互作用改进了几何转移,尽管公式仍需要轨迹、大规模训练集和大量内存。在逆向波动基准测试中,通过有限差分求解器的自动微分在几秒内恢复了材料剖面,误差约为1%。结果支持SciML的谨慎应用。神经网络适合在验证域内进行插值和模式识别,而逆向分析应在存在可靠正向求解器时首先尝试可微分的物理基础求解器。

英文摘要

Scientific machine learning (SciML) offers neural-network alternatives to numerical workflows in geotechnical engineering. This paper benchmarks multi-layer perceptrons (MLPs), physics-informed neural networks (PINNs), deep operator networks (DeepONet), and graph network simulators (GNS) against finite-difference and particle-based references on geotechnical benchmarks, and compares PINN inversion with automatic differentiation (AD) through a conventional solver. We evaluate each method for extrapolation, training, and inference cost, transfer across problem instances, and physics accuracy. An MLP trained on two years of Terzaghi consolidation fits the data, but at year ten predicts ~290 mm with ReLU and ~60 mm with tanh or sigmoid, against a reference of 99.3 mm. A PINN on a damped oscillator with a time domain inside [0,1] matches the closed form within that interval but fails outside, since the residual constrains the fit only where it is sampled. For the 1D wave equation, PINN training is ~96,000 times slower than finite-difference methods and less accurate. DeepONet avoids PINN retraining, yet for the beam on elastic foundation, its training cost equals ~1.8 million finite-difference solves, and inference is slower per query than the direct solver. GNS improves geometric transfer through local particle interactions, though formulations still need trajectories, large training sets, and substantial memory. In the inverse wave benchmark, AD through the finite-difference solver recovers the material profile in seconds with ~1% error. The results support a cautious role for SciML. Neural networks suit interpolation and pattern recognition inside validated domains, while inverse analysis should first try differentiable physics-based solvers when a reliable forward solver exists.

2512.23178 2026-05-19 math.OC cs.LG stat.ML 版本更新

Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

针对重尾噪声下的非光滑凸优化的截断梯度方法:一种细化分析

Zijian Liu

发表机构 * Stern School of Business(斯特恩商学院)

AI总结 本文针对重尾噪声下的非光滑凸优化问题,提出了一种改进的截断梯度方法,并在高概率和期望收敛方面提供了更优的收敛速率和理论分析。

Comments A preliminary conference version is accepted at ICLR 2026. This full version includes the formal statements of lower bounds and their proofs. v3: fixed some typos

详情
AI中文摘要

在重尾噪声下的优化问题近年来变得流行,因为它更好地拟合了许多现代机器学习任务,如经验观察所捕获的。具体来说,而不是对梯度噪声有有限的二阶矩,已被认识到一个有界的p阶矩,其中p∈(1,2]更现实(例如上界由σ_l^p对于某些σ_l≥0)。一个简单而有效的操作,梯度截断,已知能成功处理这个新的挑战。具体来说,截断随机梯度下降(Clipped SGD)保证了非光滑凸(resp.强凸)问题的高概率速率O(σ_l ln(1/δ)T^{1/p-1})(resp. O(σ_l^2 ln^2(1/δ)T^{2/p-2})),其中δ∈(0,1]是失败概率,T∈N是时间范围。在本文中,我们为Clipped SGD提供了一种细化分析,并提供了两个速率,O(σ_l d_{eff}^{-1/(2p)} ln^{1-1/p}(1/δ) T^{1/p-1})和O(σ_l^2 d_{eff}^{-1/p} ln^{2-2/p}(1/δ) T^{2/p-2}),比上述最佳结果更快,其中d_{eff}≥1是我们称为“广义有效维度”的量。我们的分析在两个方面优于现有方法:更有效地利用Freedman不等式和更精细的截断误差界在重尾噪声下。此外,我们将细化分析扩展到期望收敛,并获得新的速率,突破了已知的下界。最后,为了补充研究,我们为高概率和期望收敛建立了新的下界。值得注意的是,期望下界与我们的新上界相匹配,表明我们的细化分析在期望收敛方面是最佳的。

英文摘要

Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded ${\frak p}$-th moment where ${\frak p}\in(1,2]$ has been recognized to be more realistic (say being upper bounded by $σ_{\frak l}^{\frak p}$ for some $σ_{\frak l}\ge0$). A simple yet effective operation, gradient clipping, is known to handle this new challenge successfully. Specifically, Clipped Stochastic Gradient Descent (Clipped SGD) guarantees a high-probability rate ${\cal O}(σ_{\frak l}\ln(1/δ)T^{1/{\frak p}-1})$ (resp. ${\cal O}(σ_{\frak l}^2\ln^2(1/δ)T^{2/{\frak p}-2})$) for nonsmooth convex (resp. strongly convex) problems, where $δ\in(0,1]$ is the failure probability and $T\in\mathbb{N}$ is the time horizon. In this work, we provide a refined analysis for Clipped SGD and offer two rates, ${\cal O}(σ_{\frak l}d_{\rm eff}^{-1/2{\frak p}}\ln^{1-1/{\frak p}}(1/δ)T^{1/{\frak p}-1})$ and ${\cal O}(σ_{\frak l}^2d_{\rm eff}^{-1/{\frak p}}\ln^{2-2/{\frak p}}(1/δ)T^{2/{\frak p}-2})$, faster than the aforementioned best results, where $d_{\rm eff}\ge1$ is a quantity we call the $\textit{generalized effective dimension}$. Our analysis improves upon the existing approach on two sides: better utilization of Freedman's inequality and finer bounds for clipping error under heavy-tailed noise. In addition, we extend the refined analysis to convergence in expectation and obtain new rates that break the known lower bounds. Lastly, to complement the study, we establish new lower bounds for both high-probability and in-expectation convergence. Notably, the in-expectation lower bounds match our new upper bounds, indicating the optimality of our refined analysis for convergence in expectation.

2512.11089 2026-05-19 stat.ML cs.LG 版本更新

TPV: Parameter Perturbations Through the Lens of Test Prediction Variance

TPV:通过测试预测方差的透镜进行参数扰动分析

Devansh Arpit

发表机构 * Modelable AI(可建模人工智能)

AI总结 本文引入测试预测方差(TPV)作为分析训练后鲁棒性的统一框架,通过研究参数扰动对模型输出的一阶敏感性,揭示了SGD噪声、标签噪声、量化和剪枝等机制的统一视角,并提出了基于TPV的剪枝准则和模型选择方法。

Comments ICML 2026

详情
AI中文摘要

我们引入测试预测方差(TPV)——训练模型输出对参数扰动的一阶敏感性——作为分析训练后鲁棒性的统一框架。TPV是一个完全标签无关的对象,其迹形式将训练好的模型几何结构与特定扰动机制分离,将SGD噪声、标签噪声、量化和剪枝置于同一个视角下。所得到的表达式恢复了SGD和量化噪声的宽谷假设,并给出了标签噪声的Jacobian谱特征,将标签噪声TPV与非线性网络中的良性过拟合联系起来。理论上,我们证明在过参数化极限下,训练集TPV收敛到其测试集对应值,无论泛化性能如何,提供了首个结果:预测方差在局部参数扰动下可以通过训练输入单独推断。经验上,这种稳定性在更广泛的范围内成立,包括非常低的宽度。此外,TPV与测试损失相关联,使其具有实际应用价值:JBR,一种基于TPV几何匹配的无标签剪枝准则,实现了最先进的基线;以及基于训练集的模型选择信号,适用于分布内和迁移学习场景。代码可在github.com/devansharpit/TPV获得。

英文摘要

We introduce test prediction variance (TPV)--the first-order sensitivity of a trained model's outputs to parameter perturbations--as a unifying framework for analyzing post-training robustness. TPV is a fully label-free object whose trace form separates the geometry of the trained model from the specific perturbation mechanism, placing SGD noise, label noise, quantization, and pruning under a single lens. The resulting expressions recover the wide-minima hypothesis for SGD and quantization noise, and yield a distinct Jacobian-spectral characterization for label noise connecting label-noise TPV with benign overfitting in nonlinear networks. Theoretically, we prove that training-set TPV converges to its test-set counterpart in the overparameterized limit, irrespective of generalization performance, providing the first result that prediction variance under local parameter perturbations can be inferred from training inputs alone. Empirically, this stability holds far more broadly, including at very low widths. Further, TPV correlates well with test loss, enabling practical applications: JBR, a label-free pruning criterion derived from TPV geometry matching state-of-the-art baselines; and training-set based model selection signal for in-distribution and transfer learning scenarios. Code available at github.com/devansharpit/TPV.

2511.21654 2026-05-19 cs.LG 版本更新

EvilGenie: A Reward Hacking Benchmark

EvilGenie: 一个奖励黑客基准

Jonathan Gabor, Jayson Lynch, Jonathan Rosenfeld

发表机构 * Cambridge Boston Alignment Initiative(剑桥波士顿对齐倡议) MIT FutureTech(麻省理工 FutureTech)

AI总结 本文提出EvilGenie基准,用于评估编程环境中奖励黑客问题,通过测试用例硬编码和测试文件编辑等方法检测奖励黑客行为,并验证了LLM判断在无歧义情况下的有效性。

详情
AI中文摘要

我们介绍了EvilGenie,一个用于编程环境中的奖励黑客基准。我们从LiveCodeBench中获取问题,并创建了一个环境,使代理能够通过硬编码测试用例或编辑测试文件等方式轻易进行奖励黑客。我们通过三种方式测量奖励黑客:保留的单元测试、LLM判断和测试文件编辑检测。我们验证了这些方法与人类审查和彼此之间的对比。我们发现LLM判断在无歧义情况下检测奖励黑客非常有效,而保留的测试用例使用仅带来最小的改进。除了使用Inspect的basic_agent框架测试许多模型外,我们还测量了三个流行专有编码代理(OpenAI的Codex、Anthropic的Claude Code和Google的Gemini CLI)的奖励黑客率。我们观察到Codex和Claude Code表现出明显的奖励黑客行为,而所有三个代理都表现出不一致的行为。我们的代码库可在https://github.com/JonathanGabor/evilgenie_inspect找到。

英文摘要

We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic\_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/evilgenie_inspect .

2511.21016 2026-05-19 cs.LG cs.CL 版本更新

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

门控KalmaNet:通过测试时岭回归实现渐逝记忆层

Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, Stefano Soatto

发表机构 * University of Pennsylvania(宾夕法尼亚大学) AWS Agentic AI(AWS 代理人工智能)

AI总结 本文提出门控KalmaNet(GKA),通过测试时岭回归实现渐逝记忆层,解决了线性状态空间模型(SSMs)在记忆过去信息时的效率与精度问题,展示了GKA在短上下文任务和长上下文任务中的优越性能。

Comments 30 pages, 10 figures. Accepted at CVPR 2026

详情
AI中文摘要

线性状态空间模型(SSMs)提供了一种高效的替代softmax注意力机制的方案,具有恒定的内存和线性计算,但其损失性、渐逝的过去总结对需要回忆的任务造成了伤害。我们提出门控KalmaNet(GKA,发音为“gee-ka”),这是一种能够考虑完整过去同时保持SSM风格效率的层。我们的方法基于卡尔曼滤波(KF),并证明了现有的几种SSM层(DeltaNet、门控DeltaNet、Kimi Delta Attention)是在恒等误差协方差假设下的卡尔曼滤波递归近似。相比之下,GKA保持完整的误差协方差并计算精确的卡尔曼增益。在稳态假设下,这可以简化为具有恒定内存和线性计算的在线岭回归。标准的卡尔曼滤波方程在低精度设置(如bfloat16)下数值不稳定且难以在GPU上并行化。我们通过(1)输入依赖的门控进行自适应正则化以控制岭回归的条件数,以及(2)Chebyshev迭代,证明其在低精度下比传统迭代求解器更稳定。我们进一步开发了针对硬件的分块内核以提高训练效率。实证上,GKA在短上下文任务中优于现有的SSM层(如Mamba2、门控DeltaNet),并在长上下文RAG和LongQA任务中达到128k token的相对改进超过10%。我们还展示了当扩展到ImageNet分类时,GKA优于Mamba。我们的代码,包括用于训练和推理的Triton内核(vLLM),以及在HuggingFace上8B和32B规模的GKA基于混合模型的模型库,均以Apache 2.0许可证发布。

英文摘要

Linear State-Space Models (SSMs) offer an efficient alternative to softmax Attention with constant memory and linear compute, but their lossy, fading summary of the past hurts recall-oriented tasks. We propose Gated KalmaNet (GKA, pronounced "gee-ka"), a layer that accounts for the full past while retaining SSM-style efficiency. We ground our approach in the Kalman Filter (KF), and show that several existing SSM layers (DeltaNet, Gated DeltaNet, Kimi Delta Attention) are approximations to the KF recurrence under an identity error covariance assumption, which ignores how past keys and values should optimally influence state updates. In contrast, GKA maintains the full error covariance and computes the exact Kalman gain. Under a steady-state assumption that enables parallelization, this reduces to an online ridge regression with constant memory and linear compute. The standard KF equations are numerically unstable in low-precision settings (e.g., bfloat16) and hard to parallelize on GPUs. We address this with (1) adaptive regularization via input-dependent gating to control the ridge regression's condition number, and (2) Chebyshev Iteration, which we show is more stable than conventional iterative solvers in low precision. We further develop hardware-aware chunk-wise kernels for efficient training. Empirically, GKA outperforms existing SSM layers (e.g., Mamba2, Gated DeltaNet) on short-context tasks and achieves more than 10\% relative improvement on long-context RAG and LongQA up to 128k tokens. We further show GKA outperforms Mamba when extended to ImageNet classification. Our code, including Triton kernels for training and inference (vLLM), along with a model zoo of GKA-based Hybrid models at 8B and 32B scale on HuggingFace, is released under Apache 2.0.

2511.16309 2026-05-19 cs.CV cs.LG 版本更新

Sparse Autoencoders are Topic Models

稀疏自编码器是主题模型

Leander Girrbach, Zeynep Akata

发表机构 * Technical University of Munich (TUM), Munich Center for Machine Learning (MCML), Helmholtz Munich(慕尼黑技术大学(TUM)、慕尼黑机器学习中心(MCML)、海德堡-慕尼黑研究所)

AI总结 本文提出将稀疏自编码器(SAEs)视为主题模型的新视角,通过构建连续主题模型(CTM)来解释嵌入空间,并推导出SAE的目标作为最大后验估计器,从而揭示SAE特征是主题性组件而非可调节方向。

Comments ICML 2026

详情
AI中文摘要

稀疏自编码器(SAEs)被用于分析嵌入,但其作用和实用价值存在争议。我们提出了一种新的视角,通过展示它们可以自然地被理解为主题模型。我们受到潜在狄利克雷分配(LDA)的启发,提出了一种连续主题模型(CTM)用于嵌入空间,并在此模型下推导出SAE目标作为最大后验估计器。这种观点表明SAE特征是主题性组件而非可调节方向。为了验证我们的理论发现,我们引入了SAE-TM主题建模框架,该框架:(1)训练SAE以学习可重用的主题原子;(2)将它们解释为下游数据中的词分布;(3)将它们合并到任意数量的主题中而无需重新训练。SAE-TM在文本和图像数据集上比强大的基线产生更连贯的主题,同时保持多样性。最后,我们分析了图像数据集中的主题结构,并追踪了日本木版画中主题随时间的变化。我们的工作将SAEs定位为跨模态大规模主题分析的有效工具。代码可在https://github.com/ExplainableML/SAE-TM获取。

英文摘要

Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We propose a continuous topic model (CTM) inspired by Latent Dirichlet Allocation (LDA) for embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. To confirm our theoretical findings, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code is available at https://github.com/ExplainableML/SAE-TM .

2511.11934 2026-05-19 cs.LG cs.CV 版本更新

A Systematic Analysis of Out-of-Distribution Detection Under Representation and Training Paradigm Shifts

基于表示和训练范式转变的分布外检测系统分析

Claudio César Claros Olivares, Austin J. Brockmeier

发表机构 * Department of Electrical & Computer Engineering(电气与计算机工程系) University of Delaware(德雷塞尔大学)

AI总结 本文通过表示中心的视角系统评估了分布外检测的CSFs,分析了不同架构、训练范式和数据集的影响,并提出基于PCA的投影过滤方法和基于神经坍塌的预测方法来提升检测性能。

详情
AI中文摘要

我们通过表示中心的视角系统评估了分布外检测(OOD)的CSFs。我们的研究涵盖了CNN和ViT架构、多种训练范式、四个图像分类源数据集(CIFAR-10、CIFAR-100、SuperCIFAR-100和TinyImageNet),以及通过CLIP衍生的语义距离将OOD数据集分为近、中、远三个区域。为了比较这些设置下的CSFs,我们采用了一种多重比较受控的排名流程,该流程在无阈值排名指标(AURC和AUGRC)下识别出统计上不可区分的顶级聚类。主要经验发现是,竞争性检测器家族更依赖于学习的表示而不是单纯的分数设计。对于CNN和ViT,简单的概率分数在误分类检测中占主导地位。在CNN中,基于边界的分数在近OOD区域最强,而几何感知分数如NNGuide、fDBD和CTM在移位严重性增加时变得更具竞争力。在微调的ViT中,顶级聚类主要由重建和残差分数主导。为了解释这些排名变化,我们使用神经坍塌(NC)指标分析最后一层表示。得到的图景在不同架构中是一致的:原型和边界感知分数在表示更坍塌且与分类器权重更好对齐时更强,而弱坍塌区域则更青睐梯度和流形基于的分数。基于这些见解,我们提出两个贡献:一种基于PCA的投影过滤过程,可以提高检测器性能,以及一种利用训练分类器计算的NC测量来预测其竞争性的分布外检测器短名单的方法,而无需任何额外的分布外数据。

英文摘要

We present a systematic benchmark of out-of-distribution (OOD) detection CSFs through a representation-centric lens. Our study spans CNN and ViT backbones, multiple training paradigms, four image-classification source datasets (CIFAR-10, CIFAR-100, SuperCIFAR-100, and TinyImageNet), and OOD datasets grouped into near, mid, and far regimes using CLIP-derived semantic distances. To compare CSFs across these settings, we employ a multiple-comparison-controlled rank pipeline that identifies top cliques of statistically indistinguishable winners under threshold-free ranking metrics (AURC and AUGRC). The main empirical finding is that the competitive detector family depends more on the learned representation than on score design alone. For both CNNs and ViTs, simple probabilistic scores dominate misclassification detection. On CNNs, margin-based scores are strongest in near-OOD regimes, while geometry-aware scores such as NNGuide, fDBD, and CTM become more competitive as shift severity increases. On fine-tuned ViTs, the top cliques are led mainly by reconstruction- and residual-based scores. To interpret these ranking shifts, we analyze the last-layer representation using Neural Collapse (NC) metrics. The resulting picture is consistent across architectures: prototype- and boundary-aware scores become stronger when the representation is more collapsed and better aligned with classifier weights, whereas weaker-collapse regimes favor gradient- and manifold-based scores. Building on these insights, we propose two contributions: a simple PCA-based projection-filtering procedure that improves detector performance, and an approach that uses NC measurements computed from a trained classifier to predict its competitive out-of-distribution detector shortlist, without requiring any additional OOD data.

2511.08704 2026-05-19 cs.CV cs.LG 版本更新

Rethinking Generative Image Pretraining: How Far Are We From Scaling Up Next-Pixel Prediction?

重新思考生成图像预训练:我们离扩大下一步像素预测还有多远?

Xinchen Yan, Chen Liang, Lijun Yu, Adams Wei Yu, Yifeng Lu, Quoc V. Le

发表机构 * Google Deepmind(谷歌深Mind)

AI总结 本文研究了自回归下一步像素预测的扩展特性,探讨了统一视觉模型中简单且端到端但尚未充分探索的框架。通过在32x32分辨率的图像上训练Transformer模型,评估了三个目标指标:下一步像素预测目标、ImageNet分类准确率和基于生成的完成度(通过Fr'echet距离测量)。研究发现,最优扩展策略高度依赖任务,且随着图像分辨率的增加,模型大小必须比数据量增长得更快。通过预测发现,计算能力是主要瓶颈,而非训练数据量。随着计算能力每年增长四到五倍,预计在五年内可实现像素级图像建模。

Comments Accepted by ICML2026

详情
AI中文摘要

本文研究了自回归下一步像素预测的扩展特性,一种简单、端到端但尚未充分探索的统一视觉模型框架。从32x32分辨率的图像开始,我们训练了一系列Transformer模型,使用IsoFlops配置在计算预算高达7e19 FLOPs的情况下进行训练,并评估了三个不同的目标指标:下一步像素预测目标、ImageNet分类准确率和基于生成的完成度(通过Fr'echet距离测量)。首先,最优扩展策略高度依赖于任务。在固定的32x32分辨率下,图像分类和图像生成的最优扩展特性不同,其中生成最优设置要求数据量增长是分类最优设置的三到五倍。其次,随着图像分辨率的增加,最优扩展策略表明模型大小必须比数据量增长得更快。令人惊讶的是,通过投影我们的发现,我们发现主要瓶颈是计算能力,而不是训练数据量。随着计算能力每年增长四到五倍,我们预测在五年内可以实现像素级图像建模。

英文摘要

This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation-based completion measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed resolution of 32x32 alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.

2511.03828 2026-05-19 cs.LG 版本更新

From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Relaxation for Offline-to-Online Reinforcement Learning

从静态约束到动态适应:样本级约束放松用于离线到在线强化学习

Lipeng Zu, Yu Qian, Shayok Chakraborty, Xiaonan Zhang

发表机构 * Department of Computer Science, Florida State University, Tallahassee, FL, USA(佛罗里达州立大学计算机科学系)

AI总结 本文提出DARE框架,通过行为一致性实现样本级约束放松,解决了离线到在线强化学习中保留离线保守性与适应在线反馈之间的挑战,改进了细调稳定性并优于现有基线。

详情
AI中文摘要

离线到在线强化学习(O2O RL)面临在保留离线保守性与适应在线反馈下的分布偏移挑战。此挑战出现因为数据行为在微调期间演变,使得数据来源成为约束处理的误导基础,从而导致目标-数据不匹配。因此,我们提出了动态对齐用于放松(DARE),一种基于行为模型的行为一致性分布感知框架,用于样本级约束放松。据我们所知,DARE是第一个通过后验诱导交换机制将约束放松条件化于行为一致性,超越二元离线/在线数据区别的方法。重要的是,DARE仅需要每个样本的行为对齐,使它能够在许多离线算法上进行实例化,具有灵活的行为模型和微调目标选择。我们提供理论分析,显示基于行为的样本交换一致地提高了离线样本人群与在线样本人群之间的区分。在D4RL上的实验表明,DARE一致提高了微调稳定性,并在强离线到在线基线之上实现了优越的最终性能。(代码可在https://github.com/lpzu/DARE上公开获取。)

英文摘要

Offline-to-online reinforcement learning (O2O RL) faces a central challenge between retaining offline conservatism and adapting to online feedback under distribution shift. This challenge arises because data behavior evolves during fine-tuning, rendering data origin a misleading basis for constraint handling and thereby leading to objective-data mismatch. We therefore propose Dynamic Alignment for RElaxation (DARE), a distribution-aware framework for sample-level constraint relaxation based on the behavioral consistency with a behavior model. To our knowledge, DARE is the first to condition constraint relaxation on behavioral consistency via a posterior-induced exchange mechanism, moving beyond a binary offline/online data distinction. Importantly, DARE requires only per-sample behavioral alignment, enabling instantiation on top of many offline algorithms with flexible choices of behavior models and fine-tuning objectives. We provide a theoretical analysis showing that behavior-based sample exchange consistently improves the distinction between offline-like and online-like subsets. Experiments on D4RL demonstrate that DARE consistently improves fine-tuning stability and achieves superior final performance over strong offline-to-online baselines. (The code is publicly available at \url{https://github.com/lpzu/DARE}.)

2511.02610 2026-05-19 cs.LG 版本更新

Towards Migrating Neural Network Implementations

向神经网络实现迁移迈进

Nadia Daoudi, Ivan Alfonso, Jordi Cabot

发表机构 * Luxembourg Institute of Science and Technology(卢森堡科学与技术研究所) University of Luxembourg(卢森堡大学) Technology University of Luxembourg Luxembourg(卢森堡技术大学卢森堡)

AI总结 本文提出了一种自动迁移神经网络代码跨深度学习框架的方法,通过使用一个中间神经网络模型来创建迁移前的抽象,从而解决神经网络库之间迁移的挑战。

Comments To appear at the International Conference on AI-powered Software (AIware 2026)

详情
AI中文摘要

智能系统的开发(即通过AI组件增强的系统)得益于神经网络(NNs)的快速进步。由于神经网络设计和实现的支持,各种库和框架随之涌现。选择框架取决于可用功能、易用性、文档和社区支持等因素。在采用某个NN框架后,组织可能后来选择切换到另一个框架,如果性能下降、需求变化或新功能被引入。不幸的是,由于缺乏专门针对NNs的迁移方法,跨库迁移NN实现具有挑战性。这导致了更多的现代化时间与努力,因为手动更新是必要的,以避免依赖过时的实现并确保与新功能的兼容性。在本文中,我们提出了一种自动迁移神经网络代码跨深度学习框架的方法。我们的方法利用一个中间NN模型来创建迁移前的抽象。我们通过两个流行的NN框架,即PyTorch和TensorFlow,验证了我们的方法。我们还讨论了在两个框架之间迁移代码的挑战以及我们的方法如何处理这些问题。对五个NN的实验评估显示,我们的方法成功地迁移了它们的代码,并生成了与原始功能等效的NN。我们的工作成果已在线上可用。

英文摘要

The development of smart systems (i.e., systems enhanced with AI components) has thrived thanks to the rapid advancements in neural networks (NNs). A wide range of libraries and frameworks have consequently emerged to support NN design and implementation. The choice depends on factors such as available functionalities, ease of use, documentation and community support. After adopting a given NN framework, organizations might later choose to switch to another if performance declines, requirements evolve, or new features are introduced. Unfortunately, migrating NN implementations across libraries is challenging due to the lack of migration approaches specifically tailored for NNs. This leads to increased time and effort to modernize NNs, as manual updates are necessary to avoid relying on outdated implementations and ensure compatibility with new features. In this paper, we propose an approach to automatically migrate neural network code across deep learning frameworks. Our method makes use of a pivot NN model to create an abstraction of the NN prior to migration. We validate our approach using two popular NN frameworks, namely PyTorch and TensorFlow. We also discuss the challenges of migrating code between the two frameworks and how they were approached in our method. Experimental evaluation on five NNs shows that our approach successfully migrates their code and produces NNs that are functionally equivalent to the originals. Artefacts from our work are available online.

2510.26384 2026-05-19 cs.AI cs.LG 版本更新

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

Scales++: 一种计算高效的评估子集选择方法,基于认知尺度嵌入

Andrew M. Bean, Nabeel Seedat, Shengzhuang Chen, Jonathan Richard Schwarz

发表机构 * Thomson Reuters Foundational Research(汤姆森路透基础研究) University of Oxford(牛津大学) Imperial College London(帝国理工学院伦敦分校)

AI总结 本文提出了一种基于任务项目内在属性的评估子集选择方法Scales++,通过减少预选成本并保持预测保真度,提高了大规模语言模型的评估效率,同时提升了冷启动性能和可解释性。

Comments 9 pages, 2 figures, 4 tables

详情
AI中文摘要

对大规模语言模型(LLMs)进行全面评估的高昂成本需要创建小而有代表性的数据子集(即小型基准),以实现高效的评估同时保留预测保真度。当前的方法基于模型为中心的范式,根据现有模型的集体性能选择基准项目。这些方法受限于前期成本高、无法立即处理新基准(冷启动)以及假设未来模型会共享前代模型的失败模式的脆弱性。在本文中,我们提出了一种新的以项目为中心的基准子集选择方法,认为选择应基于任务项目的内在属性,而不是模型特定的失败模式。我们通过一种新的方法Scales++来实现这种以项目为中心的高效基准方法,其中数据选择基于基准样本的认知需求。实证研究表明,Scales++将前期选择成本降低了超过18倍,同时实现了有竞争力的预测保真度。在Open LLM Leaderboard上,使用仅0.25%的数据子集,我们预测完整基准分数的均方误差为3.2%,在Humanity's Last Exam上,使用2.0%的样本预测完整分数的均方误差为2.9%。我们证明这种以项目为中心的方法可以在不显著降低保真度的情况下更高效地评估模型,同时提供更好的冷启动性能和更可解释的基准测试。

英文摘要

The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks ("cold-start"), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we propose a new item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.25% data subset, we predict full benchmark scores with a 3.2% mean absolute error, and on Humanity's Last Exam we predict full scores with 2.9% mean absolute error using a 2.0% sample. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.

2510.24701 2026-05-19 cs.CL cs.AI cs.IR cs.LG cs.MA 版本更新

Tongyi DeepResearch Technical Report

通义深研技术报告

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Minpeng Liao, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang

发表机构 * Tongyi Lab(通义实验室) Alibaba Group(阿里巴巴集团)

AI总结 本文介绍了一种专为长时间深度信息检索任务设计的代理大语言模型,通过端到端训练框架结合代理中期和后期训练,实现了在复杂任务中的可扩展推理和信息检索,同时提供了高可扩展的数据合成管道,实现了无需昂贵人工标注的自动化训练流程,并在多个深度研究基准测试中取得了最先进的性能。

Comments https://tongyi-agent.github.io/blog

详情
AI中文摘要

我们介绍了通义深研,一种专为长周期、深度信息检索任务设计的代理大语言模型。为了激励自主深度研究代理,通义深研通过端到端训练框架结合代理中期和后期训练,实现了在复杂任务中的可扩展推理和信息检索。我们设计了一个高度可扩展的数据合成管道,完全自动化,无需依赖昂贵的人工标注,并赋能所有训练阶段。通过为每个阶段构建定制化环境,我们的系统在整个过程中实现了稳定一致的交互。通义深研拥有305亿总参数,每token仅激活33亿个参数,在多个代理深度研究基准测试中,包括人类最后考试、浏览比较、浏览比较-中文、WebWalkerQA、xbench-DeepSearch、FRAMES和xbench-DeepSearch-2510,均取得了最先进的性能。我们开源了该模型、框架和完整解决方案,以赋能社区。

英文摘要

We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

2510.17363 2026-05-19 cs.CV cs.LG cs.RO 版本更新

M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

M2H:基于高效窗口交叉任务注意力的多任务学习用于单目空间感知

U. V. B. L Udugama, George Vosselman, Francesco Nex

发表机构 * Department of Earth Observation Science(地球观测科学系)

AI总结 本文提出M2H框架,通过高效的窗口交叉任务注意力模块,实现单目图像上的语义分割、深度估计、边缘检测和表面法线估计,同时在计算效率上优于现有方法。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures

详情
AI中文摘要

在边缘设备上部署实时空间感知需要高效的多任务模型,这些模型能够在利用互补任务信息的同时最小化计算开销。本文介绍了Multi-Mono-Hydra(M2H),一种新的多任务学习框架,用于从单张单目图像中进行语义分割、深度、边缘和表面法线估计。与传统方法依赖独立单任务模型或共享编码器-解码器架构不同,M2H引入了基于窗口的跨任务注意力模块,实现了结构化的特征交换同时保留任务特定的细节,提高了任务间预测的一致性。M2H基于轻量级的ViT-based DINOv2主干网络,优化了实时部署,并作为支持动态环境中3D场景图构建的单目空间感知系统的基础。全面评估显示,M2H在NYUDv2上优于最先进的多任务模型,在Hypersim上超越了单任务深度和语义基线,在Cityscapes数据集上实现了更优的性能,同时在笔记本硬件上保持计算效率。除了基准测试外,M2H还在真实世界数据上得到了验证,证明了其在空间感知任务中的实用性。

英文摘要

Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

2510.16609 2026-05-19 cs.LG cs.AI cs.CC cs.DS 版本更新

Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods

先验知识使其成为可能:从次线性图算法到LLM测试时方法

Avrim Blum, Daniel Hsu, Cyrus Rashtchian, Donya Saless

发表机构 * Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) Columbia University(哥伦比亚大学) Google Research(谷歌研究)

AI总结 本文研究了测试时增强方法中先验知识与外部信息交互的理论基础,通过将多步推理建模为知识图中的s-t连通性问题,揭示了在部分先验知识下,测试时增强步骤数量与图结构之间的关系,发现当知识图中存在小组件时,增强步骤数呈平方根增长,而当知识密度超过阈值形成大组件时,增强步骤数趋于常数。

详情
AI中文摘要

测试时增强,如检索增强生成(RAG)或工具使用,关键依赖于模型参数知识与外部检索信息之间的相互作用。然而,这种关系的理论基础仍不明确。具体来说,不清楚在少量增强步骤下需要多少预训练知识来回答查询,这在实践中是理想的属性。为了解决这个问题,我们将多步推理建模为知识图中的s-t连通性问题。我们将模型的预训练参数知识表示为部分、可能嘈杂的子图。我们将增强视为查询一个 oracle 以获得真实的边,从而扩展模型的知识。然后,我们表征了在部分先验知识下,模型生成准确答案所需的必要和充分的增强步骤数。一个关键结果表明:如果包含n个顶点的知识图被分割成小组件,则通过增强找到路径是低效的,需要Ω(√n)次查询。另一方面,一旦正确知识的密度超过阈值,形成大组件,我们可以通过预期常数次查询找到路径。

英文摘要

Test-time augmentation, such as Retrieval-Augmented Generation (RAG) or tool use, critically depends on an interplay between a model's parametric knowledge and externally retrieved information. However, the theoretical underpinnings of this relationship remain poorly understood. Specifically, it is not clear how much pre-training knowledge is required to answer queries with a small number of augmentation steps, which is a desirable property in practice. To address this question, we formulate multi-step reasoning as an $s$-$t$ connectivity problem on a knowledge graph. We represent a model's pre-training parametric knowledge as a partial, potentially noisy subgraph. We view augmentation as querying an oracle for true edges that augment the model's knowledge. Then, we characterize the necessary and sufficient number of augmentation steps for the model to generate an accurate answer given partial prior knowledge. One key result shows a phase transition: if the prior knowledge graph over $n$ vertices is disconnected into small components, then finding a path via augmentation is inefficient and requires $Ω(\sqrt{n})$ queries. On the other hand, once the density of correct knowledge surpasses a threshold, forming a giant component, we can find paths with an expected constant number of queries.

2510.16252 2026-05-19 cs.LG cs.CL 版本更新

WEBSERV: A Full-Stack and RL-Ready Web Environment for Training Web Agents at Scale

WEBSERV: 一个全栈且适合强化学习的网页环境,用于大规模训练网页代理

Yuxuan Lu, Ziyi Wang, Jing Huang, Hui Liu, Jiri Gesi, Yan Han, Shihan Fu, Tianqi Zheng, Xianfeng Tang, Chen Luo, Yisi Sang, Jin Lai, Dakuo Wang

发表机构 * Northeastern University(东北大学) Amazon(亚马逊)

AI总结 本文提出WebServ,一个全栈且适合强化学习的网页环境,用于大规模训练网页代理。该环境在服务器端使用Incus容器减少启动延迟和存储需求,浏览器端提供自动化的观察和动作接口,以及可靠的执行后端。实验表明,WebServ在WebArena-Lite上实现了最先进的单提示结果,并在强化学习训练中超越了现有方法。

详情
AI中文摘要

针对网页代理强化学习需求,本文提出WebServ,一个全栈且适合强化学习的网页环境,用于大规模训练网页代理。当前网页环境存在不足:服务器端Docker设置过于资源密集,无法支持大规模并行展开;浏览器端接口产生噪声观察,执行动作在现代单页应用中不可靠,并遗漏视觉交互提示。我们引入WebServ,一个全栈、适合强化学习的网页环境,解决这些限制。在服务器端,WebServ使用Incus容器,通过块级拷贝-写入减少启动延迟约5倍,持久化存储减少约240倍,使单台主机支持200+个隔离环境。在浏览器端,WebServ提供一个紧凑的、站点无关的观察和动作接口,自动从DOM派生,并提供人类对齐的交互提示,以及使用网络感知等待的稳健动作执行后端。在WebArena-Lite上,WebServ实现了最先进的单提示结果,受控比较确认在GPT-4o、OpenAI-o3和Llama-3.1-8B上均优于普通WebArena。我们进一步在WebServ中完全训练Qwen3-4B和Qwen3-30B-A3B;RL训练的4B模型在均值准确率上达到55.5%,超过了Claude 4.5 Sonnet(50.0%)和WebAgent-R1中的RL训练8B模型(51.8%)

英文摘要

Reinforcement learning (RL) for web agents demands environments that are both effective for evaluation and efficient enough for large-scale on-policy training. Current web environments fall short: server-side Docker setups are too resource-intensive for massive parallel rollouts, while browser-side interfaces produce noisy observations, execute actions unreliably under modern single-page applications, and omit visual interactivity cues. We introduce WebServ, a full-stack, RL-ready web environment that addresses these limitations end-to-end. On the server side, WebServ uses Incus containers with block-level copy-on-write, reducing launch latency by ~5x and persistent storage by ~240x, enabling 200+ concurrent isolated environments on a single host. On the browser side, WebServ provides a compact, site-agnostic observation and action interface derived automatically from the DOM with human-aligned interactivity cues, and a robust action execution backend using network-aware waiting for reliable SPA support. On WebArena-Lite, WebServ achieves state-of-the-art single-prompt results, with controlled comparisons confirming consistent gains across GPT-4o, OpenAI-o3, and Llama-3.1-8B over vanilla WebArena. We further train Qwen3-4B and Qwen3-30B-A3B with RL entirely within WebServ; the RL-trained 4B model achieves 55.5% mean accuracy, surpassing both Claude 4.5 Sonnet (50.0%) and the RL-trained 8B model from WebAgent-R1 (51.8%).

2510.13068 2026-05-19 cs.LG cs.AI cs.HC 版本更新

NeuroRVQ: Multi-Scale Biosignal Tokenization for Generative Foundation Models

NeuroRVQ:多尺度生物信号分词用于生成式基础模型

Konstantinos Barmpas, Na Lee, Dimitrios Chalatsis, William Raftery, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Alexandros Koliousis, Dario Farina, Stefanos Zafeiriou

发表机构 * Imperial College London(帝国理工学院伦敦分校) Cogitat National and Kapodistrian University of Athens(国家与资本主义大学雅典分校) Archimedes Research Unit(阿基米德研究单位) Aristotle University of Thessaloniki(亚里士多德大学塞萨洛尼基分校) Northeastern University London(东北大学伦敦分校)

AI总结 本文提出NeuroRVQ,一种多尺度生物信号分词方法,通过多尺度时序卷积分解生物信号并结合相位感知损失,实现高保真信号重建,验证了高质量分词对下游性能的重要性。

详情
AI中文摘要

生物信号如脑电图(EEG)、心电图(ECG)和肌电信号(EMG)在多个时间和频谱尺度上编码生理活动,产生丰富但对机器学习具有挑战性的表示。训练以预测掩码信号标记为基础模型的方法在学习通用生物信号表示方面显示出前景,但其性能取决于分词器保留高频动态和高保真重建信号的能力。我们引入NeuroRVQ,一种适用于高保真信号重建的多模态生物信号分词家族。为了捕获完整的频谱,NeuroRVQ通过多尺度时序卷积将生物信号分解为频特定表示,每个表示编码为层次化的RVQ代码本以保留高频细节,并结合一种新的相位感知训练损失,该损失尊重傅里叶相位的环形拓扑。通过调整时间分辨率、时间核的数量和大小以及RVQ深度,此设计适应每种生物信号模态的频谱-时间特性。为验证分词质量驱动下游性能,我们为每种模态训练一个简单的掩码标记基础模型(NeuroRVQ-FM)使用相应的NeuroRVQ分词器。NeuroRVQ-FM家族在与现有模态特定基础模型相比时实现了竞争或更优的下游性能,证明了高保真分词是有效生物信号建模的关键因素。

英文摘要

Biosignals such as electroencephalography (EEG), electrocardiography (ECG), and electromyography (EMG) encode physiological activity across multiple temporal and spectral scales, yielding representations that are rich but challenging for machine learning. Foundation models trained to predict masked signal tokens have shown promise in learning generalizable biosignal representations, yet their performance depends on the tokenizer's ability to preserve high-frequency dynamics and reconstruct signals with high fidelity. We introduce NeuroRVQ, a modality-adaptive biosignal tokenizer family designed for high-fidelity signal reconstruction. To capture the full frequency spectrum, NeuroRVQ decomposes biosignals into frequency-specific representations via multi-scale temporal convolutions, each encoded into hierarchical RVQ codebooks to preserve high-frequency detail, combined with a novel phase-aware training loss that respects the circular topology of Fourier phase. By tuning the temporal resolution, number and size of temporal kernels and RVQ depth, this design adapts to the spectro-temporal characteristics of each biosignal modality. To validate that tokenizer quality drives downstream performance, we train a simple masked-token foundation model for each modality (NeuroRVQ-FM) using the corresponding NeuroRVQ tokenizer. The NeuroRVQ-FM family achieves competitive or superior downstream performance compared to existing modality-specific foundation models, demonstrating that high-fidelity tokenization is a critical factor for effective biosignal modeling.

2510.10528 2026-05-19 cs.CL cs.LG 版本更新

Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

Merlin's Whisper:通过黑盒说服提示增强大语言模型的高效推理

Heming Xia, Cunxiao Du, Rui Li, Chak Tou Leong, Yongqi Li, Wenjie Li

发表机构 * Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算机系) Sea AI Lab(Sea AI实验室) Peking University(北京大学)

AI总结 本文提出Whisper框架,通过黑盒说服提示减少大语言模型(LRM)的推理过程中的token使用量,同时保持性能,展示了在多个基准测试中显著的token减少效果。

Comments ACL 2026 (Long Paper), camera-ready version

详情
AI中文摘要

大型推理模型(LRMs)通过逐步思考在解决复杂任务方面表现出色。然而,这种漫长的推理过程带来了显著的计算和延迟开销,阻碍了LRMs的实用部署。本文提出了一种通过黑盒说服提示来减轻LRMs过度思考的新方法。通过将LRMs视为黑盒通信者,我们研究如何说服它们生成简洁响应而不影响准确性。我们引入了Whisper,一个迭代细化框架,能够从多种视角生成高质量的说服提示。在多个基准测试中的实验表明,Whisper在保持性能的同时,能够显著减少token使用量。值得注意的是,Whisper在简单的GSM8K问题上对Qwen3模型系列实现了平均3倍的响应长度减少,并在所有基准测试中实现了平均约40%的token减少。对于闭源API,Whisper在MATH-500上分别使Claude-3.7和Gemini-2.5的token使用量减少了46%和50%。进一步分析显示,Whisper在数据领域、模型规模和家族中的广泛应用,凸显了黑盒说服提示作为提升LRM效率的实用策略的潜力。

英文摘要

Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex tasks through step-by-step thinking. However, this lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of LRMs. This work presents a new approach to mitigating overthinking in LRMs via black-box persuasive prompting. By treating LRMs as black-box communicators, we investigate how to persuade them to generate concise responses without compromising accuracy. We introduce Whisper, an iterative refinement framework that generates high-quality persuasive prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that Whisper consistently reduces token usage while preserving performance. Notably, Whisper achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series and delivers an average ~40% token reduction across all benchmarks. For closed-source APIs, Whisper reduces token usage on MATH-500 by 46% for Claude-3.7 and 50% for Gemini-2.5. Further analysis reveals the broad applicability of Whisper across data domains, model scales, and families, underscoring the potential of black-box persuasive prompting as a practical strategy for enhancing LRM efficiency.

2510.10140 2026-05-19 cs.LG cs.CR stat.ML 版本更新

Adversarial Attacks on Downstream Weather Forecasting Models: Application to Tropical Cyclone Trajectory Prediction

对下游天气预测模型的对抗攻击:应用于热带气旋轨迹预测

Yue Deng, Francisco Santos, Pang-Ning Tan, Lifeng Luo

发表机构 * Michigan State University(密歇根州立大学)

AI总结 本文研究了对抗攻击对深度学习天气预测模型的脆弱性,提出了一种新的攻击方法Cyc-Attack,用于生成对抗性轨迹,以提高攻击的准确性并减少检测难度。

Comments Compared with the previous version, we added zeroth-order optimization methods as baselines, clarified the motivation for using a surrogate model, and provided a more detailed investigation of the upstream attack

详情
AI中文摘要

基于深度学习的天气预测(DLWF)模型利用过去的天气观测数据生成未来的预测,支持广泛的应用,包括热带气旋(TC)预测。在本文中,我们研究了这些模型对对抗攻击的脆弱性,其中对上游预测的细微扰动可以改变下游TC轨迹预测。尽管最近对DLWF模型的对抗攻击研究有所增长,但仍然具有挑战性,即创建扰动的上游预测,使下游输出朝向攻击者指定的轨迹。首先,传统的TC检测系统是不透明的、非可微的黑箱,这使得标准的梯度基攻击不可行。其次,TC事件的极端稀有性导致严重的类别不平衡问题,使得开发扰动上游预测的方法变得困难,这些扰动产生的轨迹看起来真实并与攻击者的目标轨迹一致。为了克服这些限制,我们提出了Cyc-Attack,一种新的方法,用于扰动DLWF模型的上游预测以生成对抗性轨迹。所提出的方法使用可微的替代模型来近似TC检测器的输出,使梯度基攻击的应用成为可能。Cyc-Attack还采用了一种考虑偏度的损失函数和核扩张策略来解决不平衡问题。最后,基于距离的梯度加权方案和正则化用于约束扰动并消除不真实的轨迹,从而使对抗性上游预测更难以检测。我们的实验表明,Cyc-Attack在匹配攻击者目标轨迹方面具有更高的真实阳性率,同时具有更低的误报率和更隐蔽的扰动,优于传统攻击方法。

英文摘要

Deep learning-based weather forecasting (DLWF) models leverage past weather observations to generate future forecasts, supporting a wide range of downstream applications, including tropical cyclone (TC) prediction. In this paper, we investigate their vulnerability to adversarial attacks, where subtle perturbations to the upstream forecasts can alter the downstream TC trajectory predictions. Although research into adversarial attacks on DLWF models has grown recently, it remains challenging to craft perturbed upstream forecasts that steer the downstream outputs toward attacker-specified trajectories. First, conventional TC detection systems are opaque, non-differentiable black boxes, making standard gradient-based attacks infeasible. Second, the extreme rarity of TC events leads to severe class imbalance problem, making it difficult to develop attack methods for perturbing upstream forecasts that produce realistic-looking cyclone paths aligned with attacker's target trajectories. To overcome these limitations, we propose Cyc-Attack, a novel method for perturbing the upstream forecasts of DLWF models to generate adversarial trajectories. The proposed method uses a differentiable surrogate model to approximate the TC detector's output, enabling the application of gradient-based attacks. Cyc-Attack also employs a skewness-aware loss function with kernel dilation strategy to address the imbalance problem. Finally, a distance-based gradient weighting scheme and regularization are used to constrain the perturbations and eliminate unrealistic-looking trajectories, thereby making the adversarial upstream forecasts less easily detectable. Our experiments show that Cyc-Attack achieves a higher true positive rate in matching the attacker's target trajectories, along with lower false alarm rates and stealthier perturbations than conventional attack methods.

2510.06388 2026-05-19 cs.LG cs.DS stat.ML 版本更新

Truthful Calibration Errors for Multi-Class Prediction

多类预测中的诚实校准误差

Yuxuan Lu, Yifan Wu, Jason Hartline, Lunjia Hu

发表机构 * Peking University(北京大学) Northwestern University(西北大学) Microsoft Research, New England(微软研究院(新英格兰)) Northeastern University(东北大学) Khoury College of Computer Sciences(计算机科学学院)

AI总结 本文研究了多类预测中诚实校准误差的实用作用,提出了完美诚实校准误差以处理标签分布的多维线性属性,并分析了这些诚实误差在决策理论上的影响,从而解释并缓解了分箱校准误差的排名鲁棒性问题。

详情
AI中文摘要

校准预测之所以有用,是因为其数值可以被解释为概率。校准误差因此被广泛用于评估、比较和调整概率预测器。最近,Haghtalab等人(2024)引入了一个额外的要求:诚实性。如果预测器通过报告真实的条件标签分布来最小化其预期测量误差,则校准度量是诚实的。许多标准的经验校准误差是非诚实的:预测器可能通过扭曲其概率而不是报告真实值来显得更校准。我们研究了诚实性在多类预测中校准测量的实用作用。首先,我们引入了完美诚实校准误差以处理标签分布的多维线性属性,推广了Hartline等人(2025)中二元预测的诚实校准误差。此框架包括完整的多类校准和类内校准。我们还确定了置信度校准的诚实修正。其次,我们分析了这些诚实误差的决策理论影响。对于校准预测器,诚实校准误差保持了Blackwell主导性:更信息丰富的校准预测器不会产生更大的预期误差。第三,我们表明这种决策理论解释解释并缓解了已观察到的分箱校准误差的排名鲁棒性问题。经验上,非诚实的置信度校准误差在分箱数量变化时可能逆转模型排名,而我们的诚实误差在不同分箱选择下提供更稳定的排名。

英文摘要

Calibrated predictions are useful because their numerical values can be interpreted as probabilities. Calibration errors are therefore widely used to evaluate, compare, and tune probabilistic predictors. Recently, Haghtalab et al. (2024) introduced an additional requirement for such measures: truthfulness. A calibration measure is truthful if a predictor minimizes its expected measured error by reporting the true conditional label distribution. Many standard empirical calibration errors are non-truthful: a predictor may appear better calibrated by distorting its probabilities rather than reporting them truthfully. We study the practical role of truthfulness for calibration measurement in multiclass prediction. First, we introduce perfectly truthful calibration errors for multidimensional linear properties of the label distribution, generalizing the truthful calibration error for binary predictions in Hartline et al. (2025). This framework includes full multiclass calibration and classwise calibration. We also identify a truthful correction for confidence calibration. Second, we characterize the decision-theoretic implications of these truthful errors. For calibrated predictors, truthful calibration errors preserve the Blackwell dominance: a more informative calibrated predictor receives no larger expected error. Third, we show that this decision-theoretic interpretation explains and mitigates the well-observed ranking robustness problem of binned calibration errors. Empirically, non-truthful confidence-based errors can reverse model rankings when the number of bins changes, while our truthful errors give more stable rankings across binning choices.

2510.05921 2026-05-19 cs.CL cs.LG 版本更新

Prompt reinforcing for long-term planning of large language models

通过提示强化实现大语言模型的长期规划

Hsien-Chin Lin, Benjamin Matthias Ruppik, Carel van Niekerk, Chia-Hao Shen, Michael Heck, Nurul Lubis, Renato Vukovic, Shutong Feng, Milica Gašić

发表机构 * Heinrich-Heine-Universität Düsseldorf(杜伊斯堡-埃森大学)

AI总结 本文提出了一种基于强化学习的提示优化框架,通过修改LLM代理的任务指令提示来实现长期规划,提升了多轮交互任务如文本到SQL和任务导向对话的表现,并能泛化到不同LLM代理和多种LLM作为元提示代理。

详情
AI中文摘要

大型语言模型(LLMs)在广泛自然语言处理任务中取得了显著成功,并可通过提示进行适应。然而,它们在多轮交互中仍表现不足,常依赖错误的早期假设,无法随时间跟踪用户目标,使此类任务尤其具有挑战性。先前对话系统的工作表明,长期规划对于处理交互任务至关重要。在本工作中,我们提出了一种受强化学习启发的提示优化框架,仅通过修改LLM代理的任务指令提示即可实现此类规划。通过生成回合间的反馈并利用经验回放进行提示重写,我们的方法在文本到SQL和任务导向对话等多轮任务中显示出显著改进。此外,该方法能跨不同LLM代理泛化,并可利用多种LLM作为元提示代理。这促使未来在受强化学习启发的无参数优化方法上的研究。

英文摘要

Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.

2510.01479 2026-05-19 cs.LG cs.SY eess.SY 版本更新

Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

密度比加权行为克隆:从受污染的数据集中学习控制策略

Shriram Karpoora Sundara Pandian, Ali Baheri

发表机构 * Department of Cybersecurity(网络安全系) Rochester Institute of Technology(罗切斯特理工学院) Mechanical Engineering Department(机械工程系)

AI总结 本文提出了一种鲁棒的模仿学习方法Density-Ratio Weighted Behavioral Cloning,通过使用一个小的验证干净参考集估计轨迹级密度比,以优先考虑干净的专家行为并降低或丢弃受污染的数据,从而在不需了解污染机制的情况下提升政策性能。

详情
AI中文摘要

离线强化学习(RL)通过固定数据集进行策略优化,使其适用于在线探索不可行的安全关键应用。然而,这些数据集常受到对抗性污染、系统错误或低质量样本的污染,导致标准行为克隆(BC)和离线RL方法的策略性能下降。本文介绍了密度比加权行为克隆(Weighted BC),一种鲁棒的模仿学习方法,通过二元判别器估计轨迹级密度比,这些比值被截断并用作BC目标中的权重,以优先考虑干净的专家行为,同时降低或丢弃受污染的数据,而无需了解污染机制。我们建立了理论保证,证明在有限样本界限下,能够收敛到干净的专家策略,这些界限与污染率无关。建立了一个全面的评估框架,该框架包含各种污染协议(奖励、状态、转换和动作)在连续控制基准上的应用。实验表明,Weighted BC即使在高污染比下也能保持接近最优性能,优于传统BC、批量约束Q学习(BCQ)和行为正则化的Actor-Critic(BRAC)等基线方法。

英文摘要

Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).

2510.00304 2026-05-19 cs.LG cs.AI 版本更新

Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

在不断变化的世界中学习的障碍:对学习能力丧失的数学理解

Amir Joudaki, Giulia Lanzillotta, Mohammad Samragh Razlighi, Iman Mirzadeh, Keivan Alizadeh, Thomas Hofmann, Mehrdad Farajtabar, Fartash Faghri

发表机构 * ETH Zürich(苏黎世联邦理工学院) Apple(苹果公司)

AI总结 本文研究了在非平稳环境中深度学习模型因学习能力丧失(LoP)而失效的问题,通过动力系统理论分析了LoP的两个主要机制,并探讨了缓解策略。

详情
AI中文摘要

深度学习模型在静态数据上表现优异,但在非静态环境中因一种称为学习能力丧失(LoP)的现象而表现不佳,即其未来学习能力下降。本文首次从原理上研究了基于梯度的学习中的LoP。基于动力系统理论,我们通过在参数空间中识别稳定的流形来正式定义LoP,这些流形会捕获梯度轨迹。我们的分析揭示了两种主要机制,这些机制创造了这些陷阱:来自激活饱和的冻结单元和来自表征冗余的克隆单元流形。我们的框架揭示了一个根本性的矛盾:在静态设置中促进泛化的属性,如低秩表示和简单性偏差,直接在持续学习场景中促成LoP。我们通过数值模拟验证了我们的理论分析,并探讨了架构选择或针对性扰动作为潜在的缓解策略。

英文摘要

Deep learning models excel in stationary data but struggle in non-stationary environments due to a phenomenon known as loss of plasticity (LoP), the degradation of their ability to learn in the future. This work presents a first-principles investigation of LoP in gradient-based learning. Grounded in dynamical systems theory, we formally define LoP by identifying stable manifolds in the parameter space that trap gradient trajectories. Our analysis reveals two primary mechanisms that create these traps: frozen units from activation saturation and cloned-unit manifolds from representational redundancy. Our framework uncovers a fundamental tension: properties that promote generalization in static settings, such as low-rank representations and simplicity biases, directly contribute to LoP in continual learning scenarios. We validate our theoretical analysis with numerical simulations and explore architectural choices or targeted perturbations as potential mitigation strategies.

2509.22849 2026-05-19 cs.CC cs.DM cs.LG cs.NE 版本更新

Parameterized Hardness of Zonotope Containment and Neural Network Verification

参数化的Zonotope包含与神经网络验证的难度

Vincent Froese, Moritz Grillo, Christoph Hertrich, Moritz Stargalla

发表机构 * Technische Universität Berlin(柏林技术大学) Max Planck Institute for Mathematics in the Sciences(马克斯·普朗克数学研究所) University of Technology Nuremberg(纽伦堡技术大学)

AI总结 研究探讨了2层ReLU网络函数的正性判定问题,证明其在参数d下属于W[1]-难问题,并展示了Zonotope包含、Lp-Lipschitz常数近似等任务的计算复杂性,揭示了这些基础问题的最优解法。

Comments 20 pages, 5 figures, paper accepted at ICLR 2026

详情
AI中文摘要

具有ReLU激活函数的神经网络是机器学习中广泛使用的模型。因此,深入理解此类网络所计算函数的性质至关重要。最近,关于确定这些性质的参数化计算复杂性引起了越来越多的关注。在本工作中,我们填补了几个空白并解决了Froese等人[COLT '25]提出的一个开放问题,涉及网络验证相关问题的参数化复杂性。特别是,我们证明了当参数为d时,判定由2层ReLU网络计算的函数f:R^d→R的正性(从而满射性)是W[1]-难的。这一结果也表明,Zonotope(非)包含问题是W[1]-难的,这一问题在计算几何、控制理论和机器人学中具有独立的兴趣。此外,我们还证明了在2层ReLU网络中近似最大值、计算2层网络的Lp-Lipschitz常数(p∈(0,∞])以及在3层网络中近似Lp-Lipschitz常数都是NP难且在参数d下W[1]-难的。值得注意的是,我们的难度结果是目前最强的,表明解决这些基础问题的朴素枚举方法在指数时间假设下本质上是最佳的。

英文摘要

Neural networks with ReLU activations are a widely used model in machine learning. It is thus important to have a profound understanding of the properties of the functions computed by such networks. Recently, there has been increasing interest in the (parameterized) computational complexity of determining these properties. In this work, we close several gaps and resolve an open problem posted by Froese et al. [COLT '25] regarding the parameterized complexity of various problems related to network verification. In particular, we prove that deciding positivity (and thus surjectivity) of a function $f\colon\mathbb{R}^d\to\mathbb{R}$ computed by a 2-layer ReLU network is W[1]-hard when parameterized by $d$. This result also implies that zonotope (non-)containment is W[1]-hard with respect to $d$, a problem that is of independent interest in computational geometry, control theory, and robotics. Moreover, we show that approximating the maximum within any multiplicative factor in 2-layer ReLU networks, computing the $L_p$-Lipschitz constant for $p\in(0,\infty]$ in 2-layer networks, and approximating the $L_p$-Lipschitz constant in 3-layer networks are NP-hard and W[1]-hard with respect to $d$. Notably, our hardness results are the strongest known so far and imply that the naive enumeration-based methods for solving these fundamental problems are all essentially optimal under the Exponential Time Hypothesis.

2509.18150 2026-05-19 cs.LG cs.AI 版本更新

Improving MLLM Training Efficiency via Stage-Aware Sparsity

通过阶段感知稀疏性提升MLLM训练效率

Kean Shi, Liang Chen, Haozhe Zhao, Baobao Chang

发表机构 * Peking University(北京大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出了一种基于稀疏表示的高效训练框架STS,通过阶段感知设计适应不同训练阶段的冗余,采用视觉标记压缩器和层动态跳过器来减少计算开销,验证了其在多种MLLM架构上的有效性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在各种领域中表现出色,但训练效率低下,由于长输入序列和未充分利用的层间操作导致大量计算冗余。值得注意的是,这种冗余并非静态,而是随训练阶段变化。基于此观察,我们关注训练过程本身,提出了一种基于稀疏表示的高效训练框架,称为稀疏训练方案(STS)。不同于统一的稀疏性策略,STS采用阶段感知设计,适应训练过程中不同的冗余来源。具体而言,该框架包含两个互补组件:视觉标记压缩器,通过在模态对齐过程中压缩视觉标记来减少信息负载;层动态跳过器,通过在指令微调过程中动态跳过不必要的层来减轻计算开销。我们的方法广泛适用于多种MLLM架构,并已在多个基准上进行了广泛评估,证明了其有效性和效率。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains. However, training MLLMs is often inefficient, as much of the computation is redundant due to the long input sequences from multimodal data and underutilized inter-layer operations. Notably, such redundancy is not static but varies across different stages of training. Building on this observation, we shift the focus to the training process itself and propose a training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS). Instead of applying a uniform sparsity strategy, STS adopts a stage-aware design that adapts to different sources of redundancy during training. Specifically, the framework consists of two complementary components: the Visual Token Compressor, which reduces the information load by compressing visual tokens during modality alignment, and the Layer Dynamic Skipper, which mitigates computational overhead by dynamically skipping unnecessary layers during instruction tuning. Our approach is broadly applicable to diverse MLLM architectures and has been extensively evaluated on multiple benchmarks, demonstrating its effectiveness and efficiency.

2509.16391 2026-05-19 cs.LG cs.AI cs.CV 版本更新

CoUn: Empowering Machine Unlearning via Contrastive Learning

CoUn: 通过对比学习赋能机器无学习

Yasser H. Khalil, Mehdi Setayesh, Hongliang Li

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出CoUn框架,通过对比学习和监督学习调整保留数据的表示,以提高机器无学习的有效性,实验表明其在多个数据集和模型架构上均优于现有方法。

详情
AI中文摘要

机器无学习(MU)旨在从已训练模型中移除特定'遗忘'数据的影响,同时保持对剩余'保留'数据的知识。现有的基于标签操纵或模型权重扰动的MU方法往往效果有限。为此,我们引入了CoUn,一种受观察启发的新MU框架:当模型仅使用保留数据重新训练时,它会根据保留数据的语义相似性对遗忘数据进行分类。CoUn通过对比学习(CL)和监督学习调整学习的数据表示,仅应用于保留数据。具体而言,CoUn(1)利用数据样本之间的语义相似性,通过CL间接调整遗忘表示,(2)通过监督学习保持保留表示在其各自聚类内。在各种数据集和模型架构上的广泛实验表明,CoUn在无学习有效性上 consistently 超过最先进的MU基线。此外,将我们的CL模块集成到现有基线中可以增强其无学习有效性。

英文摘要

Machine unlearning (MU) aims to remove the influence of specific "forget" data from a trained model while preserving its knowledge of the remaining "retain" data. Existing MU methods based on label manipulation or model weight perturbations often achieve limited unlearning effectiveness. To address this, we introduce CoUn, a novel MU framework inspired by the observation that a model retrained from scratch using only retain data classifies forget data based on their semantic similarity to the retain data. CoUn emulates this behavior by adjusting learned data representations through contrastive learning (CL) and supervised learning, applied exclusively to retain data. Specifically, CoUn (1) leverages semantic similarity between data samples to indirectly adjust forget representations using CL, and (2) maintains retain representations within their respective clusters through supervised learning. Extensive experiments across various datasets and model architectures show that CoUn consistently outperforms state-of-the-art MU baselines in unlearning effectiveness. Additionally, integrating our CL module into existing baselines empowers their unlearning effectiveness.

2509.02351 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels

序数自适应校正:一种数据导向的带有噪声标签的序数图像分类方法

Alireza Sedighi Moghaddam, Mohammad Reza Mohammadi

发表机构 * School of Computer Engineering, Iran University of Science and Technology(伊朗科学技术大学计算机工程学院)

AI总结 本文提出了一种数据导向的序数图像分类方法ORDAC,通过利用标签分布学习来建模序数标签的内在模糊性和不确定性,动态调整每个样本的标签分布均值和标准差,从而有效校正噪声标签并提高模型性能。

Comments 10 pages, 5 figures, 5 tables

详情
AI中文摘要

标记数据是训练计算机视觉任务中监督深度学习模型的基本组成部分。然而,尤其是在序数图像分类中,类边界往往具有模糊性,因此标注过程容易产生错误和噪声。此类标签噪声会显著降低机器学习模型的性能和可靠性。本文针对序数图像分类任务中检测和校正标签噪声的问题,提出了一种新的数据导向方法,称为ORDinal Adaptive Correction(ORDAC)。该方法利用标签分布学习(LDL)的能力来建模序数标签的内在模糊性和不确定性。在训练过程中,ORDAC动态调整每个样本的标签分布的均值和标准差。与其丢弃可能含有噪声的样本不同,该方法旨在校正这些样本并充分利用整个训练数据集。所提出方法在年龄估计(Adience)和疾病严重程度检测(糖尿病视网膜病变)基准数据集上,针对各种不对称高斯噪声场景进行了评估。结果表明,ORDAC及其扩展版本(ORDAC_C和ORDAC_R)在模型性能上取得了显著提升。例如,在Adience数据集上40%的噪声情况下,ORDAC_R将均方误差从0.86降低到0.62,并将召回指标从0.37提高到0.49。该方法还展示了其在原始数据集中固有噪声的校正效果。这项研究表明,使用标签分布进行自适应标签校正是增强在存在噪声数据时序数分类模型鲁棒性和准确性的一种有效策略。

英文摘要

Labeled data is a fundamental component in training supervised deep learning models for computer vision tasks. However, the labeling process, especially for ordinal image classification where class boundaries are often ambiguous, is prone to error and noise. Such label noise can significantly degrade the performance and reliability of machine learning models. This paper addresses the problem of detecting and correcting label noise in ordinal image classification tasks. To this end, a novel data-centric method called ORDinal Adaptive Correction (ORDAC) is proposed for adaptive correction of noisy labels. The proposed approach leverages the capabilities of Label Distribution Learning (LDL) to model the inherent ambiguity and uncertainty present in ordinal labels. During training, ORDAC dynamically adjusts the mean and standard deviation of the label distribution for each sample. Rather than discarding potentially noisy samples, this approach aims to correct them and make optimal use of the entire training dataset. The effectiveness of the proposed method is evaluated on benchmark datasets for age estimation (Adience) and disease severity detection (Diabetic Retinopathy) under various asymmetric Gaussian noise scenarios. Results show that ORDAC and its extended versions (ORDAC_C and ORDAC_R) lead to significant improvements in model performance. For instance, on the Adience dataset with 40% noise, ORDAC_R reduced the mean absolute error from 0.86 to 0.62 and increased the recall metric from 0.37 to 0.49. The method also demonstrated its effectiveness in correcting intrinsic noise present in the original datasets. This research indicates that adaptive label correction using label distributions is an effective strategy to enhance the robustness and accuracy of ordinal classification models in the presence of noisy data.

2508.06670 2026-05-19 math.NT cs.LG 版本更新

Machines Learn Number Fields, But How? The Case of Galois Groups

机器学习数域,但如何? Galois群的案例

Kyu-Hwan Lee, Seewoo Lee

AI总结 通过使用可解释的机器学习方法,如决策树,研究如何简单的模型能够利用Dedekind zeta系数来分类Q上的Galois扩展的Galois群,研究问题在于理解zeta系数分布如何依赖于Galois群,并证明新的分类标准。

Comments Accepted version, To appear in Research in Mathematical Sciences

详情
AI中文摘要

通过应用可解释的机器学习方法,如决策树,我们研究如何简单的模型能够利用Dedekind zeta系数来分类Q上的Galois扩展的Galois群,其度数为4、6、8、9和10。我们对机器学习结果的解释使我们能够理解zeta系数的分布如何依赖于Galois群,并证明新的分类标准。结合先前的结果,这项工作提供了数学研究中由机器学习驱动的新范式的一个新例子。

英文摘要

By applying interpretable machine learning methods such as decision trees, we study how simple models can classify the Galois groups of Galois extensions over $\mathbb{Q}$ of degrees 4, 6, 8, 9, and 10, using Dedekind zeta coefficients. Our interpretation of the machine learning results allows us to understand how the distribution of zeta coefficients depends on the Galois group, and to prove new criteria for classifying the Galois groups of these extensions. Combined with previous results, this work provides another example of a new paradigm in mathematical research driven by machine learning.

2507.21035 2026-05-19 cs.AI cs.LG cs.MA q-bio.GN 版本更新

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

GenoMAS:通过代码驱动的基因表达分析进行科学发现的多智能体框架

Haoyang Liu, Yijiang Li, Haohan Wang

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 该研究提出GenoMAS多智能体框架,通过类型消息传递协议协调六个专门的LLM代理,以实现基因表达数据的高效处理和科学发现,其在数据预处理和基因识别任务上均优于现有方法。

Comments 51 pages (14 pages for the main text, 10 pages for references, and 27 pages for the appendix)

详情
AI中文摘要

基因表达分析对于许多生物医学发现至关重要,但从原始转录组数据中提取见解仍然极具挑战性,这归因于多个大型半结构化文件的复杂性和对大量领域专业知识的需求。当前的自动化方法往往受到不灵活的工作流或完全自主代理的限制,这些代理缺乏进行严谨科学探究所需的精确度。GenoMAS则另辟蹊径,通过集成结构化工作流的可靠性与自主代理的适应性,提出了一支基于LLM的科学家团队。GenoMAS通过类型消息传递协议协调六个专门的LLM代理,每个代理都为共享的分析画布贡献互补的强项。GenoMAS的核心是一个引导规划框架:编程代理将高层任务指南展开为动作单元,并在每个节点选择前进、修订、绕过或回溯,从而在保持逻辑一致性的同时,灵活适应基因组数据的特性。在GenoTEX基准测试中,GenoMAS在数据预处理方面达到了89.13%的复合相似度相关性,在基因识别方面达到了60.48%的F1分数,分别超过了最佳现有方法10.61%和16.85%。除了指标外,GenoMAS还揭示了由文献支持的生物合理基因-表型关联,同时调整了潜在混杂因素。代码可在https://github.com/Liu-Hy/GenoMAS上获得。

英文摘要

Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data. On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F$_1$ of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at https://github.com/Liu-Hy/GenoMAS.

2507.16307 2026-05-19 cs.LG cond-mat.mtrl-sci cs.AI physics.chem-ph 版本更新

Perovskite-R1: a domain-specialized large language model for intelligent discovery of precursor additives and experimental design

钙钛矿-R1:一个专门领域的大型语言模型,用于智能发现前驱体添加剂和实验设计

Xin-De Wang, Zhi-Rui Chen, Peng-Jie Guo, Ze-Feng Gao, Cheng Mu, Zhong-Yi Lu

发表机构 * School of Physics, Renmin University of China(中国人民大学物理学院) School of Chemistry and Life Resource, Renmin University of China(中国人民大学化学与生命资源学院)

AI总结 本研究提出Perovskite-R1,一个专门用于发现钙钛矿太阳能电池前驱体添加剂和实验设计的大型语言模型,通过系统挖掘和整理1232篇高质量科学文献,并整合33269种候选材料,构建了领域特定的指令微调数据集,从而提升材料发现的效率。

Comments 24 pages; 5 figures

详情
Journal ref
Communications Materials 7, 86 (2026)
AI中文摘要

钙钛矿太阳能电池(PSCs)因其卓越的功率转换效率和有利的材料特性而迅速成为下一代光伏技术的有力竞争者。尽管有这些进展,长期稳定性、环境可持续性和可扩展制造等挑战仍然阻碍其商业化。前驱体添加剂工程显示出通过提高PSCs的性能和耐久性来解决这些问题的潜力。然而,科学文献的爆炸性增长以及材料、工艺和设备架构之间的复杂相互作用,使研究人员难以高效地访问、组织和利用该领域内的领域知识。为此,我们介绍了Perovskite-R1,一个具有先进推理能力的专门大型语言模型(LLM),专门用于发现和设计PSC前驱体添加剂。通过系统挖掘和整理1232篇高质量科学出版物,并整合一个包含33,269种候选材料的全面库,我们使用自动问答生成和推理链的方法构建了一个领域特定的指令微调数据集。在该数据集上微调QwQ-32B模型,得到了Perovskite-R1,它可以智能地综合文献见解,生成创新且实用的解决方案用于缺陷钝化和前驱体添加剂的选择。对几个模型提出策略的实验验证证实了它们在提高材料稳定性和性能方面的有效性。我们的工作展示了领域适应的LLM在加速材料发现中的潜力,并提供了一个闭环框架,用于智能、数据驱动的钙钛矿光伏研究进展。

英文摘要

Perovskite solar cells (PSCs) have rapidly emerged as a leading contender in next-generation photovoltaic technologies, owing to their exceptional power conversion efficiencies and advantageous material properties. Despite these advances, challenges such as long-term stability, environmental sustainability, and scalable manufacturing continue to hinder their commercialization. Precursor additive engineering has shown promise in addressing these issues by enhancing both the performance and durability of PSCs. However, the explosive growth of scientific literature and the complex interplay of materials, processes, and device architectures make it increasingly difficult for researchers to efficiently access, organize, and utilize domain knowledge in this rapidly evolving field. To address this gap, we introduce Perovskite-R1, a specialized large language model (LLM) with advanced reasoning capabilities tailored for the discovery and design of PSC precursor additives. By systematically mining and curating 1,232 high-quality scientific publications and integrating a comprehensive library of 33,269 candidate materials, we constructed a domain-specific instruction-tuning dataset using automated question-answer generation and chain-of-thought reasoning. Fine-tuning the QwQ-32B model on this dataset resulted in Perovskite-R1, which can intelligently synthesize literature insights and generate innovative and practical solutions for defect passivation and the selection of precursor additives. Experimental validation of several model-proposed strategies confirms their effectiveness in improving material stability and performance. Our work demonstrates the potential of domain-adapted LLMs in accelerating materials discovery and provides a closed-loop framework for intelligent, data-driven advancements in perovskite photovoltaic research.

2507.01099 2026-05-19 cs.CV cs.AI cs.LG cs.RO 版本更新

Geometry-aware 4D Video Generation for Robot Manipulation

面向机器人操作的几何感知4D视频生成

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, Shuran Song

发表机构 * Stanford University(斯坦福大学) Toyota Research Institute(丰田研究院)

AI总结 本文提出了一种几何感知的4D视频生成模型,通过跨视角点图对齐进行训练,以确保生成视频在多视角下的3D一致性,从而在单个RGB-D图像输入下生成时空一致的未来视频序列,并在不依赖相机姿态的情况下实现稳定的视觉和空间对齐预测。

Comments ICLR 2026; Project website: https://robot4dgen.github.io

详情
AI中文摘要

理解并预测物理世界的动态可以增强机器人在复杂环境中的规划和交互能力。尽管最近的视频生成模型在建模动态场景方面显示出强大的潜力,但生成在不同摄像机视角下既时间一致又几何一致的视频仍然是一项重大挑战。为此,我们提出了一种4D视频生成模型,通过在训练过程中使用跨视角点图对齐来监督模型,以确保生成视频的多视角3D一致性。通过这种几何监督,模型学习了一个共享的3D场景表示,使其能够从单个RGB-D图像输入中,根据新的视角生成时空一致的未来视频序列,而无需依赖相机姿态作为输入。与现有基线方法相比,我们的方法在多个模拟和现实世界机器人数据集上产生了更稳定和空间对齐的预测。我们进一步表明,预测的4D视频可用于使用现成的6自由度姿态跟踪器恢复机器人末端执行器轨迹,从而生成在新相机视角下具有良好泛化能力的机器人操作策略。

英文摘要

Understanding and predicting dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of generated videos by supervising the model with cross-view pointmap alignment during training. Through this geometric supervision, the model learns a shared 3D scene representation, enabling it to generate spatio-temporally aligned future video sequences from novel viewpoints given a single RGB-D image per view, and without relying on camera poses as input. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, yielding robot manipulation policies that generalize well to novel camera viewpoints.

2506.23549 2026-05-19 cs.AI cs.HC cs.LG 版本更新

CooT: Learning to Coordinate In-Context with Coordination Transformers

CooT: 通过协调转换器学习协调上下文

Huai-Chih Wang, Hsiang-Chun Chuang, Hsi-Chun Cheng, Dai-Jie Wu, Shao-Hua Sun

发表机构 * Graduate Institute of Communication Engineering, National Taiwan University (NTU)(国立台湾大学通信工程研究所) NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)(国立台湾大学人工智能研究中心) University of Utah(犹他大学)

AI总结 本研究提出CooT框架,通过上下文学习实现实时合作伙伴适应,解决了多智能体系统中协调不熟悉合作伙伴的挑战,其核心方法是通过观察学习对齐动作与合作伙伴意图,主要贡献是实现了在多样合作伙伴行为下的泛化能力。

Comments ICML 2026

详情
AI中文摘要

在多智能体系统中,协调不熟悉合作伙伴仍然是一个重大挑战。现有方法,如基于种群的方法,通过多样性提高鲁棒性,但通常缺乏在训练分布之外高效适应的机制。此外,微调在少样本设置中不可行,因为其交互成本高。为了解决这些限制,我们提出了CooT,一个利用上下文学习(ICL)进行实时合作伙伴适应的框架。与以往专注于任务泛化的ICL方法不同,CooT旨在在多样化的合作伙伴行为上实现泛化。在行为偏好智能体的轨迹上训练,它通过观察学习对齐动作与合作伙伴意图。我们在两个具有挑战性的多智能体基准测试中评估了CooT:Overcooked和Google Research Football。结果表明,CooT在性能上始终优于基于种群的方法、基于梯度的微调和Meta-RL基线,实现了稳定且快速的适应,而无需参数更新。人类评估也发现CooT是更受青睐的合作者,我们的消融实验确认了其快速适应新合作伙伴并在突然合作伙伴变化下保持稳定的能力,使其在现实世界的人机协作中具有可靠性。

英文摘要

Effective coordination among unfamiliar partners remains a major challenge in multi-agent systems. Existing approaches, such as population-based methods, improve robustness through diversity but often lack mechanisms for efficient adaptation beyond training distribution. Moreover, fine-tuning is impractical in few-shot settings due to its high interaction cost. To address these limitations, we propose CooT, a framework that leverages in-context learning (ICL) for real-time partner adaptation. Unlike prior ICL approaches that focus on task generalization, CooT is designed to generalize across diverse partner behaviors. Trained on trajectories from behavior-preferring agents, it learns to align actions with partner intentions purely through observation. We evaluate CooT on two challenging multi-agent benchmarks: Overcooked and Google Research Football. Results show that CooT consistently outperforms population-based methods, gradient-based fine-tuning, and Meta-RL baselines, achieving stable and rapid adaptation without parameter updates. Human evaluations also identify CooT as a preferred collaborator, and our ablations confirm its ability to adapt quickly to new partners and remain stable under sudden partner changes, making it reliable for real-world human-AI collaboration.

2506.23287 2026-05-19 cs.LG q-bio.QM 版本更新

HDTree: Generative Modeling of Cellular Hierarchies for Robust Lineage Inference

HDTree: 用于鲁棒谱系推断的细胞层次生成建模

Zelin Zang, WenZhe Li, Yongjie Xu, Chang Yu, Changxi Chi, Jingbo Zhou, Zhen Lei, Stan Z. Li

发表机构 * Centre for Artificial Intelligence and Robotics(人工智能与机器人中心) Hong Kong Institute of Science and Innovation(香港创新科学研究院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) School of Engineering, Westlake University(西湖大学工程学院)

AI总结 本文提出HDTree,一种用于鲁棒谱系推断的生成建模框架,通过统一的层次代码库和量化扩散过程捕捉细胞层次关系,提升稳定性与可扩展性,并在通用和单细胞数据集上验证了其在谱系推断准确性、重建质量和层次一致性方面的优越性。

Comments accepted by ICML26

详情
AI中文摘要

在单细胞研究中,追踪和分析高通量单细胞分化轨迹对于理解生物过程至关重要。关键在于对支配细胞发育的层次结构的稳健建模。传统方法在计算成本、性能和稳定性方面存在局限。基于VAE的方法虽有所进展,但仍需要分支特定的网络模块,限制了其可扩展性和稳定性,同时常遭遇后验崩溃问题。为克服这些挑战,我们引入HDTree,一种用于稳健谱系推断的生成建模框架。HDTree通过统一的层次代码库在层次化潜在空间中捕捉树状关系,并利用量化扩散过程建模连续细胞状态转换。通过将生成过程与Waddington景观对齐,该方法不仅提高了稳定性和可扩展性,还增强了推断谱系的生物学合理性。HDTree的有效性通过在通用和单细胞数据集上的比较得到验证,其在谱系推断准确性、重建质量和层次一致性方面均优于现有方法。这些贡献使细胞分化路径的准确高效建模成为可能,为生物学发现提供可靠见解。 ootnote{代码可在https://github.com/zangzelin/code\_HDTree\_icml获取。}

英文摘要

In single-cell research, tracing and analyzing high-throughput single-cell differentiation trajectories is crucial for understanding biological processes. Key to this is the robust modeling of hierarchical structures that govern cellular development. Traditional methods face limitations in computational cost, performance, and stability. VAE-based approaches have made strides but still require branch-specific network modules, limiting their scalability and stability, while often suffering from posterior collapse. To overcome these challenges, we introduce HDTree, a generative modeling framework designed for robust lineage inference. HDTree captures tree relationships within a hierarchical latent space using a unified hierarchical codebook and employs a quantized diffusion process to model continuous cell state transitions. By aligning the generative process with the Waddington landscape, this method not only improves stability and scalability but also enhances the biological plausibility of inferred lineages. HDTree's effectiveness is demonstrated through comparisons on both general-purpose and single-cell datasets, where it outperforms existing methods in lineage inference accuracy, reconstruction quality, and hierarchical consistency. These contributions enable accurate and efficient modeling of cellular differentiation paths, offering reliable insights for biological discovery.\footnote{Code is available at https://github.com/zangzelin/code\_HDTree\_icml.

2506.17312 2026-05-19 cs.SI cs.AI cs.LG 版本更新

Heterogeneous Temporal Hypergraph Neural Network

异构时序超图神经网络

Huan Liu, Pengfei Jiao, Mengzhou Gao, Chaochao Chen, Di Jin

发表机构 * School of Cyberspace, Hangzhou Dianzi University(杭州电子科技大学信息学院) Data Security Governance Zhejiang Engineering Research Center(浙江数据安全治理工程研究中心) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院)

AI总结 本文提出了一种异构时序超图神经网络(HTHGN),旨在捕捉复杂异构时序超图中的高阶交互关系,通过引入层次注意力机制和对比学习来提升模型对异构节点和超边之间丰富语义的捕捉能力。

Comments Accepted by IJCAI 2025

详情
AI中文摘要

图表示学习(GRL)已成为建模图结构数据的有效技术。在建模现实复杂网络中的异质性和动态性时,针对复杂异构时序图(HTGs)设计的GRL方法已被提出,并在各领域取得了成功应用。然而,大多数现有GRL方法主要关注保留低阶拓扑信息,而忽视了更高阶的组交互关系,这些关系更符合现实网络。此外,大多数现有超图方法只能建模静态同构图,限制了它们对HTGs中高阶交互关系的建模能力。因此,为了同时使GRL模型能够捕捉HTGs中的高阶交互关系,我们首先提出了异构时序超图的正式定义和不依赖额外信息的$P$-均匀异构超边构造算法。然后提出了一种新的异构时序超图神经网络(HTHGN),以完全捕捉HTGs中的高阶交互关系。HTHGN包含一个层次注意力机制模块,同时在异构节点和超边之间进行时间消息传递,以捕捉由超边带来的更宽广感受场中的丰富语义。此外,HTHGN通过最大化HTG中低阶相关异构节点对之间的一致性来进行对比学习,以避免低阶结构的模糊性问题。在三个真实世界HTG数据集上的详细实验结果验证了所提出HTHGN在建模HTGs中高阶交互关系的有效性,并展示了显著的性能提升。

英文摘要

Graph representation learning (GRL) has emerged as an effective technique for modeling graph-structured data. When modeling heterogeneity and dynamics in real-world complex networks, GRL methods designed for complex heterogeneous temporal graphs (HTGs) have been proposed and have achieved successful applications in various fields. However, most existing GRL methods mainly focus on preserving the low-order topology information while ignoring higher-order group interaction relationships, which are more consistent with real-world networks. In addition, most existing hypergraph methods can only model static homogeneous graphs, limiting their ability to model high-order interactions in HTGs. Therefore, to simultaneously enable the GRL model to capture high-order interaction relationships in HTGs, we first propose a formal definition of heterogeneous temporal hypergraphs and $P$-uniform heterogeneous hyperedge construction algorithm that does not rely on additional information. Then, a novel Heterogeneous Temporal HyperGraph Neural network (HTHGN), is proposed to fully capture higher-order interactions in HTGs. HTHGN contains a hierarchical attention mechanism module that simultaneously performs temporal message-passing between heterogeneous nodes and hyperedges to capture rich semantics in a wider receptive field brought by hyperedges. Furthermore, HTHGN performs contrastive learning by maximizing the consistency between low-order correlated heterogeneous node pairs on HTG to avoid the low-order structural ambiguity issue. Detailed experimental results on three real-world HTG datasets verify the effectiveness of the proposed HTHGN for modeling high-order interactions in HTGs and demonstrate significant performance improvements.

2506.06114 2026-05-19 cs.LG 版本更新

Scalable unsupervised feature selection via weight stability

通过权重稳定性实现可扩展的无监督特征选择

Xudong Zhang, Renato Cordeiro de Amorim

发表机构 * School of Computer Science and Electronic Engineering, University of Essex, Wivenhoe, UK(埃塞克斯大学计算机科学与电子工程学院,英国威文豪)

AI总结 本文提出了一种基于Minkowski加权k-均值的无监督特征选择方法,通过聚合不同Minkowski指数下的特征权重来识别稳定且信息丰富的特征,从而提升聚类性能。

详情
AI中文摘要

无监督特征选择对于在高维数据中提升聚类性能至关重要,其中无关特征可能会掩盖有意义的结构。在本文中,我们引入了Minkowski加权k-均值++,一种新的Minkowski加权k-均值初始化策略。我们的初始化策略利用数据本身得出的特征相关性估计,以概率方式选择质心。在此基础上,我们提出了两种新的特征选择算法,FS-MWK++,通过聚合不同Minkowski指数下的特征权重来识别稳定且信息丰富的特征,以及SFS-MWK++,一种基于子采样的可扩展变体。我们通过理论分析支持我们的方法,证明在显式假设噪声特征和聚类结构的情况下,相关特征在不同Minkowski指数下均被赋予比噪声特征更高的权重。我们的软件可在https://github.com/xzhang4-ops1/FSMWK找到。

英文摘要

Unsupervised feature selection is critical for improving clustering performance in high-dimensional data, where irrelevant features can obscure meaningful structure. In this work, we introduce the Minkowski weighted $k$-means++, a novel initialisation strategy for the Minkowski Weighted $k$-means. Our initialisation selects centroids probabilistically using feature relevance estimates derived from the data itself. Building on this, we propose two new feature selection algorithms, FS-MWK++, which aggregates feature weights across a range of Minkowski exponents to identify stable and informative features, and SFS-MWK++, a scalable variant based on subsampling. We support our approach with a theoretical analysis, demonstrating that, under explicit assumptions on noise features and cluster structure, relevant features are assigned consistently higher weights than noise features across a range of Minkowski exponents. Our software can be found at https://github.com/xzhang4-ops1/FSMWK.

2506.03837 2026-05-19 cond-mat.supr-con cond-mat.mtrl-sci cs.AI cs.LG 版本更新

HTSC-2025: A Benchmark Dataset of Ambient-Pressure High-Temperature Superconductors for AI-Driven Critical Temperature Prediction

HTSC-2025: 一个用于人工智能驱动临界温度预测的环境压力高温超导体基准数据集

Xiao-Qi Han, Ze-Feng Gao, Xin-De Wang, Zhenfeng Ouyang, Peng-Jie Guo, Zhong-Yi Lu

发表机构 * 1. School of Physics Beijing Key Laboratory of Opto-electronic Functional Materials \& Micro-nano Devices. Renmin University of China, Beijing 100872, China 2. Key Laboratory of Quantum State Construction Manipulation (Ministry of Education), Renmin University of China, Beijing 100872, China 3. Hefei National Laboratory, Hefei 230088, China

AI总结 本文提出HTSC-2025基准数据集,包含2023至2025年由理论物理学家基于BCS超导理论预测的高温超导材料,旨在促进人工智能在超导材料发现中的应用。

Comments 7 pages, 2 figures

详情
Journal ref
Chinese Physics B 34, 100301 (2025)
AI中文摘要

高温超导材料的发现对人类工业和日常生活具有重要意义。近年来,利用人工智能(AI)预测超导转变温度的研究日益流行,大多数工具声称实现了显著的准确性。然而,该领域缺乏广泛接受的基准数据集,严重阻碍了不同AI算法之间的公平比较以及这些方法的进一步发展。在本工作中,我们提出了HTSC-2025,一个环境压力高温超导基准数据集。该数据集全面涵盖了基于BCS超导理论由理论物理学家在2023至2025年间发现的理论预测超导材料,包括著名的X₂YH₆系统、钙钛矿MXH₃系统、M₃XH₈系统、源自LaH₁₀结构演化的笼状BCN掺杂金属原子系统,以及从MgB₂演化而来的二维蜂窝状系统。HTSC-2025基准数据集已开源在https://github.com/xqh19970407/HTSC-2025并将持续更新。该基准数据集对加速基于人工智能方法的超导材料发现具有重要意义。

英文摘要

The discovery of high-temperature superconducting materials holds great significance for human industry and daily life. In recent years, research on predicting superconducting transition temperatures using artificial intelligence~(AI) has gained popularity, with most of these tools claiming to achieve remarkable accuracy. However, the lack of widely accepted benchmark datasets in this field has severely hindered fair comparisons between different AI algorithms and impeded further advancement of these methods. In this work, we present the HTSC-2025, an ambient-pressure high-temperature superconducting benchmark dataset. This comprehensive compilation encompasses theoretically predicted superconducting materials discovered by theoretical physicists from 2023 to 2025 based on BCS superconductivity theory, including the renowned X$_2$YH$_6$ system, perovskite MXH$_3$ system, M$_3$XH$_8$ system, cage-like BCN-doped metal atomic systems derived from LaH$_{10}$ structural evolution, and two-dimensional honeycomb-structured systems evolving from MgB$_2$. The HTSC-2025 benchmark has been open-sourced at https://github.com/xqh19970407/HTSC-2025 and will be continuously updated. This benchmark holds significant importance for accelerating the discovery of superconducting materials using AI-based methods.

2506.01523 2026-05-19 cs.LG stat.ML 版本更新

Beyond RLHF: A Unified Theoretical Framework of Alignment

超越RLHF:对齐的统一理论框架

Jihun Yun, Juno Kim, Jongho Park, Junhyuck Kim, Jongha Jon Ryu, Jaewoong Cho, Kwang-Sung Jun

发表机构 * KRAFTON UC Berkeley(加州大学伯克利分校) MIT(麻省理工学院) POSTECH

AI总结 本文提出了一种统一的对齐理论框架,通过将对齐视为基于成对偏好的分布学习,推导出三种新的对齐目标,并证明了它们在非渐近情况下具有O(1/n)的收敛性,为RLHF提供了理论支持。

详情
AI中文摘要

通过强化学习从人类反馈(RLHF)对大型语言模型(LLMs)输出质量进行控制已成为主流方法。然而,现有理论未能为RLHF目标本身提供有力的理论依据,并且由于不同方法通常在不同框架下分析,难以比较各种方法的保证。为建立统一的对齐框架,本文探讨在何种假设下可以推导出现有或新的训练目标并获得理论保证。为此,本文将对齐重新定义为基于成对偏好的分布学习,这建立了一个概率假设,描述了偏好如何揭示关于目标LM的信息。这导致我们提出三种原理性的对齐目标:偏好最大似然估计、偏好蒸馏和反KL最小化。我们证明了它们都自然地避免退化,并具有O(1/n)的收敛性。特别是,反KL高度类似于RLHF目标,为RLHF提供了有力的理论支持。此外,本文的理论首次解释了实证发现:在策略性目标(如RLHF)通常优于似然式目标(如DPO)。最后,实验结果表明,所提出的目标在多个任务和模型上与强基线竞争。

英文摘要

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, existing theories do not provide strong justification for the RLHF objective itself and do not allow comparisons of the guarantees between various methods because different methods are often analyzed under different frameworks. Toward a unified framework for alignment, we ask under what assumptions can we derive existing or new training objectives and obtain theoretical guarantees. To this end, we reframe alignment as distribution learning from pairwise preferences, which makes a probabilistic assumption describing how preferences reveal information about the target LM. This leads us to propose three principled alignment objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We prove that they all enjoy strong non-asymptotic $O(1/n)$ convergence to the target LM, naturally avoiding degeneracy. In particular, reverse KL highly resembles the RLHF objective, providing strong justification for RLHF. Furthermore, our theory explains, for the first time, the empirical finding that on-policy objectives (e.g., RLHF) typically outperform likelihood-style objectives (e.g., DPO). Finally, empirical results indicate that the proposed objectives are competitive with strong baselines across several tasks and models.

2505.11143 2026-05-19 stat.ML cs.LG 版本更新

Nash: Neural Adaptive Shrinkage for Structured High-Dimensional Regression

Nash: 用于结构高维回归的神经自适应收缩

William R. P. Denault

发表机构 * Departments of Statistics and Human Genetics(统计学与人类遗传学系)

AI总结 本文提出Nash框架,通过神经网络整合协变量特定的侧信息,实现高维稀疏回归,提升模型适应性和准确性。

详情
AI中文摘要

稀疏线性回归是数据分析中的基本工具。然而,传统方法在协变量具有结构或来自异质来源时往往表现不佳。在生物医学应用中,协变量可能来自不同的模态或根据潜在图结构进行组织。我们引入了神经自适应收缩(Nash),一种统一的框架,通过神经网络将协变量特定的侧信息整合到稀疏回归中。Nash在每个协变量的基础上自适应地调节惩罚项,学习调整正则化而无需交叉验证。我们使用一种分裂变分经验贝叶斯算法,将先验学习与后验推断解耦,将每轮扫描的M步骤从每个神经网络传递的O(p)次减少到一次批量传递,相对于之前提出的坐标上升CAVI方法,在p在10²到10⁴之间时,实测时间加速了74到106倍。在真实数据上的实验表明,Nash在准确性和适应性上优于现有方法。

英文摘要

Sparse linear regression is a fundamental tool in data analysis. However, traditional approaches often fall short when covariates exhibit structure or arise from heterogeneous sources. In biomedical applications, covariates may stem from distinct modalities or be structured according to an underlying graph. We introduce \textit{Neural Adaptive Shrinkage} (Nash), a unified framework that integrates covariate-specific side information into sparse regression via neural networks. Nash adaptively modulates penalties on a per-covariate basis, learning to tailor regularization without cross-validation. We use a \textit{split variational empirical Bayes} algorithm that decouples prior learning from posterior inference, reducing the M-step from $\mathcal{O}(p) $ neural-network passes per sweep to a single batched pass, a \textit{74 to 106x wall-clock speedup} over previously proposed coordinate ascent CAVI for p between $10^2$ and $10^4$. Experiments on real data demonstrate that Nash improves accuracy and adaptability over existing methods.

2505.09203 2026-05-19 cond-mat.mtrl-sci cond-mat.supr-con cs.AI cs.LG 版本更新

InvDesFlow-AL: active learning-based workflow for inverse design of functional materials

InvDesFlow-AL: 基于主动学习的反向设计功能材料工作流程

Xiao-Qi Han, Peng-Jie Guo, Ze-Feng Gao, Hao Sun, Zhong-Yi Lu

发表机构 * School of Physics, Renmin University of China(中国人民大学物理学院) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院) School of Engineering Science, University of Chinese Academy of Sciences(中国科学院大学工程科学学院)

AI总结 本研究提出了一种基于主动学习的反向设计功能材料框架InvDesFlow-AL,通过迭代优化材料生成过程,提高性能特征的准确性,并在低形成能和低Ehull材料设计中取得显著成果,成功发现超导材料Li₂AuH₆。

Comments 29 pages, 11 figures

详情
Journal ref
npj Computational Materials 11, 364 (2025)
AI中文摘要

开发具有特定性能的功能材料的反向设计方法对于推进可再生能源、催化、能量存储和碳捕集等领域的进步至关重要。基于扩散原理的生成模型可以直接生成满足性能约束的新材料,从而显著加速材料设计过程。然而,现有生成和预测晶体结构的方法往往受限于低成功率。在本工作中,我们提出了一种新的反向材料设计生成框架InvDesFlow-AL,该框架基于主动学习策略。该框架可以迭代优化材料生成过程,逐步引导其向期望的性能特征发展。在晶体结构预测方面,InvDesFlow-AL模型实现了RMSE为0.0423 Å,相比现有生成模型性能提高了32.96%。此外,InvDesFlow-AL已成功应用于低形成能和低Ehull材料的设计。它可以系统地生成具有逐步降低形成能的材料,同时在多样化的化学空间中不断扩展探索。这些结果充分证明了所提出的基于主动学习的生成模型在加速材料发现和反向设计中的有效性。为进一步证明该方法的有效性,我们以InvDesFlow-AL探索的常压下BCS超导体搜索为例。结果,我们成功发现了Li₂AuH₆作为传统BCS超导体,具有超高的转变温度140 K。这一发现为反向设计在材料科学中的应用提供了有力的实证支持。

英文摘要

Developing inverse design methods for functional materials with specific properties is critical to advancing fields like renewable energy, catalysis, energy storage, and carbon capture. Generative models based on diffusion principles can directly produce new materials that meet performance constraints, thereby significantly accelerating the material design process. However, existing methods for generating and predicting crystal structures often remain limited by low success rates. In this work, we propose a novel inverse material design generative framework called InvDesFlow-AL, which is based on active learning strategies. This framework can iteratively optimize the material generation process to gradually guide it towards desired performance characteristics. In terms of crystal structure prediction, the InvDesFlow-AL model achieves an RMSE of 0.0423 Å, representing an 32.96% improvement in performance compared to exsisting generative models. Additionally, InvDesFlow-AL has been successfully validated in the design of low-formation-energy and low-Ehull materials. It can systematically generate materials with progressively lower formation energies while continuously expanding the exploration across diverse chemical spaces. These results fully demonstrate the effectiveness of the proposed active learning-driven generative model in accelerating material discovery and inverse design. To further prove the effectiveness of this method, we took the search for BCS superconductors under ambient pressure as an example explored by InvDesFlow-AL. As a result, we successfully identified Li\(_2\)AuH\(_6\) as a conventional BCS superconductor with an ultra-high transition temperature of 140 K. This discovery provides strong empirical support for the application of inverse design in materials science.

2505.07813 2026-05-19 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

DexWild:面向真实场景的机器人策略的灵巧交互

Tony Tao, Mohan Kumar Srirama, Jason Jingzhou Liu, Kenneth Shaw, Deepak Pathak

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出DexWild框架,通过结合人类和机器人示范数据,提升机器人在多样化环境中的泛化能力,实验表明其在未见环境中的成功率显著高于传统方法。

Comments In RSS 2025. Website at https://dexwild.github.io

详情
AI中文摘要

大规模、多样化的机器人数据集已成为使灵巧操作策略泛化到新环境的有希望途径,但获取此类数据集存在诸多挑战。虽然远程操作能提供高保真的数据集,但其高成本限制了可扩展性。相反,如果人们可以像在日常生活中一样使用自己的手来收集数据呢?在DexWild中,一个多样化的数据收集团队使用他们的手在多种环境和物体上收集数小时的交互数据。为了记录这些数据,我们创建了DexWild-System,一种低成本、移动且易于使用的设备。DexWild学习框架在人类和机器人示范数据上共同训练,相较于单独训练每个数据集,其性能得到提升。这种组合产生了能够泛化到新环境、任务和形态的稳健机器人策略,只需少量额外的机器人特定数据。实验结果表明,DexWild显著提高了性能,在未见环境中实现了68.5%的成功率,几乎是仅使用机器人数据训练的策略的四倍,并提供了5.8倍更好的跨形态泛化能力。视频结果、代码库和说明可在https://dexwild.github.io上找到。

英文摘要

Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io

2505.06852 2026-05-19 cs.LG stat.ML 版本更新

Improving Random Forests by Smoothing

通过平滑改进随机森林

Ziyi Liu, Phuc Luong, Mario Boley, Daniel F. Schmidt

发表机构 * Faculty of Information Technology, Monash University(莫纳什大学信息科技学院) Faculty of Computer and Information Science, University of Haifa(海法大学计算机与信息科学学院)

AI总结 本文提出一种基于核的平滑机制,通过引入局部正则性来增强随机森林的预测性能,同时保留其自适应分区能力,特别是在数据稀缺情况下提升了预测效果。

Comments v2: Accepted manuscript. 30 pages (18 main + 12 appendix), 6 figures

详情
AI中文摘要

随机森林回归是一种强大的非参数方法,通过数据驱动的分区适应局部数据特征,在各种应用领域中表现出色。然而,随机森林预测的分段常数性质意味着每个分区都是独立预测的,忽略了潜在的函数平滑性。特别是在小数据情况下,输入空间内缺乏信息共享可能导致性能不佳。在本文中,我们提出了一种基于核的平滑机制,通过引入局部正则性来增强随机森林,同时保留其自适应分区能力。我们的方法将核平滑应用于随机森林的分段常数输出,有效地结合了基于树的方法的适应性和核方法的平滑性假设。我们证明这种平滑过程可以被解释为在重新采样训练输入的情况下捕捉树切分点的变异性/不确定性。实验证实,所提出的平滑随机森林模型在各种测试案例中一致提高了预测性能,特别是在数据稀缺的情况下。代码、数据集和实验结果可在 https://github.com/Neal-Liu-Ziyi/SmoothedRandomForest.git 公开获取。

英文摘要

Random forest regression is a powerful non-parametric method that adapts to local data characteristics through data-driven partitioning, making it effective across diverse application domains. However, the piecewise constant nature of random forest predictions means each partition is predicted independently, ignoring potential smoothness in the underlying function. Particularly in the small data regime, this lack of information sharing across the input space can lead to suboptimal performance. In this work, we propose a kernel-based smoothing mechanism that enhances random forests by introducing local regularity to their predictions while preserving their adaptive partitioning capabilities. Our approach applies kernel smoothing to the piecewise constant outputs of random forests, effectively combining the adaptability of tree-based methods with the smoothness assumptions of kernel methods. We show that this smoothing procedure can be interpreted as capturing the variability/uncertainty in the tree cut points under resampling of the training inputs. Empirical results demonstrate that the proposed smoothed random forest model consistently improves predictive performance across diverse test cases, particularly in data-scarce settings. Code, datasets, and experiment results are publicly available at https://github.com/Neal-Liu-Ziyi/SmoothedRandomForest.git.

2505.02621 2026-05-19 cs.LG math.OC stat.ML 版本更新

Mirror Mean-Field Langevin Dynamics

镜像均场 Langevin 动力学

Anming Gu, Juno Kim

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) UC Berkeley(伯克利大学)

AI总结 本文提出镜像均场 Langevin 动力学(MMFLD),用于优化受限在 $\mathbb{R}^d$ 子集上的概率测度,并通过统一的对数 Sobolev 不等式获得连续 MMFLD 的线性收敛性保证,以及其时间-粒子离散化版本的统一时间传播混沌结果。

Comments ICML 2026

详情
AI中文摘要

均场 Langevin 动力学(MFLD)在 $\mathbb{R}^d$ 上的 Wasserstein 空间上最小化一个熵正则化的非线性凸函数,并最近因其作为无限宽度两层神经网络等相互作用粒子系统的梯度下降动力学模型而受到关注。然而,许多感兴趣的问题具有受限的域,而现有的均场算法由于全局扩散项无法解决此类问题。我们通过将 MFLD 扩展到镜像 Langevin 框架,提出镜像均场 Langevin 动力学(MMFLD),以研究受限在 $\mathbb{R}^d$ 的凸子集上的概率测度的优化。我们通过统一的对数 Sobolev 不等式获得了连续 MMFLD 的线性收敛性保证,并获得了其时间-粒子离散化版本的统一时间传播混沌结果。

英文摘要

The mean-field Langevin dynamics (MFLD) minimizes an entropy-regularized nonlinear convex functional on the Wasserstein space over $\mathbb{R}^d$, and has gained attention recently as a model for the gradient descent dynamics of interacting particle systems such as infinite-width two-layer neural networks. However, many problems of interest have constrained domains, which are not solved by existing mean-field algorithms due to the global diffusion term. We study the optimization of probability measures constrained to a convex subset of $\mathbb{R}^d$ by proposing the \emph{mirror mean-field Langevin dynamics} (MMFLD), an extension of MFLD to the mirror Langevin framework. We obtain linear convergence guarantees for the continuous MMFLD via a uniform log-Sobolev inequality, and uniform-in-time propagation of chaos results for its time- and particle-discretized counterpart.

2505.00409 2026-05-19 eess.AS cs.AI cs.LG 版本更新

Perceptual implications of automatic anonymization in pathological speech

病态语音中自动匿名化的人感知影响

Soroosh Tayebi Arasteh, Saba Afza, Tri-Thien Nguyen, Lukas Buess, Maryam Parvin, Tomas Arias-Vergara, Paula Andrea Perez-Toro, Hiu Ching Hung, Mahshad Lotfinia, Thomas Gorges, Elmar Noeth, Maria Schuster, Seung Hee Yang, Andreas Maier

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universit\"at Erlangen-N\"urnberg, Erlangen, Germany. Department of Urology, Stanford University, Stanford, CA, USA. Department of Radiology, Stanford University, Stanford, CA, USA. Lab for AI in Medicine, RWTH Aachen University, Aachen, Germany. Department of Diagnostic Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany. Institute of Radiology, University Hospital Erlangen, Erlangen, Germany. Department of Foreign Language Education, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany. Department of Otorhinolaryngology, Head Neck Surgery, Ludwig-Maximilians-Universität München, Munich, Germany. Speech \& Language Processing Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.

AI总结 本研究通过结构化协议评估自动匿名化病态语音的人感知影响,发现匿名化在不同疾病中存在显著差异,且感知质量下降,但临床严重程度评分保持稳定,同时发现感知结果与计算隐私指标脱钩。

详情
AI中文摘要

自动匿名化日益用于促进伦理共享的临床语音,但其感知和临床后果仍不明确。我们通过结构化协议,使用十名母语和非母语德语听众(涵盖临床和信号处理专业知识)对自动匿名化的病态语音进行了以人为中心的评估。受试者包括来自CLP、构音障碍、构语障碍、失声及成人和儿童对照组的180名德语说话者。每段原始录音及其自动匿名化版本在四个任务上进行评估:零样本图灵式辨别、少量样本辨别后短暂熟悉、5点质量评分以及4点盲评临床严重程度评分由资深语音病学家完成。听众在零样本和少量样本任务中检测到匿名化准确率分别为91%和93%,不同疾病之间存在显著差异(p=0.008),且熟悉度降低该差异。感知质量在0-100分上下降了30分(p<0.001),重新组织了各组的感知质量等级。母语影响了可检测性但不影响质量退化,而领域专业知识影响了质量退化但不影响可检测性,形成双分离现象;说话者性别和年龄无明显偏差。临床严重程度评分在构音障碍、构语障碍和失声中保持几乎完美的一致(二次加权Cohen's kappa 0.87-0.94),无录音移位超过一级。关键发现是感知结果与标准计算隐私指标脱钩:计算上匿名化最强的病态语音在感知上最不明显,反之亦然。这些发现支持了按疾病类型和听众类型、经临床验证的评估作为许可匿名语音用于临床使用的最低标准。

英文摘要

Automatic anonymization is increasingly used to enable ethical sharing of clinical speech, yet its perceptual and clinical consequences remain undercharacterized. We present a human-centered evaluation of automatically anonymized pathological speech, using a structured protocol with ten native and non-native German listeners spanning clinical and signal-processing expertise. The cohort comprised 180 German speakers from CLP, Dysarthria, Dysglossia, Dysphonia, and adult and child controls. Each original recording and its automatically-anonymized counterpart was evaluated on four tasks: zero-shot Turing-style discrimination, few-shot discrimination after brief familiarization, 5-point quality rating, and 4-point blinded clinical severity rating by a senior phoniatrician. Listeners detected anonymization at 91% zero-shot and 93% few-shot accuracy, with significant variation across disorders (p=0.008) that attenuated with familiarization. Perceived quality dropped by 30 ppts on a 0-100 scale (p<0.001), reorganizing the perceived-quality hierarchy across groups. Native language modulated detectability but not quality degradation, while domain expertise modulated quality degradation but not detectability, a double dissociation between the two listener attributes; speaker sex and age produced no detectable bias. Clinical severity ratings were preserved at near-perfect agreement in Dysarthria, Dysglossia, and Dysphonia (quadratic-weighted Cohen's kappa 0.87-0.94), with no recording shifting by more than one grade. Crucially, perceptual outcomes were decoupled from the standard computational privacy metric: the pathology with the strongest computational anonymization was the least perceptually conspicuous, and vice versa. These findings argue for disorder-stratified, listener-stratified, clinician-validated evaluation as the minimum standard for licensing anonymized speech for clinical use.

2504.16397 2026-05-19 cs.DB cs.LG 版本更新

Compass: SLO-aware Query Planner for Compound AI Serving at Scale

Compass: 一种面向大规模复合AI服务的SLO感知查询计划器

Banruo Liu, Wei-Yu Lin, Minghao Fang, Yihan Jiang, Fan Lai

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出Compass,一种首个面向大规模复合AI工作负载的SLO感知查询计划器,通过分解多查询、多SLO规划问题为可处理的子问题,利用查询间和查询内的计划相似性减少搜索步骤,并通过计划分析器提高每步效率,从而在资源竞争下最大化SLO良好吞吐量。

详情
AI中文摘要

复合AI服务的兴起使得端用户应用如生成式AI会议助手、自动驾驶和沉浸式游戏得以实现。这些工作负载跨越多样化的部署空间,从纯云查询到跨基础设施层级的边缘辅助查询,往往包括多个部署环境。实现高服务吞吐量——即满足流水线延迟、准确性和成本的服务级别目标(SLOs)——需要联合规划操作符的放置、配置和资源分配。然而,多样化的SLOs、变化的运行环境(如异构设备速度)以及大量竞争共享基础设施的查询使规划空间变得复杂,使现有进展难以实现实时服务和成本高效的部署。本文提出了Compass,一种首个SLO感知查询计划器,用于优化跨多样化部署空间的大规模复合AI工作负载。Compass将多查询、多SLO规划问题分解为可处理的子问题,同时保持全局决策质量,利用查询内和跨查询的计划相似性来减少搜索步骤。它进一步通过计划分析器提高每步效率,该分析器进行选择性分析以在极低的分析成本下实现高保真度的性能估计。在运行时,Compass执行查询-计划二分匹配以在资源竞争下最大化SLO吞吐量。实际评估表明,Compass将服务吞吐量提高2.4-5.1倍,减少部署成本3.8-4.5倍,并加速规划4.2-10.5倍,实现秒级的服务响应和接近最优的决策质量。

英文摘要

The rise of compound AI serving that integrates multiple operators in a pipeline enables end-user applications such as generative AI-powered meeting companions, autonomous driving, and immersive gaming. These workloads span diverse deployment spaces, from cloud-only queries to edge-assisted ones across infrastructure tiers, often including both within an application. Achieving high service goodput -- i.e., meeting service level objectives (SLOs) for pipeline latency, accuracy, and costs -- requires joint planning of operators' placement, configuration, and resource allocation. However, diverse SLOs, varying runtime environments (e.g., heterogeneous device speeds), and a large volume of queries competing for shared infrastructure explode the planning space, making real-time serving and cost-efficient deployment intractable with existing advances. This paper presents Compass, the first SLO-aware query planner that optimizes large-scale compound AI workloads across diverse deployment spaces. Compass decomposes the many-query, multi-SLO planning problem into tractable subproblems while preserving global decision quality, exploiting plan similarities within and across queries to slash the search steps. It further improves per-step efficiency with a plan profiler that performs selective profiling to achieve high-fidelity performance estimates at a fraction of the profiling cost. At runtime, Compass performs query-plan bipartite matching to maximize SLO goodput under resource contentions. Real-world evaluations show that Compass improves service goodput by 2.4--5.1x, reduces deployment costs by 3.8--4.5x, and accelerates planning by 4.2--10.5x, achieving service responsiveness within seconds and near-optimal decision quality.

2504.07347 2026-05-19 stat.ML cs.LG math.PR 版本更新

Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents

面向LLM推理和AI代理的吞吐量最优调度算法

J. G. Dai, Tianze Deng, Yueying Li, Tianyi Peng

发表机构 * School of Operations Research and Information Engineering, Cornell University(Cornell大学运筹学与信息工程学院) Operations Management, Booth School of Business, University of Chicago(芝加哥大学博斯商学院运营管理系) Decision, Risk and Operations, Columbia Business School, Columbia University(哥伦比亚大学哥伦比亚商学院决策、风险与运营系)

AI总结 本文从排队论角度研究了LLM推理系统的吞吐量优化问题,证明了工作保持调度算法在DAG和Fork-Join路由拓扑中能实现最大吞吐量,并揭示了批量处理网络中K-FCFS调度的流极限框架,评估了Orca和Sarathi-Serve的吞吐量最优性,同时指出批量大小限制和循环路由拓扑对吞吐量的影响。

详情
AI中文摘要

随着大型语言模型(LLM)和AI代理的需求迅速增长,优化高效LLM推理系统变得至关重要。尽管已有大量针对系统级工程的努力,但从数学建模和排队视角进行探索的却很少。本文开发了LLM推理的排队基础。特别地,我们研究了LLM推理系统的吞吐量方面。我们证明了一类广泛的'工作保持'调度算法在单个请求和AI代理工作负载中都能实现最大吞吐量,建立了'工作保持'作为从业者的关键设计原则。技术上,我们开发了在K-FCFS调度下的多类批量处理网络的流极限框架,这可能具有独立价值。对实际系统的评估证实Orca和Sarathi-Serve是吞吐量最优的,使从业者放心,而FasterTransformer和原生vLLM则不是最大稳定,应谨慎使用。我们的分析还揭示了诸如批量大小限制和循环路由拓扑等约束如何复杂化吞吐量的图景,指向排队论与LLM系统设计交汇处丰富的开放问题。

英文摘要

As demand for Large Language Models (LLMs) and AI agents grows rapidly, optimizing systems for efficient LLM inference becomes critical. While significant efforts have targeted system-level engineering, little has been explored from a mathematical modeling and queueing perspective. In this paper, we develop the queueing fundamentals for LLM inference. In particular, we study the throughput aspect of LLM inference systems. We prove that a large class of `work-conserving' scheduling algorithms achieve maximum throughput for both individual requests and AI-agent workloads with directed acyclic graph (DAG) and fork-join routing topologies, establishing `work-conserving' as a key design principle for practitioners. Technically, we develop a fluid-limit framework for multi-class batched processing networks under $K$-FCFS scheduling, which may be of independent interest. Evaluations of real-world systems confirm that Orca and Sarathi-Serve are throughput-optimal, reassuring practitioners, while FasterTransformer and vanilla vLLM are not maximally stable and should be used with caution. Our analysis also reveals how constraints such as batch size limits and cyclic routing topologies complicate the throughput picture, pointing to rich open questions at the intersection of queueing theory and LLM system design.

2503.14800 2026-05-19 cs.IR cs.AI cs.LG 版本更新

Long Context Modeling with Ranked Memory-Augmented Retrieval

长上下文建模与排名记忆增强检索

Ghadir Alselwi, Hao Xue, Shoaib Jameel, Basem Suleiman, Flora D. Salim, Imran Razzak

发表机构 * University of New South Wales(新南威尔士大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学) University of Southampton(南安普顿大学) Mohamed Bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出了一种增强的排名记忆增强检索框架,通过动态排名记忆条目和学习到的排名技术,提升语言模型在长上下文任务中的性能和可扩展性。

详情
AI中文摘要

有效管理长期记忆对于语言模型处理扩展上下文至关重要。我们介绍了增强的排名记忆增强检索(ERMAR)框架,该框架根据相关性动态排名记忆条目。与先前模型不同,ERMAR采用了一种新颖的相关性评分机制和一个点wise重新排序模型,用于键值嵌入,灵感来自信息检索中的学习到的排名技术。通过整合历史使用模式和自适应检索,ERMAR在标准基准上实现了最先进的结果,展示了在长上下文任务中优越的可扩展性和性能。

英文摘要

Effective long-term memory management is crucial for language models handling extended contexts. We introduce the Enhanced Ranked Memory Augmented Retrieval (ERMAR) framework, which dynamically ranks memory entries based on relevance. Unlike prior models, ERMAR employs a novel relevance scoring mechanism and a pointwise re-ranking model for key-value embeddings, inspired by learning-to-rank techniques in information retrieval. By integrating historical usage patterns and adaptive retrieval, ERMAR achieves state-of-the-art results on standard benchmarks, demonstrating superior scalability and performance in long-context tasks.

2503.02161 2026-05-19 cs.LG 版本更新

LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion

LLM-TabLogic: 通过提示引导的潜在扩散模型在合成表格数据中保留列间逻辑关系

Yunbo Long, Liming Xu, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 本文提出LLM-TabLogic方法,利用大语言模型推理捕捉表格列间的复杂逻辑关系,并通过Score-based Diffusion模型在潜在空间中生成数据,以在不需领域知识的情况下有效保持合成表格数据中的列间关系。

详情
AI中文摘要

合成表格数据越来越多地被用来替代真实数据,作为一种同时保护隐私和解决数据稀缺问题的有效解决方案。然而,除了保持全局统计属性外,合成数据集还必须维持领域特定的逻辑一致性——特别是在供应链等复杂系统中,诸如运输日期、位置和产品类别等字段必须保持逻辑一致性以确保现实应用。现有生成模型往往忽视这些列间关系,导致现实应用中不可靠的合成表格数据。为了解决这些挑战,我们提出了LLM-TabLogic,一种新颖的方法,利用大语言模型推理来捕捉和压缩表格列间的复杂逻辑关系,同时这些条件约束被传递到Score-based Diffusion模型中,在潜在空间中进行数据生成。通过在真实工业数据集上的广泛实验,我们评估了LLM-TabLogic在列推理和数据生成中的表现,将其与SMOTE和最先进的生成模型等五个基线进行比较。我们的结果表明,LLM-TabLogic在逻辑推理方面具有强大的泛化能力,在未见过的表格上实现了超过90%的准确率。此外,我们的方法在数据生成方面优于所有基线,通过完全保留列间关系的同时保持数据保真度、实用性和隐私的最佳平衡。本研究提出了首个在不需领域知识的情况下有效保持合成表格数据中列间关系的方法,为创建逻辑一致的现实表格数据提供了新的见解。代码可在https://github.com/Yunbo-max/TabKG获取。

英文摘要

Synthetic tabular data are increasingly being used to replace real data, serving as an effective solution that simultaneously protects privacy and addresses data scarcity. However, in addition to preserving global statistical properties, synthetic datasets must also maintain domain-specific logical consistency**-**especially in complex systems like supply chains, where fields such as shipment dates, locations, and product categories must remain logically consistent for real-world usability. Existing generative models often overlook these inter-column relationships, leading to unreliable synthetic tabular data in real-world applications. To address these challenges, we propose LLM-TabLogic, a novel approach that leverages Large Language Model reasoning to capture and compress the complex logical relationships among tabular columns, while these conditional constraints are passed into a Score-based Diffusion model for data generation in latent space. Through extensive experiments on real-world industrial datasets, we evaluate LLM-TabLogic for column reasoning and data generation, comparing it with five baselines including SMOTE and state-of-the-art generative models. Our results show that LLM-TabLogic demonstrates strong generalization in logical inference, achieving over 90% accuracy on unseen tables. Furthermore, our method outperforms all baselines in data generation by fully preserving inter-column relationships while maintaining the best balance between data fidelity, utility, and privacy. This study presents the first method to effectively preserve inter-column relationships in synthetic tabular data generation without requiring domain knowledge, offering new insights for creating logically consistent real-world tabular data. The code is available at https://github.com/Yunbo-max/TabKG.

2503.02087 2026-05-19 cs.RO cs.LG cs.SY eess.SY 版本更新

Uncertainty Representation in a SOTIF-Related Use Case with Dempster-Shafer Theory for LiDAR Sensor-Based Object Detection

基于Dempster-Shafer理论的LiDAR传感器目标检测SOTIF相关用例中的不确定性表示

Milin Patel, Rolf Jung

发表机构 * Institute for Driver Assistance and Connected Mobility(驾驶员辅助与车联网研究所) Kempten University of Applied Sciences(科佩滕应用科学大学)

AI总结 本文提出了一种系统的方法,利用Dempster-Shafer理论构建判定框架,以表示LiDAR传感器目标检测中的不确定性,并通过方差敏感性分析量化和优先处理这些不确定性,以确保自动驾驶场景的安全性。

Comments submitted as extended paper of Vehicle Technology and Intelligent Transport Systems (VEHITS)2024 conference and will be published by Springer in a CCIS Series book later in 2025

详情
AI中文摘要

LiDAR传感器目标检测中的不确定性源于环境变化和传感器性能限制。表示这些不确定性对于确保预期功能安全(SOTIF)至关重要,SOTIF旨在防止自动驾驶场景中的危险。本文提出了一种系统的方法,用于识别、分类和表示LiDAR目标检测中的不确定性。Dempster-Shafer理论(DST)被用于构建判定框架(FoD)以表示检测结果。基于识别的不确定性来源之间的依赖性,应用条件基本概率分配(BPAs)。Yager的证据组合规则用于解决多个来源的冲突证据,提供一个结构化的框架来评估不确定性对检测准确性的影响。研究应用方差基于敏感性分析(VBSA)来量化和优先处理不确定性,详细说明其对检测性能的具体影响。

英文摘要

Uncertainty in LiDAR sensor-based object detection arises from environmental variability and sensor performance limitations. Representing these uncertainties is essential for ensuring the Safety of the Intended Functionality (SOTIF), which focuses on preventing hazards in automated driving scenarios. This paper presents a systematic approach to identifying, classifying, and representing uncertainties in LiDAR-based object detection within a SOTIF-related scenario. Dempster-Shafer Theory (DST) is employed to construct a Frame of Discernment (FoD) to represent detection outcomes. Conditional Basic Probability Assignments (BPAs) are applied based on dependencies among identified uncertainty sources. Yager's Rule of Combination is used to resolve conflicting evidence from multiple sources, providing a structured framework to evaluate uncertainties' effects on detection accuracy. The study applies variance-based sensitivity analysis (VBSA) to quantify and prioritize uncertainties, detailing their specific impact on detection performance.

2502.04055 2026-05-19 cs.LG 版本更新

Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation

评估合成表格数据生成中列之间的逻辑关系

Yunbo Long, Liming Xu, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系) The Alan Turing Institute, London(伦敦艾伦·图灵研究所)

AI总结 本文提出三种评估指标,用于评估合成表格数据中列间逻辑关系的保持情况,并通过实验证明现有方法在保持逻辑一致性方面存在不足,讨论了改进逻辑关系建模的可能路径。

详情
AI中文摘要

当前对合成表格数据的评估主要集中在联合分布建模的质量上,往往忽略了其在保持真实事件序列和列间一致实体关系方面的有效性。本文提出了三种评估指标,用于评估合成表格数据中列间逻辑关系的保持情况。我们通过在真实工业数据集上评估经典和最新生成方法的性能来验证这些指标。实验结果表明,现有方法往往无法严格保持逻辑一致性(例如地理或组织中的层级关系)和依赖性(例如时间序列或数学关系),这些对于保持真实世界表格数据的细粒度真实性至关重要。基于这些见解,本文还讨论了在建模合成表格数据分布时更好地捕捉逻辑关系的可能路径。代码可在https://github.com/Yunbo-max/TabLogicEval获取。

英文摘要

Current evaluations of synthetic tabular data mainly focus on how well joint distributions are modeled, often overlooking the assessment of their effectiveness in preserving realistic event sequences and coherent entity relationships across columns.This paper proposes three evaluation metrics designed to assess the preservation of logical relationships among columns in synthetic tabular data. We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.Experimental results reveal that existing methods often fail to rigorously maintain logical consistency (e.g., hierarchical relationships in geography or organization) and dependencies (e.g., temporal sequences or mathematical relationships), which are crucial for preserving the fine-grained realism of real-world tabular data. Building on these insights, this study also discusses possible pathways to better capture logical relationships while modeling the distribution of synthetic tabular data. The code is available at https://github.com/Yunbo-max/TabLogicEval.

2411.03936 2026-05-19 cs.LG stat.ML 版本更新

GUIDE-VAE: Advancing Data Generation with User Information and Pattern Dictionaries

GUIDE-VAE:利用用户信息和模式词典推进数据生成

Kutay Bölat, Simon Tindemans

发表机构 * Department of Electrical Sustainable Energy(电气可持续能源系)

AI总结 本文提出GUIDE-VAE,一种基于用户嵌入和模式词典的生成模型,通过整合用户信息和复杂特征依赖性,提升多用户数据集下的生成性能和样本真实性。

详情
AI中文摘要

多用户数据集的生成建模在科学和工程中变得突出。生成特定用户的样本需要利用用户信息,而传统生成模型,包括变分自编码器(VAEs),通常忽略这一点。本文介绍了GUIDE-VAE,一种新的条件生成模型,利用用户嵌入生成用户引导的数据。通过利用用户之间的共享模式,GUIDE-VAE在多用户设置中提升了性能,即使在数据不平衡显著的情况下。除了整合用户信息外,GUIDE-VAE还采用基于模式词典的协方差组成(PDCC)来提高生成样本的真实性和捕捉复杂特征依赖性。虽然用户嵌入推动了性能提升,但PDCC解决了VAEs中常见的噪声和过平滑问题。所提出的GUIDE-VAE在具有显著用户数据不平衡的多用户智能电表数据集上进行了评估。定量结果表明,GUIDE-VAE在合成数据生成和缺失记录填补任务中表现良好,而定性评估表明其生成的数据更加合理且噪声更少。这些结果确立了GUIDE-VAE作为多用户数据集可控、真实数据生成的有前景工具,具有跨领域应用的潜力。

英文摘要

Generative modelling of multi-user datasets has become prominent in science and engineering. Generating a data point for a given user requires employing user information, and conventional generative models, including variational autoencoders (VAEs), often ignore this. This paper introduces GUIDE-VAE, a novel conditional generative model that leverages user embeddings to generate user-guided data. By leveraging shared patterns across users, GUIDE-VAE improves performance in multi-user settings, even under significant data imbalance. In addition to integrating user information, GUIDE-VAE incorporates a pattern dictionary-based covariance composition (PDCC) to improve the realism of generated samples by capturing complex feature dependencies. While user embeddings drive performance gains, PDCC addresses common issues such as noise and over-smoothing typically seen in VAEs. The proposed GUIDE-VAE was evaluated on a multi-user smart meter dataset characterised by substantial data imbalance across users. Quantitative results show that GUIDE-VAE performs effectively on both synthetic data generation and missing-record imputation tasks, while qualitative evaluations indicate that it produces more plausible and less noisy data. These results establish GUIDE-VAE as a promising tool for controlled, realistic data generation in multi-user datasets, with potential applications across domains that require user-informed modelling.

2410.13846 2026-05-19 cs.CL cs.AI cs.LG 版本更新

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

LightTransfer: 你的长上下文LLM实际上是一个具有轻松适应能力的混合模型

Xuan Zhang, Fengzhuo Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin

发表机构 * Singapore Management University(新加坡国立大学) National University of Singapore(新加坡国立大学) Sea AI Lab, Singapore(新加坡海智实验室)

AI总结 本文提出LightTransfer方法,通过将LLaMA等模型转换为混合架构,实现更高效的生成,实验表明在长上下文理解任务中,即使有半数层被识别为懒层,也能在性能损失小于1.5%的情况下提升2.17倍的吞吐量,并在数学基准AIME24上达到53.3%的分数。

Comments Accepted by TMLR 2025

详情
AI中文摘要

将语言模型扩展到处理更长上下文引入了由于键值(KV)缓存成本增加而带来的显著内存挑战。受混合模型的效率提升和预训练大变压器骨干的广泛可用性启发,我们探索将变压器模型转换为混合架构以实现更高效的生成。在本工作中,我们提出了LightTransfer,一种轻量级方法,将模型如LLaMA转换为混合变体。我们的方法识别出懒层——那些专注于最近或初始token的层,并将它们的完整注意力替换为流式注意力。这种转换可以在无需任何训练的情况下用于长上下文理解任务,或在需要更强推理能力的o1-like长推理生成任务中进行最小微调。在多样化的基准和模型(如LLaMA、Mistral、QwQ-STILL)上的实验表明,即使有半数层被识别为懒层,LightTransfer在性能损失小于1.5%(在LongBench上)的情况下,也能实现高达2.17倍的吞吐量提升,并在数学基准AIME24上达到先进o1-like长推理模型QwQ-STILL的53.3%。

英文摘要

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose LightTransfer, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that, even when half of the layers are identified as lazy, LightTransfer achieves up to 2.17$\times$ throughput improvement with minimal performance loss ($<1.5\%$ on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.

2410.04941 2026-05-19 cs.LG cs.AI 版本更新

TOAST: Transformer Optimization using Adaptive and Simple Transformations

TOAST: 使用自适应和简单变换的Transformer优化

Irene Cannistraci, Simone Antonelli, Emanuele Palumbo, Thomas M. Sutter, Emanuele Rodolà, Bastian Rieck, Julia E. Vogt

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系) CISPA Helmholtz Center for Information Security(信息安全赫尔姆霍兹中心) Sapienza University of Rome(罗马大学萨皮恩扎大学) University of Fribourg(弗里堡大学)

AI总结 本文提出TOAST框架,通过利用Transformer内部的冗余性,用轻量级闭式映射(如线性变换或身份函数)近似整个Transformer块,从而在不额外训练的情况下减少参数和计算量,同时保持甚至提升下游性能。

Comments 33 pages, 16 figures, 22 tables

详情
AI中文摘要

基础模型在不同任务上实现了最先进的性能,但其规模和计算需求引发了关于可访问性和可持续性的担忧。现有的效率方法通常需要额外的重新训练或微调,限制了其实用性。最近的研究发现,深度神经网络表现出内部表示的相似性。虽然这种相似性已被用于启用技术如模型缝合和合并,但网络内部的冗余性仍较少被用作效率提升的来源。在本文中,我们介绍了Transformer优化使用自适应和简单变换(TOAST),一个框架利用这些冗余性,用轻量级闭式映射(如线性变换或甚至身份函数)近似整个Transformer块,而无需任何额外训练。在最先进的预训练视觉模型(如ViT、DINOv2、DeiT)和从MNIST到ImageNet-1k的各类数据集上,TOAST在减少参数和计算量的同时,保持并有时提升下游性能。这些结果表明,Transformer深度的大部分可以被简单函数替代,为高效基础模型提供了新的视角。

英文摘要

Foundation models achieve state-of-the-art performance across different tasks, but their size and computational demands raise concerns about accessibility and sustainability. Existing efficiency methods often require additional retraining or finetuning, limiting their practicality. Recent findings suggest that deep neural networks exhibit internal representation similarities. While such similarities across different models have been exploited for enabling techniques such as model stitching and merging, intra-network redundancy remains underexplored as a source for efficiency gains. In this paper, we introduce Transformer Optimization using Adaptive and Simple Transformations (TOAST), a framework that exploits these redundancies to approximate entire transformer blocks with lightweight closed-form mappings, such as linear transformations or even the identity function, without any additional training. Across state-of-the-art pretrained vision models (e.g., ViT, DINOv2, DeiT) and datasets ranging from MNIST to ImageNet-1k, TOAST reduces parameters and computation while preserving, and in some cases improving, downstream performance. These results show that large portions of transformer depth can be replaced by trivial functions, opening a new perspective on efficient foundation models.

2410.02064 2026-05-19 cs.LG cs.AI cs.CL 版本更新

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

对Llama3-8b-Instruct自生成文本识别能力的检查与控制

Christopher Ackerman, Nina Panickssery

AI总结 本研究探讨了LLM是否能识别自身生成的文本,发现Llama3-8b-Instruct模型能够区分自身输出与人类输出,并通过残差流中的特定向量控制其行为和感知,揭示了模型自我归属的认知机制。

Comments 10 pages, 13 figs, 2 tables, accepted as conference paper to ICLR 2025

详情
Journal ref
The Thirteenth International Conference on Learning Representations (ICLR 2025)
AI中文摘要

已报告LLM能够识别其自身生成的文本,这可能对AI安全有重要影响,但研究较少。我们调查这一现象,以确定其在行为层面是否稳健发生,观察行为是如何实现的,以及是否可以控制。首先,我们发现Llama3-8b-Instruct聊天模型(而非基础Llama3-8b模型)能够可靠地区分自身输出与人类输出,并提供证据表明聊天模型很可能利用其在训练后对自身输出的经验来完成文本识别任务。其次,我们识别出残差流中一个在模型正确识别自身生成文本时被差异激活的向量,证明该向量对自我归属相关信息的响应,并提供证据表明该向量与模型中的“自我”概念相关,并展示该向量与模型感知和声明自我归属能力的因果关系。最后,我们证明该向量可用于控制模型的行为和感知,通过将其应用于模型生成输出时,可引导模型声称或否认作者身份;通过将其应用于模型阅读的文本时,可引导模型相信或不相信其写了任意文本。

英文摘要

It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

2410.01223 2026-05-19 stat.CO cs.LG 版本更新

Statistical Taylor Expansion: A New and Path-Independent Method for Uncertainty Analysis

统计泰勒展开:一种新的、路径无关的不确定性分析方法

Chengpu Wang

发表机构 * Grossman Street, Melville, NY 11747, USA(美国纽约州梅尔维尤市格罗斯曼街11747号)

AI总结 本文提出了一种新的路径无关的不确定性分析方法,通过将精确输入变量替换为具有已知分布和样本数的随机变量,计算每个结果的均值、偏差和可靠因子,从而实现对输入不确定性的传播追踪,使最终结果成为路径无关的,与传统数学方法不同。

Comments 47 pages, 40 figures

详情
AI中文摘要

作为一种严谨的统计方法,统计泰勒展开扩展了传统泰勒展开,通过将精确输入变量替换为具有已知分布和样本数的随机变量来计算每个结果的均值、偏差和可靠因子。它通过中间步骤追踪输入不确定性的传播,使最终的解析结果成为路径无关的。因此,它与传统数学方法根本不同,后者为每项计算优化计算路径。统计泰勒展开可能为解析表达式的数值计算提供标准化方法。本研究还介绍了称为方差算术的统计泰勒展开的实现,并在广泛的数学应用中展示了相应测试结果。此外,本研究还得出一个重要结论,即库函数中的数值误差可能显著影响结果。理想情况下,每个库函数的值都应通过不确定性偏差来完成。此外,统计泰勒展开与量子物理之间的可能联系也进行了讨论。

英文摘要

As a rigorous statistical approach, statistical Taylor expansion extends the conventional Taylor expansion by replacing precise input variables with random variables of known distributions and sample counts to compute the mean, the deviation, and the reliable factor of each result. It tracks the propagation of the input uncertainties through intermediate steps, so that the final analytic result becomes path independent. Therefore, it differs fundamentally from common approaches in applied mathematics that optimize computational path for each calculation. Statistical Taylor expansion may standardize numerical computations for analytic expressions. This study also introduces the implementation of statistical Taylor expansion termed variance arithmetic and presents corresponding test results across a wide range of mathematical applications. Another important conclusion of this study is that numerical errors in library functions can significantly affect results. It is desirable that each value from library functions be accomplished by an uncertainty deviation. The possible link between statistical Taylor expansion and quantum physics is discussed as well.

2406.09241 2026-05-19 math.OC cs.LG math.PR stat.ML 版本更新

What is the long-run distribution of stochastic gradient descent? A large deviations analysis

小批量梯度下降的长期分布是什么?一种大偏差分析

Waïss Azizian, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos

发表机构 * Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LJK(格勒诺布尔阿尔卑斯大学,法国国家科学研究中心,法国国家信息与自动化技术研究院,格勒诺布尔理工大学,LJK研究所) Institut de Mathématiques de Toulouse, Université de Toulouse, CNRS, UPS(图卢兹数学研究所,图卢兹大学,法国国家科学研究中心,普罗旺斯大学)

AI总结 本文研究了在一般非凸问题中随机梯度下降(SGD)的长期分布。通过基于大偏差理论和随机扰动动力系统的方法,作者发现SGD的长期分布类似于热力学平衡态的玻尔兹曼-盖布斯分布,其中温度等于方法的步长大小,能量水平由问题的目标函数和噪声统计决定。研究还发现,在长期中,(a)问题的临界区域比任何非临界区域被访问的次数指数级更多;(b)SGD的迭代结果在问题的最低能量状态上指数级集中(该状态不总是对应于目标函数的全局最小值);(c)所有其他临界点的连通分量被访问的频率与它们的能量水平呈指数比例关系;最后,(d)任何局部极大值或鞍点的连通分量都被局部最小值的连通分量所主导,后者被访问的次数指数级更多。

Comments 71 pages, 3 figures; presented in ICML 2024

详情
AI中文摘要

在本文中,我们研究了随机梯度下降(SGD)在一般非凸问题中的长期分布。具体而言,我们试图了解SGD更可能访问问题状态空间的哪些区域,以及程度如何。通过基于大偏差理论和随机扰动动力系统的方法,我们证明SGD的长期分布类似于热力学平衡态的玻尔兹曼-盖布斯分布,其中温度等于方法的步长大小,能量水平由问题的目标函数和噪声的统计特性决定。特别地,我们证明在长期中,(a)问题的临界区域比任何非临界区域被访问的次数指数级更多;(b)SGD的迭代结果在问题的最低能量状态上指数级集中(该状态不总是对应于目标函数的全局最小值);(c)所有其他临界点的连通分量被访问的频率与它们的能量水平呈指数比例关系;最后,(d)任何局部极大值或鞍点的连通分量都被局部最小值的连通分量所主导,后者被访问的次数指数级更多。

英文摘要

In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem's state space are more likely to be visited by SGD, and by how much. Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method's step-size and energy levels determined by the problem's objective and the statistics of the noise. In particular, we show that, in the long run, (a) the problem's critical region is visited exponentially more often than any non-critical region; (b) the iterates of SGD are exponentially concentrated around the problem's minimum energy state (which does not always coincide with the global minimum of the objective); (c) all other connected components of critical points are visited with frequency that is exponentially proportional to their energy level; and, finally (d) any component of local maximizers or saddle points is "dominated" by a component of local minimizers which is visited exponentially more often.

2405.19189 2026-05-19 cs.LG 版本更新

DyDiff: Long-Horizon Rollout via Dynamics Diffusion for Offline Reinforcement Learning

DyDiff: 通过动力学扩散实现离线强化学习中的长周期 rollout

Hanye Zhao, Xiaoshen Han, Zhengbang Zhu, Minghuan Liu, Yong Yu, De-Chuan Zhan, Weinan Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China(上海交通大学计算机科学学院) Department of Computer Science, Nanjing University, Nanjing 210093, China(南京大学计算机科学系)

AI总结 本文提出DyDiff,一种通过动力学扩散模型实现离线强化学习中长周期轨迹生成的方法,通过迭代注入学习策略信息,解决行为策略与学习策略不一致的问题,提升长周期rollout的准确性。

Comments 18 pages, 10 figures, 9 tables. The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-026-52028-5}

详情
AI中文摘要

随着扩散模型(DMs)在生成逼真合成视觉数据方面的巨大成功,许多研究者探索其在决策和控制中的潜力。大多数工作利用DMs直接从轨迹空间采样,其中DMs可视为动力学模型和策略的结合。在本工作中,我们探讨如何在完全离线设置中解耦DMs作为动力学模型的能力,使学习策略能够生成轨迹。由于DMs从数据集中学习数据分布,其内在策略实际上是数据集诱导的行为策略,导致行为策略与学习策略之间存在不匹配。我们提出Dynamics Diffusion,简称DyDiff,可以迭代地将学习策略的信息注入DMs中。DyDiff在保持策略一致性的同时确保长周期rollout的准确性,并且可以轻松部署在无模型算法上。我们提供了理论分析,证明DMs在长周期rollout上的优势优于其他模型,并在离线强化学习的上下文中验证了DyDiff的有效性,其中提供了一个rollout数据集但没有交互环境。

英文摘要

With the great success of diffusion models (DMs) in generating realistic synthetic vision data, many researchers have investigated their potential in decision-making and control. Most of these works utilized DMs to sample directly from the trajectory space, where DMs can be viewed as a combination of dynamics models and policies. In this work, we explore how to decouple DMs' ability as dynamics models in fully offline settings, allowing the learning policy to roll out trajectories. As DMs learn the data distribution from the dataset, their intrinsic policy is actually the behavior policy induced from the dataset, which results in a mismatch between the behavior policy and the learning policy. We propose Dynamics Diffusion, short as DyDiff, which can inject information from the learning policy to DMs iteratively. DyDiff ensures long-horizon rollout accuracy while maintaining policy consistency and can be easily deployed on model-free algorithms. We provide theoretical analysis to show the advantage of DMs on long-horizon rollout over models and demonstrate the effectiveness of DyDiff in the context of offline reinforcement learning, where the rollout dataset is provided but no online environment for interaction.

2405.06415 2026-05-19 stat.ML cs.LG 版本更新

Generalization analysis with deep ReLU networks for metric and similarity learning

基于深度ReLU网络的度量与相似性学习的泛化分析

Junyu Zhou, Puyu Wang, Ding-Xuan Zhou

发表机构 * RPTU Kaiserslautern-Landau(凯撒斯劳滕-兰道工业大学) University of Sydney(悉尼大学)

AI总结 本文研究了度量与相似性学习的泛化性能,通过构建结构化的深度ReLU神经网络来近似真实度量,并推导出显式的泛化误差界,首次为该领域提供了明确的泛化分析。

Comments 15 pages, 1 figure

详情
AI中文摘要

尽管度量与相似性学习已从多个理论角度被广泛研究,但对其泛化性能的深入理解仍显不足。本文通过利用真实度量(即目标函数)的特定结构,研究了度量与相似性学习的泛化行为。特别地,通过推导具有hinge损失的度量与相似性学习的真实度量的显式形式,我们构建了一个结构化的深度ReLU神经网络作为真实度量的近似,其近似能力取决于网络复杂度。这里,网络复杂度通过网络深度、非零权重数量和计算单元数量来表征。基于由此类结构化深度ReLU网络构成的假设空间,我们通过仔细控制近似误差和估计误差,建立了度量与相似性学习的超额风险界。通过选择适当的构造假设空间的容量,推导出显式的超额风险率。迄今为止,这是首次为度量与相似性学习提供显式超额风险界的泛化分析。此外,我们还研究了在更一般损失函数下度量与相似性学习的真实度量的性质。实验表明,所提出模型在经验上具有竞争力,并能更好地捕捉底层的相似性结构。

英文摘要

While metric and similarity learning has been extensively studied from several theoretical perspectives, a rigorous understanding of its generalization performance is still lacking. In this paper, we investigate the generalization behavior of metric and similarity learning by exploiting the specific structure of the true metric (i.e., the target function). In particular, by deriving the explicit form of the true metric for metric and similarity learning with the hinge loss, we construct a structured deep ReLU neural network as an approximation of the true metric, whose approximation ability depends on the network complexity. Here, the network complexity is characterized by the network depth, the number of nonzero weights, and the number of computational units. Based on the hypothesis space consisting of such structured deep ReLU networks, we establish excess risk bounds for metric and similarity learning by carefully controlling both the approximation error and the estimation error. An explicit excess risk rate is derived by choosing the proper capacity of the constructed hypothesis space. To the best of our knowledge, this is the first generalization analysis that provides explicit excess risk bounds for metric and similarity learning. In addition, we investigate properties of the true metric for metric and similarity learning under more general loss functions. Experiments show that the proposed model is empirically competitive and better captures the underlying similarity structure.

2402.15058 2026-05-19 math.AT cs.CG cs.LG 版本更新

Mixup Barcodes: Quantifying Geometric-Topological Interactions between Point Clouds

Mixup Barcodes: 量化点云之间几何-拓扑相互作用

Hubert Wagner, Nickolas Arustamyan, Matthew Wheeler, Peter Bubenik

发表机构 * University of Florida(佛罗里达大学) University of Central Florida(中央佛罗里达大学)

AI总结 本文提出了一种新的方法,通过结合标准持续同调与图像持续同调,定义了量化形状及其相互作用的新型方法,引入了混合条形码、总混合度和总百分比混合度等统计量,并开发了相关软件工具,用于机器学习中的特征解缠问题。

Comments To appear at SoCG 2026

详情
AI中文摘要

我们结合标准持续同调与图像持续同调,定义了一种新的方法来表征形状及其相互作用。具体来说,我们引入了(1)混合条形码,它捕捉两个任意维度点集之间的几何-拓扑相互作用(混合);(2)简单的汇总统计量,总混合度和总百分比混合度,它们以单个数字量化相互作用的复杂性;(3)用于操作上述内容的软件工具。作为概念验证,我们将此工具应用于机器学习中出现的问题,特别是研究不同类别的嵌入解缠。结果表明,拓扑混合是一种用于低维和高维数据相互作用表征的有用方法。与持续同调的典型用法相比,新工具对拓扑特征的几何位置更敏感,这通常是可取的。

英文摘要

We combine standard persistent homology with image persistent homology to define a novel way of characterizing shapes and interactions between them. In particular, we introduce: (1) a mixup barcode, which captures geometric-topological interactions (mixup) between two point sets in arbitrary dimension; (2) simple summary statistics, total mixup and total percentage mixup, which quantify the complexity of the interactions as a single number; (3) a software tool for playing with the above. As a proof of concept, we apply this tool to a problem arising from machine learning. In particular, we study the disentanglement in embeddings of different classes. The results suggest that topological mixup is a useful method for characterizing interactions for low and high-dimensional data. Compared to the typical usage of persistent homology, the new tool is sensitive to the geometric locations of the topological features, which is often desirable.

2308.06197 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features

利用基本特征的深度知识蒸馏进行复杂面部表情识别

Angus Maiden, Bahareh Nakisa

发表机构 * School of Information Technology, Deakin University(德克萨斯大学信息学院)

AI总结 本文提出了一种基于持续学习的方法,通过知识蒸馏和新颖的预测排序记忆重放,实现了复杂面部表情识别的最新状态,能够在少量样本下准确识别新复合表情类别。

Comments 13 pages, 9 figures, 6 tables, 3 algorithms. Code available at https://github.com/AngusMaiden/complex-FER

详情
AI中文摘要

复杂情绪识别是一种认知任务,迄今为止尚未达到与其他处于或高于人类认知水平的任务相同的优秀性能。通过面部表情识别情绪尤其困难,因为人类面部表达的情绪复杂性。为了使机器在复杂面部表情识别方面达到人类的水平,可能需要实时综合知识和理解新概念,就像人类所做的那样。人类能够仅通过少量示例学习新概念,通过从记忆中蒸馏重要信息。受人类认知和学习的启发,我们提出了一种新的持续学习方法,用于复杂面部表情识别,通过在基本表情类别上构建和保留知识,能够使用少量训练样本准确识别新的复合表情类别。在本工作中,我们还使用GradCAM可视化来展示基本和复合面部表情之间的关系。我们的方法通过知识蒸馏和一种新颖的预测排序记忆重放来利用这种关系,实现了复杂面部表情识别持续学习的最新状态,新类别的总体准确率为74.28%。我们还证明了使用持续学习进行复杂面部表情识别的性能远优于非持续学习方法,比最先进的非持续学习方法提高了13.95%。我们的工作也是首次将少样本学习应用于复杂面部表情识别,仅使用每个类别一个训练样本,就实现了100%的准确率,达到了最先进的水平。

英文摘要

Complex emotion recognition is a cognitive task that has so far eluded the same excellent performance of other tasks that are at or above the level of human cognition. Emotion recognition through facial expressions is particularly difficult due to the complexity of emotions expressed by the human face. For a machine to approach the same level of performance in complex facial expression recognition as a human, it may need to synthesise knowledge and understand new concepts in real-time, as humans do. Humans are able to learn new concepts using only few examples by distilling important information from memories. Inspired by human cognition and learning, we propose a novel continual learning method for complex facial expression recognition that can accurately recognise new compound expression classes using few training samples, by building on and retaining its knowledge of basic expression classes. In this work, we also use GradCAM visualisations to demonstrate the relationship between basic and compound facial expressions. Our method leverages this relationship through knowledge distillation and a novel Predictive Sorting Memory Replay, to achieve the current state-of-the-art in continual learning for complex facial expression recognition, with 74.28% Overall Accuracy on new classes. We also demonstrate that using continual learning for complex facial expression recognition achieves far better performance than non-continual learning methods, improving on state-of-the-art non-continual learning methods by 13.95%. Our work is also the first to apply few-shot learning to complex facial expression recognition, achieving the state-of-the-art with 100% accuracy using only a single training sample per class.

2307.12405 2026-05-19 cs.LG 版本更新

Optimal Control of Multiclass Fluid Queueing Networks: A Machine Learning Approach

多类流队列网络的最优控制:一种机器学习方法

Dimitris Bertsimas, Cheol Woo Kim

发表机构 * Sloan School of Management, Massachusetts Institute of Technology(麻省理工学院斯隆管理学院) Operations Research Center, Massachusetts Institute of Technology(麻省理工学院运筹学中心)

AI总结 本文提出了一种机器学习方法,用于多类流队列网络(MFQNETs)的最优控制,通过显式且有洞察力的控制策略,证明了存在分段常数最优策略,并通过OCT-H算法学习最优控制策略,实验表明在大规模网络中,该方法在测试集上达到100%的准确率。

详情
AI中文摘要

我们提出了一种机器学习方法,用于多类流队列网络(MFQNETs)的最优控制,该方法提供了显式且有洞察力的控制策略。我们证明了对于MFQNET控制问题存在分段常数最优策略,各段由通过原点的超平面分隔。我们使用最优分类树(OCT-H)来学习MFQNETs的最优控制策略。我们使用MFQNET控制问题的数值解作为训练集,并应用OCT-H来学习显式的控制策略。此外,我们还展示了理论结果和所提出算法可以扩展到具有不确定服务和到达率的鲁棒MFQNETs。我们报告了具有多达33个服务器和99个类别的实验结果,显示所学策略在测试集上达到100%的准确率。虽然OCT-H的离线训练在大型网络中可能需要数天时间,但在线应用只需毫秒级时间。

英文摘要

We propose a machine learning approach to the optimal control of multiclass fluid queueing networks (MFQNETs) that provides explicit and insightful control policies. We prove that a piecewise constant optimal policy exists for MFQNET control problems, with segments separated by hyperplanes passing through the origin. We use Optimal Classification Trees with hyperplane splits (OCT-H) to learn an optimal control policy for MFQNETs. We use numerical solutions of MFQNET control problems as a training set and apply OCT-H to learn explicit control policies. Furthermore, we show that both the theoretical results and the proposed algorithm extend to robust MFQNETs with uncertain service and arrival rates. We report experimental results with up to 33 servers and 99 classes that demonstrate that the learned policies achieve 100% accuracy on the test set. While the offline training of OCT-H can take days in large networks, the online application takes milliseconds.

2307.09575 2026-05-19 cs.SI cs.LG cs.MA eess.SP 版本更新

Causal Influences over Social Learning Networks

社交学习网络中的因果影响

Mert Kayaalp, Ali H. Sayed

发表机构 * Dalle Molle Institute for Artificial Intelligence (IDSIA USI-SUPSI)(达勒莫勒人工智能研究所(IDSIA USI-SUPSI)) École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院(EPFL))

AI总结 本文研究了由社交图连接的智能体之间的时间动态因果影响,分析了社交学习模型和分布式决策协议的动力学,并推导出揭示智能体对之间因果关系的表达式,解释了网络中的影响流动。结果依赖于图拓扑和每个智能体对推理问题的信息水平。基于这些结论,本文提出了一种算法来对智能体的整体影响进行排名,以发现具有高度影响力的智能体,并提供了一种从原始观测数据中学习必要模型参数的方法。结果和所提算法通过考虑合成数据和真实社交媒体数据进行了示例说明。

Comments Accepted to the Journal of Machine Learning Research

详情
AI中文摘要

本文研究了由社交图连接的智能体之间的时间动态因果影响。具体而言,本文分析了社交学习模型和分布式决策协议的动力学,并推导出揭示智能体对之间因果关系的表达式,解释了网络中的影响流动。结果发现依赖于图拓扑和每个智能体对推理问题的信息水平。基于这些结论,本文提出了一种算法来对智能体的整体影响进行排名,以发现具有高度影响力的智能体。此外,本文还提供了一种从原始观测数据中学习必要模型参数的方法。结果和所提算法通过考虑合成数据和真实社交媒体数据进行了示例说明。

英文摘要

This paper investigates causal influences between agents linked by a social graph and interacting over time. In particular, the work examines the dynamics of social learning models and distributed decision-making protocols, and derives expressions that reveal the causal relations between pairs of agents and explain the flow of influence over the network. The results turn out to be dependent on the graph topology and the level of information that each agent has about the inference problem they are trying to solve. Using these conclusions, the paper proposes an algorithm to rank the overall influence between agents to discover highly influential agents. It also provides a method to learn the necessary model parameters from raw observational data. The results and the proposed algorithm are illustrated by considering both synthetic data and real social media data.

2304.10726 2026-05-19 cs.CR cs.ET cs.LG cs.SE 版本更新

Usenix'23 Extended Version: Smart Learning to Find Dumb Contracts

Usenix'23 延伸版:智能学习以发现愚蠢的合约

Tamer Abdelaziz, Aquinas Hobor

发表机构 * National University of Singapore(新加坡国立大学) University College London(伦敦大学学院)

AI总结 本文提出了一种基于神经网络的深度学习漏洞分析器(DLVA),用于分析以太坊智能合约的字节码,通过训练模型来判断字节码,即使监督 oracle 只能判断源代码。DLVA 的训练算法通用且稳健,能够识别被错误标记的合约,并在速度和准确性方面优于其他智能合约漏洞检测器。

Comments arXiv preprint arXiv:2304.10726, 2023

详情
AI中文摘要

我们介绍了基于神经网络的深度学习漏洞分析器(DLVA)用于以太坊智能合约。我们训练 DLVA 来判断字节码,尽管监督 oracle 只能判断源代码。DLVA 的训练算法是通用的:我们扩展了源代码分析到字节码,无需任何手动特征工程、预定义模式或专家规则。DLVA 的训练算法也是稳健的:它克服了 1.25% 的错误标记合约,并且学生超越教师,发现了 Slither 错误标记的漏洞合约。DLVA 比其他智能合约漏洞检测器快得多:DLVA 在 0.2 秒内检查 29 个漏洞,速度提高了 10-1000 倍。DLVA 有三个关键组件。首先,智能合约到向量(SC2V)使用神经网络将智能合约字节码映射到高维浮点向量。我们对 SC2V 进行基准测试,与 4 种最先进的图神经网络进行比较,证明其提高了模型区分度 2.2%。其次,兄弟检测器(SD)在目标合约的向量与训练集中标记合约的向量在欧几里得距离上接近时分类合约;尽管只能判断我们测试集中的 55.7% 的合约,但其 Slither 预测准确率为 97.4%,假阳性率仅为 0.1%。第三,核心分类器(CC)使用神经网络推断易受攻击的合约,无论向量距离。我们对 DLVA 的 CC 进行基准测试,与 10 种 ML 技术进行比较,证明 CC 提高了准确性 11.3%。总体而言,DLVA 预测 Slither 的标签的总体准确率为 92.7%,假阳性率为 7.2%。最后,我们对 DLVA 与九种著名的智能合约分析工具进行了基准测试。尽管分析时间远少,DLVA 完成了所有查询,以平均 99.7% 的准确性领先,很好地平衡了高真阳性率与低假阳性率。

英文摘要

We introduce the Deep Learning Vulnerability Analyzer (DLVA) for Ethereum smart contracts based on neural networks. We train DLVA to judge bytecode even though the supervising oracle can only judge source. DLVA's training algorithm is general: we extend a source code analysis to bytecode without any manual feature engineering, predefined patterns, or expert rules. DLVA's training algorithm is also robust: it overcame a 1.25% error rate mislabeled contracts, and--the student surpassing the teacher--found vulnerable contracts that Slither mislabeled. DLVA is much faster than other smart contract vulnerability detectors: DLVA checks contracts for 29 vulnerabilities in 0.2 seconds, a 10-1,000x speedup. DLVA has three key components. First, Smart Contract to Vector (SC2V) uses neural networks to map smart contract bytecode to a high-dimensional floating-point vector. We benchmark SC2V against 4 state-of-the-art graph neural networks and show that it improves model differentiation by 2.2%. Second, Sibling Detector (SD) classifies contracts when a target contract's vector is Euclidian-close to a labeled contract's vector in a training set; although only able to judge 55.7% of the contracts in our test set, it has a Slither-predictive accuracy of 97.4% with a false positive rate of only 0.1%. Third, Core Classifier (CC) uses neural networks to infer vulnerable contracts regardless of vector distance. We benchmark DLVA's CC with 10 ML techniques and show that the CC improves accuracy by 11.3%. Overall, DLVA predicts Slither's labels with an overall accuracy of 92.7% and associated false positive rate of 7.2%. Lastly, we benchmark DLVA against nine well-known smart contract analysis tools. Despite using much less analysis time, DLVA completed every query, leading the pack with an average accuracy of 99.7%, pleasingly balancing high true positive rates with low false positive rates.

2605.17362 2026-05-19 cs.LG 版本更新

Learning Fill-in Reduction Ordering via Graph Policy Optimization for Sparse Matrices

通过图策略优化学习稀疏矩阵的填充填充排序

Ziwei Li, Shuzi Niu, Huiyuan Li, Tao Yuan, Wenjia Wu

发表机构 * Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出了一种图策略优化方法,通过全局和局部视角建模填充,以减少稀疏矩阵求解器中的填充和内存使用,实验表明该方法在SuiteSparse矩阵集合上实现了显著的改进。

Comments Accepted by ICASSP 2026

详情
AI中文摘要

在大型稀疏求解器中,矩阵重新排序旨在寻找一个排列,以最小化因数分解填充来减少内存和计算。由于最小填充排列问题是NP难的,且填充隐含在稀疏性模式中,因此使用图论启发式方法。现有的强化学习方法要么忽略稀疏性模式--错过了全局填充,要么缺乏局部精确填充反馈。我们提出了一种图策略优化方法,建模来自全局和局部视角的填充:策略和价值网络均使用多跳图神经网络骨干来嵌入全局填充;策略进一步与图上的符号分解交互以提取局部、步骤级填充,并通过自适应饱和函数将结果反馈与价值网络对齐,以提高收敛性。在SuiteSparse矩阵集合上,我们的方法在状态-of-the-art基线上实现了平均填充减少29.3和峰值内存使用减少31.3。

英文摘要

Matrix reordering in large sparse solvers seeks a permutation that minimizes factorization fill-in to reduce memory and computation. Because the minimum fill-in ordering problem is NP-complete and fill-in is implicit in the sparsity pattern, graph-theoretic heuristics are used. Existing reinforcement learning methods either ignore sparsity patterns--missing the global fill-in--or lack local exact fill-in feedback. We propose a graph policy optimization method, modeling fill-ins from global and local views: both the policy and value networks use a multi-hop graph neural backbone to embed global fill-in; the policy further interacts with symbolic factorization over graphs to extract local, step-level fill-ins, and the resulting feedback is aligned with the value network via an adaptive saturation function to improve convergence. On the SuiteSparse Matrix Collection, our method achieves mean reductions of 29.3 in fill-ins and 31.3 in peak memory usage over state-of-the-art baselines.

2605.17361 2026-05-19 cs.LG cs.AI 版本更新

\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer

MasFACT:基于几何感知后验转移的连续多智能体拓扑学习

Xuefei Wang, Jialu Wang, Fengbo Zhang, Yihan Hu, Di Zhang, Yutong Ye, Yikun Ban, Jun Han, Ruijie Wang

发表机构 * Beihang University(北京航空航天大学)

AI总结 本文提出MasFACT框架,通过几何感知后验转移方法,解决多智能体系统中因新任务适应导致的拓扑遗忘问题,提升连续学习任务的准确性和拓扑稳定性。

详情
AI中文摘要

多智能体系统(MAS)借助大型语言模型(LLMs)已成为解决复杂问题的强大范式,其性能关键依赖于底层的智能体间通信拓扑。然而,现有拓扑生成方法主要针对孤立任务进行优化,而现实部署涉及连续演化的任务流,要求先前有效的协作模式被保留和重用而非重新发现或覆盖。本文识别出一种此前未被充分探索的失败模式,即拓扑遗忘,其中适应新任务会使拓扑生成器偏离早期任务所需通信结构。该问题源于智能体层面功能语义和关系通信结构的跨任务不一致。为解决这一挑战,我们提出MasFACT,一种几何感知后验转移框架,通过融合Gromov-Wasserstein最优传输在任务特定智能体空间中转移历史协作知识作为可转移拓扑先验,并通过PAC-Bayes引导的保守后验适应在任务特定可塑性与结构稳定性之间取得平衡。在类别级、领域级和任务级连续设置中的实验表明,MasFACT在提升平均准确率的同时减少了拓扑遗忘,相比强大的拓扑生成和重放基线表现更优,并可无缝集成到不同的MAS拓扑生成器中。

英文摘要

Multi-agent systems (MAS) powered by large language models (LLMs) have emerged as a powerful paradigm for complex problem solving, where performance critically depends on the underlying inter-agent communication topology. However, existing topology generation methods mainly optimize for isolated tasks, while real-world deployments involve streams of evolving tasks, requiring previously effective collaboration patterns to be retained and reused rather than rediscovered or overwritten. We identify a previously underexplored failure mode, \emph{topology forgetting}, in which adapting to new tasks shifts the topology generator away from communication structures required by earlier tasks. This issue stems from cross-task misalignment in both agent-level functional semantics and relational communication structures. To address this challenge, we propose \textbf{\textsc{MasFACT}}, a geometry-aware posterior transfer framework that preserves and reuses historical collaboration knowledge as transferable topology priors. We transfer these priors across task-specific agent spaces through Fused Gromov-Wasserstein optimal transport and perform PAC-Bayes-guided conservative posterior adaptation to balance task-specific plasticity with structural stability. Experiments across class-, domain-, and task-level continual settings demonstrate that \textsc{MasFACT} consistently improves average accuracy while reducing topology forgetting compared to strong topology generation and replay-based baselines, and can be seamlessly integrated with different MAS topology generators.

2605.17347 2026-05-19 cs.CY cs.CV cs.LG 版本更新

Position: Age Estimation Models Do Not Process Biometric Data

位置:年龄估计模型不处理生物特征数据

Nikita Marshalkin

发表机构 * Sumsub GmbH, Berlin, Germany(Sumsub公司,柏林,德国)

AI总结 本文研究了年龄估计模型是否处理生物特征数据,通过实验表明这些模型无法达到身份识别阈值,因此不涉及身份识别,呼吁研究者和监管机构提高透明度。

Comments 11 pages, 3 figures, 3 tables. Accepted as a position paper at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

当神经网络通过照片估计某人的年龄时,它是否处理生物特征数据?答案取决于网络在推断过程中是否生成身份区分的表示,这个问题对机器学习研究人员来说可能显得 trivial,但在 GDPR 下可能需要同意,BIPA 下可能面临法定损害,或在欧盟 AI 法下被归类为高风险 AI。然而,目前没有监管指导。本文提供了实证证据:在三个面部验证基准测试中评估的14个模型显示,年龄估计器在数量级上远低于识别阈值。年龄估计模型无法识别个体。我们呼吁研究者提供有关系统存储和能做什么的透明度,并呼吁监管机构区分短暂处理与模板存储。

英文摘要

When a neural network estimates someone's age from a photograph, does it process biometric data? The answer depends on whether identity-discriminative representations arise within the network during inference, a question that may seem trivial to ML researchers but triggers consent requirements under GDPR, statutory damages under BIPA, or high-risk AI classification under the EU AI Act. Yet no regulatory guidance addresses it. This position paper provides empirical evidence: 14 models evaluated across 3 face verification benchmarks show age estimators fall orders of magnitude short of identification thresholds. Age estimation models cannot identify individuals. We call on researchers to provide transparency about what systems store and can do, and on regulators to distinguish transient processing from template storage.

2605.17339 2026-05-19 cs.LG 版本更新

Bridging the Gap between Sparse Matrix Reordering and Factorization: A Deep Learning Framework for Fill-in Reduction

弥合稀疏矩阵重排与分解之间的差距:一种用于填充减少的深度学习框架

Ziwei Li, Tao Yuan, Shuzi Niu, Huiyuan Li

发表机构 * Institute of Software, Chinese Academy of Sciences, Beijing, China(中国科学院软件研究所,北京,中国) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学,北京,中国)

AI总结 本文提出一种深度学习框架,通过谱嵌入最小化填充代理函数,弥合稀疏矩阵重排与分解之间的差距,实验表明其性能优于传统图论算法和深度学习方法。

Comments Accepted by DASFAA 2025

详情
AI中文摘要

稀疏矩阵重排可以显著减少矩阵分解过程中的填充量,从而降低稀疏矩阵计算中的计算和存储需求。寻找最小填充量的重排顺序已知是NP难问题。此外,存在一个悖论:矩阵重排在矩阵分解之前进行,但重排方法旨在减少的填充是由矩阵分解产生的。为了弥合重排与分解之间的差距,我们提出了一种深度学习框架,基于谱嵌入最小化填充代理函数。首先,我们采用多网格-like GNN架构来学习近似其图拉普拉斯矩阵的最小特征向量,即谱嵌入,并捕捉矩阵的全局结构信息。然后,另一个多网格-like GNN架构用于基于秩分布最小化潜在的填充空间。实验结果表明,我们的方法在传统图论算法和深度学习方法中表现具有竞争力。

英文摘要

Sparse matrix reordering can significantly reduce the fill-in during matrix factorization, thereby decreasing the computational and storage requirements in sparse matrix computations. Finding a minimal fill-in ordering is known to be an NP-hard problem. Moreover, there is a paradox: matrix reordering is applied before matrix factorization, but fill-ins that matrix reordering methods aim at are generated from matrix factorization. To bridge the gap between reordering and factorization, we propose a deep learning framework to minimize a fill-in surrogate function based on spectral embedding. First, we employ a multi-grid-like GNN architecture to learn to approximate the smallest eigenvectors of its graph Laplacian matrix, i.e. spectral embedding, and capture the global structural information of the matrix. Then, another multi-grid-like GNN architecture is used to minimize the potential space where fill-in can occur based on the rank distribution. Experimental results indicate that our approach achieves competitive performance compared with traditional graph-theoretic algorithms and deep learning methods.

2605.17334 2026-05-19 cond-mat.mtrl-sci cond-mat.stat-mech cs.LG physics.comp-ph 版本更新

Causal Anomaly Detection for Lithium-Ion Battery Degradation

锂离子电池退化中的因果异常检测

Dieter W. Heermann, Hagen Heermann

发表机构 * Institute for Theoretical Physics, Heidelberg University(海德堡大学理论物理研究所) Intilion GmbH(Intilion公司)

AI总结 本研究提出了一种基于因果图发现和k近邻转移熵的框架,用于通过常规循环 telemetry 数据检测锂离子电池退化,并通过三种信号类别包对异常评分进行组织,以提高检测灵敏度。

详情
AI中文摘要

可靠的早期检测锂离子电池退化需要能够物理解释且能从常规循环 telemetry 数据中计算出的健康指标。我们引入了CausalHealth框架,该框架应用因果图发现和k近邻转移熵对每个循环的电压、电流、温度和电阻时间序列进行处理,并将十二个结果异常评分组织成三个信号类别包(幅度位移、预测残差、复杂性熵)——隔离森林被单独报告,因为它低于包的可靠性阈值——以表征在十个校准分数(5-30%)范围内的检测灵敏度。幅度位移类别在所有七个测试的电池上实现了100%的检测率,覆盖LFP(MIT-Stanford MATR)和LCO(NASA PCoE、CALCE CS2)化学体系,其在渐进衰减电池上在传统容量阈值失效前的提前时间可达402个循环。一个可靠性加权主健康指数(RWMHI)——一个跨包融合的五个高可靠性检测器,按逆系数变异率加权——在长寿命电池上将提前时间提高了15-52个循环,同时保持100%的检测率。通过电化学阻抗谱对一个NMC棱柱电池的验证提供了独立的物理基础:转移熵TE(R→V)与电荷转移电阻R_ct相关(汇总r=+0.990;温度控制部分r=+0.898),对两者进行阿伦尼乌斯分析得到的激活能与已发表的NMC电荷转移动力学一致。这些结果在三个基准数据集上的七块电池上进行了评估。

英文摘要

Reliable early detection of lithium-ion battery degradation requires health indicators that are physically interpretable and computable from routine cycler telemetry without access to the degradation region. We introduce \textsc{CausalHealth}, a framework that applies causal graph discovery and $k$-nearest-neighbour transfer entropy to per-cycle voltage, current, temperature, and resistance time series, and organises twelve resulting anomaly scores into three signal-class bundles (Magnitude-shift, Predictive-residual, Complexity-entropy) -- with Isolation Forest reported separately as it falls below the bundle reliability threshold -- to characterise detection sensitivity across ten commissioning fractions (5--30\,\%). The Magnitude-shift class achieves 100\,\% detection across all seven tested cells spanning LFP (MIT--Stanford MATR) and LCO (NASA PCoE, CALCE CS2) chemistries, with a lead time of up to 402 cycles before conventional capacity-threshold failure on gradual-fade cells. A Reliability-Weighted Master Health Index (RWMHI) -- a cross-bundle fusion of five high-reliability detectors weighted by inverse coefficient of variation -- improves lead time by 15--52 cycles over the class median on long-lived cells while maintaining 100\,\% detection. Validation against electrochemical impedance spectroscopy on an NMC prismatic cell provides independent physical grounding: transfer entropy $\mathrm{TE}(R \!\to\! V)$ correlates with charge-transfer resistance $R_{\mathrm{ct}}$ (pooled $r = +0.990$; temperature-controlled partial $r = +0.898$), and an Arrhenius analysis of both quantities yields an activation energy consistent with published NMC charge-transfer kinetics. These results are evaluated on seven cells across three benchmark datasets.

2605.17316 2026-05-19 cs.LG cs.AI 版本更新

Learning Higher-Order Structure from Incomplete Spatiotemporal Data: Multi-Scale Hypergraph Laplacians with Neural Refinement

从不完整时空数据中学习高阶结构:具有神经细化的多尺度超图拉普拉斯算子

Keshu Wu, Sixu Li, Zihao Li, Zhiwen Fan, Xiaopeng Li, Yang Zhou

发表机构 * Texas A&M University(德克萨斯大学A&M分校) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出了一种多尺度超图拉普拉斯(MSHL)框架,通过两阶段方法从不完整时空观测中学习高阶结构。该方法通过发现阶段构建多尺度超图,并在细化阶段引入条件残差网络,以处理高阶关系中的残差特征,从而在交通网络中实现了更准确的缺失数据填补。

详情
AI中文摘要

传感器网络日益成为现代基础设施的核心,然而标准填补基准所假设的均匀随机缺失模式往往不适用于实际场景。环形检测器在校准期间会断线,路边柜子会沉默附近传感器的集群,而新安装的仪器则无法提供历史数据。这些故障会产生结构化的缺失,其值受传感器组之间的高阶关系约束,而非仅仅是成对接近性。现有低秩和图方法往往无法捕捉这种集体结构,当缺失性变得一致时可能会失效。本文引入多尺度超图拉普拉斯(MSHL),一种两阶段框架,用于从不完整的时空观测中学习高阶结构。发现阶段通过互补的拓扑和残差相关证据构建多尺度超图,并采用仅基于观测的选取器,适应支持的交互尺度。细化阶段添加一个小型超图条件残差网络,其安全性由构造保证:在存在信息残差特征时学习非线性修正,在不存在时则退化为线性估计。我们证明MSHL可以表示无法被成对图先验捕捉的组内守恒模式,能够适应最佳固定尺度,至多一个对数因子,将这种优势转移到验证的填补误差中,并允许单侧细化保证。在两个真实交通网络上评估,针对散落单元缺失、连续块断电和整个传感器黑箱在五种速率下,MSHL在高阶结构可识别时优于成对图基线,否则在采样噪声范围内匹配。结果表明,可靠的基础设施学习存在更广泛的原则:缺失数据不应被视为孤立的填补条目,而应视为发现结构的证据。

英文摘要

Sensor networks increasingly govern modern infrastructure, yet the data they lose are rarely missing in the uniform-random patterns assumed by standard imputation benchmarks. Loop detectors go offline during calibration, roadside cabinets silence clusters of nearby sensors, and newly installed instruments provide no history. Such failures create structured absences whose values are constrained by higher-order relations among groups of sensors, not merely by pairwise proximity. Existing low-rank and graph-based methods often miss this collective structure and can fail when missingness becomes coherent. We introduce Multi-Scale Hypergraph Laplacians (MSHL), a two-stage framework for learning higher-order structure from incomplete spatiotemporal observations. The Discovery stage builds a multi-scale hypergraph from complementary topology and residual-correlation evidence, with an observation-only selector that adapts to the supported interaction scale. The Refinement stage adds a small hypergraph-conditioned residual network that is safe by construction: it learns nonlinear corrections where informative residual features exist and defers to the linear estimate where they do not. We prove that MSHL represents group-conservation patterns inaccessible to pairwise graph priors, adapts to the best fixed scale up to a logarithmic factor, transfers this advantage to held-out imputation error, and admits a one-sided refinement guarantee. On two real traffic networks evaluated across scattered cell missingness, contiguous block outages, and whole-sensor blackouts at five rates, MSHL improves over a pairwise-graph baseline whenever higher-order structure is identifiable and otherwise matches it within sampling noise. The results point to a broader principle for reliable infrastructure learning: missing data should be treated not as isolated entries to fill, but as evidence of structure to discover.

2605.17314 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

通过不匹配的错误草稿实现弱到强的引导

Wei Deng

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了通过较小较弱模型的不匹配错误草稿引导更强学习者的能力,发现这种策略在MATH-500和AIME 2025/2026等任务上表现优异,主要贡献是提出了一种有效的训练方法。

详情
AI中文摘要

我们考虑是否可以利用较小、较弱模型的离线经验来引导更强的学习者,使其在在线策略学习(如GRPO)无法达到的能力。我们发现,将数学上错误但更领域训练的较小模型生成的草稿注入更强学习者的GRPO上下文,能一致优于标准在线GRPO在MATH-500和离分布AIME 2025/2026上。具体来说,我们使用Mathstral-7B作为学习者,Qwen2.5-Math-1.5B作为草稿模型,8.8K Level 3--5 MATH问题(其中MATH-500被排除),并使用Dr. GRPO进行训练。不匹配是关键成分:在保持其他条件不变的情况下,将草稿洗牌到不匹配的问题中,使MATH-500的greedy pass@1提升+1.62pp(n=10种子,p=0.0015,Welch's t检验)。事实上,不匹配-错误变体在MATH-500上所有测试的变体中均优于。在离分布AIME 2025和2026上,不匹配-错误变体在每个样本预算从k=1到k=1024的所有年份中,均将pass@k提升到Mathstral-7B(其原生[INST]格式)和Qwen2.5-Math-1.5B草稿模型之上。所有变体在测试时使用相同的提示,没有草稿注入。该配方——在单个GPU上训练,无需SFT、奖励模型、合成数据和无produce-critique-revise内循环——在Mathstral-7B-v0.1上达到了71.98%的MATH-500成绩,这是目前该模型的最高已发表结果,超过了WizardMath流程在完整MATH上的70.9%(SFT + PPO加过程/指令奖励模型)。

英文摘要

We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields $+1.62$pp on MATH-500 (greedy pass@1) over the matched-wrong variant ($n=10$ seeds, $p=0.0015$, Welch's $t$). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across both greedy pass@1 and sampling pass@$k$. On out-of-distribution AIME 2025 and 2026, the mismatched-wrong variant uniquely lifts pass@$k$ above both Mathstral-7B (in its native [INST] format) and the Qwen2.5-Math-1.5B draft model at every sample budget from $k=1$ to $k=1024$ across 2 seeds ($+14.2$pp on 2025 and $+9.0$pp on 2026 at pass@1024 over Mathstral-7B), and at pass@1024 also leads no-draft, matched-wrong, and mismatched-correct variants on both years. All variants use the same prompt with no draft injection at test time. The recipe -- trained on a single GPU with no SFT, no reward models, no synthesized data, and no produce-critique-revise inner loop -- reaches 71.98% MATH-500 on Mathstral-7B-v0.1, the highest published result on this model to our knowledge, surpassing the heavier WizardMath pipeline at 70.9% on full MATH (SFT + PPO with process/instruction reward models).

2605.17307 2026-05-19 q-fin.PM cs.AI cs.LG cs.NE q-fin.TR 版本更新

Deep Reinforcement Learning Framework for Diversified Portfolio Management Across Global Equity Markets

面向全球股票市场的多样化投资组合管理的深度强化学习框架

Kamil Kashif, Robert Ślepaczuk

发表机构 * Quantitative Finance Research Group, Faculty of Economic Sciences, University of Warsaw(经济科学学院量化金融研究组,华沙大学) Quantitative Finance Research Group, Department of Quantitative Finance and Machine Learning, Faculty of Economic Sciences, University of Warsaw(经济科学学院量化金融与机器学习系量化金融研究组,华沙大学)

AI总结 本文提出并评估了一个深度强化学习框架,用于动态分配全球股票市场投资组合,通过比较五种模型配置,探讨了奖励函数、策略结构、投资组合约束和时间编码器对风险调整后表现的影响。

Comments 67 pages, 11 figures, 16 tables

详情
AI中文摘要

本研究开发并评估了一个深度强化学习框架,用于动态分配全球股票市场投资组合。Soft Actor-Critic算法被用于在马尔可夫决策过程中学习连续的投资组合权重,将交易成本、换手惩罚和多样化约束纳入奖励函数中。比较了五种模型配置,这些配置在奖励公式、策略结构(扁平与分层Dirichlet)、投资组合约束和时间编码器(LSTM与Transformer)方面有所不同,并通过走步优化在2003-2026年的纳斯达克100、日经225和欧元 Stoxx 50十六个外样本折上进行了评估。结果表明,强化学习策略在欧元 Stoxx 50市场中实现了有竞争力的风险调整后表现,其中观察到统计显著的异常收益,但核心假设仅部分得到验证:没有策略在HAC稳健推断下相对于持有策略实现统计显著的超额收益。制度分析揭示,强化学习在不确定性升高时期增加价值,而跨市场的集合聚合提高了风险调整后表现,并确认了地理多样化的好处。

英文摘要

This study develops and evaluates a deep reinforcement learning framework for dynamic portfolio allocation across global equity markets. The Soft Actor-Critic algorithm is used to learn continuous portfolio weights within a Markov Decision Process, incorporating transaction costs, turnover penalties, and diversification constraints into the reward function. Five model configurations are compared, varying in reward formulation, policy structure (flat versus hierarchical Dirichlet), portfolio constraints, and temporal encoder (LSTM versus Transformer), and evaluated via walk-forward optimization across sixteen out-of-sample folds spanning 2003-2026 on the Nasdaq-100, Nikkei 225, and Euro Stoxx 50. Results show that RL strategies achieve competitive risk-adjusted performance primarily in the Euro Stoxx 50, where statistically significant abnormal returns are observed, but the central hypothesis is only partially confirmed: no strategy achieves statistically significant excess returns relative to Buy and Hold under HAC-robust inference across all markets. Regime analysis reveals that RL adds the most value during periods of elevated uncertainty, while ensemble aggregation across markets improves risk-adjusted performance and confirms the benefits of geographic diversification.

2605.17304 2026-05-19 cs.LG cs.CL 版本更新

Compress the Context, Keep the Commitments: A Formal Framework for Verifiable LLM Context Compression

压缩上下文,保持承诺:可验证大语言模型上下文压缩的正式框架

Natalia Trukhina, Vadim Vashkelis

发表机构 * Embedded Intelligence Lab (EMILAB)(嵌入式智能实验室)

AI总结 本文提出Context Codec框架,通过语义层面的压缩方法,确保在压缩对话历史时保留关键承诺,解决现有方法在压缩过程中缺乏对语义承诺保留的明确规范的问题。

详情
AI中文摘要

LLM上下文不仅仅是token;它是一组承诺。长期对话累积了目标、约束、决定、偏好、工具结果、检索到的证据、制品和安全边界,这些必须被未来响应保留。现有上下文管理方法通过截断、检索、摘要、记忆系统或token级提示压缩来减少长度,但很少明确指定哪些语义承诺必须在压缩中保留或如何衡量其保留。我们提出Context Codec,一种基于承诺的框架,用于压缩提示和聊天历史。Context Codec将对话状态表示为具有标准身份、等价性、冲突、置信度、风险和证据跨度的语义原子。它分离了五个关注点——提取、规范化、表示、渲染和验证,并引入了关键原子召回率、加权原子召回率、承诺密度和往返恢复性等指标。它还定义了语义压缩错误的分类学,一个具体的规范化程序,保守的回退规则用于低置信度和安全关键原子,以及Context Compression Language (CCL),一种以ASCII优先的紧凑表示法,用于标准JSON原子。在一项小规模诊断研究中,CCL-Core在结构化的散文和JSON之间占据了一个有用的中间位置:比散文更明确和可审计,通常比JSON更紧凑,且比高度压缩的符号更安全。结果不是声称缩写解决压缩问题,而是一个使上下文压缩可验证的框架:压缩对话,保持承诺。

英文摘要

LLM context is not just tokens; it is a set of commitments. Long-running conversations accumulate goals, constraints, decisions, preferences, tool results, retrieved evidence, artifacts, and safety boundaries that future responses must preserve. Existing context-management methods reduce length through truncation, retrieval, summarization, memory systems, or token-level prompt compression, but they rarely specify which semantic commitments must survive compression or how their preservation should be measured. We propose Context Codec, a commitment-level framework for compressing prompts and chat histories. Context Codec represents dialogue state as typed, source-grounded semantic atoms with canonical identity, equivalence, conflict, confidence, risk, and evidence spans. It separates five concerns - extraction, normalization, representation, rendering, and verification - and introduces metrics for Critical Atom Recall, Weighted Atom Recall, Commitment Density, and round-trip recoverability. It also defines a taxonomy of semantic compression errors, a concrete normalization procedure, conservative fallback rules for low-confidence and safety-critical atoms, and Context Compression Language (CCL), an ASCII-first compact rendering of canonical JSON atoms. In a small diagnostic study, CCL-Core occupies a useful middle ground between structured prose and JSON: more explicit and auditable than prose, usually more compact than JSON, and less risky than heavily minified notation. The result is not a claim that shorthand solves compression, but a framework for making context compression verifiable: compress the conversation, keep the commitments.

2605.17295 2026-05-19 cs.LG cs.CL 版本更新

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

DISA: 分布匹配强化学习中的离线重要性采样

Shaobo Wang, Yujie Chen, Yafeng Sun, Wenjie Qiu, Zhihui Xie, Sihang Li, Yucheng Li, Huiqiang Jiang, Xingzhang Ren, Xuming Hu, Dayiheng Liu, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Qwen Team, Alibaba Group(阿里集团Qwen团队) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The University of Science and Technology of China(中国科学技术大学) Nanjing University(南京大学) The University of Hong Kong(香港大学)

AI总结 本研究提出DISA方法,通过离线重要性采样解决分布匹配强化学习中的校准问题,分离了分区函数估计与策略学习,提高了策略多样性并在多个基准测试中表现出色。

Comments 21 pages, 7 figures, 7 tables. Abstract shortened to respect the arXiv limit of 1920 characters. Please see the PDF for the full abstract

详情
AI中文摘要

现代推理代理越来越多地被评估其在给定输入下生成多个有效解决方案路径、计划或工具使用轨迹的能力。标准奖励最大化强化学习倾向于崩溃到最容易强化的高奖励模式,而分布匹配强化学习旨在在整个奖励形状的解决方案集中分配概率质量。实现这一目标需要计算轨迹空间中依赖提示的分区函数。由于现有分布匹配方法在线学习这个分区函数,导致分区函数的校准误差直接扭曲策略更新且无法独立诊断。我们引入DISA(Decoupled Importance-Sampled Anchoring),通过离线绘制提案轨迹、通过重要性采样估计分区函数,并在策略优化开始前冻结所得的分区函数估计。这种解耦保持了分布匹配目标,同时严格分离分区函数估计与策略学习在数据、梯度、损失和诊断方面。实验表明,在六个数学和三个代码基准测试上,DISA与在线耦合的分布匹配基线FlowRL持平或超过,优于奖励最大化基线GRPO和GSPO在数学平均表现,并在相同离线轨迹上超过LoRASFT蒸馏方法多达13.8 Mean@8点。LLM-as-judge评估进一步显示DISA比奖励最大化基线保留了显著更多的策略多样性,提案强度和逆温度的敏感性研究遵循分析预测的偏差-方差模式。

英文摘要

Modern reasoning agents are increasingly evaluated on their ability to generate multiple valid solution paths, plans, or tool-use traces for a given input. Standard reward-maximizing RL tends to collapse onto the most easily reinforced high-reward mode, whereas distribution-matching RL aims to allocate probability mass across the entire reward-shaped solution set. Achieving this objective requires computing a prompt-dependent partition function over the trajectory space. Because existing distribution-matching methods learn this partition function online alongside the policy, calibration errors in the partition function directly distort policy updates and remain impossible to diagnose independently. We introduce DISA, short for Decoupled Importance-Sampled Anchoring, which moves this calibration problem outside the RL loop. DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins. This decoupling preserves the distribution-matching objective while strictly separating partition-function estimation from policy learning in data, gradients, loss, and diagnostics. Empirically, on two open-weight backbones across six math and three code benchmarks, DISA matches or exceeds the online-coupled distribution-matching baseline FlowRL, outperforms rewardmaximization baselines GRPO and GSPO on math averages, and exceeds LoRASFT distillation by up to 13.8 Mean@8 points on the same offline trajectories. An LLM-as-judge evaluation further shows that DISA retains substantially more strategy-level diversity than reward-maximization baselines, and sensitivity studies on the proposal strength and inverse temperature follow the bias-variance pattern predicted by the analysis.

2605.17291 2026-05-19 cs.LG 版本更新

Step-wise Rubric Rewards for LLM Reasoning

分步评分奖励用于大语言模型推理

Weichu Xie, Haozhe Zhao, Wenpu Liu, Yongfu Zhu, Liang Chen, Minghao Ye, Zirong Chen, Yuqi Xu, Shuai Dong, Ziyue Wang, Xinbo Xu, Kean Shi, Ruoyu Wu, Xiaoying Zhang, Wenqi Shao, Baobao Chang, Nan Duan, Jiaqi Wang

发表机构 * Peking University(北京大学) JD Explore Academy(京东探索学院) Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出一种分步评分奖励方法,通过引入LLM判官对每个评分项进行归因,规范化每步评分,并结合优势估计器提高推理准确性和减少自我纠正循环。

Comments Code available at https://github.com/akarinmoe/SRaR

详情
AI中文摘要

可验证奖励的强化学习(RLVR)被广泛用于改进大语言模型的推理能力,但奖励仅关注最终答案的正确性,而没有对中间步骤进行监督。基于评分的 方法如评分作为奖励(RaR)通过评分滚动生成的结构化标准引入更细粒度的监督,但评分仍然被聚合为一个单一的标量应用于整个响应,导致三个弱点:多标准结构的丢失、对正确和错误步骤的均匀监督,以及通过无界自我纠正的奖励黑客。在1000个问题上,我们发现18.2%的正确答案响应中的步骤是错误的但仍然被正向奖励,而49.9%的错误答案响应中的步骤是正确的但被惩罚。我们引入了分步评分作为奖励(SRaR),一种RLVR框架,它(i)使用LLM判官将每个评分项归因于特定的推理步骤;(ii)规范化每步评分跨滚动生成,使得只有质量变化的步骤产生学习信号;(iii)通过解耦的优势估计器结合每步奖励与结果奖励,保持结果基线的稳定性。我们进一步构建了一个16000个问题的评分数据集,通过对比性地从强模型中采样正确和有缺陷的推理路径来蒸馏评分项。在六个数学推理基准测试中,SRaR在Qwen3-8B上将平均准确率提高3.57个点,在Qwen3-32B上提高2.75个点,将AIME 2025的忠实推理率从34.5%提升到46.7%,并将自我纠正循环从48.1%降低到26.5%。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in large language models, but rewards only final-answer correctness with no supervision over intermediate steps. Rubric-based methods such as Rubrics as Rewards (RaR) introduce finer-grained supervision by scoring rollouts against structured criteria, yet the rubric scores are still aggregated into a single scalar applied to the entire response, causing three weaknesses: loss of multi-criterion structure, uniform supervision of correct and incorrect steps, and reward hacking through unbounded self-correction. On 1,000 problems, we find 18.2% of steps in correct-answer responses are wrong yet positively rewarded, while 49.9% of steps in incorrect-answer responses are correct yet penalized. We introduce Step-wise Rubrics as Rewards (SRaR), an RLVR framework that (i) uses an LLM judge to attribute each rubric item to a specific reasoning step, (ii) normalizes per-step rubric scores across rollouts so only steps whose quality varies produce a learning signal, and (iii) combines the per-step reward with the outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable. We further build a 16K-problem rubric dataset by contrastively distilling rubric items from correct and flawed reasoning paths sampled from a strong model. Across six mathematical reasoning benchmarks, SRaR improves average accuracy over RaR by 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B, raises the Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7%, and reduces self-correction looping from 48.1% to 26.5%.

2605.17285 2026-05-19 cs.LG cs.AI 版本更新

UNR-Explainer: Counterfactual Explanations for Unsupervised Node Representation Learning Models

UNR-Explainer: 为无监督节点表示学习模型生成反事实解释

Hyunju Kang, Geonhee Han, Hogun Park

发表机构 * Department of Artificial Intelligence(人工智能系)

AI总结 本文提出UNR-Explainer,一种基于蒙特卡洛树搜索的反事实解释生成方法,用于无监督节点表示学习模型,通过识别关键子图来提升对下游任务如链接预测和聚类的理解。

Comments Accepted at ICLR 2024

详情
AI中文摘要

节点表示学习,如图神经网络(GNNs),已成为机器学习中的关键方法。对可靠解释生成的需求日益增加,但无监督模型仍处于探索阶段。为此,我们提出了一种在无监督节点表示学习中生成反事实(CF)解释的方法。我们识别出在扰动后导致感兴趣节点k近邻显著变化的最重要子图。基于k近邻的反事实解释方法为理解无监督下游任务,如top-k链接预测和聚类,提供了简单但关键的信息。因此,我们引入UNR-Explainer,基于蒙特卡洛树搜索(MCTS)为无监督节点表示学习方法生成具有表现力的反事实解释。所提出的方法在多样化的数据集上对无监督的GraphSAGE和DGI表现出优越的性能。

英文摘要

Node representation learning, such as Graph Neural Networks (GNNs), has emerged as a pivotal method in machine learning. The demand for reliable explanation generation surges, yet unsupervised models remain underexplored. To bridge this gap, we introduce a method for generating counterfactual (CF) explanations in unsupervised node representation learning. We identify the most important subgraphs that cause a significant change in the k-nearest neighbors of a node of interest in the learned embedding space upon perturbation. The k-nearest neighbor-based CF explanation method provides simple, yet pivotal, information for understanding unsupervised downstream tasks, such as top-k link prediction and clustering. Consequently, we introduce UNR-Explainer for generating expressive CF explanations for Unsupervised Node Representation learning methods based on a Monte Carlo Tree Search (MCTS). The proposed method demonstrates superior performance on diverse datasets for unsupervised GraphSAGE and DGI.

2605.17284 2026-05-19 cs.CV cs.AI cs.LG cs.RO 版本更新

CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving

CLAP:用于端到端自动驾驶的对比潜在空间提示优化

Ruiyang Zhu, Yuehan He, Boyuan Zheng, Zesen Zhao, Ahmad Chalhoub, Qingzhao Zhang, Z. Morley Mao

发表机构 * University of Michigan(密歇根大学) University of Arizona(亚利桑那大学)

AI总结 本文提出CLAP方法,通过对比潜在空间提示优化解决自动驾驶中罕见但安全关键的长尾场景问题,利用V2X通信获取数据并优化提示,从而提升规划性能。

Comments 9 pages + appendix

详情
AI中文摘要

端到端自动驾驶系统通过视觉-语言-动作(VLA)模型在常见驾驶场景中表现出色,但在罕见但安全关键的长尾场景如活跃施工区和复杂让行几何中表现脆弱。本文提出了一种方法,超越数据扩展和模型训练,解决长尾挑战场景。我们引入CLAP(对比潜在空间提示优化),一种位置感知的适应框架,通过车辆到一切(V2X)通信按需检索,将冻结的VLA驾驶模型与每条道路块的软提示相结合。我们的方法基于VLA潜在空间的两个观察:(i)在VLA的隐藏状态层,来自相同道路块的场景紧密聚集并占据潜在空间的紧凑区域;(ii)在单个道路块内,长尾和正常帧在潜在表示中高度混合,难以改进其中一个而不影响另一个。CLAP通过两阶段流程解决此问题:监督对比学习发现道路块特定的困难场景方向,随后方向性正则化提示优化选择性改进挑战帧同时保持正常帧性能。在NAVSIM基准上,使用各种最先进的VLA后端,CLAP将挑战场景规划错误减少了24%,在不回归正常帧的情况下显著提高了规划性能。

英文摘要

End-to-end autonomous driving systems powered by Vision-Language-Action (VLA) models achieve strong performance on common driving scenarios, yet remain brittle in rare but safety-critical long-tail situations such as active construction zones and complex yielding geometries. In this paper, we present a method that addresses the long-tail challenging scenes beyond data scaling and model training. We introduce CLAP (Contrastive Latent-space Prompt optimization), a location-aware adaptation framework that augments a frozen VLA driving model with per-roadblock soft prompts, optimized from crowdsourced data and retrieved on demand via Vehicle-to-Everything (V2X) communication. Our approach rests on two observations from VLAs' latent space: (i) at the VLA's hidden-state layer, scenarios from the same roadblock cluster tightly and occupy compact regions of the latent space; and (ii) within a single roadblock, long-tail and normal frames are heavily intermixed in the latent representation, making it difficult to improve one without disturbing the other. CLAP addresses this via a two-stage pipeline: supervised contrastive learning to discover a roadblock-specific hard-scene direction, followed by directionally regularized prompt optimization that selectively improves challenging frames while preserving normal frame performance. On the NAVSIM benchmark with various state-of-the-art VLA backbones, CLAP reduces challenging scenario planning error by 24% with no regression on normal frames, significantly improving planning performance.

2605.17282 2026-05-19 nlin.CD cs.LG math.DS 版本更新

FEG-Pro: Forecast-Error Growth Profiling for Finite-Horizon Instability Analysis of Nonlinear Time Series

FEG-Pro:用于非线性时间序列有限时间不稳定分析的预测误差增长谱

Andrei Velichko, N'Gbo N'Gbo, Bruno Carpentieri, Mudassir Shams

发表机构 * Institute of Physics and Technology, Petrozavodsk State University, Petrozavodsk, Russia(彼尔米尔扎夫茨克州立大学物理与技术研究所) School of Science and Engineering, International University of Grand Bassam, Grand Bassam, Cote d'Ivoire(大格巴姆国际大学科学与工程学院) Faculty of Engineering, Free University of Bozen-Bolzano, Bolzano, Italy(博赞-博尔扎诺自由大学工程学院) Department of Mathematics, Faculty of Arts and Science, Balikesir University, Balikesir, Turkey(巴利克西尔大学艺术与科学学院数学系) Department of Mathematics and Statistics, Riphah International University, Islamabad, Pakistan(里法国际大学数学与统计学系)

AI总结 本文提出FEG-Pro框架,用于分析非线性时间序列的有限时间不稳定特性,通过构建自相关引导的稀疏历史、距离加权k近邻多时间跨度预测以及几何平均预测误差的对数增长分析,输出有限时间预测误差增长斜率lambda_FEG,并提取多个次级描述符用于非线性信号分析。

Comments 31 pages, 9 figures, 43 references

详情
AI中文摘要

从标量时间序列估计最大李雅普诺夫指数在治理方程、切线动力学和完整状态向量不可用时具有挑战性。我们提出FEG-Pro,一种用于非线性标量时间序列的预测误差增长谱框架。该方法构建自相关引导的稀疏历史,执行距离加权k近邻多时间跨度预测,并分析几何平均预测误差的对数增长。其主要输出是有限时间预测误差增长斜率lambda_FEG。当误差增长曲线支持准线性区域时,该斜率可以与参考最大李雅普诺夫指数比较,作为主导不稳定性率的估计。相同的流程也提取了正式拟合选择区域、曲率、二次去趋势后的残差粗糙度、单调性以及预测误差分布熵(FEDE)等次级描述符。这些次级描述符不仅作为斜率的诊断控制,还作为非线性信号分析的候选机器学习特征,因为它们编码了未被lambda_FEG单独捕捉的谱几何和分布不确定性。我们评估了该方法在已知或参考指数的混沌映射、Mackey-Glass延迟动力学和标量Lorenz-63观测者上的表现。全记录实验显示,在准线性情况下有良好的一致性,在曲率或弱谱情况下有有意义的曲线形状信息。对代表性Logistic、Mackey-Glass和Lorenz记录进行二进制长度减半实验显示,残差粗糙度和平均FEDE经常单调变化,并且在记录长度减小时仍保持可解释性,即使斜率变得偏倚或高度变化。结果支持将预测误差增长视为结构化谱和特征生成框架,而非单一数字估计器。

英文摘要

Estimating the largest Lyapunov exponent from a scalar time series is difficult when the governing equations, tangent dynamics, and full state vector are unavailable. We propose FEG-Pro, a forecast-error growth profiling framework for nonlinear scalar time series. The method constructs autocorrelation-guided sparse histories, performs distance-weighted k-nearest-neighbor multi-horizon forecasting, and analyzes the logarithmic growth of geometrically averaged forecast errors. Its primary output is the finite-horizon forecast-error growth slope, lambda_FEG. When the error-growth curve supports a quasi-linear regime, this slope can be compared with reference largest Lyapunov exponents as an estimate of the dominant instability rate. The same pipeline also extracts the formal fit-selection regime, curvature, residual roughness after quadratic detrending, monotonicity, and forecast-error distribution entropy (FEDE) from signed multi-horizon errors. These secondary descriptors are intended not only as diagnostic controls for the slope, but also as candidate machine-learning features for nonlinear signal analysis, because they encode profile geometry and distributional uncertainty not captured by lambda_FEG alone. We evaluate the method on chaotic maps, Mackey-Glass delay dynamics, and scalar Lorenz-63 observables with known or reference exponents. Full-record experiments show good agreement in quasi-linear cases and meaningful curve-shape information in curved or weak profiles. A dyadic length-halving experiment on representative logistic, Mackey-Glass, and Lorenz records shows that residual roughness and mean FEDE often change monotonically and remain interpretable as record length decreases, even when the slope becomes biased or highly variable. The results support treating forecast-error growth as a structured profile and feature-generation framework rather than a single-number estimator.

2605.17278 2026-05-19 cs.AI cs.LG 版本更新

A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

A2RBench: 一个用于形式可验证抽象推理基准生成的自动范式

Qingchuan Ma, Yuexiao Ma, Yongkang Xie, Tianyu Xie, Xiawu Zheng, Rongrong Ji

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education(教育部多媒体可信感知与高效计算重点实验室) Institute of Artificial Intelligence(人工智能研究院)

AI总结 本文提出A2RBench自动范式,通过生成、扩展、评估和分析流程提升抽象推理基准生成效率,发现当前LLM在抽象推理能力上存在根本缺陷,且高信息复杂度输入可简化推理过程。

详情
AI中文摘要

抽象推理能力反映了LLM提取和应用抽象规则的智能和泛化能力。然而,准确测量这一能力仍然具有挑战性:现有基准要么依赖昂贵的手动标注,限制了其规模,要么有风险测量记忆而非真正的推理。为此,我们引入了一个名为A2RBench的自动化流程,包括生成、扩展、评估和分析。具体而言,在生成阶段,LLM创建多样化的任务,要求真正的推理;在扩展阶段,LLM重用已验证的规则并扩展新的输入空间以生成任务变体,实现扩展。然而,这一过程可能导致幻觉。为消除它,我们进一步建立了理论框架并证明,程序验证——测试逆操作是否完美地逆转正向操作(循环一致性)——保证了唯一解。通过在主流LLM上的广泛评估,我们发现:(1)当前LLM在抽象推理上存在根本缺陷,顶级模型在代表性子集上显著低于人类(39.8% vs. 68.5%)。(2)当前LLM在生成3D任务的复杂度上远低于2D和1D,揭示了其对高维任务的理解不足。(3)反直觉的是,信息复杂度更高的输入可以简化推理过程。

英文摘要

Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification--testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)--guarantees a unique solution. Through extensive evaluations on mainstream LLMs, we find: (1) Current LLMs exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans on a representative subset (39.8% vs. 68.5%). (2) Current LLMs fall far short of 2D and 1D in the complexity of generated 3D tasks, revealing their lack of understanding of high-dimensional tasks. (3) Counterintuitively, inputs with higher information complexity can simplify the reasoning process.

2605.17276 2026-05-19 cs.LG cs.AI 版本更新

How Do Electrocardiogram Models Scale?

ECG模型如何扩展?

Jiawei Li, Fabio Bonassi, Ming Jin, Stefan Gustafsson, Johan Sundström, Thomas B. Schön, Antônio H. Ribeiro

发表机构 * Uppsala University(乌普萨拉大学) Griffith University(格里菲斯大学)

AI总结 本文研究了ECG模型在不同规模下的扩展规律,发现监督学习模型在数据受限时表现不佳,而自监督学习模型在模型和数据规模上都具有鲁棒性,同时自监督Transformer在非常大的模型规模上超越了ResNet。

详情
AI中文摘要

尽管扩展定律已为自然语言处理中的基础模型建立了基本框架,但其在心电图(ECG)模型中的适用性仍缺乏充分的描述。事实上,最近的研究并未始终显示出随着ECG模型的大小或预训练数据集大小的增加,下游性能的一致性提升,这使得模型架构归纳偏置、预训练范式以及与规模相关的预期改进的确切作用仍然不明。在本工作中,我们系统地研究了ECG领域内的神经网络和损失到损失扩展定律。通过在大规模CODE数据集(230万条记录)上预训练超过120个模型(参数量从2万到2000万不等),我们解耦了模型架构(ResNet vs. Transformer)和预训练范式(监督学习SL vs. 自监督学习SSL)的影响。我们发现(i)SL模型在分布内是数据瓶颈的,而SSL模型在模型和数据规模上都具有鲁棒性;(ii)对于分布外(OOD)泛化,ResNet比Transformer在参数效率上高1.3到2.5倍,而SSL在数据效率上最高可达16倍,并在未见的临床任务上实现了高达7.6倍的转移效率;(iii)在观察到的规模范围内,基于ResNet的模型通常在OOD损失上表现最低,SSL在未见的临床任务上占据主导地位,而自监督的Transformer在非常大的模型规模上超越了ResNet。我们的结果表明,有效ECG基础模型的路径在于架构和范式的战略对齐,而非单纯的暴力扩展。

英文摘要

While scaling laws have established a fundamental framework for foundation models in natural language processing, their applicability to electrocardiogram (ECG) models remains poorly characterized. Indeed, recent studies do not always yield consistent downstream gains as one increases the model size or pre-training dataset size of ECG models, leaving the exact roles of architectural inductive biases, pre-training paradigms, and expected improvements with size largely unanswered. In this work, we systematically investigate neural and loss-to-loss scaling laws within the ECG domain. By pre-training over $120$ models (ranging from $20$K to $200$M parameters) on the large-scale CODE dataset ($2.3$M records), we decouple the effects of model architecture (ResNet vs. Transformer) and pre-training paradigm, namely supervised learning (SL) versus self-supervised learning (SSL). We found that (i) SL models are data-bottlenecked in-distribution, whereas SSL models scale robustly across both model and data sizes; (ii) for out-of-distribution (OOD) generalization, ResNets are $1.3$ to $2.5$ times more parameter-efficient than Transformers, while SSL is up to $16$ times more data-efficient and achieves up to $7.6$ times higher transfer efficiency than SL on unseen clinical tasks; (iii) across the observed scales, ResNet-based models generally achieve the lowest OOD loss, with SSL dominating on unseen clinical tasks and self-supervised Transformers overtaking at very large model sizes. Our results suggest that the path to effective ECG foundation models lies in the strategic alignment of architecture and paradigm rather than brute-force scaling.

2605.17275 2026-05-19 q-fin.RM cs.LG 版本更新

A Hybrid Gaussian Process Regression Framework for Stable Volatility-Covariance Estimation: Evidence from Global Equity Indices

一种用于稳定波动-协方差估计的混合高斯过程回归框架:来自全球股票指数的证据

Ujjwala Vadrevu

AI总结 本文提出了一种混合高斯过程回归-历史模拟(GPR-HS)框架,用于估计全球股票指数多样化投资组合中的VaR和ES,通过动态建模单个资产波动率和稳定的历史协方差估计交叉资产相关性,从而提高尾部风险预测的准确性。

Comments Working paper. Replication code available at: https://colab.research.google.com/drive/1nrlSqmG10DNerNmEqGIh3EB9CcLWIgH9

详情
AI中文摘要

准确预测波动-协方差矩阵(VCV)对于监管资本充足性过程,如内部资本充足性评估程序(ICAAP)和综合资本分析和审查(CCAR)至关重要。传统的计量经济模型,包括GARCH家族和指数加权移动平均(EWMA)方法,由于参数刚性和分布假设,在压力下导致数值不稳定,从而系统性低估尾部风险。本文提出并验证了一种新的混合高斯过程回归-历史模拟(GPR-HS)框架,用于估计多样化投资组合中七个主要全球股票指数的VaR和ES。该框架将VCV估计问题解耦:单个资产波动率通过具有Matern 5/2核的单变量GPR动态建模,而交叉资产相关性通过稳定的历史协方差估计。关键的方法论贡献是攻击性噪声初始化(ANI)策略,该策略将初始白噪声核方差设置为训练回报的实证方差,确保Gram矩阵正定性、正则化和保守、符合监管要求的预测。通过2020年6月至2025年6月的扩展窗口前向链交叉验证方案评估,GPR-HS框架在大多数测试分割中实现了监管合规性;包括投资组合层面100%的ES通过率,同时在71.4%的单变量案例中通过二次损失优于静态历史VaR基准,在100%的案例中通过违规次数。

英文摘要

Accurate forecasting of the Volatility-Covariance Matrix (VCV) is central to regulatory capital adequacy processes such as the Internal Capital Adequacy Assessment Process (ICAAP) and the Comprehensive Capital Analysis and Review (CCAR). Traditional econometric models, including GARCH-family and Exponentially Weighted Moving Average (EWMA) approaches, suffer from parametric rigidity, distributional assumptions, and numerical instability under stress, leading to systematic underestimation of tail risk. This paper proposes and validates a novel Hybrid Gaussian Process Regression-Historical Simulation (GPR-HS) framework for estimating Value-at-Risk (VaR) and Expected Shortfall (ES) across a diversified portfolio of seven major global equity indices. The framework decouples the VCV estimation problem: individual asset volatilities are modelled dynamically using Univariate GPR with a Matern 5/2 kernel, while inter-asset correlations are estimated via stable historical covariance. A key methodological contribution is the Aggressive Noise Initialization (ANI) strategy, which sets the initial White Noise kernel variance equal to the empirical variance of the training returns, ensuring Gram matrix positive-definiteness, regularization, and conservative, regulatory-compliant forecasts. Evaluated using an expanding window forward-chaining cross-validation scheme over June 2020 -June 2025, the GPR-HS framework achieves regulatory compliance in the majority of test splits; including a 100% ES pass rate at the portfolio level, while outperforming the static Historical VaR benchmark in 71.4% of univariate cases by Quadratic Loss and 100% of cases by violation count.

2605.17271 2026-05-19 math.OC cs.LG 版本更新

Scalable Bi-causal Optimal Transport via KL Relaxation and Policy Gradients

基于KL松弛和策略梯度的可扩展双因果最优传输

Haoyang Cao, Jesse Hoekstra, Renyuan Xu, Yumin Xu, Ruixun Zhang

发表机构 * Department of Applied Mathematics and Statistics, Data Science and AI Institute, and Mathematical Institute for Data Science, Johns Hopkins University(应用数学与统计学系、数据科学与人工智能学院以及数据科学数学研究所,约翰霍普金斯大学) Department of Statistics, Oxford-Man Institute of Quantitative Finance, and Nuffield College, University of Oxford(统计系、牛津-曼定量金融研究所以及牛津大学努尔菲尔德学院) Management Science & Engineering Department, Stanford University(管理科学与工程系,斯坦福大学) School of Mathematical Sciences, Peking University(北京大学数学科学学院)

AI总结 本文提出了一种基于KL松弛和策略梯度的可扩展双因果最优传输方法,通过引入KL惩罚项将硬约束转化为可处理的散度惩罚,从而解决连续分布和长时域下双因果耦合约束的计算难题,展示了在稳健子对冲和时间序列统计降缩等应用中的有效性。

详情
AI中文摘要

双因果最优传输(OT)是一种自然的框架,用于在非anticipative信息约束下比较和耦合随机过程,具有在稳健金融、顺序不确定性量化和多阶段随机优化中的重要应用。特别是,一个学习到的双因果耦合自然地充当生成联合样本路径的模拟器,这些样本路径既尊重预定的边际律,又尊重底层的信息流。然而,其实际应用受限于在路径空间上强制双因果耦合约束的计算难度,尤其是对于连续分布和长时域。我们开发了一种可扩展的随机优化框架,用于在一般边际下计算双因果OT耦合。我们的方法引入了Kullback-Leibler(KL)-惩罚松弛,将硬边际约束替换为可处理的散度惩罚,同时保持问题的递归结构。我们为原始和松弛的 formulations 建立动态规划原理,证明当惩罚项增大时,松弛问题收敛到原始双因果OT问题,并推导出松弛目标的显式策略梯度表示。基于这些结果,我们提出了一种实用的策略梯度算法,具有无偏小批量估计器、方差缩减和非渐近性后悔保证。数值实验表明,该方法能够准确捕捉边际律和时间依赖性,并在稳健子对冲和时间序列统计降缩等应用中表现良好。这些结果提供了一种可扩展的双因果OT计算方法,并在非anticipative信息约束至关重要的设置中扩展了其应用范围。

英文摘要

Bi-causal optimal transport (OT) is a natural framework for comparing and coupling stochastic processes under nonanticipative information constraints, with important applications in robust finance, sequential uncertainty quantification, and multistage stochastic optimization. In particular, a learned bi-causal coupling naturally serves as a simulator for generating joint sample paths that respect both prescribed marginal laws and the underlying information flow. Its practical use, however, is limited by the computational difficulty of enforcing bi-causal coupling constraints over path space, especially for continuous distributions and long horizons. We develop a scalable stochastic-optimization framework for computing bi-causal OT couplings under general marginals. Our approach introduces a Kullback--Leibler (KL)-penalized relaxation that replaces hard marginal constraints with tractable divergence penalties while preserving the recursive structure of the problem. We establish dynamic programming principles for both the original and relaxed formulations, prove that the relaxed problem converges to the original bi-causal OT problem as the penalty grows, and derive explicit policy-gradient representations for the relaxed objective. Building on these results, we propose a practical policy-gradient algorithm with unbiased mini-batch estimators, variance reduction, and nonasymptotic regret guarantees. Numerical experiments show that the method accurately captures marginal laws and temporal dependence, and performs well in applications including robust subhedging and time series statistical downscaling. These results provide a scalable computational approach to bi-causal OT and broaden its applicability in settings where nonanticipative information constraints are essential.

2605.17269 2026-05-19 cs.LG stat.ML 版本更新

Calibeating for general proper losses: A Bregman divergence approach

基于Bregman散度的方法:一般恰当损失的校准

Maximilian Fichtl, Cristóbal Guzmán, Nishant A. Mehta

发表机构 * Independent researcher(独立研究者) Institute for Mathematical and Computational Engineering, Faculty of Mathematics and School of Engineering, Pontificia Universidad Católica de Chile(数学与计算工程学院,数学系和工程学院,天主教智利大学) Department of Computer Science, University of Victoria(计算机科学系,维多利亚大学)

AI总结 本文提出了一种基于懊悔最小化的通用校准框架,考虑了包括α-Tsallis损失(α∈[1,2])和Lipschitz损失在内的广泛恰当损失家族,同时展示了新的关于Be The Regularized Leader的懊悔等式。

Comments 31 pages

详情
AI中文摘要

本文介绍了一种基于懊悔最小化的通用校准框架。与Foster和Hart的开创性校准工作相比,后者专门处理Brier分数(平方损失)和log损失,我们考虑了一类包含α-Tsallis损失(α∈[1,2])和Lipschitz损失的广泛恰当损失家族。我们的结果对于Tsallis损失也适用于未缩放的Tsallis损失,该损失恢复log损失。我们的分析围绕恰当损失的Bregman散度观点展开。技术上,我们考虑的Tsallis损失家族的结果是U-calibration结果,同时在所有损失家族中获得对数懊悔,同时与先前结果相比具有更弱的维度依赖性。潜在的独立兴趣点是,我们还展示了新的关于Be The Regularized Leader的懊悔等式。该懊悔等式适用于一般恰当损失,并且本身基于两个与广义方差的在线更新公式相关的结果,后者是基于Bregman散度的方差泛化。

英文摘要

This work introduces a general framework for calibeating based on regret minimization. As compared to Foster and Hart's seminal calibeating work which had specialized treatments of Brier score (squared loss) and log loss, we consider a large family of proper losses that includes $α$-Tsallis losses (for $α\in [1, 2]$) and Lipschitz losses. Our results for Tsallis losses also hold for an unscaled version of Tsallis loss that recovers log loss. Our analysis is oriented around the Bregman divergence view of a proper loss. Technically, our results for the family of Tsallis losses that we consider are U-calibration results, simultaneously obtaining logarithmic regret for all losses in this family while having a weaker dependence on the dimension compared to previous results. Of potential independent interest, we also show a new regret equality for the regret of Be The Regularized Leader. This regret equality holds for general proper losses and itself is based on two results related to online updating formulas for the generalized variance, the latter being a previously introduced generalization of variance based on Bregman divergences.

2605.17265 2026-05-19 cs.LG 版本更新

When Molecular Similarity Works: Property Cliffs Reveal Hidden Errors

当分子相似性起作用:属性悬崖揭示隐藏的错误

Di Hu, Kun Li, Haojie Rao, Longtao Hu, Jiameng Chen, Wenbin Hu, Yizhen Zheng, Jiajun Yu, Duanhua Cao

发表机构 * School of Economics and Management(经济管理学院) School of Computer Science(计算机科学学院) Department of Data Science and Artificial Intelligence(数据科学与人工智能系) College of Computer Science and Technology(计算机科学与技术学院) School of Life Sciences and Technology(生命科学与技术学院)

AI总结 研究通过属性悬崖揭示分子相似性失效的问题,提出CliffSplit和CliffLoss方法来评估和缓解模型在局部区域的错误。

Comments Preprint, 22 pages, 10 figures, 11 tables. Di Hu and Kun Li contributed equally

详情
AI中文摘要

准确预测分子性质是药物发现和材料设计的基础,然而即使最先进的模型仍容易在局部失效模式中出现错误,这些错误无法通过聚合指标检测。属性悬崖暴露了这一差距:结构相似的分子在目标性质上可能有显著差异,因此性能表现优秀的模型可能在高风险的局部区域失效。为揭示并缓解这种失效模式,引入了CliffSplit,一种能够构建局部支持、暴露悬崖的评估协议,以及CliffLoss,一种对悬崖敏感错误具有通用性的训练-only缓解机制。在三个QM9目标和三个MoleculeNet任务上,五个backbones的实验表明,CliffSplit在QM9区域揭示至少15%更高的错误,而CliffLoss在疏水性上将悬崖到平滑错误差距减少了30%,并整体将MAE提高了9.7%。这些结果将分子相似性失效从描述性异常转变为分子机器学习的基准评估问题。代码可在https://anonymous.4open.science/r/Cliff_Loss获取。

英文摘要

Accurate prediction of molecular properties underpins drug discovery and material design, yet even state-of-the-art models remain vulnerable to localized failure modes that aggregate metrics cannot detect. The places where molecular similarity should be most helpful are also places where standard evaluation can be most misleading. Property cliffs expose this gap: structurally similar molecules can still differ sharply in target property, so models with competitive overall performance may fail in high-risk local neighborhoods. To expose and mitigate this failure mode, CliffSplit, a cliff-aware evaluation protocol that constructs locally supported, cliff-exposed test cases, and CliffLoss, a model-agnostic train-only mitigation mechanism for cliff-sensitive errors, are introduced. Experiments on three QM9 targets and three MoleculeNet tasks across five backbones show that CliffSplit reveals at least 15% higher error in cliff-heavy QM9 regions, while CliffLoss reduces the cliff-to-smooth error gap by up to 30% on Lipophilicity and improves overall MAE by 9.7%. Together, these results turn molecular similarity failure from a descriptive anomaly into a benchmarked evaluation problem for molecular machine learning. The code is available at https://anonymous.4open.science/r/Cliff_Loss.

2605.17256 2026-05-19 eess.SY cs.AI cs.LG cs.SY 版本更新

Latency-Aware Deep Learning Benchmark for Real-Time Cyber-Physical Attack and Fault Classification in Inverter-Dominated Power Grids

面向实时机电攻击和故障分类的延迟感知深度学习基准测试

Emad Abukhousa, Saman Zonouz, A. P. Sakis Meliopoulos

发表机构 * Emad Abukhousa(埃马德·阿布库霍萨) Saman Zonouz(萨曼·宗努兹)

AI总结 本文提出了一种延迟感知的深度学习基准测试框架,用于评估在逆变器主导电网中使用高保真时域信号进行电力系统异常检测的深度学习模型。通过系统评估从物理故障和网络攻击中生成的流数据集,评估了八种神经网络架构,包括MLP到Transformer。所有模型都能在亚周期响应时间低于15毫秒的情况下实时分类两种代表性多事件序列,但端到端推理延迟始终超过三个周期,范围从50到90毫秒。这些结果突显了算法能力与保护级部署之间的关键差距,指出了进一步优化和硬件加速的必要性。研究结果建立了可重复的亚周期异常检测基准,并为将机器学习方法从研究原型过渡到实际保护应用提供了指导。

详情
AI中文摘要

本文介绍了一种延迟感知的基准测试框架,用于评估在电力系统异常检测中使用高保真时域信号生成的深度学习模型。通过系统评估从物理故障和网络攻击中生成的流数据集,评估了八种神经网络架构,包括MLP到Transformer。所有模型都能在亚周期响应时间低于15毫秒的情况下实时分类两种代表性多事件序列,但端到端推理延迟始终超过三个周期,范围从50到90毫秒。这些结果突显了算法能力与保护级部署之间的关键差距,指出了进一步优化和硬件加速的必要性。研究结果建立了可重复的亚周期异常检测基准,并为将机器学习方法从研究原型过渡到实际保护应用提供了指导。

英文摘要

This work introduces a latency-aware benchmarking framework for evaluating deep learning models in power system anomaly detection using high-fidelity, time-domain signals generated from an industry-grade electromagnetic transient simulator. Eight neural network architectures, ranging from MLPs to Transformers, were systematically evaluated on streaming datasets representing both physical faults and cyber-attacks in inverter-dominated networks. All models successfully classified two representative multi-event sequences in real time with sub-cycle response times below 15 ms. However, although classification decisions occurred within one cycle, the end-to-end inference latency consistently exceeded three cycles, ranging from 50 to 90 ms. These results highlight a critical gap between algorithmic capability and protection-grade deployment, pointing to the need for further optimization and hardware acceleration. The findings establish a reproducible benchmark for sub-cycle anomaly detection and provide guidance for transitioning machine learning methods from research prototypes to real-world protection applications.

2605.17251 2026-05-19 cs.DS cs.LG 版本更新

Iterative Chow Filtering for Learning with Distribution Shift

迭代 Chow 过滤用于分布偏移学习

Gautam Chandrasekaran, Georgios Gkrinias, Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan

发表机构 * UT Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出了一种基于迭代 Chow 过滤的方法,解决了分布偏移学习中的效率问题,展示了在 DNF 公式下实现准多项式时间 PQ 学习算法,并在多个函数类上提供了指数级改进。

Comments 30 pages

详情
AI中文摘要

Goel 等人的近期工作给出了在具有挑战性的 PQ 框架中学习分布偏移的第一个高效算法。在此设置中,学习者接收带标签的训练示例和未标记的测试示例,并必须在测试集上做出正确预测,但允许在分布外点上不进行预测。他们的结果依赖于 L2 沙丁鱼近似,这是一个强需求,导致在如 DNF 公式等基本函数类上产生较差的界限。在这里,我们证明较弱的 L1 沙丁鱼近似足以实现高效的 PQ 学习。作为结果,我们获得了在均匀分布下 DNFs 的第一个准多项式时间 PQ 学习算法,并在很大程度上匹配了普通 PAC 学习所知的保证。更广泛地说,我们的界限为包括常深电路和常次数多项式阈值函数在内的多个类提供了指数级改进。我们的主要技术成分是迭代 Chow 过滤,这是一种新的程序,利用低次 Chow 参数来识别和移除与训练分布不兼容的测试点。

英文摘要

Recent work due to Goel et al. gave the first efficient algorithms for learning with distribution shift in the challenging PQ framework. In this setting, a learner receives labeled training examples, unlabeled test examples, and must make correct predictions on the test set but is allowed to abstain from predicting on out-of-distribution points. Their results rely on ${\cal L}_2$ sandwiching approximations, a strong requirement that leads to poor bounds for several basic function classes such as DNF formulas. Here, we show that the weaker notion of ${\cal L}_1$ sandwiching suffices for efficient PQ learning. As a consequence, we obtain the first quasipolynomial-time PQ learning algorithm for DNFs under the uniform distribution and essentially match the guarantees known for ordinary PAC learning. More broadly, our bounds provide exponential improvements for several classes including constant depth circuits and constant degree polynomial threshold functions. Our main technical ingredient is Iterative Chow Filtering, a new procedure that uses low-degree Chow parameters to identify and remove test points incompatible with the training distribution.

2605.17250 2026-05-19 cs.LG 版本更新

Towards Principled Test-Time Adaptation for Time Series Forecasting

面向时间序列预测的原理性测试时间适应方法

Haochun Wang, Ruichen Xu, Georgios Kementzidis, Karen Cho, Sebastian Ramirez Villarreal, Yuefan Deng

发表机构 * Stony Brook University(石溪大学)

AI总结 本文提出了一种基于频率域的轻量级校准方法FAC,用于改进时间序列预测在分布偏移下的适应性,通过频率域分析现有适配器的预测修正,并在多种数据集和预测时间上实现了更高效和一致的性能。

详情
AI中文摘要

测试时间适应(TTA)最近作为一种有前景的方法,用于在分布偏移下改进时间序列预测(TSF)。现有的TSF-TTA方法在利用揭示的目标方面存在差异,导致适应协议异质且缺乏明确统一的公式。为了解决这个问题,我们从协议清洁度的角度重新审视TSF-TTA,并提出了一种仅基于成熟地面真实值的适应协议,从而获得更原理化的适应设置。在该协议下,我们进一步在频域中诊断现有适配器,并发现其预测修正通常表现出有限且弱结构化的频谱修改。受此诊断启发,我们提出了频率感知校准(FAC),一种轻量级校准方法,直接在频域中参数化预测修正。在多种数据集、预测时间跨度和源预测器上,FAC实现了竞争性和一致性的性能,同时所需可训练参数显著少于比较的TSF-TTA适配器。

英文摘要

Test-time adaptation (TTA) has recently emerged as a promising approach for improving time series forecasting (TSF) under distribution shift. Existing TSF-TTA methods differ in how they utilize revealed targets, yet the resulting adaptation protocols remain heterogeneous and lack a clearly unified formulation. To address this issue, we revisit TSF-TTA from the perspective of protocol cleanliness and propose an adaptation protocol based solely on matured ground truth, yielding a more principled setting for adaptation. Under this protocol, we further diagnose existing adapters in the frequency domain and find that their prediction corrections often exhibit limited and weakly structured spectral modifications. Motivated by this diagnosis, we propose Frequency-Aware Calibration (FAC), a lightweight calibration method that directly parameterizes prediction corrections in the frequency domain. Across diverse datasets, forecasting horizons, and source forecasters, FAC achieves competitive and consistent performance while requiring substantially fewer trainable parameters than the compared TSF-TTA adapters.

2605.17246 2026-05-19 cs.LG cs.AI 版本更新

Fidelity Probes for Specification--Code Alignment

规范-代码对齐的保真度探针

Ferhat Erata, Hao Zhou, Luke Huan

发表机构 * AWS Agentic AI(AWS智能AI)

AI总结 本文提出保真度探针,通过从参考artifact生成的自然语言问题和代码派生的地面真实答案,从候选规范中回答问题。保真度是同意探针的比例,分解为矛盾率和覆盖缺口率,驱动针对性的规范编辑以达到收敛。在15个程序、约12000行COBOL基准(AWS CardDemo)上,通过八次迭代将冻结测试规范的保真度从0.63提升到0.94,其中平台位置由仅需四次速率数据的两状态马尔可夫固定点$F^\dagger$预测。探针来自LLM读取代码或静态分析管道对其控制流、数据流和系统依赖图的处理,具有可调混合比例。一个带有冻结留出集的探针重采样协议提供了Hoeffding有界的过拟合判别;我们测量的训练/测试差距保持在该包络线下一个数量级。三种基于图的混合提升了保真度16到30分;跨分布评估显示LLM和符号通道在经验上互补。在五个独立LLM家族(Anthropic、DeepSeek、Google、Alibaba、OpenAI)上进行的跨家族生成器扫描确认了收敛行为不依赖于任何单一模型家族:五个非Claude生成器中有三个产生了与马尔可夫固定点预测一致的轨迹,而冻结测试协议主动否定了两个探针分布随迭代变化的生成器。该方法适用于任何应描述相同行为的artifact对。

Comments 29 pages, 14 figures, 11 tables

详情
AI中文摘要

我们引入了保真度探针:从参考artifact生成的自然语言问题,其代码派生的地面真实答案由候选规范回答。保真度是同意探针的比例,分解为矛盾率和覆盖缺口率,驱动针对性的规范编辑以达到收敛。在15个程序、约12000行COBOL基准(AWS CardDemo)上,我们通过八次迭代将冻结测试规范的保真度从0.63提升到0.94,其中平台位置由仅需四次速率数据的两状态马尔可夫固定点$F^\dagger$预测。探针来自LLM读取代码或静态分析管道对其控制流、数据流和系统依赖图的处理,具有可调混合比例。一个带有冻结留出集的探针重采样协议提供了Hoeffding有界的过拟合判别;我们测量的训练/测试差距保持在该包络线下一个数量级。三种基于图的混合提升了保真度16到30分;跨分布评估显示LLM和符号通道在经验上互补。在五个独立LLM家族(Anthropic、DeepSeek、Google、Alibaba、OpenAI)上进行的跨家族生成器扫描确认了收敛行为不依赖于任何单一模型家族:五个非Claude生成器中有三个产生了与马尔可夫固定点预测一致的轨迹,而冻结测试协议主动否定了两个探针分布随迭代变化的生成器。该方法适用于任何应描述相同行为的artifact对。

英文摘要

We introduce fidelity probes: natural-language questions generated from a reference artifact with code-derived ground-truth answers, answered from a candidate specification. The fraction of agreeing probes, which we call the fidelity, decomposes into contradiction and coverage-gap rates that drive targeted spec edits to convergence. On a 15-program, roughly 12k-line COBOL benchmark (AWS CardDemo), we raise frozen-test specification fidelity from 0.63 to 0.94 over eight iterations, with the plateau location predicted by a two-state Markov fixed point $F^\dagger$ from just four iterations of rate data. Probes come from an LLM reading the code or from a static-analysis pipeline over its control-flow, data-flow, and system-dependence graphs, with a tunable mixture. A probe-resampling protocol with a frozen held-out set gives a Hoeffding-bounded overfitting discriminant; our measured train/test gap stays more than an order of magnitude below this envelope. Three graph-grounded mixtures lift fidelity by +16 to +30 points; cross-distribution evaluation shows the LLM and symbolic channels are empirically complementary. A cross-family generator sweep on five independent LLM lineages (Anthropic, DeepSeek, Google, Alibaba, OpenAI) confirms the convergence behaviour is not tied to any single model family: three of five non-Claude generators produce trajectories consistent with the Markov fixed-point prediction, and the frozen-test protocol actively falsifies the two generators whose probe distributions drift across iterations. The method applies to any pair of artifacts that are supposed to describe the same behaviour.

2605.17244 2026-05-19 cs.LG cs.AI 版本更新

Drift Flow Matching

漂移流匹配

Chenrui Ma, Xi Xiao, Lin Zhao, Tianyang Wang, Ferdinando Fioretto, Yanning Shen

发表机构 * University of California, Irvine(加州大学伊万斯堡分校) University of Virginia(弗吉尼亚大学) University of Alabama at Birmingham(阿拉巴马大学伯明翰分校) Northeastern University(东北大学)

AI总结 本文提出Drift Flow Matching框架,结合漂移生成模型与基于流的迭代生成方法,实现高效生成与多步细化,提升生成质量与效率适应性。

详情
AI中文摘要

迭代生成模型如流匹配和扩散模型在测试时表现出强大的扩展性,额外的推理计算可以提高生成质量。相比之下,漂移模型提供高效的单步生成,但其直接生成范式限制了灵活性。在本文中,我们提出Drift Flow Matching (DFM),一个连接漂移生成建模与基于流的迭代生成的框架。DFM保留了直接传输映射的效率,同时在需要时通过多个推理步骤细化生成。这填补了单步漂移模型与多步流匹配方法之间的空白,并提供了一种新的生成范式,可以适应不同的质量-效率需求。在不同任务和数据集上的广泛实验验证了所提框架的有效性和通用性。

英文摘要

Iterative generative models such as Flow Matching and Diffusion models have demonstrated strong test-time scaling behavior, where additional inference computation can improve generation quality. In contrast, Drift Models offer efficient one-step generation, but their direct generation paradigm limits such flexibility. In this work, we propose Drift Flow Matching (DFM), a framework that connects drifting generative modeling with flow-based iterative generation. DFM preserves the efficiency of direct transport maps while enabling generation to be refined through multiple inference steps when desired. This bridges the gap between one-step Drift Models and multi-step Flow Matching methods, and provides a novel generative paradigm that can adapt sampling computation to different quality--efficiency requirements. Extensive experiments across different tasks and datasets demonstrate the effectiveness and generality of the proposed framework.

2605.17238 2026-05-19 cs.LG stat.ML 版本更新

Learning in Position-Aware Multinomial Logit Bandits: From Multiplicative to General Position Effects

基于位置感知的多项逻辑带宽学习:从乘法位置效应到一般位置效应

Xi Chen, Shibo Dai, Jiameng Lyu, Yuan Zhou

发表机构 * Leonard N. Stern School of Business, New York University(纽约大学勒纳商学院) Qiuzhen College, Tsinghua University(清华大学齐臻学院) Department of Management Science, School of Management, Fudan University(复旦大学管理学院管理科学系) Yau Mathematical Sciences Center & Department of Mathematical Sciences, Tsinghua University(清华大学尤数学科学中心及数学科学系)

AI总结 本文研究了动态联合品类选择与排列问题,其中每个产品的吸引力取决于其内在吸引力和显示位置,在多项逻辑(MNL)选择框架下。研究从乘法位置效应模型扩展到一般位置效应模型,为两种模型设计了基于轮次的学习算法,并建立了首个最优后悔分析。此外,这些基于轮次的算法为现代平台提供了必要的实时操作。对于乘法模型,开发了具有截断机制的交叉位置成对最大似然估计器,并证明算法P2MLE-UCB达到$ ilde{O}(\sqrt{NT})$的后悔,匹配下限并弥补了先前基于周期的分析留下的$\sqrt{K}$差距。对于一般模型,建立了最小最大下界并提出了GP2-UCB算法,具有匹配的上界。此外,设计了基于Dinkelbach方法和最大权二分图匹配的高效子程序,用于每轮联合品类和排列优化。在合成数据和Expedia数据集上的数值实验表明,我们的算法在性能上始终优于最先进的基准。

详情
AI中文摘要

我们研究了动态联合品类选择与排列问题,其中每个产品的吸引力取决于其内在吸引力和显示位置,在多项逻辑(MNL)选择框架下。我们的研究从乘法位置效应模型开始,其中每个产品的吸引力由位置特定因子缩放,扩展到一般位置效应模型,该模型为每个产品-位置对分配独立吸引力参数以捕捉异质协同效应。对于两种模型,我们设计了基于轮次的学习算法,在每次反馈后更新决策,并建立了首个最优后悔分析。此外,我们的基于轮次算法为现代平台提供了必要的实时操作。对于乘法模型,我们开发了具有截断机制的交叉位置成对最大似然估计器,并证明我们的算法P2MLE-UCB达到$ ilde{O}(\sqrt{NT})$的后悔,匹配下限并弥补了先前基于周期的分析留下的$\sqrt{K}$差距。对于一般模型,我们建立了最小最大下界并提出了GP2-UCB算法,具有匹配的上界。此外,我们设计了基于Dinkelbach方法和最大权二分图匹配的高效子程序,用于每轮联合品类和排列优化。在合成数据和Expedia数据集上的数值实验表明,我们的算法在性能上始终优于最先进的基准。

英文摘要

We study the dynamic joint assortment selection and positioning problem, where the attraction of each product depends on both its intrinsic appeal and its display position under a Multinomial Logit (MNL) choice framework. Our study ranges from the multiplicative position effects model, in which each product's attraction is scaled by a position-specific factor, to a general position effects model assigning independent attraction parameters to every product--position pair to capture heterogeneous synergies. For both models, we design round-based learning algorithms that update decisions after every single feedback, and establish the first regret-optimal characterization. Besides, our round-based algorithms provide the prompt operations needed by modern platforms. For the multiplicative model, we develop a cross-position pairwise maximum likelihood estimator with a clipping mechanism, and prove that our algorithm P2MLE-UCB attains a regret of $\tilde{O}(\sqrt{NT})$, matching the lower bound and closing the $\sqrt{K}$ gap left by prior epoch-based analyses. For the general model, we establish a minimax lower bound and propose GP2-UCB with a matching upper bound. Moreover, we design an efficient subroutine for the per-round joint assortment and positioning optimization based on Dinkelbach's method and maximum-weight bipartite matching. Numerical experiments on synthetic data and the Expedia dataset show that our algorithms consistently outperform state-of-the-art benchmarks.

2605.17231 2026-05-19 cs.LG cs.CL 版本更新

FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

FishBack: 用于变换器中最优激活引导的反向费舍尔几何

Sihan Wang, Jiayi Zhao

发表机构 * Sihan Wang, 1 Jiayi Zhao(1 王世涵,1 赵佳怡)

AI总结 本文提出FishBack框架,通过反向费舍尔几何优化变换器中的激活引导,解决激活空间欧几里得假设失效的问题,提出闭式引导方程并实现迭代优化。

Comments Preprint. 20 pages, 9 figures, 5 tables

详情
AI中文摘要

激活引导方法通过修改语言模型的中间表示来控制输出行为,但普遍假设激活空间是欧几里得空间。我们证明这一假设严重失效:模型自身输出行为诱导的局部几何——即softmax层的费舍尔信息度量通过后续层的雅可比矩阵拉回后的度量——在GPT-2上相对于欧几里得度量的谱相对范数偏离超过97%,其有效维度仅为环境空间的2-17%。基于此拉回费舍尔度量,我们推导出一个闭式引导方程,确定任何目标概念的最小失真方向,从而在每个点上获得闭式最优方向,可迭代应用而无需曲面拟合或数据驱动的几何估计。我们称该框架为FishBack。该度量允许逐层递归分解,揭示现有方法——CAA、ActAdd、ITI等——各自隐式采用特定近似度量,其性能差距可通过单个谱诊断量化:其隐式度量成本与费舍尔最优成本的比率。在GPT-2上,迭代拉回引导在三个动词形态学概念和四层上均优于所有欧几里得基线,其偏离目标的KL减少量相对于欧几里得梯度上升为1.3×-2.5×,相对于CAA在匹配的概念概率下为1.5×。

英文摘要

Activation steering methods modify intermediate representations of language models to control output behavior, but universally assume the activation space is Euclidean. We show this assumption fails drastically: the local geometry induced by the model's own output behavior -- the Fisher information metric of the softmax layer, pulled back through the Jacobian of subsequent layers -- deviates from the Euclidean metric by over 97% in relative spectral norm on GPT-2, with an effective dimensionality of only 2--17% of the ambient space. From this pullback Fisher metric, we derive a closed-form steering equation that identifies the minimum-distortion direction for any target concept, yielding a closed-form optimal direction at each point that can be applied iteratively without manifold fitting or data-driven geometry estimation. We call the resulting framework FishBack. The metric admits a layer-wise recursive decomposition, which reveals that existing methods -- CAA, ActAdd, ITI, and others -- each implicitly adopt a particular approximate metric, and that their performance gaps are quantitatively predicted by a single spectral diagnostic: the ratio of their implicit metric's cost to the Fisher-optimal cost. On GPT-2, iterative pullback steering consistently outperforms all Euclidean baselines across three verb-morphology concepts and four layers, with off-target KL reductions of $1.3\times$--$2.5\times$ relative to Euclidean gradient ascent and $1.5\times$ relative to CAA at matched concept probability.

2605.17230 2026-05-19 quant-ph cond-mat.dis-nn cond-mat.stat-mech cs.LG 版本更新

Maximum Likelihood Decoding of Quantum Error Correction Codes

量子纠错码的最大似然解码

Hanyan Cao, Ge Yan, Yuxuan Du, Feng Pan

发表机构 * Science, Mathematics and Technology Cluster, Singapore University of Technology and Design(新加坡科技设计大学科技、数学与技术集群) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) School of Physical and Mathematical Sciences, Nanyang Technological University(南洋理工大学物理与数学科学学院)

AI总结 本文研究了量子纠错码最大似然解码问题,综述了统计力学、张量网络和人工智能三种方法在解码中的应用,探讨了其联系、在模拟和实验量子硬件中的应用以及面临的挑战。

Comments An invited topical review. Comments are welcome

详情
AI中文摘要

量子纠错(QEC)对于实现容错量子计算至关重要,但其有效性取决于解释噪声综合征测量的经典解码算法。在所有可能的解码策略中,最大似然解码(MLD)被证明是最优的,因为它通过求和所有可能的错误,识别具有最大似然的逻辑组。尽管其最优性,MLD在一般情况下计算上是不可行的(#P-难),这推动了大量精确和近似算法的发展。在本文的专题综述中,我们通过统计力学、张量网络和人工智能三种互补的视角,提供对MLD的统一观点。从统计力学的角度,MLD问题映射到评估无序自旋模型的配分函数,使某些编码和噪声模型能够获得精确解,并通过相变分析进行阈值估计。从张量网络的角度,对编码因子图上的张量网络进行近似收缩,可以得到接近MLD精度且具有多项式计算成本的解码器。从人工智能的角度,基于神经网络的解码器,包括自回归生成模型和循环变压器,通过从数据中学习来近似MLD分布,利用现代硬件加速器的并行性实现高精度。我们讨论了这三种方法之间的联系,回顾了它们在模拟和实验量子硬件中的应用,并概述了开放挑战,包括实时解码、扩展到大码距以及推广到高率量子低密度奇偶校验码。

英文摘要

Quantum error correction (QEC) is indispensable for realizing fault-tolerant quantum computation, yet its effectiveness hinges critically on the classical decoding algorithm that interprets noisy syndrome measurements. Among all possible decoding strategies, maximum likelihood decoding (MLD) is provably optimal, since it identifies the logical group with largest likelihood by summing over all possible errors within logical class consistent with the observed syndrome. Despite its optimality, MLD is computationally intractable in general (#P-hard), motivating a rich landscape of exact and approximate algorithms. In this topical review, we provide a unified perspective on MLD by surveying recent advances through three complementary lenses: statistical mechanics, tensor networks, and artificial intelligence. From the statistical mechanics viewpoint, the MLD problem maps onto evaluating partition functions of disordered spin models, enabling exact solutions for certain codes and noise models as well as threshold estimation via phase-transition analysis. From the tensor network perspective, approximate contraction of tensor networks on the code's factor graph yields decoders that closely approach MLD accuracy with polynomial computational cost. From the artificial intelligence perspective, neural-network-based decoders, including autoregressive generative models and recurrent transformers, learn to approximate the MLD distribution from data, achieving high accuracy with the parallelism afforded by modern hardware accelerators. We discuss the connections among these three approaches, review their application to both simulated and experimental quantum hardware, and outline open challenges including real-time decoding, scalability to large code distances, and generalization to high-rate quantum low-density parity-check codes.

2605.17217 2026-05-19 quant-ph cs.LG 版本更新

Toward Near-Real-Time Marine Oil Spill Detection in SAR Imagery using Quantum-Assisted SVM

向SAR影像中实现近实时海洋油污检测的量子辅助SVM

Joseph Strauss, Jyotsna Sharma

发表机构 * Division of Computer Science \& Engineering Louisiana State University Baton Rouge, United States Department of Petroleum Engineeering Louisiana State University Baton Rouge, United States

AI总结 本研究提出一种量子辅助SVM集成方法,用于近实时检测SAR影像中的海洋油污,通过量子退火优化小数据集中的SVM支持向量,实现高效准确的油污检测。

详情
AI中文摘要

海洋油污需要快速检测以减轻严重的生态和经济损失。虽然基于卫星的合成孔径雷达(SAR)提供全天候监控,但分析这些数据仍然具有挑战性。深度学习模型通常需要大量数据集并导致高延迟。为此,开发了一种像素级量子辅助支持向量机(QSVM)集成方法。利用量子退火优化单个弱SVM的支持向量,这些支持向量随后在经典系统中进行聚合。该方法在Sentinel-1影像上使用量子模拟和物理量子退火硬件进行评估。量子辅助流程在性能上与严格的经典基线相当,实现了交并比(IoU)为0.60和平衡精度为0.89。使用基于门的量子计算的补充实验显示了相似的分割精度,尽管退火方法在推理效率上更优。在霍尔木兹海峡的独立油污影像上的泛化评估进一步证明了训练管道对地理不同的油污事件的潜在转移性。这些结果确立了量子辅助分割管道在近实时环境监测中的可行性。

英文摘要

Marine oil spills require rapid detection to mitigate severe ecological and economic damage. While satellite-based Synthetic Aperture Radar (SAR) provides essential all-weather monitoring, analyzing this data remains challenging. Deep learning models often require massive datasets and incur high latency. To address this, a pixel-wise quantum-assisted Support Vector Machine (QSVM) bagging ensemble is developed. Quantum annealing is leveraged to optimize the support vectors of individual weak SVMs on small data subsets, which are then classically aggregated. The approach is evaluated on Sentinel-1 imagery using both quantum simulation and physical quantum annealing hardware. The quantum-assisted pipeline achieved performance comparable to a rigorous classical baseline, yielding an Intersection-over-Union (IoU) of 0.60 and a balanced accuracy of 0.89. Complementary experiments with gate-based quantum computing demonstrated similar segmentation accuracy, although the annealing approach offered superior inference efficiency. Generalization was further assessed on independent oil spill imagery from the Strait of Hormuz, demonstrating the potential transferability of the trained pipeline to geographically distinct spill events. These results establish the feasibility of quantum-assisted, segmentation pipelines for near-real-time environmental monitoring.

2605.17201 2026-05-19 cs.CR cs.LG 版本更新

Filter-then-Verify: A Multiphase GNN and ModernBERT Framework for Social Engineering Detection in Email Networks

Filter-then-Verify: 一种多阶段图神经网络和现代BERT框架用于电子邮件网络中的社会工程检测

Barsat Khadka, Prasant Koirala, Kshitiz Neupane, Nick Rahimi

发表机构 * School of Computing Sciences and Computer Engineering, The University of Southern Mississippi(计算机科学与计算机工程学院,密西西比大学)

AI总结 本文提出了一种结合图神经网络和现代BERT的多阶段框架,用于检测电子邮件网络中的社会工程攻击,通过结构异常检测和内容验证相结合,实现了高召回率和高精确度。

Comments Under review at Elseiver's Computer and security journal

详情
AI中文摘要

社会工程攻击利用人类信任而非软件漏洞,使传统过滤器难以检测。我们提出一种两阶段的Filter-then-Verify框架,结合归纳图神经网络(GNN)进行结构异常检测和共注意ModernBERT模型进行内容验证。GNN识别异常的发件人-收件人模式,而BERT分析信息上下文以减少误报。使用增强的Enron数据集和现实合成活动,我们显示该框架在结构过滤中达到86%的召回率,在BERT优化后超过92%的精确度,有效检测外部攻击和内部威胁。我们的结果表明,结合结构和内容分析可以实现对电子邮件网络中多阶段社会工程攻击的实用、可扩展检测。

英文摘要

Social engineering attacks exploit human trust rather than software vulnerabilities, making them difficult to detect using conventional filters. We propose a two-stage filter-then-verify framework combining inductive Graph Neural Networks (GNNs) for structural anomaly detection with a co-attention ModernBERT model for content verification. The GNN identifies anomalous sender-receiver patterns, while BERT analyzes message context to reduce false positives. Using the Enron dataset augmented with realistic synthetic campaigns, we show that the framework achieves 86% recall in structural filtering and over 92% precision after BERT refinement, effectively detecting both external attacks and insider threats. Our results demonstrate that combining structural and content analysis allows practical, scalable detection of multi-stage social engineering attacks in email networks.

2605.17197 2026-05-19 cs.LG cs.CV 版本更新

OPTNet: Ordering Point Transformer Network for Post-disaster 3D Semantic Segmentation

OPTNet:用于灾后3D语义分割的点变换网络

Nhut Le, Ehsan Karimi, Maryam Rahnemoonfar

发表机构 * Computer Science and Engineering, Lehigh University, Bethlehem PA 18015, US(计算机科学与工程系,莱维大学,贝特莱姆 PA 18015,美国) Civil and Environmental Engineering, Lehigh University, Bethlehem PA 18015, US(土木与环境工程系,莱维大学,贝特莱姆 PA 18015,美国)

AI总结 本文提出OPTNet,一种通过可学习的点排序模块动态预测最优排列以提高注意力机制局部性的网络,用于灾后3D点云语义分割。

Comments Accepted for International Conference on Pattern Recognition (ICPR) 2026

详情
AI中文摘要

灾后损害评估需要快速且准确地对3D点云进行语义分割,以识别受损的基础设施,如损坏的建筑和道路。早期的点变换(如PTv1、PTv2)依赖于计算成本高的邻居搜索(k-NN)和最远点采样(FPS)。为了提高效率,最近的架构如Point Transformer V3(PTv3)采用了静态序列化方法,如Hilbert曲线或Z-order,来组织无序点以进行基于窗口的注意力。然而,这些固定顺序并不利于捕捉灾难场景的复杂几何结构。在本文中,我们提出了OPTNet(Ordering Point Transformer Network),它引入了一个可学习的点排序模块。OPTNet利用自监督的排序损失动态预测最优排列,以最大化注意力机制的局部性。我们在3DAeroRelief数据集上评估了我们的方法,显著优于最先进的基线。

英文摘要

Post-disaster damage assessment requires rapid and accurate semantic segmentation of 3D point clouds to identify critical infrastructure such as damaged buildings and roads. Early Point Transformers (e.g., PTv1, PTv2) relied on computationally expensive neighbor searching (k-NN) and Farthest Point Sampling (FPS). To improve efficiency, recent architectures like Point Transformer V3 (PTv3) adopted static serialization methods, such as Hilbert curves or Z-order, to organize unstructured points for window-based attention. However, these fixed orderings are not optimal for capturing the complex geometry of disaster scenes. In this paper, we propose OPTNet (Ordering Point Transformer Network), which introduces a learnable Point Sorter module. OPTNet utilizes a self-supervised ordering loss to dynamically predict an optimal permutation that maximizes the locality of the attention mechanism. We evaluate our method on the 3DAeroRelief dataset, significantly outperforming state-of-the-art baselines.

2605.17180 2026-05-19 cs.LG math.OC stat.ML 版本更新

The Geometry of Projection Heads: Conditioning, Invariance, and Collapse

投影头的几何学:条件性、不变性与坍缩

Faris Chaudhry

发表机构 * Department of Computing, Imperial College London, United Kingdom(伦敦帝国学院计算机系,英国)

AI总结 本文提出了一种投影头的几何理论,通过将投影头建模为可训练的黎曼度量来研究自监督学习中的条件性、不变性和坍缩问题,揭示了投影头在不同深度下的适应能力和稳定性。

Comments Accepted at ICML 2026. 29 pages, 8 figures, 7 tables

详情
AI中文摘要

我们通过将头建模为可训练的黎曼度量来发展投影头在自监督学习中的几何理论。我们证明线性头执行隐式的子空间白化,而非线性头适应局部度量以满足损失函数的特定拓扑约束,且头的深度经验上决定了这种能力。通过分析维度坍缩,我们证明平滑的非线性头在坍缩平衡点会自然诱导Hessian矩阵的负特征值,使其不稳定。我们通过连续跟踪训练过程中的优化几何来验证这一点,发现Swish等平滑激活函数可以生成显式的负曲率以逃离坍缩,而线性和ReLU头在连续时间梯度流中无法做到这一点,而是依赖于离散时间优化动态和BatchNorm。最后,我们从几何上表征了度量退化如何支配信息不变性之间的权衡,解释了为什么必须丢弃头。在基础模型上对比和去相关目标的评估表明,投影头起到通用几何缓冲器的作用,将语义骨干与预训练目标的刚性破坏约束解耦。

英文摘要

We develop a geometric theory of projection heads in self-supervised learning by modeling the head as a trainable Riemannian metric on the backbone representation manifold. We show that linear heads perform implicit subspace whitening, while nonlinear heads adapt local metrics to satisfy the specific topological constraints of the loss, with head depth empirically dictating this capacity. Analyzing dimensional collapse, we prove that smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria, making them unstable. We empirically validate this by continuously tracking the optimization geometry during training, which reveals that smooth activations like Swish can generate explicit negative curvature to escape collapse, whereas linear and ReLU heads under continuous-time gradient flow cannot, relying instead on discrete-time optimization dynamics and BatchNorm. Finally, we geometrically characterize how metric degeneracy governs the information-invariance trade-off, explaining why the head must be discarded. Evaluated across contrastive and decorrelation-based objectives on foundation models, our results demonstrate that the projection head acts as a universal geometric buffer, decoupling the semantic backbone from the rigid, destructive constraints of the pretraining objective.

2605.17177 2026-05-19 math.OC cs.LG math.ST stat.ML stat.TH 版本更新

High-dimensional Limit of SGD for Diagonal Linear Networks

SGD在对角线线性网络中的高维极限

Begoña García Malaxechebarría, Courtney Paquette, Maryam Fazel, Dmitriy Drusvyatskiy

发表机构 * University of Washington(华盛顿大学) McGill University(麦吉尔大学) Google DeepMind(谷歌DeepMind) University of Washington & Amazon Inc.(华盛顿大学与亚马逊公司) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 本文研究了在高维情况下,SGD在对角线线性网络中的行为,通过推导随机微分方程来近似SGD的动力学,并推导了描述迭代状态和可观测统计量时间演化的偏微分方程,最终证明了在合适参数化下,SGD动态具有全局良好定义并以指数速度收敛到零风险。

Comments 91 pages, 5 figures

详情
AI中文摘要

理解随机梯度方法的行为是现代机器学习中的核心问题。最近的研究强调了对角线线性网络作为一种简化且具有表现力的设置,用于分析神经模型的优化和泛化特性。在本文中,我们证明在高维情况下,对角线线性网络上的随机梯度下降可以被由随机微分方程(SDE)控制的连续动力学近似,该方程显式地将漂移与梯度噪声分离。我们进一步推导了一个确定性偏微分方程,其解传播迭代状态并描述了广泛可观测统计量的时间演化,包括风险、曲率和其他最优性度量。最后,我们证明在合适的参数化下,随机动力学是全局良好的,并以高概率指数收敛到零风险,从而得到其长时间行为的完全显式非渐近描述。数值模拟验证了我们的理论发现。

英文摘要

Understanding the behavior of stochastic gradient methods is a central problem in modern machine learning. Recent work has highlighted diagonal linear networks as a simplified yet expressive setting for analyzing the optimization and generalization properties of neural models. In this work, we show that in the high-dimensional regime, stochastic gradient descent on diagonal linear networks is well-approximated by continuous dynamics governed by a stochastic differential equation (SDE), which explicitly decouples the drift from the gradient noise. We further derive a deterministic partial differential equation whose solution propagates the relevant state of the iterates and characterizes the time evolution of a broad class of observable statistics, including the risk, curvature, and other metrics for optimality. Finally, we show that, under a suitable parametrization, the stochastic dynamics are globally well posed and converge exponentially fast to zero risk with high probability, yielding a fully explicit non-asymptotic description of their long-time behavior. Numerical simulations corroborate our theoretical findings.

2605.17173 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Why Do Safety Guardrails Degrade Across Languages?

为何安全护栏在不同语言中会退化?

Max Zhang, Ameen Patel, Sang T. Truong, Sanmi Koyejo

发表机构 * Stanford University(斯坦福大学)

AI总结 该研究通过引入多组项目反应理论框架,揭示了语言无关的安全鲁棒性、提示内在难度、全球语言处理难度和提示特定的跨语言安全差距等因素,发现安全退化并非仅在低资源语言中发生,且文化与概念不匹配也会影响安全性能。

详情
AI中文摘要

大型语言模型在非英语语言中表现出安全退化。标准评估依赖于禁令成功率(JSR),但将多个安全驾驶因素合并为一个,掩盖了安全失败的具体原因。我们引入了一个潜在变量模型,即多组项目反应理论(IRT)框架,将安全驾驶因素如语言无关的安全鲁棒性(θ)、内在提示难度(β)、全球语言处理难度(γ)和提示特定的跨语言安全差距(τ)分离。使用MultiJail数据集,我们评估了61种模型配置在5个闭源模型家族和10种资源各异的语言中的安全鲁棒性,汇总了190万行数据集。探索性因子分析显示安全主要是一维的:模型拒绝不同危害类型主要通过共享机制。与预期趋势相反,22种模型配置在英语中比在低资源语言中更易受攻击。低资源语言产生更多不确定响应(高熵)比高资源语言。此外,高τ提示集中在如盗窃和武器等物理危害类别和低资源语言中,趋势通过跨数据集泛化得到验证。虽然全球翻译质量与τ相关性低,但严重翻译错误驱动高偏置异常值,通过本地说话者验证。文化与概念基础不匹配也会影响τ。在预测验证中,IRT框架实现了AUC=0.940,优于更简单的基线,在预测不安全提示的安全拒绝方面表现更优。我们的框架揭示了概念-语言脆弱性,这些指标汇总后被掩盖,使公平的跨语言安全评估和目标改进数据集建设成为可能。

英文摘要

Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of safety failure. We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness ($θ$), intrinsic prompt hardness ($β$), global language processing difficulty ($γ$), and a prompt-specific cross-lingual safety gap ($τ$). Using the MultiJail dataset, we evaluate the safety robustness of 61 model configurations across 5 closed-model families and 10 languages of varying resource, aggregating a dataset of 1.9 million rows. Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a shared mechanism. Contrary to the expected trend that safety degrades largely in low-resource languages, 22 model configurations are more vulnerable in English than in low-resource languages. Low-resource languages produce more uncertain responses (high entropy) than high-resource languages. Also, high-$τ$ prompts cluster in physical harm categories like Theft and Weapons and lower-resource languages, trends validated through cross-dataset generalization. While global translation quality shows low correlation with $τ$, severe mistranslations drive high-bias outliers, as validated by native speakers. Cultural and conceptual grounding mismatches also contribute to $τ$. In predictive validation, the IRT framework achieves $\mathrm{AUC} = 0.940$, outperforming simpler baselines in predicting safe refusal of unsafe prompts. Our framework reveals concept-language vulnerabilities that aggregate metrics obscure, enabling fairer cross-lingual safety evaluation and targeted improvements in dataset construction.

2605.17172 2026-05-19 cs.LG cs.AI cs.CL 版本更新

OpenJarvis: Personal AI, On Personal Devices

OpenJarvis: 个人AI,本地设备上

Jon Saad-Falcon, Avanika Narayan, Robby Manihani, Tanvir Bhathal, Herumb Shandilya, Hakki Orhun Akengin, Gabriel Bo, Andrew Park, Matthew Hart, Caia Costello, Chuan Li, Christopher Ré, Azalia Mirhoseini

发表机构 * OpenClaw Hermes Agent PinchBench GAIA

AI总结 本文提出OpenJarvis,一种分解的个人AI堆栈,通过在本地设备上优化五个基本组件(智能、引擎、代理、工具与记忆、学习)来缩小本地与云端之间的性能差距,同时保持本地模型的特性。

Comments Code: https://github.com/openjarvis/openjarvis Website: https://open-jarvis.github.io/OpenJarvis/

详情
AI中文摘要

个人AI堆栈,如OpenClaw和Hermes Agent,正在成为日常工作的核心,但它们几乎将每一个查询(通常涉及敏感的本地数据)都路由到云托管的前沿模型。用现有的堆栈中替换前沿模型为本地模型并不奏效:将Claude Opus 4.6换成Qwen3.5-9B,在个人AI任务如PinchBench和GAIA上会降低25-39个百分点的准确性。现有堆栈围绕特定的云模型捆绑代理提示、工具描述、内存配置和运行时设置。只有提示可以进行调优,而最先进的提示优化器只能自行关闭5个百分点的本地-云差距。这促使了分解的个人AI堆栈:一种能够暴露个体原语,可以单独或联合优化以缩小本地-云差距的堆栈。我们提出了OpenJarvis,一种将个人AI系统表示为五种原语的类型规范的架构:智能、引擎、代理、工具与记忆、学习。每个原语都是独立可编辑的字段,使堆栈能够端到端优化,并且可以针对准确性、成本和延迟进行测量。为了在不牺牲本地模型特性的情况下缩小本地-云差距,OpenJarvis引入了LLM引导的规范搜索,这是一种本地-云协作,在搜索时前沿云模型提出规范的编辑,只有非退化的编辑被接受,最终的规范在推理时完全在设备上运行。通过LLM引导的规范搜索,设备上的规范在8个基准中的4个上匹配或超过了云准确性,并且平均在最佳云基线基础上减少了3.2个百分点。它们还减少了边际API成本约800倍,并将端到端延迟减少了4倍。

英文摘要

Personal AI stacks, like OpenClaw and Hermes Agent, are becoming central to daily work, yet they route nearly every query (often over sensitive local data) to cloud-hosted frontier models. Replacing frontier models with local models inside existing stacks does not work: swapping Claude Opus 4.6 for Qwen3.5-9B drops accuracy by 25-39 pp across personal AI tasks like PinchBench and GAIA. Existing stacks bundle agentic prompts, tool descriptions, memory configuration, and runtime settings around a specific cloud model. Only the prompts can be tuned, and state-of-the-art prompt optimizers close just 5 pp of the local-cloud gap on their own. This motivates a decomposed personal AI stack: one that exposes individual primitives which can be optimized individually or jointly to close the local-cloud gap. We present OpenJarvis, an architecture that represents a personal AI system as a typed spec over five primitives: Intelligence, Engine, Agents, Tools & Memory, and Learning. Each primitive is an independently editable field, making the stack end-to-end optimizable and measurable against accuracy, cost, and latency. Towards closing the local-cloud gap without surrendering local-model properties, OpenJarvis introduces LLM-guided spec search, a local-cloud collaboration in which frontier cloud models propose edits across the spec at search time, only non-regressing edits are accepted, and the resulting spec runs entirely on-device at inference time. With LLM-guided spec search, on-device specs match or exceed cloud accuracy on 4 of 8 benchmarks and land within 3.2 pp of the best cloud baseline on average. They also reduce marginal API cost by ~800x and end-to-end latency by 4x.

2605.17170 2026-05-19 cs.LG 版本更新

TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks

TriAxialKV: 向极低精度KV缓存量化迈进以应对代理推理任务

Hanzhang Shen, Haoran Wu, Yiren Zhao, Robert Mullins

发表机构 * University of Cambridge(剑桥大学) Imperial College London(伦敦帝国学院)

AI总结 本文提出TriAxialKV,一种混合精度的KV缓存量化方案,通过为每个token分配三轴标签,校准每种标签的敏感性,并在固定内存预算下分配INT2/INT4位宽,以提高代理推理任务的效率和吞吐量。

详情
AI中文摘要

代理工作负载已成为LLM推理中的主要工作负载。它们与仅聊天的工作负载有显著不同,要求长上下文处理、处理多模态输入以及支持结构化的多轮交互和工具调用能力。因此,其上下文表现出结构,可以沿三个关键轴携带不同的重要性:时间最近性、模态(如文本或图像标记)以及语义角色(如用户查询、工具调用、观察或推理)。这些轴捕捉了不同的标记行为,并导致不同的对KV缓存压缩的敏感性。然而,现有的KV缓存量化方法通常是同质的或仅在单一维度上利用异质性,如时间接近性或模态,忽略了它们之间的相互作用。为此,我们引入TriAxialKV,一种新的混合精度KV缓存量化方案,为每个token分配三轴标签,校准每种标签的敏感性,并在固定内存预算下分配INT2/INT4位宽。我们实现了TriAxialKV作为端到端的服务系统,包括校准、混合精度量化和内存管理,并定制了融合的Triton解码内核。当使用Qwen3-VL-32B-Thinking作为计算机使用代理操作OSWorld时,TriAxialKV在BF16 KV缓存的准确性与SGLang相当,同时支持4.5倍的KV缓存大小,并在真实GPU系统上实现了30%更高的端到端吞吐量。

英文摘要

Agentic workloads have emerged as a major workload for LLM inference. They differ significantly from chat-only workloads, requiring long-context processing, the ability to handle multimodal inputs, and structured multi-turn interactions with tool calling capabilities. As a result, their context exhibits structure that can carry different importance along three key axes: temporal recency to the current turn, modality such as text or image tokens, and semantic role such as user queries, tool calls, observations, or reasoning. These axes capture distinct token behaviors and lead to different sensitivities to KV-cache compression. However, existing KV-cache quantization methods are typically homogeneous or exploit only heterogeneity on a single dimension, such as temporal proximity or modality, overlooking the interactions among them. To this end, we introduce TriAxialKV, a novel mixed-precision KV-cache quantization scheme that assigns each token a triaxial tag, calibrates per-tag sensitivity, and allocates INT2/INT4 bitwidths under a fixed memory budget. We implement TriAxialKV as an end-to-end serving system, comprising calibration, mixed-precision quantization and memory management, and custom fused Triton decode kernels. When using Qwen3-VL-32B-Thinking as a computer-use agent operating the OSWorld, TriAxialKV matches the accuracy of SGLang with BF16 KV cache while supporting 4.5$\times$ KV cache size and achieving 30% higher end-to-end throughput, when running on real GPU systems.

2605.17165 2026-05-19 cs.CV cs.LG 版本更新

Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives

视频JEPA中的因子化潜在动态:辅助目标的实证研究

Santosh Premi

发表机构 * Adhikari(阿迪卡里)

AI总结 本研究探讨了视频JEPA中辅助目标的实证效果,通过对比不同辅助目标变体,发现潜在表示的因子化方法在提升某些能力的同时可能降低其他能力,FWM-HW-LD在混合数据集下提升了ImageNet-100和SSv2的性能。

详情
AI中文摘要

联合嵌入预测架构(JEPA)是自监督视频表示学习的一个有前景的框架,但小型规模的视频JEPA训练中辅助目标的行为尚未得到充分表征。我们报告了在两个预训练阶段(单一数据集(UCF-101)和混合数据集(UCF-101 + Something-Something V2 + ImageNet-100))下对18种辅助目标变体进行的小规模实证研究。我们评估了冻结表示在三个互补基准上的表现:Diving-48(细粒度运动)、SomethingSomething V2(时间推理)和ImageNet-100(外观)。我们的实验表明,许多辅助目标表现出能力取舍:在一种下游能力上的收益往往伴随着另一种能力的退化。我们随后研究了FWM-HW-LD(带有硬区域加权的因子化世界模型与潜在动态),这是一种训练时的目标,将潜在表示分为外观和动态子空间,并对JEPA预测误差和潜在动态误差应用硬区域加权。在我们的混合数据集设置中,FWM-HW-LD相比参考基线在ImageNet-100上提高了+5.92个百分点,在SSv2上提高了+3.21个百分点,同时在Diving-48上保持在0.30个百分点以内。这些结果表明,潜在因子化是研究视频JEPA中辅助目标取舍的有效方向。

英文摘要

Joint-Embedding Predictive Architectures (JEPA) are a promising framework for self-supervised video representation learning, yet the behavior of auxiliary objectives in small-scale Video-JEPA training is not well characterized. We report a small-scale empirical study of 18 auxiliary objective variants for Video-JEPA across two pretraining regimes: single-dataset (UCF-101) and mixed-dataset (UCF-101 + Something-Something V2 + ImageNet-100). We evaluate frozen representations on three complementary benchmarks: Diving-48 (fine-grained motion), SomethingSomething V2 (temporal reasoning), and ImageNet-100 (appearance). Our experiments suggest that many auxiliary objectives exhibit capacity trade-offs: gains on one downstream capability often coincide with degradation on another. We then study FWM-HW-LD (Factorized World-Model with Hard-Region-Weighted Latent Dynamics), a training-time objective that separates the latent representation into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction errors and latent dynamics errors. In our mixed-dataset setting, FWM-HW-LD improves ImageNet-100 by +5.92 and SSv2 by +3.21 percentage points relative to the reference baseline, while remaining within 0.30 percentage points on Diving-48. These results indicate that latent factorization is a useful direction for studying auxiliary-objective trade-offs in Video-JEPA.

2605.17162 2026-05-19 cs.AI cs.LG 版本更新

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

从模仿到交互:利用浅层强化学习掌握斯纳普森游戏

Ján Klačan, Sizhong Zhang

发表机构 * Vrije Universiteit Amsterdam(弗里兹大学阿姆斯特丹)

AI总结 本文研究浅层神经网络代理是否能掌握纸牌游戏斯纳普森,并挑战使用蒙特卡洛采样和前瞻搜索的强搜索基线RdeepBot。通过逐步更复杂的实验设计,首先评估了基于回放数据训练的监督学习代理(MLPBot)以及通过异步蒙特卡洛更新和经验回放训练的强化学习代理(RLBot)。结果表明,监督模仿不足以击败强RdeepBot对手,而强化学习产生了更强的代理。在聚焦RdeepBot深度参数的设置中,最佳性能是在学习的价值函数与游戏过程中更深层次的前瞻搜索结合时实现的,使RLBot在最强的RdeepBot基线下实现了统计显著更高的胜率。在基于样本的设置中,收益更具条件性:最强性能出现在相对较低的训练num_samples参数下,而不是随着更强采样均匀增加。

Comments 17 pages, 8 figures

详情
AI中文摘要

本文研究浅层神经网络代理是否能掌握纸牌游戏斯纳普森,并挑战使用蒙特卡洛采样和前瞻搜索的强搜索基线RdeepBot。通过逐步更复杂的实验设计,首先评估了基于回放数据训练的监督学习代理(MLPBot)以及通过异步蒙特卡洛更新和经验回放训练的强化学习代理(RLBot)。结果表明,监督模仿不足以击败强RdeepBot对手,而强化学习产生了更强的代理。在聚焦RdeepBot深度参数的设置中,最佳性能是在学习的价值函数与游戏过程中更深层次的前瞻搜索结合时实现的,使RLBot在最强的RdeepBot基线下实现了统计显著更高的胜率。在基于样本的设置中,收益更具条件性:最强性能出现在相对较低的训练num_samples参数下,而不是随着更强采样均匀增加。

英文摘要

This paper investigates whether shallow neural network agents can master the card game Schnapsen and challenge a strong search-based baseline, RdeepBot, which uses Monte Carlo sampling and lookahead search. Guided by a progressively more complex experimental design, we first evaluate a supervised learning agent (MLPBot) trained on replay data and then a reinforcement learning agent (RLBot) with the same shallow architecture trained through asynchronous Monte Carlo updates and experience replay. The results show that supervised imitation does not generalize well enough to defeat strong RdeepBot opponents, whereas reinforcement learning produces substantially stronger agents. In the setting that focuses on the depth parameter of RdeepBot, the best performance is achieved when the learned value function is combined with deeper lookahead during gameplay, allowing RLBot to achieve statistically significant higher winning rates against the strongest evaluated RdeepBot baseline. In the sample-based setting, the gains are more conditional: the strongest performance appears at a relatively lower training num_samples parameter rather than increasing uniformly with stronger sampling.

2605.17160 2026-05-19 cs.LG cs.AI cs.CV 版本更新

When Bits Break Recourse: Counterfactual-Faithful Quantization

当比特失效时的反事实:反事实忠实量化

Chaymae Yahyati, Ismail Lamaakal, Khalid El Makkaoui, Ibrahim Ouahbi

发表机构 * Mohammed First University(穆罕默德第一大学)

AI总结 本文研究了量化过程中反事实可解释性的问题,提出反事实忠实量化方法,通过定义有效性下降和反事实可逆差距两个指标来评估量化对反事实可解释性的影响,并在多个数据集上验证了该方法在保持准确性的同时提升了反事实稳定性。

Comments 57 pages, 32 tables, 26 figures

详情
AI中文摘要

量化可以在低比特部署下保持预测准确性,但会无声地破坏算法可逆性:一个在量化前可以执行的操作在量化后可能失效,或变得显著更昂贵。我们通过有效性、成本和方向稳定性来形式化量化下的反事实敏感性,并引入两个指标:有效性下降(VD)和反事实可逆差距(CRG),以揭示准确性无法检测到的可逆失败。我们提出反事实忠实量化(CFQ),通过训练量化参数和混合精度位分配,在全局位预算下强制在教师可逆点上保持目标结果,以保留反事实行为。基于边界的分析给出了在受限制的量化扰动下可逆转移的充分条件。在Adult、德国信贷和COMPAS数据集上的实验表明,与准确性匹配的基线相比,CFQ在保持准确性的同时显著提高了VD和CRG。

英文摘要

Quantization can preserve predictive accuracy under low-bit deployment while silently breaking algorithmic recourse: an actionable change that flips a decision before quantization may fail after quantization, or become substantially more costly. We formalize counterfactual sensitivity under quantization through validity, cost, and direction stability, and introduce two metrics: Validity Drop (VD) and Counterfactual Recourse Gap (CRG) that reveal recourse failures invisible to accuracy. We propose Counterfactual-Faithful Quantization (CFQ), which trains quantizer parameters and mixed-precision bit allocation to preserve counterfactual behavior by enforcing the target outcome at teacher recourse points under a global bit budget. A margin-based analysis gives a sufficient condition for recourse transfer under bounded quantization perturbations. Experiments on Adult, German Credit, and COMPAS show that accuracy-matched baselines can significantly degrade recourse stability, while CFQ maintains accuracy and substantially improves VD and CRG across bit budgets.

2605.17153 2026-05-19 cs.LG cs.LO math.OC 版本更新

Stress-Testing Neural Network Verifiers with Provably Robust Instances

用可证明稳健实例压力测试神经网络验证器

David Troxell, Yulia Alexandr, Sofia Hunt, Stephanie Lei, Guido Montúfar

发表机构 * Department of Statistics & Data Science, University of California, Los Angeles(统计与数据科学系,加州大学洛杉矶分校) Department of Mathematics, University of California, Los Angeles(数学系,加州大学洛杉矶分校) Max Planck Institute for Mathematics in the Sciences, Leipzig(科学数学研究所,莱比锡)

AI总结 本文提出了一种生成具有已知真实稳健标签的验证实例的框架,揭示了现有验证器的数值容忍度问题和实现错误,并引入了验证难度轮廓以系统研究验证器失败模式,评估了五种最先进的验证器并展示了不同实例对验证流程不同方面的压力测试。

详情
AI中文摘要

神经网络验证器旨在为模型行为提供正式保证,但现有的验证基准本质上受到缺乏真实标签的限制。因此,验证器评估依赖于间接启发式方法,这阻止了精确评分和系统研究验证器失败模式。我们通过引入一个可重用的框架来生成验证实例,其真实稳健标签通过分析构造已知,从而填补了这一差距。我们的框架导致在流行的验证器中发现了多个数值容忍度问题和实现错误,突显了真实标签的必要性。此外,为了系统研究验证器失败模式,我们引入了验证难度轮廓,一个收集可估计数量的集合,捕捉不同的实例难度来源。使用我们的框架和这些轮廓,我们评估了五种最先进的验证器,并展示了不同实例对验证流程不同方面的压力测试。我们证明这些结果可以帮助未来验证器的发展,因为它们为提高数值可靠性、放松质量和搜索行为提供了可行的目标。我们的代码已公开可用:https://github.com/dtroxell19/VeriStressGT.git。

英文摘要

Neural network verifiers aim to provide formal guarantees on model behavior, but existing verification benchmarks are fundamentally limited by their lack of ground-truth labels. As a result, verifier evaluation relies on indirect heuristics, which prevents exact scoring and systematic study of verifier failure modes. We address this gap by introducing a reusable framework for generating verification instances whose ground-truth robustness labels are known a priori through analytic construction. Our framework led to the discovery of multiple numeric tolerance concerns and an implementation bug in popular verifiers, highlighting the need for ground-truth labels. Additionally, to systematically study verifier failure modes, we introduce the verification Difficulty Profile, a collection of estimable quantities capturing distinct sources of instance hardness. Using our framework and these profiles, we evaluate five state-of-the-art verifiers and show that different instances stress distinct aspects of the verification pipeline. We show that these results can aid the future development of verifiers as they provide actionable targets for improving numerical reliability, relaxation quality, and search behavior. Our code is publicly available: https://github.com/dtroxell19/VeriStressGT.git.

2605.17151 2026-05-19 cs.LG 版本更新

An Analytical Multiple Criteria Framework for Temporal and Dynamic Business-to-Business Customer Segmentation in Manufacturing

一个用于制造业业务到业务客户细分的分析多标准框架

Muhammad Raees, Konstantinos Papangelis, Vassilis Javed Khan

发表机构 * Rochester Institute of Technology(罗切斯特技术学院) Independent Researcher(独立研究者)

AI总结 本文提出了一种动态多准则决策方法,通过扩展RFM模型以包含稳定性和增长维度,整合自适应和分析性的分层过程,并评估多变量时间序列聚类模型,以提高制造业B2B客户细分的鲁棒性。

详情
AI中文摘要

在销售和营销中,客户细分是制定客户处理和供应链管理策略的重要工具。大多数细分实现依赖于有限的标准,如最近、频率和货币(RFM)建模,这通常无法捕捉复杂的商业互动。在本工作中,我们设计并评估了一种动态多准则决策(MCDM)方法,应用于业务到业务(B2B)制造环境,通过1)将RFM扩展到稳定性和增长维度,2)整合自适应和分析性的分层过程以匹配业务目标,3)评估多变量时间序列聚类模型。我们随后测量客户稳定性,跟踪不同细分之间的转换,以及时间内的波动性,并应用基于图的共识模型进一步加强分析。我们使用现实世界制造公司数据集测试所提出方法的有效性,对超过3,000个B2B客户进行细分,显示出对时间变化的强鲁棒性。该实现使领域专家能够利用优先分析来制定策略,为B2B客户细分提供有效的决策支持。

英文摘要

In sales and marketing, customer segmentation is an important tool for formulating strategies for customer treatment and supply chain management. Most segmentation implementations rely on limited criteria, such as recency, frequency, and monetary (RFM) modeling, which often fail to capture complex business interactions. In this work, we design and evaluate a dynamic multi-criteria decision-making (MCDM) method in a business-to-business (B2B) manufacturing context by 1) extending RFM to dimensions of stability and growth, 2) integrating an adaptive and analytical hierarchical process to match business objectives, and 3) evaluating multivariate time-series clustering models. We then measure customer stability, tracking between-segment transitions, and volatility over time, and apply a graph-based consensus model to further strengthen the analysis. We test the efficacy of the proposed method using a real-world manufacturing company dataset to segment more than 3,000 B2B customers, showing strong robustness to temporal shifts. The implementation enables domain experts with preferential analytics to devise their strategies, providing effective decision support for B2B customer segmentation.

2605.17148 2026-05-19 cs.NE cs.AI cs.LG 版本更新

Evolutionary Extreme Learning Machine of ab-initio Energy Landscapes for Crystal Structure Prediction using Manta Ray Optimization with Levy Flight

基于Manta Ray优化与Levy飞行的进化极值学习机用于二元系统中晶格结构预测

Adrian Rubio-Solis

发表机构 * Hamlyn Centre for Robotic Surgery, Imperial College London (ICL)(机器人手术哈姆林中心,伦敦帝国理工学院)

AI总结 本文提出了一种改进的Manta Ray优化算法结合Levy飞行用于训练极值学习机,以预测二元系统中未弛豫和弛豫形成能化合物相对于基态晶格结构的纯组分相对能量。

Comments 8 pages, 4 figures

详情
AI中文摘要

Manta Ray Foraging Optimization算法(MRFO)已被证明是解决大量工程问题最优解的强大启发式策略。本文提出了一种改进的MRFO结合Levy飞行用于训练极值学习机(ELM)的训练,其基本模型是单层前馈网络(SLFN)。所提出的方法称为进化极值学习机-MRFO-Levy飞行(EELM-MRFO-LF)被应用于预测二元系统中未弛豫和弛豫形成能化合物相对于基态晶格结构的纯组分相对能量。EELM-MRFO-LF遵循传统进化ELM的学习过程,首先使用MRFO与Levy飞行选择输入权重,然后应用Moore-Penrose广义逆来解析确定输出权重。Levy飞行轨迹用于增加ELM种群的多样性,以防止早收敛和避免陷入局部最优。所提出的EELM-MRFO-LF性能在相似条件下与其他知名启发式算法进行了比较。

英文摘要

The Manta Ray Foraging Optimization algorithm (MRFO) has proven to be a powerful heuristic strategy in the optimal solution of a large number of engineering problems. In this paper, an improvement of MRFO with Levy Flight is suggested for the training of extreme learning machines (ELMs) whose basic model is a Single Layer Feedforward Network (SLFN). The proposed methodology that we called Evolutionary EELM-MRFO-LF for short is implemented to the prediction of unrelaxed and relaxed formation energy compounds relative to ground state crystal structure of pure components in binary systems. EELM-MRFO-LF follows the learning procedure of traditional Evolutionary ELMs in which first MRFO with LF is used to select the input weights and Moore-Penrose (MP) generalized inverse is applied to analytically determine the output weights. Levy Flight trajectory is implemented for increasing the diversity of the population of ELMs against premature convergence and the ability of avoiding getting trapped in a local optima. The performance of the suggested EELM-MRFO-LF is compared with other well-known nature-inspired algorithms under similar conditions.

2605.17146 2026-05-19 cs.CE cs.LG cs.SY eess.SY 版本更新

Weighted Flow Matching and Physics-Informed Nonlinear Filtering for Parameter Estimation in Digital Twins

带权流匹配与物理信息非线性滤波用于数字孪生中的参数估计

Yasar Yanik, Himadri Basu, Ricardo G. Sanfelice, Daniele Venturi

发表机构 * Department of Applied Mathematics, University of California, Santa Cruz, CA 95064, USA(应用数学系,加州大学圣克鲁兹分校) Department of Electrical and Computer Engineering, University of California, Santa Cruz, CA 95064, USA(电气与计算机工程系,加州大学圣克鲁兹分校)

AI总结 本文提出了一种结合带权流匹配和物理信息非线性滤波的新框架,用于提升数字孪生中的参数估计能力,通过在航天器数字孪生架构中实现稳定的转动惯量估计,展示了在不确定和噪声环境下的性能优势。

Comments 14 pages, 5 figures

详情
AI中文摘要

数字孪生(DTs)依赖于通过在线参数估计在不确定环境下持续同步物理系统与其虚拟对应物。然而,在许多实际情况下,这一任务受到低可观测性、弱激励、非线性动态和噪声或偏置测量的挑战。本文开发了一种新的数学框架,将带权流匹配(WFM)生成建模与物理信息非线性滤波相结合,以增强数字孪生中的参数估计。WFM依赖于动态重新加权训练样本,引导生成模型朝向最能反映演进系统状态的参数区域。这一生成组件与基于无迹卡尔曼滤波(UKF)的物理信息过滤架构紧密耦合,产生了一个结合数据驱动概率传输与物理一致状态和参数估计的统一数字孪生框架。新集成框架的有效性在航天器数字孪生架构中得到验证,实现了在不确定和噪声传感下的稳定转动惯量估计,显著优于已建立的方法如扩展卡尔曼滤波(EKF)和集合卡尔曼滤波(EnKF)。这些结果突显了带权生成建模作为实时数字孪生同步在操作和关键任务系统中的潜力。

英文摘要

Digital twins (DTs) rely on continuous synchronization between physical systems and their virtual counterparts through online parameter estimation under uncertainty. In many practical settings, however, this task is challenged by low observability, weak excitation, nonlinear dynamics, and noisy or biased measurements. In this work, we develop a new mathematical framework that integrates Weighted Flow Matching (WFM) generative modeling with physics-informed nonlinear filtering to enhance parameter estimation in DTs. WFM relies on dynamic reweighting of training samples, which guides the generative model toward parameter regimes most informative of the evolving system state. This generative component is tightly coupled with a physics-informed filtering architecture based on the Unscented Kalman Filter (UKF), yielding a unified DT framework that combines data-driven probability transport with physically consistent state and parameter estimation. The effectiveness of the new integrated framework is demonstrated within a spacecraft DT architecture, where stable moment of inertia estimation is achieved under uncertain and noisy sensing, with significant performance improvements over established approaches such as Extended Kalman Filtering (EKF) and Ensemble Kalman Filtering (EnKF). These results highlight the potential of weighted generative modeling as a core mechanism for real-time DT synchronization in operational and mission-critical systems.

2605.17144 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States

对比性概念激活引导(COAST):通过隐藏状态解锁视觉-语言-动作模型

Miranda Muqing Miao, Subin Kim, Brandon Yang, Lyle Ungar

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出COAST方法,通过识别成功子空间来提升视觉-语言-动作模型在机器人任务中的性能,其核心方法是利用概念投射来引导模型向成功分布发展,从而提高任务成功率。

Comments Submitted to NeurIPS 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型利用大规模网络视觉-语言模型(VLM)预训练的强感知先验,但实际应用中却表现出惊人的脆弱性,常常在简单的机器人任务中失败。为缓解这一问题,我们提出了对比性概念激活引导(COAST)。COAST基于“概念”这一线性操作符,该操作符能将数据软投影到目标分布的主成分中。COAST利用概念来从少量的成功和失败轨迹中识别出目标机器人任务的成功子空间。在推理过程中,它将VLA的潜在表示引导到这些识别出的成功子空间中,以提高任务结果。在三种架构不同的神经策略(流匹配VLA、自回归VLA和扩散策略)上,COAST将绝对均值仿真和真实机器人任务的成功率分别提高了超过20%和40%。激活子空间几何表明,失败模式在不同任务中共享大量结构,而成功表示则主要任务特定。当任务共享相似的失败模式时,这种结构使之前拟合的概念能提升新任务的性能而无需重新拟合。最终,我们的结果表明,当前VLA在潜在表示中保留了大量任务相关的知识,而动作专家的解码瓶颈可以通过将残差流引导至任务相关子空间来缓解。COAST提供了一条轻量、无训练的路径,通过引导模型朝其自身的“成功”分布发展,来解锁这些潜在能力。

英文摘要

Vision-Language-Action (VLA) models leverage powerful perceptual priors from web-scale Vision-Language Model (VLM) pre-training, yet they remain surprisingly brittle in practice, frequently failing at simple robotic tasks. To mitigate this, we propose Contrastive Conceptor Activation Steering (COAST). COAST builds on the notion of a "conceptor", a linear operator that soft-projects data into the principal components of a target distribution. COAST uses conceptors to identify success-critical subspaces for a target robotic task from a few examples of success and failure rollouts. At inference time, it steers VLA latents into these identified success subspaces to improve task outcomes. Across three architecturally distinct neural policies (flow-matching VLA, autoregressive VLA, and Diffusion Policy), COAST improves absolute mean simulation and real-robot task success rate by over 20 and 40% respectively. The activation subspace geometry reveals that failure modes share substantial structure across tasks while success representations remain largely task-specific. When tasks share similar failure modes, this structure enables previously fitted conceptors to improve performance on new tasks without refitting. Ultimately, our results suggest that current VLAs retain substantial task-relevant knowledge in their latent representations, and that the action expert's decoding bottleneck could be mitigated by steering its residual stream toward task-relevant subspaces. COAST provides a lightweight, training-free path to unlocking these latent capabilities by steering the model towards its own "success" distributions.

2605.17125 2026-05-19 cs.CV cs.LG 版本更新

Principal Component Analysis for Lunar Crater Detection

基于主成分分析的月球陨石坑检测

Travis Driver, John A. Christian

发表机构 * School of Aerospace Engineering, Georgia Institute of Technology(航空航天工程学院,佐治亚理工学院)

AI总结 本文提出了一种基于主成分分析的自动陨石坑模板生成方法,用于改进基于图像的陨石坑识别技术,通过在模拟月球图像上展示优于手工挑选模板的检测和定位性能。

详情
AI中文摘要

光学导航是月球轨道器和着陆器任务中的关键组成部分。基于图像的陨石坑识别由于月球表面陨石坑丰富以及现有大量陨石坑目录的可用性,已成为光学导航的有前景技术。此外,由于月球陨石坑在形态上相对同质,模板匹配已被确定为识别的有前景方法。在本文中,我们提出EigenCrater,一种基于陨石坑数字高程图(DEM)的主成分分析的自动陨石坑模板生成方法。我们证明了在模拟月球图像上,该方法在检测和位置估计性能方面优于手工挑选的模板。

英文摘要

Optical navigation is a critical component for lunar orbiter and lander missions. Image-based crater identification has emerged as a promising technology for optical navigation due to the abundance of craters on the lunar surface and the availability of extensive crater catalogs. Moreover, due to the relative morphological homogeneity among lunar craters, template matching has been identified as a promising approach for identification. In this paper, we propose EigenCrater, an automated crater template generation method based on principal component analysis of crater digital elevation maps (DEMs). We demonstrate superior detection and position estimation performance relative to hand-picked templates on simulated lunar imagery.

2605.17118 2026-05-19 cs.LG stat.CO stat.ML 版本更新

Differentiable Optimization Layers for Guaranteed Fairness in Deep Learning

可微优化层用于深度学习中的保证公平性

David Troxell, Noah Roemer, Guido Montúfar

发表机构 * Department of Statistics \& Data Science, University of California, Los Angeles, USA Department of Mathematics, University of California, Los Angeles, USA Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany

AI总结 本文提出了一种称为'公平性层'的可微优化层,该层可确保在神经网络中集成时满足所选的输出平等性概念,并介绍了一个在线对偶推理算法,为流式预测提供可证明的公平性保证,即使使用任意小的批量大小。

Comments To be published in International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

可微优化层通常集成在预测后再优化的框架中,其中神经网络模型估计参数,这些参数随后作为固定输入用于下游决策优化问题。在本工作中,我们引入了

英文摘要

Differentiable optimization layers are traditionally integrated in predict-then-optimize frameworks where a neural model estimates parameters that subsequently serve as fixed inputs to downstream decision-making optimization problems. In this work, we introduce the concept of a "fairness layer": a differentiable optimization layer appended to a model's output layer that guarantees a chosen notion of output parity is satisfied when integrated into a neural network. Additionally, we introduce an online primal-dual inference algorithm that provides provable aggregate fairness guarantees for streaming predictions with arbitrarily small batch sizes, where traditional per-batch constraints become overly restrictive. Numerical experiments demonstrate the effectiveness of the fairness layer and associated algorithm, and theoretical analysis characterizes the layer's differentiability and stability properties during model training and backpropagation. Our code for these experiments is publicly available on GitHub (https://github.com/dtroxell19/FairDL-ICML-2026.git) and our public Python package documentation can be found online: https://dtroxell19.github.io/fairness_training/.

2605.17108 2026-05-19 cs.LG 版本更新

Parallel Recursive LSTM

并行递归LSTM

Tristan Gaudreault, Yongyi Mao

发表机构 * School of Electrical Engineering and Computer Science(电气工程与计算机科学学院) University of Ottawa(渥太华大学)

AI总结 本文提出并行递归LSTM(PR-LSTM),一种层次递归架构,通过递归非线性状态组合替代左到右递归,以减少长上下文设置中的计算深度,同时保持非线性门控状态表示,并在形式语言基准测试中实现了更强的序列长度泛化能力。

Comments 13 pages, 5 figures. Code available at https://github.com/tristangaudreault/pr-lstm

详情
AI中文摘要

Transformers have become the dominant architecture for sequence modeling by using self-attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long-context settings. Recurrent models such as LSTMs provide explicit nonlinear state updates and strong state-tracking capabilities, yet their strictly sequential computation limits parallelism. We introduce the Parallel Recursive LSTM (PR-LSTM), a hierarchical recurrent architecture that replaces left-to-right recurrence with recursive nonlinear state composition over a balanced computation tree. Tokens are first mapped independently to latent states, which are then recursively merged by a learned gated composition block. This structure uses the reduction pattern underlying parallel scans as a fixed execution schedule, rather than assuming an associative recurrence. As a result, PR-LSTM retains nonlinear gated state representations while reducing recurrent parallel depth from linear to logarithmic. Empirically, PR-LSTM achieves strong sequence-length generalization on formal-language benchmarks, solving more tasks than standard RNN, LSTM, and Transformer baselines, while avoiding the quadratic scaling of attention. These results suggest that recurrent computation can be reorganized hierarchically to expose parallelism without restricting the transition dynamics to linear or associative forms.

英文摘要

Transformers have become the dominant architecture for sequence modeling by using self-attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long-context settings. Recurrent models such as LSTMs provide explicit nonlinear state updates and strong state-tracking capabilities, yet their strictly sequential computation limits parallelism. We introduce the Parallel Recursive LSTM (PR-LSTM), a hierarchical recurrent architecture that replaces left-to-right recurrence with recursive nonlinear state composition over a balanced computation tree. Tokens are first mapped independently to latent states, which are then recursively merged by a learned gated composition block. This structure uses the reduction pattern underlying parallel scans as a fixed execution schedule, rather than assuming an associative recurrence. As a result, PR-LSTM retains nonlinear gated state representations while reducing recurrent parallel depth from linear to logarithmic. Empirically, PR-LSTM achieves strong sequence-length generalization on formal-language benchmarks, solving more tasks than standard RNN, LSTM, and Transformer baselines, while avoiding the quadratic scaling of attention. These results suggest that recurrent computation can be reorganized hierarchically to expose parallelism without restricting the transition dynamics to linear or associative forms.

2605.17107 2026-05-19 stat.ML cs.LG math.OC math.PR 版本更新

Diffusion-Based Stochastic Operator Networks for Uncertainty Quantification in Stochastic Partial Differential Equations

基于扩散的随机算子网络用于随机偏微分方程中的不确定性量化

Phuoc-Toan Huynh, Richard Archibald, Feng Bao

发表机构 * Department of Mathematics, Florida State University(佛罗里达州立大学数学系) Computer Science and Mathematics Division, Oak Ridge National Laboratory(橡树岭国家实验室计算机科学与数学 division)

AI总结 本文提出了一种新的框架,用于随机偏微分方程(SPDEs)解算子的不确定性量化。尽管SPDEs在建模具有不确定性的复杂物理系统中起着核心作用,但其实际应用通常需要指定模型不确定性的幅度和结构,而这些通常是未知且难以从噪声测量中推断出来的。为此,本文开发了一种随机算子学习框架,直接从噪声数据中学习,并输出均值解场和不确定性量化。所提出的方法,即随机算子网络(SON),通过结合深度算子网络(DeepONet)的结构与随机神经网络(SNNs)来建模随机性并实现概率预测。训练过程通过最小化一种哈密顿型损失并使用随机最大原理优化所得目标进行。在多个不确定性源下的基准SPDEs上的数值实验展示了所提出方法在捕捉解结构和量化预测不确定性方面的准确性和鲁棒性。

详情
AI中文摘要

我们介绍了一种新颖的框架,用于随机偏微分方程(SPDEs)解算子的不确定性量化。尽管SPDEs在建模具有不确定性的复杂物理系统中起着核心作用,但其实际应用通常需要指定模型不确定性的幅度和结构,而这些通常是未知且难以从噪声测量中推断出来的。为此,我们开发了一种随机算子学习框架,直接从噪声数据中学习,并输出均值解场和不确定性量化。所提出的方法,即随机算子网络(SON),是通过将深度算子网络(DeepONet)的结构与随机神经网络(SNNs)相结合来建模随机性并实现概率预测。训练过程是通过最小化一种哈密顿型损失并使用随机最大原理优化所得目标进行。在多个不确定性源下的基准SPDEs上的数值实验展示了所提出方法在捕捉解结构和量化预测不确定性方面的准确性和鲁棒性。

英文摘要

We introduce a novel framework for uncertainty quantification of solution operators associated with stochastic partial differential equations (SPDEs). Although SPDEs play a central role in modeling complex physical systems under uncertainty, their practical use typically requires specifying the magnitude and structure of model uncertainties that are often unknown and difficult to infer from noisy measurements. To address this challenge, we develop a stochastic operator-learning framework that learns directly from noisy data and outputs both a mean solution field and a quantification of uncertainty. The proposed method, namely the Stochastic Operator Network (SON), is constructed by combining the structure of the Deep Operator Network (DeepONet) with Stochastic Neural Networks (SNNs) to model stochasticity and enable probabilistic prediction. The training procedure is carried out by minimizing a Hamiltonian-type loss and optimizing the resulting objective using the Stochastic Maximum Principle. Numerical experiments on benchmark SPDEs under multiple uncertainty sources demonstrate the accuracy and robustness of the proposed method in capturing solution structure and quantifying predictive uncertainty.

2605.17095 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC

警察执法视频中的视觉时间线:用于训练和分析的开放BWC操作上下文和活动编目

Angela Srbinovska, Christopher Homan, Adrian Martin, Ernest Fokoué

发表机构 * Rochester Institute of Technology(罗切斯特理工大学) Rochester Police Department(罗切斯特警察局) Office of Business Intelligence(业务智能办公室) School of Mathematics and Statistics(数学与统计学学院)

AI总结 本文提出了一种处理体感摄像头视频的方法,生成时间对齐的固定长度10秒窗口序列,用于训练和分析,通过隐私保护协议进行处理和标记,以提高事件审查和培训流程的效率。

Comments 13 pages, 10 figures, 9 tables

详情
AI中文摘要

执法机构正在积累大量体感摄像头(BWC)视频。然而,这些视频仍然在操作上是模糊的。也就是说,分析人员和培训人员仍然需要花费大量时间观看完整视频以确定关键事件的开始点,并识别活动转向更剧烈的物理活动的点。我们提出了一种方法,将BWC视频处理为时间对齐的固定长度10秒窗口序列,通过隐私保护协议进行处理和标记。每个窗口被标记为两个维度的信息:(i)窗口的操作上下文和(ii)窗口内的运动强度水平,对于因黑暗、模糊或遮挡导致证据不足的窗口,使用低证据标签。我们训练模型根据这两个轴分类窗口,使用从每个窗口中采样的帧,通过CLIP模型编码并汇总成窗口级别的表示。我们提取每个窗口的密集光流统计信息以捕捉运动强度。在测试窗口中,最佳上下文模型达到78.75%的准确率,最佳准确率活动模型达到88.33%。我们还包含了完整性审计,以展示结果以及视觉时间线表示如何支持更快的事件审查,并使警官培训流程更加实用。

英文摘要

Law enforcement agencies are accumulating vast amounts of body-worn camera (BWC) footage. However, this remains operationally opaque. That is, analysts and trainers still have to invest considerable time watching full-length videos to pinpoint the start of key encounters and identify the points where activity shifts to something more physically intense. We present an approach to process BWC video into a time-aligned sequence of fixed-length 10-second windows, processed and labeled using a privacy-conscious protocol. Each window is labeled with two dimensions of information: (i) the operational context of the window and (ii) the level of motion intensity within the window, with low-evidence labels for windows for which insufficient evidence exists due to darkness, blur or occlusion. We train models to classify windows based on these two axes using frames sampled from each window encoded using CLIP model and aggregated into a window-level representation. We extract dense optical flow statistics for each window to capture motion intensity. On test windows the best context model achieves 78.75% accuracy, and the best-accuracy activity model achieves 88.33%. We also included integrity audits to show the results and how the visual timeline representations support faster incident review and make the officer training workflow more practical.

2605.17091 2026-05-19 cs.LG 版本更新

Mechanism Learning: Prototype-Anchored Mechanism Inference for Scientific Forecasting

机制学习:面向科学预测的原型锚定机制推断

Qian Jiang, Liping Sun

发表机构 * School of Computing(计算学院) The Australian National University(澳大利亚国立大学) iHuman Institute(iHuman研究院)

AI总结 本文提出机制学习框架,通过估计当前活跃的局部机制来预测未来状态,其核心方法是将局部时空片段压缩为机制描述,并利用原型锚定来构建数据驱动的机制空间,从而在科学预测中实现鲁棒性和稳定性。

详情
AI中文摘要

科学预测通常依赖于直接状态预测,这种方法在数据稀缺、扩展时间范围、非平稳动态或高维复杂性下会变得脆弱。尽管原始状态轨迹在这些情况下非常敏感,但底层的局部演化规则往往表现出鲁棒的可重用性。我们引入了机制学习,这是一种通过估计当前活跃的局部机制来预测未来状态的框架。我们的方法将局部时空片段压缩为机制描述,形成一个数据驱动的结构化机制空间,其中相似性反映相似的局部演化规则。为了使这些估计基于观测数据,我们利用原型锚定,一组代表性的机制,稀疏覆盖局部规则的空间。我们在Burgers动力学、WeatherBench2和Lorenz96上评估了这种方法。实证表明,学习的机制空间能够抵抗崩溃并保持强局部一致性。与直接预测和其他模型,包括FNO、NODE、LSTM和回声室方法相比,我们的框架在脆弱的环境中显示出预测优势:在Burgers动力学中显著提高了切换稳定性,在WeatherBench2的稀缺数据固定时间范围协议和中间复杂度Lorenz96中实现了最先进的性能。消融研究和漂移诊断确认,这些改进是由有限的原型锚定而不是纯粹的潜在容量驱动的。这些结果共同确立了机制学习作为在预测复杂系统中直接状态预测的原理性、鲁棒替代方案。

英文摘要

Scientific forecasting typically relies on direct state prediction, an approach that grows brittle under data scarcity, extended horizons, non-stationary dynamics, or high-dimensional complexity. While raw state trajectories are highly sensitive in these regimes, underlying local evolution rules often exhibit robust reusability. We introduce mechanism learning, a framework that forecasts future states by estimating the currently active local mechanism. Our method compresses local spatiotemporal fragments into mechanism descriptors, forming a data-driven, structured mechanism space where proximity reflects similar local evolution rules. To ground these estimates in observed data, we utilize prototype anchors, a set of representative mechanisms that sparsely cover the space of local rules. We evaluate this approach on Burgers dynamics, WeatherBench2, and Lorenz96. Empirically, the learned mechanism spaces resist collapse and maintain strong local consistency. Compared to direct prediction and other models including FNO, NODE, LSTM, and reservoir-family methods, our framework demonstrates predictive gains in fragile regimes: it significantly improves switching stability in Burgers dynamics and achieves state-of-the-art performance both under the scarce-data fixed-horizon WeatherBench2 protocol and in intermediate-complexity Lorenz96. Ablation studies and drift diagnostics confirm that these improvements are driven by finite prototype anchoring rather than sheer latent capacity. Together, these results establish mechanism learning as a principled, robust alternative to direct state prediction in forecasting complex systems.

2605.17085 2026-05-19 cs.SD cs.LG eess.AS 版本更新

Taming Audio VAEs via Target-KL Regularization

通过目标KL正则化驯服音频VAE

Prem Seetharaman, Rithesh Kumar

发表机构 * Adobe Research(Adobe研究院)

AI总结 本文提出通过压缩率调节和目标KL正则化训练音频VAE,以解决在音频生成任务中VAE正则化带来的过正则化与欠正则化之间的平衡问题,并构建了音频VAE的率失真曲线。

Comments Accepted at ICASSP 2026 (Barcelona, Spain, 3-8 May 2026). 5 pages, 1 figure, 3 tables

详情
Journal ref
Proc. ICASSP 2026
AI中文摘要

潜在扩散模型已成为许多生成任务,如音频生成(如文本到音频、文本到音乐和文本到语音)中的主导范式。潜在扩散模型的关键组成部分是一个自动编码器(VAE),它将高维信号压缩成低帧率的连续表示,以利于后续预测。正则化这些VAE具有挑战性,因为存在过度正则化(输出质量差)和欠正则化(难以预测)的潜在表示之间的权衡。我们提出一个框架来研究这种权衡,通过压缩率调节和通过目标KL正则化训练音频VAE。这使得可以直接与已研究的离散神经音频编解码器模型进行比较,并构建音频VAE的率失真曲线。我们评估了目标KL正则化对文本到声音生成的影响,并发现扫掠压缩率有助于确定最佳生成设置。

英文摘要

Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a low frame rate continuous representation that is conducive for downstream prediction. Regularizing these VAEs is challenging, as there is a trade-off between over-regularized (poor output quality) and under-regularized (difficult to predict) latent representations. We propose a framework for studying this trade-off through compression and train Audio VAEs at specific bitrates via target-KL regularization. This allows direct comparison to well-studied discrete neural audio codec models, and the construction of rate-distortion curves for audio VAEs. We evaluate the impact of target-KL regularization on text-to-sound generation and find that sweeping compression rates is helpful in identifying the optimal generation setting.

2605.17084 2026-05-19 cs.LG cs.CL 版本更新

Scale Determines Whether Language Models Organize Representation Geometry for Prediction

尺度决定语言模型是否为预测组织表示几何

Weilun Xu

发表机构 * School of Computer and Communication Sciences(计算机与通信科学学院) École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院)

AI总结 研究探讨了语言模型中表示几何是否为预测组织,通过Subspace PGA指标发现,模型规模影响表示几何的组织程度,小模型在训练后期逐渐失去这种组织,而大模型则保持稳定。

详情
AI中文摘要

在语言模型中,表示所编码的内容由其表示空间的几何结构决定:距离而非激活值承载意义。现有工具描述了这种几何结构的形状,但并未探讨其组织目的。我们引入Subspace PGA指标,测试某层的距离结构是否比随机等大小子空间更符合解嵌入矩阵$W_U$的读出子空间。在七个Pythia模型(70M-6.9B)和三个跨家族模型中,中间几何显著为预测组织(峰值$z = 9$--$24$),但程度依赖于规模:小模型($d \leq 1024$)在训练后期逐渐失去这种组织——即使损失持续改善,而大模型($d \geq 2048$)则保持稳定。我们追溯到容量权衡:少数主导方向迁离$W_U$的读出,掩盖而非破坏预测结构,移除它们可恢复对齐。频谱度量和损失曲线无法捕捉这一区别。因此,规模不仅决定了模型预测性能,还决定了其表示几何如何组织以实现预测。

英文摘要

In language models, what a representation encodes is determined by the geometry of its representation space: distances, not activations, carry meaning. Existing tools characterize the shape of this geometry but do not ask what that shape is organized for. We introduce Subspace PGA, a metric that tests whether a layer's distance structure aligns with the readout subspace of the unembedding matrix $W_U$ more than with random subspaces of equal size. Across seven Pythia models (70M--6.9B) and three cross-family models, intermediate geometry is significantly organized for prediction (peak $z = 9$--$24$), but the degree is scale-dependent: small models ($d \leq 1024$) progressively lose it at late layers during training -- even as loss keeps improving -- while large models ($d \geq 2048$) preserve it throughout. We trace this to a capacity trade-off: a few dominant directions migrate away from $W_U$'s readout, masking rather than destroying the predictive structure beneath, and removing them restores alignment. Neither spectral metrics nor loss curves capture this distinction. Scale thus determines not only how well a model predicts, but how its representation geometry is organized to do so.

2605.17058 2026-05-19 cs.LG 版本更新

Learning Multi-Timescale Abstractions for Hierarchical Combinatorial Planning

学习多时间尺度抽象以进行分层组合规划

Vivienne Huiling Wang, Tinghuai Wang, Joni Pajarinen

发表机构 * Department of Electrical Engineering(电气工程系) Automation, Aalto University, Finland(自动化,艾尔沃斯大学,芬兰)

AI总结 本文提出了一种基于模型的分层框架,用于解决序列随机组合决策问题,通过多时间尺度目标结构化潜在动态,实现高效的前瞻规划,并联合学习子目标条件预算策略以支持上下文感知的资源分配。

Comments 34 pages, 8 figures, 23 tables

详情
AI中文摘要

指数级大的动作空间、随机动态和在有限资源下进行长周期决策使得序列随机组合优化(SSCO)对强化学习尤其具有挑战性。分层强化学习(HRL)提供了一种自然的分解方法,但将其高层策略置于半马尔可夫决策过程(SMDP)中,其中动作具有可变持续时间,使得学习适用于规划的世界模型变得困难。我们引入了一种基于模型的分层框架,直接解决这一问题。我们的方法结合了潜在空间树搜索规划器和SMDP-aware的世界模型,用于可变持续时间决策。多时间尺度目标结构化潜在动态,使得转移幅度反映抽象动作的有效时间尺度,从而在自适应时间抽象下实现高效的前瞻规划。我们进一步联合学习子目标条件预算策略与世界模型,以支持上下文感知的资源分配。在具有挑战性的SSCO基准测试中,我们的方法优于强大的基线方法。

英文摘要

The combination of exponentially large action spaces, stochastic dynamics, and long-horizon decision-making under limited resources makes Sequential Stochastic Combinatorial Optimization (SSCO) particularly challenging for reinforcement learning. Hierarchical Reinforcement Learning (HRL) offers a natural decomposition, but it places the high-level policy in a Semi-Markov Decision Process (SMDP) where actions have variable durations, making it difficult to learn a world model that is suitable for planning. We introduce a model-based hierarchical framework for sequential stochastic combinatorial decision-making that directly addresses this issue. Our method combines a latent-space tree-search planner with an SMDP-aware world model for variable-duration decisions. A multi-timescale objective structures the latent dynamics so that transition magnitudes reflect the effective temporal scales of abstract actions, enabling efficient lookahead under adaptive temporal abstraction. We further learn a subgoal-conditioned budget policy jointly with the world model to support context-aware resource allocation. Across challenging SSCO benchmarks, our method outperforms strong baselines.

2605.17045 2026-05-19 eess.SY cs.LG cs.SY 版本更新

Empirical evaluation of Time Series Foundation Models for Day-ahead and Imbalance Electricity Price Forecasting in Belgium

时间序列基础模型的实证评估:用于比利时日前提前和不平衡电力价格预测

Chi Bui, Maria Margarida Mascarenhas, Arnaud Verstraeten, Hussain Kazmi

发表机构 * ELECTA-ESAT (KU Leuven)(ELECTA-ESAT(比利时列日大学)) Gridual Leuven, Belgium(Gridual(比利时列日)) Leuven, Belgium(列日,比利时) KU Leuven(比利时列日大学)

AI总结 本文评估了时间序列基础模型在比利时日前提前和不平衡电力价格预测中的应用,发现Chronos-2在ARX模式下表现最佳,其在日前提前市场中的预测误差比其他方法低5%,但在不平衡市场中误差较高,但模型仍表现出真正的零样本预测能力。

详情
AI中文摘要

最近的时间序列基础模型(TSFMs)的进步承诺了零样本预测能力,只需最小的任务特定训练。尽管这些模型在通用基准上表现强劲,但它们在波动性大、复杂的电力市场中的适用性仍待探索。针对这一空白,本文系统地评估了几种TSFMs,特别是Amazon开发的Chronos-2和Chronos-Bolt,以及Google提供的TimesFM 2.5,用于预测比利时的日前提前和不平衡电力价格。对于两个考虑的市场,Chronos-2在ARX模式下产生最准确的预测。与其它机器学习方法的最佳集成预测相比,Chronos-2在日前提前市场中的平均绝对误差(MAE)低5%。相比之下,模型在所有预测时间范围内预测不平衡价格时,MAE高出10%,除了两小时提前范围。此外,我们发现TSFMs表现出真正的零样本预测能力,但在极端市场条件下仍面临困难。

英文摘要

Recent advances in Time Series Foundation Models (TSFMs) promise zero-shot forecasting capabilities with minimal task-specific training. While these models have shown strong performance across generic benchmarks, their applicability in volatile, complex electricity markets remains underexplored. Addressing this gap, this study provides a systematic empirical evaluation of several TSFMs, specifically Chronos-2 and Chronos-Bolt (developed by Amazon), and TimesFM 2.5 (provided by Google), for forecasting Belgian day-ahead and imbalance electricity prices. For both considered markets, Chronos-2 in ARX mode produces the most accurate forecasts. Compared with the best ensemble prediction from other machine learning methods, Chronos-2's Mean Absolute Error (MAE) is 5% lower for the day-ahead market. In contrast, the model yields 10% higher MAE predicting imbalance prices across all forecast horizons, except for the two-hour-ahead horizon. Moreover, we find that TSFMs exhibit genuine zero-shot forecasting skills but still struggle under extreme market conditions.

2605.17039 2026-05-19 cs.LG cs.CE 版本更新

Privacy-Preserving Generation Fraud Detection for Distributed Photovoltaic Systems: A Solar Irradiance-Fused Federated Learning Framework

隐私保护的分布式光伏系统发电欺诈检测:一种融合太阳能辐照度的联邦学习框架

Xiaolu Chen, Chenghao Huang, Yanru Zhang, Hao Wang

发表机构 * School of Computer Science and Technology, University of Electronic Science and Technology of China(电子科技大学计算机科学与技术学院) Department of Data Science and AI, Faculty of IT, Monash University(墨尔本大学信息技术学院数据科学与人工智能系) Monash Energy Institute, Monash University(墨尔本大学莫纳什能源研究所) Shenzhen Institute for Advanced Study of UESTC(电子科技大学深圳研究院)

AI总结 本文提出了一种基于联邦学习的隐私保护分布式光伏系统发电欺诈检测框架,通过融合太阳能辐照度数据和天气数据,利用共注意机制检测关键异常,有效解决了光伏发电欺诈检测中的间歇性和不确定性问题,并在真实世界数据集上验证了方法的有效性。

Comments 15 pages

详情
Journal ref
IEEE Transactions on Smart Grid, 2026
AI中文摘要

住宅光伏(PV)系统的广泛应用引入了新的发电欺诈检测(FD)挑战。与传统电力盗窃检测不同,光伏发电欺诈检测(PVG-FD)因光伏发电的固有间歇性和不确定性而更加复杂。由于可扩展性和隐私问题,分布式光伏系统的集中式PVG-FD方法面临进一步挑战。本文开发了一种基于联邦学习(FL)的隐私保护分布式PVG-FD框架。在此框架中,电力公司管理多个家庭社区,每个社区都配备有本地检测器。该框架集成了新颖的检测模型架构与隐私保护的全局协作。每个社区的本地模型通过共注意机制融合光伏发电和天气数据以检测对PVG-FD至关重要的异常。FL框架通过聚合模型参数和原型实现跨社区协作,利用全局知识共享与本地细化,同时保护隐私。它还使用原型对齐来解决类别不平衡问题,通过增强欺诈样本的表示。在真实世界住宅PV数据集上的广泛实验验证了所开发方法的有效性,并证明其在各种场景中优于最先进的FL方法。结果还显示了其在不同社区规模下的可扩展性和对类别不平衡的强鲁棒性。

英文摘要

The wide adoption of residential photovoltaic (PV) systems introduces new challenges for generation fraud detection (FD). Unlike traditional electricity theft detection, which focuses on electricity consumption-side behavior, PV generation fraud detection (PVG-FD) is complicated by the inherent intermittency and uncertainty of PV generation. The distributed nature of PV systems poses further challenges for centralized PVG-FD approaches due to scalability and privacy concerns. This paper develops a privacy-preserving distributed PVG-FD framework based on federated learning (FL). In this framework, a utility company manages multiple household communities, where each of which is equipped with a local detector. The framework integrates a novel detection model architecture with privacy-preserving global collaboration. Each community's local model fuses PV generation and weather data via a co-attention mechanism to detect discrepancies critical for PVG-FD. The FL framework enables cross-community collaboration by aggregating model parameters and prototypes, leveraging global knowledge sharing with local refinement while preserving privacy. It also uses prototype alignment to address class imbalance by enhancing fraud sample representation. Extensive experiments on a real-world residential PV dataset validate the effectiveness of the developed method and demonstrate that it outperforms state-of-the-art FL methods across various scenarios. The results also show its scalability across varying community sizes and strong robustness to class imbalance.

2605.17037 2026-05-19 cs.LG cs.AI cs.CL 版本更新

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

D$^2$Evo: 双重难度感知的自进化方法用于数据高效的强化学习

Ru Zhang, Renda Li, Ziyu Ma, Weijie Qiu, Chongyang Tao, Yong Wang, Xiangxiang Chu

发表机构 * Zhejiang University(浙江大学) AMAP, Alibaba Group(AMAP,阿里巴巴集团)

AI总结 本文提出D$^2$Evo方法,通过双重难度感知的自进化机制,解决强化学习中有效数据稀缺和动态难度变化的问题,从而在数学推理基准上以少于2K真实数学样本实现优于现有方法的性能。

Comments Accepted by ICML 2026. First two authors contributed equally

详情
AI中文摘要

强化学习(RL)在增强大型语言模型(LLMs)推理能力方面展现出潜力。然而,需要中等难度训练样本的有效RL训练面临两个根本性挑战:有效数据稀缺和动态难度变化,其中中等难度样本稀缺且随着模型提升变得简单。现有方法在一定程度上缓解了这种稀缺性,通过生成训练样本。然而,这些方法存在无锚点生成、忽略共进化和难度不匹配的问题。为了解决这些问题,我们提出了D$^2$Evo,一种双重难度感知的自进化RL框架。在每次迭代中,我们的方法基于当前求解器的能力挖掘中等难度锚点,训练提问者生成不同难度层级的多样化问题,并共同优化两个组件以实现渐进式的推理提升。广泛实验表明,D$^2$Evo在数学推理基准上以少于2K真实数学样本优于现有方法,并在通用推理基准上表现出强大的泛化能力。

英文摘要

Reinforcement learning (RL) has demonstrated potential for enhancing reasoning in large language models (LLMs). However, effective RL training, which requires medium-difficulty training samples, faces two fundamental challenges: Effective Data Scarcity and Dynamic Difficulty Shifts, where medium-difficulty samples are scarce and become trivial as models improve. Existing methods mitigate this scarcity to some extent by generating training samples. However, these approaches suffer from anchor-free generation, ignoring co-evolution, and difficulty mismatch. To address these issues, we propose D$^2$Evo, a Dual Difficulty-aware self-Evolution RL framework. In each iteration, our method mines medium-difficulty anchors based on the current Solver's capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. Extensive experiments demonstrate that D$^2$Evo outperforms existing methods on mathematical reasoning benchmarks with fewer than 2K real mathematical samples, and exhibits strong generalization on general reasoning benchmarks.

2605.17026 2026-05-19 cs.LG 版本更新

Why Do Reasoning Models Lose Coverage? The Role of Data and Forks in the Road

为什么推理模型会失去覆盖能力?数据和道路中的分支在其中的作用

Ngoc-Hieu Nguyen, Parshin Shojaee, Phuc Minh Nguyen, Nan Zhang, Chandan K Reddy, Khoa D Doan, Rui Zhang

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) Virginia Tech(弗吉尼亚理工大学) VinUniversity(文大学)

AI总结 本文研究了推理模型覆盖能力下降的原因,发现训练数据中决策点的普遍存在是导致覆盖缩小的关键因素,并提出通过数据合成和解码机制改进来缓解这一问题。

Comments 22 pages, 13 figures

详情
AI中文摘要

近年来,大语言模型的进展催生了推理模型,这些模型通过专门的微调过程在复杂任务上表现出色。尽管这些方法能可靠地提高pass@1准确率,但先前的研究发现它们表现出覆盖缩小行为,即pass@k相对于基模型会退化。在本文中,我们调查了基于SFT的后训练过程中推理缩小现象的出现原因。我们假设这种行为是由微调数据的特性驱动的,特别是与决策点或

英文摘要

Recent progress in large language models has led to the emergence of reasoning models, which have shown strong performance on complex tasks through specialized fine-tuning procedures. While these methods reliably improve pass@1 accuracy, prior works have observed that they show a coverage shrinkage behavior, where pass@k degrades relative to the base model. In this paper, we investigate the reasoning shrinkage arise under SFT-based post-training. We hypothesize that this behavior is driven by properties of the fine-tuning data, specifically related to decision points or "forks in the road" scenarios where model faces indecipherable patterns with multiple valid reasoning paths. To test this hypothesis, we design controlled case studies that simulate such decision-point settings, spanning indecipherable nodes in graph branching, and reasoning modes. By tracking post-training dynamics in these settings, we find that the shrinkage phenomenon is tightly correlated with the prevalence of decision-point scenarios in the training data. We also demonstrate that this shrinkage behavior can be partially mitigated through targeted data synthesis design of decision-points, and a more systematic diversity-encouraging decoding mechanism. Our findings identify data-centric factors as a key driver of shrinkage in reasoning models and highlight diversity-aware designs as an effective lever for controlling it.

2605.17017 2026-05-19 cs.LG cs.AI 版本更新

When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited

当动态变化时,鲁棒任务推断胜出:重新审视具有行为基础模型的离线模仿学习

Rishabh Agrawal, Rahul Jain, Ashutosh Nayyar

发表机构 * University of Southern California(南加州大学)

AI总结 本文提出了一种基于行为基础模型(BFM)的框架,通过将任务推断建模为鲁棒最小最大优化问题,以应对动态变化,从而在不修改预训练的情况下实现对最坏动态扰动的适应。该方法在动态变化下显著优于标准BFM和鲁棒离线模仿学习基线。

详情
AI中文摘要

行为基础模型(BFM)通过预训练任务无关的表示,实现了可扩展的模仿学习(IL)。然而,现有BFM假设环境动态固定,限制了其在现实世界变化(如摩擦力、驱动或传感器噪声变化)下的鲁棒性。我们通过将BFM的任务推断建模为鲁棒最小最大优化问题来解决这一问题,从而能够在不修改预训练的情况下适应最坏情况的动态扰动。到目前为止,这是首个仅依赖单个名义环境的离线数据的BFM框架,能够在动态变化下实现鲁棒性。我们的方法在动态变化下显著优于标准BFM和鲁棒离线IL基线。这些结果表明,鲁棒策略可以完全在任务推断时间实现,提高了BFM在动态环境中的实用性。

英文摘要

Behavior Foundation Models (BFMs) enable scalable imitation learning (IL) by pretraining task-agnostic representations that can be rapidly adapted to new tasks. However, existing BFMs assume fixed environment dynamics, limiting their robustness under real-world shifts such as changes in friction, actuation, or sensor noise. We address this by formulating BFM task-inference as a robust minimax optimization problem, enabling adaptation to worst-case dynamics perturbations without modifying pretraining. To the best of our knowledge, this is the first BFM-based framework that achieves robustness to dynamics shifts while relying solely on offline data from a single nominal environment. Our approach significantly outperforms standard BFM and robust offline IL baselines under dynamics shifts. These results demonstrate that robust policy can be achieved entirely at task-inference time, improving the practicality of BFMs in dynamic settings.

2605.17011 2026-05-19 cs.GR cs.CV cs.LG 版本更新

Topo-GS: Continuous Volumetric Embedding of High-Dimensional Data via Topological Gaussian Splatting

Topo-GS: 通过拓扑高斯散射实现高维数据的连续体积分嵌入

João Paulo Gois, Luis Gustavo Nonato

发表机构 * Universidade Federal do ABC (UFABC)(巴西圣安德烈大学)

AI总结 本文提出Topo-GS方法,利用拓扑感知策略将高维数据转换为连续体积分表示,通过局部几何约束优化,保持局部拓扑保真度,同时显式表现投影扭曲。

Comments 7 pages, 2 figures

详情
AI中文摘要

降维算法将高维数据映射到可可视化的2D或3D空间,但传统上依赖于离散点云范式。这种离散抽象容易受到视觉遮挡和人工不连续性的影响,往往无法表示底层流形的连续密度。为了解决这些限制,我们引入Topo-GS,一个框架,重新利用3D高斯散射(3DGS)将多维投影作为无网格体积分重建过程。与标准光度损失不同,Topo-GS由局部几何约束驱动。通过解决正交Procrustes目标,优化强制了As-Rigid-As-Possible先验,同时显式对齐每个高斯的空间协方差到局部切空间。认识到解卷不同内在维数的数据需要不同的空间处理,我们利用拓扑感知策略,将损失公式定制以保持连续1D轨迹或连贯2D表面。定量和视觉评估表明,Topo-GS成功地将离散散点图转换为连续体积分表示,其中固有的投影扭曲显式表现为可观察的几何变化,同时保持与离散基线相当的局部拓扑保真度。

英文摘要

Dimensionality reduction algorithms map high-dimensional data into visualizable 2D or 3D spaces, but traditionally rely on a discrete point-cloud paradigm. This discrete abstraction is susceptible to visual occlusion and artificial discontinuities, often failing to represent the continuous density of the underlying manifold. To address these limitations, we introduce Topo-GS, a framework that repurposes 3D Gaussian Splatting (3DGS) to cast multidimensional projection as a meshless volumetric reconstruction process. Instead of standard photometric losses, Topo-GS is driven by local geometric constraints. By solving orthogonal Procrustes targets, the optimization enforces an As-Rigid-As-Possible prior while explicitly aligning the spatial covariance of each Gaussian to the local tangent space. Recognizing that unrolling data of varying intrinsic dimensionalities requires distinct spatial treatments, we utilize a topology-aware strategy that tailors the loss formulation to preserve either continuous 1D trajectories or cohesive 2D surfaces. Quantitative and visual evaluations demonstrate that Topo-GS successfully transforms discrete scatter plots into continuous volumetric representations, where inherent projection distortions explicitly manifest as observable geometric variations, while preserving local topological fidelity comparable to discrete baselines.

2605.17000 2026-05-19 cs.LG cs.AI 版本更新

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

BoLT:一个民主化黑盒优化研究的基准,用于昂贵的LLM任务

Ruth Wan Theng Chew, Zhiliang Chen, Apivich Hemachandra, Bryan Kian Hsiang Low

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出BoLT基准,旨在通过提供真实LLM优化问题,促进黑盒优化方法在昂贵的大型语言模型任务中的研究和评估。

详情
AI中文摘要

优化大型语言模型(LLM)的训练和推理配置,如超参数、数据混合和提示,对性能至关重要,但在实践中往往采用启发式方法,导致可能的次优结果。通过将它们视为噪声、昂贵且无导数的优化问题,贝叶斯优化(BO)和其他黑盒优化(BBO)方法提供了一个有前途但尚未充分探索的方向,用于原则性、样本效率高的方法。然而,LLM训练和推理成本对大多数BBO研究社区来说过高,新方法往往仅在合成测试函数和小规模数据集上进行评估,这些数据集无法捕捉现代LLM优化问题的挑战。这阻碍了BBO方法的发展,并使评估这些方法在现代LLM任务上的有效性变得困难。我们介绍了BoLT,这是首个以LLM为中心的基准,旨在民主化LLM研究,服务于BBO社区。BoLT在https://github.com/chewwt/bolt上发布。BoLT涵盖了广泛且有动机的LLM优化问题,包括多保真度、多目标、异方差噪声和高维搜索空间。BoLT中的每个问题都基于真实的实验数据,并通过轻量级的替代模型,基于成千上万的真实LLM实验结果,使其完全可重复和可访问。我们对BoLT进行了广泛的BO和BBO方法的评估,显示选定的BO方法在各种任务上持续优于其他方法,突显了现有BBO方法在LLM任务上的不足,强调了为BBO社区现代化基准的必要性。

英文摘要

Optimization of LLM training and inference configurations, such as hyperparameters, data mixtures, and prompts, is critical to performance, but it is often approached heuristically in practice, leading to potentially suboptimal outcomes. By framing them as noisy, expensive, and derivative-free optimization problems, Bayesian optimization (BO) and other black-box optimization (BBO) methods offer a promising yet underexplored direction for principled, sample-efficient methods. However, LLM training and inference costs are prohibitively high for most of the BBO research community, and new methods are often only evaluated on synthetic test functions and small-scale datasets that fail to capture the challenges of modern LLM optimization problems. This impedes the development of BBO methods and makes it difficult to assess their effectiveness on modern LLM tasks. We introduce BoLT, the first LLM-centric benchmark that democratizes LLM research for the BBO community. BoLT is released at https://github.com/chewwt/bolt. BoLT covers broad and well-motivated LLM optimization problems, involving multi-fidelity, multi-objective, heteroscedastic noise, and high-dimensional search spaces. Each problem in BoLT is grounded in real experimental data and made fully reproducible and accessible through lightweight surrogate models fitted to the results of thousands of real LLM experiments. We benchmark BoLT against an extensive range of BO and BBO methods, showing that selected BO methods consistently outperform others across tasks and highlighting gaps in existing BBO methods on LLM tasks, underscoring the need to modernize benchmarks for the BBO community.

2605.16999 2026-05-19 cs.LG 版本更新

Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning

基于排名的校准:可靠多模态强化学习

Peng Cui, Boyao Yang, Jun Zhu

发表机构 * Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University, Beijing(计算机科学与技术系,人工智能研究院,BNRist中心,清华大学-博世联合机器学习中心,THBI实验室,清华大学,北京) Dept. of Automation, Tsinghua University, Beijing(自动化系,清华大学,北京)

AI总结 本文提出Ranking-Aware Calibration方法,通过利用组内强化学习自然产生的比较信号,提升多模态强化学习的校准能力,从而提高任务准确性和校准效果。

详情
AI中文摘要

强化学习后训练显著提高了视觉-语言模型的推理准确性,但由此产生的策略仍然校准不足。终端正确性奖励无法提供梯度来惩罚置信度更高的错误比不确定的错误更严厉,也无法提供将置信度与视觉证据质量联系起来的信号,这一差距在损坏或模糊输入下尤为严重,此时模型仍会报告高置信度的错误答案。我们引入Ranking-Aware Calibration (RAC),一种训练时框架,利用组内强化学习已经产生的两种比较信号来监督置信度。排名感知组损失强制在同一提示下,更优的回放获得更高的置信度。清洁-损坏成对损失强制置信度随着视觉证据的退化而减弱。由于排名信号迫使策略区分正确和错误的推理路径,它还超越了仅靠正确性奖励所能达到的任务准确性。这两种损失都不需要外部置信度注释,并自然地与组内强化学习后训练整合。我们将在Qwen2.5-VL和InternVL-3.5基础上实例化RAC,并在六个多模态推理基准测试中评估清洁和损坏输入下的表现。实验证明,排名感知损失通过教政策区分更好和更差的推理路径显著提高了任务准确性,而成对损坏损失在退化输入下减少了校准误差。它们的结合在所有测试的backbone上实现了最佳的校准,同时在大多数设置中提高了准确性。

英文摘要

Reinforcement learning post-training has substantially improved the reasoning accuracy of vision-language models, yet the resulting policies remain poorly calibrated. Terminal correctness rewards provide no gradient that penalizes confident errors more than uncertain ones and no signal that ties confidence to the quality of visual evidence, a gap that becomes especially severe under corrupted or ambiguous inputs where models continue to report high confidence on incorrect answers. We introduce Ranking-Aware Calibration (RAC), a training-time framework that supervises confidence using two comparison signals that group-based RL already produces at no additional labeling cost. The ranking-aware group loss enforces that a better rollout receives higher confidence than a worse one within the same prompt. The clean--corrupted pairwise loss enforces that confidence attenuates as visual evidence degrades. Because the ranking signal forces the policy to distinguish between correct and incorrect reasoning paths, it also reinforces task accuracy beyond what correctness rewards alone produce. Both losses require no external confidence annotations and integrate naturally with group-based RL post-training. We instantiate RAC on Qwen2.5-VL and InternVL-3.5 backbones and evaluate on six multimodal reasoning benchmarks under clean and corrupted inputs. Empirical results show that the ranking-aware loss substantially improves task accuracy by teaching the policy to discriminate between better and worse reasoning, while the pairwise corruption loss reduces calibration error under degraded inputs. Their combination achieves the best calibration across all tested backbones while improving accuracy in the majority of settings.

2605.16998 2026-05-19 quant-ph cs.LG 版本更新

$\mathcal{O}(n)$ alternative to Quantum Fourier Transform with efficient neural net classical post-processing

$\mathcal{O}(n)$ 量子傅里叶变换的替代方案:具有高效的神经网络经典后处理

Kaiming Bian, Zujin Wen, Oscar Dahlsten

发表机构 * Shenzhen Institute for Quantum Science and Engineering(深圳量子科学与工程研究院) Southern University of Science and Technology(南方科技大学) International Quantum Academy(国际量子学院) City University of Hong Kong(香港城市大学)

AI总结 本文提出了一种$\mathcal{O}(n)$的量子傅里叶变换替代方案,通过使用Hadamard和受控相位门构建的HP-$L$电路,保留了移位不变性,并通过离散Fisher信息证明了其有效性,最终通过高效的神经网络实现经典后处理,从而在Shor算法中替代传统的$\mathcal{O}(n^2)$量子傅里叶变换。

详情
AI中文摘要

量子傅里叶变换(QFT)是隐子群问题(HSP)算法所必需的,包括用于因数分解的Shor算法。QFT的电路深度对于近期硬件仍然具有挑战性。为了寻找更浅的替代方案,我们识别出QFT用于实现HSP的两个特性。首先,QFT的移位不变性允许移除随机的整体移位。其次,QFT保留了关于隐藏子群生成器的信息,该信息可通过测量结果访问。我们通过离散Fisher信息量化了该信息。我们构造了一组浅层电路,使用Hadamard和受控相位门,称为HP-$L$电路,证明这些电路保留了移位不变性。数值分析显示这些电路保留了指数增长的Fisher信息。$\mathcal{O}(n)$的HP-$1$电路可以替代传统的$\mathcal{O}(n^2)$ QFT在Shor算法中,如数值所示,通过高效的神经网络实现经典后处理。

英文摘要

The Quantum Fourier Transform (QFT) is required by hidden subgroup problem (HSP) algorithms, including Shor's algorithm for factoring. The circuit depth of the QFT remains challenging for near-term hardware. To find shallower alternatives we identify two properties that are exploited by the QFT to enable HSP. Firstly, the shift invariance of the QFT allows for the removal of a random overall shift. Secondly, the QFT retains information about the hidden subgroup generator accessible in the measurement outcomes. We quantify that information via the discrete Fisher information. We construct a family of shallow circuits using Hadamards and controlled-Phase gates, HP-$L$ circuits, that we prove preserve shift invariance. Numerical analysis shows these circuits retain exponentially growing Fisher information. The $\mathcal{O}(n)$ HP-$1$ can replace the $\mathcal{O}(n^2)$ QFT in Shor's algorithm, as demonstrated numerically, with an efficient neural network implementing classical post-processing.

2605.16993 2026-05-19 cs.CY cs.AI cs.LG 版本更新

Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings

临床AI中的对抗脆弱性与语言脆弱性:在低资源医疗环境中对诊断崩溃的系统审计及不可察觉扰动和跨语言漂移的影响

Anthonio Oladimeji Gabriel, Ahmad Rufai Yusuf

发表机构 * Centre for Clinical Intelligence & Safety(临床智能与安全中心) Tomorrow University of Applied Sciences(明天应用科学大学)

AI总结 本文系统地审计了临床AI在不可察觉扰动和跨语言漂移下的诊断崩溃问题,揭示了对抗脆弱性和语言脆弱性对低资源医疗环境中的临床AI系统的影响。

Comments 23 pages, 9 figures, 3 tables. Code and data available at https://github.com/anthoniooladimeji11-coder/clinical-ai-safety-audit

详情
AI中文摘要

当前的临床人工智能(AI)系统几乎只在干净、标准化的英语输入条件下进行评估,这些条件无法反映低资源环境中的医疗实践现实。本研究首次系统地对临床AI的两种正交安全漏洞进行了双重审计:对抗图像脆弱性和跨语言诊断漂移。使用DenseNet121,这是CheXNet架构的基础,经过在COVID-QU-Ex胸部X光数据集(85,318张图像;COVID-19、非COVID肺炎、正常)上微调,我们证明在Fast Gradient Method(FGM)扰动下,epsilon=0.021时,诊断准确率从89.3%下降到62.0%,这种幅度对人眼来说是不可察觉的。标准防御策略,包括高斯平滑和投票集成,未能恢复临床安全。在平行的语言脆弱性实验中,我们测试了Llama3.1:8b和NatLAS(N-ATLAS)在Standard English、Nigerian Pidgin(Naija)和Yoruba-inflected English中的20例COVID-19临床病例。两种模型均表现出显著的准确性下降:Llama3.1:8b在Pidgin上从80.0%下降到65.0%;NatLAS,一个非洲语境模型,从85.0%下降到55.0%,诊断一致性下降到50%。这些发现为尼日利亚初级卫生中心(PHC)部署中代表性的临床AI系统建立了定量失败范围,并促使对对抗性强、语言包容的临床AI架构的紧急呼吁。

英文摘要

Current clinical artificial intelligence (AI) systems are evaluated almost exclusively on clean, standardised, English-language inputs, conditions that do not reflect the realities of healthcare delivery in low-resource settings. This study presents the first systematic dual audit of two orthogonal safety vulnerabilities in clinical AI: adversarial image fragility and cross-lingual diagnostic drift. Using DenseNet121, the architecture underlying CheXNet, fine-tuned on the COVID-QU-Ex chest X-ray dataset (85,318 images; COVID-19, Non-COVID Pneumonia, Normal), we demonstrate that diagnostic accuracy collapses from 89.3% to 62.0% under a Fast Gradient Method (FGM) perturbation of epsilon=0.021, a magnitude imperceptible to the human eye. Standard defensive strategies including Gaussian smoothing and ensemble voting failed to restore clinical safety. In a parallel language fragility experiment, we tested Llama3.1:8b and NatLAS (N-ATLAS) on 20 COVID-19 clinical cases presented in Standard English, Nigerian Pidgin (Naija), and Yoruba-inflected English. Both models exhibited significant accuracy degradation: Llama3.1:8b dropped from 80.0% to 65.0% on Pidgin; NatLAS, an African-context model, collapsed from 85.0% to 55.0%, with diagnosis consistency falling to 50%. These findings establish a quantitative failure envelope for clinical AI under conditions representative of Primary Health Centre (PHC) deployment in Nigeria, and motivate urgent calls for adversarially hardened, linguistically inclusive clinical AI architectures.

2605.16989 2026-05-19 cs.LG 版本更新

Decision-Aware Proximal Bridge Learning for Optimal Treatment Selection

面向决策的近端桥学习用于最优治疗选择

Tomàs Garriga, Alejandro Almodóvar, Axel Brando, Gerard Sanz, Eduard Serrahima de Cambra, Juan Parras

发表机构 * Novartis(诺华) Barcelona Supercomputing Center(巴塞罗那超级计算中心) Universidad Politécnica de Madrid(马德里理工大学)

AI总结 本文提出了一种面向决策的近端桥学习方法,通过强调决策相关治疗区域并保留全局稳定性,解决了在近端因果推断中治疗选择和最优决策的不足。

详情
AI中文摘要

在需要连续动作的个性化治疗选择中,必须在决策相关区域中准确估计因果响应,而不是在整个动作空间中均匀估计。因此,估计全局因果响应面并选择最大化它的治疗可能不最优,因为标准估计目标根据观察到的治疗分布分配建模努力,而不是决定最优决策的区域。虽然在无偏设定中已经研究了面向决策的方法,但在近端因果推断中,这一问题仍处于探索阶段,其中代理变量和桥函数在存在隐藏混杂的情况下能够通过合适假设进行识别。尽管有最近的进展,近端方法主要集中在治疗效应和潜在结果估计,而不是治疗选择和最优决策。为弥合这一差距,我们引入了一种面向政策的加权桥损失,强调决策相关治疗区域的同时保留全局稳定性。我们证明了一个后悔界,表明所提出的加权桥损失通过加权不恰当常数控制治疗选择的后悔。我们将在几种近端桥求解器的决策意识变体中实例化该框架,得到交替进行加权桥估计、响应面投影、策略更新和权重细化的实用算法。经验上,我们发现面向决策的加权方法在多个桥求解器中减少了后悔,表明在近端设置中改进了治疗选择。

英文摘要

Individualized treatment selection with continuous actions requires accurate causal response estimation in decision-relevant regions, rather than uniformly over the entire action space. Estimating a global causal response surface and then choosing the treatment that maximizes it can therefore be suboptimal, since standard estimation objectives allocate modeling effort according to the observed treatment distribution rather than the regions that determine the optimal decision. While decision-aware approaches have been studied in unconfounded settings, this problem remains underexplored in proximal causal inference, where proxy variables and bridge functions enable identification under suitable assumptions even in the presence of hidden confounding. Despite recent progress, proximal methods have primarily focused on treatment-effect and potential-outcome estimation rather than treatment selection and optimal decision-making. To bridge this gap, we introduce a policy-targeted weighted bridge loss that emphasizes decision-relevant treatment regions while retaining global stabilization. We prove a regret bound showing that the proposed weighted bridge loss controls treatment-selection regret through a weighted ill-posedness constant. We instantiate the framework in decision-aware variants of several proximal bridge solvers, yielding practical algorithms that alternate between weighted bridge estimation, response-surface projection, policy update, and weight refinement. Empirically, we find that decision-aware weighting reduces regret across several bridge solvers, suggesting improved treatment selection in proximal settings.

2605.16975 2026-05-19 cs.LG cs.AI 版本更新

Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons

扩展预训练的10秒ECG基础模型以适应更长的时域

Wei Tang, Jinpei Han, Kangning Cui, Mattia Carletti, Fredrik K. Gustafsson, Shreyank N Gowda, Patitapaban Palo, Anshul Thakur, Lei Clifton, Jean-michel Morel, Raymond H. Chan, David A. Clifton, Xiao Gu

发表机构 * City University of Hong Kong(香港城市大学) Imperial College London(伦敦帝国学院) Wake Forest University(威克森林大学) University of Nottingham(诺丁汉大学) Lingnan University(岭南大学) University of Oxford(牛津大学)

AI总结 本文提出了一种参数高效的框架,通过在不重新训练基础模型的情况下扩展预训练的10秒ECG基础模型,使其能够处理更长和可变长度的ECG信号,解决了结构不兼容和语义挑战问题,实验表明其在多个长时域ECG任务中优于滑动窗口和池化基线方法。

详情
AI中文摘要

预训练在典型诊断10秒ECG片段上的ECG基础模型已在多种临床应用中展示了强大的迁移能力。然而,许多实际应用产生的记录通常更长,且在推理过程中持续时间各异。这些10秒模型缺乏整合时间信息的内置方法。将其扩展到更长的时域引入了两个挑战:由于输入长度差异导致的结构不兼容性,以及限制有意义时间聚合的语义挑战。我们提出了一种参数高效的框架,通过冻结预训练的10秒模型,引入一个轻量级插件模块,以两种互补的方式扩展模型:(i) 结构兼容的长序列处理,(ii) 语义指导的时间建模。在多个长时域ECG任务、数据集和基础模型背骨上的实验表明,我们的方法能够从预训练的快照模型中实现稳健的长时域扩展,一致优于滑动窗口和池化基线方法,具有强大的参数效率。

英文摘要

Electrocardiogram (ECG) foundation models pretrained on typical diagnostic 10-second ECG segments, have demonstrated strong transferability across a range of clinical applications. However, many real-world applications produce recordings that are typically longer, and are varied in duration during inference time. These 10-second models have no built-in way to combine information across time. Extending them to longer horizons introduces two challenges: structural incompatibilities arising from input-length disparities, and semantic challenges that limit meaningful temporal aggregation. We propose a parameter-efficient framework that extends pretrained ECG foundation models to longer and variable-length ECGs without retraining the backbone. Guided by a frozen pretrained 10-second model, we introduce a lightweight plug-in module that extends the model in two complementary ways: (i) structurally compatible long-sequence processing and (ii) semantically informed temporal modeling. Experiments on multiple long-horizon ECG tasks, datasets, and foundation model backbones demonstrate that our method enables robust long-horizon extension from pretrained snapshot models, consistently outperforming sliding-window and pooling-based baselines with strong parameter efficiency.

2605.16973 2026-05-19 cs.CV cs.LG 版本更新

SHED: Style-Homogenized Embedding Alignment for Domain Generalization

SHED: 风格均质化嵌入对齐用于领域泛化

Kai Gan, Tong Wei

发表机构 * School of Computer Science and Engineering, Southeast University, Nanjing 210096, China(1 东南大学计算机科学与工程学院,南京 210096,中国) Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China(2 教育部计算机网络与信息集成重点实验室(东南大学),中国)

AI总结 本文提出SHED方法,通过均质化嵌入对齐来解决领域泛化中的信息不对称问题,实验表明其在多个基准测试中取得了最先进的性能。

详情
AI中文摘要

领域泛化旨在通过嵌入分布偏移增强模型对未见领域的鲁棒性。尽管像CLIP这样的大规模视觉-语言模型表现出色,但其直接的图像-文本嵌入对齐却受到固有信息不对称的限制:图像编码了类别语义和领域特定的风格,而文本提示主要传达基本的类别线索。这种不对称性阻碍了在现实场景中对新领域的泛化。为此,我们提出了SHED,一种基于CLIP的新方法,通过对齐风格均质化的嵌入而不是CLIP编码器的原始表示。在训练过程中,SHED从图像嵌入(按源领域计算)和文本嵌入(在多样化的提示模板下平均并去除全局质心)中移除领域特定的风格质心。在推理过程中,考虑到目标领域信息的缺乏,SHED将多样化的文本领域质心投影到视觉空间,并通过成员加权聚合预测。在五个基准测试上的广泛实验表明,SHED在多个基准测试中取得了最先进的性能,显著优于先前方法(例如,在DomainNet上比标准微调高出+4.0%)

英文摘要

Domain generalization aims to enhance model robustness against unseen domains with embedding distribution shifts. While large-scale vision-language models like CLIP exhibit strong generalization, their direct image-text embedding alignment suffers from inherent information asymmetry: images encode both class semantics and domain-specific styles, whereas text prompts primarily convey basic class cues. This asymmetry hinders generalization to novel domains in realistic scenarios. To address this, we propose Style-Homogenized Embedding alignment for Domain-generalization (SHED), a novel CLIP-based method that aligns style-homogenized embeddings instead of raw representations from encoders in CLIP. During training, SHED removes domain-specific style centroids from both image embeddings computed per source domains and text embeddings which are averaged across diverse prompt templates and stripped of a global centroid. For inference, considering the lack of target domain information, SHED projects diverse textual domain centroids into the visual space and aggregates predictions via membership weighting. Extensive experiments on five benchmarks show SHED achieves state-of-the-art performance, outperforming prior methods significantly (e.g., +4.0\% on DomainNet vs. standard fine-tuning).

2605.16929 2026-05-19 cs.LG 版本更新

Emulating the Forced Response of Climate Models with Flow Matching

用流匹配模拟气候模型的强迫响应

Graham Clyne, Julia Kaltenborn, Peer Nowack, Claire Monteleoni, Anasatase Charantonis

发表机构 * INRIA MILA Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)

AI总结 本文提出利用深度学习模型模拟气候模型对多种气候强迫的响应,通过训练多个SSP情景生成未见过的场景,并验证了该模型在土地表面温度方面的有效性。

详情
AI中文摘要

全球气候模型是模拟过去和潜在未来气候变化路径以及相关气候影响的关键工具。共享社会经济路径(SSPs)描述了全球经济和人口发展的各种未来情景。这些SSPs本质上与气候强迫的变化相关,这些强迫是外部驱动因素,如温室气体和气溶胶排放,从而导致地球能量平衡随时间的变化。这些强迫是气候模型中的基本边界条件,以了解这些变化对气候影响的潜在影响。然而,运行气候模型计算成本极高,与需要大量模拟集以获得更稳健估计的需求相冲突(考虑内部变异性和情景不确定性)。最近的研究表明,可以利用机器学习捕捉气候模型的动力学,当条件于不同气候情景的强迫。我们在此训练了一个深度学习(DL)模型在多个SSP上,并成功生成训练期间未见过的场景。我们的模拟器验证了MESMER-M,一个土地表面温度的统计模拟器。我们的研究展示了生成对多种同时气候强迫(如二氧化碳、甲烷、一氧化二氮、硫酸气溶胶和臭氧)响应的气候变化状态的能力。特别是,我们的消融研究强调需要包括多种不同强迫以用DL模拟器表示长期大气趋势。

英文摘要

Global climate models are essential tools to simulate past and potential future pathways of climate change, as well as associated climate impacts. Shared Socioeconomic Pathways (SSPs) describe a range of future scenarios of global economic and demographic development. These SSPs are intrinsically linked to changes in climate forcings, the external drivers, such as greenhouse gas and aerosol emissions, which in turn lead to the human impact on the energy balance of the Earth over time. These forcings are fundamental boundary conditions in climate models in order to gain insight into the potential climatic impacts of these changes described by each SSP. Running a climate model, however, is extremely computationally expensive, conflicting with the need for large ensembles of simulations for each model to give, e.g., more robust estimates in the presence of internal variability (the inherent, chaotic fluctuations within the climate system) and scenario uncertainty. Recent research has demonstrated the ability to capture climate model dynamics using machine learning when conditioned on forcings from different climatic scenarios. We here train a Deep Learning (DL) model on multiple SSPs and successfully generate scenarios unseen during training. Our emulator is validated against MESMER-M, a statistical emulator of land surface temperature. Our research demonstrates the capacity to generate such changing climate states in response to a variety of simultaneous climate forcings (e.g., carbon dioxide, methane, nitrous oxide, sulphate aerosols, and ozone). In particular, our ablation studies underline a need to include a range of different forcings to represent long-term atmospheric trends with a DL emulator.

2605.16919 2026-05-19 stat.ML cs.LG 版本更新

CAST: Causal Anchored Simplex Transport for Distribution-Valued Time Series

CAST:基于简单集的因果传输用于分布值时间序列

Jiecheng Lu, Jieqi Di, Runhua Wu, Yuwei Zhou

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Indiana University(印第安纳大学)

AI总结 该研究提出CAST方法,通过因果锚定简单集传输来处理分布值时间序列的因果预测,解决了分布传输中的结构性失效问题,并在多个基准测试中表现出色。

详情
AI中文摘要

许多面向决策的随机系统是通过聚合分布而非标量轨迹观测的:队列占用、移动份额、公共卫生混合、发电源份额、生态组成和空气质量严重程度剖面都生活在概率简单集上并随时间演变。我们研究这些分布值时间序列的因果(在线)预测,并认为过渡算子本身应围绕简单集进行结构化。我们引入CAST(因果锚定简单集传输),一种 successor-local 操作符,它(i)从因果上下文中检索经验后继,(ii)通过持久锚稳定它们,(iii)在有序支持上应用有界的局部随机传输;每一步都通过构造保持简单集。我们识别出一种结构性失效模式,即潜在的转换核别名,其中相似的观测分布在不同的上下文制度下演变不同,且证明任何仅依赖于别名总结的预测者都会遭受不可约的加权Jensen-Shannon超额风险下界,而CAST假设类包含制度-aware的贝叶斯后继;对于有序支持,当传输后继位于无传输锚壳体外时,额外存在Pinsker分离。在覆盖生态、能源、饮食、死亡率、就业、空气质量、恶劣天气、移动和G/G/1,G_t/G/1队列占用的11个公共和模拟基准上,CAST在一步KL(1.27)和自回归滚动JSD(1.91)上获得最佳平均排名,战胜了广泛的统计、组成、递归、卷积和Transformer基线集,并在所有11个部分中取得前两名的离线KL。组件消融和受控合成别名实验验证了理论。

英文摘要

Many decision-facing stochastic systems are observed through aggregate distributions rather than scalar trajectories: queue occupancies, mobility shares, public-health mixtures, generation-source shares, ecological compositions, and air-quality severity profiles all live on the probability simplex and evolve over time. We study causal (online) forecasting for these distribution-valued time series and argue that the transition operator itself should be structured around the simplex. We introduce CAST (Causal Anchored Simplex Transport), a successor-local operator that (i) retrieves empirical successors from causal context, (ii) stabilizes them with a persistence anchor, and (iii) applies a bounded local stochastic transport on ordered supports; every stage preserves the simplex by construction. We identify a structural failure mode, latent transition-kernel aliasing, where similar observed distributions evolve differently under different contextual regimes, and prove that any forecaster depending only on an aliased summary incurs an irreducible weighted Jensen-Shannon excess-risk lower bound, while the CAST hypothesis class contains the regime-aware Bayes successor; for ordered supports an additional Pinsker separation holds whenever the transported successor lies outside the no-transport anchor hull. On eleven public and simulated benchmarks spanning ecology, energy, diet, mortality, employment, air quality, severe weather, mobility, and G/G/1, G_t/G/1 queue occupancy, CAST attains the best average rank on both one-step KL (1.27) and autoregressive rollout JSD (1.91), winning 8/11 sections on each metric against a broad statistical, compositional, recurrent, convolutional, and Transformer baseline set, and top-2 on all 11 sections for offline KL. Component ablations and a controlled synthetic aliasing experiment corroborate the theory.

2605.16913 2026-05-19 stat.ML cond-mat.dis-nn cond-mat.stat-mech cs.LG math.PR 版本更新

A Fourier perspective on the learning dynamics of neural networks: from sample complexities to mechanistic insights

从样本复杂性到机理洞察的神经网络学习动态的傅里叶视角

Fabiola Ricci, Claudia Merger, Sebastian Goldt

发表机构 * SISSA(国际理论物理中心)

AI总结 本文从傅里叶视角研究神经网络学习动态,揭示了自然图像的近似平移不变性和功率谱特性,展示了简单神经网络在图像分类任务中先依赖幅度信息再利用相位信息的学习过程,并证明了在高维输入下仅基于相位信息的分类任务的难度,以及功率谱如何加速相位信息学习。

详情
Journal ref
ICML 2026
AI中文摘要

通过梯度方法训练的神经网络表现出强烈的简单性偏差:它们在学习数据的更复杂特征之前,先学习更简单的统计特征。以往对此现象的研究主要集中在(准)各向同性输入的设置中。在本文中,我们从傅里叶视角研究简单性偏差,这使我们能够将自然图像的两个关键特性纳入分析:近似平移不变性和功率谱。我们首先实验表明,简单神经网络在图像分类任务中首先依赖于幅度信息——与像素对之间的相关性有关——然后再利用相位信息,后者编码边缘和高阶相关性。为此,我们引入了一个合成数据模型,用于平移不变输入,允许对幅度和相位进行精确控制,同时保持可处理性。我们严格证明了对于各向同性和高维输入,仅基于相位信息的分类任务是一个真正困难的任务:在线随机梯度下降(SGD)在n << N^3步内无法区分结构输入与噪声,但需要至少n >> N^3 log^2{N}步。相比之下,我们通过实验和理论证明,功率谱可以显著加速相位信息学习的速度,即使谱本身不帮助分类。对纹理任务的两层网络和ImageNet和CIFAR100的深度卷积网络的模拟证实了幅度和相位之间非平凡的相互作用,提供了深度神经网络高效学习自然图像分布的机理洞察。

英文摘要

Neural networks trained with gradient-based methods exhibit a strong simplicity bias: they learn simpler statistical features of their data before moving to more complex features. Previous analyses of this phenomenon have largely focused on settings with (quasi-)isotropic inputs. In this work, we study the simplicity bias from a Fourier perspective, which allows us to include two key features of natural images in the analysis: approximate translation-invariance and power-law spectra. We first show experimentally that simple neural networks trained on image classification tasks first rely on amplitude information -- related to pair-wise correlations between pixels -- before exploiting phase information, which encodes edges and higher-order correlations. In view of this, we introduce a synthetic data model for translation-invariant inputs that allows precise control over amplitudes and phases while remaining tractable. We rigorously establish that for isotropic and high-dimensional inputs, classification based on phase information alone is a genuinely hard task: online stochastic gradient descent (SGD) cannot distinguish the structured inputs from noise within $n \ll N^3$ steps, but needs at least $n \gg N^3 \log^2{N}$ steps. In contrast, we show both experimentally and theoretically that power-law spectra can dramatically accelerate the speed of learning phase information, even if the spectra do not help with classification. Simulations with two-layer networks trained on textures and with deep convolutional networks on ImageNet and CIFAR100 confirm this non-trivial interaction between amplitudes and phases, providing mechanistic insights into how deep neural networks can learn natural image distributions efficiently.

2605.16905 2026-05-19 cs.LG cs.CV 版本更新

AIM: Adversarial Information Masking for Faithfulness Evaluation of Saliency Maps

AIM:对抗性信息遮蔽用于显著图忠实性评估

Chia-Ying Hsieh, Hsin-Yuan Fang, Chun-Shu Wei

发表机构 * National Yang Ming Chiao Tung University(阳明交通大学)

AI总结 本文提出AIM方法,通过对抗性信息遮蔽框架评估显著图的忠实性及遮蔽操作的可靠性,通过对比不同遮蔽方式下的退化效果,减少遮蔽诱导的偏差,并揭示不同模态下符号和非符号归因之间的差异。

详情
AI中文摘要

后验显著性方法广泛用于解释深度神经网络,但其忠实性难以可靠评估。现有评估方法根据显著性诱导的特征排序进行特征遮蔽并测量性能退化,但这种退化可能受遮蔽操作干扰:零遮蔽可能产生分布外伪影,而基于插值的遮蔽可能保留残余预测信息。我们提出对抗性信息遮蔽(AIM),一种基于显著性的对抗性特征替换框架,用于评估显著图的忠实性和遮蔽操作的可靠性。AIM将选定特征替换为输入的对抗性对应值,并在互补的遮蔽顺序下比较退化效果。我们通过随机归因偏差和解释方法忠实性排名的稳定性来评估可靠性。在图像、音频和EEG任务中的实验表明,AIM相比零和插值遮蔽减少了遮蔽诱导的偏差,同时揭示了符号和非符号归因之间的模态依赖性差异。

英文摘要

Post-hoc saliency methods are widely used to interpret deep neural networks, but their faithfulness is difficult to evaluate reliably. Existing evaluations mask features according to saliency-induced feature ordering and measure performance degradation, but this degradation can be confounded by the masking operator: zero masking may create out-of-distribution artifacts, while interpolation-based masking may preserve residual predictive information. We propose Adversarial Information Masking (AIM), a saliency-guided adversarial feature replacement framework for evaluating both saliency-map faithfulness and masking-operator reliability. AIM replaces selected features with values from an adversarial counterpart of the input and compares degradation under complementary masking orders. We assess reliability using random-attribution bias and stability of explanation-method faithfulness rankings. Experiments on image, audio, and EEG tasks suggest that AIM reduces masking-induced bias compared with zero and interpolation-based masking, while revealing modality-dependent differences between signed and unsigned attributions.

2605.16902 2026-05-19 cs.LG 版本更新

ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery

ArtifactLinker: 通过自动发现最新研究成果来链接科学制品

Haofei Yu, Jiaxuan You, Peter Clark, Bodhisattwa Prasad Majumder, Kyle Richardson

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Allen Institute for AI(人工智能算法研究所)

AI总结 本文提出ArtifactLinker框架,通过图神经网络和大语言模型预测模型-数据集链接,并通过编码实验验证,以实现自动发现最新研究成果

Comments 12 pages

详情
AI中文摘要

科学制品如模型和数据集是研究的基础。随着像HuggingFace这样的平台迅速发展,研究人员现在可以访问大量制品。然而,一个关键挑战依然存在:如何通过充分利用现有制品自动发现给定数据集的最新研究成果(SOTA)模型?我们通过将HuggingFace建模为一个制品图来正式化这一任务,其中节点是模型/数据集,边表示评估。我们提出了ArtifactLinker,一个两阶段框架:(1)使用图神经网络(GNN)或图增强的大语言模型(LLM)对有前途的未观测模型-数据集链接进行排名;(2)通过编码实验使用基于LLM的代理验证顶级链接。我们进一步引入了一个名为ArtifactBench的基准,包含14,053个制品和51,337个关系,以评估两个阶段的性能。结果表明:(1)现有制品之间的图结构对缺失链接预测有效;(2)使用ArtifactLinker进行端到端排名和验证有助于发现潜在的SOTA结果和研究见解。

英文摘要

Scientific artifacts such as models and datasets are foundations for research. With the rapid growth of platforms like HuggingFace, researchers now have access to a large number of artifacts. Yet, a key challenge remains: how can we automatically discover the state-of-the-art (SOTA) model for a given dataset by fully leveraging existing artifacts? We formalize this task as automatic SOTA discovery by modeling HuggingFace as an artifact graph, where nodes are models/datasets and edges represent evaluations. We propose ArtifactLinker, a two-stage framework: (1) ranking promising unobserved model--dataset links using Graph Neural Networks (GNNs) or graph-augmented Large Language Models (LLMs), and (2) verifying top-ranked links via coding experiments with LLM-based agents. We further introduce a benchmark named ArtifactBench with 14,053 artifacts and 51,337 relations to evaluate the performance of both stages. Results show that (1) graph structures between existing artifacts are effective for missing link prediction; (2) end-to-end ranking and verification with ArtifactLinker help discover potential SOTA results and research insights.

2605.16891 2026-05-19 cs.LG 版本更新

Tensor Channel Equivariant Graph Neural Networks for Molecular Polarizability Prediction

张量通道等价图神经网络用于分子极化率预测

Jean Philip Filling, Daniel Franzen, Michael Wand

发表机构 * Institute for Computer Science, Johannes Gutenberg University Mainz, Germany(明斯特大学计算机科学研究所,德国)

AI总结 本文提出了一种张量通道等价图神经网络,用于直接预测分子极化率张量,通过改进的PaiNN架构,在消息传递中传播张量结构,从而在分子极化率预测任务中取得更好的性能。

详情
AI中文摘要

我们介绍了一种张量通道等价图神经网络,用于直接预测分子极化率张量。基于高效的PaiNN架构,我们通过在隐藏表示中加入显式的对称秩-2张量通道,这些通道与极化率分解为各向同性和各向异性部分对齐。与仅在读出阶段构建张量输出的方法不同,我们的模型利用几何动机的张量基,在消息传递过程中传播张量结构。这产生了一种针对张量值分子预测的目标对齐架构。在优化的QM7-X几何结构上,所提出的模型在匹配的训练条件下,其全张量和各向异性误差均低于PaiNN风格的读出基线和介电常数MACE基线,并且在推理速度上也显著更快。消融研究显示,这种增益并非来自单纯增加容量,而是来自显式张量传播和与极化率张量各向异性部分匹配的迹零目标参数化相结合。在考虑的张量基中,最强的结果来自于学习的定向特征之间的相互作用,表明这些特征在建模分子极化率方面特别有效。旋转等价性测试进一步确认了所有比较模型在数值上都是等价的,因此观测到的改进归因于对目标张量本身的更好学习。总体而言,我们的结果表明,对于结构化的张量值目标,传播目标对齐的张量特征可以优于仅读出的张量构建和更一般的高阶等价模型。

英文摘要

We introduce a tensor-channel equivariant graph neural network for direct prediction of molecular polarizability tensors. Building on the efficient PaiNN architecture, we augment the hidden representation with explicit symmetric rank-2 tensor channels aligned with the decomposition of polarizability into isotropic and anisotropic components. In contrast to approaches that construct tensor outputs only at readout, our model propagates tensor structure throughout message passing using geometrically motivated tensor bases. This yields a target-aligned architecture for tensor-valued molecular prediction. On optimized QM7-X geometries, the proposed model achieves lower full-tensor and anisotropic error than both a PaiNN-style readout baseline and a dielectric MACE baseline under matched training conditions and at nearly identical parameter count. In this controlled setting, it also outperforms MACE while remaining substantially faster at inference. Ablation studies show that the gain does not arise from increased capacity alone, but from the combination of explicit tensor propagation and a traceless target parameterization matched to the anisotropic part of the polarizability tensor. Among the tensor bases considered, the strongest results are obtained from interactions between learned directional features, indicating that these are particularly effective for modeling molecular polarizability. Rotational equivariance tests further confirm that all compared models are numerically equivariant, so the observed improvements are attributable to better learning of the target tensor itself. Overall, our results show that for structured tensor-valued targets, propagating target-aligned tensor features can outperform both readout-only tensor construction and a more general higher-order equivariant model in the present training setting.

2605.16887 2026-05-19 cs.CV cs.LG 版本更新

Mind the Gap: Learning Modality-Agnostic Representations with a Cross-Modality UNet

Mind the Gap: Learning Modality-Agnostic Representations with a Cross-Modality UNet

Xin Niu, Enyi Li, Jinchao Liu, Yan Wang, Margarita Osadchy, Yongchun Fang

发表机构 * Tianjin Key Laboratory of Intelligent Robotics, College of Artificial Intelligence, Nankai University, China(天津智能机器人重点实验室,人工智能学院,南开大学,中国) Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, Nankai University, China(可信行为智能工程研究中心,教育部,南开大学,中国) Department of Computer Science, Haifa University, Israel(计算机科学系,海法大学,以色列) VisionMetric Ltd, Canterbury, Kent, UK(VisionMetric Ltd,坎特伯雷,肯特,英国)

AI总结 本文提出了一种紧凑的编码器-解码器神经模块(cmUNet),通过跨模态转换和模态内重建,学习模态无关的表示,同时保留身份相关的信息。此外,作者提出了MarrNet,通过将cmUNet连接到标准特征提取网络,实现跨模态匹配,并在多个挑战性任务上验证了其优越性能。

Comments Published in IEEE Transactions on Image Processing. See full abstract in the PDF file

详情
Journal ref
n IEEE Transactions on Image Processing, vol. 33, pp. 655-670, 2024
AI中文摘要

Cross-modality recognition has many important applications in science, law enforcement and entertainment. Popular methods to bridge the modality gap include reducing the distributional differences of representations of different modalities, learning indistinguishable representations or explicit modality transfer. The first two approaches suffer from the loss of discriminant information while removing the modality-specific variations. The third one heavily relies on the successful modality transfer, could face catastrophic performance drop when explicit modality transfers are not possible or difficult. To tackle this problem, we proposed a compact encoder-decoder neural module (cmUNet) to learn modality-agnostic representations while retaining identity-related information. This is achieved through cross-modality transformation and in-modality reconstruction, enhanced by an adversarial/perceptual loss which encourages indistinguishability of representations in the original sample space. For cross-modality matching, we propose MarrNet where cmUNet is connected to a standard feature extraction network which takes as inputs the modality-agnostic representations and outputs similarity scores for matching. We validated our method on five challenging tasks, namely Raman-infrared spectrum matching, cross-modality person re-identification and heterogeneous (photo-sketch, visible-near infrared and visible-thermal) face recognition, where MarrNet showed superior performance compared to state-of-the-art methods. Furthermore, it is observed that a cross-modality matching method could be biased to extract discriminant information from partial or even wrong regions, due to incompetence of dealing with modality gaps, which subsequently leads to poor generalization. We show that robustness to occlusions can be an indicator of whether a method can well bridge the modality gap.

英文摘要

Cross-modality recognition has many important applications in science, law enforcement and entertainment. Popular methods to bridge the modality gap include reducing the distributional differences of representations of different modalities, learning indistinguishable representations or explicit modality transfer. The first two approaches suffer from the loss of discriminant information while removing the modality-specific variations. The third one heavily relies on the successful modality transfer, could face catastrophic performance drop when explicit modality transfers are not possible or difficult. To tackle this problem, we proposed a compact encoder-decoder neural module (cmUNet) to learn modality-agnostic representations while retaining identity-related information. This is achieved through cross-modality transformation and in-modality reconstruction, enhanced by an adversarial/perceptual loss which encourages indistinguishability of representations in the original sample space. For cross-modality matching, we propose MarrNet where cmUNet is connected to a standard feature extraction network which takes as inputs the modality-agnostic representations and outputs similarity scores for matching. We validated our method on five challenging tasks, namely Raman-infrared spectrum matching, cross-modality person re-identification and heterogeneous (photo-sketch, visible-near infrared and visible-thermal) face recognition, where MarrNet showed superior performance compared to state-of-the-art methods. Furthermore, it is observed that a cross-modality matching method could be biased to extract discriminant information from partial or even wrong regions, due to incompetence of dealing with modality gaps, which subsequently leads to poor generalization. We show that robustness to occlusions can be an indicator of whether a method can well bridge the modality gap.

2605.16883 2026-05-19 cs.LG 版本更新

SE-GA: Memory-Augmented Self-Evolution for GUI Agents

SE-GA:基于记忆的自进化GUI代理

Shilong Jin, Lanjun Wang, Zhuosheng Zhang

发表机构 * College of Intelligence and Computing, Tianjin University, Tianjin, China(天津大学智能与计算学院) School of New Media and Communication, Tianjin University, Tianjin, China(天津大学新媒体与传播学院) School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院)

AI总结 本文提出SE-GA框架,通过整合分层记忆结构和迭代自我改进机制,解决GUI代理在多步骤任务中因上下文窗口受限和静态策略无法适应动态环境的问题,实验表明其在多个基准测试中均达到领先性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

自主图形用户界面(GUI)代理在多步骤任务中常因上下文窗口受限和静态策略无法适应动态环境而遇到困难。为解决这些限制,本文提出了自进化GUI代理(SE-GA),一种新颖的框架,整合了分层记忆结构和迭代自我改进机制。我们的方法核心是测试时间记忆扩展(TTME),通过动态检索事件性、语义性和经验性记忆,在推理过程中提供显著的上下文。为确保持续学习,我们引入了记忆增强自进化(MASE),这是一种训练流程,采用TTME收集的数据来稳定和增强代理的基础策略。在离线和在线基准测试中的广泛评估表明,SE-GA在ScreenSpot上达到89.0%的成功率,在具有挑战性的AndroidControl-High数据集上达到75.8%的成功率。此外,对AndroidWorld基准测试的显著改进突显了其在动态环境中的优越泛化能力。开源代码:https://github.com/jinshilong-dev/SE-GA

英文摘要

Autonomous Graphical User Interface (GUI) agents often struggle with multi-step tasks due to constrained context windows and static policies that fail to adapt to dynamic environments. To address these limitations, this work proposes the Self-Evolving GUI Agent (SE-GA), a novel framework that integrates hierarchical memory structures with an iterative self-improvement mechanism. At the core of our approach is Test-Time Memory Extension (TTME), which facilitates long-term planning by dynamically retrieving episodic, semantic, and experiential memories to provide salient contexts during inference. To ensure continuous learning, we introduce Memory-Augmented Self-Evolution (MASE), which is a training pipeline that adopts the data collected by TTME to stabilize and enhance the agent's foundational policy. Extensive evaluations across both offline and online benchmarks demonstrate SE-GA achieves state-of-the-art performance, reaching success rates of 89.0\% on ScreenSpot and 75.8\% on the challenging AndroidControl-High dataset. Furthermore, significant improvements on the AndroidWorld benchmark highlight the superior generalization to dynamic environments. Open source code: https://github.com/jinshilong-dev/SE-GA

2605.16863 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

先规划,后扩散:用于长视距扩散规划的外在图引导

Yaniv Hassidof, Adir Morgan, Yilun Du, Kiril Solovey

发表机构 * Technion(技术Ion大学) Harvard(哈佛大学)

AI总结 本文提出了一种外在搜索引导的扩散模型(XDiffuser),通过在状态空间图上先规划再引导扩散过程,以提高长视距规划的效率和效果,尤其在低质量数据和未见任务中表现优异。

详情
AI中文摘要

组合扩散模型通过去噪多个重叠的子轨迹并确保它们构成全局解,为长视距规划提供了一条有前途的路线。然而,强制在长链上执行局部行为往往不足以产生一致的全局结构。最近的工作通过内在搜索在去噪过程中探索多条路径来解决这一限制。尽管内在搜索提高了全局一致性,但代价是重复评估已经计算密集的模型。在本文中,我们主张在去噪过程之外进行外在搜索,为长视距规划提供更有效的探索模式,同时自然地使经典算法能够解决测试时的未见组合任务。我们的eXtrinsic搜索引导的Diffuser(XDiffuser)首先在状态空间图上计算一个计划——作为扩散模型的轻量级局部连接Oracle。该计划随后用于引导单条轨迹的去噪,有效地将探索负担转移出去。XDiffuser在长视距任务上优于基于扩散的基线,特别是在低质量数据领域和超出目标到达的未见任务中,包括多智能体协调和TSP风格推理。项目网站:https://yanivhass.github.io/XDiffuser-site/

英文摘要

Compositional diffusion models offer a promising route to long-horizon planning by denoising multiple overlapping sub-trajectories while ensuring that together they constitute a global solution. However, enforcing local behavior over long chains is often insufficient for a coherent global structure to emerge. Recent works tackle this limitation through intrinsic search, which explores multiple paths during the denoising process. While intrinsic search improves global coherence, it comes at the cost of repeated evaluations of an already compute-heavy model. In this work, we argue that extrinsic search, performed outside the denoising process, offers a more effective mode of exploration for long-horizon planning while naturally enabling the use of classical algorithms to solve unseen combinatorial tasks at test time. Our eXtrinsic search-guided Diffuser (XDiffuser) first computes a plan over a state-space graph -- serving as a lightweight local connectivity oracle for the diffusion model. The plan is then used to guide denoising for a single trajectory, effectively offloading the burden of exploration. XDiffuser outperforms diffusion-based baselines on long-horizon tasks, with particularly large gains in the low-quality data regime and on unseen tasks beyond goal-reaching, including multi-agent coordination and TSP-style reasoning. Project website: https://yanivhass.github.io/XDiffuser-site/

2605.16860 2026-05-19 cs.LG cs.AI q-bio.QM 版本更新

PhysioSeq2Seq: A Hybrid Physiological Digital Twin and Sequence-to-Sequence LSTM for Long-Horizon Glucose Forecasting in Type 1 Diabetes

PhysioSeq2Seq:一种混合生理数字孪生和序列到序列LSTM的长周期1型糖尿病葡萄糖预测方法

Phat Tran, Neville Mehta, Clara Mosquera-Lopez, Robert H. Dodier, Lizhong Chen, Peter G. Jacobs

发表机构 * Oregon State University(俄勒冈州立大学) Oregon Health & Science University(俄勒冈健康与科学大学)

AI总结 本文提出了一种结合患者特定生理建模与序列到序列LSTM的混合架构PhysioSeq2Seq,用于长周期1型糖尿病葡萄糖预测,通过消除递归误差累积并注入患者匹配的生理状态,提高了预测精度和临床意义。

详情
AI中文摘要

准确的长周期葡萄糖预测对于自动胰岛素输送系统至关重要,这些系统帮助1型糖尿病患者管理血糖并避免危险的低血糖。然而,标准递归长短期记忆网络(LSTM)在更长的周期内由于误差累积存在系统性负偏置,而纯粹的机理微分方程(ODE)模型在群体参数化时无法跨个体泛化。我们提出PhysioSeq2Seq,一种结合患者特定生理建模与序列到序列(Seq2Seq)LSTM的混合架构。对于每个葡萄糖段,双胞胎匹配搜索300个参数化的数字孪生体群体,以从连续葡萄糖监测(CGM)历史中找到最佳拟合的生理匹配。匹配双胞胎的10个内部ODE状态变量被注入到Seq2Seq LSTM的编码器和解码器中。这种同时48步预测策略消除了递归误差累积,而ODE特征提供了一个基于物理的约束,限制了长周期漂移在生理合理范围内。PhysioSeq2Seq在1型糖尿病运动倡议(T1DEXI)数据集中训练了348名参与者的CGM和胰岛素数据,并在74名被排除的参与者上进行评估。在240分钟的预测范围内,PhysioSeq2Seq的平均绝对误差为39.28 mg/dL,平均误差为-10.62 mg/dL,比递归LSTM减少了13.89 mg/dL的偏置,比基于ODE的数字孪生减少了28.62 mg/dL的平均绝对误差。这些结果表明,消除架构反馈并注入患者匹配的生理状态是一种有效且具有临床意义的策略,用于1型糖尿病的长周期葡萄糖预测。

英文摘要

Accurate long-horizon glucose forecasting is critical for automated insulin delivery systems, which help people with type 1 diabetes (T1D) manage their glucose and avoid dangerous hypoglycemia. However, standard recursive long short-term memory (LSTM) networks suffer from systematic negative bias at longer horizons due to error compounding, while purely mechanistic ordinary differential equation (ODE) models fail to generalize across individuals when parameterized at the population level. We propose PhysioSeq2Seq, a hybrid architecture that combines patient-specific physiological modeling with a sequence-to-sequence (Seq2Seq) LSTM. For each glucose segment, twin matching searches a population of 300 parameterized digital twins to identify the best-fitting physiological match from a 3-hour continuous glucose monitoring (CGM) history. The 10 internal ODE state variables of the matched twin are injected as exogenous covariates into both the encoder and decoder of the Seq2Seq LSTM. This simultaneous 48-step prediction strategy eliminates recursive error compounding, while the ODE features provide a physics-grounded constraint that bounds long-horizon drift within physiologically plausible ranges. PhysioSeq2Seq was trained on CGM and insulin data from 348 participants in the Type 1 Diabetes Exercise Initiative (T1DEXI) dataset and evaluated on 74 held-out participants. At the 240-minute horizon, PhysioSeq2Seq achieves a mean absolute error of 39.28 mg/dL and a mean error of -10.62 mg/dL, reducing bias by 13.89 mg/dL over the recursive LSTM and reducing mean absolute error by 28.62 mg/dL over the ODE-based digital twin. These results show that eliminating architectural feedback and injecting patient-matched physiological states is an effective and clinically meaningful strategy for long-horizon glucose forecasting in T1D.

2605.16848 2026-05-19 cs.CV cs.AI cs.CL cs.LG 版本更新

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

基于模式的思考:通过模式诱导突破视觉规划中的感知瓶颈

Yichang Jian, Boyuan Xiao, Zhenyuan Huang, Yifei Peng, Yao-Xiang Ding

发表机构 * State Key Lab of CAD& CG(CAD与CG国家重点实验室)

AI总结 本文提出通过模式诱导的方法,利用模式推理和模式诱导策略,使视觉语言模型在视觉规划任务中实现更高效和准确的感知与推理,解决传统模型在复杂输入下的感知瓶颈问题。

详情
AI中文摘要

从原始视觉输入进行规划仍然对当前的视觉-语言模型(VLMs)构成重大挑战,当输入复杂度超出其一步感知能力时。受最近在图像思考(TWI)中的进展启发,一种合理的解决方案是通过迭代获取和整合局部视觉证据,将感知过程分解为更简单的步骤。然而,尽管当前VLMs在一般TWI能力上训练良好,但其在规划领域中的感知瓶颈仍然存在。为解决这一挑战,我们将TWI视为一种工具,逐步构建并反映一个准确的内部世界模型。我们发现,由此产生的无训练规划策略使VLMs能够解决远超其初始能力的任务,但代价是过多的TWI操作会显著增加计算开销。为进一步提高效率,我们提出模式推理,一种新的TWI策略,使VLMs能够主动识别新任务中的已知视觉模式并直接推断局部世界模型结构。为了获得这些模式,我们提出模式诱导,一种在线归纳学习策略,将视觉模式视为复合且可重用的专家,这些专家是自主从经验中发现和优化的。在FrozenLake、Crafter和CubeBench领域中的实验评估表明,我们的方法在准确性和效率之间实现了良好的平衡。

英文摘要

Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.

2605.16836 2026-05-19 stat.ML cs.LG 版本更新

HYVINT: Intensity-Driven Hypergraph Generation with Variational Representations

HYVINT: 基于变分表示的强度驱动超图生成

Xinyi Hong, Shuntuo Xu, Zhou Yu

发表机构 * School of Statistics(统计学系) East China Normal University(东华大学)

AI总结 本文提出HYVINT框架,通过强度驱动的超图生成机制和变分估计器,解决超图生成中节点-超边关系的建模问题,实现高保真且具有多样性的生成。

详情
AI中文摘要

超图提供了一个系统的方法来建模多阶交互,应用于推荐系统、社交网络和分子建模等领域。超图生成仍然具有挑战性,因为 incidence 结构是离散、稀疏且由异质的高阶交互支配。现有的生成器通常依赖于隐含的潜在空间或连续的 incidence 解码器,这些方法在解释节点-超边关系的产生机制方面有限。为了解决这些限制,我们提出HYVINT,一种强度驱动的超图生成框架。我们的关键创新是双重:(i) 我们开发了一种强度驱动的 incidence 形成机制,将潜在的交互强度与二进制 incidence 相联系;(ii) 我们推导出一个可处理的变分下界估计器用于学习潜在表示。我们提供了生成误差界和渐近收敛速率,并在合成和现实超图上实验证明HYVINT在保持显著新颖性和多样性的同时实现了强保真度。

英文摘要

Hypergraphs provide a principled framework for modeling polyadic interactions, with applications in recommendation systems, social networks, and molecular modeling. Hypergraph generation remains challenging because incidence structures are discrete, sparse, and governed by heterogeneous higher-order interactions. Existing generators often rely on implicit latent spaces or continuous incidence decoders, which provide limited mechanistic interpretation of how node-hyperedge incidences arise. To address these limitations, we propose HYVINT, an intensity-driven hypergraph generative framework. Our key innovations are twofold: (i) we develop an intensity-driven incidence formation mechanism for hypergraphs that links latent interaction strength to binary incidence, and (ii) we derive a tractable lower-bound variational estimator for learning latent representations. We provide generation error bounds with asymptotic convergence rates and empirically show that HYVINT achieves strong fidelity while maintaining substantial novelty and diversity on synthetic and real-world hypergraphs.

2605.16834 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

基于有限数据的细粒度多模态对齐的相对表示学习

Shiwon Kim, Yu Rang Park

发表机构 * Yonsei University(延世大学)

AI总结 本文提出了一种基于相对表示的学习方法,用于在有限数据条件下实现细粒度多模态对齐,通过学习token级别的跨模态结构来提升零样本分类、跨模态检索和零样本分割任务的性能。

详情
AI中文摘要

多模态预训练展示了强大的泛化性能,但在缺乏配对数据的领域中,这种范式往往难以实施。一种有前景的替代方法是事后多模态对齐,它通过有限数量的配对示例分别对预训练的单模态编码器进行对齐。然而,现有方法主要关注全局表示的对齐,忽略了片段-token关系。这可能阻碍了需要细粒度跨模态匹配的任务的迁移,超越粗粒度样本层面的语义。为了解决这个问题,我们提出了一种事后对齐方法,通过相对表示学习token级别的跨模态结构。具体来说,我们通过图像和文本与每种模态空间中一组可学习锚点的token级相似性来表示它们,这些锚点被训练以诱导一致的跨模态相似性模式,以匹配对。尽管仅学习锚点而没有重大的投影层,我们的方法在零样本分类、跨模态检索和零样本分割任务中均显著优于现有方法。这突显了在有限配对数据下,建模细粒度跨模态结构对于有效事后多模态对齐的重要性。

英文摘要

Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately pre-trained unimodal encoders using a limited number of paired examples. However, existing methods focus primarily on aligning global representations, missing patch-token relations. This may hinder transfer to tasks that require fine-grained cross-modal matching beyond coarse sample-level semantics. To address this issue, we propose a post-hoc alignment method that learns token-level cross-modal structure using relative representations. Specifically, we represent images and texts through their token-level similarities to a set of learnable anchors in each modality space, which are trained to induce consistent cross-modal similarity patterns for matched pairs. Despite learning only the anchors without heavy projection layers, our approach consistently outperforms existing methods in zero-shot classification, cross-modal retrieval, and zero-shot segmentation by a substantial margin. This highlights the importance of modeling fine-grained cross-modal structure for effective post-hoc multimodal alignment with limited paired data.

2605.16828 2026-05-19 stat.ML cs.AI cs.LG stat.ME 版本更新

Prediction-Intervention Games and Invariant Sets

预测-干预博弈与不变集

Linus Kühne, Felix Schur, Jonas Peters

发表机构 * Seminar for Statistics and ETH AI Center(统计研究所和ETH人工智能中心) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文研究了预测-干预博弈中的领导方如何通过选择预测函数来应对跟随方的干预,证明了基于稳定毯的预测在某些情况下优于因果父母的预测,并讨论了实际应用中的策略。

详情
AI中文摘要

我们考虑了一个两位玩家博弈:利用观测数据,领导者选择一个响应变量Y的预测函数,跟随者则在潜在的结构因果模型中对某些协变量进行干预以最大化自身目标。领导者知道干预目标,但可能对跟随者的目标了解有限。我们称这种设置为预测-干预博弈,是Stackelberg博弈的一种特殊情况。找到领导者的最优策略通常很困难。为了避免严重性能损失,领导者可能基于Y的因果父母或更一般地基于协变量的不变子集来选择预测。我们证明,对于两种常见的跟随者目标类别,基于稳定毯(特定不变子集)的预测总是更好或至少与基于因果父母的预测一样好。我们进一步通过允许的干预的最坏情况风险上界来上界领导者干预后的风险,并加强现有的分布泛化结果以分析此界限:我们给出了稳定毯预测在某些条件下最坏情况最优的充分条件,并通过例子表明这些条件不能一般被删除。最后,我们讨论了已知和未知图的实际情况中的实用策略,并在模拟和现实数据上测试了这些策略。

英文摘要

We consider the following two-player game: using observational data, the leader chooses a prediction function for a response variable $Y$ from given covariates. The follower then reacts with an intervention on some covariates in the underlying structural causal model to maximize their own objective. The leader knows the intervention targets, but may have limited knowledge of the follower's objective. We call this setup a prediction-intervention game, a special case of a Stackelberg game. Finding an optimal strategy for the leader is generally difficult. To avoid severe performance loss, the leader may base their prediction on the causal parents of $Y$, or more generally on an invariant subset of covariates. We prove, for two common classes of follower objectives, that predictors based on the stable blanket, a specific invariant subset, are always better or as good as those based on the causal parents. We further upper bound the leader's post-intervention risk by a worst-case risk over allowed interventions and strengthen existing distribution generalization results to analyze this bound: we give sufficient conditions under which stable-blanket predictors are worst-case optimal, and show by examples that these conditions cannot in general be dropped. Finally, we discuss practical strategies for settings with known and unknown graph, and test them on simulated and real-world data.

2605.16826 2026-05-19 cs.LG cs.AI cs.CL 版本更新

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

解耦KL与轨迹:为LLM蒸馏中的SFT、DAgger、离线RL和OPD提供统一视角

Anhao Zhao, Haoran Xin, Yingqi Fan, Junlong Tong, Wenjie Li, Xiaoyu Shen

发表机构 * Eastern Institute of Technology(东技术院) The Hong Kong Polytechnic University(香港理工大学) Shanghai Jiao Tong University(上海交通大学) Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou)(人工智能 thrust,香港科学与技术大学(广州))

AI总结 本文探讨了知识蒸馏中KL散度与轨迹之间的耦合问题,通过解耦两个轴向,提出了四种有效的蒸馏目标,并通过实验揭示了KL方向、前缀源和训练长度之间的权衡关系,提出了KL混合和熵门长度课程等实用方法。

Comments Code available at https://github.com/EIT-NLP/Decoupled-Distill

详情
AI中文摘要

知识蒸馏是LLM后训练的核心,但其设计空间仍不明确,尤其是在与强化学习(RL)结合时。我们展示了主流范式,即离线蒸馏和在线蒸馏(OPD),隐含地耦合了两个正交选择:前缀源和token级KL方向。这源于将序列级KL分解为自回归响应分布的KL:前向KL将教师前缀与token级前向KL配对,而反向KL将学生前缀与token级反向KL配对。我们主张这种耦合并非本质:解耦这两个轴向会产生四个有效的目标。我们建立了梯度级恒等式,显示前向KL给出SFT风格的交叉熵匹配,而反向KL给出RL风格的策略梯度目标,连接到离线SFT、DAgger风格的在线SFT、离线RL风格的蒸馏和OPD。我们在数学推理上进行了广泛的受控研究,评估了四个目标作为独立方法和后续RL的初始化。结果揭示了三个权衡:KL方向引起准确度-熵权衡,前缀源引起质量-计算权衡,训练长度引起准确度-稳定性权衡。受这些发现启发,我们提出了KL混合和熵门长度课程。KL混合显示长序列蒸馏需要显著的前向KL权重以防止熵崩溃和长度膨胀而不牺牲准确性。熵门长度课程提高了Avg@k和Pass@k分别3.6和高达5.8个点,并将平均响应长度减少了约3倍。我们的结果提供了一个框架和实用方法,用于设计平衡准确度、多样性、计算和RL行为的推理蒸馏目标。

英文摘要

Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient-level identities showing forward KL gives SFT-style cross-entropy matching with teacher soft targets, whereas reverse KL gives an RL-style policy-gradient objective with a dense teacher-student log-ratio reward, connecting them to off-policy SFT, DAgger-style on-policy SFT, offline-RL-style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy-entropy tradeoff, prefix source a quality-compute tradeoff, and training length an accuracy-stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy-gated length curriculum. KL mixing shows long-sequence distillation requires substantial forward-KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy-gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long-horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.

2605.16824 2026-05-19 cs.LG cs.CL 版本更新

Confidence Geometry Reveals Trace-Level Correctness in Large Language Model Reasoning

置信几何揭示大语言模型推理中的痕量正确性

Shuo Liu, Ding Liu, Shi-Ju Ran

发表机构 * School of Computer Science and Technology, Tiangong University(天津工业大学计算机科学与技术学院) Center for Quantum Physics and Intelligent Sciences, Department of Physics, Capital Normal University(首都师范大学量子物理与智能科学中心)

AI总结 本文研究了大语言模型推理正确性与置信轨迹之间的关系,提出通过置信几何分析来区分正确与错误推理轨迹,并展示了NeuralConf方法在提高推理准确性方面的有效性。

Comments 11 pages, 9 figures, 1 table. Code is available at https://github.com/QML-TGU/NeuralConf

详情
AI中文摘要

大语言模型(LLMs)不仅生成推理文本,还生成记录推理过程中不确定性演变的token级置信轨迹。这些轨迹是否与推理正确性相关尚不清楚。本文表明,置信轨迹编码了与痕量最终答案正确性相关的内容无关的置信几何。仅使用token级置信值,不访问输入问题、推理文本、隐藏状态或外部验证器,发现置信轨迹的低维表示能将正确和错误的推理轨迹分开。在GSM8K、MATH和MMLU数据集上,这种几何分离与下游可预测性量度定量相关:正确和错误轨迹的更强聚类(通过Davies-Bouldin指数测量)一致对应更高的正确性判别AUC。进一步发现正确性相关信息在推理尾部得到丰富,表明晚期置信动态携带关键正确性信号。本文提出NeuralConf,一个轻量级估计器,通过置信轨迹学习正确性评估。在固定轨迹预算下,NeuralConf衍生的分数在置信加权答案聚合方面优于多数投票、尾置信度和其他静态基线。这些结果表明,LLMs通过自身的置信动态暴露了正确性的痕量统计信号,为利用生成中已存在的信息提高推理提供了途径。

英文摘要

Large language models (LLMs) generate not only reasoning text, but also token-level confidence trajectories that record how uncertainty evolves during inference. Whether these trajectories are relevant to reasoning correctness remains unclear. Here we show that confidence trajectories encode a content-agnostic confidence geometry associated with trace-level final-answer correctness. Using only token-level confidence values, without access to the input question, reasoning text, hidden states, or external verifiers, we find that low-dimensional representations of confidence trajectories separate correct from incorrect reasoning traces. Across GSM8K, MATH, and MMLU, this geometric separation is quantitatively linked to downstream predictability: stronger clustering of correct and incorrect traces, measured by the Davies--Bouldin index, consistently corresponds to higher correctness-discrimination AUC. We further show that correctness-related information is enriched in the tail of reasoning, suggesting that late-stage confidence dynamics carry key correctness signals. We propose NeuralConf, a lightweight estimator that learns from confidence trajectories for correctness evaluation. Under a fixed trace budget, NeuralConf-derived scores improve confidence-weighted answer aggregation over majority voting, tail confidence, and other static baselines. These results reveal that LLMs expose trace-intrinsic statistical signals of correctness through their own confidence dynamics, offering a route to improve inference using information already present within generation.

2605.16819 2026-05-19 cs.CL cs.AI cs.LG 版本更新

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

AgentKernelArena: GPU核优化代理的通用化意识基准测试

Sharareh Younesian, Wenwen Ouyang, Sina Rafati, Mehdi Rezagholizadeh, Sharon Zhou, Ji Liu, Yue Liu, Yuchen Yang, Hao Li, Ziqiong Liu, Dong Li, Vikram Appia, Zhenyu Gu, Emad Barsoum

发表机构 * AMD

AI总结 本文提出AgentKernelArena,一个用于评估GPU核优化代理的开源基准,通过隔离工作区和统一评分机制,测试代理在不同任务和硬件目标上的性能和通用化能力,发现大多数任务在正确性和编译效率上表现优异,但在PyTorch到HIP的转换任务中存在显著的正确性下降。

详情
AI中文摘要

GPU核优化对于高效深度学习系统日益关键,但编写高性能核仍然需要大量的低级专业知识。最近的AI编码代理可以迭代阅读代码、调用编译器和性能分析器,并优化实现,但现有的核基准测试仅评估单个LLM调用而非完整的代理工作流程,且未包含核到核的优化和未见过的配置泛化测试。我们提出了AgentKernelArena,一个开源的基准测试,用于衡量AI编码代理在GPU核优化上的能力。该基准测试包含196个任务,涵盖HIP到HIP的优化、Triton到Triton的优化以及PyTorch到HIP的转换,并在隔离的工作区中使用门控编译、正确性和性能检查,集中评分和一个未见过的配置泛化协议,测试优化是否转移到代理从未见过的输入配置。在包括Cursor Agent、Claude Code和Codex Agent在内的生产代理中,我们发现大多数任务在正确性和编译效率上表现优异,最强配置在PyTorch到HIP任务中平均加速达6.89倍,在HIP到HIP任务中达6.69倍,在Triton到Triton任务中达2.13倍。我们的未见过的配置评估显示,HIP到HIP和Triton到Triton的优化大多能转移到未见过的输入形状,而PyTorch到HIP的转换则表现出显著的正确性下降,表明生成核的代理经常硬编码形状特定的假设。AgentKernelArena被设计为一个模块化、可扩展的框架,用于严格评估跨代理、任务和硬件目标的代理GPU核优化。

英文摘要

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.

2605.16809 2026-05-19 cs.LG 版本更新

Informative Graph Structure Learning

信息导向的图结构学习

Shen Han, Zhiyao Zhou, Jiawei Chen, Sheng Zhou, Canghong Jin, Hai Lin, Da Zhong Li, Bingde Hu, Can Wang

发表机构 * Zhejiang University(浙江大学) Hangzhou City University(杭州市大学) China Mobile Communications Group Co.,Ltd(中国移动通信集团有限公司)

AI总结 本文提出了一种信息导向的图结构学习方法(InGSL),通过结合相似性和多样性来优化图结构,减少边数并提高性能。

详情
AI中文摘要

图结构数据的质量对现代图分析技术如图神经网络(GNNs)的成功至关重要。然而,现实中的图数据往往质量不佳,存在噪声和连接不完整等问题。图结构学习(GSL)作为一种适应性优化节点连接的技术已崭露头角。然而,我们发现GSL的效果常常以边数大幅增加为代价,导致存储和计算开销显著增加。在本工作中,我们揭示这一限制源于广泛使用的基于相似性的边构造方法,该方法主要基于嵌入连接高度相似的邻居,引入了大量结构冗余。为了解决这一问题,我们提出了一种新颖的信息导向图结构学习方法(InGSL),通过引入互信息引导的学习策略,同时考虑相似性和多样性进行边构造。值得注意的是,InGSL作为一种可插拔模块,能够无缝集成到现有的GSL框架中。通过在六个代表性GSL方法上的广泛实验,我们证明InGSL在减少边数的同时实现了显著的性能提升。

英文摘要

The quality of graph-structured data is fundamental to the success of modern graph analysis techniques such as Graph Neural Networks (GNNs). However, real-world graph data is often suboptimal, suffering from issues such as noise and incomplete connections. Graph Structure Learning (GSL) has emerged as a promising technique that adaptively optimizes node connections. However, we observe that the effectiveness of GSL often comes at the cost of a dramatic expansion in edge count, resulting in significant storage and computational overhead. In this work, we reveal that this limitation stems from the prevalent use of similarity-based edge construction, which predominantly connects highly similar neighbors based on their embeddings, introducing substantial structure redundancy. To address this, we propose a novel Informative Graph Structure Learning method (InGSL), which jointly considers both similarity and diversity in edge construction by incorporating a mutual-information-guided learning strategy. Notably, InGSL serves as a plug-in module that can be seamlessly integrated into existing GSL frameworks. Through extensive experiments on six representative GSL methods, we demonstrate that InGSL achieves significant performance improvements at a reduced number of edges.

2605.16806 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Cross-modal Affinity-aligned Multimodal Learning Analytics for Predicting Student Collaboration Satisfaction in Game-Based Learning

跨模态亲和对齐的多模态学习分析用于预测基于游戏的学习中学生协作满意度

Wen-Hsin Tsai, Chia-Ming Lee, Yuk-Ying Tung

发表机构 * Institute of Education, National Cheng Kung University(国立成功大学教育研究所) Institute of Intelligent System, National Yang Ming Chiao Tung University(阳明交通大学智能系统研究所) Department of Computer Science, University at Albany, State University of New York(纽约州立大学水牛城分校计算机科学系)

AI总结 本文提出了一种跨模态亲和对齐的多模态学习分析框架,通过建模模态间关系和对比学习来增强学生协作满意度预测的鲁棒性和可解释性。

Comments Accetped by CVPR 2026 CVxEdu Workshop

详情
AI中文摘要

协作式基于游戏的学习环境为小组知识构建提供了丰富的机遇,但自动预测学生协作满意度仍具挑战性。关键障碍是模态退化:在教育部署中,个体模态如眼动在学生群体间表现出不一致的信息量,导致基于隐式注意力的融合产生脆弱的多模态表示。我们提出了亲和对齐多模态学习分析(AAMLA)框架,其核心贡献是跨模态亲和引导的模态对齐(CAMA)模块,该模块通过亲和矩阵显式建模模态间关系,并通过对比学习强制跨模态一致性,从而实现对无信息模态的自适应抑制而不丢弃它们。AAMLA进一步应用模态特定的投影层,将异构特征,包括面部动作单元、头部姿态、眼动和交互痕迹日志,映射到统一的语义空间,然后再进行对齐。在EcoJourneys协作学习环境中的50名中学生实验表明,在标准和模态退化条件下,AAMLA在单模态基线和先前跨注意力方法上均表现出一致的改进,SHAP和t-SNE分析证实CAMA能够产生稳健且可解释的跨模态表示,用于学生协作建模。

英文摘要

Collaborative game-based learning environments offer rich opportunities for small-group knowledge construction, yet automatically predicting student collaboration satisfaction remains challenging. A critical barrier is modality degradation: in educational deployments, individual modalities such as eye gaze exhibit inconsistent informativeness across student cohorts, causing implicit attention-based fusion to produce brittle multimodal representations. We propose the Affinity-Aligned Multimodal Learning Analytics (AAMLA) framework, whose core contribution is the Cross-modal Affinity-guided Modality Alignment (CAMA) module, which explicitly models inter-modal relationships via affinity matrices and enforces cross-modal consistency through contrastive learning, enabling adaptive suppression of uninformative modalities without discarding them. AAMLA further applies modality-specific projection layers to map heterogeneous features, including facial action units, head pose, eye gaze, and interaction trace logs, into a unified semantic space prior to alignment. Experiments on 50 middle school students in the EcoJourneys collaborative learning environment demonstrate consistent improvements over unimodal baselines and prior cross-attention approaches under standard and modality degradation conditions, with SHAP and t-SNE analyses confirming that CAMA produces robust, interpretable cross-modal representations for student collaboration modeling.

2605.16800 2026-05-19 cs.LG cs.CL 版本更新

FIM-LoRA: Task-Informative Rank Allocation for LoRA via Calibration-Time Gradient-Variance Estimation

FIM-LoRA: 通过校准时间梯度方差估计实现任务信息的秩分配

Ramakrishnan Sathyavageeswaran

发表机构 * Intuit

AI总结 本文提出FIM-LoRA,通过校准时间梯度方差估计来分配任务信息的秩,以优化LoRA的秩分配,从而提高模型性能。

Comments 10 pages, 1 figure

详情
AI中文摘要

低秩适应(LoRA)为每个适应的权重矩阵分配一个统一的秩——一种实用的便利,但忽略了一个基本现实:不同层对任务适应的贡献不均。我们通过一种轻量级的工程解决方案来解决这个问题:在微调开始之前,运行八次校准反向传递,计算每个LoRA-B矩阵的梯度方差作为层信息度的代理,并按比例重新分配秩预算。所得到的适配器是一个标准的LoRA,具有每层的秩模式——没有新的参数,没有训练开销,没有对服务基础设施的更改。我们通过高效地近似经验 Fisher 信息矩阵(eFIM)对角线,仅限于 LoRA 适配器矩阵,来实现这一点,这将内存成本降低了大约256倍相比完整的模型 Fisher 估计。在 GLUE 上使用 DeBERTa-v3-base 时,FIM-LoRA 在相同参数预算下与 LoRA 相当(88.6 vs. 88.7),在常识推理上使用 LLaMA-3-8B 时达到 68.5 vs. 68.7。每层的秩映射是可解释的:值投影和早期到中期层一致获得更高的秩,与已建立的 transformer 层角色研究结果一致。

英文摘要

Low-rank adaptation (LoRA) assigns a uniform rank to every adapted weight matrix - a practical convenience that ignores a fundamental reality: different layers contribute unequally to task adaptation. We address this with a lightweight engineering solution: before fine-tuning begins, run eight calibration backward passes, compute the gradient variance of each LoRA-B matrix as a proxy for layer informativeness, and redistribute the rank budget proportionally. The resulting adapter is a standard LoRA with a per-layer rank pattern - no new parameters, no training overhead, no changes to serving infrastructure. We implement this via an efficient approximation of the empirical Fisher Information Matrix (eFIM) diagonal, restricted to LoRA adapter matrices only, which reduces memory cost by approximately 256x compared to full-model Fisher estimation. On GLUE with DeBERTa-v3-base, FIM-LoRA matches LoRA (88.6 vs. 88.7) at the same parameter budget, and on commonsense reasoning with LLaMA-3-8B reaches 68.5 vs. 68.7 for LoRA. The per-layer rank maps are interpretable: value projections and early-to-middle layers consistently receive higher rank, consistent with established findings on transformer layer roles.

2605.16790 2026-05-19 cs.LG cs.AI cs.CL 版本更新

TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition

TIER: 用于多步工具组合的轨迹不变执行奖励

Anay Kulkarni, ChiaEn Lu, Dheeraj Mekala, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang

发表机构 * UC San Diego(加州大学圣迭戈分校) Cisco Research(思科研究)

AI总结 本文提出TIER,一种基于函数模式和运行时执行的奖励框架,能够提供密集且可解释的序列级反馈,支持多种解决方案策略并适应变化的工具接口,在DepthBench等基准上实现了高准确率。

Comments Preprint. Submitted to NeurIPS 2026. 28 pages, 7 figures, 8 tables. Code and datasets available at https://github.com/anaykulkarni/TIER

详情
AI中文摘要

工具使用使大语言模型能够通过一系列API调用解决复杂任务,但现有的强化学习方法无法扩展到多步骤组合设置。基于结果的奖励只能提供稀疏反馈,而轨迹监督的奖励依赖于注释的参考解决方案,惩罚有效的替代方案并限制可扩展性。我们提出TIER:轨迹不变执行奖励,一种奖励框架,其监督直接来自函数模式和运行时执行,而非参考轨迹。该奖励分解为格式有效性、模式遵守、执行成功和答案正确性,提供来自细粒度验证的单个步骤工具使用反馈。这种设计允许任何有效的执行路径获得信用,自然支持多种解决方案策略并适应变化的工具接口。在DepthBench,一个按深度(1到6步)分层的组合基准上,TIER在所有步骤中实现了>90%的准确率,其中轨迹监督的奖励在第4步之后崩溃。我们进一步在BFCL v3和NestFUL等基准上展示了持续的提升。消融研究确认所有奖励组件都是必要的,突显了多级监督对于组合推理的重要性。

英文摘要

Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi-step composition settings. Outcome-based rewards provide only sparse feedback, while trajectory-supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability. We propose TIER: Trajectory-Invariant Execution Rewards, a reward framework that derives supervision directly from function schemas and runtime execution, rather than from reference trajectories. The reward decomposes into format validity, schema adherence, execution success, and answer correctness, providing dense, interpretable sequence-level feedback derived from fine-grained verification of individual steps of tool use. This design allows any valid execution path to receive credit, naturally supporting multiple solution strategies and adapting to evolving tool interfaces. On DepthBench, a compositional benchmark stratified by depth (1 to 6 steps), TIER achieves >90% accuracy across steps, where trajectory-supervised rewards collapse beyond step-4. We further demonstrate consistent gains on benchmarks like BFCL v3 and NestFUL. Ablation studies confirm that all reward components are necessary, highlighting the importance of multi-level supervision for compositional reasoning.

2605.16787 2026-05-19 cs.LG cs.CL 版本更新

The Unlearnability Phenomenon in RLVR for Language Models

在语言模型中RLVR的不可学习现象

Yulin Chen, He He, Chen Zhao

发表机构 * New York University(纽约大学)

AI总结 本文研究了RLVR在提升大语言模型推理能力中的学习动态,发现即使存在正确回放,某些难例仍无法学习,揭示了当前RL方法在推理任务中的根本限制。

Comments Accepted to ICML 2026

详情
AI中文摘要

可验证奖励强化学习(RLVR)已被证明在提高大语言模型(LLM)的推理能力方面是有效的。然而,RLVR的学习动态仍缺乏深入研究。在本文中,我们揭示了一个反直觉的现象:在模型最初难以处理的硬例中,一个显著子集即使在存在正确回放的情况下仍无法学习。为了理解这一现象,我们首先证明了现有的优化和采样技术无法解决不可学习性。通过跨例梯度分析,我们显示不可学习的例子具有根本性的表示问题,其特征是与其余例子的梯度相似性低且推理模式不可泛化。我们进一步表明,表示缺陷在RL中难以缓解,因为数据增强无法提高梯度相似性。本研究为RLVR训练中的不可学习数据提供了首次系统的表征,并揭示了当前RL方法在推理任务中的根本限制。代码和数据可在https://github.com/yulinchen99/unlearnability-rlvr获取。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at \url{https://github.com/yulinchen99/unlearnability-rlvr}.

2605.16786 2026-05-19 cs.LG 版本更新

Lever: Speculative LLM Inference on Smartphones

Lever:智能手机上的推测LLM推理

Tuowei Wang, Fengzu Li, Yanfan Sun, Wei Gao, Ju Ren

发表机构 * Tsinghua University(清华大学) Beihang University(北航) University of Pittsburgh(匹兹堡大学)

AI总结 本文提出Lever系统,通过联合优化推测解码的三个阶段,在智能手机上实现高效的闪存支持的LLM推理,显著降低了推理延迟。

详情
AI中文摘要

大型语言模型(LLMs)在交互式移动应用中需求日益增加,但高质量模型超出了智能手机上有限的DRAM容量。闪存可以容纳更大的模型,但闪存支持的推理速度慢,因为自回归解码反复调用目标模型并产生昂贵的I/O。我们观察到推测解码非常适合这种环境:一个小型草稿模型可以保留在DRAM中,而一个更大的驻留于闪存的目标模型在每次调用中验证多个候选令牌。然而,现有方法假设服务器级加速器,并未考虑长时间I/O延迟、有限的计算并行性和不规则的推测执行。我们提出了Lever,一个用于智能手机上高效闪存支持LLM推理的端到端系统。Lever在移动约束下联合优化推测解码的三个阶段。在草稿阶段,它使用I/O和计算感知的增益-成本目标构建令牌树。在验证阶段,它通过早期退出预测修剪低价值分支以减少目标模型计算。在执行阶段,它将推测高效地映射到移动CPU-NPU硬件以提高利用率。全面评估显示,Lever将推理延迟降低了2.93倍于基准闪存卸载推理,1.50倍于传统推测解码,缩小了闪存支持与内存驻留LLM推理之间的延迟差距。

英文摘要

Large language models (LLMs) are increasingly needed for interactive mobile applications, but high-quality models exceed the limited DRAM available on smartphones. Flash storage can hold larger models, yet flash-backed inference is slow because autoregressive decoding repeatedly invokes the target model and incurs costly I/O. We observe that speculative decoding is a natural fit for this setting: a small draft model can remain in DRAM, while a larger flash-resident target model verifies multiple candidate tokens per invocation. However, existing methods assume server-class accelerators and fail to account for prolonged I/O latency, limited computation parallelism, and irregular speculation execution. We present Lever, an end-to-end system for efficient flash-backed LLM inference on smartphones. Lever jointly optimizes the three stages of speculative decoding under mobile constraints. For drafting, it builds token trees using an I/O- and compute-aware gain-cost objective. For verification, it prunes low-value branches through early-exit prediction to reduce target-model computation. For execution, it maps speculation efficiently across mobile CPU-NPU hardware to improve utilization. Comprehensive evaluations show that Lever reduces inference latency by an average of 2.93x over baseline flash-offloaded inference and 1.50x over conventional speculative decoding, narrowing the latency gap between flash-backed and memory-resident LLM inference.

2605.16776 2026-05-19 cs.LG cs.AI 版本更新

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

可区分删除:统一知识擦除与拒绝用于大语言模型去学习

Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen

发表机构 * Department of Natural Language Processing, MBZUAI. University of Oxford. RIKEN Center for Advanced Intelligence Project. TMLR Group, Department of Computer Science, Hong Kong Baptist University

AI总结 本文提出D^2方法,通过限制潜在表示中的响应分布来擦除不受欢迎的知识,同时区分保留知识,从而实现安全且一致的拒绝机制,以提高大语言模型去学习的效果。

Comments ICML2026 Accepted

详情
AI中文摘要

减轻敏感和有害输出对于确保大型语言模型(LLM)的安全部署至关重要。现有方法通常遵循两种范式:知识删除(KD),在训练期间擦除不受欢迎的信息,以及可区分拒绝(DR),在推理期间引导模型远离使用敏感知识。尽管进展迅速,基于KD的去学习在抑制特定令牌序列作为完整知识移除替代物时面临偏见删除的问题,而基于DR的去学习则因底层知识仍然完整而有重新出现有害知识的风险。为了解决这些问题,我们提出了可区分删除(D^2),一种通过限制潜在表示中的响应分布来擦除不受欢迎知识,同时区分保留知识的范式,从而能够安全且一致地处理去学习的输入。为了实现D^2,我们引入了一个能量指数,该指数量化了知识的存在以及去学习内容与保留内容之间的分离。数学和实证分析表明,能量既准确又高效,使能量基于去学习对齐(EUA)能够在训练期间强制执行能量边界去学习,并在推理时应用基于能量的拒绝机制。广泛的实验表明,EUA显著优于先前方法,表明D^2的优越性。我们的代码可在https://github.com/Puning97/EUA-for-LLM-Unlearning获取。

英文摘要

Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ($\mathrm{D^2}$), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement $\mathrm{D^2}$, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of $\mathrm{D^2}$. Our code is available at https://github.com/Puning97/EUA-for-LLM-Unlearning.

2605.16775 2026-05-19 cs.CV cs.AI cs.LG 版本更新

VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment

VolTA-3D: 基于3D体积分块对齐的脑MRI自监督学习

Amy Makawana, Abhijeet Parida, Marius George Linguraru, Julia Ive, Syed Muhammad Anwar

发表机构 * Institute of Health Informatics(健康信息学研究所) Sheikh Zayed Institute for Pediatric Surgical Innovation(谢赫扎耶德儿童外科创新研究所) School of Medicine and Health Sciences(医学与健康科学学院)

AI总结 本文提出VolTA-3D,一种用于脑MRI自监督学习的3D视觉Transformer框架,通过联合对齐全局类风格标记和局部块标记,增强体积分块表示的可迁移性,从而在多个下游任务中表现出更好的泛化能力和鲁棒性。

Comments Accepted at EMBC 2026

详情
AI中文摘要

自监督学习(SSL)通过利用大规模未标记数据推动了医学图像分析的发展。然而,在脑磁共振成像(MRI)中,大多数3D模型仍局限于分割或分类任务,限制了其在不同数据集、成像协议和下游任务中的泛化能力。这种缺乏可迁移性限制了3D MRI模型的临床应用,尽管存在大量未标记的体数据。我们提出了Volta-3D,一种自监督的3D视觉Transformer框架,旨在学习可迁移的体表示。Volta-3D在学生-教师范式中联合对齐全局类风格标记和局部块标记,并强制细粒度结构重建。这种联合全局-局部对齐解决了脑MRI中有限的语义多样性和细微解剖特征,这对现有SSL方法构成了挑战。我们在多个分布外下游任务上评估了Volta-3D,包括海马体分割和性别及阿尔茨海默病与健康对照的分类。在所有任务中,Volta-3D学习的表示均优于随机初始化的基线,证明了其在域偏移下的改进可迁移性和鲁棒性。因此,在预训练过程中联合强制全局语义一致性和局部结构学习,使模型能够从未标记的脑MRI数据中学习更广泛的概念。总体而言,VolTA-3D支持有效的多任务下游性能,具有任务特定的适应性,是迈向通用化和临床可行的3D模型的一步。

英文摘要

Self-supervised learning (SSL) has advanced medical image analysis be enabling learning form large unlabelled data. However, in brain magnetic resonance imaging (MRI), most 3D models remain specialized for either segmentation of classification, limiting their ability to generalize across datasets, imaging protocols,, and downstream tasks. This lack of transferability constrains the clinical utility of 3D MRI models, despite the availability of unlabeled volumetric data. We present Volta-3D, a self-supervised 3D Vision Transformer framework designed to learn transferable volumetric representations. Volta-3D jointly aligns global class-style tokens and local patch tokens within a student-teacher paradigm and enforces fine-grained structural reconstruction. This combined global-local alignment addresses the limited semantic diversity and subtle anatomical characteristics of brain MRI, which challenges existing SSL approaches. We evaluate Volta-3D on multiple out-of-distribution downstream tasks, including hippocampal segmentation and classification of sex and Alzheimer's disease versus healthy controls. Across all tasks, representations learned by Volta-3D outperform randomly initialized baselines, demonstrating improved transferability and robustness under domain shift. Hence jointly enforcing global semantic consistency and local structural learning during pretraining enables broader concept learning from unlabeled brain MRI data. Overall VolTA-3D supports effective multi-task downstream performance with task-specific pertaining, a step towards generalizable and clinically viable 3D models.

2605.16755 2026-05-19 cs.LG cs.AI 版本更新

Learning Unbiased Permutations via Flow Matching

通过流匹配学习无偏排列

Yimeng Min, Carla P. Gomes

发表机构 * Department of Computer Science(计算机科学系) Cornell University(康奈尔大学)

AI总结 本文提出PermFlow框架,通过在具有单位行和列和的矩阵仿射子空间上直接操作,学习多模态排列分布,避免了基于熵正则化Sinkhorn方法在模糊性下的崩溃问题。

详情
AI中文摘要

学习排列对于排序、排名和匹配至关重要,但现有的基于熵正则化Sinkhorn的可微方法会产生单一的软解,并在模糊性下崩溃。我们提出了PermFlow,一种条件流匹配框架,直接在具有单位行和列和的矩阵仿射子空间上操作。一个闭式切线空间投影器通过构造而非迭代校正,精确保持这些约束沿每条轨迹。一个最近目标耦合将不同的噪声初始值引导到不同的有效排列。结果是一个能够捕捉多模态排列分布而非将其坍缩到单一模式的模型。在具有混合数字模糊性的视觉排序任务和对称线性分配问题上,PermFlow在无歧义输入上具有高精度,并在模糊性下恢复两个有效排列,而基于Sinkhorn的基线方法在结构上失败。

英文摘要

Learning permutations is fundamental to sorting, ranking, and matching, but existing differentiable methods based on entropy-regularized Sinkhorn produce a single softened solution and collapse under ambiguity. We present PermFlow, a conditional flow matching framework that operates directly on the affine subspace of matrices with unit row and column sums. A closed-form tangent-space projector preserves these constraints exactly along every trajectory, by construction rather than through iterative correction, and a nearest-target coupling routes distinct noisy initializations toward distinct valid permutations. The result is a model that captures multimodal permutation distributions rather than collapsing them to a single mode. On a visual sorting task with blended-digit ambiguity and a symmetric linear assignment problem, PermFlow achieves high accuracy on unambiguous inputs and recovers both valid permutations under ambiguity, where Sinkhorn-based baselines structurally fail.

2605.16748 2026-05-19 cs.GR cs.AI cs.CV cs.LG cs.MA cs.MM 版本更新

Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned, Self-Correcting Video Generation

Genflow Ad Studio:一种用于品牌一致、自我纠正视频生成的复合AI架构

Debanshu Das, Lavi Nigam, Sunil Kumar Jang Bahadur, Gopala Dhar

发表机构 * Google(谷歌)

AI总结 本文提出Genflow Ad Studio,一种复合AI架构,通过品牌DNA提取模块和对抗性多代理质量控制循环,提高了品牌一致的视频生成效率,将合规率从42%提升到89%。

Comments 6 pages, 2 figures, 2 tables. Accepted to the ACM Conference on AI and Agentic Systems (CAIS '26). Includes demo video and code repository links

详情
Journal ref
ACM Conference on AI and Agentic Systems (CAIS '26), May 26-29, 2026, San Jose, CA, USA
AI中文摘要

近期生成视频模型的进步展示了高水平的视觉保真度,但其在企业环境中的整合受到时间不一致性和严重的品牌不一致性的限制。当前的单体架构难以强制执行严格的品牌约束,经常产生未经批准的视觉资产。我们介绍了Genflow,一种复合AI系统,旨在生成媒体生产中强制执行品牌一致性。我们的架构集成了基于检索的'品牌DNA'提取模块,以参数化生成方式根据已确立的企业身份指南进行生成。此外,我们实现了对抗性多代理质量控制(QC)循环。与单次生成流程不同,此流程采用评估代理,反复批评生成的帧,与提取的参数进行比较,促使生成模型细化输出,直到达成确定性的一致性。通过转向多阶段、自我纠正的流程,Genflow将品牌合规视频生成的产量从42%提高到89%,建立了稳健的框架,用于可扩展的、企业级的生成系统。

英文摘要

Recent advancements in generative video models demonstrate high visual fidelity, yet their integration into enterprise environments is restricted by temporal inconsistencies and severe brand misalignment. Current monolithic architectures struggle to enforce rigid brand constraints, frequently hallucinating unapproved visual assets. We introduce Genflow, a Compound AI System designed to enforce brand consistency in generative media production. Our architecture integrates a retrieval-based 'Brand DNA' extraction module to parameterize generation according to established corporate identity guidelines. Furthermore, we implement an Adversarial Multi-Agent Quality Control (QC) loop. Instead of a single-pass generation, this pipeline employs evaluator agents to iteratively critique generated frames against the extracted parameters, prompting generator models to refine outputs until a deterministic consensus is reached. By transitioning to a multi-stage, self-correcting pipeline, Genflow improved the yield of brand-compliant video generations from 42% to 89%, establishing a robust framework for scalable, enterprise-grade generative systems.

2605.16747 2026-05-19 cs.LG math.AP math.OC math.PR math.ST stat.TH 版本更新

Propagation of Chaos in Contextual Flow Maps

在上下文流映射中传播混沌

Shi Chen, Zhengjiang Lin, Kaizhao Liu, Philippe Rigollet

发表机构 * Department of Mathematics, Massachusetts Institute of Technology(麻省理工学院数学系)

AI总结 本文提出了一种定量统计理论,用于在大上下文范围内研究transformers,通过采用上下文流映射(CFMs)的抽象:在一组注意力块中,动态系统在上下文度量的存在下演进一个区分的token。在此框架下,有限上下文模型近似于理想化的无限上下文系统,其中上下文度量被其底层总体取代,因此上下文长度n成为统计资源。利用动态的麦肯-瓦尔科夫结构和经典的传播混沌经典机器,我们建立了前向边界,控制有限上下文和无限上下文CFMs在深度上的偏差,并建立了后向边界,控制对应的训练轨迹在在线梯度下降迭代中的偏差。这两个边界实现了通用CFMs的最优Wasserstein速率n^{-1/d}和参数速率n^{-1/2},对于包含transformers的受限CFM类。分析基于新的欧拉共轭公式和由此产生的前向-共轭系统的稳定性估计,这两者可能具有独立兴趣。

Comments 31 pages, 1 figure

详情
AI中文摘要

我们通过采用上下文流映射(CFMs)的抽象来开发一种定量统计理论,用于在大上下文范围内研究transformers:动态系统在一组注意力块中,通过上下文度量的存在演进一个区分的token。在此框架下,有限上下文模型近似于理想化的无限上下文系统,其中上下文度量被其底层总体取代,因此上下文长度n成为统计资源。利用动态的麦肯-瓦尔科夫结构和经典的传播混沌经典机器,我们建立了前向边界,控制有限上下文和无限上下文CFMs在深度上的偏差,并建立了后向边界,控制对应的训练轨迹在在线梯度下降迭代中的偏差。这两个边界实现了通用CFMs的最优Wasserstein速率n^{-1/d}和参数速率n^{-1/2},对于包含transformers的受限CFM类。分析基于新的欧拉共轭公式和由此产生的前向-共轭系统的稳定性估计,这两者可能具有独立兴趣。

英文摘要

We develop a quantitative statistical theory of transformers in the large-context regime by adopting the abstraction of contextual flow maps (CFMs): dynamical systems that evolve a distinguished token in the presence of a contextual measure across a stack of attention blocks. Within this framework, the finite-context model approximates an idealized infinite-context system in which the contextual measure is replaced by its underlying population, so that the context length $n$ becomes a statistical resource. Exploiting the McKean--Vlasov structure of the dynamics and the classical machinery of propagation of chaos, we establish a forward bound controlling the deviation between the finite- and infinite-context CFMs uniformly along depth, and a backward bound controlling the deviation between the corresponding training trajectories uniformly across iterations of online gradient descent. Both bounds achieve the optimal Wasserstein rate $n^{-1/d}$ for general CFMs and parametric rate $n^{-1/2}$ for a restricted class of CFMs that includes transformers as a special case. The analysis rests on a new Eulerian adjoint formulation of the loss gradient and stability estimates for the resulting forward--adjoint system, both of which may be of independent interest.

2605.16746 2026-05-19 cs.AI cs.LG 版本更新

State Contamination in Memory-Augmented LLM Agents

内存增强型大语言模型代理中的状态污染

Yian Wang, Agam Goyal, Yuen Chen, Hari Sundaram

发表机构 * Department of Computer Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校计算机科学系)

AI总结 研究探讨了内存增强型大语言模型代理中由于状态污染导致的安全问题,通过分析内存总结中的毒性内容传播,提出了一种新的衡量指标,并指出在信息压缩前进行净化可以有效减少潜在影响。

详情
AI中文摘要

LLM代理越来越多地依赖持久化状态,包括转录文本、摘要、检索上下文和内存缓冲区,以支持长周期交互。这使得安全性不仅取决于个体模型输出,还取决于代理存储和后来重用的内容。我们研究了一种称为内存清洗的故障模式:有毒或对抗性上下文可以被压缩成内存摘要,这些摘要在标准检测器下不再显得有毒,但仍保留了影响未来生成的敌对框架或冲突结构。通过配对的反事实多代理模拟,我们证明有毒起源的内存摘要可以保持在常见毒性阈值以下,但相对于匹配的中性基线,仍会增加下游毒性。为了衡量这种隐藏影响,我们引入了子阈值传播间隙(SPG),它量化了在部署监控器视为安全的内存状态下,下游行为差异。我们的实验表明,毒性通过不同的状态通道传播:原始转录文本重用驱动显性下游毒性,而压缩的内存则携带隐藏的子阈值影响。我们进一步发现,缓解依赖于干预位置。在摘要前净化有毒状态可显著减少隐藏传播间隙,而仅清洁完成的摘要则可能保留被清洗的影响。这些结果表明,内存增强型代理的安全性应被视为对演进上下文的状态控制问题,净化应在不安全信息被压缩进持久内存之前应用。

英文摘要

LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long-horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi-agent rollouts, we show that toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub-threshold propagation gap (SPG), which quantifies downstream behavioral differences conditioned on memory states that a deployed monitor would classify as safe. Our experiments show that toxicity propagates through distinct state channels: raw transcript reuse drives overt downstream toxicity, while compressed memory carries hidden sub-threshold influence. We further find that mitigation depends critically on intervention placement. Sanitizing toxic state before summarization substantially reduces the hidden propagation gap, whereas cleaning only the completed summary can leave laundered influence intact. These results suggest that safety in memory-augmented agents should be treated as a state-control problem over evolving context, with sanitization applied before unsafe information is compressed into persistent memory.

2605.16735 2026-05-19 cs.NI cs.LG 版本更新

Transformer-Based MCS Prediction for 5G Multicast-Broadcast Services (MBS)

基于Transformer的5G多播广播服务(MBS)的MCS预测

Kasidis Arunruangsirilert, Jiro Katto

发表机构 * Department of Computer Science and Communications Engineering(计算机科学与通信工程系) Waseda University(早稻田大学)

AI总结 本文提出了一种轻量级的基于Transformer的框架,用于预测即将到来的视频片段 horizon 上所有28个MCS指数的成功概率,以提高5G多播广播服务的可靠性。

Comments 2026 IEEE 104th Vehicular Technology Conference (VTC2026-Fall), 6-9 September 2026, Boston, Massachusetts, USA

详情
AI中文摘要

5G多播广播服务(MBS)的部署正在成为一种关键技术,用于高效频谱的超高清内容交付,并作为现代有线电视部署的有前途的解决方案。然而,与依赖RLC-AM和HARQ重传的单播网络不同,MBS广播在RLC无确认模式(RLC-UM)下运行,其中没有反馈环路意味着丢包是永久的,并立即影响用户QoE。传统链路自适应算法,设计用于单播,通常激进地最大化吞吐量,并在这一风险容忍度低的环境中失败,导致严重的视频卡顿和重新缓冲。为此,我们提出了一种轻量级的基于Transformer的框架,该框架预测即将到来的视频片段 horizon 上所有28个MCS指数的成功概率。利用一个独特的商业网络数据集,具有0.5毫秒的槽级粒度,我们使用一个定制的非对称安全性损失函数训练我们的模型,该函数惩罚信道过估计以优先考虑链路稳定性。实验结果表明,我们的方法在可靠性得分上达到86.89%,显著优于标准AI基线,这些基线优化于原始吞吐量(31.65%),同时保持安全的保守偏见。此外,该模型针对实时应用进行了优化,在COTS 5G时代的智能手机上展示了小于0.07毫秒的推理时间。

英文摘要

The deployment of 5G Multicast-Broadcast Services (MBS) is emerging as a critical technology for spectral-efficient UHD content delivery and serving as a promising solution to modernize CATV deployment. However, unlike unicast networks that rely on RLC-AM with HARQ retransmissions, MBS broadcast operates in RLC Unacknowledged Mode (RLC-UM), where the absence of a feedback loop means packet loss is permanent and immediately impacts user QoE. Conventional link adaptation algorithms, designed for unicast, typically aggressively maximize throughput and fail in this risk-intolerant environment, resulting in severe video stalls and rebuffering. To address this, we propose a lightweight Transformer-based framework that predicts the success probability of all 28 MCS indices over an upcoming video segment horizon. Utilizing a unique commercial network dataset with 0.5 ms slot-level granularity, we train our model using a custom Asymmetric Safety Loss function that penalizes channel overestimation to prioritize link stability. Experimental results show that our approach achieves a reliability score of 86.89%, significantly outperforming standard AI baselines optimized for raw throughput (31.65%) while maintaining a safe conservative bias. Furthermore, the model is optimized for real-time applications, demonstrating an inference time of less than 0.07 ms on COTS 5G-era smartphones.

2605.16732 2026-05-19 cs.CV cs.LG 版本更新

DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

DiRotQ:面向4位扩散变换器的旋转感知量化

Sayeh Sharify, Mahsa Salmani, Hesham Mostafa

发表机构 * d-Matrix

AI总结 本文提出DiRotQ,一种W4A4量化框架,通过旋转感知激活量化缓解扩散变换器在4位精度下的性能下降问题,同时引入VLM-as-a-Judge评估协议和Triton定制内核提升压缩下的效率与质量。

详情
AI中文摘要

扩散变换器(DiTs)在图像生成质量上达到最先进的水平,但在推理过程中带来显著的内存和计算成本。尽管激进的后训练量化(PTQ)到4位精度能带来显著的效率提升,但通常会导致严重的质量下降。现有方法,包括基于平滑的方法、混合精度方案、旋转技术以及低秩残差方法,部分缓解了这一问题,但仍与FP16/BF16性能存在明显差距。在本工作中,我们引入DiRotQ,一种W4A4 PTQ框架,通过旋转感知的激活量化来缓解这种降级。DiRotQ通过主成分分析(PCA)识别出捕捉主导激活方差的低秩子空间,在该子空间中保留系数以较高精度,同时将剩余组件量化为4位。在推理时,通过校准得出的正交变换将激活旋转到PCA基底中,而逆旋转被融合到层权重中,离线。结合基于GPTQ的权重量化,DiRotQ在PixArt-Σ数据集上实现了FID(更低越好)为15.9和PSNR(越高越好)为19.1 dB,优于先前最先进的SVDQuant(FID 18.9,PSNR 17.6)在同一INT W4A4设置下的表现。除了标准指标外,我们引入了VLM-as-a-Judge评估协议,这是该设置下的首次此类评估,提供了更全面的感知质量和提示对齐评估。在系统层面,我们实现了基于Triton的定制内核,以实现高效的端到端推理,将12B FLUX.1-dev模型的内存使用减少了2.1倍,并在24 GB RTX 4090 GPU上实现了2.3倍的加速。

英文摘要

Diffusion Transformers (DiTs) achieve state-of-the-art image generation quality but incur substantial memory and computational costs at inference. While aggressive Post-Training Quantization (PTQ) to 4-bit precision offers significant efficiency gains, it typically results in severe quality degradation. Existing approaches, including smoothing-based methods, mixed-precision schemes, rotation techniques, and low-rank residual methods, partially mitigate this issue but still leave a noticeable gap to FP16/BF16 performance. In this work, we introduce DiRotQ, a W4A4 PTQ framework that mitigates this degradation through rotation-aware activation quantization. DiRotQ identifies a low-rank subspace capturing dominant activation variance via Principal Component Analysis (PCA), preserving coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit. Activations are rotated into the PCA basis at inference time using calibration-derived orthogonal transformations, while the inverse rotation is fused into the layer weights offline. Combined with GPTQ-based weight quantization, DiRotQ achieves an FID (lower is better) of 15.9 and PSNR (higher is better) of 19.1 dB on PixArt-Σ over the MJHQ-30K dataset, outperforming the prior state-of-the-art SVDQuant (FID 18.9, PSNR 17.6) under the same INT W4A4 setting. Beyond standard metrics, we introduce a VLM-as-a-Judge evaluation protocol for diffusion model quantization, the first such evaluation in this setting, providing a more holistic assessment of perceptual quality and prompt alignment under aggressive compression. On the systems side, we implement a Triton-based custom kernel to enable efficient end-to-end inference, reducing memory usage of the 12B FLUX.1-dev model by 2.1x and delivering 2.3x speedup over the BF16 baseline, on a 24 GB RTX 4090 GPU.

2605.16720 2026-05-19 cs.CV cs.LG 版本更新

Compositional Adversarial Training for Robust Visual Watermarking

组合对抗训练用于鲁棒的视觉水印

Anirudh Satheesh, Michael-Andrei Panaitescu-Liess, Andrew Xu, Georgios Milis, Heng Huang, Zikui Cai, Furong Huang

发表机构 * University of Maryland(马里兰大学)

AI总结 本文提出了一种组合对抗训练(CAT)框架,通过在结构化空间中构建组合转换的min-max问题,提升视觉水印的鲁棒性,实验表明其在多种攻击设置下优于随机增强基线。

详情
AI中文摘要

鲁棒水印通常使用随机后处理增强进行训练,但随机采样无法覆盖真实攻击管道的组合空间,难以遇到真正破坏检测的稀有组合。这导致训练不稳定且样本效率低。我们将其水印鲁棒性建模为结构化组合转换空间上的min-max问题。我们提出组合对抗训练(CAT),一种插件框架,学习一个顺序可微的对抗者,观察当前水印图像并在每一步选择攻击家族以最大程度干扰信息恢复。CAT结合了直通Gumbel-Softmax攻击选择与熵正则化,使反向传播可端到端微分并聚合攻击家族的梯度信息,从而实现更快、更平滑的收敛,而不陷入单一攻击模式。我们评估CAT在生成后水印VideoSeal 0.0、VideoSeal 1.0和PixelSeal以及在生成WMAR下的单步和双步攻击套件,以及在分布内和多分布图像和视频基准测试中。CAT在单步攻击设置中将水印容量提高最高63.5%,在组合设置中提高13.0%;在自回归设置中,CAT在困难几何变换上将TPR@FPR=1%平均提高12%。这些结果表明,鲁棒视觉水印受益于对抗适应组合对抗者而非独立随机破坏。

英文摘要

Robust watermarking is typically trained with random post-processing augmentation, but random sampling under-covers the combinatorial space of realistic attack pipelines and rarely encounters the rare compositions that actually break detection. This leads to unstable training and poor sample efficiency. We instead formulate watermark robustness as a min-max problem over a structured space of compositional transformations. We propose Compositional Adversarial Training (CAT), a plug-in framework that learns a sequential differentiable adversary that observes the current watermarked image and selects an attack family at each step to maximally disrupt message recovery. CAT combines a straight-through Gumbel-Softmax attack selection with entropy regularization, allowing the backward pass to be end-to-end differentiable and aggregate gradient information across attack families, yielding faster, smoother convergence without collapsing to a single attack mode. We evaluate CAT on post-generation watermarks VideoSeal 0.0, VideoSeal 1.0, and PixelSeal and in-generation WMAR under both single-step and two-step attack suites, on in-distribution and multiple out-of-distribution image and video benchmarks. CAT consistently outperforms random-augmentation baselines trained with the same augmentation budget, with the largest gains on hard composed attacks and OOD evaluations; improving overall watermark capacity by up to $63.5\%$ in the single-step attack setting and $13.0\%$ in the compositional setting. In the autoregressive setting, CAT improves the TPR@FPR$=1\%$ by $12\%$ on average on difficult geometric transformations. These results show that robust visual watermarking benefits from training against adaptive compositional adversaries rather than independent random corruptions.

2605.16708 2026-05-19 cs.LG stat.ML 版本更新

Isolating Nonlinear Independent Sources in fMRI with $β$-TCVAE Models

利用β-TCVAE模型在fMRI中分离非线性独立源

Qiang Li, Shujian Yu, Jesus Malo, Jingyu Liu, Tülay Adali, Vince D. Calhoun

发表机构 * Tri-Institutional Center for Translational Research in Neuroimaging and Data Science(转化神经影像与数据科学跨机构中心) Georgia State University(佐治亚州立大学) Georgia Institute of Technology(佐治亚理工学院) Emory University(埃默里大学) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) University of Valencia(瓦伦西亚大学) University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校)

AI总结 本文提出利用β-TCVAE模型处理非线性fMRI数据,分离混合的空间和时间脑信号,恢复具有生物学意义的非线性空间成分,并通过功能网络连接性验证了潜在结构的可解释性。

Comments 6 pages, 2 figures

详情
AI中文摘要

从非线性fMRI数据中学习有意义的潜在表示仍然是神经影像分析中的基本挑战。传统独立成分分析(ICA)因其能估计可解释的功能脑网络而被广泛使用,但其依赖于线性混合假设,限制了其捕捉大脑动态内在非线性和复杂组织的能力。近年来,深度表示学习方法作为非线性潜在结构建模的有希望替代方案出现。然而,许多方法主要在模拟数据集或自然图像基准上评估,对真实世界神经影像数据如fMRI的验证相对有限。本文受β-TCVAE(总相关变分自编码器)的启发,这是β-VAE框架的改进,用于学习潜在表示而不引入额外超参数。我们调整并修改该模型以适应fMRI数据,旨在分离混合的空间和时间脑信号为可解释的成分。我们证明β-TCVAE框架可以恢复具有生物学意义的非线性空间成分,包括已建立的内在连接网络如默认模式网络。此外,我们通过功能网络连接性评估学习的表示,显示潜在结构捕捉了连贯且可解释的大脑组织模式。本研究提供了一项将非线性表示学习与fMRI分析连接的初步调查。

英文摘要

Learning meaningful latent representations from nonlinear fMRI data remains a fundamental challenge in neuroimaging analysis. Traditional independent component analysis, widely used due to its ability to estimate interpretable functional brain networks, relies on a linear mixing assumption for latent sources, limiting its ability to capture the inherently nonlinear and complex organization of brain dynamics. More recently, deep representation learning methods have emerged as promising alternatives for modeling nonlinear latent structure. However, many of these approaches have been evaluated primarily on simulated datasets or natural image benchmarks, with comparatively limited validation on real-world neuroimaging data such as fMRI. In this work, we are motivated by the $β$-TCVAE (Total Correlation Variational Autoencoder), a refinement of the $β$-VAE framework for learning latent representations without introducing additional hyperparameters during training. We adapt and modify this model to fMRI data for nonlinear source disentanglement, aiming to separate mixed spatial and temporal brain signals into interpretable components. We show that the $β$-TCVAE framework can recover meaningful nonlinear spatial components with biological relevance, including well-established intrinsic connectivity networks such as the default mode network. Furthermore, we evaluate the learned representations using functional network connectivity, showing that the latent structure captures coherent and interpretable brain organization patterns. This study provides a pilot investigation that bridges nonlinear representation learning and fMRI analysis.

2605.16707 2026-05-19 cs.CR cs.LG 版本更新

On-Device Interpretable Tsetlin Machine-Based Intrusion Detection for Secure IoMT

设备端可解释的基于Tsetlin机的入侵检测系统用于安全的物联网医疗设备

Rahul Jaiswal, Per-Arne Andersen, Linga Reddy Cenkeramaddi, Lei Jiao, Ole-Christoffer Granmo

发表机构 * Department of ICT, University of Agder, Norway(阿格德大学信息与通信技术系)

AI总结 本文提出了一种基于Tsetlin机的设备端可解释入侵检测系统,用于IoMT环境中的入侵检测,通过逻辑规则和特征贡献提升透明度和性能,达到97.83%的分类准确率。

Comments 8 pages, 11 figures, 6 Tables, submitted to IEEE Intelligent Conference on Intelligence and Security Informatics (ISI-2026), Cambridge, UK

详情
AI中文摘要

数字健康技术的快速发展正在重新定义全球医疗服务。将无线通信和互联网医疗设备整合到物联网医疗设备(IoMT)网络中,使患者能够进行连续、实时的监测。然而,这种增加的连接性由于日益复杂的网络攻击而提高了网络安全和患者安全风险。本文提出了一种新的设备端、可解释的Tsetlin机(TM)入侵检测系统(IDS),以识别IoMT环境中的各种攻击阶段。TM是一种规则驱动且透明的机器学习(ML)方法,使用命题逻辑表示攻击模式。在MedSec-25数据集上的广泛评估显示,所提出的模型在分类性能上优于ML模型和现有方法,达到97.83%的准确率。此外,所提出的模型通过特征级贡献、类级投票分数和子句激活热图提供明确的决策解释,以增强透明度。边缘部署(树莓派)进一步支持实时设备推理和入侵检测。可解释性和高性能的结合使所提出的模型适合IoMT医疗领域,其中信任、可靠性、安全性和及时决策至关重要。

英文摘要

The rapid evolution of digital health technologies is redefining healthcare services worldwide. The integration of wireless communication and Internet-enabled medical devices within Internet of Medical Things (IoMT) networks enables continuous, real-time patient monitoring. However, this increased connectivity raises cybersecurity and patient safety risks due to increasingly sophisticated cyberattacks. This paper proposes a novel on-device, interpretable Tsetlin Machine (TM)-based Intrusion Detection System (IDS) to identify various phases of cyberattacks in IoMT environments. The TM is a rule-driven and transparent machine learning (ML) approach that represents attack patterns using propositional logic. Extensive evaluations on the MedSec-25 dataset, encompassing various phases of realistic cyberattacks, show that the proposed model outperforms ML models and state-of-the-art methods, attaining a classification performance of 97.83\%. Moreover, the proposed model offers explicit explanations of its decisions to enhance transparency using feature-level contributions, class-wise vote scores, and clause activation heatmaps. Edge deployment (Raspberry Pi) further supports real-time on-device inference and intrusion detection. The combination of interpretability and high performance makes the proposed model well-suited for IoMT healthcare, where trust, reliability, safety, and timely decision-making are critical.

2605.16704 2026-05-19 cs.LG 版本更新

Convex Dataset Valuation for Post-Training

训练后凸集估值

Siqi Zeng, Christopher Jung, Rui Li, Zhe Kang, Ming Li, Nima Noorshams, Zhigang Wang, Fuchun Peng, Han Zhao, Xue Feng

发表机构 * Department of Computer Science, University of Illinois Urbana-Chamapign, Urbana, IL, USA(伊利诺伊大学厄巴纳-香槟分校计算机科学系) Meta, Menlo Park, CA, USA(Meta)

AI总结 本文研究了在训练后利用凸集估值选择辅助数据集以提升大语言模型性能,提出基于核均值匹配的凸集估值方法,有效解决数据冗余问题,实验表明其在低计算开销下表现优于现有方法。

Comments Published as a conference paper at ICML '26. 30 pages, 8 figures

详情
AI中文摘要

改进大语言模型在下游任务上的性能有时需要在训练后利用辅助数据集。然而,开发者在计算、标注和许可成本上面临限制,无法使用所有可用数据,需要有原则的数据集层面选择。这些限制日益受到数据集市场的影响,其中数据获取由预算和谈判决定。我们研究了数据集估值作为训练后大语言模型中的子集选择问题。我们的目标是识别并加权辅助数据集,以在受限制的预算下最大化目标任务性能。我们首先表明,常用梯度对齐分数提供了一个合理但不完整的估值信号,因为它们忽略了数据集间的冗余。为了解决这个问题,我们提出了一种基于梯度空间中核均值匹配(KMM)的可扩展凸数据集估值方法,该方法同时考虑了与目标任务的对齐和辅助数据集间的冗余。通过在多样化的训练后设置和任务中进行广泛实验,我们证明了我们的方法在低计算开销下一致优于现有估值基线,实现了更强的性能。我们的结果将数据集估值定位为一种实用的决策工具,用于受市场限制的大语言模型训练后数据选择。代码可在https://github.com/uiuctml/convex_data_valuation获取。

英文摘要

Improving LLM performance on downstream tasks sometimes requires leveraging auxiliary datasets during post-training. In practice, however, developers face constraints on compute, labeling, and licensing costs that preclude using all available data, necessitating principled dataset-level selection. These constraints are increasingly shaped by dataset marketplaces, where data acquisition is governed by budgets and negotiation. We study dataset valuation as a subset selection problem during LLM post-training. Our goal is to identify and weight auxiliary datasets so as to maximize target task performance given constrained budgets. We first show that commonly used gradient alignment scores provide a reasonable yet incomplete valuation signal, as they ignore redundancy among datasets. To address this, we propose a scalable convex dataset-level valuation method based on kernel mean matching (KMM) in gradient space, which jointly accounts for alignment with the target task and redundancy across auxiliary datasets. Through extensive experiments across diverse post-training settings and tasks, we show that our approach consistently outperforms existing valuation baselines, achieving stronger performance with low computational overhead. Our results position dataset valuation as a practical decision tool for post-training data selection in market-constrained large language model settings. The code is available at https://github.com/uiuctml/convex_data_valuation.

2605.16699 2026-05-19 cs.LG q-fin.RM stat.ML 版本更新

Your SaaS Is an Insurance Product: A Modeling Framework

你的 SaaS 是一种保险产品:一种建模框架

Caio Gomes

发表机构 * Magalu

AI总结 本文将 capped-usage SaaS 产品与保险产品进行类比,提出基于频率-严重性分解、保费计算原理和蒙特卡洛储备充足性的建模框架,用于 SaaS 价格建模。

Comments 23 pages, 2 figures, 7 tables. Companion code archived at DOI 10.5281/zenodo.20213155

详情
AI中文摘要

Capped-usage SaaS 产品——如 Claude Code 和 ChatGPT 等大语言模型订阅、Vercel 和 Cloudflare Workers 等云平台、企业福利平台、具有责任转移的身份验证服务——与保险产品有相同的结构性特征:固定保费与实际消费解耦、用户层面的随机需求具有厚尾严重性、非同质的上限在固定时间表重置、以及需要在尾部风险下具备充足储备的组合层面暴露。我们主张这不是类比,而是 actuarial science 已经几十年来试图解决的问题,用新的依赖变量(如 tokens、带宽字节、函数调用、健身房打卡)替代医疗索赔。本文提出一个基于频率-严重性分解、保费计算原理和蒙特卡洛储备充足性的建模框架,将其映射到两个领域(LLM 服务和云平台)的公开可观察的订阅层级,基于经典的健康保险经济学(Arrow 1963; Pauly 1968; Manning 等 1987; Brot-Goldberg 等 2017),并通过一个工作示例展示与传统单位经济的差异。贡献是操作性的而非理论性的:不是新的定理,而是目前缺失于 cs.LG/stat.ML 实践中的词汇和工具。

英文摘要

Capped-usage SaaS products -- LLM subscriptions such as Claude Code and ChatGPT, cloud platforms such as Vercel and Cloudflare Workers, corporate benefit platforms, identity-verification services with liability transfer -- share a structural signature with insurance products: a fixed premium decoupled from realized consumption, stochastic per-user demand with heavy-tailed severity, a non-fungible cap that resets on a fixed schedule, and a portfolio-level exposure that requires reserve adequacy under tail risk. We argue that this is not an analogy. It is the same operational problem actuarial science has been tooled for decades to address, restated with new dependent variables (tokens, bandwidth bytes, function-invocations, gym check-ins) in place of medical claims. This paper proposes a modeling framework for capped-usage SaaS pricing built from frequency-severity decomposition, premium calculation principles, and Monte Carlo reserve adequacy. We map the framework to publicly observable subscription tiers in two domains (LLM services and cloud platforms), ground it in canonical health-insurance economics (Arrow 1963; Pauly 1968; Manning et al. 1987; Brot-Goldberg et al. 2017), and demonstrate divergence from traditional unit economics through a worked example. The contribution is operational rather than theoretical: not a new theorem, but vocabulary and tools currently absent from cs.LG/stat.ML practice.

2605.16690 2026-05-19 cs.LG 版本更新

UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models

UB-SMoE:面向资源自适应联邦微调的通用平衡稀疏专家混合模型

Van-Tuan Tran, Hong-Hanh Nguyen-Le, Marco Ruffini, Merim Dzaferagic

发表机构 * School of Computer Science and Statistics, Trinity College Dublin, Ireland(特里尼蒂学院都柏林分校计算机科学与统计学学院) School of Computer Science, University College Dublin, Ireland(都柏林大学学院计算机科学学院)

AI总结 本文提出UB-SMoE,通过动态调节路由和通用伪梯度解决异构联邦学习中专家利用率失衡和Top-K路由非可微问题,实现低资源客户端的计算节省与性能提升。

Comments ICML 2026

详情
AI中文摘要

异构LoRA-rank方法通过根据计算能力分配客户端特定的秩来解决联邦微调基础模型中的系统异质性问题。然而,这些方法仅实现有限的计算节省,因为密集的前馈计算占主导地位。稀疏专家混合(SMoE)通过条件计算提供有前途的替代方案,但我们发现其在异构联邦设置中的直接应用引入了两个关键不一致:(i)专家利用率不平衡和(ii)Top-K路由的非可微性。我们的收敛分析表明,这些不一致导致了收敛性下降,特别是对资源受限客户端。为了解决这些挑战,我们提出了通用平衡稀疏专家混合(UB-SMoE),它引入了动态调节路由(DMR)来重新平衡专家利用率,并引入通用伪梯度(PG)来重建未激活专家的学习信号。这些机制形成一个自我强化的循环,使专家在异构客户端中保持活力。在基准测试中,UB-SMoE在低资源客户端上实现了高达45.0%的计算节省,同时相比现有异构LoRA-rank方法,其性能提高了8.7倍。

英文摘要

Heterogeneous LoRA-rank methods address system heterogeneity in federated fine-tuning of foundation models by assigning client-specific ranks based on computational capabilities. However, these methods achieve only marginal computational savings, as dense feed-forward computations dominate. Sparse Mixture-of-Experts (SMoE) provides a promising alternative through conditional computation, yet we identify that its naive application to heterogeneous federated settings introduces two critical discordances: (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing. Our convergence analysis demonstrates that these discordances lead to degraded convergence, particularly for resource-constrained clients. To address these challenges, we propose Universally Balanced Sparse Mixture-of-Experts (UB-SMoE), which introduces Dynamic Modulated Routing (DMR) to rebalance expert utilization, and Universal Pseudo-Gradient (PG) to reconstruct learning signals for non-activated experts. These mechanisms form a self-reinforcing cycle that maintains expert viability across heterogeneous clients. Experiments on benchmarks show that UB-SMoE achieves up to $45.0\%$ computational reduction on low-resource clients while improving their performance by $8.7 \times$ compared to existing heterogeneous LoRA-rank methods.

2605.16686 2026-05-19 cs.LG 版本更新

Scalable Knowledge Editing for Mixture-of-Experts LLMs via Tensor-Structured Updates

基于张量结构更新的混合专家LLM可扩展知识编辑

Roman Maksimov, Vladimir Aletov, Dmitry Bylinkin, Daniil Medyakov, Vladimir Solodkin, Aleksandr Beznosikov

发表机构 * OpenAI DeepSeek-AI Qwen Team(Qwen团队) Shazeer et al.(Shazeer等人) Molodtsov et al.(Molodtsov等人)

AI总结 本文提出一种针对混合专家架构LLM的知识编辑方法,通过张量结构和Woodbury矩阵恒等式实现高效参数更新,提升编辑效率6倍,扩展了知识编辑的应用范围。

Comments 17 pages, 3 architectures, 1 figure, 6 tables

详情
AI中文摘要

知识编辑(KE)为LLM提供了一种轻量级替代方案,避免重复微调。然而,现有KE方法多针对密集前馈层,而现代LLM越来越多采用混合专家(MoE)架构以提升内存效率和推理效率。本文提出MEMIT-like框架,利用MoE层的张量结构,在专家层面准确制定编辑目标,并通过Woodbury矩阵恒等式避免显式计算专家权重的全堆叠矩阵。所获更新仅需固定低秩矩阵的逆运算,无需额外反向传播。实验表明,该方法在主要KE指标上与强基线持平,但编辑过程加速达6倍,得益于批量MEMIT式公式和Woodbury恒等式带来的低维逆运算。这些结果表明,封闭形式的参数修改KE可有效扩展至密集层之外,为现代稀疏LLM架构的可扩展知识编辑开辟了新路径。

英文摘要

Knowledge editing (KE) provides a lightweight alternative to repeated fine-tuning of LLMs. However, most existing KE methods target dense feed-forward layers, while modern LLMs increasingly adopt Mixture-of-Experts (MoE) architectures for their superior memory footprint and inference efficiency. This mismatch leaves a growing class of production models without principled editing tools. We propose a MEMIT-like framework for knowledge editing in MoE-based LLMs. Our method exploits the tensor structure of MoE layers to formulate the editing objective faithfully at the per expert level, and applies the Woodbury matrix identity to avoid materializing or inverting the full stacked matrix of expert weights. The resulting update reduces to inversions of fixed low-rank matrices and requires no additional backward passes. Empirically, our approach matches the editing quality of strong baselines on the main KE metrics while accelerating the editing procedure by up to 6x, owing to the batched MEMIT-style formulation and the low-dimensional inversions enabled by the Woodbury identity. These results show that closed-form, parameter-modifying KE can be extended efficiently beyond dense layers, opening a path toward scalable knowledge editing in modern sparse LLM architectures.

2605.16682 2026-05-19 cs.LG 版本更新

Identify Then Project: Contrastive Learning of Latent Dynamics from Partial Observations with Port-Hamiltonian Structure

识别后再投影:从部分观测中利用端-哈密顿结构进行对比学习

Peilun Li, Kaiyuan Tan, Daniel Moyer, Thomas Beckers

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出一种两阶段框架,通过对比学习从部分观测中学习隐状态动态,并投影到端-哈密顿子流形,以实现物理一致性。

详情
AI中文摘要

在直接建模不可行的情况下,识别隐状态表示和动态至关重要,尤其是在部分和高维观测下。我们研究了隐式端-哈密顿系统,这是一种包含守恒和耗散动态的结构化类别。我们提出了一种两阶段识别-再投影框架。首先,对比教师从部分观测中学习连续时间隐动态。然后,学生将识别的教师表示和动态投影到端-哈密顿子流形上,通过学习的仿射图表,得到物理一致的实现。作为概念反事实,我们还考虑了单阶段变体,联合学习隐识别和端-哈密顿结构,但发现其可靠性较低,从而提出所提出的两阶段教师-学生框架。我们理论上证明仿射投影是连接对比隐识别的仿射度量和端-哈密顿系统之间的自然桥梁。经验上,我们展示了所提出的两阶段方法在保持教师动态的同时强制物理结构,并在耗散区域和高维视觉设置中比单阶段替代方法更可靠。

英文摘要

Identifying latent state representations and dynamics is essential when direct modeling in observation space is infeasible, particularly under partial and high-dimensional observations. In such settings, representation learning and physics-aware modeling are inherently coupled. We study this problem for latent port-Hamiltonian systems, a structured class encompassing both conservative and dissipative dynamics. We propose a two-stage identify-then-project framework. First, a contrastive teacher learns continuous-time latent dynamics from partial observations. Then, a student projects the identified teacher representation and dynamics onto a port-Hamiltonian submanifold via a learned affine chart, yielding a physically consistent realization. As a conceptual counterfactual, we also consider a single-stage variant that jointly learns latent identification and port-Hamiltonian structure, but find it to be less reliable, motivating the proposed two-stage teacher-student framework. We show theoretically that affine projection is the natural bridge between the affine gauge of contrastive latent identification and the port-Hamiltonian systems. Empirically, we demonstrate that the proposed two-stage approach preserves the teacher's dynamics while enforcing physical structure, and performs more reliably than the single-stage alternative, particularly in dissipative regimes and high-dimensional visual settings.

2605.16672 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Multi-Object Tracking Consistently Improves Wildlife Inference

多目标跟踪一致地提升野生动物推断

Mufhumudzi Muthivhi, Jiahao Huo, Fredrik Gustafsson, Terence L. van Zyl

发表机构 * World Wide Fund (WWF)(世界自然基金会) Centre for Artificial Intelligence Research (CAIR)(人工智能研究中心)

AI总结 本文利用多目标跟踪技术提升野生动物分类模型的鲁棒性,通过融合轨迹信息改进分类结果,实验表明在三个数据集上均提升了性能。

Comments Accepted for publication in IEEE 2026 29th International Conference on Information Fusion

详情
AI中文摘要

相机陷阱已成为生态研究和生物多样性保护中常用的野生动物监测工具。野生动物分类模型受益于野生动物视觉数据的增加,这些模型在经过整理的高质量数据集上能达到高水平的准确性。然而,其性能仍然易受现实环境约束的影响。在进行时间连续序列的推断时,它们常常产生不一致的预测。单个个体在帧之间的预测标签会迅速变化。本研究利用相机陷阱数据的时间特性来增强野生动物分类模型的推断预测。具体来说,我们采用几种标准的多目标跟踪(MOT)模型,将连续帧中的检测结果进行关联。经过整理的轨迹用于融合softmax类概率。融合的概率评分产生一个单一的共识类标签估计,以覆盖噪声引起的误分类。实验结果分析表明,我们的策略在所有数据集和每个指标上均优于独立分类器。具体而言,表现最好的MOT模型在三个MOT数据集上分别比分类器提高了5.1%、3.1%和2.0%的加权F1分数。

英文摘要

Camera traps have become a common tool for wildlife monitoring efforts in ecological research and biodiversity conservation. Wildlife classification models have benefited from the increase in wildlife visual data. These models reach high levels of accuracy on curated, high-quality datasets. However, their performance remains sensitive to real-world environmental constraints. They often produce inconsistent predictions when performing inference on temporally coherent sequences. The predicted label for a single individual shifts rapidly between frames. This study exploits the temporal nature of camera-trap data to augment inferred predictions from a wildlife classification model. Specifically, we adopt several standard Multi-Object Tracking (MOT) models to link detections across consecutive frames. The curated trajectories are used to fuse the softmax class probabilities. The fused probability score produces a single consensus class label estimate that overrides misclassifications caused by noise. The analysis of the experimental results shows that our proposed strategy improves over a standalone classifier over all datasets and for each metric. Specifically, the best-performing MOT models gain a weighted F1-Score of 5.1%, 3.1% and 2.0% over the classifier across three MOT datasets.

2605.16671 2026-05-19 cs.AI cs.CV cs.CY cs.LG 版本更新

Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents

野生环境中的可持续智能:通过知识自适应边缘专家代理实现生态监测民主化

Jiaxing Li, Hao Fang, Chi Xu, Miao Zhang, Jiangchuan Liu, William I. Atlas, Katrina M. Connors, Mark A. Spoljaric

发表机构 * Simon Fraser University(西蒙 Fraser大学) Wild Salmon Center(野生鲑鱼中心) Pacific Salmon Foundation(太平洋鲑鱼基金会) Haida Fisheries Program(海达渔业计划)

AI总结 本文提出一种知识自适应边缘代理架构,通过分离视觉感知与推理,结合视觉编码器和动态知识库,实现生态监测的可持续发展,促进伦理AI协同开发。

Comments 10 pages

详情
AI中文摘要

快速的生物多样性丧失凸显了有效监测的紧迫性,但手动调查仍消耗资源。尽管设备上的AI提供了一种可扩展的替代方案,但野外环境中经常受到环境变化的挑战。当前方法依赖云资源,需要持续上传现场数据以重新训练模型。这种方法不适合远程部署,因为它消耗有限的电力和网络连接。为了解决这些限制,本研究提出从模型适应转向知识适应。我们介绍了一种架构,将视觉感知与推理分离,结合视觉编码器和动态知识库。我们使用显式知识库取代隐式编码专家知识到模型参数。这种方法还通过结构化形式保存专家见解来支持知识可持续性。通过跨学科合作与生物学家和原住民社区,这项工作推进了伦理AI的协同开发,促进负责任和文化知情的生态系统管理。

英文摘要

Rapid biodiversity loss underscore the urgency of effective monitoring, yet manual surveys remain resource-intensive. While on-device AI offers a scalable alternative, its performance in the wild is often challenged by environmental variability. Current methods rely heavily on cloud resource, which requires continuous uploading of field data for model retraining. This approach is unsuitable for remote deployments because it consumes limited power and network connectivity. To address these constraints, this research proposes a shift from model adaptation to knowledge adaptation. We introduce an architecture that separates visual perception from reasoning, combining a visual encoder with a dynamic knowledge base. We uses an explicit knowledge base to replace implicitly encoding expert knowledge into model parameters. This method also supports knowledge sustainability by preserving expert insights in a structured form. Through cross-disciplinary collaboration with biologists and Indigenous communities, this work advances ethical AI co-development, fostering responsible and culturally informed ecosystem management.

2605.16668 2026-05-19 cs.LG cs.AI 版本更新

GraViti: Graph-Level Variational Autoencoders with Relaxed Permutation Invariance

GraViti:具有放松排列不变性的图级变分自编码器

Roman Bresson, Konstantinos Divriotis, Johannes F. Lutzeyer, Iakovos Evdaimon, Michalis Vazirgiannis

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) LIX, CNRS, École Polytechnique, IP Paris(巴黎理工学院LIX实验室,法国国家科学研究中心,巴黎理工学院,IP巴黎)

AI总结 GraViti通过图级变分自编码器生成紧凑的潜在向量,支持平滑插值和下游任务,优于节点级嵌入。

详情
AI中文摘要

我们介绍了GraViti,一种基于transformer的图级变分自编码器,将整个图映射到紧凑的潜在向量。这种设计产生了一个真正的图级潜在空间,支持平滑插值、属性引导搜索等下游任务,超越节点级嵌入的限制。在分子基准上,GraViti学会解码符合训练数据化学约束的有效样本,表明模型能直接从图级表示中恢复领域规则。我们还显示,在存在可靠规范节点顺序的领域(如分子或贝叶斯网络)中,强制排列不变性可能对一致重建有害。GraViti在大规模数据集上实现了最先进的重建准确性,并提供了坚实的生成性能。其单步解码提供了一种轻量级替代方案,同时保持实用的样本质量。

英文摘要

We introduce GraViti, a transformer-based graph-level variational autoencoder that maps entire graphs to compact latent vectors. This design produces a true graph-level latent space that supports smooth interpolation, property-guided search, and other downstream tasks beyond the constraints of node-level embeddings. On molecular benchmarks, GraViti learns to decode valid samples that follow the chemical constraints present in the training data, showing that the model recovers domain rules directly from graph-level representations. We also show that, in domains where a reliable canonical node ordering exists such as molecules or bayesian networks, enforcing permutation invariance can prove detrimental for consistent reconstruction. GraViti achieves state-of-the-art reconstruction accuracy on large datasets, and provides solid generative performance. Its single-step decoding offers a lightweight alternative to more complex generation pipelines while maintaining practical sample quality.

2605.16665 2026-05-19 cs.LG physics.geo-ph 版本更新

In-context learning enables continental-scale subsurface temperature prediction from sparse local observations

上下文学习使稀疏本地观测能够预测大陆尺度的地下温度

Daniel O'Malley, Christopher W. Johnson, Javier E. Santos, Pablo Lara, Sandro Malusà, Bharat Srikishan, John Kath, Arnab Mazumder, Mohamed Mehana, David Coblentz, Nathan DeBardeleben, Earl Lawrence, Hari Viswanathan

发表机构 * Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)

AI总结 本文提出In-Context Earth模型,利用稀疏钻孔观测预测连续温度场,优于现有方法,且能适应不同地区,具有高准确性与可解释性。

详情
AI中文摘要

大陆尺度的地下温度知识受限于钻孔测量的成本和稀疏性,但此类信息对地热资源评估和浅层地壳热传输理解至关重要。热场反映了岩石类型、地壳结构、放射性产热和对流流体流动的相互作用,有时会产生尖锐异常,传统插值或物理模型难以捕捉。本文引入了基于Transformer的In-Context Earth模型,利用稀疏本地钻孔观测作为地质上下文,预测连续温度-深度场并校准不确定性。在美国大陆,该模型的平均绝对误差为4.7°C,优于物理指导的斯坦福热模型、AlphaEarth嵌入模型、多模态透明地球模型和通用克里格法,同时在地热省中解析更尖锐的热梯度。其不确定性估计校准良好,Kolmogorov-Smirnov统计量为2.5%。无需微调,模型能适应阿尔伯塔、澳大利亚和英国,仅使用20个本地观测,在地质上不同的测试区域保持高精度,平均绝对误差分别为阿尔伯塔2.2°C、澳大利亚6.2°C和英国5.4°C。可解释性分析显示,模型学习了其训练过程中未观察到的地下属性,包括地震速度、地球化学和地壳结构,并以物理一致的方式使用这些表示。更广泛地说,这项工作表明上下文学习可以利用稀疏钻孔观测进行大陆尺度地下特征刻画,无需密集测量或区域特定重训练。

英文摘要

Continental-scale knowledge of subsurface temperature is limited by the cost and sparsity of borehole measurements, but such information is essential for geothermal resource assessment and for understanding heat transport in the shallow crust. The thermal field reflects the interaction between lithology, crustal structure, radiogenic heat production, and advective fluid flow, sometimes producing sharp anomalies that are smoothed by conventional interpolation or difficult to capture with physical models. Here we introduce In-Context Earth, a transformer-based model that uses sparse local borehole observations as geological context to predict continuous temperature-at-depth fields with calibrated uncertainty. In the contiguous United States, the model achieves a mean absolute error of 4.7 °C, outperforming the physics-informed Stanford Thermal Model, a model based on AlphaEarth embeddings, the multimodal Transparent Earth model, and universal kriging, while resolving sharper thermal gradients in geothermal provinces. Its uncertainty estimates are well calibrated, with a Kolmogorov-Smirnov statistic of 2.5%. Without finetuning, the model adapts to Alberta, Australia, and the United Kingdom (UK) using only 20 local observations at inference time, maintaining high accuracy in geologically distinct test regions with a mean absolute error of 2.2 °C in Alberta, 6.2 °C in Australia, and 5.4 °C in the UK. Interpretability analyses show that the model learns internal representations of subsurface properties it never observes during training, including seismic velocities, geochemistry, and crustal structure, and uses these representations in physically consistent ways. More broadly, this work shows that in-context learning can use sparse borehole observations for continental-scale subsurface characterization, without requiring dense measurements or region-specific retraining.

2605.16647 2026-05-19 cs.CR cs.LG 版本更新

Public-Decay Homomorphic State Space Models for Private Sequence Inference

公共衰减同态状态空间模型用于隐私序列推断

Luis Brito

发表机构 * School of Technology and Management (ESTG-IPVC) Polytechnic Institute of Viana do Castelo(技术与管理学院(ESTG-IPVC)葡萄牙维亚纳多卡斯托尔理工大学)

AI总结 本文提出公共衰减同态状态空间模型(HSSMs),通过加密-明文公共衰减更新状态,实现隐私序列推断,在保持加密状态的同时提升效率和准确性。

Comments 19 pages, 3 figures

详情
AI中文摘要

完全同态加密(FHE)改变了序列模型设计,因为旋转、加密乘积、密文材料化、乘法深度和启动压力可能主导普通神经网络成本。本文提出了公共衰减同态状态空间模型(HSSMs),即具有递归/状态空间块的循环/状态空间块,其携带状态通过密文-明文公共衰减更新,而密文-密文乘法仍保持在本地写路径上。该设计在序列中保持加密状态不变。评估的工作流将客户端侧的标记化、冻结的fastText查找、投影、裁剪、加密、解密和阈值处理与服务器侧的加密评估分离,基于有限投影特征。在完整的烂番茄和SST-2验证分割上,加密HSSM路径精确匹配明文分类,并达到0.7505和0.7420的准确率。与HE友好的多项式注意力在相同fastText工作负载上相比,HSSM在运行约5倍快的同时匹配或超过全序列任务质量。配对的L40S操作级行显示1.34-1.62倍的延迟低于缓存的最终标记多项式注意力,30-258倍的延迟低于全序列多项式注意力,并且具有更低的逻辑加密状态足迹。一个T=16/32比较器,具有加密公共线性输入和Q/K/V投影,显示在深度8/环32768下,投影HSSM成功,而投影注意力在深度10/环65536下成功。一个匹配的T=8 OpenFHE/FIDESlib跟踪在两个后端上均在最终级别3和噪声尺度度2完成。这些结果使公共衰减成为加密序列推断的实用FHE协同设计杠杆,从有限投影特征中推断。

英文摘要

Fully homomorphic encryption (FHE) changes sequence-model design because rotations, encrypted products, ciphertext materialization, multiplicative depth, and bootstrapping pressure can dominate ordinary neural-network costs. This paper presents public-decay homomorphic state space models (HSSMs), recurrent/state-space blocks whose carried state is updated through ciphertext-plaintext public decay while ciphertext-ciphertext multiplication remains on a local write path. The design keeps a fixed encrypted state across the sequence. The evaluated workflow separates client-side tokenization, frozen fastText lookup, projection, clipping, encryption, decryption, and thresholding from server-side encrypted evaluation over bounded projected features. On full Rotten Tomatoes and SST-2 validation splits, the encrypted HSSM path exactly matches plaintext classifications and reaches 0.7505 and 0.7420 accuracy. Against HE-friendly polynomial attention on the same fastText workloads, HSSM matches or exceeds full-sequence task quality while running about 5x faster. Paired L40S operation-level rows show 1.34-1.62x lower latency than cached final-token polynomial attention, 30-258x lower latency than full-sequence polynomial attention, and lower logical encrypted-state footprint. A T = 16/32 comparator with encrypted public-linear input and Q/K/V projections shows projected HSSM succeeding under depth 8/ring 32768, while projected attention succeeds under depth 10/ring 65536. A matched T = 8 OpenFHE/FIDESlib trace finishes at final level 3 and noise-scale degree 2 on both backends. These results make public-decay carry a practical FHE co-design lever for encrypted sequence inference from bounded projected features.

2605.16645 2026-05-19 math.ST cs.IT cs.LG math.IT stat.ML stat.TH 版本更新

Statistical Unlearning of Distributions: A Hypothesis Testing Approach

分布统计遗忘:一种假设检验方法

Aaradhya Pandey, Sanjeev Kulkarni

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出一种分布统计遗忘框架,通过假设检验选择样本以减少不需要的分布影响,同时保持所需分布的性能,并分析了允许的编辑数据分布区域和帕累托前沿。

Comments Comments welcome

详情
AI中文摘要

机器学习系统越来越多地面临要求遗忘不仅单个数据点,还包括整个信息领域的需求,例如有毒语言、受版权保护的语料库或人口统计数据偏见。这提出了统计-计算权衡的根本困境:移除所有不需要领域的样本可能是计算上不可行的,而随机移除一部分可能无法提供分布层面的统计保证。我们提出了一种分布遗忘的统计框架,其中领域被建模为概率分布,目标是移除精心选择的样本子集,以减少不需要分布的影响,同时保持所需分布的性能。我们通过假设检验编辑数据与所需和不需要的领域,从而得到可解释且稳健的样本移除标准。在该统计框架中,我们表征了允许的编辑数据分布区域以及广泛分布族的移除-保留帕累托前沿。这包括参数族如任意维度的位移高斯分布、一维位置族带有对数凹噪声以及一维泊松族。它还包含非参数族,如高斯白噪声模型,这是非参数回归的通用模型。我们证明了组合规则,描述了分布遗忘在多模式不需要领域中的行为,并引入了当组合大量此类族时移除-保留基线的中心极限行为。最后,我们通过提供某些选择算法的帕累托前沿来提供有限样本保证,并观察到信息-计算差距。

英文摘要

Machine learning systems increasingly face requirements to forget not only individual data points, but entire domains of information, such as toxic language, copyrighted corpora, or demographic biases. This raises a fundamental dilemma of statistical-computational tradeoffs: removing all samples from an unwanted domain may be computationally prohibitive, while randomly removing a subset may not provide distribution-level statistical guarantees. We propose a statistical framework for distributional unlearning, in which domains are modeled as probability distributions, and the goal is to remove a carefully chosen subset of samples that reduces the effect of an unwanted distribution while preserving performance on a desired one. We formalize this using a hypothesis test of the edited data with the desired and unwanted domains, leading to an interpretable and robust criterion for selecting samples to remove. Within this statistical framework, we characterize the fundamental region of the allowable edited data distributions and the removal-preservation Pareto frontier for a broad class of distribution families. This includes parametric families such as shifted Gaussians of arbitrary dimension, a one-dimensional location family with log-concave noise, and the one-dimensional Poisson family. It also includes nonparametric families such as the Gaussian white noise model, a canonical model for nonparametric regression. We prove composition rules that describe how distributional unlearning behaves across multimodal unwanted domains, and introduce a central-limit behavior for the removal-preservation baselines when composing a large number of such families. Finally, we provide finite sample guarantees by providing Pareto frontiers for some selection algorithms, and observe an information-computation gap.

2605.16644 2026-05-19 eess.SY cs.LG cs.SY math.OC stat.ML 版本更新

The Score Kalman Filter

分数卡尔曼滤波器

Kaito Iwasaki, Anthony Bloch, Taeyoung Lee, Maani Ghaffari

发表机构 * Department of Mathematics University of Michigan(数学系密歇根大学) Department of Mechanical and Aerospace Engineering George Washington University(机械与航空航天工程系乔治华盛顿大学) Department of Naval Architecture & Marine Engineering and Department of Robotics University of Michigan(海军建筑与海洋工程系和机器人系密歇根大学)

AI总结 本文提出分数卡尔曼滤波器,通过结合分数匹配与斯蒂恩恒等式,避免了分区函数的计算,实现了非线性系统的高效滤波,适用于高维问题。

Comments 56 pages, 27 figures

详情
AI中文摘要

非线性贝叶斯滤波的核心难题在于表示信念分布。基于矩的滤波器通过传播多项式矩并从它们中重建密度来解决这一问题。最近的工作通过最大熵原理完成预测-更新循环,但每一步都需要分区函数及其梯度,均为n维积分,其成本呈指数增长,限制了最大熵矩滤波器的演示到n≤4。我们通过将分数匹配与斯蒂恩恒等式结合,完全避免了分区函数。在我们的设置中,分数匹配将密度拟合减少到一个线性求解,其系数直接从传播的矩中组装。相同的参数随后驱动斯蒂恩恒等式在预测期间关闭矩层次结构,并在每次贝叶斯更新后恢复后验矩,使完整的预测-更新循环免于分区函数评估。所得到的分数卡尔曼滤波器(SKF)作为特殊情况退化为经典的信息形式卡尔曼滤波器,并通过线性代数完成每一步。在非线性耦合振荡器网络上,SKF能够运行n=20,并在测试的合成基准上报告比EKF、UKF、EnKF和粒子滤波基线更低的RMSE。

英文摘要

A central obstacle in nonlinear Bayesian filtering is representing the belief distribution. Moment-based filters address this by propagating polynomial moments and reconstructing a density from them. Recent work completes the predict-update loop via the maximum-entropy (MaxEnt) principle, but each step requires the partition function and its gradient, both $n$-dimensional integrals whose cost scales exponentially, restricting the demonstrated MaxEnt moment filtering to $n \le 4$. We avoid the partition function entirely by combining score matching with Stein's identity. In our setting, score matching reduces the density fit to a single linear solve whose coefficients are assembled directly from the propagated moments. The same parameters then drive Stein's identity to close the moment hierarchy during prediction and to recover posterior moments after each Bayesian update, keeping the full predict-update loop free of partition function evaluation. The resulting Score Kalman Filter (SKF) reduces to the classical information-form Kalman filter as a special case and performs every step through linear algebra. On nonlinear coupled-oscillator networks, the SKF runs through $n=20$ and reports lower RMSE than the EKF, UKF, EnKF, and particle-filter baselines on the tested synthetic benchmarks.

2605.14381 2026-05-19 cs.LG cs.CL 版本更新

NodeSynth: Socially Aligned Synthetic Data for AI Evaluation

NodeSynth: 为AI评估的社会协同合成数据

Qazi Mamunur Rashid, Xuan Yang, Zhengzhe Yang, Yanzhou Pan, Erin van Liemt, Darlene Neal, Kshitij Pancholi, Jamila Smith-Loud

发表机构 * Google Research USA(谷歌研究美国分公司) Google USA(谷歌美国分公司) Google Deepmind USA(谷歌深Mind美国分公司)

AI总结 NodeSynth通过结合现实证据的细粒度分类扩展,生成社会相关合成查询,提升AI模型在敏感领域评估的准确性与安全性。

详情
AI中文摘要

NodeSynth通过结合现实证据的细粒度分类扩展,生成社会相关合成查询,提升AI模型在敏感领域评估的准确性与安全性。

英文摘要

Recent advancements in generative AI facilitate large-scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSynth, an evidence-grounded methodology that generates socially relevant synthetic queries by leveraging a fine-tuned taxonomy generator (TaG) anchored in real-world evidence. Evaluated against four mainstream LLMs (e.g., Claude 4.5 Haiku), NodeSynth elicited failure rates up to five times higher than human-authored benchmarks. Ablation studies confirm that our granular taxonomic expansion significantly drives these failure rates, while independent validation reveals critical deficiencies in prominent guard models (e.g., Llama-Guard-3). We open-source our end-to-end research prototype and datasets to enable scalable, high-stakes model evaluation and targeted safety interventions (https://github.com/google-research/nodesynth).

2605.14347 2026-05-19 cs.LG 版本更新

Exemplar Partitioning for Mechanistic Interpretability

基于示例的机制可解释性划分

Jessica Rumbelow

发表机构 * Leap Laboratories(Leap实验室)

AI总结 本文提出Exemplar Partitioning方法,通过更少的token构建可解释特征字典,展示其在不同层和模型间的可比性及因果干预能力。

Comments Code: https://github.com/jessicarumbelow/exemplar-partitioning. Pretrained dictionaries: https://huggingface.co/datasets/J-RUM/exemplar-partitioning

详情
AI中文摘要

我们介绍了Exemplar Partitioning (EP),一种无监督方法,用于从大型语言模型激活中构建可解释特征字典,使用约10^3倍少于可比稀疏自编码器(SAEs)的token数量。EP字典是激活空间的Voronoi划分,通过在距离阈值内流式激活的领导聚类构建。每个区域由一个观察到的示例锚定,作为其成员资格标准和干预方向;字典大小不预先指定,但由该阈值下的激活几何决定。由于示例是观察到的而非学习的,从同一数据流构建的字典在不同层、模型和训练检查点之间直接可比。我们通过针对新可解释性属性的演示和一个头对头基准测试来表征EP为可解释性对象。在Gemma-2-2B中,EP字典区域是可解释的,并支持因果干预:在指令微调的Gemma中,拒绝集中在区域,其示例消融可使隐藏的拒绝消失。基版与指令微调字典之间的跨检查点匹配将通过微调保留的方向与引入的方向分开。EP区域和Gemma Scope SAE特征以不同方式分解激活空间,但同意共享核心:约20%的EP区域在F1>0.5时与SAE特征匹配,且EP的one-hot探针在ℓ0=1时保留约97%的原始激活探针准确性。最近邻示例距离在推理时提供免费的分布外信号。在Gemma-2-2B-it L20上的AxBench潜在概念检测中,EP在p1达到平均AUROC 0.881,比传统GemmaScope SAE领先0.126,在SAE-A的0.911附近,且构建计算低约10^3倍。

英文摘要

We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with $\sim 10^3\times$ fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader-clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints. We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction, plus one head-to-head benchmark. In Gemma-2-2B, EP dictionary regions are interpretable and support causal interventions: refusal in instruction-tuned Gemma concentrates in a region whose exemplar ablation can collapse held-out refusal. Cross-checkpoint matching between base and instruction-tuned dictionaries separates the directions preserved through finetuning from those introduced by it. EP regions and Gemma Scope SAE features decompose activation space differently but agree on a shared core: $\sim$20% of EP regions match an SAE feature at $F_1 > 0.5$, and EP one-hot probes retain $\sim$97% of raw-activation probe accuracy at $\ell_0 = 1$. Nearest-exemplar distance provides a free out-of-distribution signal at inference. On AxBench latent concept detection at Gemma-2-2B-it L20, EP at $p_1$ reaches mean AUROC 0.881, +0.126 over the canonical GemmaScope SAE leaderboard entry and within 0.030 of SAE-A's 0.911, at $\sim 10^3\times$ less build compute.

2605.14292 2026-05-19 cs.LG cs.CL 版本更新

Minimal-Intervention KV Retention via Set-Conditioned Diversity

通过集合条件多样性实现最小干预的KV保留

Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin

发表机构 * Department of Computer Science and Software Engineering(计算机科学与软件工程系) Department of Information Management(信息管理系) Auburn University(阿伯丁大学) National Central University(国立中央大学)

AI总结 研究通过改进TriAttention保留评分器,在有限预算下提升KV缓存压缩效果,采用V空间冗余惩罚机制,验证了最小修改优于结构性重设计。

Comments 15 pages, 3 figures, 3 tables. Code and data: https://github.com/libophd/minimal-kv-retention

详情
AI中文摘要

在小预算下,KV缓存压缩是一个拥挤的设计空间,涵盖缓存表示、逐头路由、压缩节奏、解码行为和预算内评分。我们研究了五个家族中的七种机制,在匹配的长形式数学推理(MATH-500~\cite{hendrycks2021math})下,使用两个蒸馏推理模型(Qwen-7B和Llama-8B变体DeepSeek-R1-Distill~\cite{deepseek2025r1})在预算$b \in \{64, 128\}$下进行测试。所有七种机制均被拒绝。我们随后提出$\alpha$,一种对TriAttention~\cite{mao2026triattention}保留评分器的单函数修改,用启发式设施选址的贪心选择替代argmax-top-$k$,在由单个权重$\lambda$控制的V空间冗余惩罚下。一个预先注册的协议在冻结的开发分割上调整$\lambda$,并在不相交的保留分割上验证;当$\lambda= 0.5$时,$\alpha$在两个四(模型,预算)单元格(Qwen $b{=}128$和Llama $b{=}64$)上通过Bonferroni检验,没有单元格显著为负,且预注册的Branch~A触发。发现是不对称的:在该范围内,最小评分修改优于更重的结构重设计,且结合匹配的记忆、sympy评分、保留验证协议的证据标准使得不对称性显现。

英文摘要

KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families under matched mean cache on long-form mathematical reasoning (MATH-500~\cite{hendrycks2021math}) with two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill~\cite{deepseek2025r1}) at budgets $b \in \{64, 128\}$. All seven were rejected. We then propose $α$, a one-function modification to the TriAttention~\cite{mao2026triattention} retention scorer that replaces argmax-top-$k$ with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight $λ$. A pre-registered protocol tunes $λ$ on a frozen development split and confirms on a disjoint held-out split; with $λ= 0.5$, $α$ clears Bonferroni on two of the four (model, budget) cells (Qwen $b{=}128$ and Llama $b{=}64$), no cell is significantly negative, and the pre-registered Branch~A triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched-memory, sympy-graded, held-out confirmation protocol is the evidence standard that made the asymmetry visible.

2605.13900 2026-05-19 cs.MA cs.LG 版本更新

Ready from Day 1: Population-Aware Coordination for Large-Scale Constrained Multi-Agent Systems

从第一天开始:面向大规模约束多智能体系统的群体感知协调

Angel Wang, Dominique Perrault-Joncas, Alvaro Maggiar, Carson Eisenach, Dean Foster

发表机构 * Amazon(亚马逊)

AI总结 本文提出群体感知协调接口,通过学习的原Dual映射,在迭代循环中查询,以实现大规模多智能体系统的协调,减少预测误差和容量违规。

Comments 30 pages, 16 figures. Submitted to NeurIPS 2026

详情
AI中文摘要

在大规模多智能体系统中,上游规划器必须迭代评估候选资源计划,评估可行性、聚合响应和边际成本,然后才能确定一个计划。拉格朗日松弛通过广播成本信号分离本地决策,但规划器仍需要成本到利用响应图来探索计划空间,而该图依赖于群体组成,这在规划周期中会变化。我们提出群体感知协调接口:学习的原Dual映射,条件于紧凑群体摘要,规划器在迭代循环中查询。原映射预测在提议成本轨迹下的聚合利用;Dual映射预测目标计划的成本轨迹。通过编码响应相关的群体结构,这些映射在演变群体中保持可靠,无需每次循环重新训练,并支持从紧凑子样本协调大规模群体。我们还把Sim2Real转移作为可回测过程,使在部署前评估成为可能。在供应链容量控制案例研究中,群体感知接口相对于无群体意识基线减少了16-19%的预测误差和20-51%的容量违规;20000智能体群体支持准确协调500000智能体群体;并且模拟训练的原映射在真实观测上达到11.1%的MAPE,优于基线的13-24%。

英文摘要

In large-scale multi-agent systems with shared resource constraints, an upstream planner must iteratively evaluate candidate resource plans -- assessing feasibility, aggregate response, and marginal cost -- before committing to one. Lagrangian relaxation separates local decisions through a broadcast cost signal, but the planner still needs the cost-to-utilization response map to explore plan space, and this map depends on population composition that changes across planning cycles. We propose \emph{population-aware coordination interfaces}: learned primal and dual maps, conditioned on compact population summaries, that the planner queries inside its iterative loop. The primal map predicts aggregate utilization under a proposed cost trajectory; the dual map predicts the cost trajectory for a target plan. By encoding response-relevant population structure, these maps remain reliable across evolving populations without per-cycle retraining, and support coordination of large populations from compact subsamples. We additionally cast Sim2Real transfer as a backtestable procedure, enabling evaluation before deployment. In a supply-chain capacity-control case study, population-aware interfaces reduce forecast error by 16--19\% and capacity violations by 20--51\% relative to population-unaware baselines under composition shift; 20K-agent cohorts support accurate coordination of 500K-agent populations; and simulator-trained primal maps achieve 11.1\% MAPE on real observations versus 13--24\% for baselines.

2605.13845 2026-05-19 cs.LO cs.LG 版本更新

Quantitative Linear Logic for Neuro-Symbolic Learning and Verification

量化线性逻辑用于神经符号学习与验证

Thomas Flinkow, Ekaterina Komendantskaya, Matteo Capucci, Rosemary Monahan

发表机构 * Maynooth University(梅诺斯大学) University of Southampton(南安普顿大学) University of Strathclyde(斯特拉思克莱德大学) Independent Researcher(独立研究员)

AI总结 本文提出量化线性逻辑(QLL),通过自然性原则设计,将逻辑约束转化为机器学习实践中常用的运算,如求和和对数求和指数,以解决神经符号学习与验证中的逻辑与语义之间的平衡问题。

Comments 23 pages, 2 figures, 13 tables

详情
AI中文摘要

可微逻辑被用于神经符号学习任务中,作为将逻辑约束嵌入神经网络训练目标的方式。可微逻辑由一种编写逻辑属性的语法和一种将其解释为实值函数的语义组成。该领域的一个核心权衡在于连接词的逻辑属性与语义的分析关注之间。模糊逻辑在代数和证明论上有良好基础,而Fischer的DL2等随意可微逻辑则专为深度学习应用设计。然而,尚未出现令人满意的理论基础。本文提出了解决这一长期矛盾的新逻辑,即量化线性逻辑(QLL),具有基础性的目标。我们的设计受自然性的驱动——即由于逻辑约束被转化为损失,连接词的语义应是机器学习实践中常用的操作(即求和和对数求和指数)在加性量(如logits)上。我们从两个方面评估结果:逻辑充分性——它们满足线性逻辑的大多数标准逻辑定律;以及经验有效性——测试时性能(通过对抗攻击测量)与实际验证逻辑约束(通过现成的神经网络验证器测量)之间有良好相关性,这使QLL在现有技术中脱颖而出。

英文摘要

Differentiable Logics are deployed in neuro-symbolic learning tasks as a way of embedding logical constraints in the training objective of neural networks. A differentiable logic consists of a syntax to write logical properties and a semantics to interpret them as real-valued functions to be folded in the loss function. A defining trade-off of the field is that between logical properties of the connectives, and analytic concerns for the semantics, with both aspects being relevant in applications. At one extreme we find fuzzy logics, that have well-established algebraic and proof-theoretic foundations, and at the other ad-hoc differentiable logics like Fischer's DL2, conceived for deep learning applications. However, no satisfactory foundation has emerged yet. We propose a resolution to this long-standing tension via a novel logic, Quantitative Linear Logic (QLL), with foundational ambitions. Our design is driven by naturality -- the idea that, since logical constraints are translated to losses, the semantics of the connectives should be pertinent operations used in ML practice (that is, sum and log-sum-exp) on additive quantities (like logits). We then judge the result on two aspects: logical adequacy -- that they satisfy most of the standard logical laws of Linear Logic; and empirical effectiveness -- test-time performance (as measured by adversarial attacks) is well-correlated to the actual verification of the logical constraints (as measured by off-the-shelf neural network verifiers), which makes QLL stand out among SoTA techniques.

2605.13322 2026-05-19 cs.CV cs.LG 版本更新

KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

KamonBench:一种基于语法规则的数据集,用于评估视觉-语言模型中的组合因子恢复

Richard Sproat, Stefano Peluchetti

AI总结 KamonBench通过20000个合成复合徽章及辅助组件示例,提供评估视觉-语言模型中稀疏组合识别和因子恢复的可控测试环境,支持程序代码因子度量和可控因子对重组。

Comments Preprint

详情
AI中文摘要

KamonBench通过20000个合成复合徽章及辅助组件示例,提供评估视觉-语言模型中稀疏组合识别和因子恢复的可控测试环境,支持程序代码因子度量和可控因子对重组。

英文摘要

Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-to-structure benchmark with 20,000 synthetic composite crests and auxiliary component examples. Each composite crest is paired with a formal kamon description language - "kamon yōgo" - description, a segmented Japanese analysis, an English translation, and a non-linguistic program code. Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and linear probes of factor accessibility. We include baseline results for a ViT encoder/Transformer decoder and two VGG n-gram decoders, with and without learned positional masks. KamonBench therefore provides a controlled testbed for sparse compositional visual recognition and factor recovery in vision-language models.

2605.12991 2026-05-19 cs.LG cs.AI 版本更新

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

不只是RLHF:为何仅对齐不足以解决多智能体趋同

Adarsh Kumarappan, Ananya Mujoo

发表机构 * California Institute of Technology(加州理工学院) Evergreen Valley College(艾弗绿谷学院)

AI总结 本文研究了多智能体系统在模拟同伴分歧下的错误率问题,发现预训练基础模型与指令模型存在相似的替换模式,且错误率较高。通过激活修补发现错误集中在中间层,修复后可恢复大部分正确率差距。研究还指出压力抑制了清洁推理特征,而非激活新的趋同回路。

详情
AI中文摘要

基于LLM的多智能体管道在模拟同伴分歧下,正确答案转为错误答案的速率我们称为收益,这一漏洞广泛归因于RLHF诱导的趋同。我们测试了四种模型家族,发现这种归因大多不成立:预训练基础模型表现出与指令变体相同的替换模式,其平均收益高于指令变体。通过激活修补,我们发现错误集中在狭窄的中间层窗口,其中注意力承担因果权重,而MLP贡献可忽略不计;在该窗口上方进行修补可恢复96%的清洁到受压P(correct)差距。攻击面分解为两个独立因素(通道框架和共识强度)的相互作用,产生47.5个百分点的收益差距,在多数共识下保持不变,适用于陪审团大小$N \in \{4, 5, 6\}$。两种收敛的激活空间干预显示,压力抑制了清洁推理特征,而非激活新的趋同回路。一个正确论证的异议者在所有测试框架中将收益降低54-73个百分点,而最强的提示级防御在攻击变体超出其设计范围时失效。缓解措施应针对机制,而非提示级防御,应在管道层面实施结构化异议。

英文摘要

LLM-based multi-agent pipelines flip from correct to incorrect answers under simulated peer disagreement at rates we term yield, a vulnerability widely attributed to RLHF-induced sycophancy. We test this attribution across four model families and find it largely wrong: pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct. Using activation patching, we localize the corruption to a narrow mid-layer window where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96% of the clean-to-pressured P(correct) gap. The attack surface decomposes into two independent factors (channel framing and consensus strength) whose interaction produces a 47.5 percentage-point yield gap at majority consensus, preserved across jury sizes $N \in \{4, 5, 6\}$. Two converging activation-space interventions show that pressure suppresses clean-reasoning features rather than activating a new sycophancy circuit. A single correctly-arguing dissenter reduces yield by 54-73 percentage points across all framings tested, whereas the strongest prompt-level defense fails on attack variants outside its design surface. Mitigations should target the mechanism, structured dissent at the pipeline level, rather than prompt-level defenses.

2605.12825 2026-05-19 cs.LG cs.AI 版本更新

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Orthrus:通过双视角扩散实现内存高效的并行令牌生成

Chien Van Nguyen, Chaitra Hegde, Van Cuong Pham, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen

发表机构 * University of Oregon(俄勒冈大学) Google DeepMind(谷歌深Mind) Adobe Research(Adobe研究)

AI总结 Orthrus结合自回归大语言模型的高保真生成与扩散模型的高速并行生成,通过双视角机制实现高效推理,提升速度7.8倍且内存开销极低。

详情
AI中文摘要

我们介绍Orthrus,一种简单高效的双架构框架,结合自回归大语言模型(LLM)的精确生成保真度与扩散模型的高速并行令牌生成。标准自回归解码的序列性是高吞吐推理的根本瓶颈。尽管扩散语言模型试图通过并行生成突破这一瓶颈,但存在显著的性能下降、高训练成本和缺乏严格的收敛保证。Orthrus原生解决这一二元对立。设计用于无缝集成到现有Transformer中,框架在冻结的LLM上添加一个轻量可训练模块,创建一个并行扩散视角与标准自回归视角。在统一系统中,两个视角均关注相同的高保真键值(KV)缓存;自回归头执行上下文预填充以构建准确的KV表示,而扩散头执行并行生成。通过在两个视角之间采用精确的一致性机制,Orthrus保证无损推理,仅以O(1)的内存缓存开销和极小的参数增加,即可实现高达7.8倍的速度提升。

英文摘要

We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.

2605.12547 2026-05-19 econ.EM cs.LG q-fin.ST stat.AP 版本更新

The Payment Heterogeneity Index: An Integrated Unsupervised Framework for High-Volume Procurement Oversight and Decision Support

支付异质性指数:一种用于高 volume 采购监督和决策支持的集成无监督框架

Kyriakos Christodoulides

发表机构 * Philips University, Department of Computer Science(菲利普斯大学计算机科学系)

AI总结 本文提出支付异质性指数(PHI),通过整合高斯混合模型参数和非参数统计,用于高 volume 采购监督和决策支持,揭示支付结构和潜在模式。

Comments Request category change from econ.EM -> stat.ML. Paper is methodological, introducing a new unsupervised ML/stat framework (SHI/PHI index) for distributional structure. Methodology is general; procurement is the application. stat.ML is more appropriate primary; econ.EM as cross-list

详情
AI中文摘要

公共采购易受错误、欺诈和腐败影响,特别是在高交易量超出监督能力时。尽管研究常关注招标阶段异常,但中标后付款监控仍被忽视。由于标记数据稀缺且如本福特定律等方法假设限制多,需要可解释的无监督框架用于高 volume 采购监督和决策支持。本文引入结构异质性指数(SHI),一种一维样本复合统计量,及其支付特定实例支付异质性指数(PHI),用于表征支付结构和潜在模式。它整合高斯混合模型(GMM)参数和非参数统计,整合四个可解释组件:模态、不对称性、尾部行为和结构分散性。独特的是,尾部行为组件捕捉分布厚重和极值集中,而结构分散性结合了潜在支付模式的变异性、普遍性和分离度。应用于英国市政采购数据,PHI识别出一个财务显著的供应商群体(0.6%的供应商;10.1%的高 volume 供应商)具有结构不同的支付模式。统计检验进一步支持这些差异,针对性的人工验证确认了优先案例的合理性。比较分析显示PHI揭示了被变异系数(ρ=0.310)掩盖的模式分离。PHI提供了一个透明、可分解且计算轻量的框架用于采购完整性监督和目标审计优先级。

英文摘要

Public procurement is vulnerable to error, fraud, and corruption, particularly as high transaction volumes overwhelm oversight. While research often focuses on tender-stage anomalies, post-award payment monitoring remains underexplored. Since labelled datasets are rare and methods like Benford's Law face restrictive assumptions, there is a need for interpretable, unsupervised frameworks for high-volume procurement oversight and decision support. This paper introduces the Structural Heterogeneity Index (SHI), a composite statistic for one-dimensional samples, and its payment-specific instantiation, the Payment Heterogeneity Index (PHI), characterising payment structure and latent regimes. It incorporates Gaussian Mixture Model (GMM) parameters alongside non-parametric statistics, integrating four interpretable components: modality, asymmetry, tail behaviour, and structural dispersion. Uniquely, the tail-behaviour component captures both distributional heaviness and extreme-value concentration, while structural-dispersion combines the variability, prevalence, and separation of latent payment regimes. Applied to UK municipal procurement data, PHI identifies a financially significant cohort (0.6\% of suppliers; 10.1\% of high-volume vendors) with structurally distinct payment patterns. Statistical testing further supports these differences, and targeted human verification confirms the plausibility of prioritised cases. Comparative analysis shows PHI reveals regime separation obscured by the Coefficient of Variation ($ρ= 0.310$). PHI provides a transparent, decomposable, and computationally lightweight framework for procurement integrity oversight and targeted audit prioritisation.

2605.12070 2026-05-19 cs.LG cs.AI 版本更新

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

异步代理强化学习中缺失旧日志:语义不匹配及用于离线策略修正的修复方法

Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di, Likang Wu, Xiong Jun Wu, Hongke Zhao

发表机构 * Tianjin University(天津大学) Tsinghua University(清华大学) Peking University(北京大学) JDT AI Infra(京东AI基础设施)

AI总结 本文研究了异步代理强化学习中因缺失旧日志导致的语义不匹配问题,提出三种精确获取旧日志的策略及近似修正方法,改进了PPO-EWMA方法,提升了训练速度和优化性能。

详情
AI中文摘要

异步强化学习通过将样本生成与策略优化解耦,提高了大语言模型代理的回放吞吐量,但同时也引入了PPO类离线策略修正中的关键故障模式。在异构训练系统中,总重要性比率应理想地分解为两个语义不同的因素:一个训练-推理不匹配项,用于对齐同一行为策略版本的推理侧和训练侧分布,以及一个策略陈旧项,用于约束从历史策略到当前策略的更新。我们发现实际的异步管道在延迟更新和部分回放的情况下,常常丢失所需的训练侧旧日志或旧日志。这种缺失旧日志的问题使不匹配修复与陈旧修正纠缠在一起,破坏了解耦修正的初衷,并使裁剪和掩码阈值产生不良交互。为了解决这一问题,我们研究了精确和近似修正路径。我们提出了三种精确旧日志获取策略:基于快照的版本跟踪、专用旧日志模型以及通过部分回放中断进行同步,并比较了它们的系统权衡。从近似修正的角度来看,我们关注通过更合适的近似策略保留解耦修正的好处,当无法以低成本恢复精确旧日志时,不增加额外系统开销。随后,我们采用改进的PPO-EWMA方法,该方法在训练速度和优化性能方面均取得显著提升。

英文摘要

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emph{training--inference discrepancy term} that aligns inference-side and training-side distributions at the same behavior-policy version, and a \emph{policy-staleness term} that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance.

2605.11970 2026-05-19 cs.LG 版本更新

NOFE - Neural Operator Function Embedding

NOFE - 神经操作函数嵌入

Lars Uebbing, Harald L. Joakimsen, Siyan Chen, Georgios Leontidis, Kristoffer K. Wickstrøm, Michael C. Kampffmeyer, Sébastien Lefèvre, Arnt-Børre Salberg, Robert Jenssen

发表机构 * David S. Hippocampus Department of Computer Science Cranberry-Lemon University(David S. Hippocampus 计算机科学系 Cranberry-Lemon 大学) UiT The Arctic University of Norway(UiT 北极大学) Norwegian Computing Center(挪威计算中心) University of South Brittany(南布列塔尼大学) University of Copenhagen(哥本哈根大学)

AI总结 NOFE是一种面向连续域的降维框架,通过图核操作学习函数到函数的映射,实现无网格评估,优于传统方法在局部结构保持和鲁棒采样方面表现。

Comments 21 pages, 11 figures, 12 tables

详情
AI中文摘要

大多数降维方法将数据视为离散点云,忽视了许多现实过程固有的连续域结构。为弥合这一差距,我们引入神经操作函数嵌入(NOFE),一种面向连续域的降维框架。NOFE通过图核操作学习函数到函数的映射,能够在任意查询位置进行无网格评估,而不受输入离散化的限制。我们建立了NOFE作为sheaf到sheaf映射的近似,将sheaf神经网络推广到连续域。我们在不同数据集上评估了NOFE,将其与PCA、t-SNE和UMAP进行比较。结果表明,NOFE在局部结构保持方面显著优于基线方法,在ERA5气候再分析数据集上,局部应力为0.111,相比之下PCA为0.398,t-SNE为0.773,UMAP为0.791。NOFE还表现出鲁棒的采样独立性,相对于UMAP,将拼接误差降低了高达20.0倍(59.0 vs. 267.6在区域归一化下),并确保在不连续域碎片之间的一致性。虽然保持了竞争性的全局结构保持(应力-1:0.379 vs. PCA的0.268),NOFE解决了细粒度结构并产生了平滑一致的嵌入,这些嵌入在不同样本密度下具有良好的泛化能力,解决了离散降维方法的关键限制。

英文摘要

Most dimensionality reduction methods treat data as discrete point clouds, ignoring the continuous domain structure inherent to many real-world processes. To bridge this gap, we introduce Neural Operator Function Embedding (NOFE), a domain-aware framework for continuous dimensionality reduction. NOFE learns function-to-function mappings via a Graph Kernel Operator, enabling mesh-free evaluation at arbitrary query locations independent of input discretization. We establish NOFE as approximation of sheaf-to-sheaf mappings, generalizing Sheaf Neural Networks to continuous domains. We evaluate NOFE across different datasets, comparing it against PCA, t-SNE, and UMAP. Our results demonstrate that NOFE significantly outperforms baselines in local structure preservation, achieving a local Stress of 0.111 compared to 0.398 for PCA, 0.773 for t-SNE, and 0.791 for UMAP for the ERA5 climate reanalysis dataset. NOFE also exhibits robust sampling independence, reducing the Patch Stitching Error by up to $20.0\times$ relative to UMAP (59.0 vs. 267.6 under regional normalization) and ensuring consistency across disjoint domain patches. While maintaining competitive global structure preservation (Stress-1: 0.379 vs. PCA's 0.268), NOFE resolves fine-grained structures and produces smooth, consistent embeddings that generalize across varying sample densities, addressing key limitations of discrete reduction methods.

2605.11599 2026-05-19 cs.LG 版本更新

Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol

面向LLM推理的定向测试:一种受审计约束的协议

Hongmin Li

发表机构 * School of Life Science and Technology, Institute of Science Tokyo(生命科学与技术学院,科学东京研究所) Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences(计算生物学与医学科学系,前沿科学研究生院)

AI总结 本文提出一种受审计约束的协议,用于评估LLM推理能力,通过组件自适应提示采样与均匀采样对比,验证了在受控环境下研究定向提示变化的有效性。

Comments 17 pages, 1 figure

详情
AI中文摘要

固定推理基准评估标准提示,但语义上有效的呈现变化仍可能改变模型行为。提示变化研究可揭示此类失败,但缺乏审计时可能混杂真实模型错误、无效扰动、提取伪影和不匹配的搜索过程。本文提出一种受审计约束的定向推理评估协议。提示变体由有限组件语法生成,确定性渲染,固定查询预算下评估,并在经过语义和提取审计后才视为模型错误。在此协议中,我们实例化了组件自适应提示采样(CAPS),一种基于得分的提示组件采样器,并在相同任务库、渲染器、模型接口、解码设置和审计程序下,与等预算均匀组件采样进行比较。在三个受审计的切片中,该协议确认了模型错误提示键,同时排除了格式和提取伪影,但匹配比较未显示CAPS在受控产量或唯一提示键发现上优于均匀采样。贡献是方法论的:定向提示变化可以在可重建、可审查、预算匹配的协议下研究,代理引导策略应通过受控产量而非原始不匹配计数或选定示例单独判断。

英文摘要

Fixed reasoning benchmarks evaluate canonical prompts, but semantically valid changes in presentation can still change model behavior. Studies of prompt variation can reveal such failures, but without audit they can mix genuine model errors with invalid perturbations, extraction artifacts, and unmatched search procedures. We propose an audit-constrained protocol for targeted reasoning evaluation. Prompt variants are generated from a finite component grammar, rendered deterministically, evaluated under a fixed query budget, and counted as model errors only after semantic and extraction audit. Within this protocol we instantiate Component-Adaptive Prompt Sampling (CAPS), a score-based sampler over prompt components, and compare it with equal-budget uniform component sampling under the same task bank, renderer, model interface, decoding settings, and audit procedure. Across three audited slices, the protocol identifies confirmed model-error prompt keys while excluding formatting and extraction artifacts, but matched comparisons do not show that CAPS improves audited yield or unique prompt-key discovery over uniform sampling. The contribution is methodological: targeted prompt variation can be studied under a reconstructable, reviewable, budget-matched protocol, and proxy-guided policies should be judged by audited yield rather than raw mismatch counts or selected examples alone.

2605.11518 2026-05-19 cs.AI cs.CL cs.LG 版本更新

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

AutoLLMResearch: 训练研究代理以自动化LLM实验配置 - 从低成本学习,优化高成本

Taicheng Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang

发表机构 * University of Notre Dame(诺丁汉大学)

AI总结 本文提出AutoLLMResearch框架,通过多保真度实验环境学习LLM配置原则,解决高成本实验自动化问题,展示其在大规模LLM实验中的有效性与通用性。

详情
AI中文摘要

有效配置可扩展的大规模语言模型(LLM)实验,涵盖架构设计、超参数调优等,对推进LLM研究至关重要,因为糟糕的配置选择会浪费大量计算资源并阻碍模型潜力的实现。以往的自动化方法适用于低成本环境,但可扩展的LLM实验成本过高,无法进行大量迭代。为了解决这一问题,我们提出AutoLLMResearch,一个模仿人类研究人员从低保真度实验中学习一般性原则并高效识别高成本LLM配置的代理框架。核心挑战是如何使代理通过与多保真度实验环境的交互学习LLM配置景观的结构。为此,我们提出一个系统框架,包含两个关键组件:1) LLMConfig-Gym,涵盖四个关键LLM实验任务的多保真度环境,支持超过一百万GPU小时的可验证实验结果;2) 一个结构化训练管道,将配置研究建模为长周期马尔可夫决策过程,并相应地激励跨保真度外推推理。在各种强基线上的广泛评估表明了我们框架的有效性、通用性和可解释性,支持其作为大规模现实LLM实验自动化的实用且通用解决方案的潜力。

英文摘要

Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

2605.11480 2026-05-19 cs.LG 版本更新

Efficient Adjoint Matching for Fine-tuning Diffusion Models

高效对抗匹配用于扩散模型微调

Jeongwoo Shin, Dongsoo Shin, Yuchen Zhu, Wei Guo, Yongxin Chen, Joonseok Lee, Jaewoong Choi, Jaemoo Choi

发表机构 * Seoul National University(首尔国立大学) Georgia Institute of Technology(佐治亚理工学院) Sungkyunkwan University(庆尚大学)

AI总结 本文提出高效对抗匹配(EAM),通过改用线性基础漂移和修改终端成本,解决对抗匹配在扩散模型微调中的计算瓶颈,使训练效率提升4倍并在多个指标上表现优异。

详情
AI中文摘要

奖励微调已成为对齐预训练扩散和流模型与人类偏好的常见方法。在基于奖励梯度的方法中,对抗匹配(AM)通过将奖励微调视为随机最优控制(SOC)问题提供了系统化的公式。然而,AM不可避免地需要显著的计算成本:它要求(i)在无记忆动态下对完整生成轨迹进行随机模拟,导致大量的函数评估,以及(ii)沿每个采样轨迹进行反向ODE模拟。在本工作中,我们观察到这两个瓶颈都与从预训练模型继承的非平凡基础漂移密切相关。受此启发,我们提出高效对抗匹配(EAM),通过将SOC问题改用线性基础漂移和相应修改的终端成本,大幅提高训练效率。此改写消除了两种无效来源;它使训练时采样能够使用几步确定性ODE求解器,并产生闭合形式的伴随解,从而消除反向伴随模拟。在标准的文本到图像奖励微调基准上,EAM比AM快4倍收敛,并在PickScore、ImageReward、HPSv2.1、CLIPScore和Aesthetics等各项指标上匹配或超越了AM。

英文摘要

Reward fine-tuning has become a common approach for aligning pretrained diffusion and flow models with human preferences in text-to-image generation. Among reward-gradient-based methods, Adjoint Matching (AM) provides a principled formulation by casting reward fine-tuning as a stochastic optimal control (SOC) problem. However, AM inevitably requires a substantial computational cost: it requires (i) stochastic simulation of full generative trajectories under memoryless dynamics, resulting in a large number of function evaluations, and (ii) backward ODE simulation of the adjoint state along each sampled trajectory. In this work, we observe that both bottlenecks are closely tied to the \textit{non-trivial base drift} inherited from the pretrained model. Motivated by this observation, we propose \textbf{Efficient Adjoint Matching (EAM)}, which substantially improves training efficiency by reformulating the SOC problem with a \textit{linear base drift} and a correspondingly modified \textit{terminal cost}. This reformulation removes both sources of inefficiency; it enables training-time sampling with a few-step deterministic ODE solver and yields a closed-form adjoint solution that eliminates backward adjoint simulation. On standard text-to-image reward fine-tuning benchmarks, EAM converges up to 4x faster than AM and matches or surpasses it across various metrics including PickScore, ImageReward, HPSv2.1, CLIPScore and Aesthetics.

2605.10923 2026-05-19 cs.LG cs.CL 版本更新

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

动态技能生命周期管理用于代理强化学习

Junhao Shen, Teng Zhang, Xiaoyan Zhao, Hong Cheng

发表机构 * Database Group, The Chinese University of Hong Kong(香港中文大学数据库组) Lanzhou University(兰州大学)

AI总结 本文提出SLIM框架,通过动态优化变量管理代理强化学习中的外部技能集,提升任务性能。

Comments Implementation code is available at https://github.com/ejhshen/SLIM

详情
AI中文摘要

大型语言模型代理越来越多地依赖外部技能来解决复杂任务,其中技能作为模块化单元扩展其能力。现有方法假设外部技能要么积累为持久指导或内化到策略中,最终导致零技能推断。本文认为这一假设过于限制,因为参数容量有限且不同技能的边际贡献不均,最优活跃技能集是非单调、任务和阶段依赖的。本文提出SLIM,一种动态技能生命周期管理框架,将活跃的外部技能集作为动态优化变量与策略学习共同更新。具体而言,SLIM通过留一技能验证估计每个活跃技能的边际外部贡献,然后应用三种生命周期操作:保留高价值技能、退役贡献变得微不足道的技能、以及在持续失败揭示缺失能力覆盖时扩展技能库。实验显示,SLIM在ALFWorld和SearchQA上平均比最佳基线高出7.1个百分点。结果进一步表明,策略学习和外部技能保留并非互斥:某些技能被吸收进策略,而其他技能继续提供外部价值,支持SLIM作为基于技能的代理强化学习更通用的范式。

英文摘要

Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized into the policy, eventually leading to zero-skill inference. We argue this assumption is overly restrictive, since with limited parametric capacity and uneven marginal contribution across skills, the optimal active skill set is non-monotonic, task- and stage-dependent. In this work, we propose SLIM, a framework of dynamic Skill LIfecycle Management for agentic reinforcement learning (RL), which treats the active external skill set as a dynamic optimization variable jointly updated with policy learning. Specifically, SLIM estimates each active skill's marginal external contribution through leave-one-skill-out validation, then applies three lifecycle operations: retaining high-value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank when persistent failures reveal missing capability coverage. Experiments show that SLIM outperforms the best baselines by an average of 7.1% points across ALFWorld and SearchQA. Results further indicate that policy learning and external skill retention are not mutually exclusive: some skills are absorbed into the policy, while others continue to provide external value, supporting SLIM as a more general paradigm for skill-based agentic RL.

2605.10759 2026-05-19 cs.LG cs.CV 版本更新

Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models

强化共轭匹配:扩散和流匹配模型的后训练强化学习扩展

Andreas Bergmeister, Stefanie Jegelka, Nikolas Nüsken, Carles Domingo-Enrich, Jakiw Pidstrigach

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) King's College London(伦敦国王学院) Microsoft Research New England(微软研究院新英格兰分部) University of Oxford(牛津大学)

AI总结 本文提出Reinforce Adjoint Matching方法,通过强化学习后训练优化扩散和流匹配模型,无需SDE回滚或梯度,提升生成质量与人类偏好匹配度。

详情
AI中文摘要

扩散和流匹配模型的扩展性源于预训练的监督回归:干净样本通过分析噪声,模型回归闭式目标。强化学习后训练将模型对齐于奖励。在图像生成中,这使样本正确组成物体、清晰渲染文本并匹配人类偏好。现有方法依赖于成本高的SDE回滚、奖励梯度或替代损失,牺牲了预训练的回归结构。我们证明结构可扩展至强化学习后训练。在KL正则化的奖励最大化下,最优生成过程使干净端点分布向奖励更高的样本倾斜,而噪声法则不变。结合此与共轭匹配最优条件和REINFORCE恒等式,我们推导出Reinforce Adjoint Matching(RAM):一种一致性损失,修正预训练目标与奖励。每一步,从当前模型抽样干净端点,评估其奖励,按预训练方式噪声化,并回归。无需SDE回滚、反向共轭扫描或奖励梯度。如同预训练目标,RAM简单且可扩展。在Stable Diffusion 3.5M上,RAM在可组合性、文本渲染和人类偏好方面达到最高奖励,达到Flow-GRPO的峰值奖励,训练步骤减少达50倍。

英文摘要

Diffusion and flow-matching models scale because pretraining is supervised regression: a clean sample is noised analytically, and a model regresses against a closed-form target. RL post-training aligns the model with a reward. In image generation, this makes samples compose objects correctly, render text legibly, and match human preferences. Existing methods rely on costly SDE rollouts, reward gradients, or surrogate losses, sacrificing pretraining's regression structure. We show that the structure extends to RL post-training. Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution towards samples with higher reward and leaves the noising law unchanged. Combining this with the adjoint-matching optimality condition and a REINFORCE identity, we derive Reinforce Adjoint Matching (RAM): a consistency loss that corrects the pretraining target with the reward. At each step, we draw a clean endpoint from the current model, evaluate its reward, noise it as in pretraining, and regress. No SDE rollouts, backward adjoint sweeps, or reward gradients are required. Like the pretraining objective, RAM is simple and scales. On Stable Diffusion 3.5M, RAM achieves the highest reward on composability, text rendering, and human preference, reaching Flow-GRPO's peak reward in up to $50\times$ fewer training steps.

2605.09395 2026-05-19 cs.AI cs.LG cs.MA cs.MM 版本更新

Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

通过定制代理推理增强VLMs在少样本多模态时间序列分类中的能力

Lin Li, Jiawei Huang, Qihao Quan, Dan Li, Boxin Li, Xiao Zhang, Erli Meng, Wenjie Feng, Jian Lou, See-Kiong Ng

发表机构 * Sun Yat-sen University(中山大学) Xiaomi Corporation(小米公司) University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出MarsTSC框架,通过自演化知识库和代理推理提升少样本多模态时间序列分类性能,实验表明其在六个VLM基础上均优于传统和基础模型基线。

Comments 18 pages, 12 figures, 6 tables. Preprint

详情
AI中文摘要

本文提出首个VL$\underline{\textbf{M}}$ $\underline{\textbf{a}}$gentic $\underline{\textbf{r}}$easoning框架用于少样本多模态时间序列分类(MarsTSC),引入自演化知识库作为动态上下文,通过反思代理推理不断优化。框架包含三个协作角色:i) 生成器通过推理进行可靠分类;ii) 反射器诊断推理错误根源以获得判别性见解;iii) 修改器应用验证更新以防止上下文崩溃。进一步引入测试时更新策略以实现谨慎持续的知识库优化,缓解少样本偏差和分布偏移。在12个主流时间序列基准上的广泛实验表明,MarsTSC在六个VLM基础上均取得显著且一致的性能提升,优于传统和基础模型基线,并生成可解释的推理依据,使每个分类决策都基于人类可读的特征证据。

英文摘要

In this paper, we propose the first VL$\underline{\textbf{M}}$ $\underline{\textbf{a}}$gentic $\underline{\textbf{r}}$easoning framework for few-$\underline{\textbf{s}}$hot multimodal $\underline{\textbf{T}}$ime $\underline{\textbf{S}}$eries $\underline{\textbf{C}}$lassification ($\textbf{MarsTSC}$), which introduces a self-evolving knowledge bank as a dynamic context iteratively refined via reflective agentic reasoning. The framework comprises three collaborative roles: i) Generator conducts reliable classification via reasoning; ii) Reflector diagnoses the root causes of reasoning errors to yield discriminative insights targeting the temporal features overlooked by Generator; iii) Modifier applies verified updates to the knowledge bank to prevent context collapse. We further introduce a test-time update strategy to enable cautious, continuous knowledge bank refinement to mitigate few-shot bias and distribution shift. Extensive experiments across 12 mainstream time series benchmarks demonstrate that $\textbf{MarsTSC}$ delivers substantial and consistent performance gains across 6 VLM backbones, outperforming both classical and foundation model-based time series baselines under few-shot conditions, while producing interpretable rationales that ground each classification decision in human-readable feature evidence.

2605.09183 2026-05-19 cs.LG 版本更新

Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift

学习何时停止:在任意动态偏移下的选择性模仿学习

Surbhi Goel, Jonathan Pei, James Wang

AI总结 研究在动态偏移下选择性模仿学习的方法,提出SeqRejectron算法,实现无 horizon 的样本复杂度保证,适用于确定性和随机性策略。

详情
AI中文摘要

行为克隆在训练和测试环境动态一致时提供强模仿学习保证。然而在许多部署场景中,测试环境的转移与训练不同,经典离线IL无能为力:学习者必须在每个状态采取行动,即使示范数据无信息且可能导致性能任意退化。本文研究选择性模仿,学习者可在无法可靠行动时停止。提出在任意动态偏移下的选择性模仿模型:给定训练环境的标记专家示范和测试环境的未标记状态轨迹,学习者输出选择性策略,该策略在训练中很少停止,在测试中产生低遗憾。算法SeqRejectron使用少量验证策略构建停止规则,其大小与 horizon 或策略类无关。对于确定性策略,此方法获得无 horizon 的样本复杂度 $\tilde{O}(\log|Π|/ε^2)$,假设稀疏成本。对于随机性策略,使用累积Hellinger停止时间获得类似无 horizon 保证。框架扩展到不规范专家和训练测试不同专家策略,结果随不规范程度优雅退化。

英文摘要

Behavior cloning provides strong imitation learning guarantees when training and test environments share the same dynamics. However, in many deployment settings the test environment's transitions differ from training, and classical offline IL offers no recourse: the learner must commit to an action at every state, even when its demonstrations are uninformative and could lead to arbitrary degradation of performance. This motivates the study of selective imitation, where the learner may choose to stop when it cannot act reliably. We introduce a model for selective imitation under arbitrary dynamics shift: given labeled expert demonstrations from a training environment and unlabeled state trajectories from the same expert in a test environment, the learner outputs a selective policy that is complete (rarely stops in training) and sound (incurs low regret before stopping in test). Our algorithm, SeqRejectron, constructs a stopping rule using a small set of validator policies whose size is independent of the horizon or policy class. For deterministic policies, this yields horizon-free $\tilde{O}(\log|Π|/ε^2)$ sample complexity, assuming sparse costs. For stochastic policies, we obtain analogous horizon-free guarantees using a cumulative Hellinger stopping time. We extend the framework to misspecified experts and different expert policies across train and test and obtain results that gracefully degrade with the amount of misspecification.

2605.08794 2026-05-19 cs.LG cs.AI 版本更新

Deterministic Decomposition of Stochastic Generative Dynamics

确定性分解随机生成动力学

Xingyu Song, Yuan Mei, Naoya Takeishi

发表机构 * The University of Tokyo(东京大学) Zhejiang University(浙江大学)

AI总结 本文提出Bridge Matching框架,通过分解生成动力学中的确定性与随机效应,实现可控的生成模型。

Comments 10 pages main text, 6 figures; appendix included. Code available at: https://github.com/xingyu-song/bridge_matching

详情
AI中文摘要

现代生成模型可视为从简单基础分布到目标数据分布的概率传输。确定性传输模型提供可计算的速度场参数化,而随机生成模型通过漂移和扩散捕捉更丰富的密度演变。然而,当随机动力学通过确定性速度场描述时,漂移和扩散的影响常被压缩为单一有效场,掩盖了确定性演化和随机波动的差异作用。本文表明,随机生成过程的确定性场$b_t$可自然分解为传输-渗透分解,分离确定性传输与随机扩散效应:$b_t = u_t + d_t$,其中$u_t$控制边际概率传输,$d_t$由扩散和边际分数决定。基于此分解,我们提出Bridge Matching框架,通过边际和条件形式学习分解的生成动力学。在生成模型实验中,我们重新组合学习的组件作为$b_t = u_t + λ_d d_t$,显示所提分解通过调整概率传输中的渗透贡献实现可解释和可控的采样。

英文摘要

Modern generative models can be understood as probability transport from a simple base distribution to a target data distribution. Deterministic transport models offer tractable velocity-field parameterizations, whereas stochastic generative models capture richer density evolution through drift and diffusion. Yet when stochastic dynamics are described through deterministic velocity fields, the effects of drift and diffusion are often compressed into a single effective field, obscuring the distinct roles of deterministic evolution and stochastic fluctuation. In this work, we show that the deterministic field \(b_t\) of a stochastic generative process admits a natural transport--osmotic decomposition that separates deterministic transport from stochastic, diffusion-induced effects: \(b_t = u_t + d_t\), where \(u_t\) governs marginal probability transport and \(d_t\) captures an osmotic effect induced by diffusion and determined by the marginal score. Based on this decomposition, we propose Bridge Matching, a flow-based framework for learning decomposed generative dynamics through both marginal and conditional formulations. In generative modeling experiments, we recombine the learned components as \(b_t = u_t + λ_d d_t\), showing that the proposed decomposition enables interpretable and controllable sampling by adjusting the osmotic contribution in probability transport.

2605.08550 2026-05-19 cs.LG stat.ML 版本更新

A Call to Lagrangian Action: Learning Population Mechanics from Temporal Snapshots

对拉格朗日作用的呼吁:从时间快照中学习群体动力学

Vincent Guan, Lazar Atanackovic, Kirill Neklyudov

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Broad Institute of MIT(MIT-哈佛Broad研究所) University of Alberta(阿尔伯塔大学) Alberta Machine Intelligence Institute(阿尔伯塔机器智能研究所) Mila - Quebec AI Institute(魁北克AI研究所)

AI总结 本文提出通过时间快照学习群体动力学的新方法,基于拉格朗日作用和韦瑟斯特拉格梯度流,提出WLM算法能预测和插值未见边际,并在多种动态中表现优异。

Comments Accepted at ICML 2026 (spotlight)

详情
AI中文摘要

分子、细胞和生物体的群体动力学由若干未知力支配。过去十年中,群体动力学主要通过韦瑟斯特拉格梯度流建模。然而,由于梯度流最小化自由能,它们无法捕捉重要的动态特性,如周期性。本文提出通过考虑在阻尼韦瑟斯特拉格拉格朗日下最小化群体层面作用的动力学,推导对应的哈密顿方程,正式化韦瑟斯特拉格拉格梯度流力学,即一类包含经典力学、量子力学和梯度流的结构化二阶动力学。随后提出WLM作为首个能从观测边际中学习这些二阶动力学的算法,无需指定拉格朗日量。通过直接学习群体动力学,WLM能够预测和插值未见边际,并在广泛动态中优于现有梯度流和流匹配方法,包括涡旋动力学、胚胎发育和鸟群行为。

英文摘要

The population dynamics of molecules, cells, and organisms are governed by a number of unknown forces. In the last decade, population dynamics have predominantly been modeled with Wasserstein gradient flows. However, since gradient flows minimize free energy, they fail to capture important dynamical properties, such as periodicity. In this work, we propose a change in perspective by considering dynamics that minimize a population-level action under a damped Wasserstein Lagrangian. By deriving the corresponding Hamiltonian equations of motion, we formalize Wasserstein Lagrangian Mechanics, a structured class of second-order dynamics that encompasses classical mechanics, quantum mechanics, and gradient flows. We then propose WLM as the first algorithm that learns these second-order dynamics from observed marginals, without specifying the Lagrangian. By directly learning the population mechanics, WLM can both forecast and interpolate unseen marginals, and outperforms existing gradient flow and flow matching methods across a wide range of dynamics, including vortex dynamics, embryonic development, and flocking.

2605.07501 2026-05-19 cs.LG cs.CL 版本更新

ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression

ExpThink:基于经验的强化学习用于自适应思路链压缩

Tingcheng Bian, Yuzhe Zhang, Jing Jin, Jinchang Luo, MingQuan Cheng, Haiwei Wang, Wenyuan Jiang, Miaohui Wang

发表机构 * Baidu Inc.(百度公司) Shenzhen University(深圳大学) Peking University(北京大学) Tsinghua University(清华大学) D-INFK, ETH Zürich(ETH Zürich 计算机科学与技术研究所)

AI总结 本文提出ExpThink框架,通过经验引导奖励塑造和难度自适应优势机制,实现思路链压缩的准确优先、压缩次之的训练目标,实验表明其在多个数学推理基准上显著提升压缩效率和准确性。

Comments 39 pages, 18 figures. Code and model checkpoints will be released upon publication

详情
AI中文摘要

大型推理模型(LRMs)通过扩展的思路链(CoT)推理实现强性能,但面临过度的token消耗和高推理延迟。现有CoT压缩的强化学习(RL)方法依赖于统一的静态长度惩罚,忽略了模型能力动态和问题级别难度变化。本文提出ExpThink框架,通过两个互补机制解决这两个维度:首先,经验引导奖励塑造跟踪每个问题迄今为止找到的最短正确解决方案,并应用三级奖励:对简洁正确的响应给予全额信用,对冗长正确的响应给予折扣信用,对错误的响应给予零信用。阈值随着模型改进自动收紧,形成自我演进的课程,无需手动调度。其次,难度自适应优势将标准差归一化替换为正确计数归一化,产生单调难度缩放的梯度,放大对难题的学习以保持准确性,同时抑制对简单问题的梯度以促进简洁性。这些机制强制执行准确性优先、压缩次之的训练目标。在多个数学推理基准上的实验表明,ExpThink减少了平均响应长度高达77%,同时提高了准确性,实现了比基线模型高3倍的准确性-效率比,并在两个指标上优于现有基于RL的压缩方法。

英文摘要

Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penalties that neglect model capability dynamics and problem-level difficulty variation. We propose \textbf{ExpThink}\xspace, an RL framework that addresses both dimensions through two complementary mechanisms. First, \emph{experience-guided reward shaping} tracks the shortest correct solution found so far for each problem and applies a three-tier reward: full credit for concise correct responses, discounted credit for verbose correct ones, and zero for incorrect ones. The threshold tightens automatically with model improvement, forming a self-evolving curriculum that requires no manual scheduling. Second, \emph{difficulty-adaptive advantage} replaces standard deviation normalization with correct-count normalization, yielding monotonically difficulty-scaled gradients that amplify learning on hard problems to preserve accuracy while suppressing gradients on easy ones to encourage brevity. Together, these mechanisms enforce an accuracy-first, compression-second training objective. Experiments on multiple mathematical reasoning benchmarks demonstrate that \textbf{ExpThink}\xspace reduces average response length by up to 77\% while simultaneously improving accuracy, achieving up to $3\times$ higher accuracy-efficiency ratio (accuracy divided by average token count) than the vanilla baseline and outperforming existing RL-based compression methods on both metrics.

2605.07364 2026-05-19 cs.LG 版本更新

FlightSense: An End-to-End MLOps Platform for Real-Time Flight Delay Prediction via Rotation-Chain Propagation Features and Agentic Conversational AI

FlightSense: 一个端到端的MLOps平台,通过旋转链传播特征和代理对话式AI实现实时航班延误预测

Aditi J. Shelke, Renuka J. Shelke, Yash M. Kamerkar, Nitin P. Hazarani

发表机构 * Stevens Institute of Technology(史蒂文斯理工学院) Axtria Inc.(Axtria公司) Meta

AI总结 本文提出FlightSense平台,通过旋转链传播特征和对话式AI实现实时航班延误预测,采用三级特征工程框架提升预测性能,最终达到AUC 0.879。

Comments 12 pages, 5 figures, 9 tables; machine learning, MLOps, aviation delay prediction

详情
AI中文摘要

航班延误对航空网络造成连锁运营和财务负担,每年使美国经济损失数十亿美元。尽管已有方法表现良好,但多数将上游延误视为静态变量而非动态传播。本文提出FlightSense平台,通过三级特征工程框架实现端到端MLOps系统,整合天气数据和对话式AI,最终在测试集上达到AUC 0.879。

英文摘要

Flight delays impose cascading operational and financial burdens across the aviation network, costing the U.S. economy billions of dollars annually by disrupting interconnected aircraft rotation systems. While prior machine learning approaches have demonstrated strong predictive performance, most treat upstream delays as static input variables rather than explicitly modeling how delays propagate dynamically through aircraft rotation chains, and none have deployed such systems alongside a live weather-aware conversational AI interface for end-user interaction. This paper presents FlightSense, an end-to-end MLOps platform for real-time flight delay prediction built through a progressive three-version feature engineering framework. Version 1 trains an XGBoost classifier on 11 schedule-based features establishing a baseline ROC AUC of 0.732 on 7.07 million BTS 2018 On-Time Performance records. Version 2 introduces 11 delay propagation features derived from aircraft rotation chains via tail-number tracking, yielding the dominant performance gain (AUC 0.732 to 0.875) and surpassing the single-stage XGBoost baseline reported by Zhou (2025). Version 3 integrates five NOAA meteorological features across 10 major U.S. airports, achieving a final test set AUC of 0.879. FlightSense is deployed as a production AWS MLOps pipeline incorporating live weather ingestion via Lambda, real-time SageMaker inference, an interactive Streamlit dashboard, and an Amazon Bedrock Nova Micro conversational assistant answering natural-language delay queries via a tool-use architecture.

2605.07333 2026-05-19 cs.LG 版本更新

Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

超越线性注意:Softmax变换器实现上下文强化学习

Zixuan Xie, Xinyu Liu, Claire Chen, Shuze Daniel Liu, Rohan Chandra, Shangtong Zhang

发表机构 * University of Virginia(弗吉尼亚大学) California Institute of Technology(加州理工学院) Purdue University(普渡大学)

AI总结 本文研究了在预训练后通过上下文适应新任务的强化学习代理,通过softmax注意力机制证明了Transformer层前向传递等价于加权softmax时序差分算法的迭代更新,并证明了参数在预训练损失中的全局极小性。

详情
AI中文摘要

上下文强化学习(ICRL)研究的是在预训练后,通过额外上下文条件适应新任务而无需参数更新的智能体。现有ICRL理论分析大多依赖线性注意力,即用身份映射替代标准注意力中的softmax函数。本文首次在不采用不现实的线性注意力简化的情况下,提供了ICRL的理论理解。特别地,我们考虑了实践中使用的标准softmax注意力。我们证明,在某些参数下,具有此类softmax注意力的Transformer的层间前向传递等价于加权softmax时序差分(TD)学习算法的迭代更新。这里,加权softmax TD是一种新的强化学习算法,它在核空间中进行策略评估,并采用线性TD和表格TD作为特殊情况。我们还证明,在某种收缩条件下,随着层数增加,策略评估误差会减小,且上述参数在此条件下成立。最后,我们证明这些参数是预训练损失的全局极小值,解释了它们在数值实验中的出现。

英文摘要

In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the standard attention with an identity mapping. This paper provides the first theoretical understanding of ICRL without making the unrealistic linear attention simplification. In particular, we consider the standard softmax attention used in practice. We show that, with certain parameters, the layerwise forward pass of a Transformer with such softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm. Here, weighted softmax TD is a new RL algorithm that performs policy evaluation in kernel space and adopts both linear TD and tabular TD as special cases. We also prove that under a certain contraction condition, the policy evaluation error decays as the number of layers grows, with the identified parameters above. Finally, we prove that those parameters are a global minimizer of a pretraining loss, explaining their emergence in our numerical experiments.

2605.07098 2026-05-19 cs.LG physics.comp-ph 版本更新

CarCrashNet: A Large-Scale Dataset and Hierarchical Neural Solver for Data-Driven Structural Crash Simulation

CarCrashNet:一个大规模数据集和分层神经求解器用于数据驱动的结构碰撞仿真

Mohamed Elrefaie, Dule Shu, Matthew Klenk, Faez Ahmed

发表机构 * MIT(麻省理工学院) Toyota Research Institute(丰田研究院) Future Product Innovation(未来产品创新)

AI总结 本文提出CarCrashNet数据集和分层神经求解器,用于数据驱动的结构碰撞仿真,包含14000多个碰撞模拟和825辆整车碰撞模拟,通过开源求解器验证并评估了基于机器学习的求解器性能。

详情
AI中文摘要

碰撞仿真是现代汽车开发的核心,因为它减少了昂贵物理原型的需求,加速了安全驱动的设计迭代,并越来越多地支持虚拟测试流程。同时,建模结构碰撞力学仍然极具挑战性:响应由非线性接触、大变形、材料塑性、失效和复杂多体相互作用在高分辨率有限元网格上空间和时间演化决定。在本工作中,我们介绍了CarCrashNet,一个公开的高保真开源基准,用于数据驱动的结构碰撞仿真。CarCrashNet结合了组件级和整车级别的仿真,在多模态格式中包含超过14000个保险杠-梁杆碰撞仿真,具有变化的几何形状、材料和边界条件,以及825辆整车碰撞仿真,基于三种行业标准车辆模型:Dodge Neon、Toyota Yaris和Chevrolet Silverado。为了建立基准的可靠性,我们验证了基于OpenRadioss的开源有限元工作流程,与实验碰撞数据和商业求解器Ansys LS-DYNA进行对比。我们还引入了CrashSolver,一种设计用于从高分辨率有限元碰撞数据预测整车碰撞的机器学习模型。我们进一步在发布的数据集上进行了广泛的基准测试,并评估了CrashSolver与最先进的几何深度学习和基于变压器的神经求解器。我们的结果将CarCrashNet定位为结构仿真、碰撞worthiness建模和AI驱动的虚拟碰撞测试可重复研究的基础。数据集可在https://github.com/Mohamedelrefaie/CarCrashNet获取。

英文摘要

Crash simulation is a cornerstone of modern vehicle development because it reduces the need for costly physical prototypes, accelerates safety-driven design iteration, and increasingly supports virtual testing workflows. At the same time, modeling structural crash mechanics remains exceptionally challenging: the response is governed by nonlinear contact, large deformation, material plasticity, failure, and complex multi-body interactions evolving over space and time on high-resolution finite-element meshes. In this work, we introduce CarCrashNet, a public high-fidelity open-source benchmark for data-driven structural crash simulation. CarCrashNet combines component-scale and full-vehicle simulations in a multi-modal format, including more than 14,000 bumper-beam pole-impact simulations with varying geometry, materials, and boundary conditions, together with 825 full-vehicle crash simulations built from three industry-standard vehicle models of increasing structural complexity: Dodge Neon, Toyota Yaris, and Chevrolet Silverado. To establish the reliability of the benchmark, we validate our open-source finite-element workflow based on OpenRadioss against both experimental crash data and the commercial solver Ansys LS-DYNA. We also introduce CrashSolver, a machine-learning model designed for full-vehicle crash prediction from high-resolution finite-element crash data. We further perform extensive benchmarking across the released datasets and evaluate CrashSolver against state-of-the-art geometric deep learning and transformer-based neural solvers. Our results position CarCrashNet as a foundation for reproducible research in structural simulation, crashworthiness modeling, and AI-driven virtual crash testing. The dataset is available at https://github.com/Mohamedelrefaie/CarCrashNet.

2605.07005 2026-05-19 cs.DS cs.LG 版本更新

Equivalence of Coarse and Fine-Grained Models for Learning with Distribution Shift

粗粒度与细粒度模型在存在分布偏移学习中的等价性

Adam R. Klivans, Shyamal Patel, Konstantinos Stavropoulos, Arsen Vasilyan

发表机构 * UT Austin(得克萨斯大学) Columbia University(哥伦比亚大学)

AI总结 本文探讨了在无分布假设下,粗粒度与细粒度学习模型的等价性,证明了通过黑盒减少将PQ学习转换为TDS学习的效率,并展示了通过成员查询可绕过硬度结果,实现半空间的分布自由PQ可学习性。

Comments 26 pages, Accepted to COLT 2026

详情
AI中文摘要

最近关于能保证高效学习存在分布偏移的算法研究,集中在两种模型上:PQ学习(Goldwasser等人,2020)和TDS学习(Klivans等人,2024)。TDS学习算法允许在检测到分布偏移时完全拒绝测试集,而PQ学习者只能根据个体点是否被认为是分布外来拒绝。我们的主要结果是在无分布假设下,这两种模型之间存在令人惊讶的等价性。具体来说,我们为任何布尔概念类给出了高效的黑盒减少方法,将PQ学习转换为TDS学习。这种等价性意味着首次在无分布假设下对基本类如半空间的TDS学习的难度结果。我们等价性的主要技术贡献是通过分支程序提升TDS学习者拒绝目标域的弱区分能力。我们还展示,给学习器提供成员查询访问可以绕过这些难度结果,并允许高效地分布自由地学习半空间。我们的算法通过迭代地从训练数据上应用连续的Forster变换来恢复大边距分离器。

英文摘要

Recent work on provably efficient algorithms for learning with distribution shift has focused on two models: PQ learning (Goldwasser et al. (2020)) and TDS learning (Klivans et al. (2024)). Algorithms for TDS learning are allowed to reject a test set entirely if distribution shift is detected. In contrast, PQ learners may only reject points that are deemed out-of-distribution on an individual basis. Our main result is a surprising equivalence between these two models in the distribution-free setting. In particular, we give an efficient black-box reduction from PQ learning to TDS learning for any Boolean concept class. This equivalence implies the first hardness results for distribution-free TDS learning of basic classes such as halfspaces. The main technical contribution underlying our equivalence is a method for boosting, via branching programs, the weak distinguishing power of TDS learners that have rejected the target domain. We also show that giving a learner access to membership queries sidesteps these hardness results and allows for efficient, distribution-free PQ learnability of halfspaces. Our algorithm iteratively recovers large-margin separators obtained by applying successive Forster transforms on the training data.

2605.06017 2026-05-19 cs.LG math.PR 版本更新

Matrix-Decoupled Concentration for Autoregressive Sequences: Dimension-Free Guarantees for Sparse Long-Context Rewards

自回归序列的矩阵解耦浓度:稀疏长上下文奖励的维度无关保证

Pei-Sen Li

发表机构 * School of Mathematics and Statistics, Beijing Institute of Technology, Beijing 100872, China(北京理工大学数学系,北京理工大学,北京100872,中国)

AI总结 本文提出矩阵解耦浓度框架,通过严格因果依赖解析和目标敏感性向量的精确矩阵-向量乘法,解决自回归模型中依赖结构与目标敏感性分离的问题,提供维度无关的方差代理和长上下文推理的稳定性保证。

详情
AI中文摘要

自回归大语言模型(LLM)在序列层面的评估依赖于高度依赖的token生成。现有框架中存在两个根本性瓶颈:(i) 经典不等式通常将依赖结构与目标敏感性分离,导致标量崩溃,使稀疏终端奖励的方差代理膨胀到次优O(N);(ii) 虽然某些空间方法能获得更紧的界限,但它们缺乏严格因果过滤所必需的条件,无法应用于自回归设置。为解决这两个瓶颈,本文建立了适用于依赖序列的尖锐McDiarmid型不等式,由因果依赖解析的精确矩阵-向量乘法严格支配。该矩阵解耦浓度(MDC)框架原生恢复马尔可夫链的最优常数,并利用定向d分离获得因果树的最优界限。关键在于通过严格因果框架精确保持奖励的坐标稀疏性,数学上防止标量崩溃,保证维度无关的O(1)方差代理,并为长上下文推理的稳定性提供严格数学依据。

英文摘要

Sequence-level evaluations in autoregressive Large Language Models (LLMs) rely on highly dependent token generation. Establishing tight concentration bounds for these processes remains a challenge due to two fundamental bottlenecks in existing frameworks: (i) classical inequalities typically separate dependency structures from target sensitivities, leading to a scalar collapse that inflates the variance proxy to a suboptimal $\mathcal{O}(N)$ for sparse terminal rewards; (ii) conversely, while certain spatial methods achieve tighter bounds, they lack the strictly causal filtration required by sequential generation, rendering them inapplicable to the autoregressive setting. To resolve both bottlenecks, we establish a sharp McDiarmid-type inequality for dependent sequences, governed strictly by the exact matrix-vector multiplication of the causal dependency resolvent and the target sensitivity vector. This Matrix-Decoupled Concentration (MDC) framework natively recovers optimal constants for Markov chains and exploits directed $d$-separation to yield order-optimal bounds for causal trees. Crucially, by exactly preserving the coordinate-wise sparsity of rewards within a strictly causal framework, MDC mathematically prevents scalar collapse, guaranteeing a dimension-free $\mathcal{O}(1)$ variance proxy and providing a rigorous mathematical justification for the stability of long-context reasoning.

2605.05870 2026-05-19 cs.LG 版本更新

QuadraSHAP: Stable and Scalable Shapley Values for Product Games via Gauss-Legendre Quadrature

QuadraSHAP:通过高斯-勒让德求积法实现产品游戏的稳定且可扩展的Shapley值

Majid Mohammadi, Grigory Reznikov, Pavel Sinitcyn, Krikamol Muandet, Siu Lun Chau

发表机构 * AI Technology for Life, Information and Computing Sciences, Utrecht University, The Netherlands(荷兰乌得勒支大学人工智能技术与信息计算科学系) Biomolecular Mass Spectrometry and Proteomics, Pharmaceutical Sciences, Utrecht University, The Netherlands(荷兰乌得勒支大学生物分子质谱与蛋白质组学,药学科学系) Rational Intelligence Lab, CISPA Helmholtz Center for Information Security, Germany(德国CISPA海德堡信息安全中心理性智能实验室) Epistemic Intelligence & Computation Lab, College of Computing & Data Science, Nanyang Technological University, Singapore(新加坡南洋理工大学认知智能与计算实验室)

AI总结 本文提出QuadraSHAP方法,通过高斯-勒让德求积法高效计算产品游戏的Shapley值,实现稳定且可扩展的计算,适用于具有乘积结构的机器学习可解释性场景。

详情
AI中文摘要

本文研究了产品游戏的高效Shapley值计算,产品游戏是合作游戏,其中联盟价值可以分解为玩家之间的乘积项。此类游戏出现在机器学习可解释性中,当价值函数继承自底层模型的乘积结构,例如核方法中的乘积核和树形模型。我们的主要结果是,产品游戏中每个玩家的Shapley值可以表示为一维积分:指数级数的特征联盟加权和可以简化为在[0,1]区间上度数为(d-1)的多项式的积分,其中d是总特征数。这导致了一种Gauss-Le Legendre求积法,当节点数满足m_q≥⌈d/2⌉时,可以证明是精确的,否则提供近似值,其误差可证明几何衰减。在实践中,数百个节点即使在成千上万的特征下也能实现高精度估计。基于此公式,我们推导出一种通过log空间评估实现的数值稳定实现,以及基于关联扫描原语的高效并行实现,总工作量为O(d m_q),并行时间为O(log d)。实验表明,QuadraSHAP在所有测试配置中是最快的数值稳定方法。

英文摘要

We study the efficient computation of Shapley values for \emph{product games} -- cooperative games in which the coalition value factorizes as a product of per-player terms. Such games arise in machine learning explainability whenever the value function inherits a multiplicative structure from the underlying model, as in kernel methods with product kernels and tree-based models. Our key result is that the Shapley value of each player in a product game admits an exact one-dimensional integral representation: the weighted sum over exponentially many feature coalitions collapses to the integral of a degree-$(d-1)$ polynomial over $[0,1]$, where $d$ is the total number of features. This yields a Gauss--Legendre quadrature scheme that is \emph{provably exact} whenever the number of nodes satisfies $m_q \geq \lceil d/2 \rceil$, and otherwise provides a \emph{near-exact} approximation with error provably decaying geometrically in $m_q$. In practice, a few hundred nodes can achieve highly precise estimates even with thousands of features. Building on this formulation, we derive a numerically stable implementation via log-space evaluation, together with an efficient parallel implementation based on associative scan primitives that achieves $O(d\,m_q)$ total work and $O(\log d)$ parallel time. Experiments show that \textsc{QuadraSHAP} is the fastest numerically stable method across all tested configurations.

2605.05739 2026-05-19 cs.LG cs.AI cs.CL q-fin.CP 版本更新

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback

基于大语言模型判官的多维行为评估:用于代理股票预测系统的闭环强化学习反馈

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

发表机构 * School of Electrical Engineering and Computer Science(电气工程与计算机科学学院)

AI总结 本文提出一种多维行为评估方法,通过大语言模型判官评估代理系统决策过程,利用闭环强化学习反馈提升预测性能,验证了方法在股票预测中的有效性。

Comments 17 pages, 5 figures, 14 tables. Manuscript submitted to Applied Artificial Intelligence (Taylor and Francis)

详情
AI中文摘要

代理人工智能系统通过一系列相互依赖的自主决策产生输出,但标准评估仅评估输出而无法诊断底层过程。本文开发了一种行为评估方法,通过评分中间决策过程补充输出级测试。在每个自主决策点记录的行为轨迹被分为五日周期,并由三个大语言模型(LLM)判官根据六个领域特定维度(制度检测、路由、适应、风险校准、策略一致性、错误恢复)评分。一种扰动程序破坏一个维度,同时保持其他五个维度不变,验证了维度特异性;跨模型一致性达到Krippendorff's alpha=0.85。综合行为评分与实际20日夏普比率相关性达到Spearman rho=0.72。闭环框架将缺陷的每维度评分转换为信用分配惩罚,添加到Soft Actor-Critic奖励中。三次微调循环,限制在验证数据上,将持有期MAPE从0.61%降低到0.54%(11.5%相对;p<0.001,d=0.31)在2017至2025的测试期上,显著性在Diebold-Mariano下,通过Giacomini-White局部化到高波动性制度。该方法应用无关,适用于任何可以记录中间决策的代理系统。

英文摘要

Agentic artificial intelligence systems produce outputs through sequences of interdependent autonomous decisions, yet standard evaluation assesses outputs alone and cannot diagnose the underlying process. We develop a behavioral evaluation methodology that complements output-level testing by scoring the intermediate decision process itself. Behavioral traces logged at each autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges. A perturbation procedure that corrupts one dimension while leaving the other five intact confirms dimension specificity; cross-model agreement reaches Krippendorff's alpha = 0.85. The composite behavioral score correlates at Spearman rho = 0.72 with realized 20-day Sharpe ratio. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty added to the Soft Actor-Critic reward. Three fine-tuning cycles, confined to validation data, reduce one-day MAPE from 0.61% to 0.54% (11.5% relative; p<0.001, d=0.31) on the held-out 2017 to 2025 test period, significant under Diebold-Mariano and localized by Giacomini-White to the high-volatility regime. The methodology is application-agnostic and applies to any agentic system whose intermediate decisions can be logged.

2605.02167 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution

面向流形的引导集成梯度用于可靠特征归因

Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi

发表机构 * Kim Jaechul Graduate School of AI, Korea Advanced Institute of Science and Technology (KAIST)(金 Jaechul人工智能研究生院,韩国科学技术院(KAIST))

AI总结 本文提出面向流形的引导集成梯度(MA-GIG),通过在预训练变分自编码器的潜在空间中构建归因路径,减少非流形区域噪声,提升特征归因的可靠性。

Comments 32 pages, 13 figures, 12 tables. Accepted to ICML 2026; includes appendix

详情
AI中文摘要

特征归因是诊断和信任深度神经网络的核心,集成梯度(IG)因其公理性质而被广泛使用。然而,当基线与输入之间的积分路径经过具有噪声梯度的区域时,IG可能产生不可靠的解释。虽然引导集成梯度通过自适应更新低梯度幅度特征来减少这种敏感性,但输入空间的引导仍会产生偏离数据流形的中间输入。为了解决这一限制,我们提出了面向流形的引导集成梯度(MA-GIG),通过在预训练变分自编码器的潜在空间中构建归因路径。通过解码中间潜在状态,MA-GIG将路径偏向于学习的生成流形,减少对不合理的输入空间区域的暴露。通过定性与定量评估,我们证明MA-GIG通过在接近输入的路径特征上聚合梯度,产生忠实的解释。因此,我们的方法减少了非流形噪声,并在多个数据集和分类器上优于先前的路径归因方法。我们的代码可在https://github.com/leekwoon/ma-gig/上获得。

英文摘要

Feature attribution is central to diagnosing and trusting deep neural networks, and Integrated Gradients (IG) is widely used due to its axiomatic properties. However, IG can yield unreliable explanations when the integration path between a baseline and the input passes through regions with noisy gradients. While Guided Integrated Gradients reduces this sensitivity by adaptively updating low-gradient-magnitude features, input-space guidance still produces intermediate inputs that deviate from the data manifold. To address this limitation, we propose \emph{Manifold-Aligned Guided Integrated Gradients} (MA-GIG), which constructs attribution paths in the latent space of a pre-trained variational autoencoder. By decoding intermediate latent states, MA-GIG biases the path toward the learned generative manifold and reduces exposure to implausible input-space regions. Through qualitative and quantitative evaluations, we demonstrate that MA-GIG produces faithful explanations by aggregating gradients on path features proximal to the input. Consequently, our method reduces off-manifold noise and outperforms prior path-based attribution methods across multiple datasets and classifiers. Our code is available at https://github.com/leekwoon/ma-gig/.

2605.00155 2026-05-19 cs.LG cs.CL math.OC stat.ML 版本更新

Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

Wasserstein分布鲁棒遗憾优化用于人类反馈的强化学习

Yikai Wang, Shang Liu, Jose Blanchet

发表机构 * Department of Statistics and Operations Research, University of North Carolina(统计与运筹学系,北卡罗来纳大学) Imperial Business School, Imperial College London(帝国理工学院伦敦商学院) Department of Management Science and Engineering, Stanford University(管理科学与工程系,斯坦福大学)

AI总结 本文提出Wasserstein分布鲁棒遗憾优化(DRRO)用于强化学习从人类反馈,通过简单分配模型研究提示问题,展示在ℓ1-地面成本Wasserstein模糊集下,内最坏遗憾有精确解,最优策略具有水填充结构,从而实现高效政策梯度算法。

详情
AI中文摘要

强化学习从人类反馈(RLHF)已成为对齐大语言模型的核心后训练步骤,但RLHF中使用的奖励信号仅是真实人类效用的学得代理。从运筹学角度看,这形成了一个目标不准确的决策问题:策略是针对估计奖励优化,而部署性能由未观察的目标决定。由此产生的差距导致奖励过度优化,即Goodharting现象,即代理奖励在真正质量下降后仍继续改善。现有缓解方法通过不确定性惩罚、悲观奖励或保守约束,但这些方法计算上负担重且过于悲观。我们提出Wasserstein分布鲁棒遗憾优化(DRRO)用于RLHF。不同于标准DRO悲观最坏价值,DRRO悲观最坏遗憾相对于相同合理奖励扰动下的最佳策略。我们通过简单分配模型研究提示问题,展示在ℓ1-地面成本Wasserstein模糊集下,内最坏遗憾有精确解,最优策略具有水填充结构。这些结果导致具有简单采样奖金解释和仅小幅改动GRPO式RLHF训练的实用策略梯度算法。该框架还理论上澄清了为什么DRRO比DRO更不悲观,且实验显示DRRO比现有基线更有效缓解过度优化,而标准DRO系统性过悲观。

英文摘要

Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an $\ell_1$-ground-cost Wasserstein ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.

2604.26904 2026-05-19 cs.CL cs.AI cs.LG 版本更新

ClawGym: A Scalable Framework for Building Effective Claw Agents

ClawGym:一种构建有效Claw代理的可扩展框架

Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao, Renyuan Li, Feng Chang, Yuan Wei, Ran Tao, Bryan Dai, Jian Yang, Wayne Xin Zhao, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence(人工智能学院) Renmin University of China(中国人民大学) IQuest Research(IQuest研究) Beihang University(北航)

AI总结 本文提出ClawGym框架,用于构建Claw式个人代理的全生命周期,通过合成可验证训练数据和强化学习方法提升代理效能。

详情
AI中文摘要

Claw-style环境支持在本地文件、工具和持久工作区状态上进行多步骤工作流。然而,围绕这些环境的可扩展开发受限于缺乏系统框架,尤其是合成可验证训练数据并将其与代理训练和诊断评估集成的框架。为解决这一挑战,我们提出了ClawGym,一种支持Claw式个人代理全生命周期的可扩展框架。具体而言,我们构建了ClawGym-SynData,一个包含13500个过滤任务的多样化数据集,这些任务由基于人物驱动的意图和技能基础操作合成,配以现实的模拟工作区和混合验证机制。我们随后通过在黑箱滚动轨迹上进行监督微调训练了一组有能力的Claw式模型,称为ClawGym-Agents,并进一步通过轻量级管道探索强化学习,该管道在每项任务的沙箱中并行化滚动。为了支持可靠的评估,我们进一步构建了ClawGym-Bench,一个通过自动化过滤和人工LLM审查校准的200个实例的基准。相关资源已发布在https://github.com/ClawGym。

英文摘要

Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes. To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources have been released at https://github.com/ClawGym.

2604.25858 2026-05-19 cs.LG cs.AI 版本更新

Investigation into In-Context Learning Capabilities of Transformers

对Transformer在上下文学习能力的调查

Rushil Chandrupatla, Leo Bangayan, Sebastian Leng

AI总结 本文通过系统实验研究了Gaussian-mixture二分类任务中的上下文学习,分析了输入维度、上下文示例数量和预训练任务数量对上下文测试准确率的影响,并探讨了良性过拟合现象。

详情
AI中文摘要

Transformer在上下文学习(ICL)中展现出强大的能力,使模型能够仅通过推理时提供的输入输出对解决之前未见过的任务。尽管先前的理论工作已经确立了在上下文内进行线性分类的条件,但指导这一机制何时成功的经验性扩展行为仍不够明确。本文对Gaussian-mixture二分类任务的上下文学习进行了系统性的实证研究。基于Frei和Vardi(2024)的理论框架,我们分析了上下文测试准确率如何依赖于三个基本因素:输入维度、上下文示例数量以及预训练任务数量。通过受控的合成设置和线性上下文分类器公式,我们隔离了模型在仅凭上下文自身推断任务结构时成功的几何条件。我们还研究了良性过拟合现象的出现,其中模型记忆了嘈杂的上下文标签,同时在干净的测试数据上仍能保持良好的泛化性能。通过在维度性、序列长度、任务多样性以及信噪比范围内进行广泛的扫描,我们识别了这种现象出现的参数区域,并描述了其如何依赖于数据几何和训练暴露。我们的结果为上下文分类的扩展行为提供了全面的经验图谱,突显了维度性、信号强度和上下文信息在决定上下文学习何时成功、何时失败中的关键作用。

英文摘要

Transformers have demonstrated a strong ability for in-context learning (ICL), enabling models to solve previously unseen tasks using only example input output pairs provided at inference time. While prior theoretical work has established conditions under which transformers can perform linear classification in-context, the empirical scaling behavior governing when this mechanism succeeds remains insufficiently characterized. In this paper, we conduct a systematic empirical study of in-context learning for Gaussian-mixture binary classification tasks. Building on the theoretical framework of Frei and Vardi (2024), we analyze how in-context test accuracy depends on three fundamental factors: the input dimension, the number of in-context examples, and the number of pre-training tasks. Using a controlled synthetic setup and a linear in-context classifier formulation, we isolate the geometric conditions under which models successfully infer task structure from context alone. We additionally investigate the emergence of benign overfitting, where models memorize noisy in-context labels while still achieving strong generalization performance on clean test data. Through extensive sweeps across dimensionality, sequence length, task diversity, and signal-to-noise regimes, we identify the parameter regions in which this phenomenon arises and characterize how it depends on data geometry and training exposure. Our results provide a comprehensive empirical map of scaling behavior in in-context classification, highlighting the critical role of dimensionality, signal strength, and contextual information in determining when in-context learning succeeds and when it fails.

2604.23437 2026-05-19 cs.CR cs.LG 版本更新

Scalable and Verifiable Federated Learning for Cross-Institution Financial Fraud Detection

跨机构金融欺诈检测的可扩展且可验证的联邦学习

Prajwal Panth, Nishant Nigam

发表机构 * School of Computer Engineering, KIIT Deemed to be University(计算机工程学院,KIIT deemed to be 大学) School of Electronics Engineering, KIIT Deemed to be University(电子工程学院,KIIT deemed to be 大学)

AI总结 本文提出DSFL框架,通过动态随机分片和线性完整性标签,实现跨机构金融欺诈检测的高效安全聚合,实验显示其在大规模场景下具有更低的聚合延迟和更高的检测召回率。

Comments 8 pages, 7 figures. Preprint

详情
AI中文摘要

金融欺诈日益利用机构边界:洗钱网络将交易分布在多个银行中,因为没有单一机构能观察完整模式。联邦学习(FL)允许在不共享原始数据的情况下进行协作检测,但在银行业环境中实际部署仍受三个压力限制。首先,同态加密方案导致高计算成本,限制了大规模实时聚合。其次,基于掩码的协议如谷歌的SecAgg需要O(N²)对等密钥交换,随着参与者数量增加变得低效。第三,现有协议提供的验证有限,提交的梯度更新可能不一致,使聚合易受一致性攻击。本文提出动态分片联邦学习(DSFL),一种用于跨机构欺诈检测的安全聚合框架。DSFL引入动态随机分片,将参与者分成小的密码学短暂集群,大小为m,将通信复杂度降低到O(N*m)。在每个集群中,参与者提交线性完整性标签,加法同态承诺,使服务器无需解密即可验证更新一致性。该机制检测不一致的更新而非恶意梯度。主动邻居恢复协议通过重建孤儿掩码处理中轮丢包。在ULB信用卡欺诈检测数据集(284,807笔交易跨10个模拟银行节点)上的实验显示,DSFL在N=1000时,聚合延迟比基于Paillier的安全聚合低约34倍,基于分析外推从经验基线,同时在20%丢包率下保持99%的恢复保真度。全局欺诈召回率达到91.2%(±0.8%),高于本地训练模型的平均68%。

英文摘要

Financial fraud increasingly exploits institutional boundaries: laundering networks distribute transactions across multiple banks because no single institution can observe the full pattern. Federated Learning (FL) enables collaborative detection without raw data sharing, yet practical deployment in banking environments remains constrained by three pressures. First, homomorphic encryption schemes impose high computational costs that limit real-time aggregation at scale. Second, mask-based protocols such as Google's SecAgg require O(N^2) pairwise key exchanges, which become inefficient as participant count grows. Third, existing protocols provide limited verification that submitted gradient updates are well-formed, leaving aggregation vulnerable to consistency attacks. This paper presents Dynamic Sharded Federated Learning (DSFL), a secure aggregation framework for cross-institution fraud detection. DSFL introduces Dynamic Stochastic Sharding, which partitions participants into small cryptographically ephemeral clusters of fixed size m, reducing communication complexity to O(N*m). Within each cluster, participants submit Linear Integrity Tags, additive-homomorphic commitments that allow the server to verify update consistency without decryption. The mechanism detects inconsistent updates rather than malicious gradients. An Active Neighborhood Recovery protocol handles mid-round dropouts by reconstructing orphaned masks. Experiments on the ULB Credit Card Fraud Detection dataset (284,807 transactions across 10 simulated banking nodes) show that DSFL achieves approximately 34x lower aggregation latency than Paillier-based secure aggregation at N=1000, based on analytical extrapolation from empirical baselines, while maintaining 99% recovery fidelity under a 20% dropout regime. Global fraud recall reached 91.2% (+/-0.8%), above the 68% average of locally trained models.

2604.20031 2026-05-19 math.OC cs.LG stat.ML 版本更新

Decision-Focused Federated Learning Under Heterogeneous Objectives and Constraints

在异质目标和约束下聚焦决策的联邦学习

Konstantinos Ziliaskopoulos, Alexander Vinel

发表机构 * Auburn University(亚伯拉罕大学)

AI总结 本文研究了在异质目标和约束下聚焦决策的联邦学习,通过SPO+替代损失推导出异质性界限,展示了在强凸可行集下联邦学习的鲁棒性,并通过实验验证了其有效性。

详情
AI中文摘要

我们考虑了决策聚焦联邦学习(DFFL),这是一种预测后再优化的设置,在其中多个客户端协同训练预测模型以解决下游的线性优化问题,而无需交换原始数据。除了标准联邦学习中典型的数据异质性外,客户端还可能有不同的目标函数和可行区域。基于SPO+替代损失,我们推导出异质性界限,将目标偏移(通过成本向量距离测量)与可行集偏移(通过支撑函数和形状距离术语测量)分开。我们证明,对于一般的紧致可行集,小的目标扰动仍可引起非消失的决策聚焦损失差异,而强凸可行区域会产生更尖锐的基于稳定性界限。然后,我们将这些点状界限提升到局部与联邦的超额风险比较,显示当统计优势超过客户端特定的异质性惩罚时,联邦学习是有益的。在多面体和强凸问题上的计算实验证实,在强凸可行区域下联邦学习的鲁棒性显著增强。最后,我们评估了一个简单的基于验证的插值方法,用于本地和联邦DFFL模型之间。该插值方法缓解了理论权衡,减少了合成实验和PJM电力定价案例研究中的累积遗憾和最坏客户端损害。

英文摘要

We consider Decision-Focused Federated Learning (DFFL), a predict-then-optimize setting in which multiple clients collaboratively train predictive models for downstream linear optimization problems without exchanging raw data. Besides the data heterogeneity typical of standard federated learning, clients may also have different objective functions and feasible regions. Building on the SPO+ surrogate loss, we derive heterogeneity bounds that separate objective shift, measured through cost-vector distances, from feasible-set shift, measured through support-function and shape-distance terms. We show that, for general compact feasible sets, small objective perturbations can still induce nonvanishing decision-focused loss discrepancies, while strongly convex feasible regions yield sharper stability-based bounds. We then lift these pointwise bounds to a local-versus-federated excess-risk comparison, showing that federation is beneficial when the statistical advantage of pooling exceeds a client-specific heterogeneity penalty. Computational experiments on polyhedral and strongly convex problems confirm that federation is substantially more robust under strongly convex feasible regions. Finally, we evaluate a simple validation-based interpolation between local and federated DFFL models. This interpolation mitigates the theoretical tradeoff and reduces aggregate regret and worst-client harm in both synthetic experiments and a PJM energy-pricing case study.

2604.19219 2026-05-19 cs.CR cs.AI cs.DC cs.LG 版本更新

Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers

Sherpa.ai 保护隐私的多方实体对齐无需披露交集

Daniel M. Jimenez-Gutierrez, Dario Pighin, Enrique Zuazua, Georgios Kellaris, Joaquin Del Rio, Oleksii Sliusarenko, Xabi Uribe-Etxebarria

AI总结 本文提出Sherpa.ai多方PSU协议,用于垂直联邦学习中的隐私保护实体对齐,实现精确和噪声匹配,同时隐藏交集成员信息,适用于多机构医疗疾病检测等场景。

详情
AI中文摘要

联邦学习(FL)使多个参与方在不集中原始数据的情况下协同训练模型。FL主要有两种范式:水平FL(HFL),所有参与者共享相同特征空间但持有不同样本;垂直FL(VFL),各参与方持有互补特征的相同样本集。VFL训练的前提是隐私保护实体对齐(PPEA),即在不暴露共享样本的情况下建立跨参与方的共同索引。传统私有集合交集(PSI)实现对齐但泄露交集成员信息,暴露敏感数据集关系。标准私有集合并集(PSU)通过在标识符并集上对齐而非交集来缓解此风险。然而,现有方法通常局限于两方或缺乏容错匹配支持。本文介绍Sherpa.ai多方PSU协议,一种PPEA方法,隐藏交集成员信息并实现精确和噪声匹配。该协议将两方方法扩展到多方,通信开销低,并提供两种变体:顺序保持版本用于精确对齐,无序版本容忍拼写和格式差异。我们证明了正确性和隐私性,分析了通信和计算(指数)复杂度,并正式化了从本地记录到共享索引空间的通用索引映射。该多方PSU为现实中的VFL部署提供了一种可扩展、数学基础的PPEA协议,如多机构医疗疾病检测、银行与保险公司的协作风险建模、电信与金融领域的跨域欺诈检测,同时保护交集隐私。

英文摘要

Federated Learning (FL) enables collaborative model training among multiple parties without centralizing raw data. There are two main paradigms in FL: Horizontal FL (HFL), where all participants share the same feature space but hold different samples, and Vertical FL (VFL), where parties possess complementary features for the same set of samples. A prerequisite for VFL training is privacy-preserving entity alignment (PPEA), which establishes a common index of samples across parties (alignment) without revealing which samples are shared between them. Conventional private set intersection (PSI) achieves alignment but leaks intersection membership, exposing sensitive relationships between datasets. The standard private set union (PSU) mitigates this risk by aligning on the union of identifiers rather than the intersection. However, existing approaches are often limited to two parties or lack support for typo-tolerant matching. In this paper, we introduce the Sherpa.ai multi-party PSU protocol for VFL, a PPEA method that hides intersection membership and enables both exact and noisy matching. The protocol generalizes two-party approaches to multiple parties with low communication overhead and offers two variants: an order-preserving version for exact alignment and an unordered version tolerant to typographical and formatting discrepancies. We prove correctness and privacy, analyze communication and computational (exponentiation) complexity, and formalize a universal index mapping from local records to a shared index space. This multi-party PSU offers a scalable, mathematically grounded protocol for PPEA in real-world VFL deployments, such as multi-institutional healthcare disease detection, collaborative risk modeling between banks and insurers, and cross-domain fraud detection between telecommunications and financial institutions, while preserving intersection privacy.

2604.11922 2026-05-19 math.PR cs.LG math.CO 版本更新

Spectral Structure in Finite Free Information Inequalities and $p$-Stam Phase Transitions

有限自由信息不等式中的谱结构与p-Stam相变

Baran Hashemi

发表机构 * Max Planck Institute for Mathematics in the Sciences(马克斯·普朗克数学研究所)

AI总结 本文基于FlowBoost框架研究有限自由Stam不等式的p-泛化,揭示了极值点处的谱结构,并提出关于奇异值的猜想,同时探讨了p-Stam不等式在不同p值下的行为。

详情
AI中文摘要

利用FlowBoost,一种用于极值结构发现的闭环深度生成优化框架,我们研究了在有限自由加法卷积$\boxplus_n$下,实根多项式中的$\ell^p$-泛化有限自由Stam不等式。在$p=2$时,FlowBoost发现Hermite对是唯一等号情况,并揭示了线性化卷积映射在该极值点处的谱结构。由此,我们推测在均值为零子空间上,双随机耦合矩阵$E_n$的奇异值为${2^{-k/2}:k=1,\ldots,n-1}$,与$n$无关。假设此猜想成立,我们获得了尖锐的局部稳定性常数和有限自由CLT收敛速率,两者在$n$上一致。我们引入了一族由$\ell^p$-Fisher信息定义的p-Stam不等式,并证明Hermite对本身对于每个$p>2$都违反该不等式,其偏差符号由$E_n$的$\ell^p$-收缩比决定。通过FlowBoost的系统计算支持猜想$p^*=2$是尖锐的临界指数。对于$p<2$,极值配置经历分支,意味着它们成为非匹配对,具有双模根结构,随着$p\to 2^-$,再次收敛到Hermite对。我们的发现表明,FlowBoost可以成为无限维极值问题中的有效数学发现工具。

英文摘要

Using FlowBoost, a closed-loop deep generative optimization framework for extremal structure discovery, we investigate $\ell^p$-generalizations of the finite free Stam inequality for real-rooted polynomials under finite free additive convolution $\boxplus_n$. At $p=2$, FlowBoost finds the Hermite pair as the unique equality case and reveals the spectral structure of the linearized convolution map at this extremal point. As a result, we conjecture that the singular values of the doubly stochastic coupling matrix $E_n$ on the mean-zero subspace are ${2^{-k/2}:k=1,\ldots,n-1}$, independent of $n$. Conditional on this conjecture, we obtain a sharp local stability constant and the finite free CLT convergence rate, both uniform in $n$. We introduce a one-parameter family of $p$-Stam inequalities using $\ell^p$-Fisher information and prove that the Hermite pair itself violates the inequality for every $p>2$, with the sign of the deficit governed by the $\ell^p$-contraction ratio of $E_n$. Systematic computation via FlowBoost supports the conjecture that $p^*\!=2$ is the sharp critical exponent. For $p<2$, the extremal configurations undergo a bifurcation, meaning that they become non-matching pairs with bimodal root structure, converging back to the Hermite diagonal only as $p\to 2^-$. Our findings demonstrate that FlowBoost, can be an effective tool of mathematical discovery in infinite-dimensional extremal problems.

2604.04202 2026-05-19 cs.LG cs.AI cs.CL 版本更新

ClawArena: Benchmarking AI Agents in Evolving Information Environments

ClawArena:在演化的信息环境中评估AI代理的基准测试

Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao

发表机构 * UNC-Chapel Hill(北卡罗来纳州立大学夏洛特分校) University of California, Santa Cruz(加州大学圣克鲁兹分校) University of California, Berkeley(加州大学伯克利分校)

AI总结 ClawArena评估AI代理在信息环境动态变化中的能力,通过多源冲突推理、动态信念更新和隐式个性化三个挑战,测试代理在多通道会话、工作区文件和阶段更新中的表现。

详情
AI中文摘要

部署为持久助手的AI代理必须在信息环境演变时保持正确信念。实际中,证据分散在异构来源中,常相互矛盾,新信息可能推翻先前结论,用户偏好通过修正而非明确指令出现。现有基准大多假设静态、单一权威环境,不评估代理能否应对这种复杂性。我们引入ClawArena,一个评估AI代理在演化的信息环境中的基准。每个场景保持完整的隐藏真实情况,同时仅向代理暴露噪声、部分且有时矛盾的痕迹,跨多通道会话、工作区文件和阶段更新。评估围绕三个相互关联的挑战:多源冲突推理、动态信念更新和隐式个性化,其相互作用产生14类问题分类。两种问题格式,多选(集合选择)和基于shell的可执行检查,测试推理和工作区定位。ClawArena包含12个多轮场景,覆盖337个评估轮次和45个动态更新,评估五个代理框架和18种语言模型,来自专有、社区可访问和自托管来源。实验表明,模型能力在模型间产生29分的分数范围,而框架设计最多产生24分的范围,MetaClaw的技能叠加可靠提高分数而不降低准确性,信念更新难度由更新设计策略而非更新量决定。代码可在https://github.com/aiming-lab/ClawArena获取。

英文摘要

AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. ClawArena comprises 12 multi-turn scenarios spanning 337 evaluation rounds with 45 dynamic updates, evaluated across five agent frameworks and 18 language models from proprietary, community-accessible, and self-hosted sources. Experiments show that model capability accounts for a 29-point score range across models while framework design accounts for up to a 24-point range, that MetaClaw's skill overlay reliably improves score without degrading accuracy, and that belief revision difficulty is determined by update design strategy rather than update volume. Code is available at https://github.com/aiming-lab/ClawArena.

2604.02511 2026-05-19 cs.LG q-bio.GN q-bio.MN 版本更新

Re-analysis of the Human Transcription Factor Atlas Recovers TF-Specific Signatures from Pooled Single-Cell Screens with Missing Controls

人类转录因子图谱的重新分析通过缺失控制恢复了转录因子特异性特征

Arka Jain, Umesh Sharma

发表机构 * Mayo Clinic(梅奥诊所)

AI总结 本文通过重新分析人类转录因子图谱数据集,利用外部对照和去噪策略恢复了61个转录因子的特异性特征,展示了数据在结合外部控制和可重复计算时的有效性。

详情
AI中文摘要

公开的池化单细胞扰动图谱是研究转录因子(TF)功能的重要资源,但下游重新分析可能受限于不完整的元数据和缺失的内部对照。本文重新分析了人类TF图谱数据集(GSE216481),该数据集是一个基于MORF的池化过表达筛选,覆盖了3,550个转录因子开放阅读框和254,519个细胞,采用可重复的流程进行质量控制、MORF条形码解混、每转录因子差异表达和功能富集分析。从77,018个细胞中,我们分配了60,997(79.2%)个细胞到87个转录因子身份。由于已存储的条形码映射缺少GFP和mCherry负对照,我们使用胚胎球(EB)细胞作为外部基准,并通过背景减去去除共享的批次/转染伪影。这种策略恢复了61个可测试转录因子中的59个特异性特征,相比仅使用一对一-其余方法检测的27个,显示即使在缺失池内对照的情况下,稳健的转录因子水平信号也可以被恢复。HOPX、MAZ、PAX6、FOS和FEZF2成为最强的转录重排因子,而每转录因子富集将FEZF2与分化调控、EGR1与 Hippo 和心肌程序、FOS 与 聚焦粘附、以及 NFIX 与 胶原生物合成相关联。条件水平分析揭示了收敛的 Wnt、神经发生、EMT 和 Hippo 特征,并且 Harmony 显示在池化重复中存在最小的批次效应混淆。我们的每转录因子效应大小显著与 Joung 等人的已发表排名一致(Spearman ρ= -0.316,p = 0.013;负号因为较低的排名表示更强的效应)。总之,这些结果表明,当与原则性的外部对照、伪影去除和可重复计算相结合时,已存储的 TF 图谱数据可以支持经过验证的 TF 特异性转录和通路分析。

英文摘要

Public pooled single-cell perturbation atlases are valuable resources for studying transcription factor (TF) function, but downstream re-analysis can be limited by incomplete deposited metadata and missing internal controls. Here we re-analyze the human TF Atlas dataset (GSE216481), a MORF-based pooled overexpression screen spanning 3,550 TF open reading frames and 254,519 cells, with a reproducible pipeline for quality control, MORF barcode demultiplexing, per-TF differential expression, and functional enrichment. From 77,018 cells in the pooled screen, we assign 60,997 (79.2\%) to 87 TF identities. Because the deposited barcode mapping lacks the GFP and mCherry negative controls present in the original library, we use embryoid body (EB) cells as an external baseline and remove shared batch/transduction artifacts by background subtraction. This strategy recovers TF-specific signatures for 59 of 61 testable TFs, compared with 27 detected by one-vs-rest alone, showing that robust TF-level signal can be rescued despite missing intra-pool controls. HOPX, MAZ, PAX6, FOS, and FEZF2 emerge as the strongest transcriptional remodelers, while per-TF enrichment links FEZF2 to regulation of differentiation, EGR1 to Hippo and cardiac programs, FOS to focal adhesion, and NFIC to collagen biosynthesis. Condition-level analyses reveal convergent Wnt, neurogenic, EMT, and Hippo signatures, and Harmony indicates minimal confounding batch effects across pooled replicates. Our per-TF effect sizes significantly agree with Joung et al.'s published rankings (Spearman $ρ= -0.316$, $p = 0.013$; negative because lower rank indicates stronger effect). Together, these results show that the deposited TF Atlas data can support validated TF-specific transcriptional and pathway analyses when paired with principled external controls, artifact removal, and reproducible computation.

2603.25860 2026-05-19 stat.ML cs.LG 版本更新

On the Expressive Power of Contextual Relations in Transformers

Transformer中上下文关系的表达能力

Demián Fraiman

发表机构 * Instituto de Cálculo(计算研究所) Universidad de Buenos Aires(布宜诺斯艾利斯大学)

AI总结 本文提出一种测度理论框架,将上下文关系建模为概率对象,揭示了softmax注意力与熵正则化最优传输的联系,并证明Transformer能近似任意上下文关系规则。

详情
AI中文摘要

Transformer架构在建模上下文关系方面取得了显著的实证成功,但对其表达能力的理解仍不清晰。本文引入一种测度理论框架,将上下文关系建模为概率对象,无论是条件分布还是联合分布(耦合)。这一视角揭示了标准softmax注意力与熵正则化最优传输之间的自然联系,为注意力提供了一种统一的视图,即作为底层亲和函数的归一化。在此框架内,我们利用标准softmax注意力和交替Sinkhorn归一化建立了上下文系统的通用近似定理。这些结果表明,Transformer架构能够近似任意上下文关系规则,且归一化的选择决定了这些关系的表示方式。此外,它们还提供了Transformers在建模上下文关系上有效的原因的原理性解释。

英文摘要

Transformer architectures have achieved remarkable empirical success in modeling contextual relations, yet a clear understanding of their expressive power is still lacking. In this work, we introduce a measure-theoretic framework in which contextual relations are modeled as probabilistic objects, either as conditional distributions or as joint distributions (couplings). This perspective reveals a natural connection between standard softmax attention and entropy-regularized optimal transport, providing a unified view of attention as a normalization of an underlying affinity function. Within this framework, we establish a universal approximation theorem for contextual systems using standard Softmax Attention and alternately Sinkhorn normalization. These results show that Transformer architectures can approximate arbitrary contextual relations rules, and that the choice of normalization determines how these relations are represented. Moreover, they provide a principled explanation for why Transformers are effective at modeling contextual relations.

2603.23566 2026-05-19 cs.LG cs.AI 版本更新

AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

AscendOptimizer: 一种用于Ascend NPU运算优化的经验型智能体

Jiehao Wu, Zixiao Huang, Wenhao Li, Chuyun Shen, Junjie Sheng, Xiangfeng Wang

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Shanghai University of International Business and Economics(上海国际商务经济大学) Key Lab of Mathematics and Engineering Applications (MoE), East China Normal University(东华大学数学与工程应用重点实验室) School of Mathematical Sciences, East China Normal University(东华大学数学科学学院) Shenzhen Loop Area Institute (SLAI)(深圳环宇院)

AI总结 本文提出AscendOptimizer,通过自身执行构建缺失的优化知识,结合主机侧和内核侧优化,实现AscendC运算的加速,达到1.21倍的几何平均速度提升。

详情
AI中文摘要

本文提出AscendOptimizer,一种用于Ascend NPU运算优化的经验型智能体。通过自身执行构建缺失的优化知识,结合主机侧和内核侧优化,实现AscendC运算的加速,达到1.21倍的几何平均速度提升。

英文摘要

Optimizing AscendC (Ascend C) operators for Ascend NPUs is difficult for two reasons. First, unlike CUDA, the ecosystem offers few public kernels to learn from. Second, performance depends on a coupled two-part implementation: a host-side tiling program that controls data movement and a kernel program that schedules and pipelines computation. We present AscendOptimizer, an episodic agent that builds missing optimization knowledge from execution itself. For kernel optimization, AscendOptimizer rewinds strong implementations by removing optimizations in a controlled way, then keeps the changes whose removal measurably hurts performance as reusable experience for later rewriting. For host-side optimization, it runs profiling-in-the-loop evolutionary search to find valid, fast tiling and data-movement configurations directly from hardware feedback. This combination lets the agent improve kernel structure and host-side scheduling together. On a benchmark of 101 real AscendC operators, AscendOptimizer achieves a 1.21x geometric-mean speedup over the open-source baseline, and 53.47% of operators run faster than their references. Given a same budget of evaluations per operator, AscendOptimizer consistently outperforms Best-of-N sampling and OpenEvolve in terms of geometric mean speedup, fast_p tail speedup ratios, and overall optimization progress across varying budgets.

2603.20873 2026-05-19 cs.LG math.OC 版本更新

Incentive-Aware Federated Averaging with Performance Guarantees under Strategic Participation

具有战略参与性能保证的激励感知联邦平均

Fateme Maleki, Krishnan Raghavan, Farzad Yousefian

发表机构 * Department of Industrial and Systems Engineering, Rutgers University(工业与系统工程系,罗格斯大学) Argonne National Laboratory(阿贡国家实验室)

AI总结 本文提出一种激励感知联邦平均方法,通过客户端传输模型参数和数据集大小,利用纳什均衡规则动态调整数据集规模,确保在凸和非凸目标下实现性能保证,并在单调博弈下实现福利损失最小化。

详情
AI中文摘要

联邦学习(FL)是一种通信高效的协作学习框架,使多个代理能够在私有本地数据集上进行模型训练。尽管FL在提高全局模型性能方面的益处已被广泛证实,但个体代理可能会战略性地平衡学习收益与贡献本地数据的成本。受需要成功保留参与代理的FL框架的启发,我们提出了一种激励感知的联邦平均方法,在每次通信轮次中,客户端向服务器传输本地模型参数和更新的训练数据集大小。数据集大小通过寻求纳什均衡(NE)的更新规则动态调整,以捕捉战略数据参与。我们分析了所提出的方法在凸和非凸全局目标设置下的性能保证,并建立了由此产生的激励感知FL算法的性能保证。此外,在仅仅单调博弈设置下,我们考虑了福利损失最小化框架,并建立了该方案的渐近收敛性。在MNIST和CIFAR-10数据集上的数值实验表明,代理在实现稳定的数据参与策略的同时,能够获得具有竞争力的全局模型性能。

英文摘要

Federated learning (FL) is a communication-efficient collaborative learning framework that enables model training across multiple agents with private local datasets. While the benefits of FL in improving global model performance are well established, individual agents may behave strategically, balancing the learning payoff against the cost of contributing their local data. Motivated by the need for FL frameworks that successfully retain participating agents, we propose an incentive-aware federated averaging method in which, at each communication round, clients transmit both their local model parameters and their updated training dataset sizes to the server. The dataset sizes are dynamically adjusted via a Nash equilibrium (NE)-seeking update rule that captures strategic data participation. We analyze the proposed method under convex and nonconvex global objective settings and establish performance guarantees for the resulting incentive-aware FL algorithm. Furthermore, under a merely monotone game setting, we consider a welfare loss minimization framework and establish asymptotic convergence of the scheme. Numerical experiments on the MNIST and CIFAR-10 datasets demonstrate that agents achieve competitive global model performance while converging to stable data participation strategies.

2603.12676 2026-05-19 cs.LG 版本更新

Disentangled Latent Dynamics Manifold Fusion for Solving Parameterized PDEs

解耦潜在动态流形融合用于求解参数化偏微分方程

Zhangyong Liang

发表机构 * National Center for Applied Mathematics, Tianjin University(应用数学国家中心,天津大学)

AI总结 本文提出DLDMF框架,通过解耦空间、时间和参数,利用连续时间潜在方法和动态流形融合机制,提升参数泛化和时间外推的稳定性与准确性。

详情
AI中文摘要

通用神经代理模型在不同PDE参数下泛化困难,因为PDE系数变化使学习更困难且优化不稳定。当模型必须预测超出训练时间范围时,问题更加严重。现有方法通常无法同时处理参数泛化和时间外推。标准参数化模型将时间视为另一个输入,因此无法捕捉内在动态,而近期连续时间潜在方法通常依赖昂贵的测试时间自解码,效率低且可能破坏参数化解空间的连续性。为此,我们提出解耦潜在动态流形融合(DLDMF),一种物理指导框架,明确分离空间、时间和参数。代替不稳定自解码,DLDMF通过前馈网络将PDE参数直接映射到连续潜在嵌入。该嵌入初始化并条件化一个潜在状态,其演变由参数条件的神经ODE控制。我们进一步引入动态流形融合机制,使用共享解码器结合空间坐标、参数嵌入和时间演化的潜在状态以重建相应的时空解。通过将预测建模为潜在动态演变而非静态坐标拟合,DLDMF减少参数变化与时间演变之间的干扰,同时保持平滑且一致的解流形。因此,它在未见参数设置和长期时间外推中表现良好。在多个基准问题上的实验表明,DLDMF在准确性、参数泛化和外推鲁棒性方面均优于最先进基线。

英文摘要

Generalizing neural surrogate models across different PDE parameters remains difficult because changes in PDE coefficients often make learning harder and optimization less stable. The problem becomes even more severe when the model must also predict beyond the training time range. Existing methods usually cannot handle parameter generalization and temporal extrapolation at the same time. Standard parameterized models treat time as just another input and therefore fail to capture intrinsic dynamics, while recent continuous-time latent methods often rely on expensive test-time auto-decoding for each instance, which is inefficient and can disrupt continuity across the parameterized solution space. To address this, we propose Disentangled Latent Dynamics Manifold Fusion (DLDMF), a physics-informed framework that explicitly separates space, time, and parameters. Instead of unstable auto-decoding, DLDMF maps PDE parameters directly to a continuous latent embedding through a feed-forward network. This embedding initializes and conditions a latent state whose evolution is governed by a parameter-conditioned Neural ODE. We further introduce a dynamic manifold fusion mechanism that uses a shared decoder to combine spatial coordinates, parameter embeddings, and time-evolving latent states to reconstruct the corresponding spatiotemporal solution. By modeling prediction as latent dynamic evolution rather than static coordinate fitting, DLDMF reduces interference between parameter variation and temporal evolution while preserving a smooth and coherent solution manifold. As a result, it performs well on unseen parameter settings and in long-term temporal extrapolation. Experiments on several benchmark problems show that DLDMF consistently outperforms state-of-the-art baselines in accuracy, parameter generalization, and extrapolation robustness.

2603.08145 2026-05-19 cs.LG cs.AI 版本更新

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

DARC:通过风险约束解码实现的分歧意识对齐

Mingxi Zou, Jiaxiang Chen, Junfan Li, Langzhang Liang, Qifan Wang, Xu Yinghui, Zenglin Xu

发表机构 * Fudan University, Shanghai, China(复旦大学,上海,中国) Independent Researcher(独立研究者) Meta AI Incubation Institute, Fudan University, Shanghai, China(创新与孵化院,复旦大学,上海,中国)

AI总结 DARC通过风险约束解码方法,在不重新训练的情况下,通过最大化KL-鲁棒满意度目标来缓解分歧和尾部风险,保持高质量输出。

详情
AI中文摘要

基于偏好对齐的方法(如RLHF、DPO)通常优化单一标量目标,隐式地平均异质人类偏好。在实践中,系统标注者和用户组的分歧使均值奖励最大化变得脆弱且易受代理过优化影响。我们提出了**通过风险约束解码实现的分歧意识对齐(DARC)**,一种无需重新训练的推理时间方法,将响应选择框架为分布鲁棒、风险敏感的决策制定。给定多个偏好样本或可扩展的分歧代理,DARC通过最大化KL-鲁棒(熵)满意度目标对候选者进行重新排序,并提供简单的部署控制,使相应的熵风险溢价相对于均值进行限制或惩罚,从而在不重新训练的情况下实现显式风险预算。我们提供了将此解码规则与原则性悲观主义和基于KL的分布鲁棒优化联系起来的理论分析。在对齐基准测试中,DARC在减少分歧和尾部风险的同时,保持在噪声、异质反馈下的竞争力平均质量。

英文摘要

Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.

2603.04737 2026-05-19 cs.AI cs.CL cs.LG 版本更新

Interactive Benchmarks

交互式基准测试

Baoqing Yue, Zihan Zhu, Yutong Han, Brian Fan, Qian Sun, Jichen Feng, Hufei Yang, Yifan Zhang, Mengdi Wang

发表机构 * InteractiveBench Princeton University(普林斯顿大学)

AI总结 本文提出交互式基准测试,通过预算化的多轮交互评估模型推理能力,改进传统基准和偏好评估的局限性,揭示模型在交互场景中的改进空间。

Comments Project Page: https://github.com/interactivebench/interactivebench

详情
AI中文摘要

现有的推理评估范式存在不同局限:固定基准日益饱和且易受污染,而基于偏好的评估依赖主观判断。我们主张智能的核心在于决定获取哪些信息以及如何有效使用它们。我们提出了交互式基准,一种统一的评估范式,通过预算化的多轮交互评估模型的推理能力。我们在两种设置中评估模型:交互证明,其中模型与裁判互动解决逻辑、UI2Html和数学任务,在客观反馈下;以及交互游戏,其中模型战略推理以最大化长期效用。我们的结果表明,交互式基准提供了更稳健的评估,揭示了模型在交互场景中的显著改进空间。

英文摘要

Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.

2603.02667 2026-05-19 cs.CV cs.LG 版本更新

Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

统一对比学习与生成目标以实现视觉理解和文本到图像生成

Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati, Hong-You Chen, Satya Narayan Shukla, Jianpeng Cheng, Yonghuan Yang, Jun Xiao, Xiangjun Fan, Aashu Singh, Dina Katabi, Shlok Kumar Mishra

发表机构 * MIT Computer Science \& Artificial Intelligence Laboratory Meta AI

AI总结 本文提出DREAM框架,通过Masking Warmup解决对比学习与文本到图像生成的矛盾,提升模型在多个任务上的性能。

详情
AI中文摘要

将文本-图像对比学习与文本到图像生成统一到一个端到端模型具有挑战性,因为两者需要不同的掩码策略:对比学习需要近完全可见的token,而掩码生成模型需要大量干扰。我们引入DREAM框架,通过Masking Warmup调度,在训练过程中逐步调整掩码分布的中心,使低和高掩码比率同时存在。这种共暴露使一个联合训练的编码器能够服务于两种目标。所得到的稳定优化解锁了语义对齐解码:在推理阶段,经过所有掩码比率训练的文本编码器可以评估部分生成的图像并选择最佳轨迹,仅需解码图像的12.5%,从而提高FID和吞吐量。DREAM在ImageNet线性探测(+1.1%)、5次转移(+4.1%)、ADE20K分割(+1.9%)和NYU深度估计(+6.25%)上优于CLIP,在CC12M FID上优于FLUID(+6.2%)的同时保持CLIP Score。这些收益表明,当正确统一文本-图像对比和生成目标时,它们是协同作用而非竞争。

英文摘要

Unifying text-image contrastive learning and text-to-image (T2I) generation in a single end-to-end model is challenging because the two objectives demand opposing masking regimes: contrastive alignment needs near-complete visible tokens, while masked generative modeling needs heavy corruption. We introduce DREAM, a unified framework that resolves this conflict through Masking Warmup, a schedule that shifts the center of the masking distribution over training, so low and high masking ratios coexist at every step. This co-exposure lets a single jointly-trained encoder serve both objectives. The resulting stable optimization unlocks Semantically Aligned Decoding at inference: the text encoder, trained against visual embeddings at all masking ratios, can score partially generated images and select the best trajectory with as little as 12.5% of the image decoded, improving both FID and throughput. DREAM outperforms its single-objective baselines, CLIP and FLUID: on ImageNet linear-probing (+1.1%), 5-shot transfer (+4.1%), ADE20K segmentation (+1.9%), and NYU depth estimation (+6.25%) over CLIP, and on CC12M FID (+6.2%) over FLUID while maintaining CLIP Score. Together, these gains show that text-image contrastive and generative objectives, when properly unified, are synergistic rather than competing.

2603.02531 2026-05-19 cs.LG cs.AI 版本更新

Geometry-Aware Attention Guidance for Diffusion Models via Modern Hopfield Dynamics

基于现代Hopfield动力学的几何感知注意力引导:通过现代Hopfield动力学实现扩散模型的几何感知注意力引导

Kwanyoung Kim

发表机构 * Department of AI Convergence(人工智能融合学院)

AI总结 本文提出几何感知注意力引导方法,通过分析注意力扩展中的现代Hopfield动力学,证明了稀疏-密集差异的两个方向性性质,从而提供一种无需训练的插拔式扩展规则,提升扩散模型生成质量。

详情
AI中文摘要

分类器自由引导(CFG)在扩散模型中提高了样本质量,但其双步推理和对空条件训练的依赖限制了其在少步场景中的应用。注意力空间引导作为一种互补范式,解决了这一缺口,但为何先前的稀疏-密集注意力引导有效仍不清楚。我们通过分析注意力扩展中的现代Hopfield动力学,证明了在共享条件下的稀疏-密集差异的两个方向性性质,从而证明其作为方向一致的加速信号。在此基础上,我们提出了几何感知注意力引导(GAG),一种无需训练的插拔式扩展规则,将差异分解为与检索方向平行和正交的分量,放大与收敛方向一致的分量,同时抑制离流形噪声;稳定性来源于弱收缩性质。我们进一步将此扩展解释为注意力空间中的第一阶Anderson加速,为注意力扩展方法提供了统一视角。GAG是一种通用方法,能够跨架构(UNet, MMDiT)和采样场景(多步、少步)泛化,一致地在多种架构上提升生成质量,包括FLUX.1、最近的FLUX.2和Qwen-Image,且计算开销极低。

英文摘要

Classifier-Free Guidance (CFG) improves sample quality in diffusion models, but its dual-pass inference and reliance on null-condition training limit its use in few-step regimes. Attention-space guidance has emerged as a complementary paradigm that addresses this gap, yet why prior sparse-vs-dense attention guidance works remains elusive. We address this by analyzing attention extrapolation through Modern Hopfield dynamics, proving two directional properties of the sparse-dense discrepancy under shared conditioning that together certify it as a directionally consistent acceleration signal. Building on this, we propose Geometry-Aware Attention Guidance (GAG), a training-free, plug-and-play extrapolation rule that decomposes the discrepancy into parallel and orthogonal components relative to the retrieval direction, amplifying the convergence-aligned component while suppressing off-manifold noise; stability follows from a weak contraction property. We further provide an interpretation of this extrapolation as first-order Anderson Acceleration in attention space, offering a unified perspective on attention extrapolation methods. GAG is a universal method that generalizes across architectures (UNet, MMDiT) and sampling regimes (multi-step, few-step), consistently improving generation quality on diverse backbones, including FLUX.1, the recent FLUX.2, and Qwen-Image, with minimal computational overhead.

2603.01388 2026-05-19 cs.LG stat.ML 版本更新

Invariant-Stratified Propagation for Expressive Graph Neural Networks

不变量分层传播用于表达性图神经网络

Asela Hevapathige, Ahad N. Zehmakan, Asiri Wijesinghe, Saman Halgamuge

发表机构 * Department of Mechanical Engineering University of Melbourne(墨尔本大学机械工程系) School of Computing Australian National University(澳大利亚国立大学计算机学院)

AI总结 本文提出不变量分层传播框架,通过改进的WL变体和高效神经网络实现,提升图神经网络的表达能力,解决结构异质性捕捉问题。

详情
Journal ref
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
AI中文摘要

图神经网络(GNNs)在表达性和捕捉结构异质性方面存在根本限制。标准消息传递架构受限于1维Weisfeiler-Leman(1-WL)测试,无法区分超过度序列的图,并且从邻居均匀聚合信息,无法捕捉节点在更高阶模式中的不同结构性位置。尽管存在实现更高表达性的方法,但它们带来了不可接受的计算成本,并缺乏统一的框架来灵活编码多样的结构属性。为了解决这些限制,我们引入不变量分层传播(ISP),该框架包括一种新的WL变体(ISP-WL)及其高效的神经网络实现(ISPGNN)。ISP根据图不变量分层节点,处理它们在层次结构中揭示的结构差异,这些差异对1-WL不可见。通过层次结构异质性编码,ISP量化节点在更高阶模式中的结构性位置差异,区分参与者占据不同角色的相互作用与参与者参与均匀的相互作用。我们提供了正式的理论分析,证明了超越1-WL的增强表达性,收敛保证以及固有的抗过平滑性。在图分类、节点分类和影响估计的广泛实验中,ISP在标准架构和最先进的表达性基线中表现出一致的改进。

英文摘要

Graph Neural Networks (GNNs) face fundamental limitations in expressivity and capturing structural heterogeneity. Standard message-passing architectures are constrained by the 1-dimensional Weisfeiler-Leman (1-WL) test, unable to distinguish graphs beyond degree sequences, and aggregate information uniformly from neighbors, failing to capture how nodes occupy different structural positions within higher-order patterns. While methods exist to achieve higher expressivity, they incur prohibitive computational costs and lack unified frameworks for flexibly encoding diverse structural properties. To address these limitations, we introduce Invariant-Stratified Propagation (ISP), a framework comprising both a novel WL variant (ISP-WL) and its efficient neural network implementation (ISPGNN). ISP stratifies nodes according to graph invariants, processing them in hierarchical strata that reveal structural distinctions invisible to 1-WL. Through hierarchical structural heterogeneity encoding, ISP quantifies differences in nodes' structural positions within higher-order patterns, distinguishing interactions where participants occupy different roles from those with uniform participation. We provide formal theoretical analysis establishing enhanced expressivity beyond 1-WL, convergence guarantees, and inherent resistance to oversmoothing. Extensive experiments across graph classification, node classification, and influence estimation demonstrate consistent improvements over standard architectures and state-of-the-art expressive baselines.

2603.01092 2026-05-19 cs.AI cs.LG 版本更新

The Alien Space of Science: Sampling Coherent but Cognitively Unavailable Research Directions

科学的异类空间:采样连贯但认知不可用的研究方向

Alejandro H. Artiles, Martin Weiss, Levin Brinkmann, Iyad Rahwan, Bernhard Schölkopf, Christopher Pal, Hugo Larochelle, Anirudh Goyal, Nasim Rahaman

发表机构 * Max Planck Institute for Human Development(马克斯·普朗克人类发展研究所) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所) Polytechnique Montreal(蒙特利尔理工学院) CIFAR AI Chair(CIFAR人工智能主席) Mila – Quebec AI Institute(魁北克人工智能研究所) Tiptree Systems(Tiptree系统)

AI总结 本文提出一种框架,通过分解论文为概念单元并学习两个互补模型,采样出连贯但认知不可用的研究方向,扩展了LLM生成的潜在词汇库。

Comments 10 main pages, 42 appendix pages, 29 figures

详情
AI中文摘要

科学发现不仅受真理限制,还受研究人员当前探索领域认知可用性限制。许多方向在文献中是连贯的,但因没有现有社区占据正确的概念、方法和直觉组合而不被提出。现代语言模型继承这种偏见,当被提示生成新想法时会重新组合文献的高密度区域。我们引入了一个框架,旨在针对互补区域,称为科学的异类空间,其中方向在现有知识结构下是可能的,但在现有研究人员分布下不太可能。我们的方法首先将论文分解为细粒度的概念单元,并将它们聚类为共享的词汇概念原子。然后在该词汇上学习两个互补模型。一个连贯性模型评分原子组合是否形成可行的研究方向,另一个可用性模型评分是否任何现有作者社区能够产生给定组合。采样异类方向则减少为排名原子组合,以最大化连贯性同时最小化可用性。在包含16,068篇经同行评审的LLM论文的语料库上,所得到的采样器在不牺牲连贯性的前提下,探索出比前沿LLM生成基线大3.5至7倍的有效原子词汇库,并在盲LLM、人类和下游实验评估中产生匹配或超过基线的想法。通过将科学合理性与社区可用性分开,我们的框架指向AI生成想法,补充而非仅仅加速人类科学,扩展探索到当前社区可能忽视的连贯方向。

英文摘要

Scientific discovery is constrained not only by what is true, but by what is cognitively available to the researchers currently exploring a field. Many directions are coherent in light of the literature yet unlikely to be proposed because no existing community occupies the right combination of concepts, methods, and intuitions. Modern language models inherit this bias, recombining high-density regions of the literature when prompted for novel ideas. We introduce a framework that targets the complementary region, which we call the alien space of science, where directions are plausible under the structure of existing knowledge but unlikely under the distribution of existing researchers. Our method first decomposes papers into granular conceptual units and clusters them into a shared vocabulary of idea atoms. It then learns two complementary models over this vocabulary. A coherence model scores whether a combination of atoms forms a viable research direction, and an availability model scores whether any existing author community is positioned to produce a given combination. Sampling alien directions then reduces to ranking atom combinations that maximize coherence while minimizing availability. On a corpus of 16,068 peer-reviewed LLM papers from NeurIPS, ICLR, ICML, and major NLP venues, the resulting sampler explores a 3.5 - 7 x broader effective atom vocabulary than frontier LLM ideation baselines without sacrificing coherence, and produces ideas that match or exceed those baselines under blind LLM, human, and downstream experimental evaluation. By separating scientific plausibility from community availability, our framework points toward AI ideation that complements rather than merely accelerates human science, expanding exploration into coherent directions that the current community may overlook.

2603.00975 2026-05-19 cs.LG cs.AI 版本更新

Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models

遗忘是竞争:重新思考扩散模型中的去学习作为表征干扰

Ashutosh Ranjan, Vivek Srivastava, Shirish Karande, Murari Mandal

发表机构 * TCS Research(印度 Tata Consulting Engineers 研究部) Kalinga Institute of Industrial Technology(卡林加工业技术学院)

AI总结 本文提出SurgUn方法,通过可控竞争而非直接删除或一对一重分配来实现扩散模型的去学习,有效平衡遗忘与保留,提升模型在版权、安全等场景下的表现。

详情
AI中文摘要

部署的文本到图像扩散模型日益需要事后概念去学习以应对版权主张、艺术家退出、安全更新和受保护内容缓解,而无需完全重新训练。核心挑战是擦除-保留失衡,激进更新抑制目标但损害共享能力,而保守或基于锚点的更新保留质量但使概念可通过相关、组合、改写或对抗性提示恢复。受反向干扰启发,我们提出SurgUn,将遗忘视为受控竞争而非直接删除或一对一重分配。SurgUn通过干扰条件梯度竞争实现反向概念干扰:目标梯度上升削弱目标条件的去噪或流匹配行为,而下降于语义多样的干扰集引入竞争非目标轨迹。这将输出分布在多个非目标模式而非坍缩到单一代理。为通过共享路径限制意外遗忘,SurgUn添加像素基础的权重空间局部化,轻量级诊断通过生成图像擦除-保留行为选择注意力块,利用抑制广泛可行而保留块选择性的不对称性。在UnlearnCanvas、IP-character擦除、Holistic Unlearning、EraseBench和Ring-A-Bell上,SurgUn在Stable Diffusion v1.5、SDXL和SANA-1.5中实现了比基线更强的擦除-保留平衡。消融实验显示,多样干扰、对比竞争和局部化对于稳健抑制同时保留相关和不相关概念都是必要的。

英文摘要

Deployed text-to-image diffusion models increasingly require post-hoc concept unlearning for copyright claims, artist opt-outs, safety updates, and protected-content mitigation without full retraining. A central challenge is erase-retain imbalance, aggressive updates suppress targets but damage shared capabilities, while conservative or anchor-based updates preserve quality yet leave concepts recoverable through related, compositional, paraphrased, or adversarial prompts. Inspired by retroactive interference, we propose SurgUn, which treats forgetting as controlled competition rather than direct deletion or one-to-one reassignment. SurgUn instantiates retroactive concept interference via distractor-conditioned gradient competition: target-gradient ascent weakens target-conditioned denoising or flow-matching behavior, while descent over a semantically diverse distractor set introduces competing non-target trajectories under the same prompt context. This redistributes outputs across multiple non-target modes instead of collapsing to a single proxy. To limit collateral forgetting through shared pathways, SurgUn adds pixel-grounded weight-space localization, a lightweight diagnostic that selects attention blocks by generated-image erase-retain behavior, exploiting the asymmetry that suppression is broadly achievable whereas retention is block-selective. Across UnlearnCanvas, IP-character erasure, Holistic Unlearning, EraseBench, and Ring-A-Bell on Stable Diffusion v1.5, SDXL, and SANA-1.5, SurgUn achieves a stronger erase-retain balance than baselines. Ablations show that diverse distractors, contrastive competition, and localization are all necessary for robust suppression while preserving related and unrelated concepts.

2602.22801 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

释放扩散模型在端到端自动驾驶中的潜力

Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang, Jianlin Zhang, Jianwei Cui, Guang Chen, Kun Ma, Hangjun Ye, Long Chen, Ya-Qin Zhang, Xianyuan Zhan, Jingjing Liu

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学)

AI总结 本文通过大规模实车数据和道路测试,系统研究了扩散模型在端到端自动驾驶中的规划能力,提出Hyper Diffusion Planner框架,实现10倍性能提升。

详情
AI中文摘要

扩散模型已成为机器人决策任务中的流行选择,近年来也开始被考虑用于解决自动驾驶任务。然而,其在自动驾驶中的应用和评估仍局限于模拟或实验室环境。本研究通过大规模实车数据和道路测试,系统研究了扩散模型作为端到端自动驾驶规划器的潜力。通过全面而受控的研究,我们识别了扩散损失空间、轨迹表示和数据缩放等关键洞察,显著影响端到端规划性能。此外,我们还提供了一种有效的强化学习后训练策略,进一步提升学习规划器的安全性和鲁棒性。所提出的扩散学习框架Hyper Diffusion Planner (HDP)在真实车辆平台上部署,并在6个城市驾驶场景和200公里的真实世界测试中,实现了相对于基模型的10倍性能提升。本文证明了当正确设计和训练时,扩散模型可以作为有效且可扩展的端到端自动驾驶规划器,用于复杂的真实世界自动驾驶任务。

英文摘要

Diffusion models have become a popular choice for decision-making tasks in robotics, and more recently, are also being considered for solving autonomous driving tasks. However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings. The full strength of diffusion models for large-scale, complex real-world settings, such as End-to-End Autonomous Driving (E2E AD), remains underexplored. In this study, we conducted a systematic and large-scale investigation to unleash the potential of the diffusion models as planners for E2E AD, based on a tremendous amount of real-vehicle data and road testing. Through comprehensive and carefully controlled studies, we identify key insights into the diffusion loss space, trajectory representation, and data scaling that significantly impact E2E planning performance. Moreover, we also provide an effective reinforcement learning post-training strategy to further enhance the safety and robustness of the learned planner. The resulting diffusion-based learning framework, Hyper Diffusion Planner (HDP), is deployed on a real-vehicle platform and evaluated across 6 urban driving scenarios and 200 km of real-world testing, achieving a notable 10x performance improvement over the base model. Our work demonstrates that diffusion models, when properly designed and trained, can serve as effective and scalable E2E AD planners for complex, real-world autonomous driving tasks.

2602.21185 2026-05-19 cs.LG 版本更新

The Diffusion Duality, Chapter II: $Ψ$-Samplers

扩散对偶性,第二章:Ψ-采样器

Justin Deschenaux, Caglar Gulcehre, Subham Sekhar Sahoo

发表机构 * EPFL, Lausanne, Switzerland(苏黎世联邦理工学院,洛桑分校) Microsoft AI(微软人工智能) Cornell Tech, NY(康奈尔科技)

AI总结 本文提出了一种通用的预测-校正采样器,用于离散扩散模型,提升了语言和图像建模的生成质量,并展示了其在训练效率上的优势。

详情
AI中文摘要

离散扩散模型因其自我校正能力在少量步骤生成和引导中表现出色,使其在这些场景中优于自回归或遮蔽扩散模型。然而,随着采样步骤的增加,其采样质量会趋于平缓。我们引入了一类预测-校正(PC)采样器,用于离散扩散,这些方法能够推广先前的方法并适用于任意噪声过程。当与均匀状态扩散结合时,我们的采样器在语言和图像建模上均优于祖先采样,实现了在OpenWebText上的生成困惑度更低,在CIFAR10上的FID/IS分数更优。关键的是,与传统采样器不同,我们的PC方法随着更多采样步骤的增加而持续改进。这些发现质疑了遮蔽扩散是扩散语言模型不可避免未来的假设。除了采样外,我们还开发了一种内存高效的课程学习方法,用于高斯松弛训练阶段,将训练时间减少25%,内存减少33%,同时保持在OpenWebText和LM1B上的困惑度相当,并在下游任务中表现强劲。我们发布了代码、检查点和视频教程:https://s-sahoo.com/duo-ch2

英文摘要

Uniform-state discrete diffusion models excel at few-step generation and guidance due to their ability to self-correct, making them preferred over autoregressive or Masked diffusion models in these settings. However, their sampling quality plateaus with ancestral samplers as the number of steps increases. We introduce a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers outperform ancestral sampling on both language and image modeling, achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve with more sampling steps. Taken together, these findings call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling. Beyond sampling, we develop a memory-efficient curriculum for the Gaussian relaxation training phase, reducing training time by 25% and memory by 33% compared to Duo while maintaining comparable perplexity on OpenWebText and LM1B and strong downstream performance. We release code, checkpoints, and a video-tutorial on: https://s-sahoo.com/duo-ch2

2602.19710 2026-05-19 cs.CV cs.LG cs.RO 版本更新

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

面向通用视觉-语言-动作策略的通用姿态预训练

Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, Yanwei Fu

发表机构 * Tencent Robotics X(腾讯机器人X) Futian Laboratory(福田实验室) The Hong Kong University of Science and Technology(香港科学与技术大学) Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出Pose-VLA,通过分离预训练和后训练阶段,解决视觉-语言-动作模型中的特征坍塌和训练效率问题,实现通用3D空间先验提取与机器人特定动作空间的高效对齐。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. Project website: https://hetolin.github.io/PoseVLA

详情
Journal ref
Robotics: Science and Systems, 2026
AI中文摘要

现有视觉-语言-动作(VLA)模型常因将高层感知与稀疏的、特定身体动作监督结合而出现特征坍塌和低训练效率。由于这些模型通常依赖优化用于视觉问答(VQA)的VLM主干,它们擅长语义识别但常忽视细微的3D状态变化,这些变化决定了不同的动作模式。为解决这些不一致,我们提出了Pose-VLA,一种解耦范式,将VLA训练分为预训练阶段以提取统一摄像机空间中的通用3D空间先验,以及后训练阶段以在机器人特定的动作空间中高效对齐。通过引入离散姿态标记作为通用表示,Pose-VLA无缝整合了来自不同3D数据集的空间接地与机器人演示中的几何级轨迹。我们的框架遵循一个两阶段预训练流程,通过姿态建立基本空间接地,然后通过轨迹监督实现运动对齐。广泛的评估显示,Pose-VLA在RoboTwin 2.0上实现了79.5%的平均成功率,并在LIBERO上表现出竞争力。现实世界实验进一步展示了在使用仅100个演示每任务的情况下,对多样化物体的鲁棒泛化能力,验证了我们预训练范式的效率。

英文摘要

Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.

2602.18584 2026-05-19 cs.LG cs.AI cs.CV 版本更新

GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

GIST: 通过耦合优化几何进行指令微调的目标数据选择

Guanghui Min, Tianhao Huang, Ke Wan, Chen Chen

发表机构 * Department of Computer Science, University of Virginia, Charlottesville, USA(弗吉尼亚大学计算机科学系)

AI总结 本文提出GIST方法,通过子空间对齐替代轴对齐缩放,解决参数高效微调中参数耦合问题,实现更高效的目标数据选择。

Comments ICML 2026; 27 pages, 8 figures, 11 tables

详情
AI中文摘要

目标数据选择已成为高效指令微调中的关键范式,旨在为特定任务识别一小部分有影响力的训练示例。在实践中,影响力通常通过示例对参数更新的影响来衡量。为了使选择可扩展,许多方法利用优化器统计(如Adam状态)作为轴对齐的替代品,隐式地将参数视为坐标独立。我们证明在参数高效微调(PEFT)方法如LoRA中,这一假设在破裂。在这种情况下,诱导的优化几何表现出强跨参数耦合和非平凡的非对角交互,而任务相关的更新方向被限制在低维子空间中。受此不匹配的启发,我们提出GIST(梯度等距子空间转换),一种简单但原则性的替代方法,用稳健的子空间对齐替代轴对齐缩放。GIST通过奇异值分解(SVD)从验证梯度中恢复任务特定的子空间,将训练梯度投影到该耦合子空间,并通过与目标方向的对齐程度评分示例。大量实验表明,在相同的选择预算下,GIST仅使用0.29%的存储和25%的计算时间,与当前最先进的基线匹配或优于。

英文摘要

Targeted data selection has emerged as a crucial paradigm for efficient instruction tuning, aiming to identify a small yet influential subset of training examples for a specific target task. In practice, influence is often measured through the effect of an example on parameter updates. To make selection scalable, many approaches leverage optimizer statistics (e.g., Adam states) as an axis-aligned surrogate for update geometry (i.e., diagonal precondition), implicitly treating parameters as coordinate-wise independent. We show that this assumption breaks down in parameter-efficient fine-tuning (PEFT) methods such as LoRA. In this setting, the induced optimization geometry exhibits strong cross-parameter coupling with non-trivial off-diagonal interactions, while the task-relevant update directions are confined to a low-dimensional subspace. Motivated by this mismatch, we propose GIST (Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment. GIST recovers a task-specific subspace from validation gradients via singular value decomposition (SVD), projects training gradients into this coupled subspace, and scores examples by their alignment with target directions. Extensive experiments have demonstrated that GIST matches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget.

2602.17679 2026-05-19 cs.LG math.OC 版本更新

Joint Parameter and State-Space Bayesian Optimization: Using Process Expertise to Accelerate Manufacturing Optimization

联合参数与状态空间贝叶斯优化:利用过程专业知识加速制造优化

Saksham Kiroriwal, Julius Pfrommer, Jürgen Beyerer

发表机构 * Cognitive Industrial Systems, Fraunhofer IOSB(弗劳恩霍夫工业系统认知研究所) Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)

AI总结 本文提出POGPN-JPSS框架,结合POGPN与联合参数状态空间建模,利用专家知识提取低维特征,提升多阶段生物乙醇生产过程优化效率。

Comments This paper is under review and has been submitted for CIRP CMS 2026

详情
AI中文摘要

贝叶斯优化(BO)是一种强大的方法,用于优化黑盒制造过程,但其性能在处理高维多阶段系统时受限,特别是当可以观察到中间输出时。标准BO将过程视为黑盒并忽略中间观察和底层过程结构。部分可观测高斯过程网络(POGPN)将过程建模为有向无环图(DAG)。然而,当观测是高维状态空间时间序列时,使用中间观测具有挑战性。过程专家知识可用于从高维状态空间数据中提取低维潜在特征。我们提出了POGPN-JPSS框架,结合POGPN与联合参数和状态空间(JPSS)建模,以利用提取的中间信息。我们在具有挑战性的高维多阶段生物乙醇生产过程模拟中展示了POGPN-JPSS的有效性。我们的结果表明,POGPN-JPSS显著优于现有方法,通过在两倍时间内达到所需性能阈值并更具可靠性。快速优化直接转化为时间和资源的显著节省。这突显了将专家知识与结构化概率模型结合以实现快速过程成熟的重要性。

英文摘要

Bayesian optimization (BO) is a powerful method for optimizing black-box manufacturing processes, but its performance is often limited when dealing with high-dimensional multi-stage systems, where we can observe intermediate outputs. Standard BO models the process as a black box and ignores the intermediate observations and the underlying process structure. Partially Observable Gaussian Process Networks (POGPN) model the process as a Directed Acyclic Graph (DAG). However, using intermediate observations is challenging when the observations are high-dimensional state-space time series. Process-expert knowledge can be used to extract low-dimensional latent features from the high-dimensional state-space data. We propose POGPN-JPSS, a framework that combines POGPN with Joint Parameter and State-Space (JPSS) modeling to use intermediate extracted information. We demonstrate the effectiveness of POGPN-JPSS on a challenging, high-dimensional simulation of a multi-stage bioethanol production process. Our results show that POGPN-JPSS significantly outperforms state-of-the-art methods by achieving the desired performance threshold twice as fast and with greater reliability. The fast optimization directly translates to substantial savings in time and resources. This highlights the importance of combining expert knowledge with structured probabilistic models for rapid process maturation.

2602.15405 2026-05-19 cs.LG 版本更新

Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits

信号与logits的耦合扩散模型联合增强与分类

Gilad Nurko, Roi Benita, Yehoshua Dissen, Tomohiro Nakatani, Marc Delcroix, Shoko Araki, Joseph Keshet

发表机构 * Technion -- Israel Institute of Technology, Haifa, Israel(技术学院——以色列理工学院,海法,以色列) NTT, Inc., Japan(日本NTT公司)

AI总结 本文提出一种集成信号和logits扩散模型的框架,通过相互指导提升分类鲁棒性,有效应对噪声环境下的分类挑战。

详情
AI中文摘要

在噪声环境中实现稳健分类仍是机器学习的基本挑战。传统方法通常将信号增强和分类视为独立的顺序阶段:首先增强信号,然后应用分类器。这种方法未能利用分类器输出中的语义信息进行去噪。本文提出一种通用、领域无关的框架,整合两个相互作用的扩散模型:一个处理输入信号,另一个处理分类器的输出logits,无需重新训练或微调分类器。这种耦合形式使两者相互指导,其中增强的信号细化类别估计,反之,演化的类别logits引导信号重建朝着流形的判别区域发展。我们引入了三种策略来有效建模输入和logit的联合分布。我们评估了所提出的联合增强方法用于图像分类和自动语音识别。所提出的框架超越了传统顺序增强基线,在多样的噪声条件下实现了稳健且灵活的分类准确率提升。

英文摘要

Robust classification in noisy environments remains a fundamental challenge in machine learning. Standard approaches typically treat signal enhancement and classification as separate, sequential stages: first enhancing the signal and then applying a classifier. This approach fails to leverage the semantic information in the classifier's output during denoising. In this work, we propose a general, domain-agnostic framework that integrates two interacting diffusion models: one operating on the input signal and the other on the classifier's output logits, without requiring any retraining or fine-tuning of the classifier. This coupled formulation enables mutual guidance, where the enhancing signal refines the class estimation and, conversely, the evolving class logits guide the signal reconstruction towards discriminative regions of the manifold. We introduce three strategies to effectively model the joint distribution of the input and the logit. We evaluated our joint enhancement method for image classification and automatic speech recognition. The proposed framework surpasses traditional sequential enhancement baselines, delivering robust and flexible improvements in classification accuracy under diverse noise conditions.

2602.08169 2026-05-19 cs.LG cs.CL 版本更新

Spherical Steering: Geometry-Aware Activation Rotation for Language Models

球面操控:面向语言模型的几何感知激活旋转

Zejia You, Chunyuan Deng, Hanjie Chen

发表机构 * Rice University(里士大学) Tufts University(塔夫茨大学)

AI总结 本文提出球面操控方法,通过激活旋转而非加法实现无训练的推理控制,有效避免隐藏状态幅度变化,提升模型在多项选择基准上的表现,同时保持开放生成能力。

Comments ICML 2026

详情
AI中文摘要

在推理过程中,操控语言模型(LMs)而不重新训练是一种有前景的方法。然而,标准方法通常依赖于激活加法,这不可避免地会改变隐藏状态的幅度,引发表示崩溃和开放生成退化的问题。本文探讨了球面操控,一种无需训练的原始方法,通过激活旋转解决这一权衡问题。与使用固定向量移动激活不同,我们的方法沿向目标方向的测地线旋转激活,从而在保持信号完整性的同时指向目标概念。为进一步增强适应性,我们引入了一个置信度门,根据输入不确定性动态调节操控强度。在多个选择基准上的广泛实验表明,球面操控在多项选择基准上显著优于加法基线(在TruthfulQA、COPA和Storycloze上分别提高10%),同时同时保持模型的开放生成质量。这项工作强调了几何一致性的重要性,表明保持范数的旋转是一种稳健且有效的方法,用于精确的推理时间控制。代码可在:https://github.com/chili-lab/Spherical-Steering 获取。

英文摘要

Inference-time steering offers a promising way to control language models (LMs) without retraining. However, standard approaches typically rely on activation addition, which inevitably alters the hidden-state magnitudes raising concerns about representation collapse and degraded open-ended generation. In this work, we explore Spherical Steering, a training-free primitive that resolves this trade-off through activation rotation. Rather than shifting activations with a fixed vector, our method rotates them along a geodesic toward a target direction, preserving signal integrity while steering toward the target concept. To further enhance adaptivity, we incorporate a confidence gate that dynamically modulates steering strength based on input uncertainty. Extensive experiments across multiple-choice benchmarks demonstrate that Spherical Steering significantly outperforms addition-based baselines (notably by +10% on TruthfulQA, COPA, and Storycloze), while simultaneously maintaining the model's general open-ended generation quality. This work highlights the value of geometric consistency, suggesting that norm-preserving rotation is a robust and effective primitive for precise inference-time control. The code is available at: https://github.com/chili-lab/Spherical-Steering.

2602.08167 2026-05-19 cs.RO cs.AI cs.CV cs.LG 版本更新

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

基于互联网规模知识的自监督行动预测具身推理

Milan Ganai, Katie Luo, Jonas Frey, Clark Barrett, Marco Pavone

发表机构 * Stanford(斯坦福大学) UC Berkeley(加州大学伯克利分校) NVIDIA(英伟达)

AI总结 本文提出R&B-EnCoRe方法,通过自监督细化使模型从互联网知识中自推导具身推理策略,提升动作执行和导航性能,减少碰撞率。

Comments Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

具身链式思维(CoT)推理显著提升了视觉-语言-动作(VLA)模型,但当前方法依赖刚性模板指定推理原语(如场景中的物体、高层计划、结构 affordances)。这些模板可能迫使策略处理无关信息,干扰关键动作预测信号。我们引入R&B-EnCoRe,使模型通过自监督细化从互联网规模知识中自推导具身推理。通过将推理视为重要加权变分推断中的潜在变量,模型可生成并提炼无外部奖励、验证者或人工标注的具身特定策略训练数据集。我们在各种VLA架构中验证R&B-EnCoRe,应用于 manipulation(Franka Panda在仿真中,WidowX在硬件中)、legged导航(双足、轮式、自行车、四足)和自动驾驶具身,参数规模为1B、4B、7B和30B。我们的方法在 manipulation 成功率提升28%,导航评分提高101%,碰撞率减少21%。R&B-EnCoRe使模型提炼出预测成功控制的推理,避免手动标注工程,同时将互联网规模知识接地于物理执行。

英文摘要

Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.

2602.07730 2026-05-19 cs.LG cs.AI 版本更新

The Laplacian Keyboard: Beyond the Linear Span

拉普拉斯键盘:超越线性空间

Siddarth Chandrasekar, Marlos C. Machado

发表机构 * Department of Computing Science, University of Alberta, Canada(阿尔伯塔大学计算科学系) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔人工智能研究所) Canada CIFAR AI Chair(加拿大CIFAR人工智能 chair)

AI总结 本文提出拉普拉斯键盘框架,通过构建行为库超越线性空间限制,提升零样本控制的表达能力与样本效率。

Comments 31 pages, 17 figures

详情
AI中文摘要

跨科学领域,拉普拉斯特征向量已成为简化复杂系统的基础,从信号处理到量子力学。在强化学习(RL)中,它们同样形成状态空间的基础,使奖励函数可以通过在少量特征向量上的投影来近似。这种投影使零样本控制成为可能,但同时也带来了根本性的限制:诱导的策略只能在所选特征向量的线性空间内具有表达能力。我们引入了拉普拉斯键盘(LK),一种分层框架,超越了这一线性空间。LK从这些特征向量中构建任务无关的行为库,形成一个保证包含任何奖励在该线性空间内的最优策略的行为基础。一个元策略学习动态地缝合这些行为,使在原始线性约束外高效学习策略成为可能。我们建立了零样本近似误差的理论界限,并实证表明LK在零样本解法上有所改进,同时在样本效率上优于标准RL方法。

英文摘要

Across scientific disciplines, Laplacian eigenvectors serve as a fundamental basis for simplifying complex systems, from signal processing to quantum mechanics. In reinforcement learning (RL), they similarly form a basis over the state space, enabling reward functions to be approximated by projection onto a small set of eigenvectors. This projection makes zero-shot control possible, but it also imposes a fundamental limitation: the induced policies are only as expressive as the linear span of the chosen eigenvectors. We introduce the Laplacian Keyboard (LK), a hierarchical framework that goes beyond this linear span. LK constructs a task-agnostic library of behaviors from these eigenvectors, forming a behavior basis guaranteed to contain the optimal policy for any reward within the linear span. A meta-policy learns to stitch these behaviors dynamically, enabling efficient learning of policies outside the original linear constraints. We establish theoretical bounds on zero-shot approximation error and demonstrate empirically that LK improves over the zero-shot solution while achieving better sample efficiency compared to standard RL methods.

2602.07715 2026-05-19 cs.LG 版本更新

Analyzing and Guiding Zero-Shot Posterior Sampling in Diffusion Models

分析和指导扩散模型中的零样本后验采样

Roi Benita, Michael Elad, Joseph Keshet

发表机构 * Department of Electrical and Computer Engineering(电气工程系) Department of Computer Science(计算机科学系) Technion, Haifa, Israel(以色列海法技术学院)

AI总结 本文分析了扩散模型中零样本后验采样的方法,提出基于高斯假设的框架,通过频域分析实现参数设计,提升感知质量和信号保真度。

详情
AI中文摘要

从退化测量中恢复信号一直是科学和工程的挑战。最近,零样本扩散方法被提出用于此类逆问题,提供基于先验知识的后验采样解决方案。此类算法通过推理整合观测,通常依赖手动调参和启发式方法。本文提出对这些近似后验采样器的严格分析,基于先验的高斯性假设。在此条件下,我们证明理想后验采样器和扩散重建算法可以表示为闭式形式,从而在频域中进行彻底分析和比较。基于这些表示,我们引入一种系统的方法来设计参数,取代以往的启发式选择策略。所提方法具有方法无关性,产生定制化的参数选择,共同考虑先验、退化信号和扩散动态的特性。我们显示,我们的频域推荐在结构上不同于标准启发式方法,并随扩散步长变化,从而在感知质量和信号保真度之间实现一致的平衡。

英文摘要

Recovering a signal from its degraded measurements is a long standing challenge in science and engineering. Recently, zero-shot diffusion based methods have been proposed for such inverse problems, offering a posterior sampling based solution that leverages prior knowledge. Such algorithms incorporate the observations through inference, often leaning on manual tuning and heuristics. In this work we propose a rigorous analysis of these approximate posterior samplers, relying on a Gaussianity assumption of the prior. Under this regime, we show that both the ideal posterior sampler and diffusion-based reconstruction algorithms can be expressed in closed-form, enabling their thorough analysis and comparisons in the spectral domain. Building on these representations, we introduce a principled framework for parameter design, replacing heuristic selection strategies used to date. The proposed approach is method-agnostic and yields tailored parameter choices that jointly account for the characteristics of the prior, the degraded signal, and the diffusion dynamics. We show that our spectral recommendations differ structurally from standard heuristics and vary with the diffusion step size, resulting in a consistent balance between perceptual quality and signal fidelity.

2602.06807 2026-05-19 cs.RO cs.AI cs.LG 版本更新

SuReNav: Superpixel Graph-based Constraint Relaxation for Navigation in Over-constrained Environments

SuReNav:基于超像素图的约束放松用于过约束环境中的导航

Keonyoung Koh, Moonkyeong Jung, Samuel Seungsup Lee, Daehyung Park

发表机构 * School of Computing, Korea Advanced Institute of Science and Technology, Korea(韩国科学技术院计算机学院)

AI总结 本文提出SuReNav方法,通过超像素图构建区域约束,利用图神经网络实现安全高效导航,适用于半静态环境中过约束规划问题,提升导航的人类类比性能。

Comments Accepted by ICRA 2026. Code and videos are available at https://sure-nav.github.io/

详情
AI中文摘要

我们针对半静态环境中过约束规划问题,提出SuReNav方法,通过超像素图构建区域约束,利用图神经网络训练于人类示范数据,实现安全高效的导航。框架包含三个组件:1)带有区域约束的超像素图地图生成,2)利用图神经网络进行区域约束放松,3)放松、规划和执行的交织过程。在2D语义地图和3D OpenStreetMap地图上评估,实现最高的人类类比得分,同时保持效率与安全的平衡。最后在现实城市导航中展示其可扩展性和泛化能力。代码和视频可在https://sure-nav.github.io/获取。

英文摘要

We address the over-constrained planning problem in semi-static environments. The planning objective is to find a best-effort solution that avoids all hard constraint regions while minimally traversing the least risky areas. Conventional methods often rely on pre-defined area costs, limiting generalizations. Further, the spatial continuity of navigation spaces makes it difficult to identify regions that are passable without overestimation. To overcome these challenges, we propose SuReNav, a superpixel graph-based constraint relaxation and navigation method that imitates human-like safe and efficient navigation. Our framework consists of three components: 1) superpixel graph map generation with regional constraints, 2) regional-constraint relaxation using graph neural network trained on human demonstrations for safe and efficient navigation, and 3) interleaving relaxation, planning, and execution for complete navigation. We evaluate our method against state-of-the-art baselines on 2D semantic maps and 3D maps from OpenStreetMap, achieving the highest human-likeness score of complete navigation while maintaining a balanced trade-off between efficiency and safety. We finally demonstrate its scalability and generalization performance in real-world urban navigation with a quadruped robot, Spot. Code and Videos are available at https://sure-nav.github.io/.

2602.05993 2026-05-19 cs.LG cs.AI 版本更新

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

钻石映射:通过随机流映射实现高效的奖励对齐

Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, Max Simchowitz

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出钻石映射,一种通过随机流映射实现高效奖励对齐的生成模型,能够在推理时对任意奖励进行准确对齐,提升模型适应性和性能。

详情
AI中文摘要

流和扩散模型能生成高质量样本,但训练后适应用户偏好或约束仍成本高且脆弱,这一挑战被称为奖励对齐。本文认为高效的奖励对齐应是生成模型本身的属性,而非事后考虑,并重新设计模型以增强适应性。我们提出

英文摘要

Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose "Diamond Maps", stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, Sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.

2602.05813 2026-05-19 cs.LG math.OC 版本更新

Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers

自适应调度机制:规范约束优化器中的暖启动来源

Artem Riabinin, Andrey Veprikov, Arman Bolatov, Martin Takáč, Aleksandr Beznosikov

发表机构 * Basic Research of Artificial Intelligence Laboratory(人工智能基础研究实验室) Federated Learning Problems Laboratory(联邦学习问题实验室) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Innopolis University(因诺普里斯大学)

AI总结 本文研究了规范约束优化器的自适应学习率调度,提出了一种通用平滑性假设,证明了在优化轨迹中局部曲率随次优间隙减小,从而自然产生暖启动而非人工设定。

Comments 30 pages, 8 figures, 5 tables

详情
AI中文摘要

我们研究了规范约束优化器(如Muon和Lion)的自适应学习率调度。我们引入了一种通用平滑性假设,其中局部曲率随次优间隙减小,并通过实验证实这种行为在优化轨迹中成立。在该假设下,我们建立了在适当选择学习率时的收敛保证,其中暖启动后衰减自然地从证明中产生,而非人为设定。基于此理论,我们开发了一种实用的学习率调度器,仅依赖标准超参数,并在训练开始时自动调整暖启动持续时间。我们在大型语言模型预训练中评估了该方法,使用LLaMA架构,证明我们的自适应暖启动选择在所有考虑的设置中 consistently 超过或至少匹配最佳的手动调优暖启动调度,无需额外超参数搜索。我们的源代码可在https://github.com/brain-lab-research/llm-baselines/tree/warmup获取。

英文摘要

We study adaptive learning rate scheduling for norm-constrained optimizers (e.g., Muon and Lion). We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm-up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm-up duration automatically at the beginning of training. We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm-up selection consistently outperforms or at least matches the best manually tuned warm-up schedules across all considered setups, without additional hyperparameter search. Our source code is available at https://github.com/brain-lab-research/llm-baselines/tree/warmup

2602.02236 2026-05-19 cs.RO cs.LG cs.NE cs.SY eess.SY 版本更新

Adaptive Control in Autonomous Driving via Real-Time Recurrent RL

通过实时递归强化学习实现自动驾驶中的自适应控制

Julian Lemmel, Felix Resch, Mónika Farsang, Ramin Hasani, Daniela Rus, Radu Grosu

发表机构 * TU Wien(维也纳技术大学) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Liquid AI

AI总结 本文研究了通过实时递归强化学习(RTRRL)对自动驾驶预训练控制策略进行在线微调,结合离线行为克隆与在线RTRRL微调,以适应部署时的分布偏移。在CarRacing模拟和1:10比例的RoboRacer平台上的实验验证了该方法的有效性。

详情
AI中文摘要

我们研究了使用实时递归强化学习(RTRRL)对自动驾驶预训练控制策略进行在线微调,RTRRL是一种内存高效的算法,能够在每个时间步更新策略参数而无需反向传播时间。我们扩展RTRRL以支持最近提出的非线性对角状态空间模型(LrcSSM),并将离线行为克隆与在线RTRRL微调结合,以适应部署时的分布偏移。我们在CarRacing模拟和配备事件相机的1:10比例RoboRacer平台上验证了该方法,其中预训练策略在现实世界直线跟踪中进行在线微调。到目前为止,这是首次在标准(非脉冲)硬件上实现闭环控制中的在线强化学习微调,使用事件相机观测。基于LrcSSM的策略在两种设置中均表现出最佳且最一致的性能。

英文摘要

We study online fine-tuning of pretrained control policies for autonomous driving using Real-Time Recurrent Reinforcement Learning (RTRRL), a memory-efficient algorithm that updates policy parameters at every time step without backpropagation through time. We extend RTRRL to support LrcSSM, a recently proposed nonlinear diagonal state-space model, and combine offline behavioral cloning with online RTRRL fine-tuning to adapt policies to distribution shifts at deployment. We validate the approach in the CarRacing simulation and on a 1:10-scale RoboRacer platform equipped with an event camera, where a pretrained policy is fine-tuned online during real-world line-following. To our knowledge, this is the first demonstration of online RL fine-tuning with event-camera observations on standard (non-spiking) hardware in closed-loop control. LrcSSM-based policies improve fastest and most consistently across both settings.

2602.02039 2026-05-19 cs.AI cs.CL cs.DB cs.LG 版本更新

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

在大型语言模型上进行深度数据研究:评估深度数据研究

Wei Liu, Peijie Yu, Michele Orini, Yali Du, Yulan He

发表机构 * GitHub

AI总结 本文提出深度数据研究(DDR)任务和DDR-Bench基准,评估大型语言模型的探索智能,发现有效探索需要内在策略而非单纯扩展。

Comments 14 pages, 7 tables, 8 figures, accepted by ICML 2026

详情
AI中文摘要

Agentic Large Language Models 的代理预期超越正确回答,要求模型自主设定目标和决定探索方向。我们称其为探索智能,区别于仅完成任务的执行智能。数据科学提供自然测试场,因为现实分析从原始数据而非明确查询开始,但很少有基准关注此领域。为此,我们引入深度数据研究(DDR),一个开放任务,使 LLM 自主从数据库提取关键洞察,并提出 DDR-Bench,一个大规模、基于清单的基准,支持可验证评估。结果表明,尽管前沿模型显示出新兴自主性,但长周期探索仍具挑战性。我们的分析强调,有效的探索智能不仅依赖代理支架或单纯扩展,还依赖于 agentic 模型的内在策略。

英文摘要

The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.

2602.01705 2026-05-19 cs.LG cs.AI 版本更新

LaDi-RL: Latent Diffusion Reasoning Prevents Entropy Collapse in Reinforcement Learning

LaDi-RL:潜在扩散推理防止强化学习中的熵崩溃

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Yi-An Ma, Lianhui Qin

发表机构 * UC San Diego(斯克利普斯海洋研究所) Apple(苹果公司)

AI总结 本文提出LaDi-RL方法,通过潜在扩散模型生成潜在推理轨迹,解决强化学习中熵崩溃问题,提升代码生成和数学推理性能。

详情
AI中文摘要

强化学习已成为改进大语言模型推理的核心范式,但现有方法多在离散token序列上优化政策,导致优化空间与推理结构不匹配。连续潜在空间RL提供了一种替代方案,允许政策探索更高层次的推理表示。然而,单纯转向潜在空间不足,所生成的策略必须建模复杂多模态的合理推理轨迹分布。为此,我们提出潜在扩散推理与强化学习(LaDi-RL),其中扩散模型通过迭代去噪生成潜在推理轨迹。此方法支持结构化探索和表达性分布建模,但也引入了根本的信用分配挑战:策略在潜在空间中行动,而奖励仅在潜在被解码为文本后才被观察到。因此,我们引入层次化潜在-文本回放,对每个潜在轨迹采样多个文本完成并聚合其奖励以获得解码边缘化的潜在效用估计。这为优化扩散策略提供了更清晰且方差更低的奖励信号。实验证明,LaDi-RL在代码生成和数学推理的pass@1指标上分别优于token级RL 9.4%和5.7%,甚至超越了基模型的pass@k性能。

英文摘要

Reinforcement learning has become a central paradigm for improving LLM reasoning, but most existing methods optimize policies over discrete token sequences. This creates a mismatch between the optimization space and the structure of reasoning: many important decisions are semantic, global, and trajectory-level rather than local token choices. Continuous latent-space RL offers a promising alternative by allowing policies to explore higher-level reasoning representations. However, simply moving to latent space is not sufficient. The resulting policy must model a complex, multi-modal distribution over valid reasoning trajectories. We therefore propose Latent Diffusion Reasoning with Reinforcement Learning (LaDi-RL), where a diffusion model generates latent reasoning trajectories through iterative denoising. This formulation enables structured exploration and expressive distribution modeling, but also introduces a fundamental credit-assignment challenge: the policy acts in latent space, while rewards are observed only after the latent is decoded into text. A naive rollout strategy therefore entangles latent reasoning quality with text decoding quality, making it unclear whether an incorrect answer results from a poor latent trajectory or from an imperfect textual realization. To address this, we introduce hierarchical latent-text rollouts. We sample multiple text completions for each latent trajectory and aggregate their rewards to obtain a decoder-marginalized estimate of latent utility. This provides a cleaner and lower-variance reward signal for optimizing the diffusion policy. Empirically, LaDi-RL outperforms token-level RL by 9.4% on code generation and 5.7% on math reasoning in pass@1, and even surpasses the base model's pass@k performance.

2601.23154 2026-05-19 cs.LG cs.AI 版本更新

On Safer Reinforcement Learning for Sedation and Analgesia in Intensive Care

关于重症监护中镇痛和镇静的安全强化学习

Joel Romero-Hernandez, Oscar Camara

发表机构 * BCN MedTech, Complex Systems Lab Universitat Pompeu Fabra Barcelona, Spain(BCN医疗科技,复杂系统实验室 巴塞罗那自治大学 巴塞罗那)

AI总结 本文提出一种离线深度强化学习框架,用于优化重症监护中的镇痛和镇静,通过减少疼痛或联合减少疼痛和30天出院后死亡率来提升治疗安全性。

Comments 48th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC 2026)

详情
AI中文摘要

重症监护中的镇痛管理通常涉及复杂的权衡,因为治疗不足或过量都会影响患者安全。先前强化学习在镇静和镇痛中的研究主要关注优化干预,但未考虑患者生存率或部分可观测性。为探讨这些设计选择的风险,我们开发了一个离线深度强化学习框架,基于递归状态表示建议每小时药物剂量。使用MIMIC-IV数据库中47,144例ICU住院数据,我们训练并评估了行为正则化的actor-critic模型,根据两个目标:减少疼痛或联合减少疼痛和30天出院后死亡率来处方连续剂量的阿片类药物、丙泊酚、苯二氮䓬类药物和去甲肾上腺素。尽管两种政策与较低的疼痛相关,但镇痛政策与死亡率呈正相关(ρ=0.119,p<0.0001),而联合政策与死亡率呈负相关(ρ=-0.316,p<0.0001)。我们发现这种分歧源于对高共病率的不同反应。这表明,重视出院后结果可能对学习更安全的治疗政策至关重要,即使短期目标仍是主要目标。

英文摘要

Pain management in intensive care usually involves complex trade-offs, since both inadequate and excessive treatment can compromise patient safety. Prior work on reinforcement learning for sedation and analgesia has explored how to optimize these interventions, but has not considered patient survival or partial observability. To investigate the risks of these design choices, we developed an offline deep reinforcement learning framework that suggests hourly medication doses based on recurrent state representations. Using retrospective data from 47,144 ICU stays in the MIMIC-IV database, we trained and evaluated behavior-regularized actor-critic models that prescribe continuous doses of opioids, propofol, benzodiazepines, and dexmedetomidine according to two goals: reduce pain or jointly reduce pain and 30-day post-discharge mortality. Although the two resulting policies were associated with lower pain, clinician agreement with the pain-only policy was positively correlated with mortality ($ρ$=0.119, p<0.0001), while agreement with the joint policy was negatively correlated ($ρ$=-0.316, p<0.0001). We found that such divergence arose from a different response to high levels of comorbidity. This suggests that valuing post-discharge outcomes could be critical for learning safer treatment policies, even if a short-term goal remains the primary objective.

2601.22678 2026-05-19 cs.LG 版本更新

Full-Graph vs. Mini-Batch Training: Comprehensive Analysis from a Batch Size and Fan-Out Size Perspective

全图与小批量训练:从批量大小和Fan-Out大小视角的综合分析

Mengfan Liu, Da Zheng, Junwei Su, Chuan Wu

发表机构 * The University of Hong Kong(香港大学) Ant Group(蚂蚁集团)

AI总结 本文从批量大小和Fan-Out大小角度系统比较了全图与小批量GNN训练方法,通过实证和理论分析揭示了批量大小和Fan-Out大小对模型收敛和泛化的影响,为超参数调优提供指导。

详情
AI中文摘要

全图和小批量图神经网络(GNN)训练方法具有不同的系统设计需求,选择合适的方法至关重要。比较这两种GNN训练方法的核心挑战在于刻画其模型性能(即收敛性和泛化)和计算效率。尽管批量大小在分析深度神经网络(DNNs)行为时是一个有效的视角,但GNNs通过引入Fan-Out大小扩展了这一视角,因为全图训练可以视为批量大小和Fan-Out大小最大的小批量训练。然而,GNNs的批量和Fan-Out大小的影响仍不够深入。为此,本文通过实证和理论分析,从批量大小和Fan-Out大小的角度系统比较了GNNs的全图与小批量训练。我们的主要贡献包括:1)我们提供了一种新的泛化分析,使用Wasserstein距离研究图结构,尤其是Fan-Out大小的影响。2)我们揭示了批量大小和Fan-Out大小在GNN收敛和泛化中的各向异性影响,为在资源受限条件下调优这些超参数提供了实际指导。最后,全图训练并不总能比经过良好调优的小批量设置在模型性能或计算效率上更优。实现可在GitHub链接中找到:https://github.com/LIUMENGFAN-gif/GNN_fullgraph_minibatch_training。

英文摘要

Full-graph and mini-batch Graph Neural Network (GNN) training approaches have distinct system design demands, making it crucial to choose the appropriate approach to develop. A core challenge in comparing these two GNN training approaches lies in characterizing their model performance (i.e., convergence and generalization) and computational efficiency. While a batch size has been an effective lens in analyzing such behaviors in deep neural networks (DNNs), GNNs extend this lens by introducing a fan-out size, as full-graph training can be viewed as mini-batch training with the largest possible batch size and fan-out size. However, the impact of the batch and fan-out size for GNNs remains insufficiently explored. To this end, this paper systematically compares full-graph vs. mini-batch training of GNNs through empirical and theoretical analyses from the view points of the batch size and fan-out size. Our key contributions include: 1) We provide a novel generalization analysis using the Wasserstein distance to study the impact of the graph structure, especially the fan-out size. 2) We uncover the non-isotropic effects of the batch size and the fan-out size in GNN convergence and generalization, providing practical guidance for tuning these hyperparameters under resource constraints. Finally, full-graph training does not always yield better model performance or computational efficiency than well-tuned smaller mini-batch settings. The implementation can be found in the github link: https://github.com/LIUMENGFAN-gif/GNN_fullgraph_minibatch_training.

2601.21350 2026-05-19 cs.LG 版本更新

Factored Causal Representation Learning for Robust Reward Modeling in RLHF

因式分解因果表示学习用于RLHF中的鲁棒奖励建模

Yupei Yang, Lin Yang, Wanxi Deng, Lin Qu, Fan Feng, Biwei Huang, Shikui Tu, Lei Xu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Alibaba Group(阿里巴巴集团) University of California San Diego(加州大学圣地亚哥分校) Mohamed bin Zayed University of Artificial Intelligence(莫莫德·本·扎耶德人工智能大学)

AI总结 本文提出因式分解表示学习框架,通过分离因果因素与非因果因素提升奖励模型鲁棒性,有效缓解奖励黑客问题。

详情
AI中文摘要

一个可靠的奖励模型对于通过人类反馈强化学习对齐大语言模型与人类偏好至关重要。然而,标准奖励模型易受非因果特征影响,导致奖励黑客问题。本文从因果视角出发,提出因式分解表示学习框架,将模型的上下文嵌入分解为(1)足以预测奖励的因果因素和(2)捕捉与奖励无关的属性如长度或趋炎附势偏差的非因果因素。奖励头被约束仅依赖因果部分。此外,引入对抗头预测非因果因素的奖励,同时应用梯度反转以阻止其编码与奖励相关的信息。数学和对话任务实验表明,本文方法学习更稳健的奖励模型,并在下游RLHF性能上优于现有最佳基线。对长度和趋炎附势偏差的分析进一步验证了方法在缓解奖励黑客行为方面的有效性。

英文摘要

A reliable reward model is essential for aligning large language models with human preferences through reinforcement learning from human feedback. However, standard reward models are susceptible to spurious features that are not causally related to human labels. This can lead to reward hacking, where high predicted reward does not translate into better behavior. In this work, we address this problem from a causal perspective by proposing a factored representation learning framework that decomposes the model's contextual embedding into (1) causal factors that are sufficient for reward prediction and (2) non-causal factors that capture reward-irrelevant attributes such as length or sycophantic bias. The reward head is then constrained to depend only on the causal component. In addition, we introduce an adversarial head trained to predict reward from the non-causal factors, while applying gradient reversal to discourage them from encoding reward-relevant information. Experiments on both mathematical and dialogue tasks demonstrate that our method learns more robust reward models and consistently improves downstream RLHF performance over state-of-the-art baselines. Analyses on length and sycophantic bias further validate the effectiveness of our method in mitigating reward hacking behaviors.

2601.21170 2026-05-19 cs.LG stat.ML 版本更新

The Powers of Precision: Structure-Informed Detection in Complex Systems -- From Customer Churn to Seizure Onset

精度的威力:复杂系统中的结构引导检测——从客户流失到癫痫发作 onset

Augusto Santos, Teresa Santos, Catarina Rodrigues, José M. F. Moura

发表机构 * Instituto de Telecomunicações(电信研究所) Cegid(Cegid公司) ECE Department at Carnegie Mellon University(卡内基梅隆大学电子工程系)

AI总结 本文提出一种基于结构信息的机器学习方法,用于复杂系统中关键事件的早期检测,通过学习最优特征表示和分类模块,实现对隐藏因果结构的识别与利用,展示了在癫痫发作检测和客户流失预测中的有效性。

详情
AI中文摘要

涌现现象——癫痫发作 onset、突发客户流失或流行病爆发——往往源于复杂系统中隐藏的因果相互作用。我们提出了一种机器学习方法,用于其早期检测,解决了核心挑战:在数据生成过程未知且部分观测的情况下,揭示并利用系统潜在的因果结构。该方法从一个参数家族的估计器中学习最优特征表示——经验协方差或精度矩阵的幂——提供了一种原则性方法来捕捉驱动关键事件出现的底层结构。随后的监督学习模块对学习到的表示进行分类。我们证明了该家族的结构一致性,并在癫痫发作检测和客户流失预测中展示了方法的实证有效性,取得了竞争性的结果。除了预测之外,我们还发现最优协方差幂显示出良好的可识别性,同时捕捉到结构特征,从而在预测性能与可解释的统计结构之间取得平衡。

英文摘要

Emergent phenomena -- onset of epileptic seizures, sudden customer churn, or pandemic outbreaks -- often arise from hidden causal interactions in complex systems. We propose a machine learning method for their early detection that addresses a core challenge: unveiling and harnessing a system's latent causal structure despite the data-generating process being unknown and partially observed. The method learns an optimal feature representation from a one-parameter family of estimators -- powers of the empirical covariance or precision matrix -- offering a principled way to tune in to the underlying structure driving the emergence of critical events. A supervised learning module then classifies the learned representation. We prove structural consistency of the family and demonstrate the empirical soundness of our approach on seizure detection and churn prediction, attaining competitive results in both. Beyond prediction, and toward explainability, we ascertain that the optimal covariance power exhibits evidence of good identifiability while capturing structural signatures, thus reconciling predictive performance with interpretable statistical structure.

2601.19624 2026-05-19 cs.LG cs.AI 版本更新

Tracking Drift: Variation-Aware Entropy Scheduling for Non-Stationary Reinforcement Learning

追踪漂移:面向非平稳强化学习的变异性感知熵调度

Tongxi Wang, Zhuoyang Xia, Xinran Chen, Shan Liu

发表机构 * School of Future Technology, Southeast University, Nanjing, China(东南大学未来技术学院,南京,中国) School of Automation, Southeast University, Nanjing, China(东南大学自动化学院,南京,中国)

AI总结 本文提出AES方法,通过动态调整熵系数以应对环境漂移,减少性能下降并加快恢复速度。

Comments Accepted by ICML 2026

详情
AI中文摘要

现实中的强化学习常面临环境漂移问题,但现有方法多依赖静态熵系数/目标熵,导致稳定期过度探索和漂移后探索不足。本文证明,在标准假设下,非平稳最大熵强化学习中的熵调度可转化为跟踪漂移比较器与稳定更新之间的动态遗憾权衡,得出熵权重与在线非平稳性代理的平方根缩放规则。基于此,提出AES--自适应熵调度,通过在线训练中使用可观察的漂移代理动态调整熵系数/温度,几乎不改变结构且开销极小。在四种算法变体、十二个任务和四种漂移模式中,AES显著减少了漂移导致的性能下降比例并加速了突变后的恢复。

英文摘要

Real-world reinforcement learning often faces environment drift, but most existing methods rely on static entropy coefficients/target entropy, causing over-exploration during stable periods and under-exploration after drift, and leaving unanswered the principled question of how exploration intensity should scale with drift magnitude. We show that, under standard assumptions, entropy scheduling in non-stationary maximum-entropy RL can be cast as the dynamic-regret trade-off between tracking a drifting comparator and stabilizing updates, yielding a square-root scaling rule for the entropy weight in terms of a online non-stationarity proxy. Building on this, we propose AES--Adaptive Entropy Scheduling--which adaptively adjusts the entropy coefficient/temperature online using observable drift proxies during training, requiring almost no structural changes and incurring minimal overhead. Across 4 algorithm variants, 12 tasks, and 4 drift modes, AES significantly reduces the fraction of performance degradation caused by drift and accelerates recovery after abrupt changes.

2601.16527 2026-05-19 cs.LG cs.AI cs.CL cs.CV 版本更新

Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs

超越表面遗忘:多模态大语言模型中Hallucinations的锐度感知鲁棒擦除

Xianya Fang, Feiyang Ren, Xiang Chen, Yu Tian, Zhen Bi, Haiyang Yu, Sheng-Jun Huang

发表机构 * College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics(南京航空航天大学计算机科学与技术学院) Institute for AI, Tsinghua University(清华大学人工智能研究院) Huzhou University(湖州大学) Institute of Dataspace, Hefei Comprehensive National Science Center(合肥综合性国家科学中心数据空间研究院) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出SARE方法,通过目标导向的min-max优化和Targeted-SAM机制,解决多模态大语言模型中 hallucinations 的鲁棒擦除问题,提升模型稳定性与擦除效果。

详情
AI中文摘要

多模态大语言模型虽然强大,但容易产生hallucinations,即不存在的实体,影响可靠性。尽管最近的遗忘方法试图缓解这一问题,我们发现了一个关键缺陷:结构脆弱性。我们实证显示,标准擦除仅能表面抑制,使模型陷入尖锐极小值,轻度重新学习后hallucinations会灾难性复苏。为确保几何稳定性,我们提出SARE,将遗忘视为目标min-max优化问题,并使用Targeted-SAM机制显式平坦hallucinated概念周围的损失景观。通过在模拟最坏情况参数扰动下抑制hallucinations,我们的框架确保了鲁棒去除的稳定性。大量实验表明,SARE在擦除效果上显著优于基线,同时保持一般生成质量。关键的是,它在重新学习和参数更新中维持持久的hallucination抑制,验证了几何稳定性的有效性。

英文摘要

Multimodal LLMs are powerful but prone to object hallucinations, which describe non-existent entities and harm reliability. While recent unlearning methods attempt to mitigate this, we identify a critical flaw: structural fragility. We empirically demonstrate that standard erasure achieves only superficial suppression, trapping the model in sharp minima where hallucinations catastrophically resurge after lightweight relearning. To ensure geometric stability, we propose SARE, which casts unlearning as a targeted min-max optimization problem and uses a Targeted-SAM mechanism to explicitly flatten the loss landscape around hallucinated concepts. By suppressing hallucinations under simulated worst-case parameter perturbations, our framework ensures robust removal stable against weight shifts. Extensive experiments demonstrate that SARE significantly outperforms baselines in erasure efficacy while preserving general generation quality. Crucially, it maintains persistent hallucination suppression against relearning and parameter updates, validating the effectiveness of geometric stabilization.

2601.16398 2026-05-19 cs.CY cs.CL cs.LG 版本更新

White-Box Sensitivity Auditing with Steering Vectors

白盒敏感性审计与引导向量

Hannah Cyberey, Yangfeng Ji, David Evans

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出白盒敏感性审计框架,通过激活引导进行更严格的模型内部评估,用于检测大语言模型中的偏见,揭示模型对保护属性的依赖。

详情
AI中文摘要

算法审计是检查系统属性的重要工具,当前对大语言模型(LLM)的审计主要依赖黑盒评估,仅通过输入输出测试。这些方法局限于输入空间中的测试,通常由启发式生成。此外,许多社会相关模型属性(如性别偏见)抽象且难以通过文本输入单独测量。为解决这些限制,我们提出了一种白盒敏感性审计框架,利用激活引导进行更严格的内部评估。我们的审计方法通过操纵关键概念进行内部敏感性测试,以评估模型的预期功能。我们展示了其在四个模拟高风险LLM决策任务中的应用。我们的方法一致表明,模型预测对保护属性存在显著依赖,即使在标准黑盒评估表明几乎没有偏见的设置中。我们的代码在https://github.com/hannahxchen/llm-steering-audit上公开可用。

英文摘要

Algorithmic audits are essential tools for examining systems for properties required by regulators or desired by operators. Current audits of large language models (LLMs) primarily rely on black-box evaluations that assess model behavior only through input-output testing. These methods are limited to tests constructed in the input space, often generated by heuristics. In addition, many socially relevant model properties (e.g., gender bias) are abstract and difficult to measure through text-based inputs alone. To address these limitations, we propose a white-box sensitivity auditing framework for LLMs that leverages activation steering to conduct more rigorous assessments through model internals. Our auditing method conducts internal sensitivity tests by manipulating key concepts relevant to the model's intended function for the task. We demonstrate its application to bias audits in four simulated high-stakes LLM decision tasks. Our method consistently indicates substantial dependence on protected attributes in model predictions, even in settings where standard black-box evaluations suggest little or no bias. Our code is openly available at https://github.com/hannahxchen/llm-steering-audit

2601.16287 2026-05-19 physics.optics cond-mat.mtrl-sci cs.LG physics.app-ph 版本更新

Active learning for photonic crystals

光子晶体的主动学习

Ryan Lopez, Charlotte Loh, Rumen Dangovski, Marin Soljačić

发表机构 * Department of Physics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA(麻省理工学院物理系) Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA(麻省理工学院电气工程与计算机科学系)

AI总结 本文提出基于分析的LL-BNN与不确定性驱动采样结合的方法,用于加速光子带隙预测,通过聚焦高不确定性区域减少训练数据需求,提升光子晶体拓扑优化效率。

Comments 8 pages, 7 figures, accepted to Optics Express; updated version after reviewer comments

详情
AI中文摘要

光子晶体的主动学习探讨了将分析近似贝叶斯最后一层神经网络(LL-BNNs)与不确定性驱动的样本选择相结合,以加速光子带隙预测。我们采用分析LL-BNN公式,对应于无限蒙特卡洛样本极限,以获得与未标记候选结构的真实预测误差强相关的不确定性估计。这些不确定性评分驱动主动学习策略,在训练过程中优先选择最信息量的模拟。应用于预测二维双频光子晶体的带隙大小任务,我们的方法在平均训练数据需求上比随机采样基线减少了2.7倍,同时保持预测准确性。效率提升源于将计算资源集中在设计空间的高不确定性区域,而非均匀采样。鉴于完整带结构模拟的巨大成本,尤其是在三维情况下,这种数据效率使快速可扩展的代理建模成为可能。我们的结果表明,基于分析LL-BNN的主动学习可以显著加速光子晶体的拓扑优化和反向设计流程,并更广泛地提供一个通用的数据高效回归框架,适用于科学机器学习领域。

英文摘要

Active learning for photonic crystals explores the integration of analytic approximate Bayesian last layer neural networks (LL-BNNs) with uncertainty-driven sample selection to accelerate photonic band gap prediction. We employ an analytic LL-BNN formulation, corresponding to the infinite Monte Carlo sample limit, to obtain uncertainty estimates that are strongly correlated with the true predictive error on unlabeled candidate structures. These uncertainty scores drive an active learning strategy that prioritizes the most informative simulations during training. Applied to the task of predicting band gap sizes in two-dimensional, two-tone photonic crystals, our approach achieves up to a 2.7x reduction on average in required training data compared to a random sampling baseline while maintaining predictive accuracy. The efficiency gains arise from concentrating computational resources on high uncertainty regions of the design space rather than sampling uniformly. Given the substantial cost of full band structure simulations, especially in three dimensions, this data efficiency enables rapid and scalable surrogate modeling. Our results suggest that analytic LL-BNN based active learning can substantially accelerate topological optimization and inverse design workflows for photonic crystals, and more broadly, offers a general framework for data efficient regression across scientific machine learning domains.

2601.11895 2026-05-19 cs.LG cs.AI cs.SE 版本更新

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench:一个现实的、面向开发者的代码生成模型基准测试

Adarsh Kumarappan, Pareesa Ameneh Golnari, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, Elsie Nallipogu

发表机构 * California Institute of Technology(加州理工学院) Microsoft(微软公司)

AI总结 DevBench通过真实开发者数据和生成模型合成,构建了包含六种编程语言和任务的1800个实例,评估大语言模型在代码补全任务中的表现,揭示模型在语法精度、语义推理和实用价值上的差异。

详情
AI中文摘要

DevBench是一个基于 telemetry 的基准测试,旨在评估大语言模型(LLMs)在现实代码补全任务中的性能。它包含1,800个评估实例,覆盖六种编程语言和六种任务类别,这些数据来源于真实开发者 telemetry 和多个提供商家庭的生成模型,以减轻单一来源偏差。与以往的基准测试不同,它强调生态效度,避免训练数据污染,并允许详细的诊断。评估结合了功能性正确性、基于相似度的指标以及LLM评估,专注于有用性和上下文相关性。9种最先进的模型被评估,最强的模型在Pass@1上仅达到43.5%,证实了该基准测试仍然具有挑战性,并揭示了语法精度、语义推理和实用价值之间的差异。我们的基准测试提供了可操作的见解,以指导模型选择和改进,这些细节通常缺失于其他基准测试,但对实际部署和目标模型开发至关重要。

英文摘要

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry and synthesized using generator models from multiple provider families to mitigate single-source bias. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, with the strongest achieving only 43.5% Pass@1, confirming the benchmark remains challenging and revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement, detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.

2601.10705 2026-05-19 cs.LG 版本更新

Distributed Perceptron under Bounded Staleness, Partial Participation, and Noisy Communication

分布式感知机在有界陈旧度、部分参与和噪声通信下的应用

Keval Jain, Anant Raj, Saurav Prakash, Girish Varma

发表机构 * Indian Institute of Science(印度科学研究院)

AI总结 本文研究了在联邦和分布式部署中,通过迭代参数混合(IPM风格平均)训练的半异步客户端-服务器感知机,考虑了延迟更新、部分参与和通信噪声的影响,并提出了基于陈旧度的聚合规则。

详情
AI中文摘要

我们研究了一种通过迭代参数混合(IPM风格平均)进行训练的半异步客户端-服务器感知机。客户端运行本地感知机更新,服务器通过聚合每个通信轮次到达的更新来形成全局模型。该设置捕捉了联邦和分布式部署中的三种系统效应:(i)由于模型交付延迟和客户端计算应用延迟导致的陈旧更新(双侧版本滞后),(ii)部分参与(间歇性客户端可用性),以及(iii)下行链路和上行链路通信不完美,建模为具有有界二阶矩的有效零均值加性噪声。我们引入了一种称为带有填充的陈旧度桶聚合的服务器端聚合规则,该规则确定性地强制一个预定的陈旧度配置,而无需假设任何延迟或参与的随机模型。在边缘分离性和有界数据半径条件下,我们证明了在给定的服务器轮次数内,累积加权感知机错误数的有限时间期望界:延迟的影响仅通过强制的均值陈旧度出现,而通信噪声贡献了一个额外的项,其增长速率与时间跨度的平方根成正比,总噪声能量。在无噪声情况下,我们展示了有限的期望错误预算如何在温和的鲜参与条件下产生显式的有限轮次稳定界。

英文摘要

We study a semi-asynchronous client-server perceptron trained via iterative parameter mixing (IPM-style averaging): clients run local perceptron updates and a server forms a global model by aggregating the updates that arrive in each communication round. The setting captures three system effects in federated and distributed deployments: (i) stale updates due to delayed model delivery and delayed application of client computations (two-sided version lag), (ii) partial participation (intermittent client availability), and (iii) imperfect communication on both downlink and uplink, modeled as effective zero-mean additive noise with bounded second moment. We introduce a server-side aggregation rule called staleness-bucket aggregation with padding that deterministically enforces a prescribed staleness profile over update ages without assuming any stochastic model for delays or participation. Under margin separability and bounded data radius, we prove a finite-horizon expected bound on the cumulative weighted number of perceptron mistakes over a given number of server rounds: the impact of delay appears only through the mean enforced staleness, whereas communication noise contributes an additional term that grows on the order of the square root of the horizon with the total noise energy. In the noiseless case, we show how a finite expected mistake budget yields an explicit finite-round stabilization bound under a mild fresh-participation condition.

2601.08013 2026-05-19 cs.LG 版本更新

Beyond the Next Port: A Multi-Task Transformer for Forecasting Future Voyage Segment Durations

超越下一个港口:一种用于预测未来航行段持续时间的多任务Transformer

Nairui Liu, Fang He, Xindi Tang, Yineng Wang

发表机构 * Department of Industrial Engineering, Tsinghua University, Beijing 100084, P.R. China(清华大学工业工程系) School of Management Science and Engineering, Central University of Finance and Economics, Beijing 100081, P.R. China(中央财经大学管理科学与工程学院) Department of Logistics and Maritime Studies, The Hong Kong Polytechnic University, Hong Kong, P.R. China(香港理工大学物流与海运研究部)

AI总结 本文提出一种多任务Transformer模型,用于预测未来航行段持续时间,通过整合历史航行时间、目的地港口拥堵代理和静态船舶描述符,提升港口操作的可靠性。

详情
AI中文摘要

准确预测段级航行持续时间对提高海运调度可靠性和优化长期港口运营至关重要。然而,传统到达时间估计(ETA)模型主要针对下一个港口,依赖实时自动识别系统(AIS)数据,无法用于未来航行段。为此,本研究将未来港口ETA预测重新表述为段级时间序列预测问题。我们开发了一种基于Transformer的架构,整合了历史航行持续时间、目的地港口拥堵代理和静态船舶描述符。所提出的框架采用因果掩码注意力机制以捕捉长期时间依赖性,并利用多任务学习头联合预测段航行持续时间和港口拥堵状态,通过共享潜在信号来缓解高不确定性。在2021年真实世界全球数据集上的评估表明,所提模型在综合基线模型上表现更优。结果表明,与顺序深度学习模型相比,均方误差(RMSE)减少了2.59%,平均绝对误差(MAE)减少了4.70%,平均绝对百分比误差(MAPE)减少了4.95%。与梯度提升机相比,MAE减少了7.03%,MAPE减少了39.49%,RMSE减少了4.37%。对一个主要目的地港口的案例研究进一步展示了模型的优越精度。

英文摘要

Accurate forecasts of segment-level sailing durations are fundamental to enhancing maritime schedule reliability and optimizing long-term port operations. However, conventional estimated time of arrival (ETA) models are primarily designed for the immediate next port of call and rely heavily on real-time automatic identification system (AIS) data, which is inherently unavailable for future voyage segments. To address this gap, the study reformulates future-port ETA prediction as a segment-level time-series forecasting problem. We develop a transformer-based architecture that integrates historical sailing durations, destination port congestion proxies, and static vessel descriptors. The proposed framework employs a causally masked attention mechanism to capture long-range temporal dependencies and a multi-task learning head to jointly predict segment sailing durations and port congestion states, leveraging shared latent signals to mitigate high uncertainty. Evaluation on a real-world global dataset from 2021 demonstrates the proposed model consistently outperforms a comprehensive suite of competitive baselines. The result shows a relative reduction of 4.70% in mean absolute error (MAE), 4.95% in mean absolute percentage error (MAPE) and 2.59% in root mean squared error (RMSE) compared with sequential deep learning models. The relative reductions compared with gradient boosting machines are 7.03% in MAE, 39.49% in MAPE and 4.37% in RMSE. The case study conducted on one major destination port further illustrates the model's superior accuracy.

2601.06633 2026-05-19 cs.LG cs.AI cs.CL cs.CY 版本更新

KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks

KASER:面向开放性编程任务的知识对齐学生错误模拟器

Zhangqi Duan, Nigel Fernandez, Andrew Lan

发表机构 * University of Massachusetts(马萨诸塞大学) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 KASER通过强化学习方法,结合代码相似性、错误匹配和预测多样性,提升大语言模型对学生错误的模拟与预测能力,实验表明其在代码和错误预测及错误覆盖方面优于基线方法。

Comments Published in ACL 2026: The 64th Annual Meeting of the Association for Computational Linguistics

详情
AI中文摘要

开放性任务,如计算机科学教育中的编程问题,能提供关于学生知识的深入洞察。然而,训练大语言模型(LLMs)模拟和预测学生在这些问题上的可能错误具有挑战性:它们常出现模式崩溃,并无法充分捕捉学生响应中的语法、风格和解决方案方法的多样性。在本文中,我们提出了KASER(知识对齐学生错误模拟器),一种将错误与学生知识对齐的新方法。我们提出了一种基于强化学习的训练方法,使用混合奖励反映学生代码预测的三个方面:i)代码与地面真相的相似性,ii)错误匹配,以及iii)代码预测的多样性。在两个真实世界数据集上,我们进行了两个层面的评估,并表明:在每对学生-问题对层面,我们的方法在代码和错误预测上优于基线;在每问题层面,我们的方法在错误覆盖和模拟代码多样性上优于基线。

英文摘要

Open-ended tasks, such as coding problems that are common in computer science education, provide detailed insights into student knowledge. However, training large language models (LLMs) to simulate and predict possible student errors in their responses to these problems can be challenging: they often suffer from mode collapse and fail to fully capture the diversity in syntax, style, and solution approach in student responses. In this work, we present KASER (Knowledge-Aligned Student Error Simulator), a novel approach that aligns errors with student knowledge. We propose a training method based on reinforcement learning using a hybrid reward that reflects three aspects of student code prediction: i) code similarity to the ground-truth, ii) error matching, and iii) code prediction diversity. On two real-world datasets, we perform two levels of evaluation and show that: At the per-student-problem pair level, our method outperforms baselines on code and error prediction; at the per-problem level, our method outperforms baselines on error coverage and simulated code diversity.

2601.06009 2026-05-19 stat.ML cs.LG eess.SP math.PR stat.AP 版本更新

Detecting Stochasticity in Discrete Signals via Nonparametric Excursion Theorem

通过非参数逃逸定理检测离散信号中的随机性

Sunia Tanweer, Firas A. Khasawneh

发表机构 * Dept. of Mechanical Engineering, Michigan State University(密歇根州立大学机械工程系) Dept. of Computational Mathematics, Science and Engineering, Michigan State University(密歇根州立大学计算数学、科学与工程系)

AI总结 本文提出一种基于连续半鞅逃逸和穿越定理的非参数方法,通过比较实测逃逸次数与理论期望比值,区分扩散过程与确定性信号,不依赖参数模型。

详情
AI中文摘要

我们开发了一个实用框架,仅使用单个离散时间序列区分扩散随机过程与确定性信号。该方法基于连续半鞅的经典逃逸和穿越定理,将逃逸次数$N_\varepsilon$与过程的二次变分$[X]_T$相关联。该标度定律适用于所有具有有限二次变分的连续半鞅,包括具有非线性或状态依赖波动率的一般伊藤扩散过程,但对确定性系统失效,从而提供了一种理论认证的方法来区分这些动态,而非基于主观熵或复发的最新方法。我们构建了一个稳健的数据驱动扩散测试,该方法将实测逃逸次数与理论期望进行比较。所得比值$K(\varepsilon)=N_{\varepsilon}^{\mathrm{emp}}/N_{\varepsilon}^{\mathrm{theory}}$通过log-log斜率偏差总结,测量$\varepsilon^{-2}$定律,从而分类为扩散样或非扩散样。我们在经典随机系统、某些周期性和混沌映射及加性白噪声系统,以及随机杜芬系统上展示了该方法。该方法是非参数、无模型的,仅依赖于连续半鞅的小尺度结构。

英文摘要

We develop a practical framework for distinguishing diffusive stochastic processes from deterministic signals using only a single discrete time series. Our approach is based on classical excursion and crossing theorems for continuous semimartingales, which correlates number $N_\varepsilon$ of excursions of magnitude at least $\varepsilon$ with the quadratic variation $[X]_T$ of the process. The scaling law holds universally for all continuous semimartingales with finite quadratic variation, including general Ito diffusions with nonlinear or state-dependent volatility, but fails sharply for deterministic systems -- thereby providing a theoretically-certfied method of distinguishing between these dynamics, as opposed to the subjective entropy or recurrence based state of the art methods. We construct a robust data-driven diffusion test. The method compares the empirical excursion counts against the theoretical expectation. The resulting ratio $K(\varepsilon)=N_{\varepsilon}^{\mathrm{emp}}/N_{\varepsilon}^{\mathrm{theory}}$ is then summarized by a log-log slope deviation measuring the $\varepsilon^{-2}$ law that provides a classification into diffusion-like or not. We demonstrate the method on canonical stochastic systems, some periodic and chaotic maps and systems with additive white noise, as well as the stochastic Duffing system. The approach is nonparametric, model-free, and relies only on the universal small-scale structure of continuous semimartingales.

2601.05679 2026-05-19 cs.LG 版本更新

Do Sparse Autoencoders Identify Reasoning Features in Language Models?

稀疏自编码器是否在语言模型中识别推理特征?

George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi

发表机构 * UC Berkeley(加州大学伯克利分校) UCSF(旧金山大学)

AI总结 研究稀疏自编码器在大语言模型中识别推理相关内部特征的可靠性,提出基于因果token注入的评估框架,发现许多候选特征对token级干预敏感,需通过反证法验证其推理相关性。

Comments In Forty-Third International Conference on Machine Learning (2026)

详情
AI中文摘要

我们研究稀疏自编码器(SAEs)如何可靠地支持关于大语言模型中推理相关内部特征的主张。我们首先进行简化分析,表明稀疏正则化解码可以优先保留稳定的低维相关性,同时抑制高维行为内变异性,这促使我们考虑对比选择的'推理'特征可能在推理痕迹耦合时集中在提示结构上。基于此视角,我们提出一种基于反证的评估框架,结合因果token注入与LLM引导的反例构造。在22种配置中,涵盖多个模型家族、层和推理数据集,我们发现许多对比选择的候选特征对token级干预高度敏感,45%-90%在注入少量相关token到非推理文本后激活。对于剩余的上下文依赖候选特征,LLM引导的反证会产生触发激活的非推理输入,并生成保留意义的改写,以抑制激活。小规模引导研究在评估基准上产生最小变化。总体而言,我们的结果表明,在我们研究的设置中,稀疏分解可能倾向于与推理共现的低维相关性,强调在将高层行为归因于个别SAE特征时需要反证。代码可在https://github.com/GeorgeMLP/reasoning-probing获取。

英文摘要

We study how reliably sparse autoencoders (SAEs) support claims about reasoning-related internal features in large language models. We first give a stylized analysis showing that sparsity-regularized decoding can preferentially retain stable low-dimensional correlates while suppressing high-dimensional within-behavior variation, motivating the possibility that contrastively selected "reasoning" features may concentrate on cue-like structure when such cues are coupled with reasoning traces. Building on this perspective, we propose a falsification-based evaluation framework that combines causal token injection with LLM-guided counterexample construction. Across 22 configurations spanning multiple model families, layers, and reasoning datasets, we find that many contrastively selected candidates are highly sensitive to token-level interventions, with 45%-90% activating after injecting only a few associated tokens into non-reasoning text. For the remaining context-dependent candidates, LLM-guided falsification produces targeted non-reasoning inputs that trigger activation and meaning-preserving paraphrases of top-activating reasoning traces that suppress it. A small steering study yields minimal changes on the evaluated benchmarks. Overall, our results suggest that, in the settings we study, sparse decompositions can favor low-dimensional correlates that co-occur with reasoning, underscoring the need for falsification when attributing high-level behaviors to individual SAE features. Code is available at https://github.com/GeorgeMLP/reasoning-probing.

2601.05527 2026-05-19 cs.LG cs.AI 版本更新

DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis

DeMa:双路径延迟感知Mamba用于高效多变量时间序列分析

Rui An, Haohao Qu, Wenqi Fan, Xuequn Shang, Qing Li

发表机构 * Northwestern Polytechnical University(西北工业大学)

AI总结 DeMa通过双路径架构改进Mamba,解决多变量时间序列分析中的延迟建模、跨变量依赖和时间动态分离问题,实现高效且准确的分析。

Comments The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-026-52221-6}

详情
AI中文摘要

准确且高效的多变量时间序列(MTS)分析对广泛智能应用越来越关键。在这一领域,Transformer因其强大的捕捉成对依赖能力而成为主导架构。然而,基于Transformer的模型存在二次计算复杂度和高内存开销,限制了其在长期和大规模MTS建模中的可扩展性和实用性。最近,Mamba作为一种线性时间替代方案出现,具有高表达能力。然而,直接应用原始Mamba到MTS仍不理想,因为存在三个关键限制:(i)缺乏显式的跨变量建模,(ii)难以分离纠缠的系列内时间动态和系列间交互,(iii)对潜在时间滞后交互效应的建模不足。这些问题限制了其在多样MTS任务中的有效性。为了解决这些挑战,我们提出了DeMa,一种双路径延迟感知Mamba骨干网络。DeMa保留了Mamba的线性复杂度优势,同时显著提高了其在MTS设置中的适用性。具体而言,DeMa引入了三个关键创新:(i)它将MTS分解为系列内时间动态和系列间交互;(ii)它开发了一个时间路径,包含Mamba-SSD模块,以捕捉每个单独系列内的长程动态,实现系列无关的并行计算;(iii)它设计了一个变量路径,包含Mamba-DALA模块,通过延迟感知线性注意力模块来建模跨变量依赖。在五个代表性任务(长期和短期预测、数据插补、异常检测和系列分类)上的广泛实验表明,DeMa在达到最先进性能的同时,还实现了显著的计算效率。

英文摘要

Accurate and efficient multivariate time series (MTS) analysis is increasingly critical for a wide range of intelligent applications. Within this realm, Transformers have emerged as the predominant architecture due to their strong ability to capture pairwise dependencies. However, Transformer-based models suffer from quadratic computational complexity and high memory overhead, limiting their scalability and practical deployment in long-term and large-scale MTS modeling. Recently, Mamba has emerged as a promising linear-time alternative with high expressiveness. Nevertheless, directly applying vanilla Mamba to MTS remains suboptimal due to three key limitations: (i) the lack of explicit cross-variate modeling, (ii) difficulty in disentangling the entangled intra-series temporal dynamics and inter-series interactions, and (iii) insufficient modeling of latent time-lag interaction effects. These issues constrain its effectiveness across diverse MTS tasks. To address these challenges, we propose DeMa, a dual-path delay-aware Mamba backbone. DeMa preserves Mamba's linear-complexity advantage while substantially improving its suitability for MTS settings. Specifically, DeMa introduces three key innovations: (i) it decomposes the MTS into intra-series temporal dynamics and inter-series interactions; (ii) it develops a temporal path with a Mamba-SSD module to capture long-range dynamics within each individual series, enabling series-independent, parallel computation; and (iii) it designs a variate path with a Mamba-DALA module that integrates delay-aware linear attention to model cross-variate dependencies. Extensive experiments on five representative tasks, long- and short-term forecasting, data imputation, anomaly detection, and series classification, demonstrate that DeMa achieves state-of-the-art performance while delivering remarkable computational efficiency.

2601.02353 2026-05-19 cs.CV cs.LG 版本更新

Meta-Learning Guided Pruning for Few-Shot Plant Pathology on Edge Devices

元学习引导的剪枝用于边缘设备上的少样本植物病理学

Mohammed Mudassir Uddin, Shahnawaz Alam, Mohammed Kaif Pasha, Dr Tasneem Bano Rehman, Dr Fahmina Taranum, Afroze Begum

发表机构 * Department of CSE, Muffakham Jah College of Engineering and Technology (MJCET)(计算机科学与工程系,穆法卡姆·贾赫工程与技术学院(MJCET))

AI总结 本文提出DACIS方法,结合神经网络剪枝与少样本学习,实现边缘设备上高效植物疾病识别,实验表明模型大小减小78%且保持92.3%的精度。

详情
AI中文摘要

远程地区农民需要快速可靠的植物疾病识别方法,但通常缺乏实验室或高性能计算资源。深度学习模型可通过叶片图像检测疾病,但模型通常过大且计算成本高,难以在低成本边缘设备如Raspberry Pi上运行。此外,收集数千张标记的疾病图像进行训练既昂贵又耗时。本文通过结合神经网络剪枝和少样本学习解决这两个挑战。本文提出Disease-Aware Channel Importance Scoring (DACIS),一种识别神经网络中区分不同植物疾病关键部分的方法,集成到三阶段Prune-then-Meta-Learn-then-Prune (PMP)流程中。在PlantVillage和PlantDoc数据集上的实验表明,所提出的方法将模型大小减少78%,同时保持92.3%的原始精度,压缩后的模型在Raspberry Pi 4上以每秒7帧的速度运行,使小农户农民的实时田间诊断成为可能。

英文摘要

Farmers in remote areas need quick and reliable methods for identifying plant diseases, yet they often lack access to laboratories or high-performance computing resources. Deep learning models can detect diseases from leaf images with high accuracy, but these models are typically too large and computationally expensive to run on low-cost edge devices such as Raspberry Pi. Furthermore, collecting thousands of labeled disease images for training is both expensive and time-consuming. This paper addresses both challenges by combining neural network pruning, removing unnecessary parts of the model, with few-shot learning, which enables the model to learn from limited examples. This paper proposes Disease-Aware Channel Importance Scoring (DACIS), a method that identifies which parts of the neural network are most important for distinguishing between different plant diseases, integrated into a three-stage Prune-then-Meta-Learn-then-Prune (PMP) pipeline. Experiments on PlantVillage and PlantDoc datasets demonstrate that the proposed approach reduces model size by 78% while maintaining 92.3% of the original accuracy, with the compressed model running at 7 frames per second on a Raspberry Pi 4, making real-time field diagnosis practical for smallholder farmers.

2512.23978 2026-05-19 cs.LG math.OC stat.ML 版本更新

Assured autonomy: How operations research powers and orchestrates generative AI systems

保障自主性:如何用运筹学赋能和协调生成式AI系统

Tinglong Dai, David Simchi-Levi, Michelle Xiao Wu, Yao Xie

发表机构 * Carey Business School, Johns Hopkins University(约翰霍普金斯大学卡里商学院) Data Science and AI Institute, Johns Hopkins University(约翰霍普金斯大学数据科学与人工智能研究院) Institute for Data, Systems and Society, Operations Research Center, Department of Civil and Environmental Engineering, Massachusetts Institute of Technology(麻省理工学院数据、系统与社会研究所,运筹学中心,土木与环境工程系) Purdue University(普渡大学) H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(佐治亚理工学院H.米尔顿·斯图尔特工业与系统工程学院)

AI总结 本文探讨生成式AI在向自主决策系统转变过程中,如何通过运筹学方法提升系统的可行性、鲁棒性和风险控制能力。

Comments Authors are listed alphabetically; Production and Operations Management (POM), 2026

详情
AI中文摘要

生成式人工智能(GenAI)正从对话助手转向代理系统——能够在操作流程中感知、决策和行动的自主决策系统。这种转变带来了自主性悖论:随着GenAI系统获得更大的操作自主权,它们应通过设计体现更正式的结构、更明确的约束和更强的风险控制。我们论证,除非生成模型与提供可验证可行性、对抗鲁棒性和高后果场景下的压力测试机制相结合,否则随机生成模型在操作领域可能脆弱。为此,我们开发了一个以运筹学(OR)为基础的保障自主性框架,基于两种互补方法。首先,基于流的生成模型将生成过程框架为确定性传输,由常微分方程描述,从而实现可审计性、约束感知生成以及与最优传输、鲁棒优化和顺序决策控制的联系。其次,通过对抗鲁棒性视角制定操作安全性:决策规则在不确定性或模糊集内评估最坏扰动,使未建模风险成为设计的一部分。该框架阐明了增加自主性如何使OR的角色从求解器转变为护栏到系统架构师,负责控制逻辑、激励协议、监控制度和安全边界。这些元素定义了在安全关键、可靠性敏感的操作领域中保障自主性的研究议程。

英文摘要

Generative artificial intelligence (GenAI) is shifting from conversational assistants toward agentic systems -- autonomous decision-making systems that sense, decide, and act within operational workflows. This shift creates an autonomy paradox: as GenAI systems are granted greater operational autonomy, they should, by design, embody more formal structure, more explicit constraints, and stronger tail-risk discipline. We argue that stochastic generative models can be fragile in operational domains unless paired with mechanisms that provide verifiable feasibility, robustness to distribution shift, and stress testing under high-consequence scenarios. To address this challenge, we develop a conceptual framework for assured autonomy grounded in operations research (OR), built on two complementary approaches. First, flow-based generative models frame generation as deterministic transport characterized by an ordinary differential equation, enabling auditability, constraint-aware generation, and connections to optimal transport, robust optimization, and sequential decision control. Second, operational safety is formulated through an adversarial robustness lens: decision rules are evaluated against worst-case perturbations within uncertainty or ambiguity sets, making unmodeled risks part of the design. This framework clarifies how increasing autonomy shifts OR's role from solver to guardrail to system architect, with responsibility for control logic, incentive protocols, monitoring regimes, and safety boundaries. These elements define a research agenda for assured autonomy in safety-critical, reliability-sensitive operational domains.

2512.23752 2026-05-19 cs.LG cs.AI 版本更新

Geometric Scaling of Bayesian Inference in LLMs

贝叶斯推断在大语言模型中的几何特性

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

发表机构 * Columbia University(哥伦比亚大学) Columbia University School of Professional Studies(哥伦比亚大学专业研究学院) Department of Statistics(统计学系) Columbia University Department of Computer Science(哥伦比亚大学计算机科学系)

AI总结 研究发现大语言模型中存在几何结构,用于编码后验结构,通过干预实验表明该结构是不确定性的重要读取而非单一计算瓶颈。

Comments v2: Extend cross-architecture analysis with Qwen2.5 and DeepSeek (MLA) families; add SULA and RoPE-channel results; document MLA boundary case (DeepSeek-V2-Lite: substrate preserved, dynamic routing absent); add dual-entropy framework at scale; fix duplicate bibliography entries

详情
AI中文摘要

近期研究表明,经过受控

英文摘要

Recent work has shown that small transformers trained in controlled "wind-tunnel'' settings can implement exact Bayesian inference, and that their training dynamics produce a geometric substrate -- low-dimensional value manifolds and progressively orthogonal keys -- that encodes posterior structure. We investigate whether this geometric signature persists in production-grade language models. Across Pythia, Phi-2, Llama-3, and Mistral families, we find that last-layer value representations organize along a single dominant axis whose position strongly correlates with predictive entropy, and that domain-restricted prompts collapse this structure into the same low-dimensional manifolds observed in synthetic settings. To probe the role of this geometry, we perform targeted interventions on the entropy-aligned axis of Pythia-410M during in-context learning. Removing or perturbing this axis selectively disrupts the local uncertainty geometry, whereas matched random-axis interventions leave it intact. However, these single-layer manipulations do not produce proportionally specific degradation in Bayesian-like behavior, indicating that the geometry is a privileged readout of uncertainty rather than a singular computational bottleneck. Taken together, our results show that modern language models preserve the geometric substrate that enables Bayesian inference in wind tunnels, and organize their approximate Bayesian updates along this substrate.

2512.23070 2026-05-19 cs.LG 版本更新

FLEX-MoE: Federated Mixture-of-Experts with Load-balanced Expert Assignment for Edge Computing

FLEX-MoE:用于边缘计算的联邦混合专家模型,具有负载平衡的专家分配

Boyang Zhang, Xiaobing Chen, Songyang Zhang, Shuai Zhang, Xiangwei Zhou, Jian Zhang, Mingxuan Sun

发表机构 * Louisiana State University(路易斯安那州立大学) University of Louisiana at Lafayette(路易斯安那大学拉法叶分校) New Jersey Institute of Technology(新泽西理工学院)

AI总结 本文提出FLEX-MoE,一种联邦混合专家框架,通过联合优化专家分配和负载平衡,在受限客户端容量下提升边缘计算性能。

详情
AI中文摘要

混合专家(MoE)模型通过条件计算实现可扩展的神经网络,为下一代无线通信提供增强的有效性和效率。然而,在无线和物联网边缘网络上使用联邦学习(FL)部署MoE面临两个关键挑战:1)资源受限的客户端无法存储具有完整专家集的大型AI模型;2)非独立同分布(non-IID)数据分布导致严重的专家负载不平衡,从而降低模型性能。为此,我们提出了FLEX-MoE,一种联邦MoE框架,该框架在有限的客户端容量下联合优化专家分配和负载平衡。具体而言,我们的方法引入了客户端-专家适应度分数,通过训练反馈量化专家对本地数据集的适用性,并采用基于优化的算法来最大化客户端-专家专业化,同时确保系统范围内的负载平衡。与仅关注个性化而忽略负载不平衡的贪心方法不同,FLEX-MoE解决了专家利用偏斜问题,这在异构边缘FL中尤为严重。我们的实验结果表明,在各种资源受限场景下,FLEX-MoE在准确性和一致的负载平衡方面均表现优异。

英文摘要

Mixture-of-Experts (MoE) models enable scalable neural networks through conditional computation, offering enhanced effectiveness and efficiency for next-generation wireless communications. However, deploying MoE with federated learning (FL) over wireless and IoT edge networks faces two critical challenges: 1) resource-constrained clients cannot store large AI models with full expert sets, and 2) non-IID data distributions cause severe expert load imbalance that degrades model performance. To this end, we propose FLEX-MoE, a federated MoE framework that jointly optimizes expert assignment and load balancing under limited client capacity. Specifically, our approach introduces client-expert fitness scores that quantify expert suitability for local datasets through training feedback, and employs an optimization-based algorithm to maximize client-expert specialization while enforcing balanced expert utilization system-wide. Unlike greedy methods that focus solely on personalization while ignoring load imbalance, FLEX-MoE addresses expert utilization skew, which is particularly severe in heterogeneous edge FL. Our experimental results demonstrate superior accuracy and consistently balanced expert utilization across diverse resource-constrained scenarios for edge computing.

2512.22473 2026-05-19 stat.ML cs.AI cs.LG 版本更新

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

注意力的梯度动力学:交叉熵如何塑造贝叶斯流形

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

发表机构 * Columbia University(哥伦比亚大学) Columbia University School of Professional Studies(哥伦比亚大学专业研究学院) Department of Statistics(统计学系) Columbia University Department of Computer Science(哥伦比亚大学计算机科学系)

AI总结 研究通过分析交叉熵训练如何重塑Transformer注意力分数和值向量,揭示了注意力评分的优势路由定律和值的职责加权更新,展示了梯度动力学如何塑造贝叶斯流形以支持概率推理。

Comments v2: Add dual-entropy connection - advantage signal drives \r{ho} down; fix duplicate bibliography entries (synced from Paper I)

详情
AI中文摘要

Transformer在精心构建的『贝叶斯风洞』和大规模语言模型中表现出精确的概率推理能力,但梯度学习如何创建所需的内部几何仍不清楚。本文提供了一种完整的首次级分析,揭示了交叉熵训练如何重塑Transformer注意力头中的注意力评分和值向量。核心结果是注意力评分的『优势路由定律』,以及值的『职责加权更新』。这些方程诱导出正反馈循环,使路由和内容共同专业化:查询更强烈地路由到误差信号高于平均的值,而这些值被拉向使用它们的查询。本文展示了这种耦合专业化行为类似于两时间尺度EM过程:注意力权重实现E步(软责任),而值实现M步(责任加权原型更新),查询和键调整假设框架。通过受控模拟,包括一个粘性马尔可夫链任务,比较了闭合形式EM式更新与标准SGD,证明了相同的梯度动力学在最小化交叉熵的同时,塑造了本文配套工作所识别的低维流形,这些流形实现了贝叶斯推理。这给出了一个统一的画面:优化(梯度流)导致几何(贝叶斯流形),后者又支持功能(上下文概率推理)。

英文摘要

Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, \[ \frac{\partial L}{\partial s_{ij}} = α_{ij}\bigl(b_{ij}-\mathbb{E}_{α_i}[b]\bigr), \qquad b_{ij} := u_i^\top v_j, \] coupled with a \emph{responsibility-weighted update} for values, \[ Δv_j = -η\sum_i α_{ij} u_i, \] where $u_i$ is the upstream gradient at position $i$ and $α_{ij}$ are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).

2512.22471 2026-05-19 cs.LG cs.AI stat.ML 版本更新

The Bayesian Geometry of Transformer Attention

Transformer 注意力的贝叶斯几何

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

发表机构 * Columbia University(哥伦比亚大学) Columbia University School of Professional Studies(哥伦比亚大学专业研究学院) Department of Statistics(统计学系) Columbia University Department of Computer Science(哥伦比亚大学计算机科学系)

AI总结 本文通过构建贝叶斯风道,验证了Transformer在上下文中的贝叶斯推理能力,发现其通过几何机制实现后验更新与路由,揭示了注意力机制的必要性及扁平架构的不足。

Comments v2: Add dual-entropy measurement framework (H_I, H_P, \r{ho} = H_P/H_I); incorporate Overleaf revisions; fix duplicate bibliography entries (akyurek mashup; openai title; legacy aliases removed)

详情
AI中文摘要

Transformer 似乎在上下文中表现出贝叶斯推理,但严格验证一直困难:自然数据缺乏解析后验,大模型将推理与记忆混淆。我们通过构建贝叶斯风道——可控环境,其中真实后验以闭合形式给出,记忆可证明不可能。在这些设置中,小型Transformer以10^-3-10^-4 bit精度再现贝叶斯后验,而容量匹配的MLP则相差多个数量级,确立了明确的架构分离。在两个任务——双射消除和隐马尔可夫模型(HMM)状态跟踪中,发现Transformer通过一致的几何机制实现贝叶斯推理:残差流作为信念基质,前馈网络执行后验更新,注意力提供内容可寻址路由。几何诊断揭示正交键基、渐进查询-键对齐和由后验熵参数化的低维值流形。训练期间该流形展开而注意力模式保持稳定,这与最近的梯度分析预测的帧精度解离一致。这些结果表明,分层注意力通过几何设计实现贝叶斯推理,解释了注意力的必要性及扁平架构的失败。贝叶斯风道为机械连接小型可验证系统与大语言模型中推理现象提供了基础。

英文摘要

Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} -- controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks -- bijection elimination and Hidden Markov Model (HMM) state tracking -- we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame-precision dissociation} predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.

2512.15997 2026-05-19 cs.LG 版本更新

Higher-Order LaSDI: Reduced Order Modeling with Multiple Time Derivatives

高阶LaSDI:多时间导数的降阶建模

Robert Stephany, William Michael Anderson, Youngsoo Choi

发表机构 * Center for Applied Scientific Computing(应用科学计算中心) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 本文提出高阶LaSDI方法,通过引入灵活的高阶有限差分方案和Rollout损失函数,提升降阶模型在长时间尺度上的预测能力,验证于二维Burgers方程。

Comments 38 pages, 14 figures

详情
Journal ref
Computer Methods in Applied Mechanics and Engineering 455 (2026) 118890
AI中文摘要

求解复杂偏微分方程在物理科学中至关重要,但通常需要计算成本高昂的数值方法。降阶模型(ROMs)通过利用降维来创建快速近似来解决这一问题。尽管现代ROMs能够求解参数化PDE家族,但其在长时间尺度上的预测能力会下降。我们通过(1)引入一种灵活、高阶但成本低廉的有限差分方案和(2)提出Rollout损失函数,训练ROMs在任意时间尺度上做出准确预测。我们在二维Burgers方程上展示了我们的方法。

英文摘要

Solving complex partial differential equations is vital in the physical sciences, but often requires computationally expensive numerical methods. Reduced-order models (ROMs) address this by exploiting dimensionality reduction to create fast approximations. While modern ROMs can solve parameterized families of PDEs, their predictive power degrades over long time horizons. We address this by (1) introducing a flexible, high-order, yet inexpensive finite-difference scheme and (2) proposing a Rollout loss that trains ROMs to make accurate predictions over arbitrary time horizons. We demonstrate our approach on the 2D Burgers equation.

2512.09269 2026-05-19 cs.LG cs.IR 版本更新

Goal inference with Rao-Blackwellized Particle Filters

基于 Rao-Blackwellized 粒子滤波器的目标推断

Yixuan Wang, Dan P. Guralnik, Warren E. Dixon

发表机构 * Mechanical \& Aerospace Engineering Department, University of Florida. Mathematics Department, Ohio University

AI总结 本文提出利用改进的 Rao-Blackwellized 粒子滤波器推断移动智能体的目标,通过闭合环行为假设提升样本效率,并引入两种估计器评估对抗者恢复意图的能力。

Comments 6 pages, 3 figures. Accepted for presentation at the 23rd IFAC World Congress 2026, Busan, Republic of Korea, August 23-28, 2026. To appear in IFAC-PapersOnLine

详情
AI中文摘要

从噪声观测中推断移动智能体最终目标是基本的估计问题。本文首次使用 Rao-Blackwellized 粒子滤波器(RBPF)变体研究意图推断,假设智能体意图通过具有已证明实用稳定性性质的闭环行为表现。利用假设的闭式智能体动力学,RBPF 分析性地边缘化线性高斯子结构,并仅更新粒子权重,提升样本效率。引入两种差分估计器:基于 RBPF 权重的高斯混合模型和限制混合到有效样本的简化版本。通过信息论泄漏度量量化对抗者恢复意图的能力,并通过高斯混合 KL 界提供可计算的 KL 散度下界。还提供两种估计器性能差异的界,表明简化估计器几乎与完整估计器一样有效。实验展示了对合规智能体的快速准确意图恢复,激励未来设计意图混淆控制器的研究。

英文摘要

Inferring the eventual goal of a mobile agent from noisy observations of its trajectory is a fundamental estimation problem. We initiate the study of such intent inference using a variant of a Rao-Blackwellized Particle Filter (RBPF), subject to the assumption that the agent's intent manifests through closed-loop behavior with a state-of-the-art provable practical stability property. Leveraging the assumed closed-form agent dynamics, the RBPF analytically marginalizes the linear-Gaussian substructure and updates particle weights only, improving sample efficiency over a standard particle filter. Two difference estimators are introduced: a Gaussian mixture model using the RBPF weights and a reduced version confining the mixture to the effective sample. We quantify how well the adversary can recover the agent's intent using information-theoretic leakage metrics and provide computable lower bounds on the Kullback-Leibler (KL) divergence between the true intent distribution and RBPF estimates via Gaussian-mixture KL bounds. We also provide a bound on the difference in performance between the two estimators, highlighting the fact that the reduced estimator performs almost as well as the complete one. Experiments illustrate fast and accurate intent recovery for compliant agents, motivating future work on designing intent-obfuscating controllers.

2511.08154 2026-05-19 hep-ph cs.LG hep-th 版本更新

Good flavor search in SU(5): a machine learning approach

在SU(5)中寻找良好风味:一种机器学习方法

Fayez Abu-Ajamieh, Shinsuke Kawai, Nobuchika Okada

发表机构 * Formerly Center for High Energy Physics, Indian Institute of Science, Bangalore 560012, Karnataka, India(原高能物理中心,印度科学研究院,班加罗尔560012,卡纳塔克邦,印度) Faculty of Science, Yamagata University, 1-4-12 Kojirakawa-machi, Yamagata, 990-8560 Japan(山梨大学科学系,山梨,990-8560日本) Department of Physics and Astronomy, University of Alabama, Tuscaloosa, Alabama, AL35487 USA(阿拉巴马大学物理与天文学系,塔斯卡洛萨,阿拉巴马州,AL35487 USA)

AI总结 本文利用机器学习技术重新审视SU(5)统一理论中的费米子质量问题,通过比较不同修正方案的美观性,发现24维场相互作用模型更接近原始Georgi-Glashow模型。

Comments 15 pages, 9 figures, version to be published

详情
AI中文摘要

我们重新审视SU(5)大统一理论中的费米子质量问题,使用机器学习技术。原始SU(5)模型由Georgi和Glashow提出,与观测到的费米子质量谱不兼容。已知有两种解决办法:一种是通过引入45维场的新相互作用,另一种是通过24维场。我们研究哪种修正更“美丽”,将美丽定义为接近原始Georgi-Glashow SU(5)模型的程度。分析显示,在超对称和非超对称情况下,包含24维场相互作用的模型在这一标准下更美丽。我们通过引入连续参数y,将这些模型一般化,其中y=3对应45维场,y=1.5对应24维场。数值优化显示,y≈0.8能最接近原始SU(5)模型,表明此值对应根据我们定义的最美丽模型。

英文摘要

We revisit the fermion mass problem of the $SU(5)$ grand unified theory using machine learning techniques. The original $SU(5)$ model proposed by Georgi and Glashow is incompatible with the observed fermion mass spectrum. Two remedies are known to resolve this discrepancy, one is through introducing a new interaction via a 45-dimensional field, and the other via a 24-dimensional field. We investigate which modification is more beautiful, defining the beauty as proximity to the original Georgi-Glashow $SU(5)$ model. Our analysis shows that, in both supersymmetric and non-supersymmetric scenarios, the model incorporating the interaction with the 24-dimensional field is more beautiful under this criterion. We then generalise these models by introducing a continuous parameter $y$, which takes the value 3 for the 45-dimensional field and 1.5 for the 24-dimensional field. Numerical optimisation reveals that $y \approx 0.8$ yields the closest match to the original $SU(5)$ model, indicating that this value corresponds to the most beautiful model according to our definition.

2511.07329 2026-05-19 cs.LG cs.CV 版本更新

Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis

基于分形的计算架构制备用于高级大语言模型分析

Yash Mittal, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg(计算机视觉实验室,CAIDAS,乌尔姆大学)

AI总结 本文提出FractalNet框架,通过递归模板模式自动生成并评估卷积神经网络架构,实现高效稳定的网络结构探索,实验显示分形架构在五轮训练后达到80.18%的准确率。

详情
AI中文摘要

本文提出FractalNet,一种基于分形设计原理的框架,通过递归模板模式自动生成并评估卷积神经网络(CNN)架构。该框架通过递归分形模板系统地变化关键参数如分形深度、列宽和层配置,而非依赖计算成本高的神经架构搜索(NAS)方法。框架包含生成器、分形模板模块和运行器模块,生成1200多个CNN架构在CIFAR-10数据集上进行测试。使用PyTorch进行训练,采用随机梯度下降和自动混合精度及梯度检查点技术降低计算开销。实验结果显示分形架构具有稳定的训练动态和竞争性性能,五轮训练后验证准确率为60-70%,峰值准确率为80.18%。这些发现表明递归分形结构在平衡网络深度和宽度方面有效,并支持大规模自动化架构探索。

英文摘要

This paper proposes FractalNet, a framework based on fractal design principles that automatically generates and evaluates convolutional neural network (CNN) architectures using recursive template patterns. Rather than relying on computationally expensive Neural Architecture Search (NAS) methods, the framework explores a structured architecture space defined by recursive fractal templates that systematically vary key parameters such as fractal depth, column width, and layer configurations. The framework consists of three core components: a generator that produces candidate architectures via controlled permutations of convolutional, normalization, activation, and dropout layers; a fractal template module that enforces recursive multi-path structural patterns; and a runner module that manages model training, evaluation, and logging. Using this system, over 1,200 distinct CNN architectures were automatically generated and evaluated on the CIFAR-10 image classification benchmark. Training was performed in PyTorch using stochastic gradient descent with Automatic Mixed Precision (AMP) and gradient checkpointing to reduce computational overhead. Experimental results demonstrate that fractal-based architectures exhibit stable training dynamics and achieve competitive performance, with an average validation accuracy of 60-70% and a peak accuracy of 80.18% after only five training epochs. These findings suggest that recursive fractal structures provide an effective means of balancing network depth and width while supporting large-scale automated architecture exploration. The proposed framework offers a resource-efficient and interpretable approach to systematic neural architecture experimentation.

2510.23641 2026-05-19 cs.LG cs.AI hep-ex physics.ins-det 版本更新

Spatially Aware Linear Transformer (SAL-T) for Particle Jet Tagging

具有空间意识的线性变换器(SAL-T)用于粒子喷注标记

Aaron Wang, Zihan Zhao, Subash Katel, Vivekanand Gyanchand Sahu, Elham E Khoda, Abhijith Gandrakota, Jennifer Ngadiuba, Richard Cavanaugh, Javier Duarte

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) University of California San Diego(加州大学圣地亚哥分校) Fermi National Accelerator Laboratory(费米国家加速器实验室)

AI总结 SAL-T通过空间感知分区和卷积层提升喷注分类性能,在资源消耗和延迟方面优于标准linformer。

详情
AI中文摘要

Transformers在高能粒子碰撞中能有效捕捉全局和局部相关性,但在高数据吞吐环境如CERN LHC中部署存在挑战。由于transformer模型的二次复杂性,需要大量资源且推理延迟高。为此,我们引入了物理启发的线性变换器增强架构SAL-T,保持线性注意力。我们的方法基于动量学特征对粒子进行空间感知分区,从而计算具有物理意义区域之间的注意力。此外,我们采用卷积层捕捉局部相关性,受喷注物理启发。除了在喷注分类任务中优于标准linformer外,SAL-T在推理时使用更少的资源且延迟更低,其结果与全注意力transformer相当。在通用点云分类数据集(ModelNet10)上的实验进一步证实了这一趋势。我们的代码可在https://github.com/aaronw5/SAL-T4HEP获得。

英文摘要

Transformers are very effective in capturing both global and local correlations within high-energy particle collisions, but they present deployment challenges in high-data-throughput environments, such as the CERN LHC. The quadratic complexity of transformer models demands substantial resources and increases latency during inference. In order to address these issues, we introduce the Spatially Aware Linear Transformer (SAL-T), a physics-inspired enhancement of the linformer architecture that maintains linear attention. Our method incorporates spatially aware partitioning of particles based on kinematic features, thereby computing attention between regions of physical significance. Additionally, we employ convolutional layers to capture local correlations, informed by insights from jet physics. In addition to outperforming the standard linformer in jet classification tasks, SAL-T also achieves classification results comparable to full-attention transformers, while using considerably fewer resources with lower latency during inference. Experiments on a generic point cloud classification dataset (ModelNet10) further confirm this trend. Our code is available at https://github.com/aaronw5/SAL-T4HEP.

2510.18941 2026-05-19 cs.CL cs.AI cs.LG 版本更新

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ProfBench:需要专业知识回答和评判的多领域评分标准

Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong

发表机构 * NVIDIA

AI总结 ProfBench通过7000多个由专业领域专家评估的响应-评分对,评估大语言模型在处理专业文档、信息整合和生成综合报告方面的能力,揭示了即使顶级模型在专业任务上也面临挑战。

Comments Published at ICLR 2026, 30 pages

详情
AI中文摘要

评估大语言模型(LLMs)的进步通常受限于验证响应的挑战,限制了评估任务仅限于数学、编程和简短问答。然而,许多现实应用需要评估LLMs在处理专业文档、整合信息和生成综合报告方面的能力。我们介绍了ProfBench:一个包含超过7000个响应-评分对的集合,由具有物理学博士、化学博士、金融MBA和咨询MBA专业知识的人类专家评估。我们构建了稳健且经济的LLM-Judges来评估ProfBench评分标准,通过减轻自我增强偏差并减少评估成本2-3个数量级,使其公平且对更广泛社区可及。我们的发现表明,即使对于最先进的LLM,ProfBench也提出了重大挑战,顶级模型如GPT-5-high仅达到65.9%的整体性能。此外,我们识别了专有模型与开源模型之间显著的性能差异,并提供了关于扩展思考在解决复杂专业领域任务中的作用的见解。数据:https://huggingface.co/datasets/nvidia/ProfBench 和代码:https://github.com/NVlabs/ProfBench 和排行榜:https://huggingface.co/spaces/nvidia/ProfBench

英文摘要

Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench and Leaderboard: https://huggingface.co/spaces/nvidia/ProfBench

2510.15221 2026-05-19 cs.AI cs.CY cs.LG 版本更新

WELD: The First Naturalistic Long-Period Small-Team Workplace Emotion Dataset for Ubiquitous Affective Computing

WELD:首个自然长期小团队工作场所情感数据集用于无处不在的情感计算

Xiao Sun

发表机构 * AnHui Province Key Laboratory of Affective Computing and Advanced Intelligent Machines, School of Computer Science and Information Engineering, Hefei University of Technology(安徽省情感计算与先进智能机器重点实验室,计算机科学与信息工程学院,合肥工业大学)

AI总结 WELD是首个自然长期小团队工作场所情感数据集,包含733780个每帧七类面部表情概率向量,用于支持长期情感计算研究,验证了三个已知现象并发现四个新结果。

Comments v2: Major revision. 30-month report with full ethics framework, 4-tier access model, variance decomposition, HMM regime discovery, AUC=0.79 vs C-index=0.52 turnover-prediction methodology audit, and Asian-neutral-face FER bias finding. Companion: arXiv:2510.16046. 49 employees, 733,780 records, 17 pages. Submitted to IEEE TAFFC

详情
AI中文摘要

情感计算在实验室环境中迅速成熟,但此前没有数据集同时满足(i)数月到数年的持续时间,(ii)自然工作场所环境,(iii)稳定的小组社交结构,以及(iv)完全被动传感协议并通过机构审查。我们介绍了WELD,首个满足所有四个条件的数据集。WELD包含来自中国软件公司49名员工30.1个月(2021年11月-2024年5月)的733780个每帧七类面部表情概率向量——最长的自然情感语料库和唯一支持多年纵向分析和小组关系分析的数据集。数据以四级访问模型发布,只有聚合概率可公开下载。我们通过复制三个已知现象(+43.1%周末情感提升;13:00低谷日周期;上海2022封锁效应d=-0.40)验证了数据集,并报告了四个新发现:(1)方差分解将每日情感变异性中19.3%归因于人与人差异,29.8%归因于月季节性——对未来预测模型的定量上限;(2)隐藏马尔可夫分解揭示了六个情感状态,具有不对称的负面状态停留时间(16-18天 vs 3天);(3)留一人出离职预测达到AUC=0.79,但Cox一致性指数仅为0.52,暴露了在不考虑生存意识基线时报告AUC的度量陷阱;(4)数据集揭示了基于现成FER模型对中性亚洲面孔预测“愤怒”存在系统性过预测(0.194 vs ~0.05西方先验),使WELD成为FER公平审计的重要资源。对数据集的复杂系统分析作为配套预印本(arXiv:2510.16046)发表。

英文摘要

Affective computing has matured rapidly in laboratory settings, yet no prior dataset combines (i) months-to-years of duration, (ii) a naturalistic workplace context, (iii) a stable small-team social structure, and (iv) a fully passive sensing protocol that survives institutional review. We introduce WELD, the first dataset to satisfy all four. WELD comprises 733,780 per-frame seven-class facial-expression probability vectors from 49 employees of a Chinese software company over 30.1 months (Nov 2021 - May 2024) -- the longest naturalistic in-the-wild emotion corpus and the only multi-year corpus supporting both within-individual longitudinal and within-team relational analyses on the same subjects. Data are released under a four-tier access model with only aggregated probabilities publicly downloadable. We validate the corpus by replicating three established phenomena (+43.1% weekend valence boost; 13:00-trough diurnal cycle; Shanghai 2022 lockdown effect d=-0.40), and report four novel findings: (1) variance decomposition attributes 19.3% of daily-valence variance to between-person differences and 29.8% to month seasonality -- a quantitative ceiling for future predictive models; (2) Hidden Markov decomposition reveals six emotional regimes with asymmetric negative-state dwell times (16-18 d vs 3 d); (3) leave-one-person-out turnover prediction reaches AUC=0.79 yet a Cox concordance index of only 0.52, exposing a metric-trap when AUC is reported without survival-aware baselines; (4) the corpus reveals systematic over-prediction of "angry" by an off-the-shelf FER model on neutral Asian faces (0.194 vs ~0.05 Western priors), making WELD valuable for FER fairness audits. A complex-systems analysis of the corpus appears as a companion preprint (arXiv:2510.16046).

2510.14102 2026-05-19 astro-ph.IM cs.AI cs.LG 版本更新

Extracting latent representations from X-ray spectra. Classification, regression, and accretion signatures of Chandra sources

从X射线光谱中提取潜在表示。Chandra源的分类、回归和吸积特征

Nicolò Oreste Pinciroli Vago, Juan Rafael Martínez-Galarza, Roberta Amato

发表机构 * Department of Electronics, Information Center for Astrophysics Harvard \& Smithsonian, 60 Garden Street, Cambridge, MA 02138, USA

AI总结 本文利用深度学习从Chandra X射线光谱中提取紧凑且物理意义明确的表示,通过分类、回归和可解释性分析验证其有效性,并测量光谱与时间域属性间的互信息,用于未来识别暂现事件。

Comments 21 pages, 17 figures; accepted in A&A

详情
AI中文摘要

光谱特征在大规模X射线巡天时代至关重要。自动机器学习方法在此方面已被证明有效,但迄今为止尚未应用于大规模光谱数据集,如Chandra源目录(CSC)。本工作旨在利用深度学习开发一种紧凑且具有物理意义的Chandra X射线光谱表示。为验证所学表示是否捕捉到相关信息,我们通过分类、回归和可解释性分析进行评估,并测量这些源的光谱与时间域属性间的互信息,以帮助未来识别暂现事件。我们使用基于变换器的自编码器将X射线光谱压缩到一个8维的潜在空间中。天体物理源类型和物理汇总统计信息来自外部目录。我们从光谱重建精度、8种已知天体物理源类的聚类性能以及与硬度比和氢柱密度(N_H)等物理量的相关性来评估所学表示。重建后,潜在空间中的聚类在8种源类上实现了约40%的平衡分类精度,当仅限于类星体和恒星级致密天体时,该精度提高至约69%。此外,潜在特征与光谱和时间属性相关,表明压缩的表示捕捉到了物理相关信息。直接从X射线光谱中学习的特征在捕捉相关物理信息方面与需要额外计算的人工提取特征同样有效。它们可用于大规模巡天中的分类和回归,并且与时间域属性共享互信息。该方法可以适应现有和即将来临的X射线目录。

英文摘要

Spectral signatures are crucial in the era of large X-ray surveys. Automatic machine learning methods have proven useful in this respect, but so far they have not been applied to large spectral datasets, such as the Chandra Source Catalog (CSC). This work aims to develop a compact and physically meaningful representation of Chandra X-ray spectra using deep learning. To verify that the learned representation captures relevant information, we evaluate it through classification, regression, and interpretability analyses, and measure the mutual information between spectral and time-domain properties of these sources, aiding in the future identification of transient events. We use a transformer-based autoencoder to compress X-ray spectra into representations in an 8-dimensional latent space. Astrophysical source types and physical summary statistics are compiled from external catalogs. We evaluate the learned representation in terms of spectral reconstruction accuracy, clustering performance on 8 known astrophysical source classes, and correlation with physical quantities such as hardness ratios and hydrogen column densities ($N_H$). Upon reconstruction, clustering in the latent space yields a balanced classification accuracy of $\sim$40% across the 8 source classes, increasing to $\sim$69% when restricted to AGNs and stellar-mass compact objects exclusively. Moreover, latent features correlate with spectral and temporal properties, suggesting that the compressed representation captures physically relevant information. Features learned directly from X-ray spectra capture relevant physical information as effectively as human-extracted features that require additional computations. They can be used for both classification and regression in large surveys, and also share mutual information with time-domain properties. The method can be adapted to existing and upcoming X-ray catalogs.

2510.10777 2026-05-19 cs.LG math.OC 版本更新

Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

预条件范数:一种统一框架用于最速下降、拟牛顿和自适应方法

Andrey Veprikov, Arman Bolatov, Aleksandr Bogdanov, Samuel Horváth, Aleksandr Beznosikov, Martin Takáč, Slavomir Hanzely

发表机构 * Basic Research of Artificial Intelligence Laboratory (BRAIn Lab)(人工智能基础研究实验室(BRAIn Lab)) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(人工智能 Mohamed bin Zayed 大学) Innopolis University(Innopolis 大学)

AI总结 本文提出一种统一框架,通过预条件矩阵范数概括了最速下降、拟牛顿和自适应方法,系统研究了仿射和尺度不变性,并引入了MuAdam和MuAdam-SANIA两种新方法,实验表明其在某些情况下优于现有方法。

Comments 22 pages, 2 figures, 8 tables

详情
AI中文摘要

优化是现代深度学习的核心,但现有方法常面临适应问题几何与利用曲率之间的根本权衡。最速下降算法通过范数选择适应不同几何,但仍是严格一阶的;拟牛顿和自适应优化器结合曲率信息,但受限于Frobenius几何,限制了其在不同架构中的应用。本文提出一种统一框架,通过预条件矩阵范数概括了最速下降、拟牛顿方法和自适应方法。这一抽象揭示了广泛使用的优化器如SGD和Adam,以及更先进的方法如Muon和KL-Shampoo,以及最近的混合方法如SOAP和SPlus,都是同一原理的特例。在该框架中,我们首次系统地研究了矩阵参数化设置中的仿射和尺度不变性,建立了在广义范数下的必要和充分条件。基于此基础,我们引入了两种新方法,MuAdam和MuAdam-SANIA,它们结合了Muon的谱几何与Adam风格的预条件化。我们的实验表明,这些优化器在某些情况下优于现有最先进方法。我们的代码可在https://github.com/brain-lab-research/LIB/tree/quasi_descent获取。

英文摘要

Optimization lies at the core of modern deep learning, yet existing methods often face a fundamental trade-off between adapting to problem geometry and leveraging curvature utilization. Steepest descent algorithms adapt to different geometries through norm choices but remain strictly first-order, whereas quasi-Newton and adaptive optimizers incorporate curvature information but are restricted to Frobenius geometry, limiting their applicability across diverse architectures. In this work, we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle. Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix-parameterized setting, establishing necessary and sufficient conditions under generalized norms. Building on this foundation, we introduce two new methods, $\texttt{MuAdam}$ and $\texttt{MuAdam-SANIA}$, which combine the spectral geometry of Muon with Adam-style preconditioning. Our experiments demonstrate that these optimizers are competitive with, and in some cases outperform, existing state-of-the-art methods. Our code is available at https://github.com/brain-lab-research/LIB/tree/quasi_descent

2510.07195 2026-05-19 quant-ph cs.LG 版本更新

Accelerating Inference for Multilayer Neural Networks with Quantum Computers

利用量子计算机加速多层神经网络推理

Arthur G. Rattew, Po-Wei Huang, Naixu Guo, Lirandë Pira, Patrick Rebentrost

发表机构 * Department of Materials, University of Oxford(牛津大学材料系) Mathematical Institute, University of Oxford(牛津大学数学研究所) Quantum Motion(Quantum Motion公司) Centre for Quantum Technologies, National University of Singapore(新加坡国立大学量子中心) Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系)

AI总结 本文首次提出全相干的多层神经网络量子实现,采用残差块、多滤波2D卷积、Sigmoid激活等结构,分析了不同量子数据访问模式下的推理复杂度,证明了在不同条件下可实现二次到四次方的加速效果。

Comments Published at the International Conference on Learning Representations (ICLR), 2026

详情
AI中文摘要

容错量子处理单元(QPUs)有望在特定计算任务中提供指数级加速,但其在现代深度学习流水线中的整合仍不明确。本文通过提出首个全相干的多层神经网络量子实现,填补了这一空白。该实现基于ResNet架构,包含残差块、多滤波2D卷积、Sigmoid激活、跳跃连接和层归一化。我们分析了三种量子数据访问模式下的推理复杂度。在无假设的情况下,我们证明了浅层双线性网络相比经典方法有二次加速。在高效量子访问权重的情况下,我们获得了四次方加速。在高效量子访问输入和权重的情况下,我们证明了一个具有N维向量输入、k个残差块层和最终残差-线性-池化层的网络可以以误差ε实现,推理成本为O(polylog(N/ε)^k)。

英文摘要

Fault-tolerant Quantum Processing Units (QPUs) promise to deliver exponential speed-ups in select computational tasks, yet their integration into modern deep learning pipelines remains unclear. In this work, we take a step towards bridging this gap by presenting the first fully-coherent quantum implementation of a multilayer neural network with non-linear activation functions. Our constructions mirror widely used deep learning architectures based on ResNet, and consist of residual blocks with multi-filter 2D convolutions, sigmoid activations, skip-connections, and layer normalizations. We analyse the complexity of inference for networks under three quantum data access regimes. Without any assumptions, we establish a quadratic speedup over classical methods for shallow bilinear-style networks. With efficient quantum access to the weights, we obtain a quartic speedup over classical methods. With efficient quantum access to both the inputs and the network weights, we prove that a network with an $N$-dimensional vectorized input, $k$ residual block layers, and a final residual-linear-pooling layer can be implemented with an error of $ε$ with $O(\text{polylog}(N/ε)^k)$ inference cost.

2510.04309 2026-05-19 cs.LG 版本更新

Activation Steering with a Feedback Controller

通过反馈控制器激活控制

Dung V. Nguyen, Hieu M. Vu, Nhi Y. Pham, Lei Zhang, Tan M. Nguyen

发表机构 * Department of Mathematics(数学系) Center for AI Research(人工智能研究中心) National University of Singapore(新加坡国立大学) VinUniversity(文大学) Torilab(Torilab实验室)

AI总结 本文提出PID激活控制方法,基于控制理论构建激活控制框架,通过P、I、D项实现激活对齐、误差累积和抑制超调,提升大语言模型行为控制的鲁棒性和可靠性。

Comments 10 pages in the main text. ICLR2026 Poster

详情
AI中文摘要

控制大语言模型(LLM)的行为对于其安全对齐和可靠部署至关重要。然而,现有控制方法主要依赖经验洞察,缺乏理论性能保证。本文通过证明流行控制方法对应于比例(P)控制器,以偏置向量作为反馈信号,建立了激活控制的控制理论基础。在此基础上,提出PID激活控制方法,利用完整的PID控制器进行LLM激活控制。P项对齐激活与目标语义方向,I项累积误差以跨层强制持续修正,D项通过抵消快速激活变化来抑制超调。闭环设计产生可解释的误差动态,并将激活控制连接到经典稳定性保证中。此外,PID激活控制轻量、模块化且易于与最新控制方法集成。在多个LLM家族和基准上的广泛实验表明,PID激活控制一致优于现有方法,实现更稳健和可靠的行为控制。代码已公开在:https://github.com/dungnvnus/pid-steering

英文摘要

Controlling the behaviors of large language models (LLM) is fundamental to their safety alignment and reliable deployment. However, existing steering methods are primarily driven by empirical insights and lack theoretical performance guarantees. In this work, we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. Building on this finding, we propose Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs. The proportional (P) term aligns activations with target semantic directions, the integral (I) term accumulates errors to enforce persistent corrections across layers, and the derivative (D) term mitigates overshoot by counteracting rapid activation changes. This closed-loop design yields interpretable error dynamics and connects activation steering to classical stability guarantees in control theory. Moreover, PID Steering is lightweight, modular, and readily integrates with state-of-the-art steering methods. Extensive experiments across multiple LLM families and benchmarks demonstrate that PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control. The code is publicly available at: https://github.com/dungnvnus/pid-steering

2510.04006 2026-05-19 cs.LG nlin.CD physics.ao-ph 版本更新

Learning more physically realistic dynamics in machine-learning based weather forecasting with latent-space constraints

在基于机器学习的天气预报中通过隐空间约束学习更符合物理现实的动力学

Hang Fan, Yi Xiao, Yongquan Qu, Juan Nathaniel, Fenghua Ling, Ben Fei, Lei Bai, Pierre Gentine

发表机构 * Department of Earth and Environmental Engineering, Columbia University(哥伦比亚大学地球与环境工程系) Learning the Earth with Artificial Intelligence and Physics (LEAP) Center, Columbia University(人工智能与物理联合地球学习中心(LEAP), 哥伦比亚大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 本文提出通过隐空间约束改进天气预报模型,以捕捉多变量依赖关系,提升长期预报能力并保持物理真实性。

详情
AI中文摘要

数据驱动的机器学习(ML)模型正在重塑天气预报,并展示了加速和超越传统物理方法的潜力,从而引发该领域第二次革命。然而,大多数ML预报模型在训练时使用加权变量损失,忽视了由物理耦合引起的跨变量和空间误差协方差,通常导致过于平滑且物理不真实的长期预报。为此,我们将模型训练重新表述为四维变分数据同化(4DVar)问题,将再分析数据视为不完美的观测。这使损失函数能够纳入跨变量误差协方差结构,捕捉多变量依赖及其相关误差。在实践中,我们通过计算自动编码器学习的全球大气状态隐空间中的损失来近似此目标。通过编码大气变量间的复杂非线性耦合,这种表示允许高维、复杂误差协方差矩阵在模型空间中近似为隐空间中的对角矩阵,从而大大简化了实现。我们证明了在隐空间约束下的滚动训练能提高长期预报能力,同时比广泛使用的模型空间损失更好地保持细尺度结构和物理真实性。最后,我们扩展了这一框架以适应异质数据源,使预报模型能够在统一的理论框架内联合训练再分析和多源观测。

英文摘要

Data-driven machine learning (ML) models are reshaping weather forecasting and have shown the potential to accelerate and surpass traditional physics-based approaches, leading to a second revolution in the field after data assimilation. However, most ML forecast models are trained with weighted variable-wise losses on rollout forecasts that neglect cross-variable and spatial error covariance induced by physical coupling, often yielding overly smooth and physically unrealistic long-range forecasts. To address this, we reformulate model training as a four-dimensional variational data assimilation (4DVar) problem that treats reanalysis data as imperfect observations. This enables the loss function to incorporate cross-variable error covariance structures that capture multivariate dependencies and their associated errors. In practice, we approximate this objective by computing the loss in an autoencoder-learned latent space of global atmospheric states. By encoding complex nonlinear couplings among atmospheric variables, this representation allows the high-dimensional, complex error covariance matrix in model space to be approximated as nearly diagonal in latent space, substantially simplifying implementation. We show that rollout training with latent-space constraints improves long-term forecast skill, while better preserving fine-scale structures and physical realism than the widely used model-space loss. Finally, we extend this framework to accommodate heterogeneous data sources, enabling the forecast model to be trained jointly on reanalysis and multi-source observations within a unified theoretical formulation.

2509.26037 2026-05-19 cs.AI cs.CV cs.LG 版本更新

CoLLM-NAS: Collaborative Large Language Models for Efficient Knowledge-Guided Neural Architecture Search

CoLLM-NAS:协作大型语言模型用于高效知识引导的神经架构搜索

Zhe Li, Zhiwei Lin, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学计算机科学技术研究院)

AI总结 本文提出CoLLM-NAS,一种基于协作大型语言模型的两阶段神经架构搜索框架,通过导航和生成两个LLM及协调模块,有效指导搜索过程,提升效率并取得新状态最优结果。

Comments Accepted as Oral at CVPR 2026 Workshop on Neural Architecture Search (NAS)

详情
AI中文摘要

将大型语言模型(LLMs)与神经架构搜索(NAS)结合,为自动设计神经架构提供了新可能。然而,现有方法面临架构无效、计算低效和性能劣于传统NAS的限制。本文提出协作LLM-based NAS(CoLLM-NAS),一种两阶段NAS框架,通过两个互补的LLM驱动知识引导搜索。具体而言,提出具有状态的导航LLM指导搜索方向,无状态的生成LLM合成高质量候选,以及协调模块协调LLM间通信并管理评估过程。CoLLM-NAS通过结合LLM对结构神经架构的内在知识与迭代反馈和历史轨迹的逐步知识,高效指导搜索过程。在ImageNet和NAS-Bench-201上的实验结果表明,CoLLM-NAS超越现有NAS方法和传统搜索算法,取得新状态最优结果,同时显著降低搜索成本4-10倍。此外,CoLLM-NAS在多种搜索空间(如MobileNet、ShuffleNet和AutoFormer)中一致提升各种两阶段NAS方法(如OFA、SPOS和AutoFormer)的性能和效率,展示其优秀的泛化能力。

英文摘要

The integration of Large Language Models (LLMs) with Neural Architecture Search (NAS) has introduced new possibilities for automating the design of neural architectures. However, most existing methods face critical limitations, including architectural invalidity, computational inefficiency, and inferior performance compared to traditional NAS. In this work, we present Collaborative LLM-based NAS (CoLLM-NAS), a two-stage NAS framework with knowledge-guided search driven by two complementary LLMs. Specifically, we propose a stateful Navigator LLM to guide search direction, a stateless Generator LLM to synthesize high-quality candidates, and a Coordinator module to orchestrate inter-LLM communication and manage evaluation processes. CoLLM-NAS efficiently guides the search process by combining LLMs' inherent knowledge of structured neural architectures with progressive knowledge from iterative feedback and historical trajectory. Experimental results on ImageNet and NAS-Bench-201 show that CoLLM-NAS surpasses existing NAS methods and conventional search algorithms, achieving new state-of-the-art results while significantly reducing search costs by 4--10. Furthermore, CoLLM-NAS consistently enhances the performance and efficiency of various two-stage NAS methods (e.g., OFA, SPOS, and AutoFormer) across diverse search spaces (e.g., MobileNet, ShuffleNet, and AutoFormer), demonstrating its excellent generalization.

2509.21319 2026-05-19 cs.CL cs.AI cs.LG 版本更新

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

RLBFF:二进制灵活反馈用于连接人类反馈与可验证奖励

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev

发表机构 * NVIDIA

AI总结 RLBFF结合人类偏好与规则验证,提升奖励模型对响应质量的精准捕捉,优于Bradley-Terry模型,在RM-Bench和JudgeBench上取得优异成绩,且支持用户自定义反馈原则。

Comments Published at ICLR 2026, 21 pages

详情
AI中文摘要

Reinforcement Learning with Human Feedback (RLHF) 和 Reinforcement Learning with Verifiable Rewards (RLVR) 是LLM后训练的主要RL范式,各有优势。然而,RLHF在可解释性和奖励黑客问题上存在困难,因为它依赖于通常缺乏明确标准的人类判断,而RLVR则受限于其对正确性基于验证器的专注。我们提出Reinforcement Learning with Binary Flexible Feedback (RLBFF),结合人类驱动的偏好灵活性与规则基础验证的精确性,使奖励模型能够捕捉响应质量的细微方面,超越单纯的正确性。RLBFF从自然语言反馈中提取可以二进制回答的原则(例如信息准确性:是,或代码可读性:否)。这些原则随后可用于将奖励模型训练作为蕴含任务(响应满足或不满足任意原则)。我们展示奖励模型以这种方式训练可以优于匹配数据的Bradley-Terry模型,在RM-Bench(86.2%)和JudgeBench(81.4%,2025年9月24日排行榜第一)上取得最佳成绩。此外,用户可以在推理时指定感兴趣的原理以自定义我们的奖励模型,与Bradley-Terry模型不同。最后,我们提供了一个完全开源的食谱(包括数据)来对Qwen3-32B进行对齐,以匹配或超过o3-mini和DeepSeek R1在MT-Bench、WildBench和Arena Hard v2的一般对齐基准上的性能(在<5%的推理成本下)。模型:https://huggingface.co/collections/nvidia/reward-models-10-2025

英文摘要

Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025

2509.19590 2026-05-19 cs.AI cs.CY cs.LG 版本更新

Position: AI Evaluations Should be Grounded on a Theory of Capability

位置:AI评估应基于能力理论

Nathanael Jo, Ashia Wilson

发表机构 * MIT EECS, Cambridge, USA(麻省理工学院电子工程与计算机科学系,剑桥,美国)

AI总结 本文提出AI评估应基于明确的能力理论,通过实验证明评估结果受建模假设影响显著,提出Evaluation Card促进透明化评估实践。

Comments ICML 2026 Position Paper Track

详情
AI中文摘要

生成模型的评估如今普遍存在,其结果深刻影响公众和科学界对AI能力的看法。然而,对其可靠性的怀疑持续增长。如何确定报告的准确率真实反映模型的底层性能?尽管基准结果常被视为能力的直接测量,但实际上它们是推断:将分数视为能力证据已预设了能力定义的理论。我们主张AI评估应作为基于明确能力理论的推断任务。虽然这一观点在心理学测量学等学科中是标准做法,但在AI评估中仍不完善,核心假设常被隐含。作为概念验证,我们实证显示报告性能可能强烈依赖评估者的建模假设,凸显透明、理论驱动的评估实践的必要性。最后,我们提出Evaluation Card帮助研究人员记录、论证和审查AI评估背后的建模决策。

英文摘要

Evaluations of generative models are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet skepticism about their reliability continues to grow. How can we know that a reported accuracy genuinely reflects a model's underlying performance? Although benchmark results are often presented as direct measurements of capability, in practice they are inferences: treating a score as evidence of capability already presupposes a theory of what it means to be capable at a task. We argue that AI evaluations should instead be framed as inference tasks grounded on an explicit theory of capability. While this perspective is standard in fields like psychometrics, it remains underdeveloped in AI evaluation, where core assumptions are often left implicit. As a proof-of-concept, we empirically show that reported performance can depend strongly on the evaluator's modeling assumptions, underscoring the need for transparent, theory-driven evaluation practices. We conclude by offering an Evaluation Card to help researchers document, justify, and scrutinize the modeling decisions underlying AI evaluations.

2509.18103 2026-05-19 cs.LG math.NT 版本更新

Machine Learnability as a Measure of Order in Aperiodic Sequences

机器可学性作为非周期序列中的秩序度量

Jennifer Dodgson, Michael Joedhitya, Adith Ramdas, Surender Suresh Kumar, Adarsh Singh Chauhan, Akira Rafhael, Wang Mingshu, Nordine Lotfi

发表机构 * ImageNet

AI总结 本文通过图像聚焦的机器学习模型,研究素数分布区域的规律性,发现更高区域的素数分布更易被学习,揭示了机器学习在数论研究中的潜力。

详情
AI中文摘要

对素数分布的研究揭示了其双重特性:定义确定性但表现出类似随机过程的统计行为。本文展示了一个图像聚焦的机器学习模型可用于测量特定区域的Ulam螺旋中素数场的相对规律性。具体而言,模型在训练块提取自500m区域时,比训练块提取自低于25m区域的模型在纯准确率上更优。这表明前者区域存在更易学习的秩序。此外,精确度和召回率的详细分析似乎表明,模型在螺旋的不同区域采用不同的分类方法,更关注于识别低数的素数模式,而在高数区域更注重消除合数。这与数论猜想一致,即在更高数量级时,素数分布的噪声应减少,平均值(密度、AP等分布)将主导,而局部随机性在按log x缩放后将趋于规律化。这些发现表明,机器学习可以成为数论研究的新实验工具。值得注意的是,该方法在研究强素数和弱素数的模式以用于密码学目的方面显示出潜力。

英文摘要

Research on the distribution of prime numbers has revealed a dual character: deterministic in definition yet exhibiting statistical behavior reminiscent of random processes. In this paper we show that it is possible to use an image-focused machine learning model to measure the comparative regularity of prime number fields at specific regions of an Ulam spiral. Specifically, we demonstrate that in pure accuracy terms, models trained on blocks extracted from regions of the spiral in the vicinity of 500m outperform models trained on blocks extracted from the region representing integers lower than 25m. This implies existence of more easily learnable order in the former region than in the latter. Moreover, a detailed breakdown of precision and recall scores seem to imply that the model is favouring a different approach to classification in different regions of the spiral, focusing more on identifying prime patterns at lower numbers and more on eliminating composites at higher numbers. This aligns with number theory conjectures suggesting that at higher orders of magnitude we should see diminishing noise in prime number distributions, with averages (density, AP equidistribution) coming to dominate, while local randomness regularises after scaling by log x. Taken together, these findings point toward an interesting possibility: that machine learning can serve as a new experimental instrument for number theory. Notably, the method shows potential 1 for investigating the patterns in strong and weak primes for cryptographic purposes.

2509.03403 2026-05-19 cs.LG cs.AI 版本更新

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

超越正确性:通过RL训练和谐过程与结果奖励

Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang, Anurag Beniwal

发表机构 * Amazon(亚马逊公司) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出PROF方法,通过过程一致性过滤提升推理质量和最终答案准确性,减少对强PRM的依赖。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)提升了推理任务的最终答案准确性,但未能可靠提升推理质量。由于结果奖励仅评估最终答案,它也会奖励虚假成功:错误推理仍可能因偶然得到正确结果而获得最大奖励。这种结果奖励黑客行为会创建有偏的梯度,使当前RLVR不足以学习忠实的推理。过程奖励模型(PRMs)提供逐步监督,但直接优化PRMs或简单地将它们与结果奖励结合在RL训练过程中分布偏移时不稳定。我们引入了过程一致性过滤(PROF),一种数据整理方法,利用PRM-ORM一致性进行样本选择,而不是直接奖励优化。PROF保留具有强过程支持的正确响应和具有弱过程支持的错误响应,同时保持训练比例的平衡。实验表明,PROF在强基线之上一致地提高了最终答案准确性和中间推理质量,对强PRMs的依赖较少。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) improves final-answer accuracy on reasoning tasks, but it does not reliably improve reasoning quality. Because outcome rewards only assess final answers, they also reward spurious successes: flawed reasoning can still receive maximal reward when it accidentally reaches the correct outcome. This outcome reward hacking creates biased gradients, making current RLVR insufficient for learning faithful reasoning. Process Reward Models (PRMs) provide step-wise supervision, but directly optimizing PRMs or naively combining them with outcome rewards is unstable under distribution shift during RL training process. We introduce PRocess cOnsistency Filter (PROF), a data curation method that uses PRM--ORM consistency for sample selection rather than direct reward optimization. PROF keeps correct responses with strong process support and incorrect responses with weak process support while maintaining a balanced training ratio. Experiments show that PROF consistently improves both final-answer accuracy and intermediate reasoning quality over strong baselines, with less dependence on strong PRMs.

2509.01629 2026-05-19 stat.ML cs.LG cs.NA math.NA 版本更新

Lipschitz-Guided Design of Interpolation Schedules in Generative Models

基于Lipschitz性的生成模型插值调度设计

Yifan Chen, Eric Vanden-Eijnden, Jiawei Xu

发表机构 * Department of Mathematics, University of California, Los Angeles, CA, USA(加州大学洛杉矶分校数学系) Machine Learning Lab, Capital Fund Management, Paris, France(Capital Fund Management机器学习实验室) Courant Institute, New York University, NY, USA(纽约大学Courant研究所) Now at University of Maryland, College Park, MD, USA(现在位于马里兰大学 College Park 分校)

AI总结 本文研究了生成模型中插值调度的设计,从统计和数值角度出发,提出通过最小化漂移场的平均平方Lipschitz性来设计调度,以提升生成模型的稳定性与准确性。

详情
AI中文摘要

我们从统计和数值角度研究了流和扩散生成模型中插值调度的设计。在随机插值框架下,我们证明在最优后验调优扩散系数后,标量插值调度在路径空间的Kullback-Leibler散度下是统计等价的。这一等价性促使我们关注漂移场的数值特性而非纯统计标准。我们提出最小化漂移的平均平方Lipschitz性作为调度设计的原理性标准,与最优传输中的动能最小化形成对比。一个简单的转换公式将一个调度的漂移表示为另一个调度的漂移,允许在不同(如线性)调度训练的模型上进行推断而不需重新训练。我们为高斯和高斯混合目标分析了最优调度:对于高斯分布,我们获得比线性调度在Lipschitz常数上指数级改进的调度;对于高斯混合,我们获得在少量步采样中缓解模式崩溃的调度。我们随后在高维不变测度的随机Allen-Cahn和Navier-Stokes方程中验证了该方法,其中设计的调度在固定积分器预算下显著提高了细粒度统计的准确性。

英文摘要

We study the design of interpolation schedules in flow and diffusion-based generative models from both statistical and numerical perspectives. Within the stochastic interpolants framework, we first show that scalar interpolation schedules are statistically equivalent under the Kullback--Leibler divergence in path space, after optimal a posteriori tuning of the diffusion coefficient. This equivalence motivates focusing on numerical properties of the drift field rather than purely statistical criteria. We propose minimizing the averaged squared Lipschitzness of the drift as a principled criterion for schedule design, in contrast with kinetic-energy minimization in optimal transport. A simple transfer formula expresses the drift of one schedule in terms of the drift of another, allowing the designed schedule to be used at inference time with a model trained under a different (e.g., linear) schedule, without retraining. We work out the optimal schedules analytically for Gaussian and Gaussian-mixture targets: for Gaussians, we obtain exponential improvements in the Lipschitz constant over linear schedules; for Gaussian mixtures, we obtain schedules that mitigate mode collapse in few-step sampling. We then validate the approach on high-dimensional invariant measures of stochastic Allen--Cahn and Navier--Stokes equations, where the designed schedule yields markedly more accurate fine-scale statistics at fixed integrator budget.

2508.15100 2026-05-19 cs.CR cs.LG 版本更新

Shift Detection and Adaptation for Network Intrusion Detection

网络入侵检测中的分布偏移检测与适应

Ehssan Mousavipour, Andrey Dimanchev, Majid Ghaderi

发表机构 * University of Calgary(卡尔加里大学)

AI总结 本文提出NetSight框架,通过在线持续检测和适应分布偏移,提升网络入侵检测的鲁棒性,实验表明其在F1-score上优于依赖人工标注的现有方法。

详情
AI中文摘要

分布偏移,即数据统计特性随时间变化,对深度学习异常检测系统构成重大挑战。现有异常检测系统难以适应这些偏移。监督学习系统需昂贵的人工标注,而无监督学习系统依赖干净数据进行偏移适应,但干净数据难以获取。本文引入NetSight框架,通过新颖的伪标注技术消除人工干预,并利用知识蒸馏策略防止灾难性遗忘。在三个长期网络数据集上评估,NetSight在F1-score上优于依赖人工标注的现有方法,最高提升达11.72%,证明其在动态网络中的鲁棒性和有效性。

英文摘要

Distribution shift, a change in the statistical properties of data over time, poses a critical challenge for deep learning anomaly detection systems. Existing anomaly detection systems often struggle to adapt to these shifts. Specifically, systems based on supervised learning require costly manual labeling, while those based on unsupervised learning rely on clean data, which is difficult to obtain, for shift adaptation. Both of these requirements are challenging to meet in practice. In this paper, we introduce NetSight, a framework for supervised anomaly detection in network data that continually detects and adapts to distribution shifts in an online manner. NetSight eliminates manual intervention through a novel pseudo-labeling technique and uses a knowledge distillation-based adaptation strategy to prevent catastrophic forgetting. Evaluated on three long-term network datasets, NetSight demonstrates superior adaptation performance compared to state-of-the-art methods that rely on manual labeling, achieving F1-score improvements of up to 11.72%. This proves its robustness and effectiveness in dynamic networks that experience distribution shifts over time.

2508.04227 2026-05-19 cs.CV cs.LG 版本更新

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

视觉语言模型的持续学习:超越遗忘的综述与分类

Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami, Xialei Liu, Joost van de Weijer, Yonghong Tian

AI总结 本文综述了视觉语言模型的持续学习挑战,提出四种核心范式以解决跨模态特征漂移和灾难性遗忘问题,强调零样本学习和智能体生态系统的发展。

详情
AI中文摘要

视觉语言模型(VLMs)和近期多模态大语言模型(MLLMs)通过前所未有的跨模态对齐和零样本泛化革新了人工智能。然而,使它们能够从非平稳数据中持续学习仍是一个重大挑战,因为它们的跨模态对齐和泛化能力特别容易受到灾难性遗忘的影响。不同于传统单模态持续学习(CL),VLMs面临独特的挑战,如跨模态特征漂移、由于共享架构导致的参数干扰以及零样本能力侵蚀。此外,生成式MLLMs表现出一种独特的“对齐税”,其中灾难性遗忘不仅表现为事实性遗忘,还表现为深度链式思维(CoT)推理的系统性崩溃。本文首次全面、诊断性地回顾了预测VLMs和生成式MLLMs的持续学习。我们系统地分解了上述失败模式,并提出了一个以挑战为导向的分类,包括四个核心范式:(1)多模态重播策略解决显式和隐式记忆漂移;(2)跨模态正则化强制拓扑和几何对齐;(3)参数高效适应利用动态路由和子空间投影;以及新兴的(4)模型融合与解耦范式。我们批判性地分析了评估协议的演变,强调了向双轨基准(领域 vs. 能力 CL)和微诊断 CoT 评估的转变。最后,我们绘制了未来研究的路线图,强调组合式零样本学习、具身AI与传感器融合以及自主智能体生态系统。所有资源均可在:https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models 上找到。

英文摘要

Vision-language models (VLMs) and the recent surge of Multimodal Large Language Models (MLLMs) have revolutionized artificial intelligence with unprecedented cross-modal alignment and zero-shot generalization. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. Furthermore, generative MLLMs exhibit a unique ``alignment tax,'' where catastrophic forgetting manifests not merely as factual amnesia, but as a systemic collapse of deep Chain-of-Thought (CoT) reasoning. This survey presents the first comprehensive, diagnostic review bridging continual learning for both predictive VLMs and generative MLLMs. We systematically deconstruct the aforementioned failure modes and propose a challenge-driven taxonomy comprising four core paradigms: (1) Multi-Modal Replay Strategies addressing explicit and implicit memory drift; (2) Cross-Modal Regularization enforcing topological and geometric alignment; (3) Parameter-Efficient Adaptation} utilizing dynamic routing and subspace projections; and the emerging (4) Model Fusion and Decoupling paradigms. We critically analyze the evolution of evaluation protocols, highlighting the essential shift toward dual-track benchmarks (Domain vs. Ability CL) and micro-diagnostic CoT evaluations. Finally, we chart a roadmap for future research, emphasizing compositional zero-shot learning, embodied AI with sensor fusion, and autonomous agentic ecosystems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.

2508.04149 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

基于难度的偏好数据选择:通过DPO隐式奖励差距

Xuan Qi, Rongwu Xu, Zhijing Jin

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(华盛顿大学计算机科学与工程保罗·G·艾伦学校) Max Planck Institute for Intelligent Systems, Tübingen, Germany(德国图宾根马克斯·普朗克智能系统研究所) Jinesis Lab, University of Toronto & Vector Institute(多伦多大学Jinesis实验室及向量研究所)

AI总结 本文提出基于难度的偏好数据选择方法,利用DPO隐式奖励机制选择奖励差距小的样本,提升数据效率和模型对齐性能,在多个数据集和对齐任务中优于五个基线方法。

Comments Our code and data are available at https://github.com/Difficulty-Based-Preference-Data-Select/Difficulty-Based-Preference-Data-Select

详情
AI中文摘要

对齐大语言模型(LLMs)与人类偏好是AI研究中的关键挑战。尽管强化学习从人类反馈(RLHF)和直接偏好优化(DPO)等方法被广泛使用,但它们通常依赖于大规模、成本高的偏好数据集。本文缺少针对偏好数据的高质量数据选择方法。在本文中,我们引入了一种基于难度的偏好数据选择策略,该策略基于DPO隐式奖励机制。通过选择奖励差距较小的偏好数据示例,这些示例代表更具挑战性的案例,从而提高数据效率和模型对齐。我们的方法在多个数据集和对齐任务中一致优于五个强大的基线方法,仅使用原始数据的10%即可实现优越性能。这种原理上高效的选择方法为在有限资源下扩展LLM对齐提供了有前景的解决方案。

英文摘要

Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

2508.02383 2026-05-19 cs.LG cs.IR 版本更新

Graph Embedding in the Graph Fractional Fourier Transform Domain

图在图分数傅里叶变换域中的嵌入

Changjie Sheng, Zhichao Zhang, Yangfan He

发表机构 * School of Mathematics and Statistics, Nanjing University of Information Science and Technology(南京信息工程大学数学与统计学学院) Hubei Key Laboratory of Applied Mathematics, Hubei University(湖北省应用数学重点实验室) Key Laboratory of System Control and Information Processing, Ministry of Education, Shanghai Jiao Tong University(教育部系统控制与信息处理重点实验室,上海交通大学) Nanjing Institute of Technology(南京理工大学) Jiangsu Province Engineering Research Center of IntelliSense Technology and System, Nanjing(江苏省智能感知技术与系统工程研究中心,南京)

AI总结 本文提出GEFRFE方法,通过引入图分数傅里叶变换扩展通用频率过滤嵌入,提升嵌入信息量,实验表明其能捕捉更丰富的结构特征并提升分类性能。

详情
AI中文摘要

谱图嵌入在图表示学习中通过从图谱信息生成低维向量表示起关键作用。然而,传统谱嵌入方法的嵌入空间往往表达能力有限,无法充分捕捉不同变换域下的潜在结构特征。为解决此问题,本文使用图分数傅里叶变换将现有的最先进通用频率过滤嵌入(GEFFE)扩展到分数域,提出通用分数过滤嵌入(GEFRFE)。GEFRFE通过图分数域过滤和从分数化图拉普拉斯矩阵导出的非线性特征向量成分组合来增强嵌入信息量。为动态确定分数阶数,本文引入了两种并行策略:基于搜索的优化和基于ResNet18的自适应学习。在五个基准数据集上的广泛实验表明,GEFRFE能够捕捉更丰富的结构特征并显著提升分类性能。GEFRFE为图嵌入从“固定域”到“通用域”的发展提供了新范式。结果表明,将GFRFT引入图嵌入领域是正确且有效的研究路径。值得注意的是,所提出的方法保持了与GEFFE方法相当的计算复杂度。

英文摘要

Spectral graph embedding plays a critical role in graph representation learning by generating low-dimensional vector representations from graph spectral information. However, the embedding space of traditional spectral embedding methods often exhibit limited expressiveness, failing to exhaustively capture latent structural features across alternative transform domains. To address this issue, we use the graph fractional Fourier transform to extend the existing state-of-the-art generalized frequency filtering embedding (GEFFE) into fractional domains, giving birth to the generalized fractional filtering embedding (GEFRFE), which enhances embedding informativeness via the graph fractional domain.The GEFRFE leverages graph fractional domain filtering and a nonlinear composition of eigenvector components derived from a fractionalized graph Laplacian. To dynamically determine the fractional order, two parallel strategies are introduced: search-based optimization and a ResNet18-based adaptive learning. Extensive experiments on five benchmark datasets demonstrate that the GEFRFE captures richer structural features and significantly enhance classification performance. The GEFRFE provides a new paradigm for the development of graph embedding from the "fixed domain" to the "generalized domain". The results indicate that introducing the GFRFT into the graph embedding domain is a correct and effective research path. Notably, the proposed method retains computational complexity comparable to GEFFE approaches.

2508.00712 2026-05-19 cs.LG cs.AI 版本更新

JSON-Bag: A generic game trajectory representation

JSON-Bag:一种通用的游戏轨迹表示方法

Dien Nguyen, Diego Perez-Liebana, Simon Lucas

发表机构 * GitHub

AI总结 本文提出JSON-Bag模型,通过分词JSON描述并使用Jensen-Shannon距离衡量游戏轨迹,验证了其在六个桌面游戏中对玩家、参数和种子分类的有效性,优于基线方法并提升了准确性。

Comments 8 pages, 3 figures, 6 tables, published in IEEE Conference on Games 2025

详情
AI中文摘要

本文提出JSON-Bag模型,通过分词JSON描述并使用Jensen-Shannon距离衡量游戏轨迹,验证了其在六个桌面游戏中对玩家、参数和种子分类的有效性,优于基线方法并提升了准确性。

英文摘要

We introduce JSON Bag-of-Tokens model (JSON-Bag) as a method to generically represent game trajectories by tokenizing their JSON descriptions and apply Jensen-Shannon distance (JSD) as distance metric for them. Using a prototype-based nearest-neighbor search (P-NNS), we evaluate the validity of JSON-Bag with JSD on six tabletop games: 7 Wonders, Dominion, Sea Salt and Paper, Can't Stop, Connect4, Dots and boxes; each over three game trajectory classification tasks: classifying the playing agents, game parameters, or game seeds that were used to generate the trajectories. Our approach outperforms a baseline using hand-crafted features in the majority of tasks. Evaluating on N-shot classification suggests using JSON-Bag prototype to represent game trajectory classes is also sample efficient. Additionally, we demonstrate JSON-Bag ability for automatic feature extraction by treating tokens as individual features to be used in Random Forest to solve the tasks above, which significantly improves accuracy on underperforming tasks. Finally, we show that, across all six games, the JSD between JSON-Bag prototypes of agent classes highly correlates with the distances between agents' policies.

2507.21334 2026-05-19 stat.ML cs.LG 版本更新

Graph neural networks for residential location choice: connection to classical logit models

图神经网络在住宅选址选择中的应用:与经典logit模型的联系

Zhanhong Cheng, Lingqian Hu, Yuheng Bu, Yuqi Zhou, Shenhao Wang

发表机构 * Department of Urban and Regional Planning, University of Florida(佛罗里达大学城市与区域规划系) Department of Landscape Architecture and Urban Planning, Texas A&M University(德克萨斯大学农业与机械学院景观建筑与城市规划系) Department of Computer Science, University of California, Santa Barbara(加州大学圣芭芭拉分校计算机科学系)

AI总结 本文提出基于图神经网络的住宅选址选择模型,通过捕捉空间替代关系,优于传统模型,展现深度学习与离散选择模型结合的潜力。

详情
AI中文摘要

研究人员已采用深度学习进行经典离散选择分析,因其能捕捉复杂特征关系并提高预测性能。然而,现有深度学习方法无法显式捕捉选择替代品之间的关系,这在经典离散选择模型中一直是重点。为解决这一差距,本文引入图神经网络(GNN)作为新框架分析住宅选址选择。GNN-DCMs提供了一种结构化方法,使神经网络能捕捉空间替代品间的依赖关系,同时保持与经典随机效用理论的明确联系。理论上,证明GNN-DCMs包含嵌套logit(NL)模型和空间相关logit(SCL)模型作为特定情况,通过替代品效用间的消息传递获得新的算法解释。实证上,GNN-DCMs在预测芝加哥77个社区区的住宅选址选择中优于基准MNL、SCL和前馈神经网络。在模型解释方面,GNN-DCMs能捕捉个体异质性和空间感知的替代模式。总体而言,这些结果突显了GNN-DCMs作为统一且表达性强的框架,可整合离散选择建模和深度学习,在复杂空间选择情境中的潜力。

英文摘要

Researchers have adopted deep learning for classical discrete choice analysis as it can capture complex feature relationships and achieve higher predictive performance. However, the existing deep learning approaches cannot explicitly capture the relationship among choice alternatives, which has been a long-lasting focus in classical discrete choice models. To address the gap, this paper introduces Graph Neural Network (GNN) as a novel framework to analyze residential location choice. The GNN-based discrete choice models (GNN-DCMs) offer a structured approach for neural networks to capture dependence among spatial alternatives, while maintaining clear connections to classical random utility theory. Theoretically, we demonstrate that the GNN-DCMs incorporate the nested logit (NL) model and the spatially correlated logit (SCL) model as two specific cases, yielding novel algorithmic interpretation through message passing among alternatives' utilities. Empirically, the GNN-DCMs outperform benchmark MNL, SCL, and feedforward neural networks in predicting residential location choices among Chicago's 77 community areas. Regarding model interpretation, the GNN-DCMs can capture individual heterogeneity and exhibit spatially-aware substitution patterns. Overall, these results highlight the potential of GNN-DCMs as a unified and expressive framework for synergizing discrete choice modeling and deep learning in the complex spatial choice contexts.

2507.12969 2026-05-19 cs.LG cs.CV 版本更新

WaveletInception Networks for on-board Vibration-Based Infrastructure Health Monitoring

小波 inception 网络用于车载振动基基础设施健康监测

Reza Riahi Samani, Alfredo Nunez, Bart De Schutter

发表机构 * Delft Center for Systems and Control (DCSC), Delft University of Technology(代尔夫特理工大学系统与控制中心) Section of Railway Engineering, Department of Engineering Structures, Delft University of Technology(工程结构系铁路工程部)

AI总结 本文提出WaveletInception-BiGRU网络,通过可学习小波包变换提取频谱特征,结合Inception-残差网络进行多尺度特征学习,并利用BiGRU模块整合时间依赖性,实现无需预处理的振动信号分析,提升车载基础设施健康监测的准确性和自动化水平。

Comments Under reviewer for the Journal of Engineering Application of Artificial Intelligence

详情
AI中文摘要

本文提出了一种深度学习框架,用于分析车载振动响应信号以进行基础设施健康监测。所提出的WaveletInception-BiGRU网络采用可学习的小波包变换(LWPT)进行早期频谱特征提取,随后通过一维Inception-残差网络(1D Inception-ResNet)模块进行多尺度、高级特征学习。双向门控循环单元(BiGRU)模块则整合时间依赖性,并纳入操作条件,如测量速度。该方法使能够有效分析在不同速度下记录的振动信号,无需显式信号预处理。序列估计头进一步利用双向时间信息,产生准确的基础设施健康局部评估。最终,该框架生成高分辨率的空间映射健康配置文件。针对轨道刚度回归和过渡区分类的案例研究显示,所提出的框架显著优于现有方法,证明了其在准确、局部化和自动化车载基础设施健康监测中的潜力。

英文摘要

This paper presents a deep learning framework for analyzing on board vibration response signals in infrastructure health monitoring. The proposed WaveletInception-BiGRU network uses a Learnable Wavelet Packet Transform (LWPT) for early spectral feature extraction, followed by one-dimensional Inception-Residual Network (1D Inception-ResNet) modules for multi-scale, high-level feature learning. Bidirectional Gated Recurrent Unit (BiGRU) modules then integrate temporal dependencies and incorporate operational conditions, such as the measurement speed. This approach enables effective analysis of vibration signals recorded at varying speeds, eliminating the need for explicit signal preprocessing. The sequential estimation head further leverages bidirectional temporal information to produce an accurate, localized assessment of infrastructure health. Ultimately, the framework generates high-resolution health profiles spatially mapped to the physical layout of the infrastructure. Case studies involving track stiffness regression and transition zone classification using real-world measurements demonstrate that the proposed framework significantly outperforms state-of-the-art methods, underscoring its potential for accurate, localized, and automated on-board infrastructure health monitoring.

2507.09148 2026-05-19 stat.ML cs.LG math.OC 版本更新

A Randomized Algorithm for Sparse PCA based on the Basic SDP Relaxation

基于基本SDP松弛的稀疏PCA随机算法

Alberto Del Pia, Dekun Zhou

发表机构 * Department of Industrial and Systems Engineering & Wisconsin Institute for Discovery, University of Wisconsin-Madison(工业与系统工程系及威斯康星大学麦迪逊分校威斯康星发现研究所)

AI总结 本文提出基于基本SDP松弛的稀疏PCA随机近似算法,通过构造确定性和随机性解并输出最优解,实现高概率下的稀疏性常数近似比,并在特定条件下保证近似比受对数约束。

Comments 29 pages, 2 figures

详情
AI中文摘要

稀疏主成分分析(SPCA)是一种用于降维的基本技术,属于NP难问题。本文介绍了一种基于基本SDP松弛的随机近似算法,该算法通过构造确定性稀疏解和多个随机解,并输出最优解。该算法在足够多次调用时,近似比最多为稀疏常数。在技术假设下,平均近似比受O(log d)约束,其中d为特征数。我们证明若SDP解低秩或具有指数衰减特征值,则该技术假设成立。我们还展示了两类实例满足该假设,并在协方差模型中证明确定性解可达到近优近似比。通过在真实数据集上的数值测试验证了算法的有效性。

英文摘要

Sparse Principal Component Analysis (SPCA) is a fundamental technique for dimensionality reduction, and is NP-hard. In this paper, we introduce a randomized approximation algorithm for SPCA, which is based on the basic SDP relaxation. Our algorithm takes an (approximate) SDP solution, constructs one deterministic sparse solution and several randomized solutions, and outputs the best among them. Our algorithm has an approximation ratio of at most the sparsity constant with high probability, if called enough times. Under a technical assumption, which is consistently satisfied in our numerical tests, the average approximation ratio is also bounded by $\mathcal{O}(\log{d})$, where $d$ is the number of features. We show that this technical assumption is satisfied if the SDP solution is low-rank, or has exponentially decaying eigenvalues. We then present two classes of instances for which this technical assumption holds. We also demonstrate that in a covariance model, which generalizes the spiked Wishart model, the deterministic solution in our algorithm achieves a near-optimal approximation ratio. We demonstrate the efficacy of our algorithm through numerical tests on real-world datasets.

2506.23978 2026-05-19 cs.LG cs.CL cs.CY cs.SI 版本更新

LLM Agents Are the Antidote to Walled Gardens

大语言模型代理是封闭生态系统的解药

Samuele Marro, Philip Torr

发表机构 * Department of Engineering Science, University of Oxford(牛津大学工程科学系) Institute for Decentralized AI(去中心化人工智能研究所)

AI总结 本文提出通过大语言模型代理实现通用互操作性,打破封闭平台垄断,促进数据端到端迁移,同时探讨其带来的安全与法律挑战。

Comments Published at the ICML 2026 Position Paper track

详情
AI中文摘要

尽管互联网的核心基础设施最初设计为开放和通用,但当今的应用层却被封闭的专有平台主导。开放且互操作的API需要大量投资,而市场领导者缺乏激励去启用可能削弱用户锁定的数据交换。我们主张基于大语言模型的代理从根本上颠覆这一现状。代理可以自动转换数据格式并与为人设计的界面交互:这使互操作性大幅降低且实际上不可避免。我们称之为这种转变通用互操作性:任何两个数字服务都能通过AI调解的适配器无缝交换数据的能力。通用互操作性削弱了垄断行为,促进数据端到端迁移。然而,它也可能导致新的安全风险、技术债务和法律摩擦。我们的立场是ML社区应拥抱这一发展,同时构建适当的框架来减轻负面影响。通过现在行动,我们可以利用AI恢复用户自由和竞争市场,而不牺牲安全。

英文摘要

While the Internet's core infrastructure was designed to be open and universal, today's application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks, technical debt, and legal frictions. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.

2506.22901 2026-05-19 cs.LG cs.AI q-bio.BM q-bio.GN 版本更新

Missing-Modality-Aware Graph Neural Network for Cancer Classification

面向缺失模态的图神经网络用于癌症分类

Sina Tabakhi, Chen, Chen, Haiping Lu

发表机构 * School of Computer Science, University of Sheffield(谢菲尔德大学计算机科学学院)

AI总结 本文提出MAGNET模型,通过动态患者-模态多头注意力机制融合低维模态嵌入,以提升部分模态下的多模态预测性能,实验表明其在癌症分类任务中优于现有方法。

Comments 27 pages, 22 figures

详情
AI中文摘要

在学习多模态生物数据时,缺失模态是一个关键挑战,其中某些患者的数据缺失一个或多个模态。现有方法要么排除缺失模态的患者,要么填补缺失模态,或直接使用部分模态进行预测。然而,这些方法大多依赖于不灵活的、患者无关的融合策略,且无法扩展到随着模态数量增加而指数级增长的缺失模态模式。为解决这些限制,我们提出MAGNET(Missing-modality-Aware Graph neural NETwork)以增强部分模态下的多模态预测,其特征是动态患者-模态多头注意力机制,根据贡献和缺失性融合低维模态嵌入。MAGNET融合的复杂性随着模态数量线性增加,同时适应缺失模式的变异性。为了生成预测,MAGNET进一步构建一个患者图,其中融合的多模态嵌入作为节点特征,连接性由模态缺失性决定,随后通过图神经网络进行处理。在三个公共多组学数据集上进行的实验表明,MAGNET在癌症分类任务中优于现有最先进的融合方法。数据和代码可在https://github.com/SinaTabakhi/MAGNET获取。

英文摘要

A key challenge in learning from multimodal biological data is missing modalities, where data from one or more modalities are absent for some patients. Existing approaches either exclude patients with missing modalities, impute missing modalities, or make predictions directly with partial modalities. However, most of these methods rely on inflexible, patient-agnostic fusion strategies and do not scale computationally to the combinatorial growth of missing-modality patterns as the number of modalities increases. To address these limitations, we propose MAGNET (Missing-modality-Aware Graph neural NETwork) to enhance multimodal prediction with partial modalities, featuring a dynamic patient-modality multi-head attention mechanism to fuse lower-dimensional modality embeddings based on their contribution and missingness. MAGNET fusion's complexity increases linearly with the number of modalities while adapting to missing-pattern variability. To generate predictions, MAGNET further constructs a patient graph with fused multimodal embeddings as node features and connectivity determined by the modality missingness, followed by a graph neural network. Experiments on three public multiomics datasets for cancer classification, with real-world missingness, show that MAGNET outperforms state-of-the-art fusion methods. The data and code are available at https://github.com/SinaTabakhi/MAGNET.

2506.11925 2026-05-19 cs.AR cs.AI cs.CV cs.LG 版本更新

Real-World Deployment of a Lane Change Prediction Architecture Based on Knowledge Graph Embeddings and Bayesian Inference

基于知识图谱嵌入和贝叶斯推断的车道变换预测架构的现实世界部署

M. Manzour, Catherine M. Elias, Omar M. Shehata, R. Izquierdo, M. A. Sotelo

发表机构 * Department of Computer Engineering, University of Alcalá(阿尔卡拉大学计算机工程系) Department of Computer Science, German University in Cairo(开罗德国大学计算机科学系) Department of Mechatronics, German University in Cairo(开罗德国大学机电系)

AI总结 本文提出基于知识图谱嵌入和贝叶斯推断的车道变换预测系统,通过现实硬件验证,实现了算法与道路部署的结合,提前3-4秒预测目标车辆车道变换,确保安全。

详情
Journal ref
2025 IEEE International Conference on Vehicular Electronics and Safety (ICVES)
AI中文摘要

近年来,车道变换预测研究取得显著进展,但大多数研究局限于仿真或数据集结果,未能实现算法与道路部署的结合。本文通过现实硬件展示了基于知识图谱嵌入(KGEs)和贝叶斯推断的车道变换预测系统。该系统包含感知模块和预测模块:感知模块感知环境,提取数值特征并转换为语言类别,与预测模块通信;预测模块执行KGE和贝叶斯推断模型,预测目标车辆的行驶动作并转换为纵向制动动作。现实硬件实验验证表明,该预测系统能提前3-4秒预测目标车辆的车道变换,为自动驾驶车辆提供充足反应时间,确保车道变换安全。

英文摘要

Research on lane change prediction has gained a lot of momentum in the last couple of years. However, most research is confined to simulation or results obtained from datasets, leaving a gap between algorithmic advances and on-road deployment. This work closes that gap by demonstrating, on real hardware, a lane-change prediction system based on Knowledge Graph Embeddings (KGEs) and Bayesian inference. Moreover, the ego-vehicle employs a longitudinal braking action to ensure the safety of both itself and the surrounding vehicles. Our architecture consists of two modules: (i) a perception module that senses the environment, derives input numerical features, and converts them into linguistic categories; and communicates them to the prediction module; (ii) a pretrained prediction module that executes a KGE and Bayesian inference model to anticipate the target vehicle's maneuver and transforms the prediction into longitudinal braking action. Real-world hardware experimental validation demonstrates that our prediction system anticipates the target vehicle's lane change three to four seconds in advance, providing the ego vehicle sufficient time to react and allowing the target vehicle to make the lane change safely.

2506.10959 2026-05-19 cs.LG cs.AI math.ST stat.TH 版本更新

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

在结构流形上理解上下文学习:连接注意力机制与核方法

Zhaiming Shen, Alexander Hsu, Rongjie Lai, Wenjing Liao

发表机构 * School of Mathematics, Georgia Institute of Technology(佐治亚理工学院数学系) Department of Mathematics, Purdue University(普渡大学数学系)

AI总结 本文研究了在结构几何数据上上下文学习的理论,通过将注意力机制与核方法联系,揭示了transformers在流形上进行核预测的机制,并推导了泛化误差界。

详情
AI中文摘要

尽管上下文学习(ICL)在自然语言和视觉领域取得了显著成功,但其在结构几何数据中的理论理解仍不明确。本文首次对ICL在流形上回归Hölder函数的理论进行了研究。我们建立了注意力机制与经典核方法之间的新联系,证明transformers通过与提示的交互在新查询上进行基于核的预测。这一联系通过数值实验得到验证,显示学习的查询-提示分数与高斯核高度相关。基于此见解,我们推导了泛化误差界,以提示长度和训练任务数量为变量。当观察到足够多的训练任务时,transformers在流形上实现Hölder函数的最小最大回归率,该速率与提示长度呈指数关系,指数取决于流形的内在维度,而非外蕴空间维度。我们的结果还描述了泛化误差随训练任务数量的变化,揭示了transformers作为上下文核算法学习器的复杂性。我们的发现为理解几何在ICL中的作用提供了基础见解,并为研究非线性模型的ICL提供了新工具。

英文摘要

While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding-particularly in the context of structured geometric data-remains unexplored. This paper initiates a theoretical study of ICL for regression of Hölder functions on manifolds. We establish a novel connection between the attention mechanism and classical kernel methods, demonstrating that transformers effectively perform kernel-based prediction at a new query through its interaction with the prompt. This connection is validated by numerical experiments, revealing that the learned query-prompt scores for Hölder functions are highly correlated with the Gaussian kernel. Building on this insight, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of Hölder functions on manifolds, which scales exponentially with respect to the prompt length with the exponent depending on the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context kernel algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.

2505.19590 2026-05-19 cs.LG cs.CL 版本更新

Learning to Reason without External Rewards

无需外部奖励的学习推理

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song

发表机构 * UC Berkeley(加州大学伯克利分校) Yale University(耶鲁大学)

AI总结 本文提出Intuitor方法,通过内在反馈实现无需外部奖励的自主学习,实验表明其在数学基准和代码生成等任务中表现优异,为无监督学习提供了新途径。

Comments ICLR 2026

详情
AI中文摘要

训练大型语言模型(LLMs)进行复杂推理的强化学习可验证奖励(RLVR)方法虽有效但受限于昂贵的领域特定监督。我们探索强化学习从内在反馈(RLIF)框架,使LLMs能从内在信号学习而无需外部奖励或标注数据。我们提出Intuitor,一种使用模型自身信心术语自信心作为唯一奖励信号的RLIF方法。Intuitor在组相对策略优化(GRPO)中用自信心分数替代外部奖励,实现完全无监督学习。实验表明,Intuitor在数学基准上与GRPO性能相当,但在代码生成等跨领域任务中泛化能力更强,无需黄金解决方案或测试用例。我们的发现表明,内在模型信号能驱动跨领域有效学习,为无可验证奖励的自主AI系统提供可扩展替代方案。代码可在https://github.com/sunblaze-ucb/Intuitor获取。

英文摘要

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

2505.16831 2026-05-19 cs.CL cs.AI cs.CR cs.LG 版本更新

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

反学习不是删除:调查机器反学习在大语言模型中的可逆性

Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, Haibo Hu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Carnegie Mellon University(卡内基梅隆大学) University of California, Santa Cruz(加州大学圣克ruz分校) Huawei Technologies(华为技术有限公司) Research Centre for Privacy and Security Technologies in Future Smart Systems, PolyU(未来智能系统中的隐私与安全技术研究中心,PolyU)

AI总结 研究揭示大语言模型反学习的可逆性问题,提出表示层面分析框架,通过PCA相似度、CKA和Fisher信息等指标评估表示漂移,发现四种遗忘模式,指出数据来源影响重学效率,揭示不可逆遗忘的挑战。

Comments ICML 2026, accepted to appear

详情
AI中文摘要

在大语言模型(LLMs)中,反学习旨在移除指定数据,但其效果通常通过任务级指标如准确率和困惑度评估。我们证明这些指标可能误导,因为模型似乎遗忘,但通过最小微调即可恢复原始行为。这种可逆性表明信息被抑制而非真正删除。为填补这一评估空白,我们引入表示层面分析框架。我们的工具包包括PCA相似度和位移、中心核对齐(CKA)和Fisher信息,辅以均值PCA距离作为总结指标,用于衡量表示漂移。在多种反学习方法、数据领域和LLMs上应用此框架,我们识别出四种基于可逆性和灾难性程度的遗忘模式。我们比较了恢复策略,发现重学效率依赖于数据来源。我们还发现不可逆、非灾难性遗忘异常困难。通过探测反学习极限,我们识别出一个看似不可逆的目标遗忘案例,为更稳健的擦除算法提供见解。总体而言,我们的发现揭示了当前评估的差距,并建立了可信反学习的表示层面基础。

英文摘要

Unlearning in large language models (LLMs) aims to remove specified data, but its efficacy is typically assessed with task-level metrics like accuracy and perplexity. We show that these metrics can be misleading, as models can appear to forget while their original behavior is easily restored through minimal fine-tuning. This \emph{reversibility} suggests that information is merely suppressed, not genuinely erased. To address this critical evaluation gap, we introduce a \emph{representation-level analysis framework}. Our toolkit comprises PCA similarity and shift, centered kernel alignment (CKA), and Fisher information, complemented by a summary metric, the mean PCA distance, to measure representational drift. Applying this framework across multiple unlearning methods, data domains, and LLMs, we identify four distinct forgetting regimes based on their \emph{reversibility} and \emph{catastrophicity}. We compare recovery strategies and show that relearning efficiency relies on the data source. We also find that irreversible, non-catastrophic forgetting is exceptionally challenging. By probing unlearning limits, we identify a case of seemingly irreversible, targeted forgetting, offering insights for more robust erasure algorithms. Overall, our findings expose a gap in current evaluation and establish a representation-level foundation for trustworthy unlearning.

2505.16786 2026-05-19 cs.LG 版本更新

FlowMixer: A Depth-Agnostic Neural Architecture for Interpretable Spatiotemporal Forecasting

FlowMixer:一种不依赖深度的神经架构用于可解释的时空预测

Fares B. Mehouachi, Saif Eddin Jabari

发表机构 * New York University in Abu Dhabi(纽约大学阿布扎赫尔分校) New York University Abu Dhabi(纽约大学阿布扎赫尔分校) Brooklyn, USA(布鲁克林,美国)

AI总结 FlowMixer通过约束矩阵运算建模结构化时空模式,结合可逆映射框架实现可解释的时空预测,通过Kronecker-Koopman特征模式直接操控预测时间跨度,无需重新训练。

Comments Accepted (main track) at NeurIPS 2025. 44 pages, 17 figures, 22 tables. Published in Advances in Neural Information Processing Systems, vol. 38

详情
Journal ref
Advances in Neural Information Processing Systems, vol. 38, pp. 88811-88861, 2025
AI中文摘要

我们介绍了FlowMixer,一种单层神经架构,利用约束矩阵运算来建模结构化时空模式,提升可解释性。FlowMixer在可逆映射框架中整合非负矩阵混合层,通过先应用变换再应用逆变换的方式实现形状保持设计。这种设计使得Kronecker-Koopman特征模式框架能够连接统计学习与动力系统理论,提供可解释的时空模式,并允许直接进行代数操作,无需重新训练。该架构的半群性质使单层能够通过组合数学上表示任意深度,从而完全消除深度搜索。在多样本域的广泛实验中,FlowMixer展示了长预测时间跨度的能力,同时有效建模物理现象如混沌吸引子和湍流。我们的结果在性能上与最先进的方法相匹配,同时通过可直接提取的特征模式提供更优越的可解释性。这项工作表明,架构约束可以同时保持竞争性的性能并增强神经预测系统的数学可解释性。

英文摘要

We introduce FlowMixer, a single-layer neural architecture that leverages constrained matrix operations to model structured spatiotemporal patterns with enhanced interpretability. FlowMixer incorporates non-negative matrix mixing layers within a reversible mapping framework - applying transforms before mixing and their inverses afterward. This shape-preserving design enables a Kronecker-Koopman eigenmodes framework that bridges statistical learning with dynamical systems theory, providing interpretable spatiotemporal patterns and facilitating direct algebraic manipulation of prediction horizons without retraining. The architecture's semi-group property enables this single layer to mathematically represent any depth through composition, eliminating depth search entirely. Extensive experiments across diverse domains demonstrate FlowMixer's long-horizon forecasting capabilities while effectively modeling physical phenomena such as chaotic attractors and turbulent flows. Our results achieve performance matching state-of-the-art methods while offering superior interpretability through directly extractable eigenmodes. This work suggests that architectural constraints can simultaneously maintain competitive performance and enhance mathematical interpretability in neural forecasting systems.

2505.03205 2026-05-19 cs.LG cs.NA math.NA math.ST stat.TH 版本更新

Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights

用于噪声和任务级流形学习的Transformer:近似和泛化见解

Zhaiming Shen, Alex Havrilla, Rongjie Lai, Alexander Cloninger, Wenjing Liao

发表机构 * School of Mathematics, Georgia Institute of Technology(佐治亚理工学院数学系) Department of Mathematics, Purdue University(普渡大学数学系) Department of Mathematics and Halicioğlu Data Science Institute, University of California, San Diego(加州大学圣地亚哥分校数学系和Halicioğlu数据科学研究所)

AI总结 本文研究了Transformer在噪声和任务级流形上的学习性能,证明了其在低维结构中泛化能力与任务级流形的内在维度密切相关。

详情
AI中文摘要

Transformers作为大语言和视频生成模型的基础架构,如GPT、BERT、SORA及其后续模型。实证研究表明,现实数据和学习任务具有低维结构,伴有噪声或测量误差。Transformer的性能依赖于数据/任务的内在维度,但理论理解仍待探索。本文通过分析回归任务中接近流形的噪声输入数据,建立了Transformer的理论基础。具体而言,输入数据位于流形的管状邻域中,而真实函数依赖于噪声数据在该流形上的投影,称为任务级流形。我们证明了近似和泛化误差,其关键依赖于任务级流形的内在维度。结果表明,即使输入数据受高维噪声扰动,Transformer仍能利用低复杂度结构进行学习。我们的新证明技术通过Transformer构建基本算术运算的表示,可能具有独立兴趣。

英文摘要

Transformers serve as the foundational architecture for large language and video generation models, such as GPT, BERT, SORA and their successors. Empirical studies have demonstrated that real-world data and learning tasks exhibit low-dimensional structures, along with some noise or measurement error. The performance of transformers tends to depend on the intrinsic dimension of the data/tasks, though theoretical understandings remain largely unexplored for transformers. This work establishes a theoretical foundation by analyzing the performance of transformers for regression tasks involving noisy input data near a manifold. Specifically, the input data are in a tubular neighborhood of a manifold, while the ground truth function depends on the projection of the noisy data onto this manifold, referred to as the task-level manifold. We prove approximation and generalization errors which crucially depend on the intrinsic dimension of the task-level manifold. Our results demonstrate that transformers can leverage low-complexity structures in learning task even when the input data are perturbed by high-dimensional noise. Our novel proof technique constructs representations of basic arithmetic operations by transformers, which may hold independent interest.

2505.02360 2026-05-19 cs.LG cs.AI 版本更新

Catastrophic Overfitting, Entropy Gap and Participation Ratio: A Noiseless $l^p$ Norm Solution for Fast Adversarial Training

灾难性过拟合、熵差与参与比:一种无噪声的 $l^p$ 范数解决方案用于快速对抗训练

Fares B. Mehouachi, Saif Eddin Jabari

发表机构 * New York University of Abu Dhabi(纽约阿布扎比分校) Department of Civil and Urban Engineering(土木与城市工程系) NYU Tandon School of Engineering(纽约大学坦顿工程学院)

AI总结 本文提出基于 $l^p$ 范数的无噪声方法,通过量化梯度集中度和熵测度,自动调整训练范数以缓解灾难性过拟合问题,无需额外正则化或噪声注入。

Comments 26 pages, 13 figures, 5 table. Preliminary version at NeurIPS 2025 Reliable and Responsible AI Workshop. Code: https://github.com/FaresBMehouachi/lpfgsm

详情
AI中文摘要

对抗训练是稳健深度学习的基石,但快速方法如快速梯度符号法(FGSM)常遭遇灾难性过拟合(CO),即模型对单步攻击鲁棒但对多步变种失效。现有解决方案依赖噪声注入、正则化或梯度裁剪,本文提出一种纯控制 $l^p$ 训练范数以缓解 CO 的新方法。我们的研究受实证观察启发,即 CO 在 $l^{\infty}$ 范数下比 $l^2$ 范数更普遍。基于此洞察,我们开发了广义 $l^p$ 攻击作为固定点问题,并设计 $l^p$-FGSM 攻击以理解从 $l^2$ 到 $l^{\infty}$ 的过渡机制。这导致我们的核心洞察:CO 出现于高度集中梯度(信息在少数维度本地化)与激进范数约束相互作用时。通过量化梯度集中度通过参与比和熵测度,我们开发了自适应 $l^p$-FGSM,根据梯度信息自动调整训练范数。大量实验表明,该方法在无需额外正则化或噪声注入的情况下实现了强大的鲁棒性,提供了一种新颖且理论指导的缓解 CO 问题的途径。

英文摘要

Adversarial training is a cornerstone of robust deep learning, but fast methods like the Fast Gradient Sign Method (FGSM) often suffer from Catastrophic Overfitting (CO), where models become robust to single-step attacks but fail against multi-step variants. While existing solutions rely on noise injection, regularization, or gradient clipping, we propose a novel solution that purely controls the $l^p$ training norm to mitigate CO. Our study is motivated by the empirical observation that CO is more prevalent under the $l^{\infty}$ norm than the $l^2$ norm. Leveraging this insight, we develop a framework for generalized $l^p$ attack as a fixed point problem and craft $l^p$-FGSM attacks to understand the transition mechanics from $l^2$ to $l^{\infty}$. This leads to our core insight: CO emerges when highly concentrated gradients where information localizes in few dimensions interact with aggressive norm constraints. By quantifying gradient concentration through Participation Ratio and entropy measures, we develop an adaptive $l^p$-FGSM that automatically tunes the training norm based on gradient information. Extensive experiments demonstrate that this approach achieves strong robustness without requiring additional regularization or noise injection, providing a novel and theoretically-principled pathway to mitigate the CO problem.

2503.19950 2026-05-19 cs.LG cs.AI cs.CL 版本更新

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

LogQuant: 一种基于对数分布的2位KV缓存量化技术,具有更优异的精度保持性能

Han Chen, Zicong Jiang, Zining Zhang, Bingsheng He, Pingyi Luo, Mian Lu, Yuqiang Chen

发表机构 * Paradigm(4Paradigm)

AI总结 LogQuant通过基于对数的过滤机制实现KV缓存的2位量化,减少内存占用的同时保持高性能,实验表明其在吞吐量、批处理大小和准确性上均优于现有方法。

Comments Accepted by ICLR 2025 Workshop on Sparsity in LLMs (SLLM)

详情
AI中文摘要

我们介绍了LogQuant,一种突破性的2位量化技术,用于大型语言模型(LLM)推理中的KV缓存,实现显著的内存节省同时保持优越的性能。先前的方法要么假设后续token更重要,要么基于早期注意力模式预测重要token,但两者都可能导致性能瓶颈或频繁的误预测。LogQuant采取了不同的方法。通过应用基于对数的过滤机制,它在整个上下文中选择性地压缩KV缓存,与现有方法相比,实现更好的性能,甚至减少内存占用。在基准测试中,它提高了25%的吞吐量,提升了60%的批处理大小,而无需增加内存消耗。对于Math和Code Completion等具有挑战性的任务,LogQuant在相同压缩比下将准确性提高了40%至200%,优于其他方法。LogQuant可以轻松集成到流行的推理框架中,如Python的transformers库。实现可在https://github.com/Concyclics/LogQuantKV上获得。

英文摘要

We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.

2502.18632 2026-05-19 cs.AI cs.CL cs.CY cs.LG cs.SE 版本更新

Automated Knowledge Component Generation for Interpretable Knowledge Tracing in Coding Problems

面向编码问题可解释知识追踪的自动化知识组件生成

Zhangqi Duan, Nigel Fernandez, Arun Balajiee Lekshmi Narayanan, Mohammad Hassany, Rafaella Sampaio de Alencar, Peter Brusilovsky, Bita Akram, Andrew Lan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) University of Pittsburgh(匹兹堡大学) North Carolina State University(北卡罗来纳州立大学)

AI总结 本文提出基于LLM的知识组件生成与标注自动化流程,开发KCGen-KT框架,在不同编程语言的实测数据中验证其优于传统方法和人工编写的知识组件。

Comments Findings of ACL 2026: The 64th Annual Meeting of the Association for Computational Linguistics

详情
AI中文摘要

知识组件(KCs)映射到问题有助于建模学生学习,跟踪他们在细粒度技能上的掌握水平,从而在在线学习平台中实现个性化学习和反馈。然而,传统上由人类领域专家进行的知识组件编制和标注工作非常劳动密集。本文提出一个基于LLM的自动化流程,用于开放性编程问题的知识组件生成和标注。我们还开发了一个基于LLM的知识追踪(KT)框架,利用这些LLM生成的知识组件,称为KCGen-KT。我们在两个真实世界的学生代码提交数据集中进行了广泛的定量和定性评估。我们发现KCGen-KT在预测学生未来响应方面优于现有KT方法和人工编写的KCs。我们研究了生成KCs的学习曲线,并显示在认知模型下,LLM生成的KCs比人工编写的KCs拟合更好。我们还进行了与课程讲师的人类评估,以展示我们的流程生成合理的问题-KC映射。

英文摘要

Knowledge components (KCs) mapped to problems help model student learning, tracking their mastery levels on fine-grained skills thereby facilitating personalized learning and feedback in online learning platforms. However, crafting and tagging KCs to problems, traditionally performed by human domain experts, is highly labor intensive. We present an automated, LLM-based pipeline for KC generation and tagging for open-ended programming problems. We also develop an LLM-based knowledge tracing (KT) framework to leverage these LLM-generated KCs, which we refer to as KCGen-KT. We conduct extensive quantitative and qualitative evaluations on two real-world student code submission datasets in different programming languages.We find that KCGen-KT outperforms existing KT methods and human-written KCs on future student response prediction. We investigate the learning curves of generated KCs and show that LLM-generated KCs result in a better fit than human written KCs under a cognitive model. We also conduct a human evaluation with course instructors to show that our pipeline generates reasonably accurate problem-KC mappings.

2501.06933 2026-05-19 cs.LG physics.comp-ph physics.flu-dyn 版本更新

Neural equilibria for long-term prediction of nonlinear conservation laws

神经均衡用于非线性守恒律的长期预测

J. Antonio Lara Benitez, Kareem Hegazy, Junyi Guo, Ivan Dokmanić, Michael W. Mahoney, Maarten V. de Hoop

发表机构 * Rice University(里士大学) ICSI and University of California at Berkeley(ICSI和加州大学伯克利分校) University of Basel(巴塞尔大学) ICSI, LBNL, and University of California at Berkeley(ICSI、劳伦斯伯克利国家实验室和加州大学伯克利分校)

AI总结 本文提出NeurDE方法,通过结合守恒律与神经网络,实现对非线性守恒律系统更精确的长期预测,优于现有SciML方法。

详情
AI中文摘要

非线性守恒律 governing 了科学和工业中广泛的重要物理系统,并是科学机器学习(SciML)的核心。大型通用模型提供速度,但替换求解器的数值和物理结构往往会牺牲稳定性、准确性和物理忠实性。本文旨在通过一种守恒意识的SciML backbone,即Neural Discrete Equilibrium(NeurDE),平衡守恒的归纳偏差与神经网络的灵活性和速度。NeurDE通过学习Boltzmann表述的局部平衡闭合,将机器学习置于动能求解器中。动能求解器仍执行传输、松弛、动量恢复和守恒;神经网络仅提供非线性平衡目标。NeurDE在6个守恒系统上进行测试,包括三个极具挑战性的亚声速、跨音速和超声速激波系统。NeurDE优于现有SciML方法,包括神经运算符和预训练SciML基础模型,后者分别大10^4和10^6倍。最值得注意的是,NeurDE在衍生自其的数值方法上有所改进。因此,NeurDE为保守模拟的科学机器学习提供了一个紧凑的目标:学习系统松弛的平衡律,而非自身演化的律本身。

英文摘要

Nonlinear conservation laws govern a broad class of important physical systems in science and industry and are central to scientific machine learning (SciML). Large general-purpose models offer speed, but replacing the numerical and physical structure of solvers often compromises stability, accuracy, and physical faithfulness. Here, we aim to balance the general inductive bias of conservation with the flexibility and speed of neural networks through a conservation-aware SciML backbone, which we call Neural Discrete Equilibrium (NeurDE). NeurDE places machine learning inside a kinetic solver by learning the local equilibrium closure of a Boltzmann formulation. The kinetic solver still performs transport, relaxation, moment recovery, and conservation; the neural network provides only the nonlinear equilibrium target. We test NeurDE on $6$ conserved systems, including three very challenging subsonic, transonic, and supersonic shock systems. NeurDE outperforms state-of-the-art SciML methods, including neural operators and pretrained SciML foundation models that are $10^4$ and $10^6$ times larger, respectively. Most notably, NeurDE improves upon the numerical method from which it is derived. NeurDE therefore provides a compact target for scientific machine learning in conservative simulation: learn the equilibrium law toward which the system relaxes, not the evolution law itself.

2411.18234 2026-05-19 cs.LG cs.AI cs.PF stat.CO 版本更新

Time-Efficient Hybrid Hyperparameter Tuning Approach for Cardiovascular Disease Classification

用于心血管疾病分类的高效混合超参数调优方法

Abhay Kumar Pathak, Mrityunjay Chaubey, Manjari Gupta

发表机构 * Department of Computer Science, Institute of Science, Banaras Hindu University(计算机科学系,科学学院,班纳拉森胡大学) School of Computer Science, University of Petroleum and Energy Studies(计算机科学学院,石油与能源研究大学)

AI总结 本文提出一种结合随机搜索和网格搜索的混合超参数调优方法,提升心血管疾病分类模型的准确性和效率,实验表明该方法在性能和计算时间上均优于传统方法。

详情
AI中文摘要

心血管疾病(CVDs)是任何严重的心脏疾病,需要准确诊断以防止致命后果。超参数调优在优化机器学习模型性能中起关键作用,通过选择最合适的参数配置来提高准确性、泛化性和可靠性。网格搜索系统地评估预定义的超参数组合,而随机搜索则从搜索空间中随机采样配置,实现更广泛的探索并减少计算成本。因此,在开发分类模型时,高效调优策略至关重要,因为时间和预测能力同样关键。本文提出了一种新的超参数调优方法,用于调优用于CVD分类的机器学习模型。所提出的随机网格搜索结合了随机搜索探索全局空间的能力和网格搜索在最有前途区域的集中和彻底搜索。这种混合方法在探索和利用之间找到最佳平衡,产生了一个稳健且高效的时间机器学习模型。在最先进的模型上的实验结果表明,随机网格搜索比传统超参数调优方法表现更好。除了观察到的模型性能提升外,大多数模型的训练所需计算时间也显著减少。所提研究的结果强调了所提出随机网格搜索方法在训练时间和计算效率上的减少。所提出的技术在医疗保健领域的机器学习应用中具有重大潜力,能够提供及时且准确的CVDs诊断。

英文摘要

Cardiovascular diseases (CVDs) are any serious illness of the heart, which require accurate diagnosis to prevent fatal consequences. Hyperparameter tuning plays a critical role in optimizing machine learning model performance by selecting the most suitable parameter configurations for improved accuracy, generalization, and reliability. Grid search systematically evaluates predefined hyperparameter combinations, whereas random search samples configurations randomly from the search space enabling broader exploration with reduced computational cost. Therefore, an efficient tuning strategy is essential when developing classification models where time plays an crucial role along with the predictive capability. In this work, we propose a new hyperparameter tuning approach to tune the hyperparameters of ML models for CVD classification. The proposed random grid search combines the power of random search to explore the global space with the focused and exhaustive search of grid search in the most promising areas. This hybrid approach finds an optimal balance between exploration and exploitation and yields a robust and time-efficient ML model for classification seetings. Experimental results on state of the art models demonstrated that randomised grid search performed better than traditional hyperparameter tuning methods. In addition to the observed improvement in model performance, the computational time required for training models was substantially reduced across most of the models. Presented results of the proposed study emphasizes the reduction in training time and computational efficiency of the proposed Randomized-Grid Search method. The proposed technique has significant potential to advance ML application in healthcare providing timely and accurate CVDs diagnosis.

2411.10636 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Mitigating Extrinsic Gender Bias for Bangla Classification Tasks

缓解孟加拉语分类任务中的外在性别偏见

Sajib Kumar Saha Joy, Arman Hassan Mahy, Meherin Sultana, Azizah Mamun Abha, MD Piyal Ahmmed, Yue Dong, G M Shahariar

发表机构 * Ahsanullah University of Science and Technology(阿沙努拉科学与技术大学) University of California, Riverside(加州大学河滨分校)

AI总结 本文研究了孟加拉语预训练语言模型中的外在性别偏见,构建了四个任务特定的基准数据集,并提出RandSymKL方法以缓解偏见,实验表明其能有效减少偏见并保持高准确率。

详情
AI中文摘要

在本研究中,我们探讨了孟加拉语预训练语言模型中的外在性别偏见,这是一个在低资源语言中鲜有研究的领域。为了评估这种偏见,我们构建了四个人工标注的任务特定基准数据集,用于情感分析、毒性检测、仇恨言论检测和讽刺检测。每个数据集都通过细致的性别扰动进行了增强,通过系统地交换性别化名称和术语并保持语义内容,实现了对性别驱动预测变化的最小配对评估。然后,我们提出RandSymKL,一种整合对称KL散度和交叉熵损失的随机去偏策略,以在任务特定的预训练模型中缓解偏见。RandSymKL是一种精炼的训练方法,以统一的方式整合这些元素,专注于分类任务的外在性别偏见缓解。我们的方法在现有偏见缓解方法上进行了评估,结果表明,我们的技术不仅有效减少了偏见,还与其他基线方法相比保持了竞争性的准确性。为了促进进一步研究,我们已公开了我们的实现和数据集:https://github.com/sajib-kumar/Mitigating-Bangla-Extrinsic-Gender-Bias

英文摘要

In this study, we investigate extrinsic gender bias in Bangla pretrained language models, a largely underexplored area in low-resource languages. To assess this bias, we construct four manually annotated, task-specific benchmark datasets for sentiment analysis, toxicity detection, hate speech detection, and sarcasm detection. Each dataset is augmented using nuanced gender perturbations, where we systematically swap gendered names and terms while preserving semantic content, enabling minimal-pair evaluation of gender-driven prediction shifts. We then propose RandSymKL, a randomized debiasing strategy integrated with symmetric KL divergence and cross-entropy loss to mitigate the bias across task-specific pretrained models. RandSymKL is a refined training approach to integrate these elements in a unified way for extrinsic gender bias mitigation focused on classification tasks. Our approach was evaluated against existing bias mitigation methods, with results showing that our technique not only effectively reduces bias but also maintains competitive accuracy compared to other baseline approaches. To promote further research, we have made both our implementation and datasets publicly available: https://github.com/sajib-kumar/Mitigating-Bangla-Extrinsic-Gender-Bias

2410.07191 2026-05-19 cs.RO cs.LG stat.ME 版本更新

Curb Your Attention: Causal Attention Gating for Robust Trajectory Prediction in Autonomous Driving

抑制注意力:因果注意力门控用于自动驾驶中的鲁棒轨迹预测

Ehsan Ahmadi, Ray Mercurius, Soheil Alizadeh, Kasra Rezaee, Amir Rasouli

发表机构 * University of Alberta(阿尔伯塔大学) Noah’s Ark Laboratory, Huawei Technologies Canada(华为加拿大诺亚实验室) Cornell University(康奈尔大学)

AI总结 本文提出CRiTIC模型,通过因果发现网络识别agent间因果关系,并引入因果注意力门控机制提升轨迹预测的鲁棒性和泛化能力,实验表明模型在对抗非因果扰动时鲁棒性提升54%。

Comments Accepted ICRA 2025

详情
AI中文摘要

自动驾驶中的轨迹预测模型易受非因果代理的扰动影响,此类扰动可能导致其他代理轨迹预测错误,进而影响自动驾驶决策的安全性和效率。本文提出CRiTIC模型,利用因果发现网络识别过去时间窗口内代理间的因果关系,并引入因果注意力门控机制,以选择性过滤Transformer架构中的信息。在两个自动驾驶基准数据集上进行了大量实验,评估了模型在对抗非因果扰动和泛化能力方面的鲁棒性。实验结果表明,预测鲁棒性可提升54%而对预测准确性影响不大。此外,本文展示了所提模型在跨域性能上的优越泛化能力,达到29%的改进。进一步细节请参见项目页面:https://ehsan-ami.github.io/critic。

英文摘要

Trajectory prediction models in autonomous driving are vulnerable to perturbations from non-causal agents whose actions should not affect the ego-agent's behavior. Such perturbations can lead to incorrect predictions of other agents' trajectories, potentially compromising the safety and efficiency of the ego-vehicle's decision-making process. Motivated by this challenge, we propose $\textit{Causal tRajecTory predICtion}$ $\textbf{(CRiTIC)}$, a novel model that utilizes a $\textit{Causal Discovery Network}$ to identify inter-agent causal relations over a window of past time steps. To incorporate discovered causal relationships, we propose a novel $\textit{Causal Attention Gating}$ mechanism to selectively filter information in the proposed Transformer-based architecture. We conduct extensive experiments on two autonomous driving benchmark datasets to evaluate the robustness of our model against non-causal perturbations and its generalization capacity. Our results indicate that the robustness of predictions can be improved by up to $\textbf{54%}$ without a significant detriment to prediction accuracy. Lastly, we demonstrate the superior domain generalizability of the proposed model, which achieves up to $\textbf{29%}$ improvement in cross-domain performance. These results underscore the potential of our model to enhance both robustness and generalization capacity for trajectory prediction in diverse autonomous driving domains. Further details can be found on our project page: https://ehsan-ami.github.io/critic.

2409.07014 2026-05-19 stat.ML cs.DB cs.LG 版本更新

A Practical Theory of Generalization in Selectivity Learning

选择性学习中泛化理论的实用性研究

Peizhi Wu, Haoshu Xu, Ryan Marcus, Zachary G. Ives

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文从理论与实践角度探讨选择性学习的泛化能力,提出基于有符号测度的可学习预测方法,并改进OOF泛化性能。

Comments 15 pages. Technical Report (Extended Version)

详情
AI中文摘要

查询驱动的机器学习模型已作为一种有前途的查询选择性估计技术出现。然而,从理论角度看,这些技术的有效性仍知之甚少,因为实际解决方案与基于Probably Approximately Correct (PAC) 学习框架的最先进理论之间存在显著差距。本文旨在弥合理论与实践之间的差距。首先,我们证明由符号测度诱导的选择性预测器是可学习的,这放松了PAC理论对概率测度的依赖。更重要的是,在此基础上,我们建立了在温和假设下,此类选择性预测器在分布外(OOD)泛化误差界上的有利表现。这些理论进步为我们提供了对查询驱动选择性学习的分布内和分布外泛化能力的更好理解,并促进了两种改进分布外泛化的通用策略的设计。我们实证验证了我们的技术在预测准确性和查询延迟性能方面显著帮助查询驱动选择性模型泛化到分布外查询,同时保持其优越的分布内泛化性能。

英文摘要

Query-driven machine learning models have emerged as a promising estimation technique for query selectivities. Yet, surprisingly little is known about the efficacy of these techniques from a theoretical perspective, as there exist substantial gaps between practical solutions and state-of-the-art (SOTA) theory based on the Probably Approximately Correct (PAC) learning framework. In this paper, we aim to bridge the gaps between theory and practice. First, we demonstrate that selectivity predictors induced by signed measures are learnable, which relaxes the reliance on probability measures in SOTA theory. More importantly, beyond the PAC learning framework (which only allows us to characterize how the model behaves when both training and test workloads are drawn from the same distribution), we establish, under mild assumptions, that selectivity predictors from this class exhibit favorable out-of-distribution (OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the in-distribution and OOD generalization capabilities of query-driven selectivity learning, and facilitate the design of two general strategies to improve OOD generalization for existing query-driven selectivity models. We empirically verify that our techniques help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance, while maintaining their superior in-distribution generalization performance.

2409.02428 2026-05-19 cs.LG cs.AI cs.CL cs.SY eess.SY 版本更新

Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement

语言模型作为定制环境多目标强化学习的高效奖励函数搜索器

Guanwen Xie, Jingzehua Xu, Yiyuan Yang, Yimian Ding, Shuai Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University, China(清华大学深圳国际研究生院,清华大学,中国) Department of Computer Science, University of Oxford, United Kingdom(英国牛津大学计算机科学系) Department of Data Science, New Jersey Institute of Technology, USA(美国新泽西理工学院数据科学系)

AI总结 本文提出ERFSL,利用语言模型高效搜索奖励函数,通过生成奖励组件和使用奖励批评者修正代码,实现多目标强化学习任务中零样本学习的高效奖励函数设计。

详情
AI中文摘要

在强化学习任务中,设计和改进复杂定制环境和多重需求的奖励函数具有挑战性。本文提出ERFSL,一种利用大型语言模型(LLMs)的高效奖励函数搜索器,使LLMs成为有效的白盒搜索器,并突出其先进的语义理解能力。具体而言,我们为每个数值明确的用户需求生成奖励组件,并使用奖励批评者识别正确的代码形式。然后,LLMs为奖励组件分配权重以平衡其值,并通过灵活采用方向突变和交叉策略迭代调整权重,类似于遗传算法,基于训练日志分析器提供的上下文。我们将其应用于无直接人类反馈或奖励示例的定制数据收集RL任务(零样本学习)。奖励批评者仅需每个需求一个反馈实例即可有效纠正奖励代码,防止不可纠正的错误。权重初始化使在帕累托解集内获取不同奖励函数而无需权重搜索。即使权重偏差达500倍,平均仅需5.2次迭代即可满足用户需求。ERFSL也适用于大多数使用GPT-4o mini的提示,因为我们分解了权重搜索过程,以降低对数值和长上下文理解能力的要求。

英文摘要

Achieving the effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities. Specifically, we generate reward components for each numerically explicit user requirement and employ a reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively adjust the weights without ambiguity and redundant adjustments by flexibly adopting directional mutation and crossover strategies, similar to genetic algorithms, based on the context provided by the training log analyzer. We applied the framework to a customized data collection RL task without direct human feedback or reward examples (zero-shot learning). The reward critic successfully corrects the reward code with only one feedback instance for each requirement, effectively preventing unrectifiable errors. The initialization of weights enables the acquisition of different reward functions within the Pareto solution set without the need for weight search. Even in cases where a weight is 500 times off, on average, only 5.2 iterations are needed to meet user requirements. The ERFSL also works well with most prompts utilizing GPT-4o mini, as we decompose the weight searching process to reduce the requirement for numerical and long-context understanding capabilities.

2406.15797 2026-05-19 cs.LG cs.AI 版本更新

$\texttt{SynC}$: Synergistic Boosting of Structure and Representation for Deep Graph Clustering

$\texttt{SynC}$:深度图聚类的结构与表示协同提升

Shifei Ding, Benyu Wu, Xiao Xu, Ling Ding, Xindong Wu

发表机构 * School of Computer Science and Technology/the School of Artificial Intelligence, China University of Mining and Technology(计算机科学与技术学院/人工智能学院,中国矿业大学) Mine Digitization Engineering Research Center of Ministry of Education(教育部矿山数字化工程研究中心) College of Intelligence and Computing, Tianjin University(智能与计算学院,天津大学) Key Laboratory of Knowledge Engineering with Big Data (the Ministry of Education of China), Hefei University of Technology(大数据知识工程重点实验室(教育部),合肥工业大学)

AI总结 SynC通过协同提升结构与表示学习,改进深度图聚类,减少参数并提升低同质图的泛化能力。

详情
AI中文摘要

SynC通过协同提升结构与表示学习,改进深度图聚类,减少参数并提升低同质图的泛化能力。

英文摘要

Employing graph neural networks (GNNs) for graph clustering has shown promising results in deep graph clustering. However, existing methods disregard the reciprocal relationship between representation learning and structure augmentation: the more homogeneous the graph, the more cohesive the node representations; the more cohesive the node representations, the more reliable the structure augmentation becomes. Moreover, the generalization ability of existing GNN-based models on the low homophily graph is relatively poor. To this end, we propose a graph clustering framework named Synergistic Deep Graph Clustering Network (SynC). SynC employs a Transform Input Graph Auto-Encoder (TIGAE) to obtain high-quality embeddings via mitigating the representations collapse issue of GAE for guiding structure augmentation. Then, we re-capture neighborhood representations on the refined graph to obtain clustering-friendly embeddings and conduct self-supervised clustering. Notably, these two stages share weights, resulting in synergistic boosting while significantly reducing the number of model parameters. Additionally, we introduce a structure fine-tuning strategy to improve the model's generalization on the low homophily graph. Extensive experiments on benchmark datasets demonstrate the superiority of SynC. The code is released at GitHub.

2406.13187 2026-05-19 cs.LG 版本更新

Decouple then Converge: Handling Unknown Unlabeled Distributions in Long-Tailed Semi-Supervised Learning

解耦然后收敛:处理长尾半监督学习中未知的未标记分布

Kai Gan, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China(教育部计算机网络与信息集成重点实验室(东南大学))

AI总结 本文提出DeCon方法,通过解耦学习分支处理长尾半监督学习中未标记数据分布未知的问题,通过两个分支互补提升整体性能。

Comments TPAMI Accepted

详情
AI中文摘要

尽管长尾半监督学习(LTSSL)在许多实际分类任务中受到越来越多的关注,但现有LTSSL算法通常假设标记和未标记数据具有几乎相同的类别分布。当这一假设被违反时,这些方法表现不佳,因为它们依赖于偏见的模型生成伪标签。为了解决这个问题,我们提出了一种简单而有效的DeCon方法用于LTSSL。具体来说,DeCon将学习分解为两个专门的分支:一个标准分支专注于头部类别,一个平衡分支专注于尾部类别。在训练过程中,两个分支相互作用并逐渐收敛,使它们能够互补并最终在所有类别上实现强大的性能。尽管其简单性,我们证明DeCon在多种标准LTSSL基准上实现了最先进的性能,例如当标记和未标记数据的类别分布不匹配时,测试准确率平均增加2.7%。即使当类别分布相同,DeCon也始终优于许多复杂的LTSSL算法。此外,我们进行了广泛的消融分析,以分离对DeCon成功最重要的因素。源代码可在https://github.com/Gank0078/DeCon上获得。

英文摘要

While long-tailed semi-supervised learning (LTSSL) has attracted growing attention in many real-world classification tasks, existing LTSSL algorithms typically assume that labeled and unlabeled data share nearly identical class distributions. When this assumption is violated, these methods can perform poorly because they rely on biased model-generated pseudo-labels. To address this issue, we propose a simple yet effective approach called DeCon for LTSSL with unknown unlabeled class distributions. Specifically, DeCon decouples learning into two specialized branches: a standard branch that focuses on head classes and a balanced branch that focuses on tail classes. During training, the two branches interact and gradually converge, allowing them to complement each other and ultimately achieve strong performance across all classes. Despite its simplicity, we show that DeCon achieves state-of-the-art performance on a variety of standard LTSSL benchmarks, e.g., an averaged 2.7\% absolute increase in test accuracy against existing algorithms when the class distributions of labeled and unlabeled data are mismatched. Even when the class distributions are identical, DeCon consistently outperforms many sophisticated LTSSL algorithms. Furthermore, we conduct extensive ablation analyses to tease apart the factors that are the most important to the success of DeCon. The source code is available at \url{https://github.com/Gank0078/DeCon}.

2405.14657 2026-05-19 cs.LG stat.ML 版本更新

Anchor-Based Heteroscedastic Noise for Preferential Bayesian Optimization

基于锚点的异方差噪声用于偏好贝叶斯优化

Marshal Arijona Sinaga, Julien Martinelli, Samuel Kaski

发表机构 * ELLIS Institute Finland(芬兰埃利斯研究所) Aalto University(艾洛大学) University of Manchester(曼彻斯特大学)

AI总结 本文提出一种异方差噪声模型用于偏好贝叶斯优化,通过用户提供的可靠示例(锚点)和核密度估计生成用户不确定性图,并推导出风险规避的获取函数,提升风险调整性能。

Comments Camera-ready version (ProbML 2026)

详情
AI中文摘要

偏好贝叶斯优化(PBO)通过成对比较学习潜在效用,但现有方法假设比较噪声同方差,这在人机交互场景中不足,因为用户可能对某些设计可靠而对其他设计犹豫。本文提出PBO的异方差噪声模型:在优化前,用户提供少量可靠示例(锚点),核密度估计(KDE)将这些锚点转化为输入依赖的用户不确定性图。该图被整合到偏好高斯过程(GP)代理中,并推导出风险规避的获取函数,平衡效用和比较的便利性。进一步证明,风险调整的流行预期效用(EUBO)变体在一步贝叶斯最优性保证上至多加一个常数,且在理想化的独立同分布锚点模型下,KDE估计器具有标准一致性和集中率。在合成问题和人类偏好数据集上的实验显示,改进了风险调整性能,并澄清了锚点放置对方法的影响。

英文摘要

Preferential Bayesian optimization (PBO) learns latent utilities from pairwise comparisons, but most existing methods assume homoscedastic comparison noise. This is inadequate in human-in-the-loop settings, where a user may compare some designs reliably and others only hesitantly. We propose a heteroscedastic noise model for PBO: before optimization, the user provides a small set of reliable examples, called anchors, and a kernel density estimator (KDE) turns these anchors into an input-dependent map of user uncertainty. We incorporate this map into preferential GP surrogates and derive risk-averse acquisition functions that trade off utility and ease of comparison. We further show that a risk-adjusted variant of the popular expected utility of the best option (EUBO) preserves the one-step Bayes-optimality guarantee up to an additive constant, and that under an idealized i.i.d. anchor model the KDE estimator enjoys standard consistency and concentration rates. Experiments on synthetic problems and human-preference datasets show improved risk-adjusted performance and clarify how anchor placement affects the method.

2404.00470 2026-05-19 cs.SD cs.LG eess.AS 版本更新

Classification of Short Segment Pediatric Heart Sounds Based on a Transformer-Based Convolutional Neural Network

基于变压器卷积神经网络的短段儿童心音分类

Md Hassanuzzaman, Nurul Akhtar Hasan, Mohammad Abdullah Al Mamun, Khawza I Ahmed, Ahsan H Khandoker, Raqibul Mostafa

AI总结 本文研究了用于自动分类心音的最短信号持续时间,采用基于MFCC特征的变压器残差一维卷积神经网络,发现5秒信号能获得93.69%的准确率,而3秒信号信息不足,15秒信号噪声较多。

Comments 16 pages,11 Figures

详情
Journal ref
IEEE Access, vol. 13, pp. 93852-93868, 2025
AI中文摘要

先天性心脏病(CHDs)是由于心脏和大血管结构缺陷导致的先天异常。PCG能提供关于心脏机械传导系统的重要信息,并指出与不同CHD类型相关的特定模式。本研究旨在调查自动分类心音所需的最短信号持续时间。此外,研究还探讨了最佳信号质量评估指标(RMSSD和ZCR值)。基于MFCC特征构建了变压器残差一维卷积神经网络,用于分类心音。研究显示,0.4是RMSSD和ZCR指标获得合适信号的理想阈值。此外,5秒信号是有效心音分类所需的最小信号长度。研究还表明,较短的信号(3秒心音)无法准确分类,而较长的信号(15秒心音)可能包含更多噪声。5秒信号在区分心音方面获得了最佳准确率93.69%。

英文摘要

Congenital anomalies arising as a result of a defect in the structure of the heart and great vessels are known as congenital heart diseases or CHDs. A PCG can provide essential details about the mechanical conduction system of the heart and point out specific patterns linked to different kinds of CHD. This study aims to investigate the minimum signal duration required for the automatic classification of heart sounds. This study also investigated the optimum signal quality assessment indicator (Root Mean Square of Successive Differences) RMSSD and (Zero Crossings Rate) ZCR value. Mel-frequency cepstral coefficients (MFCCs) based feature is used as an input to build a Transformer-Based residual one-dimensional convolutional neural network, which is then used for classifying the heart sound. The study showed that 0.4 is the ideal threshold for getting suitable signals for the RMSSD and ZCR indicators. Moreover, a minimum signal length of 5s is required for effective heart sound classification. It also shows that a shorter signal (3 s heart sound) does not have enough information to categorize heart sounds accurately, and the longer signal (15 s heart sound) may contain more noise. The best accuracy, 93.69%, is obtained for the 5s signal to distinguish the heart sound.

2403.11782 2026-05-19 cs.LG stat.ML 版本更新

A tutorial on learning from preferences and choices with Gaussian Processes

基于高斯过程的学习偏好与选择教程

Alessio Benavoli, Dario Azzimonti

发表机构 * School of Computer Science and Statistics, Trinity College Dublin(三一学院都柏林计算机科学与统计学系) SUPSI, Dalle Molle Institute for Artificial Intelligence (IDSIA)(SUPSI瑞士人工智能研究所)

AI总结 本文介绍了利用高斯过程进行偏好学习的框架,结合经济学和决策理论原理,提出新颖的模型以填补现有文献的空白。

详情
AI中文摘要

偏好建模处于经济学、决策理论、机器学习和统计学的交汇点。通过理解个体的偏好和选择方式,可以构建更符合预期的产品,推动在广泛领域内更高效和个性化应用的发展。本文旨在介绍一个连贯且全面的高斯过程(GPs)偏好学习框架,展示如何将理性原则无缝融入学习过程。通过适当调整似然函数,该框架能够构建包含随机效用模型、辨别极限以及多重冲突效用场景的偏好学习模型。本文在已有研究基础上,同时引入了一些新的基于高斯过程的模型,以解决现有文献中的特定缺口。

英文摘要

Preference modelling lies at the intersection of economics, decision theory, machine learning and statistics. By understanding individuals' preferences and how they make choices, we can build products that closely match their expectations, paving the way for more efficient and personalised applications across a wide range of domains. The objective of this tutorial is to present a cohesive and comprehensive framework for preference learning with Gaussian Processes (GPs), demonstrating how to seamlessly incorporate rationality principles (from economics and decision theory) into the learning process. By suitably tailoring the likelihood function, this framework enables the construction of preference learning models that encompass random utility models, limits of discernment, and scenarios with multiple conflicting utilities for both object- and label-preference. This tutorial builds upon established research while simultaneously introducing some novel GP-based models to address specific gaps in the existing literature.

2401.03717 2026-05-19 cs.LG cs.AI 版本更新

Universal Time-Series Representation Learning: A Survey

通用时间序列表示学习:综述

Patara Trirat, Yooju Shin, Junhyeok Kang, Youngeun Nam, Jihye Na, Minyoung Bae, Joeun Kim, Byunghyun Kim, Jae-Gil Lee

发表机构 * KAIST(韩国延世大学)

AI总结 本文综述了时间序列数据表示学习方法,探讨了深度学习在提取隐藏模式中的优势,并提出了新的分类方法以指导未来研究。

Comments Accepted by ACM Computing Surveys. Extended version: 41 pages, 7 figures

详情
AI中文摘要

时间序列数据存在于现实世界的各个方面,从天空中的卫星到身上的可穿戴设备。通过提取和推断有价值的信息来学习表示对于理解复杂现象的动力学和做出明智决策至关重要。深度学习在无需手动特征工程的情况下展示了在时间序列数据中提取隐藏模式和特征的卓越性能。本文首先提出了一种基于三种基本要素的新分类方法,用于设计最先进的通用表示学习方法。根据该分类法,本文全面回顾了现有研究,讨论了这些方法如何提高学习表示的质量。最后,作为未来研究的指南,本文总结了常用的实验设置和数据集,并讨论了几个有前途的研究方向。相关资源可在https://github.com/itouchz/awesome-deep-time-series-representations上找到。

英文摘要

Time-series data exists in every corner of real-world systems and services, ranging from satellites in the sky to wearable devices on human bodies. Learning representations by extracting and inferring valuable information from these time series is crucial for understanding the complex dynamics of particular phenomena and enabling informed decisions. With the learned representations, we can perform numerous downstream analyses more effectively. Among several approaches, deep learning has demonstrated remarkable performance in extracting hidden patterns and features from time-series data without manual feature engineering. This survey first presents a novel taxonomy based on three fundamental elements in designing state-of-the-art universal representation learning methods for time series. According to the proposed taxonomy, we comprehensively review existing studies and discuss their intuitions and insights into how these methods enhance the quality of learned representations. Finally, as a guideline for future studies, we summarize commonly used experimental setups and datasets and discuss several promising research directions. An up-to-date corresponding resource is available at https://github.com/itouchz/awesome-deep-time-series-representations.

2310.07983 2026-05-19 cs.LG math.OC stat.ML 版本更新

Achieving Linear Speedup with ProxSkip in Distributed Stochastic Optimization

通过ProxSkip在分布式随机优化中实现线性加速

Luyao Guo, Sulaiman A. Alghunaim, Kun Yuan, Laurent Condat, Jinde Cao

发表机构 * School of Computer Science and Engineering, Suzhou University of Technology(苏州科技大学计算机科学与工程学院) Department of Electrical Engineering, Kuwait University(科威特大学电子工程系) Center for Machine Learning Research, Peking University(北京大学机器学习研究中心) King Abdullah University of Science and Technology(卡塔尔科学与技术大学) School of Mathematics, Southeast University(东南大学数学系) Purple Mountain Laboratories(紫金山实验室)

AI总结 本文研究了ProxSkip在非凸设置下的收敛性,证明其在节点数量上实现线性加速,并展示了局部更新对通信效率的提升作用。

详情
AI中文摘要

ProxSkip算法在分布式优化中因其减少通信的效果而受到越来越多的关注。然而,现有分析仅限于强凸设置,无法实现节点数量的线性加速。本文重新审视去中心化ProxSkip,回答了其在非凸设置下的行为及线性加速的可实现性问题。我们为随机非凸、凸和强凸问题提供了统一的收敛分析,揭示了梯度噪声、局部更新、网络连通性和数据异质性如何共同决定收敛行为。到目前为止,这是首次证明去中心化ProxSkip在随机梯度下实现节点数量线性加速的分析。此外,我们的结果表明,局部更新可以有效减少通信频率并提高通信效率。

英文摘要

The ProxSkip algorithm for distributed optimization is gaining increasing attention due to its effectiveness in reducing communication. However, existing analyses of ProxSkip are limited to the strongly convex setting and fail to achieve linear speedup with respect to the number of nodes. Key questions regarding its behavior in the non-convex setting and the achievability of linear speedup remain open. In this paper, we revisit decentralized ProxSkip and answer these questions affirmatively. We provide a unified convergence analysis for stochastic non-convex, convex, and strongly convex problems, revealing how gradient noise, local updates, network connectivity, and data heterogeneity jointly determine the convergence behavior. To the best of our knowledge, this is the first analysis showing that decentralized ProxSkip achieves linear speedup in the number of nodes under stochastic gradients. Moreover, our results demonstrate that local updates can effectively reduce communication frequency and improve communication efficiency.

2309.05646 2026-05-19 cs.CR cs.LG cs.NI 版本更新

Lightweight CNN-Based DDoS Detection for Resource-Constrained Edge Networks

轻量级基于CNN的DDoS检测用于资源受限的边缘网络

Vedanth Ramanathan, Krish Mahadevan, Sejal Dua

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Green River College(绿河学院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出一种轻量级监督深度学习方法,利用CNN检测DDoS攻击,通过提取包流特征并进行分类,实现低延迟的边缘网络检测。

详情
AI中文摘要

分布式拒绝服务(DDoS)攻击仍然是互联网服务、边缘网络和网络物理基础设施可用性的持续威胁。尽管最近的AI安全工作越来越多地关注基础模型、自主代理和对抗鲁棒性,但许多运营防御任务仍然需要靠近网络边缘的低延迟分类,其中云规模分析可能太慢或昂贵。本文提出了一种轻量级监督深度学习方法,使用卷积神经网络(CNN)对来自CIC-DDoS2019基准数据集的包流表示进行训练。所提出的流程从PCAP流量中提取包流,将其标准化为固定长度的表示,并使用紧凑的CNN架构(包含卷积、丢弃、池化和Sigmoid分类层)将每个流分类为良性或恶意。在测试集上,模型在0.28秒内处理评估的测试流,达到0.9883的准确率、0.9864的精确率、0.9784的召回率和0.9824的F1分数。这些结果表明,紧凑的神经模型可以为面向边缘的DDoS检测提供有用的早期预警信号。我们进一步讨论了部署限制、基准限制以及跨数据集评估、硬件感知分析和与缓解管道集成的未来方向。

英文摘要

Distributed Denial of Service (DDoS) attacks remain a persistent threat to the availability of Internet services, edge networks, and cyber-physical infrastructure. Although recent AI-security work has increasingly focused on foundation models, autonomous agents, and adversarial robustness, many operational defense tasks still require low-latency classification close to the network edge, where cloud-scale analysis may be too slow or expensive. This paper presents a lightweight supervised deep learning approach for DDoS detection using a convolutional neural network (CNN) trained on packet-flow representations derived from the CIC-DDoS2019 benchmark dataset. The proposed pipeline extracts packet flows from PCAP traffic, normalizes them to fixed-length representations, and classifies each flow as benign or malicious using a compact CNN architecture with convolution, dropout, pooling, and sigmoid classification layers. On a held-out test set of previously unseen flows, the model achieves 0.9883 accuracy, 0.9864 precision, 0.9784 recall, and 0.9824 F1 score, while processing the evaluated test flows in 0.28 seconds. These results suggest that compact neural models can provide useful early-warning signals for edge-oriented DDoS detection. We further discuss deployment constraints, benchmark limitations, and future directions for cross-dataset evaluation, hardware-aware profiling, and integration with mitigation pipelines.

2305.10721 2026-05-19 cs.LG cs.AI 版本更新

Revisiting Long-term Time Series Forecasting: An Investigation on Linear Mapping

重新审视长期时间序列预测:对线性映射的调查

Zhe Li, Shiyi Qi, Yiduo Li, Zenglin Xu

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳研究院)

AI总结 本文研究了长期时间序列预测中线性映射的有效性,揭示了仿射映射在周期信号预测中的关键作用,并探讨了可逆归一化和输入时间 horizon 对模型鲁棒性的影响。

详情
Journal ref
Li, Zhe, Shiyi Qi, Yiduo Li, and Zenglin Xu. Revisiting Long-Term Time Series Forecasting: an Investigation on Affine Mapping. Academia AI and Applications 2, no. 2 (2026)
AI中文摘要

引言:长期时间序列预测(LTSF)近年来获得了广泛关注。尽管存在各种专门设计来捕捉时间依赖性的方法,但近期研究表明,甚至一个单一的线性层也能取得竞争性的性能。本文研究了近期LTSF方法的内在有效性,并揭示了仿射映射在周期信号预测中的关键作用。材料和方法:我们对模拟和现实世界的数据集进行了全面实验,以分析最先进模型的组成部分。我们提供了理论分析,解释仿射映射在周期信号预测中的工作机制。我们评估了可逆归一化和输入时间跨度扩展对模型鲁棒性的影响。结果:我们发现(1)仿射映射在常用的基准测试中主导了预测性能,模型从输入到输出学习了相似的转换矩阵;(2)仿射映射能够有效捕捉周期性模式,但在非周期性信号或具有不同周期的时序数据中表现较差;(3)可逆归一化显著增强了趋势预测,通过将非周期性趋势转换为周期性模式;(4)增加输入时间跨度提高了多通道数据的性能。代码可在:https://github.com/plumprc/RTSF获得。结论:我们的发现为LTSF模型的工作机制提供了理论和实验见解,突显了线性方法的优势和局限性。结果表明,未来模型的发展应关注处理跨通道周期变化和非周期性成分。

英文摘要

Introduction: Long-term time series forecasting (LTSF) has gained significant attention in recent years. While various specialized designs exist for capturing temporal dependency, recent studies have shown that even a single linear layer can achieve competitive performance. This paper investigates the intrinsic effectiveness of recent LTSF approaches and reveals the critical role of affine mapping. Materials and methods: We conduct comprehensive experiments on both simulated and real-world datasets to analyze the components of state-of-the-art models. A theoretical analysis is provided to explain the working mechanisms of affine mapping in periodic signal forecasting. We evaluate the impact of reversible normalization and input horizon extension on model robustness. Results: We find that (1) affine mapping dominates forecasting performance across commonly utilized benchmarks, with models learning similar transition matrices from input to output; (2) affine mapping effectively captures periodic patterns but struggles with non-periodic signals or time series with varying periods across channels; (3) reversible normalization significantly enhances trend forecasting by transforming non-periodic trends into periodic-like patterns; (4) increasing input horizon improves performance on multi-channel data with different periods. Code is available at: \url{https://github.com/plumprc/RTSF}. Conclusions: Our findings provide theoretical and experimental insights into the working mechanisms of LTSF models, highlighting both the strengths and limitations of linear approaches. The results suggest that future model development should focus on handling cross-channel period variations and non-periodic components.

2212.05155 2026-05-19 cs.DC cs.LG 版本更新

Cost-aware Duration Prediction for Software Upgrades in Datacenters

面向数据中心软件升级的成本感知持续时间预测

Yi Ding, Aijia Gao, Thibaud Ryden, Michal Sedlak, Essam Ewaisha, Igor Marnat, Henry Hoffmann

发表机构 * Meta

AI总结 本文提出Acela框架,通过考虑不对称预测成本和选择最佳模型,提升数据中心软件升级调度效率和吞吐量,实测提升升级窗口利用率1.25倍,升级数量增加33%。

Comments 18 pages, 25 figures. The 9th MLSys Conference (Industry Track), Bellevue, WA, USA, 2026

详情
AI中文摘要

软件升级是维护数据中心服务器可靠性的重要环节。尽管作业持续时间预测和调度已广泛研究,但软件升级带来的独特挑战仍被低估。本文首次深入研究数据中心级别的软件升级调度。我们首先刻画各种升级类型,然后将调度任务建模为约束优化问题。为解决此问题,我们引入Acela,一种成本感知的持续时间预测框架,旨在提高升级调度效率和吞吐量,同时满足服务等级目标(SLOs)。Acela考虑不对称的预测成本,战略性地选择最佳预测模型,并缓解滞后效应导致的预测过高。在Meta生产数据中心系统的评估中,Acela显著提高了现有升级调度器的效率,通过提升升级窗口利用率1.25倍,增加计划和完成的升级数量33%和41%,并减少取消率2.4倍。代码和数据集将在论文通过后发布。

英文摘要

Software upgrades are critical to maintaining server reliability in datacenters. While job duration prediction and scheduling have been extensively studied, the unique challenges posed by software upgrades remain largely under-explored. This paper presents the first in-depth investigation into software upgrade scheduling at datacenter scale. We begin by characterizing various types of upgrades and then frame the scheduling task as a constrained optimization problem. To address this problem, we introduce Acela, a cost-aware duration prediction framework designed to improve upgrade scheduling efficiency and throughput while meeting service-level objectives (SLOs). Acela accounts for asymmetric misprediction costs, strategically selects the best predictive models, and mitigates straggler-induced overestimations. Evaluations on Meta's production datacenter systems demonstrate that Acela significantly increases efficiency of the existing upgrade scheduler by improving upgrade window utilization by 1.25X, increasing the number of scheduled and completed upgrades by 33% and 41%, and reducing cancellation rates by 2.4X. The code and data sets will be released after paper acceptance.

2605.16640 2026-05-19 cs.LG 版本更新

Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders

在混合DeltaNet-注意力解码器中证明更短的scratchpad

Tomasz Steifer

发表机构 * Centre for Credible AI(可信人工智能中心) Warsaw University of Technology(华沙理工大学)

AI总结 研究混合递归-注意力解码器的表达能力,证明混合架构在模型表达性和效率上有优势,使用常数精度假设,Qwen风格的混合模型能以常数scratchpad解决parity-conditioned检索任务,而纯DeltaNet或纯注意力模型需多项式scratchpad。

Comments Under review at a ML conference

详情
AI中文摘要

我们研究了混合递归-注意力解码器的表达能力,这类架构用于最近的开源语言模型如Qwen3-Next及其后续版本。这些模型结合了门控注意力头和递归门控DeltaNet头。是否存在这样的混合架构在模型表达性或效率上有正式优势?我们证明确实存在。我们定义了parity-conditioned检索任务,并显示在常数精度假设下,Qwen风格的混合门控DeltaNet和门控注意力解码器能以常数scratchpad解决该任务,或等价于$O(1)$的思维链步骤。相比之下,纯门控DeltaNet模型没有类似的解决方案,而纯门控注意力模型至少需要多项式级别的scratchpad。

英文摘要

We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently $O(1)$ chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.

2605.16639 2026-05-19 cs.LG 版本更新

MedMIX: Modality-Internal Expert Fusion for Multimodal Medical Diagnosis

MedMIX:多模态医学诊断中的模态内部专家融合

Seungik Cho, Anqi Li, Wei Qiu

发表机构 * Department of Physics and Astronomy(物理与天文学系) Department of Electrical and Computer Engineering(电气与计算机工程系) Rice University(里奇大学)

AI总结 MedMIX通过融合模态内部专家、跨模态学习融合及大-小模型协作,提升多模态医学预测的鲁棒性,适用于缺失模态的场景。

详情
AI中文摘要

多模态临床预测面临三个挑战:每种模态有多个互补基础模型、训练和测试时存在普遍缺失模态、以及模态贡献的样本特异性变化。我们引入MedMIX,一种多模态框架,结合模态内部专家融合、学习跨模态融合以及训练时的大-小模型协作,以在不完整模态下实现稳健的医学预测。在每种模态内,MedMIX聚合多个小型专家模型的互补嵌入;跨模态时,它在可用模态上执行学习融合;训练时,它利用大教师模型来改进部署的表示,而无需额外推理成本。在三个异质基准(OpenI、MIMIC-IV-MM和MMIST-ccRCC)上,MedMIX在保持缺失模态扰动下的鲁棒性的同时实现了持续强劲的性能,并进一步在MIMIC-III上展示了跨队列转移的持续鲁棒性。这些结果突显了MedMIX作为一种实用框架,它统一了模态内部专家协作、样本特异性跨模态融合以及高效的大小模型协作,同时在不完整模态下保持稳健。

英文摘要

Multimodal clinical prediction faces three challenges: multiple foundation models (FMs) with complementary strengths per modality, pervasive missing modalities at training and test time, and sample-specific variation in modality contributions. We introduce MedMIX, a multimodal framework that combines intra-modality expert fusion, learned inter-modality fusion, and training-only large--small model collaboration for robust medical prediction under incomplete modalities. Within each modality, MedMIX aggregates complementary embeddings from multiple small expert models; across modalities, it performs learned fusion over available modalities; and during training, it leverages large teacher models to improve deployed representations without additional inference cost. Across three heterogeneous benchmarks (OpenI, MIMIC-IV-MM, and MMIST-ccRCC), MedMIX achieves consistently strong performance while remaining robust under controlled missing-modality perturbations, and further demonstrates sustained robustness under cross-cohort shift on MIMIC-III. These results highlight MedMIX as a practical framework that unifies within-modality expert collaboration, sample-specific cross-modality fusion, and efficient large--small model collaboration while remaining robust to incomplete modalities.

2605.16632 2026-05-19 cs.LG cs.AI cs.LO 版本更新

Learning How to Cube

学习如何求立方

Ferhat Erata, Sam Kouteili, Thanos Typaldos, Timos Antonopoulos, Robert B. Jones, Byron Cook, Ruzica Piskac

发表机构 * Yale University(耶鲁大学) AWS Agentic AI(AWS智能体AI)

AI总结 本文提出一种神经符号后训练框架,通过MCTS数据整理管道和符号启发式方法,使4B参数模型在SAT竞赛基准上取得53的pass@5分数,超越了Claude-Sonnet-4等前沿LLM。

Comments 33 pages, preprint

详情
AI中文摘要

尽管Cube-and-Conquer(C&C)在解决具有挑战性的布尔可满足性(SAT)问题上非常有效,但之前的工作没有展示基于Transformer的模型能够学习有效的求立方启发式方法。我们介绍了一种神经符号后训练框架。我们设计了一个基于MCTS的数据整理管道,利用符号启发式方法在SAT竞赛公式上探索分割决策,生成基于求解器统计信息的偏好数据,并辅以教师模型的推理轨迹。我们的两阶段后训练,监督微调(SFT)后接直接偏好优化(DPO),使4B参数模型在100个SAT竞赛基准上取得53的pass@5分数,超越了前沿LLM如Claude-Sonnet-4(50)并匹配最佳符号启发式(53)。消融实验显示,SFT单独将pass@5提升至51,DPO增加2个基准;对实际首次立方决策的熵/一致消融显示,SFT而非DPO导致根层决策多样性,产生互补的运行覆盖。这表明Transformer可以在传统由符号方法主导的领域中被训练出有效的求立方决策。

英文摘要

Despite the effectiveness of Cube-and-Conquer (C&C) for solving challenging Boolean Satisfiability (SAT) problems, no prior work has shown that transformer-based models can learn effective cubing heuristics. We introduce a neuro-symbolic post-training framework for this task. We design an MCTS-based data curation pipeline that uses symbolic heuristics to explore splitting decisions over SAT competition formulas, producing preference data grounded in solver statistics and augmented with reasoning traces from a teacher model. Our two-stage post-training, supervised fine-tuning (SFT) followed by direct preference optimization (DPO), enables a 4B-parameter model to achieve a pass@5 score of 53 on 100 SAT competition benchmarks, surpassing frontier LLMs such as Claude-Sonnet-4 (50) and matching the best symbolic heuristic (53). Ablations show that SFT alone improves pass@5 from 46 to 51, with DPO adding 2 additional benchmarks; an entropy/agreement ablation on realized first-cube decisions further shows that SFT, not DPO, accounts for the root-level decision diversity that produces complementary per-run coverage over deterministic symbolic methods. This demonstrates that transformers can be trained to make effective cubing decisions in a domain traditionally dominated by symbolic methods.

2605.16622 2026-05-19 cs.LG math.OC stat.ML 版本更新

Does Weight Decay Enhance Training Stability?

权重衰减是否增强训练稳定性?

Marius Saether, Amir Kolic, Tomaso Poggio, Pierfrancesco Beneventano

发表机构 * NTNU(挪威技术大学) MIT(麻省理工学院)

AI总结 本文研究权重衰减对训练动态稳定性的影响机制,发现其通过参数空间动态和损失尖锐度的变化影响训练稳定性,并揭示了架构依赖的相变现象。

Comments 24 pages, 16 figures

详情
AI中文摘要

在现代深度学习中,权重衰减常被归功于

英文摘要

In modern deep learning, weight decay is often credited with "stabilizing" training dynamics, diverging from its classical role as a static regularization penalty. We investigate a fundamental question: *does weight decay stabilize training dynamics, and if so, through which mechanism?* Indeed, training stability is understood through different but related notions in the literature. We consider how weight decay affects the parameter-space dynamics and loss sharpness by analyzing its effects at the \emph{Edge of Stability} (EoS). We show that weight decay robustly slows *progressive sharpening}. Furthermore, we uncover a striking architecture-dependent phase transition. In CNNs, weight decay dampens the oscillations at the EoS, while in MLPs, increasing weight decay causes a phase transition in which the sharpness stabilizes at a threshold significantly below the theoretical $\frac{2}η$ boundary. We develop a mathematical framework that accurately models these phenomena and identify the global alignment of the parameter vector and the sharpness gradient as the mechanistic driver of the phase transition. Importantly, we show that these phenomena translate into stability in terms of search in function-space (NTK). Last, this shows that curvature thresholds obtained from convex/quadratic heuristics may not be reliable stability diagnostics under regularization.

2605.16620 2026-05-19 cs.LG 版本更新

SCOUT: Cyclic Causal Discovery Under Soft Interventions with Unknown Targets

SCOUT:在软干预下未知目标的循环因果发现

Alpar Turkoglu, Muralikrishnna G. Sethuraman, Faramarz Fekri

发表机构 * School of Electrical and Computer Engineering, Georgia Institute of Technology(佐治亚理工学院电气与计算机工程学院)

AI总结 SCOUT提出一种新框架,通过最大化数据对数似然恢复图结构,利用合同残差流和神经样条流,实现从软干预数据中学习非线性循环因果关系,优于现有方法。

详情
AI中文摘要

学习变量间的因果关系是跨学科的重要研究领域。现有因果发现算法通常假设系统是无环的,外生噪声变量是高斯分布的,并且数据生成实验的干预目标是已知的。然而这些假设在现实系统中不成立。大多数现有方法要么假设底层模型是线性的,要么受限于有限的干预设置。为此,我们提出SCOUT,一种新的因果发现框架,用于从具有未知目标的软干预数据中学习非线性循环因果关系。我们的方法通过最大化数据对数似然来恢复图结构,使用两种归一化流架构:合同残差流和神经样条流。通过在合成和现实数据上的实验,我们证明SCOUT在各种干预和噪声设置中,比现有方法在因果图恢复和未知目标恢复上均表现更优。

英文摘要

Learning causal relationships between variables from data is a fundamental research area with many applications across disciplines. Most existing causal discovery algorithms rely on the assumptions that (i) the underlying system is acyclic, (ii) the exogenous noise variables are Gaussian, and (iii) the intervention targets for the data-generating experiments are known. While these assumptions simplify the analysis, they are violated in real-life systems. Most existing methods that address these issues either assume the underlying model is linear or are constrained to operate in limited interventional settings. To that end, we propose SCOUT, a novel causal discovery framework for learning nonlinear cyclic causal relationships from soft interventional data with unknown targets. Our approach maximizes the data log-likelihood to recover the graph structure, using two normalizing-flow architectures: contractive residual flows and neural spline flows. Through experiments on synthetic and real-world data, we show that SCOUT outperforms state-of-the-art methods in both causal graph recovery and unknown target recovery across various interventional and noise settings.

2605.16616 2026-05-19 cs.LG 版本更新

MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility

MLReplicate:用于机器学习可重复性自主研究系统的基准测试

Sasi Kiran Gaddipati, Diyana Muhammed, Farhana Keya, Gollam Rabby, Sören Auer

发表机构 * TIB—Leibniz Information Centre for Science and Technology(蒂宾根-莱比锡信息科学与技术研究中心) L3S Research Center, Leibniz University Hannover(莱比锡大学汉诺威分校L3S研究中心)

AI总结 本文提出MLReplicate基准测试,评估自主研究系统在机器学习可重复性方面的性能,发现计算规模并非决定性因素,且系统间存在显著差距。

详情
AI中文摘要

能够生成完整科学论文的自主研究系统已取得显著进展,但稳健且真实的评估框架仍未跟上。为弥合这一差距,我们引入MLReplicate,一个端到端的基准测试,评估自主研究系统在机器学习可重复性方面的表现。该基准测试基于ICML 2025中表现突出的论文,重新制定为标准化的输入规范,并在六个最先进的研究系统上进行评估:AI SCIENTIST-V1、AI SCIENTIST-V2、AGENT LABORATORY、CYCLERESEARCHER、AI RESEARCHER和TINY SCIENTIST,生成了45篇论文,其中3篇实验失败。输出通过双协议方法评估,结合自动化会议式评审和结构化专家人工评估,同时跟踪计算成本、运行时间和所需的人类干预量。自动化会议式评审接受了37份有效提交中的10份。另外8份提交在评审前被桌面拒绝,因未达到最低页面阈值。与自动化评审相比,人工评审员一致发现所有系统中存在方法学缺陷、幻觉实验结果和可重复性失败,并且59%的接受自动化评审包含伪造或未经证实的声明。我们进一步发现,token预算和计算成本并不能预测输出质量:最便宜的系统在人工评估中优于最资源密集的系统,尽管输入token数量相差38倍。因此,我们证明自主研究工作流设计比计算规模更重要。MLReplicate揭示了当前自主研究系统与真实科学严谨性之间的显著差距,并建立了一个实用且可扩展的评估框架,以系统性推进可信的人工智能驱动科学发现。

英文摘要

Autonomous research systems capable of generating complete scientific manuscripts have advanced rapidly, yet robust and realistic evaluation frameworks have failed to keep pace. To bridge this gap, we introduce MLReplicate, an end-to-end benchmark evaluating autonomous research systems on machine learning reproducibility. The benchmark was constructed from ICML 2025 outstanding papers reformulated into standardized input specifications and evaluated across 6 state-of-the-art research systems: AI SCIENTIST-V1, AI SCIENTIST-V2, AGENT LABORATORY, CYCLERESEARCHER, AI RESEARCHER, and TINY SCIENTIST, yielding 45 generated manuscripts, with 3 failed experiments. Outputs are assessed using a dual-protocol approach that combines automated conference-style review and structured expert human evaluation, while tracking computational cost, runtime, and the amount of required human intervention. The automated conference-style review accepted 10 out of 37 valid submissions. An additional 8 submissions were desk-rejected before review for failing to meet the minimum page threshold. In contrast to automated reviews, human reviewers consistently identified methodological flaws, hallucinated experimental results, and reproducibility failures across all systems, and 59% of accepted automated reviews contained fabricated or unsupported claims. We further find that neither token budget nor computational cost predicts output quality: the cheapest system outperforms the most resource-intensive system in human evaluation, despite a 38-fold difference in input tokens. We thus demonstrate that autonomous research workflow design matters more than the scale of compute. MLReplicate exposes a substantial gap between current autonomous research systems and genuine scientific rigor, and establishes a practical, extensible evaluation framework for systematic progress toward trustworthy AI-driven scientific discovery.

2605.16615 2026-05-19 cs.LG 版本更新

Learning What Evaluators Value: A Reliable Approach to Modeling Evaluator Preferences

学习评估者所重视的内容:一种可靠建模评估者偏好的方法

Madeline Celi Kitch, Nihar B. Shah

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文研究了如何学习评估者偏好,提出了一种鲁棒算法,能够在模型不匹配的情况下有效学习偏好,通过合成数据和实际数据验证了其有效性。

详情
AI中文摘要

在许多应用中,人类和LLM评估者通过相关标准的评估来创建整体评估。例如,在招生中,委员会根据测试成绩、GPA和研究经验等属性来评估候选人的整体匹配度。在医疗护理中,医生通过患者症状报告来考虑初步诊断并评估风险。每个场景都涉及将多个标准映射到整体评估——这一过程反映了评估者的潜在偏好。我们关注学习这些偏好的根本问题。许多此类问题的应用对评估者偏好做出特定的建模假设,这些假设在现实世界中可能被大幅违反。我们做出最小的假设,即偏好函数是逐坐标非递减的,这在大量评估场景中是合理的。我们理论上表征了多种常见假设的模型不匹配严重性,并表明这可能导致学习评估者偏好及其他重要下游任务的重大问题。然后,我们提出了一种学习评估者偏好的算法,该算法对模型不匹配具有鲁棒性。我们证明了当线性假设成立时,我们的算法可以学习任何偏好函数而不牺牲性能。通过合成模拟和实际数据对我们的算法进行评估,证实了其在学习偏好方面的鲁棒性,并展示了LLM和人类偏好的关键方面。

英文摘要

In many applications, human and LLM evaluators use assessments of relevant criteria to create an overall evaluation for an item or individual. For example, in admissions, committees assess candidates on attributes such as test scores, GPA, and research experience to evaluate their overall fit for the program. Another example arises in medical care where clinicians use patient reports of symptoms to consider preliminary diagnoses and assess risks. Each setting involves mapping multiple criteria to an overall evaluation -- a process that reflects the evaluator's underlying preferences. We focus on the fundamental question of learning these preferences. Many applications of this problem make specific modeling assumptions on evaluator preferences that may be substantially violated in the real world. We make the minimal assumption that the preference function is coordinate-wise non-decreasing, which is reasonable in a large number of evaluation settings. We theoretically characterize the severity of model mismatch for many common assumptions and show that it can lead to significant issues for learning evaluator preferences and other important downstream tasks. We then present an algorithm for learning evaluators' preferences that is robust to model mismatch. We prove theoretically that our algorithm can learn any preference function without sacrificing performance when the linearity assumption holds. Evaluations of our algorithm with synthetic simulations and real-world data confirm its ability to learn preferences robustly and illustrate key aspects of LLM and human preferences.

2605.16610 2026-05-19 cs.LG cs.GL 版本更新

Tensor Cookbook: Mastering Tensors through Diagrams

张量烹饪书:通过图表掌握张量

Beheshteh T. Rakhshan, Guillaume Rabusseau

AI总结 本文通过图表语言阐述张量网络及其在张量代数中的应用,展示如何用图形化方法简化高维数据处理和梯度推导。

详情
AI中文摘要

高维数据在许多科学和工程领域自然出现,包括机器学习、信号处理、计算物理和统计学。此类数据通常表示为张量,即矩阵的多维推广。尽管张量为多模态结构提供了自然表示,但随着阶数增长,直接操作变得困难:参数数量呈指数增长,涉及多个索引的代数表达式难以理解和实现。张量网络(TNs)提供了解决这些挑战的有效框架。最初由Penrose引入并在量子物理中得到广泛发展,张量网络的图形语言将收缩表示为图中的边,减少了符号开销并揭示了被索引表示掩盖的结构特性。尽管高维张量在现代机器学习和数值分析中扮演核心角色,但张量网络图在量子计算之外仍被低估,部分原因是缺乏一个自包含的数学参考,可供广泛的技术受众使用。本文提供了一个自包含的张量网络指南及其在张量代数中的应用。我们介绍了张量的主要操作,如收缩、乘积和重塑,通过图形符号,并展示经典张量分解及相关计算如何自然地表达在此框架中。我们还说明了张量网络如何简化梯度推导和高维概率分布的操作。在整个过程中,我们展示了图示方法能够产生真正更简洁和透明的证明,证明经典恒等式、秩界和梯度公式,否则需要繁琐的索引操作。

英文摘要

High-dimensional data arise naturally in many areas of science and engineering, including machine learning, signal processing, computational physics, and statistics. Such data are often represented as tensors, multi-dimensional generalizations of matrices. While tensors provide a natural representation for multi-modal structure, their direct manipulation quickly becomes challenging as the order grows: the number of parameters increases exponentially, and algebraic expressions involving many indices become difficult to interpret and implement. Tensor networks (TNs) provide an effective framework for addressing these challenges. Originally introduced by Penrose and developed extensively in quantum physics, the graphical language of tensor networks encodes contractions as edges in a graph, reducing notational overhead and revealing structural properties obscured by index notation. Despite the central role of high-dimensional tensors in modern machine learning and numerical analysis, tensor network diagrams remain underutilized outside quantum computing, partly due to the lack of a self-contained mathematical reference accessible to a broad technical audience. This manuscript provides a self-contained guide to tensor networks and their use in tensor algebra. We present the main operations on tensors, contractions, products, and reshaping through, graphical notation, and show how classical tensor decompositions and related computations are naturally expressed in this framework. We also illustrate how tensor networks simplify the derivation of gradients and the manipulation of high-dimensional probability distributions. Throughout, we show that the diagrammatic approach yields genuinely shorter and more transparent proofs of classical identities, rank bounds, and gradient formulas that would otherwise require laborious index manipulation.

2605.16604 2026-05-19 cs.LG 版本更新

R2V Agent: Teaching SLMs When to Ask for Help

R2V Agent:教SLMs何时请求帮助

Raghu Vamshi Hemadri, Humaira Firdowse Mohammed, Rishabh Maheshwary, Srivatsava Daruru, Sagar Davasam, Vikas Yadav, Srinivas Sunkara, Sai Rajeswar

AI总结 R2V-Agent通过风险校准的SLM-LLM路由框架提升交互代理的可靠性,结合小型语言模型策略、更强的教师LLM、轻量级过程验证器和校准的步骤路由器,在多个基准测试中显著提升性能与成本效率。

详情
AI中文摘要

高效的智能体系统应仅在本地模型可能失败的决策中承担昂贵的前沿模型成本。现有LLM级联通常在执行前路由整个查询,但任务难度在轨迹中途变化——在不稳定的工具调用、截断观察或叠加的本地错误后发生,使预执行路由变得脆弱。我们引入R2V-Agent,一种用于交互智能体的风险校准SLM-LLM路由框架。R2V结合四个组件:一个蒸馏的小语言模型(SLM)策略、一个更强的教师LLM、一个轻量级的过程验证器,该验证器在每一步评分候选动作,以及一个校准的步骤级路由器。路由器是我们的核心贡献:在SLM训练后,它在每个步骤估计残余失败风险,并仅在教师干预必要时升级。为了使路由问题明确界定,我们首先使用标准离线流水线训练一个稳定的本地SLM:通过行为克隆(BC)在教师轨迹上进行训练,随后通过验证器引导的直接偏好优化(DPO)进行优化,结合一致性正则化。然后,路由器在该固定策略的残余失败上进行训练,使用Brier校准的概率估计和一个条件价值-at-风险(CVaR)约束的目标,该目标惩罚所有扰动种子下的最坏情况失败。在HumanEval+、TextWorld和TerminalBench四个SLM骨干网络上,R2V改进了可靠性-成本前沿:它在HumanEval+上达到94.3%的成功率,仅需0.60%的LLM升级,将TextWorld从64.6%的SLM-only成功提升到98.2%在41.7%的升级率下,最终在TerminalBench上达到93.3%的成功率,在33.9%的LLM调用中,大致是启发式路由器成本的一半。

英文摘要

Efficient agentic systems should incur expensive frontier-model costs only on decisions where a cheaper local model is likely to fail. Existing LLM cascades usually route whole queries before execution, but task difficulty shifts mid-trajectory - after flaky tool calls, truncated observations, or compounding local errors - making pre-execution routing brittle. We introduce \textbf{R2V-Agent}, a risk-calibrated SLM-LLM routing framework for interactive agents. R2V combines four components: a distilled small language model (SLM) policy, a stronger teacher LLM, a lightweight process verifier that scores candidate actions at each step, and a calibrated step-level router. The router is our central contribution: after the SLM is trained, it estimates residual failure risk at each step and escalates only when teacher intervention is warranted. To make the routing problem well-defined, we first train a stable local SLM using a standard offline pipeline: behavioral cloning (BC) on teacher trajectories, followed by verifier-guided Direct Preference Optimization (DPO) with consistency regularization. The router is then trained on this fixed policy's residual failures using Brier-calibrated probability estimation and a Conditional Value-at-Risk (CVaR)-constrained objective that penalizes worst-case failures across perturbation seeds. Across HumanEval+, TextWorld, and TerminalBench with four SLM backbones, R2V improves the reliability-cost frontier: it achieves $94.3\%$ HumanEval+ success with $0.60\%$ LLM escalation, recovers TextWorld from $64.6\%$ SLM-only success to $98.2\%$ at $41.7\%$ escalation, and reaches $93.3\%$ TerminalBench success at $33.9\%$ LLM calls, roughly half the heuristic-router cost.

2605.16600 2026-05-19 cs.LG cs.AI cs.CL 版本更新

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

预训练写入,对齐读取:Transformer权重空间的不对称性

Valeria Ruscio, Eli-Shaoul Khedouri, Keiran Thompson

AI总结 研究揭示了预训练和对齐在Transformer权重空间中的不对称性,通过分析权重变化在残差流激活子空间和预测子空间中的对齐情况,发现读路径权重集中于注意力输入激活的主方向,而写路径权重在预测子空间中保持各向同性。

详情
AI中文摘要

交叉熵预训练和偏好对齐更新相同的Transformer权重,但留下几何上不同的痕迹。我们通过相对子空间分数探针来刻画这种不对称性,追踪权重变化如何与残差流激活子空间和由去嵌入定义的预测子空间对齐。对齐变化集中在读路径(W_Q,W_K)上,沿着注意力输入激活的主方向,而写路径(W_O,W_2)相对于预测子空间则保持近各向同性。我们通过各向异性梯度积累来解释这种模式:对矩阵W的更新是外积δ_t a_t^T之和,继承自哪一侧的协方差集中。对于读路径矩阵,这一侧是输入激活a_t,其协方差在训练过的Transformer中呈尖峰状,因此产生与目标无关的集中。对于写路径矩阵,相关的一侧是上游梯度δ_t,其各向异性取决于损失。交叉熵提供标准的每样本信号,诱导预训练期间写路径的预测几何;对齐目标通常在写路径上添加很少的进一步集中。我们通过检查点内轨迹、渐进对比目标控制以及闭合形式的秩1干预与匹配方向控制来支持这一解释,为所提出的权重空间几何提供因果证据。

英文摘要

Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas align with residual-stream activation subspaces and with the prediction subspace defined by the unembedding. Alignment deltas concentrate in the read pathway ($W_Q$, $W_K$), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway ($W_O$, $W_2$) relative to the prediction subspace. We explain this pattern through anisotropic gradient accumulation: updates to a matrix $W$ are sums of outer products $δ_t a_t^\top$, and inherit directional structure from whichever side has concentrated covariance. For read-pathway matrices, this side is the input activation $a_t$, whose covariance is spiked in trained transformers and therefore produces objective-agnostic concentration. For write-pathway matrices, the relevant side is the upstream gradient $δ_t$, whose anisotropy depends on the loss. Cross-entropy supplies the canonical sharp per-sample signal, inducing write-pathway prediction geometry during pretraining; alignment objectives typically add little further write-side concentration. We support this explanation with a within-checkpoint trajectory, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls, providing causal evidence for the proposed weight-space geometry.

2605.16594 2026-05-19 math.NA cs.LG cs.NA 版本更新

fPINN-DeepONet: A Physics-Informed Operator Learning Framework for Multi-term Time-fractional Mixed Diffusion-wave Equations

fPINN-DeepONet:一种用于多项时间分数混合扩散波方程的物理信息运算学习框架

Binghang Lu, Zhaopeng Hao, Christian Moya, Guang Lin

AI总结 本文提出fPINN-DeepONet框架,结合运算学习与L2近似,高效求解分数偏微分方程,适用于固定和变分数阶PDE,并通过动态变化的分数阶和噪声数据实验验证其准确性、鲁棒性和效率。

详情
AI中文摘要

本文开发了一种用于求解多项时间分数混合扩散波方程(TFMDWEs)的物理信息深度运算学习框架。我们首先推导出L2近似,实现Caputo分数导数阶β∈(1,2)的第一阶精度。在此基础上,我们提出fPINN-DeepONet框架,一种将运算学习与L2近似结合的新型方法,有效求解分数偏微分方程(FPDEs)。该框架成功应用于固定和变分数阶PDE,展示了其多样性和广泛适用性。为评估所提模型的性能,我们进行了系列数值实验,涉及空间和时间中动态变化的分数阶以及噪声数据场景。这些结果突显了fPINN-DeepONet框架的准确性、鲁棒性和效率。

英文摘要

In this paper, we develop a physics-informed deep operator learning framework for solving multi-term time-fractional mixed diffusion-wave equations (TFMDWEs). We begin by deriving an $L_2$ approximation, which achieves first-order accuracy for the Caputo fractional derivative of order $β\in (1,2)$. Building upon this foundation, we propose the fPINN-DeepONet framework, a novel approach that integrates operator learning with the $L_2$ approximation to efficiently solve fractional partial differential equations (FPDEs). Our framework is successfully applied to both fixed and variable fractional-order PDEs, demonstrating the framework's versatility and broad applicability. To evaluate the performance of the proposed model, we conduct a series of numerical experiments that involve dynamically varying fractional orders in both space and time, as well as scenarios with noisy data. These results highlight the accuracy, robustness, and efficiency of the fPINN-DeepONet framework.

2605.16581 2026-05-19 cs.LG 版本更新

Structure-Aware Masking for Protein Representation Learning

基于结构的掩码策略用于蛋白质表示学习

Thomas Walton, Ayan Goel, Amirali Aghazadeh

AI总结 本文提出结构感知掩码策略,通过三维空间接近性选择残基组,改进蛋白质表示学习,提升下游预测任务性能,实现14%的提升。

详情
AI中文摘要

Masked language modeling (MLM) 是训练蛋白质语言模型的标准目标,通常通过随机掩码单个残基在固定比例(例如15%)实现。这种做法隐含假设所有序列位置对表示学习贡献相等。然而在下游适应性预测任务中,蛋白质序列受三维结构依赖和长程残基接触影响,导致残基间存在强非局部耦合。我们引入Bucket Masking,一种结构感知的掩码策略,根据三维空间接近性选择残基组,在训练中优先掩码结构耦合区域。通过将掩码分布条件于残基接触,Bucket Masking将学习目标转向建模对蛋白质功能至关重要的长程相互作用。在四个下游蛋白质适应性预测任务中,Bucket Masking相比标准随机掩码实现最高14%的提升,擅长预测高阶突变相互作用。通过受控消融实验,我们证明这些改进源于掩码位置而非跨度大小,确立掩码作为位置归纳偏置。

英文摘要

Masked language modeling (MLM) is the standard objective for training protein language models, typically implemented by randomly masking individual residues at a fixed rate (e.g., 15%). This practice implicitly assumes that all sequence positions contribute equally to representation learning. In downstream fitness prediction tasks, however, protein sequences are governed by three-dimensional structural dependencies and long-range residue contacts that induce strong nonlocal couplings between residues. We introduce Bucket Masking, a structure-aware masking strategy that selects groups of residues based on their proximity in three-dimensional space, preferentially masking structurally coupled regions during training. By conditioning the masking distribution on residue contacts, Bucket Masking shifts the learning objective toward modeling long-range interactions that are critical for protein function. Across four downstream protein fitness prediction tasks, Bucket Masking enables up to a 14% improvement over standard random masking, excelling at predicting higher-order mutational interactions. Through controlled ablations, we show that these improvements arise from mask placement rather than span size, establishing masking as a positional inductive bias.

2605.16573 2026-05-19 cs.LG cs.AI physics.flu-dyn 版本更新

Wavelet Flow Matching for Multi-Scale Physics Emulation

小波流匹配用于多尺度物理模拟

Gabriele Accarino, Juan Nathaniel, Carla Roesch, Pierre Gentine, Sara Shamekh, Duncan Watson-Parris, Viviana Acquaviva

AI总结 本文提出小波流匹配方法,通过在多尺度小波空间中直接进行最优传输,解决多尺度物理系统模拟中稳定性与精度的平衡问题,实现更高效的生成式模拟。

详情
AI中文摘要

准确模拟由偏微分方程 governing 的多尺度物理系统需要保持长期自回归滚动的稳定性同时保留细尺度结构的模型。确定性模拟器产生过于平滑的预测,而生成方法能更好地捕捉细节但成本高。潜在空间生成模型作为折中方案,但需额外训练自动编码器。我们提出小波流匹配(WFM),一种新型生成模拟器,通过在多尺度小波空间中直接进行最优传输,克服了当前成本与能力之间的权衡。WFM 不学习潜在压缩,而是利用 U-Net 的层次结构,共同预测指定小波表示的传输速度。在三个具有挑战性的混沌流体动力学系统上,WFM 在长期稳定性、准确性和频谱一致性方面优于现有最佳模型。我们的结果清楚地表明,小波空间作为无训练的表示,在复杂物理动态的生成模拟中是有效的。

英文摘要

Accurate emulation of multi-scale physical systems governed by PDEs demands models that remain stable over long autoregressive rollouts while preserving fine-scale structures. Deterministic emulators produce overly-smoothed predictions, while generative approaches better capture details but are costly. Latent-space generative models have emerged as a compromise but with the additional cost of separately pre-trained autoencoders. We propose Wavelet Flow Matching (WFM), a novel generative emulator that overcomes current trade-offs between cost and skill by performing optimal-transport directly in the multi-scale wavelet space. Rather than learning a latent compression, WFM leverages the hierarchical structure of a U-Net to jointly predict transport velocities of a prescribed wavelet representation. On three challenging systems of chaotic fluid dynamics, WFM achieves superior long-horizon stability, accuracy and spectral coherence compared to state-of-the-art models. Our results clearly position the wavelet space as an effective training-free representation for generative emulation of complex physical dynamics.

2605.16571 2026-05-19 stat.ML cs.AI cs.LG 版本更新

Isotonic Survival Regression: Calibrated Survival Distributions from Deep Cox Models

非递减生存回归:从深度Cox模型中校准生存分布

Anchit Jain, Kevin Zhang, Stephen Bates

AI总结 本文提出一种非递减回归方法,用于校准深度Cox模型的生存概率,通过理论保证和实验验证提升模型实用性。

详情
AI中文摘要

时间到事件数据在生命科学和工程中普遍存在,但通常伴随删失,这使得标准机器学习方法的应用复杂化。深度Cox模型因能优雅处理删失并可与无结构数据如临床文本报告、基因组序列和病理图像结合而成为分析时间到事件数据的流行方法。然而,其预测的生存概率往往校准不良,限制了实际应用。本文提出了一种新颖的后验校准方法,利用非递减回归来改进预测生存概率而不影响判别能力。我们建立了有利的理论保证,包括双重鲁棒性属性和渐近校准。在合成和真实世界临床数据上的实验展示了我们方法的实证有效性。

英文摘要

Time-to-event data is widespread across the life sciences and engineering, but it is typically encountered together with censoring, which complicates the application of standard machine learning methods. Deep Cox models have emerged as a popular method for analyzing time-to-event data because they gracefully handle censoring and can be used with unstructured data such as clinical text reports, genomic sequences, and pathology images. However, their predicted survival probabilities are often poorly calibrated, thus limiting their practical utility. In this paper, we propose a novel post hoc calibration method for Deep Cox models that uses isotonic regression to refine predicted survival probabilities without affecting discriminative power. We establish favorable theoretical guarantees, including a double-robustness property and asymptotic calibration. Experiments on synthetic and real-world clinical data demonstrate the empirical effectiveness of our method.

2605.16567 2026-05-19 cs.LG cs.AI cs.DB 版本更新

Automatic Unsupervised Ensemble Outlier Model Selection--Extended Version

自动无监督集成异常检测模型选择——扩展版

Hong-Phuc Phan, Tuan-Anh Vu, Tung Kieu, Son Ha Xuan, Bin Yang, Christian S. Jensen

AI总结 本文提出MetaEns框架,通过学习预测边际增益模型,自动选择高质异常检测模型集成,无需标注数据,实验显示其在39个真实数据集上表现优异。

Comments 25 pages. An extended version of "Automatic Unsupervised Ensemble Outlier Model Selection" accepted at ICML 2026

详情
AI中文摘要

无监督异常检测因其无需标注数据而具有吸引力。此外,多模型集成可提高检测鲁棒性。然而,无标注数据下构建集成具有挑战性。简单集成可能因冗余或不可靠的检测模型导致饱和问题。我们提出MetaEns,一种自动无监督框架,用于选择异常检测模型的集成。利用标注元数据集,MetaEns学习预测边际增益模型,估计添加候选模型到部分构建集成的预期改进。在测试时,该学习信号结合子模函数启发的代理目标,通过多样性感知折扣和家族级风险正则化,实现贪心顺序选择与自适应提前停止。结果表明,MetaEns可在无真实标签的情况下构建紧凑高质量的集成。在39个真实数据集上的实验显示,MetaEns在平均精度上优于现有无监督选择器和集成基线,同时使用更少的模型。

英文摘要

Unsupervised outlier detection is attractive because it eliminates the need for labeled data. Moreover, forming multi-model ensembles can improve detection robustness. However, composing an ensemble without labeled data is challenging. Naively composed ensembles can suffer from ensemble saturation, where redundant or unreliable detection models degrade performance and incur unnecessary computation. We propose MetaEns, an automatic unsupervised framework for selecting ensembles of outlier detection models. Using labeled meta-datasets, MetaEns learns a model that predicts marginal ensemble gains, estimating the expected improvement from adding a candidate model to a partially constructed ensemble. At test time, this learned signal is combined with a submodular-inspired proxy objective that enforces diminishing returns through diversity-aware discounting and family-level risk regularization, thereby enabling greedy sequential selection with adaptive early stopping. As a result, MetaEns constructs compact, high-quality ensembles without access to ground-truth labels. Experiments on 39 real-world datasets show that MetaEns consistently outperforms state-of-the-art unsupervised selectors and ensemble baselines, achieving higher average precision while using fewer models.

2605.16550 2026-05-19 cs.CV cs.LG 版本更新

Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition

基于注意力的变换器聚合网络用于视频眼周识别

Luiz G F Carreira, Breno A Mariano, Victor H C de Melo, David Menotti, William Robson Schwartz

AI总结 本文提出一种基于变换器的聚合网络,用于视频眼周识别,通过特征嵌入和聚合模块提升识别鲁棒性,在COX Face数据集上优于传统方法,达到99.8%的TPR@1e-1和96.6%的Rank-5。

Comments Accepted at ICIP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. DOI to be added upon publication

详情
AI中文摘要

视频眼周识别是基于个体眼睛周围区域识别身份的任务。眼周区域是人脸最具有区分性的区域之一,使其适合识别任务。其作为生物特征模态的应用在监控环境中逐渐兴起,尤其是在传统生物特征如面部或虹膜识别因非受限采集条件而不可行时。本文提出了一种针对监控环境的视频眼周识别的注意力感知方法。该框架包含两个主要模块:特征嵌入和聚合。特征嵌入模块是一个深度卷积神经网络,将眼周数据映射到特征向量。聚合模块是一个仅含编码器的变换器,能够自适应地将帧级特征聚合为单一视频表示和静态参考图像的特征向量。在公开可用的COX Face数据集上的实验表明,所提方法的鲁棒性,一致优于传统聚合方案。在最佳情况下,该方法实现了99.8%的TPR@1e-1和96.6%的Rank-5。

英文摘要

Video periocular recognition is the task of recognizing an individual's identity based on the region around an individual's eyes. The periocular area is one of the most discriminative regions of the human face, making it suitable for recognition tasks. Its use as a biometric modality has emerged as an alternative, especially in surveillance scenarios where conventional biometric traits such as face or iris recognition become unfeasible due to unconstrained acquisition conditions. This paper proposes an attention-aware approach for video-based periocular recognition in surveillance environments. The framework consists of two main modules: feature embedding and aggregation. The feature embedding module is a deep convolutional neural network that maps periocular data to feature vectors. The aggregation module is an encoder-only transformer that adaptively learns to aggregate frame-level features into a single video representation and a feature vector for the still reference image. Experiments on the publicly available COX Face dataset show the robustness of the proposed method, consistently outperforming naive aggregation schemes. In the best scenario, the approach achieves $99.8\%$ of TPR@$1e^{-1}$ and $96.6\%$ of Rank-5.

2605.16547 2026-05-19 cs.LG 版本更新

World Model-Enabled Causal Digital Twins for Semantic Communications in Physical AI Systems

面向物理人工智能系统的语义通信世界模型增强因果数字孪生

Lingyi Wang, Tingyu Shui, Walid Saad, Pascal Adjakple

AI总结 本文提出基于世界模型的因果数字孪生框架,解决物理人工智能系统中语义通信的长期回报最大化问题,通过因果信息价值指标和策略训练提升导航成功率。

详情
AI中文摘要

语义通信作为一种面向目标的网络范式,但现有解决方案多用于单次任务,无法支持闭环物理人工智能系统。本文研究面向物理人工智能系统的语义通信,将其建模为无线比特预算约束下的长期回报每比特最大化问题,提出因果信息价值指标评估语义标记的边际贡献,并通过世界模型增强的因果数字孪生框架捕捉闭环系统动态,实现对长周期模拟运行的反事实推理。基于这些模拟运行,通过因果信息价值每比特评估训练策略和语义标记选择器,实验表明该框架在返回每kbit和导航成功率方面优于现有强化学习方案。

英文摘要

Semantic communication has emerged as a promising paradigm for enabling goal-oriented networking. However, most existing semantic communication solutions are tailored to one-shot tasks and optimize instantaneous performance. Hence, they cannot be used to support closed-loop dynamic systems with physical artificial intelligence (AI), in which the transmitted semantics affect not only the current inference outcome but also future control actions, state evolution, and ultimately long-horizon task performance. To address this gap, this paper investigates goal-oriented semantic communications for physical AI systems with closed-loop sensing-communication-inference-control. In particular, the problem of semantic communications is formulated as a long-term return-per-bit maximization under wireless bit-budget constraints while capturing both control efficiency and communication efficiency. To solve this problem, a novel causal information value (CIV) metric is introduced to evaluate the marginal contribution of each semantic token to the expected long-term return by transmission interventions. Then, a world-model-enabled causal digital twin (WM-CDT) framework is proposed to capture the dynamics of closed-loop physical AI systems and enable counterfactual reasoning for long-horizon imagined rollouts. Based on these imagined rollouts, an actor-critic policy is trained for long-horizon agent control with high data efficiency, while the semantic token selector is trained through CIV-per-bit evaluation. Extensive simulations on an AirSim-Sionna-based unmanned aerial vehicle (UAV) navigation simulator show that the proposed WM-CDT framework achieves significant improvement in return-per-kbit and navigation success rate compared to existing reinforcement learning solutions.

2605.16532 2026-05-19 cs.LG econ.GN q-fin.EC 版本更新

Boundedly Rational Meta-Learning in Sequential Consumer Choice

有界理性元学习在序列消费者选择中的应用

Mehrzad Khosravi, Max Kleiman-Weiner, Hema Yoganarasimhan

AI总结 研究消费者在不确定环境下重复选择时的跨情境知识转移,提出有界理性元动态规划政策BRMDP(D),发现消费者通过粗略的先验不确定性表示实现跨情境学习。

详情
AI中文摘要

许多消费者决策是在不确定环境下重复选择。标准模型利用贝叶斯学习和动态规划来捕捉这些决策:消费者根据反馈更新信念,并利用这些信念指导未来的选择。然而,在许多市场中,当消费者进入新情境时,学习不会重置:先前与品牌、产品或提供商的经验会塑造后续相关决策中的信念。我们研究了序列选择中的跨情境知识转移,或元学习。我们设计了一个分层实验室任务,参与者在多个路线中反复选择航空公司并观察噪声二元结果。实证证据表明,参与者不仅在路线内改进,还在跨路线中改进:他们更早选择更好的航空公司并在后续路线中减少伪遗憾。为了识别这种转移的机制,我们比较了人类选择与无转移基准和完全整合的贝叶斯元学习基准。特别是,我们引入了一类有界理性元动态规划策略BRMDP(D),通过有限数量的超后验抽样(记为D)近似完全整合。试次级似然比较显示,低D的有界理性元学习,特别是BRMDP(1),比无转移和完全整合的贝叶斯转移拟合参与者行为更好。因此,消费者通过粗略的先验不确定性表示在跨情境中转移品牌层面的规律性。研究结果表明,消费者学习模型应允许近似的跨情境转移,并且基于无转移或完全整合学习的管理反事实可能具有误导性。

英文摘要

Many consumer decisions are repeated choices under uncertainty. Standard models capture these decisions using Bayesian learning and dynamic programming: consumers update beliefs from feedback and use those beliefs to guide future choices. In many markets, however, learning does not restart when consumers enter a new context: prior experience with a brand, product, or provider can shape beliefs in later, related decisions. We study this cross-context knowledge transfer, or meta-learning, in sequential choice. We design a hierarchical laboratory task in which participants repeatedly choose among airlines across routes and observe noisy binary outcomes. Reduced-form evidence shows that participants improve not only within routes, but also across routes: they choose better airlines earlier in later routes and reduce pseudo-regret. To identify the mechanism behind this transfer, we compare human choices to a no-transfer benchmark and a fully integrated Bayesian meta-learning benchmark. In particular, we introduce a class of boundedly rational meta dynamic programming policies, BRMDP(D), that approximate full integration using a limited number of hyper-posterior draws, denoted by D. Trial-by-trial likelihood comparisons show that low-D boundedly rational meta-learning, especially BRMDP(1), fits participant behavior better than both no transfer and fully integrated Bayesian transfer. Consumers, therefore, transfer brand-level regularities across contexts, but through coarse representations of prior uncertainty. The findings imply that models of consumer learning should allow for approximate cross-context transfer, and that managerial counterfactuals based on either no-transfer or fully integrated learning can be misleading.

2605.16529 2026-05-19 cs.LG math.OC 版本更新

Multiscale Supervised Unbalanced Optimal Transport Flow Matching

多尺度监督不平衡最优传输流匹配

Qiangwei Peng, Lezhi Chen, Peijie Zhou

AI总结 本文提出MUST-FM框架,通过利用多尺度数据结构和已知的转移先验知识,有效降低计算成本并实现鲁棒的轨迹推断,适用于大规模单细胞数据集的动态建模。

详情
AI中文摘要

不平衡最优传输(UOT)为建模单细胞转换和出生-死亡动态提供了系统框架,但其高计算成本限制了大规模数据集的应用。尽管单细胞数据通常包含层次注释和已知的转移先验知识,现有UOT近似方法很少利用这种多尺度结构或先验知识。我们引入多尺度监督不平衡最优传输流匹配(MUST-FM),一种无需模拟的框架,通过利用层次数据结构扩展UOT。MUST-FM进一步支持可选的监督形式,将转移先验知识(如细胞谱系)纳入以指导位移场和质量变化的学习。实验表明,MUST-FM在降低计算开销的同时实现了稳健且具有生物学意义的轨迹推断,能够对大尺度单细胞数据集进行动态建模。

英文摘要

Unbalanced optimal transport (UOT) provides a principled framework for modeling single-cell transitions and birth-death dynamics, but its high computational cost limits scalability to large-scale datasets. Although single-cell data often contain hierarchical annotations and known transition priors, existing UOT approximations rarely exploit this multiscale structure or prior knowledge. We introduce Multiscale Supervised Unbalanced Optimal Transport Flow Matching (MUST-FM), a simulation-free framework that scales UOT by leveraging hierarchical data structure. MUST-FM further supports an optional supervised formulation that incorporates transition priors, such as cell lineages, to guide the learning of displacement fields and mass variations. Experiments show that MUST-FM reduces computational overhead while achieving robust and biologically meaningful trajectory inference, enabling dynamic modeling of atlas-scale single-cell datasets.

2605.16527 2026-05-19 cs.LG cs.AI 版本更新

Hypergraph Pattern Machine: Compositional Tokenization for Higher-Order Interactions

超图模式机:用于高阶交互的组合分词

Kyrie Zhao, Zehong Wang, Tianyi Ma, Fang Wu, Xiangru Tang, Pietro Lio, Sheng Wang, Yanfang Ye

AI总结 本文提出超图模式机,通过学习子集的组合模式,改进高阶交互的建模,从而在超图基准和真实案例中取得更好效果。

详情
AI中文摘要

超图模型高阶关系,从药物处方到推荐。数据中的核心结构信号是交互组合性:高阶关系是否是组合、涌现或抑制性的。在多药治疗中,制度决定是否停药、保留或排除:组合药物三元组可安全简化,涌现三元组需联合所有药物,抑制三元组标志干扰现有交互的药物。现有超图学习方法仅传播观测超边消息,未建模此信号,导致危险组合被误分类。为此,本文提出超图模式机(HGPM),从消息传递转向学习子集的组合模式。它分词组合子集,组织成包含 DAG,并训练掩码重建的包含意识 Transformer。在十个超图基准上,HGPM 匹配或超越现有方法。值得注意的是,在真实不良事件预测案例中,HGPM 正确识别出抑制副作用的药物添加,而现有方法无法区分。代码和数据见 https://github.com/KryieZhao/HGPM.git.

英文摘要

Hypergraphs model higher-order relations that drive real-world decisions, from drug prescriptions to recommendations. A central structural signal in such data, beyond what pairwise relations can express, is interaction compositionality: whether a higher-order relation is compositional, emergent, or inhibitory with respect to its observed or unobserved sets. In polypharmacy, the regime decides whether a drug should be dropped, kept, or excluded: a compositional drug triple can be safely simplified, an emergent triple requires all drugs jointly, and an inhibitory triple flags a drug that disrupts an existing interaction. However, existing hypergraph learning methods, which merely propagate messages over observed hyperedges, leave this compositional signal unmodeled, allowing dangerous drug combinations to slip through and be misclassified. To this end, we propose the Hypergraph Pattern Machine (HGPM), shifting the paradigm from message passing to learning the compositional pattern of subsets. It tokenizes compositional subsets, organizes them in an inclusion DAG, and trains an inclusion-aware Transformer under masked reconstruction. On ten hypergraph benchmarks, HGPM matches or exceeds state-of-the-art methods. Notably, in a real adverse-event prediction case, HGPM correctly identifies the drug addition that inhibits the side effect among feature-identical candidates, a discrimination existing methods cannot make. The code and data are in https://github.com/KryieZhao/HGPM.git.

2605.16520 2026-05-19 cs.LG 版本更新

Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing

通过扩散风格平滑实现采样基于非凸优化的全局收敛

Zeji Yi, Chaoyi Pan, Guanya Shi, Guannan Qu

AI总结 本文通过平滑视角分析采样优化,揭示平滑在逃逸局部极小值中的作用,并提出DIDA算法实现全局收敛。

Comments 57 pages, 5 figures

详情
AI中文摘要

采样优化(SBO)如交叉熵方法和进化算法在无梯度非凸问题中取得成功,但其收敛性理解有限。本文通过平滑视角建立非渐近收敛分析,将SBO重新解释为对平滑目标的梯度下降,类似于扩散模型中的噪声条件得分上升。我们分析了平滑目标的景观,证明平滑通过扩大局部凸区域帮助逃逸局部极小值,但引入了最优性间隙。基于此,我们为SBO算法在全局极小值邻域内建立非渐近收敛保证,并提出Diffusion-Inspired Dual-Annealing(DIDA)算法,可证明收敛到全局最优。通过大量数值实验验证景观结果,并展示DIDA在梯度自由优化方法中的优异性能。最后讨论了结果对扩散模型的影响。

英文摘要

Sampling-based optimization (SBO), like cross-entropy method and evolutionary algorithms, has achieved many successes in solving non-convex problems without gradients, yet its convergence is poorly understood. In this paper, we establish a non-asymptotic convergence analysis for SBO through the lens of smoothing. Specifically, we recast SBO as gradient descent on a smoothed objective, mirroring noise-conditioned score ascent in diffusion models. Our first contribution is a landscape analysis of the smoothed objective, demonstrating how smoothing helps escape local minima and uncovering a fundamental coverage-optimality trade-off: smoothing renders the landscape more benign by enlarging the locally convex region around the global minimizer, but at the cost of introducing an optimality gap. Building on this insight, we establish non-asymptotic convergence guarantees for SBO algorithms to a neighborhood of the global minimizer. Furthermore, we propose an annealed SBO algorithm, Diffusion-Inspired Dual-Annealing (DIDA), which is provably convergent to the global optimum. We conduct extensive numerical experiments to verify our landscape results and also demonstrate the compelling performance of DIDA compared to other gradient-free optimization methods. Lastly, we discuss implications of our results for diffusion models.

2605.16515 2026-05-19 cs.CV cs.LG 版本更新

SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability

SeamCam:通过多线索视觉可探测性量化无缝伪装

Amin Karimi Monsefi, Abolfazl Meyarian, Mridul Khurana, Shuheng Wang, Pouyan Navard, Cheng Zhang, Anuj Karpatne, Wei-Lun Chao, Rajiv Ramnath

AI总结 SeamCam通过将伪装评估转化为视觉定位问题,提出了一种量化动物伪装效果的指标,通过人类实验验证其有效性,并展示了其在扩散模型训练中的应用。

详情
AI中文摘要

动物被描述为有效伪装时,能够无缝融入周围环境,但目前缺乏标准化的量化措施。本文通过将伪装评估转化为视觉定位问题:伪装良好的动物在已知类别时仍难以检测。引入SeamCam指标,量化动物的可探测性。给定图像和目标物种,SeamCam生成类别条件的检测提案,提取分割掩码,并识别其子集,其联合覆盖最大IoU与真实掩码。SeamCam分数是最大可恢复定位信号的补数,分数越高伪装越强(即可探测性越低)。在94名参与者和2390次比较的人类二择一强制选择研究中,SeamCam与人类伪装难度判断达成78.82%的一致性,优于现有最先进方法约25%。随后展示了SeamCam作为直接偏好优化(DPO)的偏好信号,用于微调基于扩散的修复模型以生成伪装。这提供了一种经济的训练方法,其目标专门适用于伪装生成,不同于典型的扩散模型。为支持严格基准测试,进一步引入CamFG-1.5k数据集,包含1521张高分辨率图像,在伪装生成前动物完全可见,使评估更公平,通过控制现有数据集中存在的遮挡伪影。

英文摘要

Animals are described as effectively camouflaged when they blend seamlessly with their surrounding, yet no standardized quantitative measure of this seamlessness exists. We address this gap by framing camouflage evaluation as a visual localization problem: a well-camouflaged animal is one that remains difficult to detect even when its category is known. We introduce SeamCam (Seamless Camouflage), a metric that quantifies how detectable an animal is from the available visual evidence. Given an image and a target species, SeamCam generates category-conditioned detection proposals, extracts segmentation masks, and identifies the subset whose collective union yields the highest IoU with the ground-truth mask. The SeamCam score is one minus this maximum recoverable localization signal, where a higher score indicates stronger camouflage (i.e., lower detectability). In a human two-alternative forced-choice study with 94 participants and 2,390 comparisons, SeamCam achieves 78.82% agreement with human camouflage difficulty judgments, outperforming state-of-the-art by about 25%. We then demonstrate SeamCam's utility as a preference signal for Direct Preference Optimization (DPO) to fine-tune a diffusion-based inpainting model for camouflage generation. This offers an affordable training approach with an objective explicitly suited for camouflage generation, unlike typical diffusion models. To support rigorous benchmarking, we further introduce CamFG-1.5k, a curated dataset of 1,521 high-resolution images in which animals are fully visible prior to camouflage generation, enabling unbiased evaluation by controlling for occlusion artifacts present in existing datasets. https://7amin.github.io/SeamCam/

2605.16486 2026-05-19 stat.ML astro-ph.IM cs.LG 版本更新

StAD: Stein Amortized Divergence for Fast Likelihoods with Diffusion and Flow

StAD:基于Stein算子的 amortized 散度用于具有扩散和流的快速似然

Gurjeet Jagwani, Stephen Thorp, Sinan Deger, Hiranya Peiris

AI总结 本文提出StAD方法,利用Langevin-Stein算子预测和学习PF-ODE的散度,无需计算雅可比矩阵,提升了似然预测的效率和稳定性。

Comments 24 pages, 10 figures

详情
AI中文摘要

扩散和流基模型广泛用于生成建模和密度估计。它们允许确定性概率流常微分方程(PF-ODE),类似于连续归一化流(CNFs),描述了概率质量的传输。从这些模型中获得似然对于许多工作流程至关重要,尤其是贝叶斯分析,这需要求解雅可比矩阵的迹来计算学习PF-ODE的发散性,这要么是$\mathcal{O}(D^2)$精确计算,要么是$\mathcal{O}(D)$的噪声估计。我们引入StAD,一种新的蒸馏方法,利用兰格vin-斯坦算子预测和学习PF-ODE的发散性,而无需计算雅可比矩阵。我们证明我们的方法在CIFAR-10、ImageNet和其他密度估计任务上与Hutchinson和Hutch++竞争,一致提高了似然预测的方差和速度,优于Hutchinson。我们还证明我们的方法可以推广到各种生成模型,且在某些正则性条件下,这些学习的向量场可以满足斯坦类。

英文摘要

Diffusion and flow-based models are ubiquitously used for generative modelling and density estimation. They admit a deterministic probability flow ordinary differential equation (PF-ODE), analogous to continuous normalizing flows (CNFs), which describes the transport of the probability mass. Obtaining the likelihood from these models is of interest to many workflows, especially Bayesian analysis, and requires solving the trace of the Jacobian to compute the divergence of the learned PF-ODE, which is either $\mathcal{O}(D^2)$ to compute exactly or $\mathcal{O}(D)$ with a noisy estimate. We introduce StAD, a new distillation method to predict and learn the divergence of the PF-ODE using the Langevin-Stein operator without ever computing the Jacobian. We show that our method is competitive with the Hutchinson and Hutch++ on CIFAR-10, ImageNet and other density estimation tasks, consistently improving the variance and speed of the likelihood predictions compared to the Hutchinson. We additionally show our method will generalize to a varied class of generative models, and show that under some regularity conditions these learned vector fields can be made to satisfy the Stein class.

2605.16477 2026-05-19 cs.LG cs.CV 版本更新

Seeking the Unfamiliar but Memorable: Conceptual Creativity as Meta-Learning

寻求不熟悉但难忘的概念:作为元学习的概念创造力

Mengye Ren

AI总结 本文提出概念创造力作为元学习,通过创作者生成候选刺激和评估者适应学习,产生可学习的创新内容。

Comments 25 pages

详情
AI中文摘要

什么是创造新概念而不是检索熟悉概念的意义?重复采样生成模型在同一提示下会产生风格相似但内容典型的变化。我们提出创造力是产生对适应观察者最初不熟悉的刺激,但通过少量暴露即可学习。我们将此形式化为创作者-评估者对:创作者生成候选,评估者适应几轮学习步骤,评估者的改进成为创作者优化的奖励。我们用扩散模型作为创作者,MNIST上的自动编码器作为评估者,以及带有低秩适配器的CLIP作为自然图像的评估者。扩散模型保持冻结,无额外语言条件;元学习梯度足以产生基础模型无法单独生成的风格变化和概念组合。

英文摘要

What does it mean to create a new concept, rather than retrieve a familiar one? Repeatedly sampling a generative model at the same prompt produces variations with similar styles and typical content. We propose that creativity is the production of stimuli that are unfamiliar to an adaptive observer at first sight, but quickly learnable from a few exposures. We formalize this as a Creator-Appraiser pair: a Creator generates a candidate, an Appraiser adapts to it for a few inner-loop learning steps, and the Appraiser's improvement becomes the reward the Creator optimizes through. We instantiate the framework with diffusion as the Creator, an autoencoder Appraiser on MNIST, and a CLIP Appraiser with a low-rank adapter for natural images. The diffusion model remains frozen with no additional language conditioning; the meta-learning gradient is enough to produce both stylistic variations and concept compositions that the base model does not generate on its own.

2605.16476 2026-05-19 eess.IV cs.CV cs.LG 版本更新

Deep Learning for MRI Slice Interpolation: The Critical Role of Problem Formulation

深度学习在MRI切片插值中的应用:问题建模的关键作用

Shamit Savant

AI总结 本文探讨了深度学习在前列腺成像中插值中间MRI切片的方法,发现问题建模对性能的影响远大于架构复杂度,通过改进插值方式提升了SSIM性能。

Comments 10 pages main text, 21 pages total with supplementary, 8 figures, supplementary material included

详情
AI中文摘要

在临床MRI中,通过平面分辨率通常比平面内分辨率更粗糙,限制了诊断价值。本文研究了深度学习方法用于插值中间MRI切片,有效将通过平面分辨率翻倍。评估了五种架构(CNN、U-Net、两种GAN变体和DDPM),发现问题建模对性能的影响远大于架构复杂度。通过将插值任务改用相邻切片(i-1,i+1)而非远距离切片(i-2,i+2),在所有确定性架构上实现了58%的SSIM提升。U-Net模型在PSNR为30.08 dB和SSIM为0.898,比线性插值基线提升了10.1%。DDPM也进行了评估,但因随机生成与确定性重建需求不匹配而表现不佳。这些发现表明,在医学影像任务中,问题建模的影响是架构复杂度的290倍。

英文摘要

Through-plane resolution in clinical MRI is typically much coarser than in-plane resolution, limiting diagnostic utility. This work investigates deep learning approaches to interpolate intermediate MRI slices in prostate imaging, effectively doubling through-plane resolution. I evaluated five architectures (CNN, U-Net, two GAN variants, and DDPM) and discovered that problem formulation has dramatically more impact than architectural complexity. By reformulating the interpolation task to use adjacent slices (i-1, i+1) rather than distant slices (i-2, i+2), I achieved a 58% improvement in SSIM performance across all deterministic architectures. The U-Net model achieved the best results with PSNR of 30.08 dB and SSIM of 0.898, representing a 10.1% improvement over linear interpolation baseline. A DDPM was also evaluated but showed poor reconstruction quality due to fundamental mismatch between stochastic generation and deterministic reconstruction requirements. These findings demonstrate that problem formulation can have 290x more impact than architectural sophistication in medical imaging tasks.

2605.16473 2026-05-19 stat.ML cs.LG cs.NA math.NA math.PR 版本更新

Dimension-Uniform Discretization Analysis of Preconditioned Annealed Langevin Dynamics for Multimodal Gaussian Mixtures

预处理退火 Langevin 动力学在多模高斯混合中的维度均匀离散化分析

Lorenzo Baldassari, Josselin Garnier, Knut Solna, Maarten V. de Hoop

AI总结 本文研究了预处理退火 Langevin 动力学在高斯混合中的稳定性问题,通过 Euler-Maruyama 离散化和指数积分方案,证明了在满足特定谱条件时,KL 散度具有维度均匀的上界。

详情
AI中文摘要

在高维和无穷维设置中,获得稳定的扩散基采样器具有挑战性,因为高频率坐标上的误差累积会使动力学在有限维近似细化时变得不稳定。离散化是此类误差的典型来源,而使用合适的谱衰减预处理是控制其累积的一种方法。本文研究了预处理退火 Langevin 动力学(ALD)应用于高斯混合时的问题。我们首先证明 Euler-Maruyama(EM)离散化通过将退火分数的刚性线性部分用前向 Euler 步处理,施加了将预处理器与退火协方差尺度耦合的稳定性约束。结合确保退火动力学维度均匀控制的条件,该约束迫使初始平滑分布在不同维度上保持均匀接近目标。然后我们考虑了对退火分数的刚性线性部分进行精确积分的指数积分方案。在满足耦合平滑协方差、组件协方差谱和预处理器的显式谱可求和条件时,我们证明了该方案的 KL 散度具有维度均匀的上界。此上界可通过允许足够时间进行退火并相应细化时间网格来使其任意小。重要的是,这些条件允许 KL 散度在不同维度上发散的区域,表明 EM 限制是方案依赖的,而非 ALD 的固有属性。

英文摘要

Obtaining stable diffusion-based samplers in high- and infinite-dimensional settings is challenging because errors can accumulate across high-frequency coordinates and make the dynamics unstable under refinement of the finite-dimensional approximation of the underlying function-space problem. Discretization is a typical source of such errors, and preconditioning with a suitable spectral decay is one way to control their accumulation. In this paper, we study this problem for preconditioned annealed Langevin dynamics (ALD) applied to Gaussian mixtures. We first show that Euler-Maruyama (EM) discretization, by treating the stiff linear part of the annealed score with a forward Euler step, imposes a stability constraint coupling the preconditioner with the annealed covariance scale. Together with the conditions ensuring dimension-uniform control of the annealed dynamics, this constraint forces the initial smoothed law to remain uniformly close to the target across dimensions. We then consider an exponential-integrator scheme that integrates the stiff linear part of the annealed score exactly. Under explicit spectral summability conditions coupling the smoothing covariance, the component covariance spectra, and the preconditioner, we prove a dimension-uniform Kullback-Leibler (KL) bound for this scheme. This bound can be made arbitrarily small, uniformly in dimension, by allowing enough time for annealing and then refining the time mesh accordingly. Importantly, these conditions allow regimes in which the KL divergence between the target and the initial smoothed law diverges with dimension, showing that the restrictions imposed by EM are scheme-dependent rather than intrinsic to ALD.

2605.16470 2026-05-19 cs.LG cs.AI 版本更新

Strategic Over-Parameterization for Generalizable Low-Rank Adaptation

战略性过参数化以实现通用的低秩适应

Jing Gao, Zhong-Yi Lu, Pan Zhang, Ze-Feng Gao

AI总结 本文提出LoRA-Over框架,通过训练时丰富优化景观并推理时压缩,提升低秩适应的泛化能力,实验显示其在多个任务上优于传统LoRA。

详情
AI中文摘要

本文提出LoRA-Over框架,通过训练时丰富优化景观并推理时压缩,提升低秩适应的泛化能力,实验显示其在多个任务上优于传统LoRA。

英文摘要

Adapting large language models (LLMs) to downstream tasks via full fine-tuning is increasingly impractical due to its computational and memory demands. Parameter-efficient fine-tuning (PEFT) approaches such as Low-Rank Adaptation (LoRA) mitigate this by confining updates to a compact set of trainable parameters, but this aggressive reduction often sacrifices generalization, especially under transfer across heterogeneous tasks and domains. We revisit the tension between parameter efficiency and adaptation capacity, and ask whether the two are truly at odds. We answer in the negative by introducing LoRA-Over, a framework grounded in a simple principle: enrich the optimization landscape during training, then collapse the enrichment at inference. LoRA-Over injects auxiliary parameters into the low-rank adapters during training to broaden the effective hypothesis space, and through a decomposition-based reformulation folds them back into a standard low-rank structure with negligible reconstruction error, keeping inference cost identical to vanilla LoRA. Since not all weight matrices benefit equally from added capacity, we further propose two scheduling strategies, one statically predefined and one dynamically determined at runtime, that direct extra capacity where most needed. We evaluate LoRA-Over on language understanding (GLUE, T5-Base), dialogue (MT-Bench), arithmetic reasoning (GSM8K), and code generation (HumanEval), using LLaMA 2-7B and LLaMA 3.1-8B. Across all benchmarks and scales, LoRA-Over consistently outperforms vanilla LoRA, showing that principled over-parameterization designed to vanish at inference is an effective lever for improving PEFT generalization. Code will be released upon acceptance.

2605.16468 2026-05-19 cs.CV cs.AI cs.CL cs.LG q-bio.NC 版本更新

Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex

可解释的神经编码机制揭示人类视觉皮层的精细功能选择性

Idan Daniel Grosbard, Mor Geva, Galit Yovel

AI总结 本文提出MINE框架,通过机制可解释工具揭示自然图像中驱动皮层 voxel 活动的特征,验证了特征对 voxel 响应的因果影响,并揭示了视觉皮层中精细的功能选择性。

Comments 40 pages, 28 figures

详情
AI中文摘要

理解人类视觉的核心目标是揭示驱动神经活动的视觉特征。已有研究利用人工神经网络作为编码模型预测皮层对自然图像的响应,揭示了激活类别选择区域的视觉内容。然而,现有方法多为相关性分析,将编码器视为黑箱,无法确定哪些图像特征驱动每个 voxel 的响应。本文提出机制可解释神经编码(MINE)框架,通过机制可解释工具定位自然图像中驱动毫米级(voxel 级)活动的特征。MINE利用语言对齐的图像表示预测每个 voxel 的响应,并生成语义可解释的特征描述,用于 voxel 的激活。进一步将这些 per-image 特征泛化为 per-voxel 功能轮廓。为验证 per-image 描述,我们显示它们足以生成激发 voxel 响应与原始图像响应匹配的图像,其准确性优于随机或低贡献控制生成的图像。此外,通过反事实插入或移除预测特征,可使激活在预期方向变化,提供因果证据。由 voxel 激活轮廓指导的反事实编辑产生更强的激活变化,表明轮廓忠实捕捉每个 voxel 的选择性。最后,将 MINE 应用于研究充分的类别选择脑区,显示其恢复了已知的类别偏好,同时揭示了每个区域内的精细 voxel 结构。总体而言,我们的结果确立了机制可解释性作为发现和验证神经功能精细假设的路径。

英文摘要

A central goal in understanding human vision is to uncover the visual features that drive neuronal activity. A growing body of work has used artificial neural networks as encoding models to predict cortical responses to natural images, revealing the visual content that activates category-selective regions. However, existing approaches are largely correlational and treat the encoder as a black box, leaving open which image features drive each voxel's response. We introduce Mechanistically Interpretable Neural Encoding (MINE), a framework that opens this black box by applying mechanistic-interpretability tools to localize the features within natural images that drive millimeter-scale (voxel-level) activity. MINE predicts each voxel's response using language-aligned image representations, and produces semantically interpretable descriptions of the features critical for the voxel's activation. We further generalize these per-image features into per-voxel functional profiles. To validate the per-image descriptions, we show they are sufficient to generate images that elicit voxel responses matching the responses to the original images, more accurately than images generated from random or low-attribution controls. Moreover, counterfactually inserting or removing the predicted features from images shifts activation in the expected direction, providing causal evidence. Counterfactual editing guided by the per-voxel activation profiles produces even stronger activation shifts, indicating that the profiles faithfully capture each voxel's selectivity. Finally, we apply MINE to well-studied category-selective brain regions, showing it recovers their known categorical preferences while revealing fine-grained unique voxel structure within each region. Overall, our results establish mechanistic interpretability as a path to discover and causally validate fine-grained hypotheses about neural function.

2605.16454 2026-05-19 cs.LG eess.SP quant-ph 版本更新

QuChaTeR: A Hybrid Quantum-Chaotic Temporal Framework for Earthquake Prediction

QuChaTeR:一种混合量子-混沌时间框架用于地震预测

Emir Kaan Özdemir

AI总结 QuChaTeR结合小波预处理、混沌映射和变分量子电路与递归结构,提升地震信号时间特征提取能力,在真实地震数据集上表现优异,但面临可扩展性和量子硬件限制的挑战。

Comments Accepted at 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE ICASSP 2026). This is the accepted version of the paper. The final published version will appear in the IEEE proceedings. Proc. IEEE ICASSP 2026, Barcelona, Spain, 2026

详情
AI中文摘要

地震预测仍面临挑战,因其信号具有高度非线性和混沌动态。尽管经典深度学习模型如LSTM和CNN能捕捉局部时间特征,而量子模型提供更丰富的状态表示,但将混沌驱动机制与之结合仍不充分。我们引入QuChaTeR,一种混合架构,结合基于小波的预处理、混沌映射和变分量子电路与递归结构,以增强时间特征提取能力。QuChaTeR使用PyTorch和PennyLane实现,并在经典(LSTM、GRU、RNN、1D-CNN、Reservoir Computing)和量子启发(Quantum LSTM)基线模型上进行基准测试。在真实世界地震数据集上,QuChaTeR在多个评估标准上均表现出色,收敛速度更快。尽管结果令人鼓舞,但可扩展性和量子硬件限制仍是挑战。总体而言,本工作展示了量子-混沌混合方法如何为更准确和稳健的地震预测提供实用路径。

英文摘要

Seismic prediction remains challenging due to the highly nonlinear and chaotic dynamics of earthquake signals. While classical deep learning models such as LSTMs and CNNs capture local temporal features, and quantum models offer richer state representations, their integration with chaos-driven mechanisms is underexplored. We introduce QuChaTeR, a hybrid architecture that combines wavelet-based preprocessing, chaotic maps, and variational quantum circuits with recurrent structures to enhance temporal feature extraction. Implemented in PyTorch and PennyLane, QuChaTeR is benchmarked against classical (LSTM, GRU, RNN, 1D-CNN, Reservoir Computing) and quantum-inspired (Quantum LSTM) baselines. On real-world seismic datasets, QuChaTeR consistently converges faster and achieves superior performance across multiple evaluation criteria. Despite promising results, scalability and quantum hardware limitations remain challenges. Overall, this work demonstrates how quantum-chaotic hybridization provides a practical pathway toward more accurate and robust earthquake prediction.

2605.16452 2026-05-19 cs.LG cs.AI 版本更新

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

峰值检测器:通过指令调优的大语言模型实现可解释的多模态峰值检测

Jiahui Li, Yida Zhang, Zixuan Zeng, Jiayu Chen, Yingjian Song, Yin Xiao, Nishan Dong, Junjie Lu, Younghoon Kwon, Xiang Zhang, Jin Lu, Wenzhan Song, Fei Dou

AI总结 本文提出Peak-Detector框架,利用指令调优的大语言模型实现跨模态、可解释的峰值检测,通过峰表示技术压缩时间序列数据并提升检测准确性,同时生成解释性内容以支持验证与错误分析。

详情
AI中文摘要

准确检测多种心脏生理信号(如心电图、脉搏波容积图、球状心图和体震图)中的峰值对心血管监测至关重要,但常受伪影和信号变异影响。传统算法通常基于专家知识针对单一信号模态设计,限制了通用性。相比之下,深度学习方法缺乏可解释性,限制了专家验证和人机交互。为此,我们引入Peak-Detector框架,利用指令调优的大语言模型(LLMs)实现稳健、跨模态且可解释的峰值检测。框架的核心创新是“峰表示”技术,将时间序列数据转换为压缩格式,在保留关键事件信息的同时显著减少信号长度。此表示提供关键的归纳偏差,引导LLM在生理有意义的事件上推理而非原始噪声数据。模型通过监督微调(SFT)后接强化学习(RL)的多目标奖励函数进行优化。模型的自解释能力通过在自建的Peak-Explanation数据集上微调来培养。在四个模态(ECG、PPG、BCG和BSG)覆盖七个数据集(六个公开基准加一个真实世界队列)上,Peak-Detector展示了强大的跨模态性能,实现了临床相关时间容忍度下的最佳或并列最佳检测。除了准确性外,生成的解释性内容揭示了失败模式并支持验证和错误分析。

英文摘要

Accurate peak detection across diverse cardiac physiological signals, including the Electrocardiogram (ECG), Photoplethysmogram (PPG), Ballistocardiogram (BCG), and Bodyseismography (BSG), is fundamental for cardiovascular monitoring but is often hindered by artifacts and signal variability. Conventional algorithms are typically engineered with expert knowledge for a single signal modality, limiting their generalizability. Conversely, deep learning-based methods often lack interpretability, limiting transparency for expert verification and hindering expert-computer interaction. To address these limitations, we introduce Peak-Detector, a novel framework that leverages instruction-tuned Large Language Models (LLMs) for robust, cross-modal, and explainable peak detection. A core innovation of our framework is a "peak-representation" technique that transforms time-series data into a condensed format, preserving critical event information while significantly reducing signal length. This representation provides a crucial inductive bias, guiding the LLM to reason over physiologically meaningful events rather than raw, noisy data. The model is optimized through a two-stage process: supervised fine-tuning (SFT) followed by reinforcement learning (RL) with a multi-objective reward function. The model's self-explanation capabilities are cultivated by fine-tuning on a custom-built Peak-Explanation dataset. Across four modalities-ECG, PPG, BCG, and BSG-spanning seven datasets (six public benchmarks plus one real-world cohort), Peak-Detector demonstrates strong cross-modal performance, achieving best or tied-best detection under clinically relevant temporal tolerance. Beyond accuracy, the generated rationales surface failure modes and support verification and error analysis.

2605.16449 2026-05-19 cs.LG cs.AI 版本更新

PESD-TSF: A Period-Aware and Explicit Structured Decomposition Framework for Long-Term Time Series Forecasting

PESD-TSF:一种周期感知和显式结构分解框架,用于长期时间序列预测

Hua Wang, Xianhao Jiao, Fan Zhang

AI总结 PESD-TSF通过引入周期性门控机制、多尺度编码器和跨尺度协作注意力,解决深度网络中周期感知减弱和变量间依赖破坏的问题,提升多变量时间序列预测性能。

Comments 23 pages, 9 figures, 13 tables

详情
AI中文摘要

深度预测模型常面临周期感知减弱和趋势-噪声表示混乱的问题,且通道独立范式虽提高训练稳定性,却破坏变量间动态协调,阻碍多变量时间序列中变量一致性建模。为此,我们提出PESD-TSF,一种受物理启发的结构分解框架,旨在同时强调可解释性和预测准确性。PESD-TSF引入三个关键设计:首先,乘法周期性门控机制整合连续时间先验,动态调节信号幅度,保持深度层间的周期结构;其次,多尺度结构编码器整合去趋势注意力与分层采样,显式分离长期趋势与高频变化,同时保留细粒度时间语义;第三,为恢复被破坏的变量依赖,我们提出跨尺度协作注意力(CSCA)与RLC正则化方案,重构深度特征空间中的全局变量拓扑,并通过正交性和一致性约束实现物理一致的协作。在多个领域的基准数据集上进行的广泛实验表明,PESD-TSF在多变量预测任务中,特别是在涉及复杂变量耦合的任务中, consistently 实现了最先进的性能,突显其优越的结构建模能力和泛化能力。

英文摘要

Deep forecasting models often suffer from attenuated periodic perception and entangled trend-noise representations as network depth increases. Moreover, the widely adopted channel-independent paradigm, while improving training stability, disrupts intrinsic dynamic coordination among variables, hindering the modeling of cross-variable consistency in multivariate time series. To address these issues, we propose PESD-TSF, a physics-inspired structured decomposition framework for long-term time series forecasting that jointly emphasizes interpretability and predictive accuracy. PESD-TSF introduces three key designs. First, a Multiplicative Periodic Gating mechanism incorporates continuous-time priors to dynamically modulate signal amplitudes, preserving periodic structures across deep layers. Second, a multi-scale structured encoder integrates detrended attention with hierarchical sampling to explicitly decouple long-term trends from high-frequency variations while retaining fine-grained temporal semantics. Third, to recover disrupted inter-variable dependencies, we propose Cross-Scale Collaborative Attention (CSCA) together with an RLC regularization scheme, which reconstructs global inter-variable topology in deep feature spaces and enforces physically consistent collaboration through orthogonality and consistency constraints. Extensive experiments on benchmark datasets from multiple domains demonstrate that PESD-TSF consistently achieves state-of-the-art performance, with particularly strong gains on multivariate forecasting tasks involving complex inter-variable coupling, highlighting its superior structural modeling capability and generalization.

2605.16443 2026-05-19 cs.LG cs.AI 版本更新

Two-Valued Symmetric Circulant Matrices: Applications in Deep Learning

二值对称循环矩阵:在深度学习中的应用

Jayakrishna Amathi, Venkata Prasanth Yanambaka, Saraju P. Mohanty, Elias Kougianos

AI总结 本文提出二值对称循环矩阵,通过每层仅使用两个权重实现极稀疏结构,显著降低存储需求,实验显示在MNIST和MIT-BIH数据集上参数减少超过80倍,同时保持较高精度,适用于边缘计算和低功耗系统。

详情
AI中文摘要

尽管深度神经网络在视觉、医疗诊断和物联网场景中取得成功,但其在资源受限平台上的部署面临严峻挑战,由于存储需求高、计算复杂度大和占用空间大。特别是全连接层需要大量权重,使边缘设备难以容纳。为克服与有限平台相关的挑战,本文提出二值对称循环矩阵(TVSCM),一种非常稀疏的架构,每层仅使用两个权重以保持循环和对称性。极结构稀疏架构的存储成本与传统全权重存储相比几乎可以忽略不计。与传统稀疏学习技术如低秩近似和剪枝方法不同,该架构提供极稀疏形式,实现极低的存储需求。模拟研究显示,在MNIST数据集上参数从623,290减少到7,852,MIT-BIH心律失常数据集上从24,709减少到942,同时保持在MNIST上97.6%到93.5%的精度,在MIT-BIH上97.6%到93.1%的精度。由于其极低的架构需求和非常低的功耗,该架构适用于边缘计算平台、微型机器学习平台、IoMT系统和电池供电系统。

英文摘要

Despite the success of deep neural networks in vision, medical diagnosis, and IoT scenarios, their deployment on resource-limited platforms poses serious challenges due to their high storage requirements, computational complexity, and large footprint. In particular, fully connected layers require a large number of weights, making it difficult for edge devices to accommodate them. To overcome these challenges associated with limited platforms, this paper proposes the Two-Valued Symmetric Circulant Matrix (TVSCM), a very sparse architecture that employs just two weights per layer to keep it circulant and symmetric. The extreme form of structured sparse architecture provides negligible storage costs compared to traditional full-weight storage. Instead of hardware and additional stages of other traditional sparse learning techniques, such as low-rank approximation and pruning approaches, this architecture provides an extreme form of sparsity, achieving very minimal storage requirements. The simulation study demonstrates more than 80$\times$ reduction in model parameters, reducing parameters from 623,290 to 7,852 on MNIST and from 24,709 to 942 on the MIT-BIH arrhythmia dataset, while maintaining comparable accuracy from 97.6% to 93.5% on MNIST and from 97.6% to 93.1% on MIT-BIH. Due to its minimal architectural requirements and very low power consumption, this architecture would be ideal for edge computing platforms, tiny-ML platforms, IoMT systems, and battery-powered systems.

2605.16442 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Hierarchical Two-Stage Framework for Environment-Aware Long-Horizon Vessel Trajectory Prediction

面向环境的长航程船舶轨迹预测分层两阶段框架

Ganeshaaraj Gnanavel, Tharindu Fernando, Sridha Sridharan, Clinton Fookes

AI总结 本文提出分层两阶段框架,结合长短期预测器与网格感知短期预测器,通过分层融合机制提升船舶轨迹预测精度,实验显示在ADE和FDE上优于现有方法。

详情
AI中文摘要

长航程船舶轨迹预测在真实海洋条件下对碰撞避免、交通管理和路线规划至关重要。然而,由于长距离时间依赖性和动态环境因素如洋流、风和波浪,实现准确预测具有挑战性。为此,我们提出一种分层两阶段框架,通过分层融合机制结合粗略长时预测器与网格感知的短时预测器。短时分支利用离散化海事单元上的时空图变换器捕捉局部动态,而长时分支编码总体航行意图。集成的环境模块利用洋流参数、风向量和显著波高,通过跨模态注意和特征调制实现对不同海况的适应性响应。此外,可学习的Savitzky-Golay平滑层增强了融合轨迹的时间一致性。我们在澳大利亚船队跟踪系统(CTS)数据上进行了评估,数据来自西北地区,并与Copernicus海洋服务产品对齐,使用3小时输入和10小时预测时间范围。实验结果表明,我们的框架在平均位移误差(ADE)和最终位移误差(FDE)上比现有方法提高了25%和17%。消融研究进一步验证了每个组件的贡献。

英文摘要

Long-horizon vessel trajectory forecasting under real ocean conditions is critical for collision avoidance, traffic management, and route planning. However, achieving accurate predictions is challenging due to long-range temporal dependencies and dynamic environmental factors such as currents, wind, and waves. To address these issues, we propose a hierarchical two-stage framework that combines a coarse long-term predictor with a grid-aware short-term predictor through a hierarchical fusion mechanism. The short-term branch leverages a Spatio-Temporal Graph Transformer on discretized maritime cells to capture localized dynamics, while the long-term branch encodes overarching navigational intent. An integrated environmental module incorporates oceanographic parameters, including surface currents, wind vectors, and significant wave height, using cross-modal attention and feature-wise modulation for adaptive response to varying sea conditions. Additionally, a learnable Savitzky-Golay smoothing layer enhances temporal coherence in fused trajectories. We evaluate our approach on Australian Craft Tracking System (CTS) data from the North West region, aligned with Copernicus Marine Service products, using a 3-hour input and a 10-hour prediction horizon. Experimental results show that our framework outperforms the state-of-the-art by 25% in Average Displacement Error (ADE) and 17% in Final Displacement Error (FDE). Ablation studies further validate the contribution of each component.

2605.16441 2026-05-19 cs.LG cs.AI 版本更新

DeepArrhythmia: Segment-Contextualized ECG Arrhythmia Classification via Selective Evidence Acquisition

DeepArrhythmia: 基于选择性证据获取的段落上下文化ECG心律失常分类

Jiahui Li, Ruili Fang, Zishuai Liu, WenZhan Song, Jin Lu, Fei Dou

AI总结 DeepArrhythmia通过选择性证据获取实现段落上下文化ECG心律失常分类,结合原始ECG信号和渲染波形图像,利用专门工具分离生理测量与证据整合,提升多beat节奏上下文下的心律失常检测精度。

详情
AI中文摘要

心电图(ECG)心律失常检测旨在为每条心跳分配一个心律失常类别,但许多现有系统将心跳视为孤立的局部实例,限制了对多心跳节奏上下文的依赖。我们提出DeepArrhythmia,一种工具导向的多模态框架,用于段落上下文化的心跳级ECG心律失常分类。给定一个多心跳ECG段,DeepArrhythmia结合原始ECG信号和渲染的波形图像,定位R峰以识别心跳实例,并生成结构化的心跳级预测。该框架通过专门工具分离生理测量与证据整合,用于心跳定位、数值节奏-形态提取和形态聚焦的文本分析。DeepArrhythmia利用段级置信度在最小和丰富证据状态之间路由,因为更丰富的生理证据并不总是有用。这种代理设计整合了节奏上下文、显式生理基础和选择性证据获取以进行决策。

英文摘要

Beat-level Electrocardiography (ECG) arrhythmia detection aims to assign an arrhythmia class to each beat in a recording, yet many existing systems treat beats as isolated local instances. This is limiting because beat labels often depend on multi-beat rhythm context, including timing, compensatory pauses, and beat-to-beat morphological consistency. We present DeepArrhythmia, a tool-grounded multimodal framework for segment-contextualized beat-level ECG arrhythmia classification. Given a multi-beat ECG segment, DeepArrhythmia combines the raw ECG signal and a rendered waveform image, localizes R peaks to identify beat instances, and produces structured beat-level predictions. The framework decouples physiological measurement from evidence integration using specialized tools for beat localization, numerical rhythm--morphology extraction, and morphology-focused textual analysis. DeepArrhythmia uses segment-level confidence to route between minimal and rich evidence states, since richer physiological evidence is not uniformly useful. This agentic design integrates rhythm context, explicit physiological grounding, and selective evidence acquisition for decision making.

2605.16438 2026-05-19 cs.LG cs.AI 版本更新

Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers

通过量子退火的客户端选择实现容错联邦学习

Andras Ferenczi, Sutapa Samanta, Dagen Wang, Jason Qizhe Qin

AI总结 本文提出利用量子退火解决联邦学习中的拜占庭容错问题,通过将客户端选择转化为二次无约束二元优化问题,提升对恶意更新的检测能力。

Comments 9 pages, 6 figures, 8 tables

详情
AI中文摘要

联邦学习(FL)在分布式客户端上训练全局模型,但规模扩大时易受恶意更新攻击。本文提出一种量子退火方法,将客户端选择转化为二次无约束二元优化(QUBO)问题,通过量子退火器求解。QUBO方法在小规模客户端中优于MultiKrum,但在大规模客户端中性能下降。本文引入MultiSignal集成方法,结合欧几里得和余弦Krum分数差距,将攻击分类为四个阶段并路由恶意攻击至受惩罚的QUBO。实验表明,MultiSignal在MNIST数据集上达到95.3%的检测准确率,显著优于传统MultiKrum方法。

英文摘要

Federated Learning (FL) trains a global model across decentralized clients while preserving data privacy, but at scale it is vulnerable to malicious updates. Byzantine-resilient aggregation methods such as MultiKrum score gradients against their nearest neighbors and can miss malicious updates that preserve the statistical properties of honest ones. We propose a quantum annealing approach that reformulates client selection as a Quadratic Unconstrained Binary Optimization (QUBO) problem, encoding pairwise distances into a cost function solved by quantum annealers (QA). Unlike MultiKrum's greedy per-client scoring, the QUBO formulation jointly optimizes over all subsets to find the mutually closest group of $m$ clients. At small scale (15 clients), QUBO outperforms MultiKrum on the most challenging Byzantine attacks: e.g., Advanced LIE is detected with 95.11% accuracy versus 81.33% on MNIST and 97.78% versus 75.56% on CIFAR-10. QUBO fares poorly on simpler attacks where MultiKrum excels, so the two methods are complementary. QUBO quality also degrades as the number of clients grows. To address this, we introduce a MultiSignal ensemble that uses a dual-feature routing gate based on Euclidean and cosine Krum score gaps to classify attacks into four regimes and routes evasion attacks to a suspicion-penalized QUBO with agreement voting. At 100 clients on MNIST, MultiSignal achieves 95.3% average detection accuracy versus 91.8% for classical MultiKrum, with the largest gains on Sparse Lie (72.0% to 95.2%, +23.2 points) and Advanced Lie (80.4% to 85.2%, +4.8 points). These results show that QUBO-based quantum annealing with MultiSignal is a principled and scalable defense against the most challenging Byzantine strategies in federated learning.

2605.16435 2026-05-19 cs.LG cs.AI 版本更新

GPU-Accelerated Deep Learning for Heatwave Prediction and Urban Heat Risk Assessment

基于GPU的深度学习用于热浪预测和城市热风险评估

Adis Alihodžić

AI总结 本文提出基于GPU的深度学习框架,用于预测城市热条件和评估热风险,采用MODIS和Open-Meteo数据,验证了ConvLSTM混合损失函数的有效性,提升了预测精度与效率。

详情
AI中文摘要

热浪是城市中的重要问题,气候变化使其更加困难。本文提出一种基于GPU的深度学习框架,用于预测城市热条件和热风险评估。研究在萨拉热窝使用MODIS地表温度数据和Open-Meteo预报数据进行。测试了多种模型,包括卷积模型和时空模型。其中,混合损失函数的ConvLSTM模型表现最佳,得到MAE=0.2293,RMSE=0.3089,R2=0.8877。实验还表明,使用更长的时间序列和额外气象变量可提高结果。由于框架在GPU上实现并采用混合精度训练,执行时间减少。基于预测温度场,可以结合危险信息与暴露和脆弱性数据生成城市热风险地图。所提框架可作为城市热分析的实用基础。

英文摘要

Heatwaves are an important problem in cities, and climate change makes this problem more difficult. In this paper, we present a GPU-based deep learning framework for next-day prediction of urban thermal conditions and for heat risk assessment. The study was carried out in Sarajevo by using MODIS land surface temperature data and Open-Meteo forecast data. We tested several models, including convolutional models and spatiotemporal models. Among them, ConvLSTM with a mixed loss function gave the best results. The obtained values were MAE = 0.2293, RMSE = 0.3089, and R2 = 0.8877. The experiments also showed that results can be improved by using longer temporal series and additional meteorological variables. Since the framework was implemented on a GPU and trained with mixed precision, the execution time was reduced. Based on the predicted temperature fields, it was also possible to combine hazard information with exposure and vulnerability data in order to generate city heat risk maps. The proposed framework can be used as a practical basis for city heat analysis.

2605.16433 2026-05-19 cs.LG cs.AI 版本更新

Edge-AI-Driven Learning-to-Rank for Decentralized Task Allocation in Circular Smart Manufacturing

边缘AI驱动的基于排序的学习排名用于圆环式智能制造中的去中心化任务分配

Mohammadhossein Ghahramani, Yan Qiao, Mengchu Zhou

AI总结 本文提出一种边缘AI驱动的去中心化任务分配框架,通过基于排序的协商实现高效资源分配,提升高负载和紧 deadline 场景下的延迟和能效。

详情
Journal ref
Under review at IEEE IoT J, 2026
AI中文摘要

在智能制造系统中,任务分配需要在去中心化决策、动态负载和共享资源约束下运行。在循环制造环境中,这些挑战因需平衡运营效率与资源和能源可持续性而加剧。尽管已有基于学习的方法,但许多方法专注于预测绝对性能指标,这些指标不一定能提升分配结果,因为去中心化分配由候选机器的相对排序决定。本文提出一种基于排序意识协商的边缘AI驱动的去中心化任务分配框架,其中轻量级决策智能嵌入在机器层面,以实现低延迟协调而无需集中控制。该框架逐步开发:首先,资源感知的启发式方法建立去中心化投标结构,然后基于边缘AI的回归模型提供学习的本地投标近似,最后基于排序的公式重塑学习目标以与赢家选择的排序性质一致。每台机器使用本地信息评估 incoming 任务,包括处理能力、队列状态和资源竞争。该框架通过离散事件模拟在高负载和紧 deadline 场景下进行评估,使用延迟、截止期限违规、吞吐量和能耗等指标。结果表明,在高负载下延迟和截止期限遵守有所改善,在更紧的约束下能耗效率提高,导致更高效的资源操作,符合循环制造目标。这些发现表明,将学习目标与去中心化决策结构对齐对于有效的协商驱动任务分配至关重要。

英文摘要

Task allocation in smart manufacturing systems needs to operate under decentralized decision-making, dynamic workloads, and shared resource constraints. In circular manufacturing settings, these challenges are further intensified by the need to balance operational efficiency with resource and energy sustainability. While learning-based approaches have been explored, many focus on predicting absolute performance metrics that do not necessarily translate into improved allocation outcomes, since decentralized assignment is governed by the relative ordering of candidate machines. This work proposes an Edge-AI-driven decentralized task allocation framework based on ranking-aware negotiation, where lightweight decision intelligence is embedded at the machine level to enable low-latency coordination without centralized control. The framework is developed progressively: a resource-aware heuristic first establishes the decentralized bidding structure, an Edge-AI-based regression model then provides learned local bid approximation, and a ranking-aware formulation finally reshapes the learning objective to align with the ordering-based nature of winner selection. Each machine evaluates incoming tasks using local information, including processing capability, queue state, and resource contention. The framework is evaluated via discrete-event simulation under high-load and tight-deadline scenarios using delay, deadline violations, throughput, and energy consumption. Results show improved delay and deadline adherence under high load, and enhanced energy efficiency under tighter constraints, leading to more resource-efficient operation aligned with circular manufacturing objectives. These findings demonstrate that aligning learning objectives with decentralized decision structures is critical for effective negotiation-driven task allocation.

2605.16429 2026-05-19 cs.LG cs.AI 版本更新

QuantFPFlow: Quantum Amplitude Estimation for Fokker--Planck Policy Optimisation in Continuous Reinforcement Learning

QuantFPFlow:用于连续强化学习中Fokker-Planck策略优化的量子振幅估计

Abraham Itzhak Weinberg

AI总结 QuantFPFlow通过量子振幅估计提升连续强化学习中Fokker-Planck策略优化的效率,实现算法复杂度从O(1/ε²)到O(1/ε)的平方加速,并在多模态奖励景观中发现全局最优解。

详情
AI中文摘要

我们引入QuantFPFlow,一种将量子振幅估计整合到随机策略优化的Fokker-Planck(FP)公式中的强化学习框架。经典连续空间RL代理必须以成本O(1/ε²)估计FP分区函数Z=∫e^{-V(x)/D}dx;QuantFPFlow用Grover增强的振幅估计器替代,实现O(1/ε)的可证明二次加速。尽管完全量子加速需要容错硬件,此处展示的量子启发经典模拟已表现出O(1/ε)的算法结构。估计的稳态分布ρstar驱动理论支撑的探索奖励Raug=Renv+αlog(1/ρstar(s))。此奖励将代理引导至多模态奖励景观的全局最优区域,同时通过FP扩散匹配约束策略方差。在专门设计暴露局部最优失败的连续控制任务中,QuantFPFlow实现平均奖励1,295.7±423.2,优于Soft Actor-Critic(SAC)的1,284.0±474.0,同时发现全局最优的频率高10.4%(33.9% vs. 30.7%)。策略熵保持在H(π)≈6.5纳特,而SAC下降至1.5纳特,证实FP扩散匹配主动防止过早收敛。维度实验进一步显示QuantFPFlow的计算规模为O(d^{0.35}),而经典FP估计为O(d^{0.76})。

英文摘要

We introduce \textbf{QuantFPFlow}, a reinforcement learning framework that integrates quantum amplitude estimation into the Fokker--Planck~(FP) formulation of stochastic policy optimisation. Classical continuous-space RL agents must estimate the FP partition function $Z = \int e^{-V(\mathbf{x})/D}\,d\mathbf{x}$ at cost $\calO(1/\varepsilon^{2})$; QuantFPFlow replaces this with a Grover-amplified amplitude estimator achieving $\calO(1/\varepsilon)$ -- a provable quadratic speedup. While the full quantum acceleration requires fault-tolerant hardware, the quantum-inspired classical simulation demonstrated here already exhibits the $\calO(1/\varepsilon)$ algorithmic structure. The estimated stationary distribution $\rhostar$ drives a theoretically grounded exploration bonus $\Raug = \Renv + α\log(1/\rhostar(s))$. This bonus steers the agent toward globally optimal regions of multimodal reward landscapes while simultaneously constraining policy variance through FP diffusion matching. On a continuous-control task specifically designed to expose local-optima failure, QuantFPFlow achieves mean reward $1{,}295.7 \pm 423.2$ versus $1{,}284.0 \pm 474.0$ for Soft Actor-Critic~(SAC), while discovering the global optimum \textbf{10.4\,\% more frequently} (33.9\,\% vs.\ 30.7\,\%). Policy entropy remains near $H(π)\approx 6.5$\,nats throughout training, whereas SAC collapses to $1.5$\,nats, confirming that FP diffusion matching actively prevents premature convergence. Dimensionality experiments further show computational scaling of $\calO(d^{0.35})$ for QuantFPFlow versus $\calO(d^{0.76})$ for classical FP estimation.

2605.16420 2026-05-19 cs.CV cs.LG 版本更新

Video Reconstruction using Diffusion-based Image-to-Video Generation with Trajectory Guidance

基于扩散模型的图像到视频生成与轨迹引导的视频重建

Stelio Bompai, Ioannis Kontopoulos, Giannis Spiliopoulos, Dimitris Zissis, Konstantinos Tserpes

AI总结 本文提出利用预训练的图像到视频扩散模型,通过GPS轨迹引导生成无人机视频的缺失或丢失帧,无需领域特定微调,展示了在低纹理和小目标条件下视频重建的有效性。

Comments Accepted at the 1st Workshop on Multi-Sensor Trajectory Knowledge Discovery and Extraction (MuseKDE 2026), co-located with the 27th IEEE International Conference on Mobile Data Management (IEEE MDM 2026)

详情
AI中文摘要

本文解决了自主水面车辆进行结构化海上 maneuver 时顶视无人机视频中缺失或丢失帧的重建问题。我们提出了一种将原始GPS telemetry 和单个参考帧转换为轨迹引导视频序列的流程,使用预训练的图像到视频扩散模型,无需领域特定微调。通过将GPS坐标投影到图像空间,产生每艘船的运动提示,以条件化SG-I2V扩散模型。生成的帧通过感知、时间和轨迹度量与真实视频进行评估,并与光流外推和RIFE插值基线进行基准测试。SG-I2V在所有方法中产生了最自然的帧(BRISQUE 25.52,接近真实值23.64),最真实的运动幅度(时间平滑度1.14 vs. 真实值1.42),以及最强的GPS轨迹一致性(9.31px vs. 真实值28.70px,后者反映的是视频和GPS日志之间的大致时间对齐,而非生成误差),证明了轨迹引导的扩散合成在挑战性低纹理、小目标条件下是可行的海上视频重建方法。

英文摘要

This paper addresses the problem of reconstructing missing or dropped frames in top-down drone video of autonomous surface vehicles performing structured maritime manoeuvres. We propose a pipeline that converts raw GPS telemetry and a single reference frame into a trajectory-guided video sequence using a pre-trained image-to-video diffusion model, requiring no domain-specific fine-tuning. GPS coordinates from onboard telemetry logs are projected into image space via an equirectangular mapping, producing per-vessel motion cues that condition the SG-I2V diffusion model. The generated frames are evaluated against ground-truth video using perceptual, temporal and trajectory-based metrics, and benchmarked against optical flow extrapolation and RIFE interpolation baselines. SG-I2V produces the most naturally appearing frames among all methods (BRISQUE 25.52, closest to ground-truth 23.64), the most realistic motion magnitude (temporal smoothness 1.14 vs. ground truth 1.42), and the strongest GPS trajectory adherence (9.31px vs. 28.70px for ground-truth, the latter reflecting approximate temporal alignment between footage and GPS logs rather than generation error), demonstrating that trajectory-guided diffusion synthesis is a viable approach to maritime video reconstruction under challenging low-texture, small-object conditions.

2605.16411 2026-05-19 cs.CV cs.AI cs.CL cs.DB cs.LG 版本更新

Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift

通过分布偏移下的分阶段偏好优化减少视觉-语言模型中的幻觉

Qinwu Xu

AI总结 本文提出分阶段偏好优化框架,通过构建针对幻觉问题的数据集,提升视觉-语言模型的 grounded reasoning,减少幻觉并提高响应信息量。

详情
AI中文摘要

幻觉仍然是视觉-语言模型(VLMs)中的基本挑战,其中自回归生成可能因联合概率建模下的最大似然估计而产生语言上合理但物理上不一致或视觉上不 grounded 的响应。我们提出了一种分阶段偏好优化框架,通过有针对性的多模态数据构建来减少幻觉。该框架强调模糊的空间方向、物体关系、OCR不确定性以及对抗性假前提训练。幻觉负样本通过最小扰动但视觉不一致的替代品生成,使直接偏好优化(DPO)能够更好地区分 grounded 推理与 plausible 幻觉。在开源基准和现实多模态评估场景中的实验表明,改进了 grounded 一致性,减少了幻觉,并产生了更具信息量的 grounded 响应。跨模型定性评估进一步显示,所提出的多模态 LLM DPO 框架在模糊空间推理和对抗性假前提设置中比几个前沿专有 VLMs 产生更视觉 grounded 的响应。结果表明,幻觉可能不仅源于模型容量的限制,还源于自回归概率生成在弱视觉 grounding 下倾向于选择语言上合理但视觉上不一致的延续。未来工作可能探索物理一致性建模、不确定性感知的多模态推理以及超越标准自回归解码的架构替代方案。

英文摘要

Hallucination remains a fundamental challenge in vision-language models (VLMs), where autoregressive generation may produce linguistically plausible yet physically inconsistent or visually ungrounded responses due to likelihood maximization under joint probabilistic modeling. We propose a stage-wise preference optimization framework for hallucination reduction through targeted multimodal data construction. Rather than directly optimizing on generic instruction-following data, our approach progressively constructs hallucination-focused preference pairs near known failure boundaries. The framework emphasizes ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. Hallucinated negatives are generated through minimally perturbed yet visually inconsistent alternatives, enabling Direct Preference Optimization (DPO) to better separate grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses. Cross-model qualitative evaluation further shows that the proposed multimodal LLM DPO framework produces more visually grounded responses than several frontier proprietary VLMs, such as in ambiguous spatial reasoning and adversarial false-premise settings. The results suggest that hallucination may arise not only from limited model capacity, but also from inherent tendencies of autoregressive probabilistic generation to favor linguistically plausible continuations under weak visual grounding. Future work may explore physical consistency modeling, uncertainty-aware multimodal reasoning, and architectural alternatives beyond standard autoregressive decoding.

2605.16401 2026-05-19 cs.CV cs.LG 版本更新

CADS: Conformal Adaptive Decision System for Cost-Efficient Image Classification

CADS:用于成本高效图像分类的符合适应决策系统

Turkoglu Mikael, Bary Tim, Thielens Vincent, Dausort Manon, Macq Benoît

AI总结 CADS通过动态路由样本优化资源分配,提升图像分类的效率与准确性,降低计算成本达12倍。

Comments 6 pages, 2 figures, 1 table, Accepted at ICIP 2026

详情
AI中文摘要

尽管高容量AI模型在性能上取得突破,但其部署常受限于高推理成本、环境影响及

英文摘要

While high-capacity AI models have advanced state-of-the-art performance, their practical deployment is often hindered by high inference costs, environmental impact, and a "one-size-fits-all" approach that ignores varying sample complexity. In clinical settings for instance, the waste of computational resources on routine cases is a significant barrier to sustainable AI. In this paper, we introduce the Conformal Adaptive Decision System (CADS), a sequential multi-model algorithm designed to optimize resource allocation by efficiently sampling models based on the estimated data complexity. CADS leverages conformal prediction to quantify image uncertainty at runtime. CADS provides a mathematically grounded framework for balancing the cost-accuracy dilemma that dynamically routes samples through a model cascade, ranging from lightweight "Scout" models to high-capacity "Oracle" architectures. Validated on two datasets, CADS demonstrated superior efficiency and accuracy at a computational cost that can be up to 12 times lower than heavy-model inference. By accurately routing samples based on real-time complexity, CADS ensures high diagnostic reliability while drastically reducing the economic and environmental footprint of AI.

2605.16399 2026-05-19 cs.CV cs.LG 版本更新

Stable and Near-Reversible Diffusion ODE Solvers for Image Editing

稳定且近可逆的图像编辑扩散ODE求解器

Barbora Barancikova, Daniil Shmelev, Cristopher Salvi

AI总结 本文提出近可逆Runge-Kutta方法以提升图像编辑的稳定性与精度,平衡可逆性与数值稳定性,保留背景保真优势。

详情
AI中文摘要

扩散模型的反向在图像编辑中起核心作用。代数可逆的ODE求解器为文本引导的图像编辑提供了有吸引力的方法,通过消除DDIM基编辑流程中的反向误差。然而,实证结果表明仅可逆性不足。由于编辑需要更大的语义或视觉变化,可逆扩散求解器常表现出不稳定性,并导致输出质量急剧下降。本文显示,精确可逆性与数值稳定性之间的权衡在图像编辑中表现为背景保真与提示对齐之间的权衡。随后研究了近可逆Runge-Kutta方法作为更稳定的替代方案。当与向量场平滑策略结合时,所得方法提高了编辑保真度,在大范围编辑下仍保持稳定,并在很大程度上保留了可逆求解器的背景保真优势。

英文摘要

The inversion of diffusion models plays a central role in image editing. Algebraically reversible ODE solvers provide an appealing approach to diffusion inversion for text-guided image editing, by eliminating the inversion error inherent in DDIM-based editing pipelines. However, empirical results indicate that reversibility alone is insufficient. As edits require larger semantic or visual changes, reversible diffusion solvers often exhibit instabilities and suffer sharp drops in output quality. In this paper, we show that the trade-off between exact reversibility and numerical stability manifests empirically as a trade-off between background preservation and prompt alignment in image editing. We then investigate the use of near-reversible Runge-Kutta methods as a more stable alternative to exactly reversible diffusion schemes. When combined with a vector-field smoothing strategy, the resulting approach improves edit fidelity, remains stable under large edits, and largely retains the background-preservation benefits of reversible solvers.

2605.16396 2026-05-19 cs.CV cs.LG 版本更新

Beyond MMSE: Enhancing PnP Restoration with ProxiMAP

超越MMSE:通过ProxiMAP增强PnP修复

Kenta Vert, Giacomo Meanti, Scott Pesme, Michael Arbel, Julien Mairal

AI总结 本文提出ProxiMAP,通过调整噪声调度使去噪器保持分布内,实现更稳定的图像重建,适用于去模糊、补全、超分辨率和相位恢复等任务。

详情
AI中文摘要

Plug-and-Play (PnP)方法通过将不可行的最大后验(MAP)去噪器替换为MMSE去噪器成为解决成像逆问题的标准工具。尽管这种不匹配常被视为不可避免,近期研究试图通过针对扩散模型分数来缩小这一差距。本文指出在实践中,学习到的分数与真实分数不匹配,导致MAP目标迭代收敛到卡通化图像而非真实图像,而提前停止迭代能获得更好结果。本文将这一观察转化为设计原则,引入ProxiMAP,一种迭代的MAP近似方法,其噪声调度保持迭代残差噪声与去噪器训练噪声匹配。这使去噪器保持分布内,其分数可靠,并产生隐式提前停止,避免上述失败模式。ProxiMAP是标准PnP算法中MMSE去噪器的模块化替换,能一致提升重建质量。基于相同原理,本文提出一种混合变体,仅在PnP晚期迭代中应用ProxiMAP,其中去噪器最可靠,匹配或超过全替换变体,且成本仅为分数之一。

英文摘要

Plug-and-Play (PnP) methods have become standard tools for solving imaging inverse problems by replacing the intractable maximum a posteriori (MAP) denoiser with the MMSE one. While this mismatch has been widely treated as unavoidable, recent works have sought to close this gap by targeting the MAP with diffusion-model scores. We show this is problematic in practice: learned scores do not match the true ones, so MAP-targeting iterations converge to cartoon-like images rather than realistic ones, and better results are obtained by stopping short of convergence. We turn this observation into a design principle and introduce ProxiMAP, an iterative MAP approximation whose noise schedule keeps the iterate's residual noise matched to the denoiser's training noise. This keeps the denoiser in-distribution where its score is reliable, and yields implicit early stopping that avoids the failure mode above. ProxiMAP is a modular drop-in replacement for MMSE denoisers in standard PnP algorithms and consistently sharpens reconstructions across deblurring, inpainting, super-resolution, and phase retrieval. Building on the same principle, we propose a hybrid variant that applies ProxiMAP only in the late iterations of PnP, where the denoiser is most reliable -- matching or exceeding the full-replacement variant at a fraction of the cost.

2605.16395 2026-05-19 cs.RO cs.LG 版本更新

OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence

OrbiSim:作为具身智能的可微物理引擎的世界模型

Jiajian Li, Jingyuan Huang, Junru Gong, Qi Wang, Xiaokang Yang, Yunbo Wang

AI总结 OrbiSim提出了一种新的机器人仿真范式,将世界模型重新定义为完全可微的物理引擎,通过统一的物理基础路径连接结构化场景资产、神经动力学和下游强化学习,提升预测精度和控制性能。

Comments Project page: https://jjleejj85.github.io/projects/orbisim

详情
AI中文摘要

我们提出了OrbiSim,一种新的机器人仿真范式,将世界模型重新定义为完全可微的物理引擎,用于具身智能。不同于以往专注于潜在域或视觉域中无约束想象的世界模型,OrbiSim建立了一个统一的、基于物理的路径,连接结构化场景资产、神经动力学和下游强化学习。通过在整个仿真循环中实现端到端的可微性——从显式状态转换到视觉观察生成——OrbiSim支持传统经典模拟器难以处理的任务,如可微接触建模、稀疏奖励下的基于梯度的策略优化和直观的物理推理。实证结果表明,OrbiSim在预测保真度和控制性能方面显著优于最先进的世界模型。此外,其对资产配置和物理参数的一致响应表明其作为增强机器人仿真和策略训练的可微工具的潜力。

英文摘要

We present OrbiSim, a novel robotic simulation paradigm that redefines world models as a fully differentiable physics engine for embodied intelligence. Unlike prior world models that focus on unconstrained imagination in latent or visual domains, OrbiSim establishes a unified, physically-grounded pathway that bridges structured scene assets, neural dynamics, and downstream reinforcement learning. By enabling end-to-end differentiability throughout the entire simulation loop -- spanning from explicit state transitions to visual observation generation -- OrbiSim supports tasks traditionally intractable for classical simulators, such as differentiable contact modeling, gradient-based policy optimization under sparse rewards, and intuitive physical inference. Empirical results demonstrate that OrbiSim significantly outperforms state-of-the-art world models in both predictive fidelity and control performance. Furthermore, its consistent responsiveness to asset configurations and physical parameters suggests its potential as a differentiable tool for enhancing robot simulation and policy training.

2605.16392 2026-05-19 q-bio.QM cs.CV cs.LG 版本更新

Bridging the Modality Bottleneck in Pathology MIL through Virtual Molecular Staining

弥合病理MIL中的模态瓶颈:通过虚拟分子染色

Yucheng Xing, Pei Liu, Jingying Ma, Ruping Hong, Jiangdong Qiu, Tianyu Liu, Kai He, Ling Huang, Mengling Feng

AI总结 本文提出MIST方法,通过虚拟分子染色提升病理MIL中投影层性能,改进240/256配置,平均提升3.5%,在生存预测、组织分型和生物标志物预测中分别提升5.2%、3.3%和2.6%。

详情
AI中文摘要

多重实例学习(MIL)是计算病理学中全切片图像分析的主流框架,通常结合冻结的补丁编码器、投影层和滑片级聚合器。尽管编码器和聚合器已广泛研究,投影层仍是一个主要的形态学瓶颈。这限制了诸如生物标志物状态和生存等终点,这些终点由未被H&E形态完全捕捉的分子状态决定。我们引入了分子指导的染色转换(MIST),一种可替换MIL投影层的插件,仅在训练期间使用配对的空间转录组学数据来构建虚拟分子染色。MIST将基因表达谱聚类为跨模态原型,将其锚定在冻结的基础模型特征空间中,并利用它们沿分子指导的轴重新组织H&E补丁特征。它不需要转录组学在推理阶段,并且可以在标准MIL聚合器之前插入。我们评估了MIST在23个下游任务和8个MIL聚合器上的表现。MIST在256种配置中改进了240种,平均提升3.5%,在各种终点类型中观察到一致的提升:生存预测提升5.2%,组织分型提升3.3%,生物标志物预测提升2.6%。消融实验确认基因衍生的原型是提升的主要来源,而空间、生物和病理分析显示跨模态原型亲和力能够从H&E中捕捉到空间上一致的分子程序。

英文摘要

Multiple instance learning (MIL) is the dominant framework for whole-slide image analysis in computational pathology, typically combining a frozen patch encoder, a projection layer, and a slide-level aggregator. While encoders and aggregators have been extensively studied, the projection layer remains a largely morphology-only bottleneck. This limits endpoints such as biomarker status and survival, which are governed by a molecular state that is not fully captured by H&E morphology. We introduce Molecularly Informed Staining Transform (MIST), a plug-in replacement for the MIL projection layer that uses paired spatial transcriptomics only during training to construct virtual molecular stains. MIST clusters gene expression profiles into cross-modal prototypes, anchors them in the frozen foundation model feature space, and uses them to reorganize H&E patch features along molecularly guided axes. It requires no transcriptomics at inference and can be inserted before standard MIL aggregators. We evaluate MIST across 23 downstream tasks and 8 MIL aggregators. MIST improves 240 of 256 configurations over the standard projection layer, with an average gain of +3.5%, observed consistently across endpoint types: +5.2% on survival prediction, +3.3% on tissue subtyping, and +2.6% on biomarker prediction. Ablations confirm that gene-derived prototypes are the primary source of the gains, while spatial, biological, and pathological analyses show that cross-modal prototype affinities capture spatially coherent molecular programs from H&E alone.

2605.16391 2026-05-19 eess.SP cs.AI cs.LG cs.RO 版本更新

Overcoming the Intrinsic Performance Limitations of MEMS IMU via Diffusion-Based Generative Learning

通过扩散生成学习克服MEMS惯性测量单元的固有性能限制

Jiarui Lv, Feng Zhu, Xiaohong Zhang

AI总结 本文提出基于扩散的生成学习框架,利用低成本IMU数据生成高保真虚拟IMU数据,提升定位和姿态估计性能,并在空中测绘中验证了其有效性。

详情
AI中文摘要

惯性测量单元(IMUs)是多源集成导航系统中的基本传感组件,其性能直接影响解决方案的精度和可靠性。然而,低成本IMUs的精度受硬件限制。最近,生成式人工智能在建模复杂数据分布和重建高保真信号方面表现出色。受此启发,我们提出了一种基于扩散的生成学习框架,用于从低成本IMU测量中合成高保真虚拟IMU数据。具体而言,基于U-Net架构构建了条件扩散模型,其中高质量IMU测量用作先验真实数据,低成本IMU测量作为条件输入。模型生成的虚拟IMU数据用于后续导航和定位任务。实验结果表明,生成的虚拟IMU数据在定位和姿态估计方面均显著优于原始低成本IMU测量。此外,我们将模型转移到空中测绘实验中,其中所提出的方法产生了更薄且一致的点云。总体而言,所提出的框架突破了低成本IMU的性能限制,并展示了扩散基于生成学习在虚拟高质量IMU数据方面的潜力。

英文摘要

Inertial measurement units (IMUs) are fundamental sensing components in multi-source integrated navigation systems, and their performance directly determines the accuracy and reliability of solutions. However, the precision of low-cost IMUs is inherently constrained by hardware limitations. Recently, generative artificial intelligence has demonstrated remarkable capability in modeling complex data distributions and reconstructing high-fidelity signals. Motivated by this, we propose a diffusion-based generative learning framework for synthesizing high-fidelity virtual IMU data from low-cost IMU measurements. Specifically, a conditional diffusion model based on a U-Net architecture is constructed, where high-grade IMU measurements are utilized as ground-truth priors and low-cost IMU measurements are employed as conditional inputs. The virtual IMU data generated by the model is used for subsequent navigation and localization tasks. Experimental results demonstrate that the generated virtual IMU data significantly outperform the original low-cost IMU measurements in both positioning and attitude estimation. Furthermore, we transfer the model to airborne mapping experiments, where the proposed method produces thinner and more consistent point clouds. Overall, the proposed framework breaks the performance limits of low-cost IMU and demonstrates the potential of diffusion-based generative learning for virtual high-grade IMU data.

2605.16390 2026-05-19 cs.CV cs.LG stat.ML 版本更新

Inducing Spatial Locality in Vision Transformers through the Training Protocol

通过训练协议在视觉变换器中诱导空间局部性

Eduardo Santiago Toledo, Asael Fabian Martínez

AI总结 研究通过对比不同训练协议,发现CutMix能提升视觉变换器早期层的注意力局部性,降低MAD值,表明CutMix促进局部注意力的产生。

详情
AI中文摘要

我们研究了是否可以通过训练协议在从头训练的视觉变换器(ViT)的早期层中诱导空间局部性,而无需大规模预训练。在CIFAR-10、CIFAR-100和Tiny-ImageNet上,我们比较了基线协议与现代协议(AutoAugment/ColorJitter、CutMix和Label Smoothing),通过均值注意力距离(MAD)和归一化熵来表征每个注意力头。在所有三个数据集中,现代协议在早期层产生更局部和更集中的注意力;在CIFAR-100上,最小MAD从0.316(基线)降至0.008(现代)。为了确定这种效果的来源,我们在CIFAR-100上进行了消融研究,分别添加或移除每个组件。结果表明CutMix是实验中的决定性组件:所有包含CutMix的条件均显示MAD为0.024,而所有不包含CutMix的条件仍保持在MAD 0.210。AutoAugment和Label Smoothing对局部性无独立影响。总体而言,这些发现表明,由CutMix诱导的从部分图像区域进行分类的压力,可以促进视觉变换器中局部注意力的出现。

英文摘要

We investigate whether the training protocol can induce spatial locality in the early layers of a Vision Transformer (ViT) trained from scratch, without large-scale pretraining. Keeping the architecture and optimization procedure fixed, we compare a Baseline protocol with a Modern protocol (AutoAugment/ColorJitter, CutMix, and Label Smoothing) on CIFAR-10, CIFAR-100, and Tiny-ImageNet, characterizing each attention head via Mean Attention Distance (MAD) and normalized entropy. Across all three datasets, the Modern protocol produces more local and more concentrated attention in early layers; on CIFAR-100, the minimum MAD drops from 0.316 (Baseline) to 0.008 (Modern). To identify the source of this effect, we conduct an ablation study on CIFAR-100 by adding or removing each component individually. The results identify CutMix as the determining component within our experiments: all conditions with CutMix exhibit MAD 0.024, while all conditions without CutMix remain at MAD 0.210. AutoAugment and Label Smoothing show no independent effect on locality. Taken together, these findings suggest that the pressure to classify from partial image regions, induced by CutMix, can promote the emergence of local attention in Vision Transformers.

2605.16380 2026-05-19 cs.LG cs.AI 版本更新

ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction

ReTAMamba:基于Mamba的可靠性感知时间聚合用于不规则临床时间序列预测

Jinwoong Kim, Sangjin Park

AI总结 ReTAMamba通过时间变量标记序列重构临床时间序列,利用缺失性和时间间隔估计观测可靠性,并通过时间编织整合短期和长期时间信息,提升不规则时间序列预测性能。

Comments 11 pages

详情
AI中文摘要

临床时间序列数据难以用常规方法建模,因其表现出不规则采样、频繁缺失值和变量异质性。现有方法通常使用观测掩码和时间间隔信息,但无法持续捕捉过去观测的衰减可靠性或在聚合过程中保持一致的时序上下文。为此,我们提出了Reliability-aware Temporal Aggregation with Mamba(ReTAMamba),将临床时间序列重建为时间变量标记序列,从缺失性和经过时间估计观测可靠性,并将区间总结与统计描述符相结合。通过时间编织整合短期和长期时间信息,并应用预算标记路由器约束序列长度同时保留信息性总结。在MIMIC-IV、eICU和PhysioNet 2012上的实验表明,ReTAMamba在强基线模型上一致提升了AUPRC,平均相对提升分别为7.51%、7.80%和10.15%。eICU的队列和患者层面分析显示,学习到的动态信号(如心率和血压)的均值衰减比相对静态信号(如实验室变量)大24.3%。这些发现表明,有效预测不规则临床时间序列需要建模不仅测量了什么,还要何时以及如何观测,包括信息新鲜度和观测及时性。

英文摘要

Clinical time-series data are difficult to model with methods designed for regular sequences because they exhibit irregular sampling, frequent missing values, and heterogeneous observation patterns across variables. Existing approaches commonly use observation masks and time-gap information, but they do not continuously capture the decaying reliability of past observations or consistently organize multi-resolution information within a coherent temporal context during aggregation. To address these limitations, we propose Reliability-aware Temporal Aggregation with Mamba (ReTAMamba), which reconstructs clinical time series as time-variable token sequences, estimates observation reliability from missingness and elapsed time, and augments interval summaries with statistical descriptors. Chronological Weaving is used to integrate short- and long-term temporal information within a coherent temporal context, and a budgeted token router is applied to constrain sequence length while preserving informative summaries. Experiments on MIMIC-IV, eICU, and PhysioNet 2012 show that ReTAMamba consistently improves AUPRC over strong baselines, with average relative gains of 7.51%, 7.80%, and 10.15%, respectively. Cohort-level and patient-level analyses on eICU further showed that the learned mean decay for more dynamic signals, such as heart rate and blood pressure, was 24.3% larger than that for relatively static signals, such as laboratory test variables. These findings suggest that effective prediction in irregular clinical time series requires modeling not only what was measured, but also when and how it was observed, including information freshness and observation timeliness.

2605.16379 2026-05-19 cs.LG cs.AI cs.IT math.IT 版本更新

An Information-Theoretic Criterion for Efficient Data Synthesis

一种信息论准则用于高效数据合成

Hanyu Li, Zhengqi Sun, Xiaotie Deng

AI总结 本文提出信息开放循环的准则,指出合成数据的有效性取决于外部信号注入任务相关信息,从而提升模型效率与泛化能力。

Comments 12 pages. Camera-ready version for ICML 2026

详情
AI中文摘要

合成数据在大语言模型训练中变得至关重要,但其效果高度不一致。本文从信息论角度解释这种不一致:合成数据只有在生成-训练循环信息开放(即由外部信号塑造)时,才能提升模型性能。当循环信息封闭(依赖模型自身输出)时,数据处理不等式确保任务相关信息只能减少,导致崩溃。在信息开放管道中,效率和泛化依赖于元级监督:较粗的信号如二元正确性将所有可接受输出视为等同,因此其教导的行为不绑定特定领域或表层形式,能自然泛化到不同任务和领域。这些观察得出指导性论点:学习倾向于收敛到最信息高效的信号组件,当该组件为预期时加速学习,但当存在伪模式时导致奖励黑客。

英文摘要

Synthetic data becomes crucial for large language model training, but its effectiveness is highly inconsistent. We provide an information-theoretic account of this inconsistency: synthetic data improves a model only when the generation-training loop is information-open, i.e., shaped by external signals (verifiers, environments, or rubrics) that inject task-relevant information beyond the model's current distribution. When the loop is information-closed (relying on the model's own outputs without such signals), the data processing inequality ensures that task-relevant information can only decrease, making collapse a predicted outcome. Among information-open pipelines, both efficiency and generalization hinge on the meta-level of supervision: a coarser signal such as binary correctness treats all acceptable outputs as equivalent, so the behavior it teaches is not tied to any particular domain or surface form and generalizes naturally across tasks and domains. These observations lead to a guiding thesis: learning preferentially converges to the most information-efficient signal component available, which accelerates learning when that component is the intended one, but causes reward hacking when a spurious pattern happens to be simpler.

2605.16378 2026-05-19 cs.LG cs.AI 版本更新

Mixing Times of Glauber Dynamics on Masked Language Models

掩码语言模型上Glauber动力学的混合时间

Suvadip Sana, Sami Wolf, Neer Mehta, Alina Shah, Aitzaz Shaikh, Janna Goodman, Lionel Levine

AI总结 研究掩码语言模型迭代生成时的全局分布行为,通过Glauber动力学马尔可夫链分析其混合时间,揭示在不同温度下混合行为的相变现象。

Comments 21 pages, 7 figures

详情
AI中文摘要

掩码语言模型(MLMs)定义了令牌的局部条件分布,但通常不对应任何一致的序列联合分布。这提出了一个根本性问题:当此类条件在生成中迭代使用时,会诱导出何种全局分布行为?本文通过将迭代的掩码令牌重采样建模为离散令牌序列上的Glauber动力学马尔可夫链来回答这一问题。我们首先证明MLM条件本质上是不相容的:引入了一个矩形测试来验证这种不相容性,并实证验证其在现代MLM中的普遍性。然后我们对由此诱导的马尔可夫链进行了理论分析。在有限的跨令牌影响下,我们建立了高温度收缩结果,表明混合时间为O(n log n),其中n是序列长度。相反,在均匀局部边际条件下,链表现出 metastability,低温下缓慢逃离语义盆地。实证上,我们展示了混合行为随温度和序列长度的变化呈现相变,与理论预测一致。我们进一步通过语义轨迹表征诱导的平稳行为,识别出持久结构如长寿命陷阱和复发语义盆地,政治内容作为可测量的案例研究。

英文摘要

Masked language models (MLMs) define local conditional distributions over tokens but do not, in general, correspond to any consistent joint distribution over sequences. This raises a fundamental question: what global distributional behavior is induced when such conditionals are used iteratively for generation? We address this question by modeling iterative masked-token resampling as a Glauber dynamics Markov chain on the discrete space of token sequences. We first show that MLM conditionals are intrinsically incompatible: we introduce a rectangle test that certifies this incompatibility and empirically verify its prevalence across modern MLMs. We then provide a theoretical analysis of the induced Markov chain. Under bounded cross-token influence, we establish a high-temperature contraction result implying $O(n\log n)$ mixing time where $n$ is the sequence length. In contrast, we prove that under a uniform local margin condition, the chain exhibits metastability, with exponentially slow escape from semantic basins at low temperatures. Empirically, we demonstrate a phase transition in mixing behavior as a function of temperature and sequence length, consistent with the theoretical predictions. We further characterize the induced stationary behavior through semantic trajectories, identifying persistent structures such as long-lived traps and recurrent semantic basins, with political content serving as a measurable case study.

2605.16377 2026-05-19 cs.DL cs.AI cs.LG 版本更新

CheckSupport: A Local LLM-Powered Tool for Automated Manuscript Submission Checklist Selection and Completion

CheckSupport:一种基于本地LLM的自动化手稿提交检查清单选择与完成工具

Satvik Tripathi, Don Enwerem, Kevin Song, Kristian Quevada, Jacinta Arnold, Tessa S. Cook

AI总结 本文提出CheckSupport,利用本地LLM自动化选择和完成检查清单,提升科研报告的透明度和可重复性。系统通过分阶段提示策略实现高准确率,运行在CPU上,每篇手稿耗时12.5秒,准确率达90%。

详情
AI中文摘要

透明和标准化的报告对于可重复的科学研究至关重要,但因手动选择和完成检查清单的劳动强度,遵循报告指南仍不一致。我们提出了CheckSupport,一种开源、本地可部署的系统,利用大语言模型自动化推荐报告检查清单并完成清单。CheckSupport采用分阶段提示策略,将报告流程分解为受约束的推理任务,优先提取忠实信息而非生成文本合成。所有推理均在本地使用指令调优模型完成,保护数据隐私并实现可重复、可审计的工作流程。在同行评审手稿语料库上评估,CheckSupport在清单推荐上达到90%的整体准确率,在项目级完成上达到88%的整体准确率,运行在仅CPU硬件上。平均而言,每篇手稿的墙钟时间为12.5秒,包括检查清单推荐和完整检查清单完成。这些结果表明,当大语言模型作为结构化推理组件应用时,可以减少报告负担,支持跨学科更透明和可重复的科学研究报告。

英文摘要

Transparent and standardized reporting is essential for reproducible scientific research, yet adherence to reporting guidelines remains inconsistent because of the manual effort required to select and complete checklists. We present CheckSupport, an open-source, locally deployable system that uses large language models to automate the recommendation of reporting checklists and the evidence-grounded completion of checklists for scientific manuscripts. CheckSupport employs a staged prompting strategy that decomposes reporting workflows into constrained inference tasks, prioritizing faithful extraction over generative text synthesis. All inference is performed locally using instruction-tuned models, preserving data privacy and enabling reproducible, auditable workflows. Evaluated on a corpus of peer-reviewed manuscripts, CheckSupport achieved 90% overall accuracy for checklist recommendations and 88% overall accuracy for item-level completion while operating on CPU-only hardware. On average, the wall-clock time per manuscript was 12.5 seconds, including the checklist recommendation and full checklist completion. These results demonstrate that large language models, when applied as structured inference components, can reduce reporting burden and support more transparent and reproducible scientific reporting across disciplines.

2605.16376 2026-05-19 eess.IV cs.CV cs.DC cs.LG cs.MM 版本更新

Kelvin v1.0: A Neural Pre-Encoder for H.264: A standards-compliant learned preprocessor with -27.62% BD-VMAF on UVG

Kelvin v1.0:一种用于H.264的神经预编码器:一种符合标准的学得预处理程序,在UVG上实现-27.62%的BD-VMAF

Marco Graziano

AI总结 Kelvin v1.0通过内容自适应像素调整优化H.264编码,实现比基准libx264更高的BD-VMAF,其在UVG和MCL-JCV数据集上均表现优异,同时解决了H.264非可微的工程挑战。

详情
AI中文摘要

Kelvin是一种轻量级学得预编码器,位于未修改的libx264编码器之前。它应用内容自适应的像素调整,每个通道限制在±1/255以内,使编码器将比特分配到最需要感知的区域,同时输出兼容所有现有解码器、播放器和CDN的标准H.264位流。在七序列1080p UVG基准上,Kelvin v1.0实现平均BD-VMAF为-27.62%(7/7胜),BD-VMAF-NEG为-5.18%(6/7胜)。在30序列MCL-JCV公开数据集上,相同检查点在28/30片段上胜过基准libx264,去除两个可诊断失败后,平均BD-VMAF为-27.70%,与UVG一致。核心工程挑战是H.264的非可微性:我们描述了一种混合编码器代理,结合校准的可微率估计器(与真实libx264的每像素比特数斯皮尔曼_rho=0.986)和在真实编码器输出上训练的U-Net失真代理。我们发布完整的每序列率失真数据,MCL-JCV上的命名失败模式分类(率下限违规、分布偏移、指标饱和),以及五个基准的合理性面板(hqdn3d、unsharp、-tune psnr、-tune ssim、x265 medium),并诚实定位:x265 medium在相同数据集上每项指标均胜过Kelvin。因此,Kelvin是为在H.264上保持是约束而非选择的工作负载设计的。

英文摘要

Kelvin is a lightweight learned pre-encoder that sits in front of an unmodified libx264 encoder. It applies content-adaptive pixel adjustments, bounded at +/-1/255 per channel, so that the encoder allocates bits where they matter most perceptually, while emitting a standard H.264 bitstream compatible with every existing decoder, player, and CDN. On the seven-sequence 1080p UVG benchmark, Kelvin v1.0 achieves a mean BD-VMAF of -27.62% (7 of 7 wins) and BD-VMAF-NEG of -5.18% (6 of 7 wins) relative to baseline libx264 at preset medium. On the 30-sequence MCL-JCV public set (28 unseen by training), the same checkpoint wins on 28 of 30 clips by BD-VMAF; with the two diagnosable failures removed the mean is -27.70% BD-VMAF and -5.37% BD-VMAF-NEG, consistent with UVG to within one percentage point. A central engineering challenge is the non-differentiability of H.264: we describe a hybrid codec proxy that combines a calibrated differentiable rate estimator (Spearman rho = 0.986 vs. real libx264 bits-per-pixel) with a U-Net distortion proxy trained on real encoder outputs. We publish full per-sequence rate-distortion data, a named failure-mode taxonomy on MCL-JCV (rate-floor violation, distribution shift, metric saturation), a five-baseline sanity panel (hqdn3d, unsharp, -tune psnr, -tune ssim, x265 medium), and honest positioning: x265 medium beats Kelvin on every metric on the same corpus. Kelvin is therefore designed for workloads where remaining on H.264 is a constraint rather than a choice.

2605.16375 2026-05-19 cs.LG cs.NI 版本更新

M$^2$FedAQI: Multimodal Federated Learning for Air Quality Prediction on Heterogeneous Edge Devices

M$^2$FedAQI: 多模态联邦学习用于异构边缘设备上的空气质量预测

Manjil Nepal, Kimsie Phan, Tamoghna Ojha, Aritra Dutta, M Krishna Siva Prasad

AI总结 本文提出M$^2$FedAQI框架,通过多模态融合机制实现异构边缘设备上的空气质量预测,实验表明其在准确率、AUC、F1-score和R²等指标上均优于现有方法,同时降低MAE和RMSE,提升通信安全性和资源利用率。

详情
AI中文摘要

准确的空气质量预测对公共健康、环境监测和工业安全至关重要。然而,现有方法多依赖集中学习范式,导致分布式物联网环境中可扩展性、隐私保护和通信开销等问题。此外,当前基于联邦学习(FL)的解决方案大多使用单模态数据,限制了其捕捉复杂环境模式的能力。为解决这些限制,我们提出M$^2$FedAQI,一种轻量级多模态联邦框架,用于在异构边缘设备上进行去中心化空气质量指数(AQI)预测。所提出的框架通过基于特征调制的融合机制整合视觉和表格模态,实现高效的跨模态交互,同时保持低计算开销。M$^2$FedAQI在PM25Vision和TRAQID两个基准数据集上进行评估,针对分类和回归任务,在集中式和联邦学习设置下进行测试。实验结果表明,M$^2$FedAQI在准确率、AUC、F1-score和R²等指标上均优于现有方法,达到最高11.0%的准确率提升,3.53%的AUC提升,12.2%的F1-score提升和18.0%的R²提升,同时将MAE和RMSE分别降低25.4%和20.4%。此外,在异构边缘设备上的部署显示了在通信开销、内存足迹和计算成本方面的高效资源利用率。为增强通信安全,采用TLS认证机制,确保客户端参与的安全性并保护联邦学习通信通道免受未经授权第三方访问,而无需修改底层联邦学习协议。

英文摘要

Accurate air quality prediction is essential for public health, environmental monitoring, and industrial safety. However, most existing approaches rely on centralized learning paradigms, which introduce challenges related to scalability, privacy preservation, and communication overhead in distributed Internet of Things (IoT) environments. Moreover, current federated learning (FL) based solutions predominantly utilize unimodal data, limiting their capability to capture complex environmental patterns. To address these limitations, we propose M$^2$FedAQI, a lightweight multimodal federated framework for decentralized Air Quality Index (AQI) prediction across heterogeneous edge devices. The proposed framework integrates visual and tabular modalities through a feature modulation based fusion mechanism that enables efficient cross-modal interaction while maintaining low computational overhead. M$^2$FedAQI is evaluated on two benchmark datasets, PM25Vision and TRAQID, for both classification and regression tasks under centralized and federated settings. Experimental results demonstrate that M$^2$FedAQI consistently outperforms existing approaches, achieving improvements of up to 11.0\% in Accuracy, 3.53\% in AUC, 12.2\% in F1-score, and 18.0\% in $R^2$, while reducing MAE and RMSE by up to 25.4\% and 20.4\%, respectively, compared with the strongest baselines. Furthermore, deployment on heterogeneous edge devices demonstrates efficient resource utilization in terms of communication overhead, memory footprint, and computational cost. To enhance communication security, TLS-based authentication is incorporated to ensure secure client participation and protect the FL communication channel from unauthorized third-party access without modifying the underlying FL protocol.

2605.16374 2026-05-19 cs.LG cs.AI 版本更新

Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning

丢失或隐藏?监督连续学习中的概念层面遗忘

Katarzyna Filus, Kamil Faber, Roberto Corizzo, Christopher Kanan

AI总结 本文提出一种诊断框架,利用稀疏自编码器分析概念层面遗忘,发现遗忘主要源于表征可访问性变化而非信息擦除。

详情
AI中文摘要

持续学习研究模型如何在适应新任务的同时保留先前知识。尽管已有多种方法缓解灾难性遗忘,但该领域仍以性能为导向,缺乏对视觉模型表征空间中遗忘本质的理解。本文提出利用稀疏自编码器定义任务锚定的潜在特征空间,分析任务特定信息在更细粒度下的演变。我们分解遗忘为显性概念删除、可恢复性和解码性。结果显示,大量看似丢失的概念信息在线性假设下可恢复,而随着任务增加,概念解码性下降。总体而言,我们的发现表明,概念层面遗忘主要归因于表征可访问性变化而非完全信息擦除。

英文摘要

Continual learning studies how models can adapt to new tasks while retaining previously acquired knowledge. Although a broad spectrum of methods has been proposed to mitigate catastrophic forgetting, the field remains predominantly performance-driven, with limited insight into what forgetting actually corresponds to within the vision model's representation space. Prior work has primarily analyzed forgetting through task-level performance or coarse measures of representational drift, without disentangling output-level accessibility from changes in finer-grained internal structure. To this end, we propose a diagnostic framework that leverages Sparse Autoencoders (SAEs) to define a task-anchored latent feature space, enabling analysis of how task-specific information evolves at a finer granularity, where individual SAE latents are treated as concept proxies for recurring and relatively disentangled visual patterns in the model's internal computations. Within this framework, we decompose forgetting into apparent concept deletion, recoverability, and decodability. We show that a large portion of seemingly lost concept-level information can often be recovered under linearity assumption, with concept decodability degrading as more tasks are introduced. Overall, our findings suggest that a significant part of concept-level forgetting can be attributed to changes in the representational accessibility rather than complete information erasure.

2605.16373 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Cross-Source Supervision for Bone Infection Segmentation in Dual-Modality PET-CT

跨源监督在双模态PET-CT骨感染分割中的应用

Zonglin Yang, Xiaolei Diao, Jishizhan Chen, Xiaozhuang Man, Wei Kong, Gen Wen, Pengfei Cheng, Daqian Shi

AI总结 本文提出一种双模态端到端分割框架,通过早融合多模态表示整合PET代谢信号和CT骨窗解剖信息,解决标注不一致下的骨感染分割问题,采用患者级3D体积评估和交叉验证提高性能。

详情
AI中文摘要

早期和准确诊断骨感染及病变定位对临床治疗至关重要。PET-CT结合了CT的解剖信息和PET的代谢信息,是诊断骨感染的重要成像模态。然而,由于病变边界不清晰和不同专家或自动化系统生成的标注不一致,准确的病变分割仍具挑战性。本文研究了在标注不一致下的多模态分割。我们开发了一个双模态端到端分割框架,通过早融合多模态表示整合PET代谢信号和CT骨窗解剖信息。为了缓解小数据集中小切片相关性导致的性能膨胀,本研究弃用传统二维评估方法,采用严格的患者级3D体积评估和交叉验证。此外,我们提出了一种解耦的双源学习框架,其中并行模型在由高灵敏度和高特异性临床意图驱动的独立专家标注上进行训练。实验结果客观报告了患者级性能变化(均值±标准差和均值-标准差),证明了多模态PET-CT融合的有效性。交叉评估矩阵定量揭示了模型如何成功内化不同的专家诊断哲学,提供了一种稳健且保持多样性的临床AI部署范式,用于骨感染分割。

英文摘要

Early and accurate diagnosis and lesion localization of bone infections are crucial for clinical treatment. PET-CT integrates anatomical information from CT with metabolic information from PET, making it an important imaging modality for diagnosing bone infections. However, accurate lesion segmentation remains challenging due to indistinct lesion boundaries and inconsistencies in annotations generated by different experts or automated systems. In this work, we investigate multimodal segmentation of bone infections under annotation discrepancy. We develop a bimodal end-to-end segmentation framework that integrates PET metabolic signals and CT bone-window anatomy through an early-fusion multimodal representation.To mitigate performance inflation caused by inter-slice correlation in small datasets, this study discards traditional two-dimensional evaluation methods and implements a rigorous patient-level 3D volumetric evaluation and cross-validation. Furthermore, instead of forcing a singular consensus, we propose a decoupled dual-source learning framework where parallel models are trained on independent expert annotations driven by high-sensitivity and high-specificity clinical intents. Experimental results objectively report performance variations at the patient level (Mean + SD and Mean - SD), demonstrating the effectiveness of multimodal PET-CT fusion. The cross-evaluation matrix quantitatively reveals how models successfully internalize distinct expert diagnostic philosophies, providing a robust, diversity-preserving paradigm for clinical AI deployment in bone infection segmentation.

2605.16372 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SwordBench: Evaluating Orthogonality of Steering Image Representations

SwordBench:评估转向图像表示的正交性

Vladimir Zaigrajew, Dawid Pludowski, Hubert Baniecki, Przemyslaw Biecek

AI总结 本文提出SwordBench,用于评估视觉模型在多个backbone和概念移除任务中转向表示的正交性,引入了交叉概念鲁棒性和 collateral damage 等新评估指标,发现线性SVM在分离性和正交性上优于稀疏自编码器,但无法实现零 collateral damage。

详情
AI中文摘要

在推理时间对模型表示进行干预以校正预测对于AI可解释性和安全性至关重要,但现有评估协议局限于模糊的语言建模任务。为填补这一空白,我们引入SwordBench,一个用于评估视觉模型在多个backbone和概念移除任务中转向表示的基准。除了统一的基准测试套件外,我们还提出了新的评估概念,揭示了概念激活向量正交性对实用转向的二次影响。具体而言,交叉概念鲁棒性衡量在针对替代概念正交化输入上概念检测性能的稳定性,而collateral damage量化在缺乏偏见的输入上转向是否意外影响下游任务的模型性能。我们发现尽管线性支持向量机在分离性和正交性上表现优异,但无法实现零collateral damage,通常落后于稀疏自编码器。在更简单的环境中,标准基线和优化方法均无法实现完美的转向。源代码将很快在GitHub上发布。

英文摘要

Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.

2605.16365 2026-05-19 cs.LG cs.DB 版本更新

Machine Learning-Based Pre-Test Risk Stratification for PCR-Confirmed Chlamydia Using Patient-Reported Data and Urine Biomarkers

基于机器学习的PCR确认淋病的预测试风险分层:利用患者报告数据和尿液生物标志物

Mehrab Mahdian, Marko Lehes, Katrin Krolov, Tamas Pardy

AI总结 研究利用机器学习模型对PCR确认淋病的高风险个体进行预测试风险分层,结合患者报告数据和尿液生物标志物,提升筛查效率。

详情
AI中文摘要

早期识别淋病感染高风险个体可优化分子检测在资源受限筛查中的应用。本文评估了利用机器学习模型对预测试风险分层(PTRS)进行风险分层的可行性,使用常规可用的非侵入性临床数据进行训练。分析了93个尿液样本的curated数据集,使用三个特征组:患者报告的病史和症状、尿液生物标志物(标准尿液分析)以及它们的组合。评估了五个监督分类器,使用分层五折交叉验证和折叠外概率估计。性能通过受试者工作特征曲线下面积(AUC)和阈值依赖的指标评估,不确定性通过自助法置信区间量化。仅使用患者报告数据的模型显示出中等判别能力(AUC最高达0.72)。基于尿液生物标志物的模型显示出略低的峰值判别能力但更一致的性能,集成方法表现出最强的结果。结合特征组略微提高了峰值AUC并减少了模型间的性能变异,表明了改进的鲁棒性。研究结果表明,尿液生物标志物为PTRS提供了可靠的预测信号,与患者报告信息互补,而特征整合增强了鲁棒性。本研究支持将非侵入性、常规可用的信息整合到筛查流程中,包括去中心化或家庭PCR情境,以优化检测优先级。

英文摘要

Early identification of individuals at elevated risk of Chlamydia trachomatis infection may enable optimal use of molecular testing in resource-aware screening. We evaluate the feasibility of pre-test risk stratification (PTRS) using machine-learning models trained on routinely available, non-invasive clinical data. A curated dataset of 93 urine samples with PCR reference labels was analyzed using three feature groups: patient-reported history and symptoms, urine biomarkers from standard urinalysis, and their combination. Five supervised classifiers were evaluated using stratified 5-fold cross-validation with out-of-fold probability estimates. Performance was assessed using area under the receiver operating characteristic curve (AUC) and threshold-dependent metrics, with uncertainty quantified via bootstrap confidence intervals. Models using only patient-reported data showed moderate discrimination (AUC up to 0.72). Urine biomarker-based models demonstrated slightly lower peak discrimination but more consistent performance, with ensemble methods yielding the strongest results. Combining feature groups marginally increased the peak AUC and reduced performance variability across models, indicating improved robustness. Findings indicate that urine biomarkers provide a reliable predictive signal for PTRS that is complementary to patient-reported information, while feature integration enhances robustness. This work supports the integration of non-invasive, routinely available information for PTRS into screening workflows, including decentralized or home-based PCR contexts, to optimize testing prioritization.

2605.16363 2026-05-19 cs.LG cs.CY 版本更新

ORACLE: Anticipating Scams from Partial Trajectories in Streaming App Usage

ORACLE:从流式应用使用轨迹中预见诈骗

Wenbo Gao, Songbai Tan, Zhongan Wang, Fei Shen, Gang Xu, Huiping Zhuang, Yunyun Yang, Ming Li, Xiaofeng Zhu

AI总结 本文提出ORACLE框架,通过流式应用使用轨迹预测诈骗,利用自适应上下文管理器和自蒸馏方案提升早期欺诈检测性能。

详情
AI中文摘要

智能手机诈骗日益普遍,通常表现为多阶段、跨应用过程,意图逐渐显现。有效的干预需要在意图明确前预见诈骗,这极具挑战性,因为决策必须依赖部分轨迹和时间分布的证据。本文提出ORACLE在线推理框架,首个针对流式应用使用轨迹的早期诈骗预见框架。我们构建了一个现实世界长周期基准,涵盖12种诈骗类型,平均15天,涉及95种应用,交织正常与诈骗行为。为解决碎片化证据,我们引入自适应上下文管理器,动态整合实体中心交互,提升跨时间证据重建能力。为增强对潜在早期信号的敏感度,我们提出一种在线自蒸馏方案,教师模型基于总结的反诈骗反思和线索监督学生模型。实验表明,该方法在真实流式场景中有效提升早期诈骗预见,及时预警并减少误报。

英文摘要

Smartphone scams are increasingly prevalent and typically manifest as multi-stage, cross-application processes with gradually emerging intent. Effective intervention thus requires anticipating scams before the intent becomes explicit. This is inherently challenging, as decisions must rely on partial trajectories with temporally distributed evidence. In this paper, we propose \textbf{ORACLE} Online Reasoning for Anticipating Cross-temporal Latent thrEats, the first agentic framework for early scam anticipation from \textit{streaming app-usage} trajectories. To support this setting, we curate a real-world long-horizon benchmark of streaming app-usage trajectories, covering 12 scam types, spanning extended periods (15 days on average), involving diverse applications (95 apps), and interleaving normal and scam behaviors. To address fragmented evidence, we introduce a self-evolving context manager that adaptively consolidates entity-centric interactions over time, enabling more effective reconstruction of cross-temporal evidence from partial observations. To enhance sensitivity to latent early-stage signals, we propose an on-policy self-distillation scheme in which a teacher model, conditioned on summarized anti-scam reflections and clues by skills, supervises a student model without access to such reflections. This scheme thereby distills evidence-informed knowledge and improves recognition of emerging fraud patterns from partial trajectories. Experiments show that \method{} consistently improves early scam anticipation, yielding timely warnings while reducing false alerts in realistic streaming scenarios.

2605.16361 2026-05-19 cs.LG cs.AI stat.ML 版本更新

TailedTS: Benchmark Dataset for Heavy-Tailed Time Series Prediction and Periodicity Quantification

TailedTS:用于重尾时间序列预测和周期性量化的大规模基准数据集

Xinyu Chen, HanQin Cai, Lijun Ding, Jinhua Zhao

AI总结 TailedTS数据集用于测试在重尾、零膨胀和非高斯条件下时间序列预测模型的鲁棒性,通过稀疏自回归框架揭示高频页面的周期性较弱,同时提供非高斯损失函数的标准化预测基准。

详情
AI中文摘要

我们介绍了TailedTS,一个基于2024年维基百科每小时页面浏览观测数据的大规模基准数据集,专门用于测试时间序列预测模型在重尾、零膨胀和非高斯条件下的性能。该数据集包含约2469亿个数据点,覆盖约300万个唯一维基百科页面,存储在高效的Apache Parquet格式中。维基百科流量遵循幂律分布,其中约5%的页面贡献了70%的总浏览量,为模型在极端波动下的鲁棒性提供了一个自然且严谨的测试环境。TailedTS支持多个研究任务:首先,我们引入了一个基于稀疏自回归的周期性量化框架,揭示高频页面的周期性结构显著弱于低频页面,这对大型数字平台的服务器分配和流量预测有直接意义。其次,我们提供了在一系列非高斯损失函数下的标准化预测基准,包括ℓ1范数、Huber、分位数和ℓp范数损失,表明基于高斯的估计器在高流量页面类别中性能显著下降,而鲁棒替代方案在所有流量规模上均提供一致的提升。TailedTS可在https://doi.org/10.5281/zenodo.17070469公开获取。

英文摘要

We present TailedTS, a large-scale benchmark dataset derived from Wikipedia hourly page view observations throughout 2024, specifically designed to test time series forecasting models under heavy-tailed, zero-inflated, and non-Gaussian conditions. The dataset comprises approximately 24.69 billion data points spanning roughly 3 million unique Wikipedia pages per month, stored in high-efficiency Apache Parquet format. Wikipedia traffic follows a pronounced power-law distribution where roughly 5% of pages account for over 70% of total page views, creating a natural and rigorous testbed for model robustness against extreme volatility that are absent from or underrepresented in existing benchmarks such as M4, M5, and UCI electricity datasets. TailedTS enables several research tasks. First, we introduce a periodicity quantification framework based on sparse autoregression with sparsity and non-negativity constraints, revealing that frequently-viewed pages exhibit significantly weaker periodic structure than their less-viewed counterparts, showing direct implications for server allocation and traffic forecasting on large digital platforms. Second, we provide standardized prediction benchmarks evaluated under a suite of non-Gaussian loss functions, including $\ell_1$-norm, Huber, quantile, and $\ell_p$-norm losses, demonstrating that standard Gaussian-based estimators degrade substantially on high-volume page categories, while robust alternatives provide consistent gains across all traffic scales. TailedTS is publicly available at https://doi.org/10.5281/zenodo.17070469.

2605.16360 2026-05-19 cs.LG cs.AI 版本更新

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

ProxyKV:跨模型代理剪枝用于高效长上下文LLM推理

Junjie Li, Jiong Lou, Jie Li

AI总结 ProxyKV通过跨模型代理剪枝方法,解决LLM长上下文推理中的KV缓存内存瓶颈,实现高效推理与高精度的平衡,提升预填充速度和长上下文处理能力。

详情
AI中文摘要

高效长上下文推理在大型语言模型(LLM)中受到键值(KV)缓存内存瓶颈的严重限制,而现有剪枝方法在低延迟启发式和高精度重建方法之间做出取舍。为弥合评分成本与精度之间的差距,我们提出了ProxyKV,一种跨模型代理剪枝框架,将重要性评分卸载到轻量级的同族小型模型代理上,该代理异步执行于大型模型目标。为弥合异构模型之间的架构差距,我们设计了HybridAxialMapper,将时间特征提取与跨头对齐解耦,并设计了多粒度混合损失,将学习目标从刚性回归转向相对排名一致性。在Llama-3.1、Qwen-2.5和Qwen-3家族上,针对LongBench、SCBench和RULER等基准测试,ProxyKV在聚合层面(恢复约98.7%的平均精度)与KVZip相当,同时在Llama-3.1-8B上实现了高达3.21倍的预填充加速(双GPU;约1.5倍共享单GPU),并在Qwen-2.5-7B上支持高达170k tokens的上下文长度。

英文摘要

Efficient long-context inference in Large Language Models (LLMs) is severely constrained by the Key-Value (KV) cache memory wall, yet existing pruning methods force a choice between low-latency heuristics that sacrifice precision and high-precision reconstruction methods that incur prohibitive prefilling overhead. To bridge this scoring-cost--accuracy gap, we propose ProxyKV, a cross-model proxy pruning framework that offloads importance scoring to a lightweight intra-family Small-Model Proxy executed asynchronously to the Large-Model Target. To bridge the architectural gap between heterogeneous models, we design the HybridAxialMapper, which disentangles temporal feature extraction from cross-head alignment, together with a Multi-Granularity Hybrid Loss that shifts the learning objective from rigid regression to relative ranking consistency. Across the Llama-3.1, Qwen-2.5, and Qwen-3 families spanning targets from 7B up to 32B parameters on LongBench, SCBench, and RULER, ProxyKV matches KVZip on aggregate (recovering $\sim$$98.7\%$ of its mean accuracy) while delivering up to a $3.21\times$ prefilling speedup on Llama-3.1-8B (dual-GPU; $\sim$$1.5\times$ shared single-GPU) and sustaining the speedup at contexts up to 170k tokens on Qwen-2.5-7B.

2605.16358 2026-05-19 cs.LG cs.AI 版本更新

LEAF: A Living Benchmark for Event-Augmented Forecasting

LEAF:一个用于事件增强预测的活体基准

Mingtian Tan, Mihir Parmar, Palash Goyal, Chun-Liang Li, Nanyun Peng, Thomas Hartvigsen, Jinsung Yoon, Tomas Pfister

AI总结 本文提出LEAF,首个用于事件增强预测的活体基准,通过递归检索代理系统和双代理交叉验证,提供全面相关文本辅助预测,评估LLM在复杂真实场景中的预测能力。

Comments 12 tables, 6 figures, 39 pages

详情
AI中文摘要

大型语言模型(LLMs)越来越多地应用于预测。为了评估这一能力并缓解预训练数据污染,已提出几种活体基准。然而,现有基准要么因数据稀缺缺乏多维事件,要么聚焦于相对封闭环境。为评估LLM在复杂真实场景中的预测能力,我们提出LEAF,首个用于事件增强预测任务的活体基准,包括未来事件概率、趋势和时间序列预测。LEAF利用递归检索代理系统配以双代理交叉验证,提供全面相关辅助文本。评估最新专有和开源LLMs发现,这些模型能利用复杂事件提取的信号提升预测性能。在股票领域,发现LLM在自信识别为更可预测的股票上表现更好。此外,事件与目标股票呈现强相关性。为此,LEAF提供必要的动态更新测试环境,持续跟踪和推动事件驱动预测任务的进步。

英文摘要

Large Language Models (LLMs) are increasingly applied to forecasting. To evaluate this capability while mitigating pre-training data contamination, several living benchmarks have been proposed. However, existing benchmarks either lack the multidimensional events essential for accurate forecasting due to data scarcity, or focus on relatively closed environments. To assess the predictive capabilities of LLMs in complex, real-world scenarios, we propose LEAF, the first living benchmark for event-augmented forecasting tasks, including future event probabilities, trend and time series forecasting. LEAF utilizes a recursive retrieval agent system paired with dual-agent cross-validation to provide comprehensive and relevant auxiliary text for forecasting. Evaluating state-of-the-art proprietary and open-weight LLMs, we find that these models can leverage signals extracted from complex events to enhance predictive performance. In the stock domain, we find that LLMs achieve better performance on equities they confidently identify as more predictable. Furthermore, the events demonstrate a strong correlation with the target equities. To this end, LEAF provides a necessary, dynamically updating testbed to continuously track and drive progress in event-driven forecasting tasks.

2605.16354 2026-05-19 cs.LG cs.AI cs.CL cs.HC stat.ML 版本更新

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

通过LLM裁判增强人类评估:你真的需要多少人类评审?

Jane Paik Kim

AI总结 本文提出通过LLM作为辅助裁判来增强人类评估,通过两阶段抽样设计确定人类和LLM评审样本量,以实现目标统计功效。

Comments 10 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作AI系统的自动评估者,包括在高风险应用中。在这一角色中,LLMs用于生成关于模型输出质量、适当性甚至安全性的判断。这种做法受到实际限制的驱动。专家人类评分成本高且难以扩展,而LLM评分可以快速低成本地生成。然而,当前部署LLM评估者的方法是随意的,通常仅限于报告人类和LLM裁判之间的一致性度量作为替代人类评分的正当性,且缺乏正式的研究设计基础。本文(1)将LLM裁判的角色从替代性转为辅助性,并(2)将LLM作为裁判范式制定为通过两阶段抽样设计增强人类评估的一种方法,其中在第一阶段对所有观察进行LLM评估,在第二阶段对子样本进行部分人类评分。我们提出使用来自缺失数据文献的双重鲁棒估计器,利用预测模型的鲁棒性属性,因为缺失性模型是设计已知的。使用该估计器的渐近方差,我们提出如何确定人类和LLM评分的样本量以达到目标统计功效。我们还展示通过分配更多人类评分给LLM评分预测性不高的评估类型,可以高效地设计研究。据我们所知,关于在验证基准时应保留多少人类监督的指导非常有限。

英文摘要

Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety of model outputs. This approach is motivated by practical constraints. Expert human ratings are costly and difficult to scale, whereas LLM ratings can be produced quickly at low cost. However, current approaches to deploying LLM evaluators are ad hoc, typically limited to reporting agreement metrics between human and LLM judges as a justification for substitution of human ratings, and lack a formal basis for study design. This paper (1) shifts the role of the LLM judge from substitutive to auxiliary, and (2) formulates the LLM-as-a-judge paradigm as one of augmenting human evaluation through a two-stage sampling design, where LLM evaluations are measured for all observations at the first stage and human ratings are partially observed for a subsample at the second stage. We propose to use a doubly robust estimator from the missing data literature, which takes advantage of the robustness property against the prediction model, since the missingness model is known by design. Using the asymptotic variance of this estimator, we propose how sample sizes of human and LLM ratings can be determined to achieve a targeted level of power. We also show that a study can be efficiently designed by allocating more human ratings for types of evaluations where the predictability of LLM ratings is not high. To the best of our knowledge, there is very little guidance on how much human oversight should be retained when validating benchmarks.

2605.16352 2026-05-19 cs.IR cs.AI cs.LG 版本更新

LARGER: Lexically Anchored Repository Graph Exploration and Retrieval

LARGER: 词典锚定的仓库图探索与检索

Yuntong Hu, Tongli Su, Liang Zhao, Bowen Zhu, Hasibul Haque

AI总结 LARGER通过词典锚定的结构化定位方法提升代码仓库文件定位精度,实现测试生成和代码库理解任务的性能提升。

详情
AI中文摘要

仓库级别的编码代理必须首先定位与任务相关的文件和符号;此阶段的失败会影响从补丁生成到测试编写和代码库问答的下游目标。现有代理主要通过词汇搜索导航仓库,常遗漏结构关系如导入、调用链、类型层次和代码-测试链接。基于图的检索可恢复此类依赖,但现有方法常需要单独的图工具或遍历阶段,打断代理的交互循环。我们正式将仓库上下文定位定义为词典锚定的结构化定位,其成功取决于将词汇匹配转化为高精度的结构入口点,并在代理现有搜索循环中暴露最有用的置信度过滤局部邻域。我们引入LARGER(词典锚定的仓库图探索与检索),一种以词汇锚定的主动集检索框架,从词汇匹配开始,将其对齐到图锚点,并在代理现有搜索循环中执行置信度过滤的局部扩展。LARGER直接集成到现有CLI编码代理中,无需外部图数据库或专用图接口。在四个涵盖定位、测试生成和代码库理解的基准测试中,LARGER在LocBench上通过调整超参数将文件级Acc@5提升13.9点,即使在固定超参数下仍比最强基线提升11.8点,并在MuLocBench、SWE-Atlas测试编写和SWE-Atlas代码库问答任务上提供一致的提升。

英文摘要

Repository-level coding agents must first localize the files and symbols relevant to a task; failures at this stage can cascade across downstream objectives ranging from patch generation to test writing and codebase question answering. Existing agents navigate repositories primarily through lexical search, often missing structural relations such as imports, call chains, type hierarchies, and code-test links. Graph-based retrieval can recover such dependencies, but existing approaches often require separate graph tools or traversal stages that fragment the agent's interaction loop. We formalize repository context localization as Lexically Anchored Structural Localization, where success depends on turning lexical matches into high-precision structural entry points and exposing the most useful confidence-filtered local neighborhoods within the agent's existing search loop. We introduce LARGER (Lexically Anchored Repository Graph Exploration and Retrieval), a lexically anchored active-set retrieval framework that starts from lexical matches, aligns them to graph anchors, and performs confidence-filtered local expansion within the agent's existing search loop. LARGER integrates directly into existing CLI coding agents without requiring external graph databases or specialized graph interfaces. Across four benchmarks spanning localization, test generation, and codebase understanding, LARGER improves file-level Acc@5 on LocBench by +13.9 points with tuned hyperparameters and still gains +11.8 points with fixed hyperparameters over the strongest baseline, while delivering consistent gains on MuLocBench, SWE-Atlas Test Writing, and SWE-Atlas Codebase QA.

2605.16351 2026-05-19 cs.LG cs.AI 版本更新

PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift

PIMSM:基于物理的多尺度Mamba用于在分布偏移下稳定的神经表示

Sangyoon Bae, Shinjae Yoo, Jiook Cha

AI总结 本文提出PIMSM,一种基于物理的多尺度Mamba架构,通过时间尺度对齐提升科学基础模型在分布偏移下的鲁棒性和表示稳定性,实验证明其在fMRI和气象预测中的有效性。

Comments 9 pages, 2 figures

详情
AI中文摘要

科学基础模型旨在在数据集、获取协议和部署领域变化时重用表示,但许多序列骨干网络将科学时间结构视为无约束模式进行拟合。本文认为,这忽略了自然动力系统的核心特性:神经和大气时间序列由跨多个物理时间尺度的相互过程组织,未能保留这种多尺度结构会加剧分布偏移下的脆性。本文将这种失败模式正式定义为时间核不匹配,即模型使用与信号物理时间尺度无关的有效记忆策略拟合分布内动态,导致表示漂移和转移性能下降。本文提出物理约束的多尺度Mamba(PIMSM),一种状态空间架构,将频域估计的过渡点(膝频)映射到尺度特定的离散化参数并锚定到获取时间单位。在人类连接组计划fMRI上,PIMSM在严重时间上下文截断、极端低资源转移和静息态到任务态泛化中提升了鲁棒性和表示稳定性。在无需模态特定适应的情况下,相同架构在Weather-5K持出站空间分布外预测中取得了所有报告范围和变量的最低变量加权MAE。这些结果支持时间尺度对齐作为科学基础模型的实用归纳偏置,这些模型必须在部署偏移下保持结构,而非仅拟合相关性。

英文摘要

Scientific foundation models are expected to reuse representations under changes in dataset, acquisition protocol, and deployment domain, yet many sequence backbones treat scientific temporal structure as an unconstrained pattern to be fitted. We argue that this misses a central property of natural dynamical systems: neural and atmospheric time series are organized by interacting processes across multiple physical timescales, and failure to preserve this multiscale structure contributes to brittleness under distribution shift. We formalize this failure mode as temporal kernel mismatch, where a model fits in-distribution dynamics with an effective memory policy that is not anchored to the signal's physical timescales, leading to representation drift and degraded transfer. We propose Physics-Informed Multi-Scale Mamba (PIMSM), a state-space architecture that maps spectrum-estimated transition points between frequency regimes (knee frequencies) to scale-specific discretization parameters and anchors them to acquisition time units. On Human Connectome Project fMRI, PIMSM improves robustness and representation stability under severe temporal-context truncation, extreme low-resource transfer, and resting-state-to-task-state generalization. Without modality-specific adaptation, the same architecture also attains the lowest variable-wise MAE across all reported horizons and variables on Weather-5K held-out-station spatial out-of-distribution forecasting. These results support temporal-scale alignment as a practical inductive bias for scientific foundation models that must preserve structure, not only fit correlations, under deployment shift.

2605.16350 2026-05-19 cs.LG cs.AI 版本更新

Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation

联邦嵌套学习:为测试时间适应性协作训练自参考记忆

Hong Chen, Pengcheng Wu, Yuanguo Lin, Peilin Zhao, Xiuze Zhou, Fan Lin, Han Yu

AI总结 本文提出FedNL框架,通过嵌套优化系统实现协作学习优化规则,提升非独立同分布数据下的推理与检索性能,保持恒定推理内存。

详情
AI中文摘要

我们从嵌套学习视角重新思考联邦学习(FL),将核心挑战定为如何协作学习优化规则而非静态模型,以应对非独立同分布客户端数据。为此,我们提出联邦嵌套学习(FedNL),一种新的框架,将FL重新表述为三层嵌套优化系统。FedNL嵌入基于Titans的线性注意力机制到FL中,使客户端能够通过将delta规则视为在线梯度步骤进行轻量级零样本测试时间适应。在非独立同分布MMLU和长上下文基准测试中,FedNL在短上下文推理中取得竞争性性能,增强了长上下文检索和流式交叉熵的性能,并保持恒定的推理内存。

英文摘要

We rethink Federated Learning (FL) from a nested learning perspective, framing the core challenge as how to collaboratively learn optimization rules, not just static models, to tackle Non-IID client data. To address this, we propose Federated Nested Learning (FedNL), a novel framework that reformulates FL as a three-level nested optimization system. FedNL embeds Titans-based linear attention into FL, enabling clients to perform lightweight, zero-shot test-time adaptation by treating a delta rule as an online gradient step. Experiments on Non-IID MMLU and long-context benchmarks show that FedNL achieves competitive performance in short-context reasoning, enhances the performance of long-context retrieval and streaming Cross-Entropy, and maintains constant inference memory.

2605.16349 2026-05-19 cs.LG 版本更新

Geometric Asymmetry in MoE Specialization: Functional Decorrelation and Representational Overlap

MoE专业化中的几何不对称性:功能去相关与表征重叠

Feilong Liu

AI总结 研究揭示MoE专业化中功能去相关与表征重叠共存的几何结构,通过实验表明路由稀疏性影响功能分离与子空间分歧。

详情
AI中文摘要

混合专家(MoE)架构通过稀疏路由实现可扩展容量,但专家专业化几何结构仍不明确。本文提出统一的雅可比-PCA-格拉索曼框架,分析MoE层在函数空间和表征空间中的结构。在预训练的MoE Transformer(Mistral、Qwen)中发现一致的结构不对称:专家表现出强功能去相关(跨专家雅可比对齐低,接近零)而路由表征占据 distinct 但部分重叠的子空间。这表明功能去相关与表征重叠共存而非重合。受控路由实验进一步表明,路由稀疏性是塑造此几何结构的关键因素:top-k路由导致更尖锐的功能分离和更大的子空间分歧,而完全软路由产生更纠缠的专家结构。这些结果表明,MoE层可视为在共享表征流形上局部去相关操作符在重叠子流形上的实现,并提供研究现代Transformer架构条件计算的通用诊断框架。

英文摘要

Mixture-of-Experts (MoE) architectures achieve scalable capacity through sparse routing, yet the geometric structure of expert specialization remains poorly understood. We introduce a unified Jacobian-PCA-Grassmann framework for analyzing MoE layers in both function space and representation space. Across pretrained MoE Transformers (Mistral, Qwen), we find a consistent structural asymmetry: experts exhibit strong functional decorrelation (consistently low, near-zero cross-expert Jacobian alignment) while their routed representations occupy distinct but partially overlapping subspaces. This indicates that functional decorrelation and representation overlap coexist rather than coincide in MoE specialization. Controlled routing experiments further indicate that routing sparsity appears to be a key factor shaping this geometry: top-k routing induces sharper functional separation and larger subspace divergence, whereas fully soft routing yields more entangled expert structure. Together, these results suggest a geometric interpretation in which MoE layers may be viewed as implementing locally decorrelated operators over overlapping submanifolds on a shared representation manifold, and provide a general diagnostic framework for studying conditional computation in modern Transformer architectures.

2605.16348 2026-05-19 cs.LG cs.AI 版本更新

Flow-Direct: Feedback-Efficient and Reusable Guidance for Flow Models via Non-Parametric Guidance Field

Flow-Direct: 通过非参数指导场实现反馈高效且可重用的流模型指导

Kim Yong Tan, Yueming Lyu, Ivor Tsang, Yew-Soon Ong

AI总结 本文提出Flow-Direct框架,通过持续指导场提升流模型的反馈效率和可重用性,利用累积的奖励评估样本构建非参数估计器,实现高效优化和多目标样本生成。

详情
AI中文摘要

免训练指导使预训练的扩散和流模型能够利用外部黑盒奖励函数的反馈来优化应用特定的目标。然而,现有方法反馈效率低,因为奖励反馈仅临时用于指导局部梯度近似或离散搜索决策,随后被丢弃。为解决这一限制,我们提出Flow-Direct框架,通过持续的指导场引导生成过程。理论上,该指导场是从基础分布与奖励加权目标分布的对数密度比分析得出的;它将预训练分布传输到目标分布。实践中,该场被实现为一个由所有累积奖励评估样本构建的非参数估计器。随着优化过程中样本的增加,该经验指导场变得越来越准确。这种持续的 formulation 产生了两个主要优势。首先,Flow-Direct具有高度的反馈效率:因为每个评估样本都用于细化全局指导场,没有奖励信息被浪费。其次,该框架具有自然的可重用性:一旦优化完成,收集的数据库定义了一个可重用的指导场,用于生成新的目标样本而无需额外的奖励评估,并且不同的指导场可以结合以生成同时满足多个目标的样本。

英文摘要

Training-free guidance enables pre-trained diffusion and flow models to optimize application-specific objectives using feedback from external black-box reward functions. However, existing methods are feedback-inefficient because reward feedback is used only transiently to inform a localized gradient approximation or a discrete search decision, and is subsequently discarded. To address this limitation, we propose Flow-Direct, a framework that guides the generation process via a persistent guidance field. Theoretically, this guidance field is analytically derived from the log-density ratio between the base and reward-weighted target distributions; it transports the pre-trained distribution to the target distribution. In practice, the field is implemented as a non-parametric estimator constructed from all accumulated reward-evaluated samples. As more samples are collected during optimization, this empirical guidance field becomes increasingly accurate. This persistent formulation yields two major advantages. First, Flow-Direct is highly feedback-efficient: because every evaluated sample is used to refine the global guidance field, no reward information is wasted. Second, the framework is naturally reusable: once optimization is complete, the collected dataset defines a reusable guidance field for generating novel target samples without additional reward evaluations, and distinct guidance fields can be combined to generate samples that simultaneously satisfy multiple objectives.

2605.16347 2026-05-19 cs.LG 版本更新

HPC-LLM: Practical Domain Adaptation and Retrieval-Augmented Generation for HPC Support

HPC-LLM: 实用领域适应与检索增强生成用于HPC支持

Nourin Shahin, Izzat Alsmadi

AI总结 本文提出HPC-LLM,通过检索增强和领域适应支持HPC工作流,利用QLoRA实现轻量级领域适应,实验表明其在低资源下性能接近大模型。

详情
AI中文摘要

现代科学研究日益依赖高性能计算(HPC)基础设施,但许多研究者在与集群环境、作业调度器、GPU资源和并行计算框架交互时面临显著操作障碍。通用大语言模型(LLMs)提供有用的编码帮助,但通常缺乏可靠HPC支持所需的领域特定操作知识。本文提出HPC-LLM,一个检索增强和领域适应的助手,支持常见的HPC工作流,包括Slurm调度、MPI执行、GPU利用、文件系统管理和集群故障排除。所提出的框架整合了自动化文档摄入、密集检索、轻量级领域适应使用QLoRA和本地推理,形成模块化编排管道。为支持领域适应,我们从公开可用的大学HPC文档、精选的操作示例和从检索HPC内容生成的合成指令-回答对构建了HPC导向的语料库。所得到的数据集包含约9,000至24,000个HPC相关的训练示例,涵盖作业调度、GPU计算、分布式训练、存储系统和集群管理主题。我们使用QLoRA微调Llama 3.1 8B,并在JetStream2基础设施上评估结果模型,与多个开源权重基线在检索增强设置下进行比较。实验结果表明,适应的8B模型在显著较低的GPU内存需求和推理延迟下,性能接近大幅更大的通用模型。特别是,适应的模型在性能上接近Qwen 2.5 14B,同时需要显著较少的计算资源。

英文摘要

Modern scientific research increasingly depends on High-Performance Computing (HPC) infrastructures, yet many researchers face significant operational barriers when interacting with cluster environments, job schedulers, GPU resources, and parallel computing frameworks. General-purpose large language models (LLMs) provide useful coding assistance but often lack the domain-specific operational knowledge required for reliable HPC support. This paper presents HPC-LLM, a retrieval augmented and domain-adapted assistant designed to support common HPC workflows including Slurm scheduling, MPI execution, GPU utilization, filesystem management, and cluster troubleshooting. The proposed framework integrates automated documentation ingestion, dense retrieval, lightweight domain adaptation using QLoRA, and local inference within a modular orchestration pipeline. To support domain adaptation, we construct an HPC-oriented corpus from publicly available university HPC documentation, curated operational examples, and synthetic instruction-answer pairs generated from retrieved HPC content. The resulting dataset contains approximately 9,000 to 24,000 HPC-focused training examples spanning job scheduling, GPU computing, distributed training, storage systems, and cluster administration topics. We fine-tune Llama 3.1 8B using QLoRA and evaluate the resulting model against several open weight baselines under retrieval-augmented settings on JetStream2 infrastructure. Experimental results indicate that the adapted 8B model achieves performance comparable to substantially larger general-purpose models while operating under significantly lower GPU memory requirements and inference latency. In particular, the adapted model approaches the performance of Qwen 2.5 14B while requiring substantially fewer computational resources.

2605.16346 2026-05-19 cs.LG cs.AI cs.CR 版本更新

PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation

PropGuard: 通过传播感知探索与修复保障LLM-MAS

Bingyu Yan, Xiaoming Zhang, Jinyu Hou, Chaozhuo Li, Ziyi Zhou, Xiaozhe Zhang, Litian Zhang

AI总结 PropGuard通过构建双视角时空图,结合响应中心风险评估与全状态证据保存,实现对LLM-MAS中传播性攻击的检测与修复,有效降低攻击成功率并保持高任务防御效率。

详情
AI中文摘要

基于多智能体系统(LLM-MAS)的大型语言模型(LLM)已成为解决复杂任务的有前景范式,通过角色专业化、工具使用、记忆和协作推理。然而,这些交互创造了新的安全风险,恶意指令通过消息、工具或记忆注入后可能在代理之间传播,导致系统级妥协。现有防御主要依赖局部过滤或图基异常检测,但往往无法追踪细粒度传播路径或在不干扰良性协作的情况下修复污染状态。我们提出PropGuard,一种传播感知框架,用于保障LLM-MAS。PropGuard构建了双视角时空图,结合响应中心风险评估与全状态证据保存。受这些风险先验引导,一个训练好的GE-GRPO检查员依次探索全状态图,以恢复紧凑的可疑传播子图。PropGuard随后通过子图感知诊断验证有害传播,并应用源引导修复以纠正上游污染并重放受影响的下游交互。在四个通信架构和五个攻击设置上的实验表明,PropGuard在降低攻击成功率的同时保持高任务级防御成功率,实现了有利的效果-效率权衡。

英文摘要

LLM-based multi-agent systems (LLM-MAS) have become a promising paradigm for solving complex tasks through role specialization, tool use, memory, and collaborative reasoning. However, these interactions create new security risks that malicious instructions injected through messages, tools, or memories can propagate across agents and rounds, causing system-level compromise. Existing defenses largely rely on local filtering or graph-based anomaly detection, but they often fail to trace fine-grained propagation paths or remediate contaminated states without disrupting benign collaboration. We propose PropGuard, a propagation-aware framework for safeguarding LLM-MAS. PropGuard constructs a dual-view spatio-temporal graph that combines response-centric risk estimation with full-state evidence preservation. Guided by these risk priors, a GE-GRPO trained inspector sequentially explores the full-state graph to recover compact suspicious propagation subgraphs. PropGuard then verifies harmful propagation through subgraph-aware diagnosis and applies source-guided remediation to correct upstream contamination and replay affected downstream interactions. Experiments across four communication architectures and five attack settings demonstrate that PropGuard consistently lowers attack success while maintaining high task-level defense success, achieving a favorable effectiveness--efficiency trade-off.

2605.16345 2026-05-19 cs.LG cs.AI 版本更新

Goal-Conditioned Supervised Learning for LLM Fine-Tuning

面向目标的监督学习用于大语言模型微调

Shijun Li, Kaiwen Dong, Xiang Gao, Joydeep Ghosh

AI总结 本文提出目标条件监督学习(GCSL)框架,通过将反馈信号作为显式目标,利用监督学习生成高质量响应,改进了传统监督微调和直接偏好优化的局限性。

详情
AI中文摘要

大型语言模型通常需要微调以更好地与用户意图对齐。现有方法可分为在线和离线范式。在线方法如基于强化学习的对齐方法可直接优化结果质量,但通常依赖外部奖励模型和迭代滚动,使其成本高且难以部署。离线方法更高效,但现有方法如监督微调(SFT)和直接偏好优化(DPO)仍有局限:SFT通常将分级反馈转化为二元监督,而DPO依赖配对偏好数据,此类数据往往不可用或昂贵。本文提出目标条件监督学习(GCSL)作为大语言模型的离线微调框架。我们的核心思想是将反馈信号直接视为显式目标,并通过监督学习训练模型生成实现该目标的响应。为更好地利用分级反馈,我们进一步引入一种新的目标公式,将学习定义为持续追求超过目标质量阈值的成果,而非模仿选定高质量子集中的样本。此设计通过显式引导模型学习质量的定向进步,缓解了SFT和经典GCSL的有限学习效应。我们还提出了自然语言目标表示,以更好地利用大语言模型的语义理解和推理能力。我们在三个任务上评估了我们的方法:非毒性生成、代码生成和推荐系统中的大语言模型。结果表明,我们的方法在保持监督学习的效率、可扩展性和简单数据需求的同时,始终优于标准离线微调基线。

英文摘要

Large language models often require fine-tuning to better align their behavior with user intent at deployment. Existing approaches are commonly divided into online and offline paradigms. Online methods, such as RL-based alignment, can directly optimize outcome quality but typically rely on external reward models and iterative rollouts, making them costly and difficult to deploy in many cases. Offline methods are more efficient, but prevailing approaches such as supervised fine-tuning (SFT) and direct preference optimization (DPO) remain limited: SFT typically collapses graded feedback into binary supervision, while DPO depends on paired preference data that is often unavailable or expensive to construct. In this paper, we propose goal-conditioned supervised learning (GCSL) as an offline fine-tuning framework for LLMs. Our core idea is to treat feedback signals directly as an explicit goal and train the model, purely through supervised learning, to generate responses that achieve that goal. To better exploit graded feedback, we further introduce a novel goal formulation that defines learning as consistently pursuing outcomes above a target quality threshold, rather than imitating samples from a selected high-quality subset. This design mitigates the bounded-learning effect of SFT and classic GCSL by explicitly guiding the model to learn the directional progression of quality. We also propose natural-language goal representations to better leverage the semantic understanding and reasoning capabilities of LLMs. We evaluate our method on three tasks: non-toxic generation, code generation, and LLM for recommendation. Results show that our approach consistently outperforms standard offline fine-tuning baselines while retaining the efficiency, scalability, and simple data requirements of supervised learning.

2605.16344 2026-05-19 cs.IR cs.LG 版本更新

A Production-Ready RL Framework for Personalized Utility Tuning with Pareto Sweeping in Pinterest Recommender Systems

面向Pinterest推荐系统的一种生产级强化学习框架:用于个性化效用调节的帕累托扫描

Yichu Zhou, Mehdi Ben Ayed, Lin Yang, Jiacong He, Andreanne Lemay, Jiaye Wang, Jaewon Yang, Josie Zeng, Dhruvil Deven Badani, Yijie Dylan Wang, Jiajing Xu, Charles Rosenberg

AI总结 本文提出PRL-PUTS框架,通过将效用调节转化为价值基强化学习问题,实现个性化效用权重调节,利用帕累托前沿扫描可视化性能并支持实时政策更新,提升了推荐系统用户参与度。

详情
AI中文摘要

大规模推荐系统通过将多个预测结果整合为单一效用评分来编码多目标权衡。尽管该效用层可独立于排序器更新,但权重调节仍主要依赖手动调整,具有全局应用、适应性差、难以随优先级变化而管理等问题。本文提出PRL-PUTS,一种生产级、与排序器无关的强化学习框架,用于个性化效用权重调节的帕累托扫描。我们将效用调节视为一个一步价值基强化学习问题:给定请求上下文,智能体选择一个效用权重向量,重新加权排序器预测以最大化请求级参与奖励。为可视化跨权衡谱的性能并允许决策者实时更新部署策略,我们采用推理时的帕累托前沿扫描通过标量参数,生成一组策略和一个经验帕累托前沿,作为操作策略选择的治理工具。PRL-PUTS与排序推理并行运行,不增加服务延迟。我们通过离线分析使用无偏探索日志和在线实验验证PRL-PUTS,结果显示在Pinterest Homefeed上,PRL-PUTS相比基线方案显著提高了参与度,如成功会话增加0.13%。

英文摘要

Large-scale recommenders encode multi-objective trade-offs by combining multiple predicted outcomes into a single utility score. Although this utility layer can be updated independently of the ranker, weight tuning remains largely manual, globally applied, slow to adapt to changing environments and business needs, and hard to govern as priorities shift. We propose PRL-PUTS, a Production-ready, ranker independent RL framework for Personalized Utility-weight Tuning with Pareto Sweeping. We cast utility tuning as a one-step, value-based RL problem: given request context, an agent selects a utility-weight vector that re-weights ranker predictions to maximize request-level engagement rewards. To visualize performance across the trade-off spectrum and allow decision makers to update the deployed operating policy instantly, we adopt an inference-time Pareto frontier sweeping via a scalarization parameter, producing a family of policies and an empirical Pareto frontier used as a governance artifact for operating policy selection. PRL-PUTS runs in parallel with ranking inference without adding serving latency. We validate PRL-PUTS with offline analysis using unbiased exploration logs and online experiments on Pinterest Homefeed where PRL-PUTS showed significant increases in engagement compared to baseline such as +0.13\% increase in successful session, a core metric for user engagement.

2605.16343 2026-05-19 cs.LG cs.AI 版本更新

LoopQ: Quantization for Recursive Transformers

LoopQ: 递归变换器的量化

Rui Fang, Hsi-Wen Chen, Ming-Syan Chen

AI总结 本文提出LoopQ框架,针对递归变换器的量化挑战,通过激活缩放、选择性变换等方法提升模型精度与效率。

详情
AI中文摘要

Looped语言模型(LoopLMs)通过递归重用Transformer块提高参数效率,但在训练后量化(PTQ)中易出现脆弱性。本文首次系统研究LoopLMs的量化问题,识别出三个挑战:角色间的分布偏移、循环转换中的状态重用以及递归误差累积。提出LoopQ框架,通过共享量化主干和轻量级适应,结合激活缩放、选择性变换、跨循环状态对齐和轨迹感知优化,减少循环内的分布不匹配和跨循环的误差累积。实验表明,在W4A4量化下,LoopQ在七个基准测试中平均下游准确率提升68.8%,平均困惑度降低87.7%。

英文摘要

Looped language models (LoopLMs) improve parameter efficiency by recursively reusing Transformer blocks, enabling deeper computation under a fixed model size. However, this reuse makes LoopLMs more fragile under post-training quantization (PTQ). We present the first systematic study of quantization in LoopLMs and identify three challenges: distribution shift across roles, state reuse across loop transitions, and recursive error accumulation. To address these challenges, we propose LoopQ, a loop-aware PTQ framework that preserves a shared quantized backbone while introducing lightweight adaptations. LoopQ combines activation scaling, selective transformation, cross-loop state alignment, and trajectory-aware optimization to reduce distributional mismatch within loops and error accumulation across loops. Experiments across seven benchmarks show that, under W4A4 quantization, LoopQ improves average downstream accuracy by 68.8% and reduces average perplexity by 87.7% compared with the strongest static PTQ baseline.

2605.16342 2026-05-19 cs.LG cs.AI cs.CL 版本更新

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

DACA-GRPO:去噪感知的信用分配用于扩散语言模型中的强化学习

Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade, Lokesh Boominathan, Manuel R. Ciosici, Yizhe Zhang, Irina Belousova

AI总结 本文提出DACA-GRPO,通过引入去噪进度评分和分层掩码似然,改进扩散语言模型中强化学习的信用分配,提升数学推理、代码生成等任务性能。

详情
AI中文摘要

扩散大语言模型是自回归模型的有力替代品,但现有强化学习方法将所有去噪步骤视为同等重要,并依赖于有偏、高方差的似然估计。我们识别出两个根本性弱点:去噪轨迹中缺乏时间信用分配,以及用于策略优化的均场似然估计存在系统偏差。为了解决这些问题,我们提出了Denoising-Aware Credit Assignment for GRPO(DACA-GRPO),一种轻量级、即插即用的增强方法,适用于任何GRPO风格的训练器。DACA-GRPO引入了两个互补机制:去噪进度评分,从中间预测中提取每token的重要性权重,无需额外前向成本;分层掩码似然,将token位置分为层次,使每个token在大部分序列作为上下文的情况下进行预测,从而减少均场偏差。在三种GRPO基础方法上应用DACA-GRPO,使其在七个基准测试中取得一致提升,涵盖数学推理、代码生成、约束满足和受约束生成等任务,在数学推理中提升达5.6个百分点,在代码生成中提升7.4个百分点,在约束满足中提升36.3个百分点,在JSON schema符合性中提升5.9个百分点。

英文摘要

Diffusion large language models are a compelling alternative to autoregressive models, yet existing RL methods for diffusion treat all denoising steps as equally important and rely on biased, high-variance likelihood estimates. We identify two fundamental weaknesses: the absence of temporal credit assignment across the denoising trajectory, and the systematic bias of mean-field likelihood estimates used for policy optimization. To address these, we propose Denoising-Aware Credit Assignment for GRPO (DACA-GRPO), a lightweight, plug-and-play enhancement for any GRPO-style trainer. DACA-GRPO introduces two complementary mechanisms: Denoising Progress Scores, which extract per-token importance weights from intermediate predictions at no additional forward cost, and Stratified Masking Likelihood, which partitions token positions into strata so that each token is predicted with most of the sequence as context, reducing the mean-field bias. Applied on top of three GRPO base methods, DACA-GRPO achieves consistent improvements across seven benchmarks spanning mathematical reasoning, code generation, constraint satisfaction, and constrained generation, with gains of up to 5.6pp on math reasoning, 7.4pp on code generation, 36.3pp on constraint satisfaction, and 5.9pp on JSON schema adherence.

2605.16341 2026-05-19 cs.LG 版本更新

Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization

Orth-Dion:消除分布式低秩谱优化中的几何不匹配

Tatsuhiro Nakamori, Laura Gomezjurado Gonzalez, Ganesh Talluri, Ansh Tiwari, Hideyuki Kawashima, Ioannis Mitliagkas, Guillaume Rabusseau, Hiroki Naganuma

AI总结 Orth-Dion通过替换列归一化为右因子的QR正交化,解决分布式低秩谱优化中的几何不匹配问题,实现与Dion相同通信成本下的最优收敛率。

Comments 24 pages, 3 figures, 11 tables

详情
AI中文摘要

低秩梯度压缩通过用秩-r因子表示更新来减少分布式训练中的通信开销。Dion是一种近似Muon(一种正交化动量的谱优化器)的方法,通过一次幂迭代后进行列归一化(将右因子的每一列重新缩放为单位长度)。这使其兼容完全分片数据并行训练,但收敛速度比全秩谱方法更慢。我们证明这种差距是几何性的:列归一化并未产生Muon隐式目标的秩-r极因子,因此所得到的方向违反了低秩谱几何的对偶范数约束,即使梯度的低秩近似准确,收敛率仍多了一个√r因子。同样的不匹配也影响了平滑项和误差反馈递归的分析,从而对经验性能产生连锁影响。我们提出Orth-Dion,其将列归一化替换为右因子的QR正交化。在非欧几里得平滑性下,设L_r为沿秩-r方向的曲率常数,Orth-Dion获得收敛率O(√(L_r/T)),在与Dion相同的每步通信成本下达到与精确谱方法相同的性能。证明通过自洽的固定点论证消除了先前误差反馈分析中常见的有界漂移假设,并使用时间平均收缩,仅要求误差序列平均收缩而非每一步都收缩。在大规模语言模型预训练实验中验证了预测的√r缩放,并显示Orth-Dion在Dion的通信成本下关闭了与Muon的收敛差距。

英文摘要

Low-rank gradient compression reduces communication in distributed training by representing updates with rank-$r$ factors. Dion is a recent method that approximates Muon, a spectral optimizer that orthogonalizes momentum, using one step of power iteration followed by column normalization (rescaling each column of the right factor to unit length). This makes it compatible with fully sharded data parallel training, but it converges more slowly than full-rank spectral methods. We show that this gap is geometric: column normalization does not yield the rank-$r$ polar factor that Muon implicitly targets, so the resulting direction violates the dual-norm constraint of the low-rank spectral geometry, and the rate picks up an extra factor of $\sqrt{r}$ even though the low-rank approximation of the gradient itself is accurate. The same mismatch enters the smoothness term and the error-feedback recursion in the analysis, which has a knock-on effect on empirical performance. We propose Orth-Dion, which replaces column normalization with QR orthogonalization of the right factor. Under non-Euclidean smoothness, with $L_r$ the curvature constant along rank-$r$ directions, Orth-Dion attains rate $O(\sqrt{L_r/T})$, matching exact spectral methods at the same per-step communication cost as Dion. The proof removes the bounded-drift assumption common in prior error-feedback analyses via a self-consistent fixed-point argument, and uses a time-averaged contraction that only requires the error sequence to contract on average rather than at every step. Experiments on large-scale language model pre-training validate the predicted $\sqrt{r}$ scaling and show that Orth-Dion closes the convergence gap to Muon at Dion's communication cost.

2605.16339 2026-05-19 cs.LG 版本更新

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

奖励模型中的偏好不稳定性:通过稀疏自编码器进行检测与缓解

Shunchang Liu, Xin Chen, Belen Martin Urcelay, Francesco Croce

AI总结 本文研究了大型语言模型中奖励模型的偏好不稳定性问题,提出通过稀疏自编码器检测并缓解这种不稳定性,通过隔离不稳定特征来提升模型的偏好准确性。

详情
AI中文摘要

在大型语言模型中,偏好学习依赖于奖励模型作为人类判断的代理。然而,这些模型经常表现出偏好不稳定性,对细微的、保持意义的输入变化产生矛盾的偏好分配。我们分析了三种语义保持扰动类型(改写、模式注入和后门触发)下的表示层面的不稳定性,并将其归因于对预测但脆弱的特征的过度依赖,即不稳定特征。我们通过稀疏自编码器(SAEs)在稀疏潜在空间中隔离这些特征,使良性输入和扰动输入激活明显分离的模式。基于这种分离性,我们提出了两种基于SAEs的不稳定性缓解策略:SAE特征引导,通过识别并抑制异常激活的特征来减少不稳定性;以及SAE残差校正,通过学习适应性调整SAE特征以恢复正确的偏好。我们的方法显著减少了在无害性和幻觉基准上的错误偏好分配,同时在其他任务上保持了良性性能和通用效用,而无需重新训练奖励模型。我们的代码和数据可在https://github.com/shunchang-liu/pisa中获得。

英文摘要

Preference learning in large language models relies on reward models as proxies for human judgment. However, these models frequently exhibit preference instability, producing contradictory preference assignments in response to subtle, meaning-preserving input variations. We analyze this instability at the representation level under three semantic-preserving perturbation types: paraphrasing, pattern injection, and backdoor triggers. We attribute this instability to over-reliance on predictive yet brittle features, which we term unstable features, and isolate them via Sparse Autoencoders (SAEs) in a sparse latent space where benign and perturbed inputs activate distinctly separable patterns. Building on this separability, we propose two SAE-based instability mitigation strategies: SAE Feature Steering, which identifies and suppresses anomalously activated features at inference, and SAE Residual Correction, which learns adaptive adjustments over SAE features to restore correct preferences. Our methods substantially reduce incorrect preference assignments on harmlessness and hallucination benchmarks while preserving benign performance and general utility on other tasks, without retraining the reward model. Our code and data are available in \url{https://github.com/shunchang-liu/pisa}.

2605.16326 2026-05-19 q-bio.QM cs.AI cs.LG eess.SP 版本更新

A Machine Learning Framework for EEG-Based Prediction of Treatment Efficacy in Chronic Neck Pain

一种基于EEG的慢性颈部疼痛治疗效果预测机器学习框架

Xiru Wang, Aiden Li, Hongzhao Tan, Stevie Foglia, Aimee Nelson, Zhen Gao

AI总结 本文提出利用EEG数据预测慢性颈部疼痛治疗效果的机器学习框架,通过严格的数据预处理和文献综述,旨在开发支持个性化医疗的鲁棒预测模型。

Comments 15 pages, 7 figures

详情
AI中文摘要

慢性颈部疼痛是全球导致残疾的主要原因之一,当前的治疗选择仍主要依赖于试错。我们提出了一种机器学习框架,利用脑电图(EEG)预测慢性颈部疼痛患者的治疗效果,旨在支持个性化治疗并减轻医疗系统负担。该框架的核心是针对每种EEG记录类型特征的严格数据预处理阶段。对于静息态EEG,预处理流程包括基线信号去除、坏通道识别和排除、重新参考、带通和-notch滤波、独立成分分析和功率谱密度分析。对于运动执行和运动想象记录,应用相同的初始步骤后,信号对触发事件对齐,以便量化事件相关去同步(ERD)和事件相关同步(ERS)。同步记录的表面肌电数据经过带通滤波和移动平均平滑,然后与相应的EEG通道相关联,以表征尝试运动期间的EEG-EMG关系。同时,我们进行了广泛的文献综述,回顾了应用于临床EEG的机器学习模型(最初筛选出763条记录,保留16名患者和47名健康对照研究),以指导后续处理策略。通过这种结合的预处理和综述工作,我们旨在开发一个鲁棒的预测模型,以支持慢性疼痛管理中的个性化医疗策略。

英文摘要

Chronic neck pain is a leading cause of disability worldwide, and current treatment selection remains largely trial and error. We present a machine learning framework that uses electroencephalography to predict treatment efficacy in patients with chronic neck pain, with the goal of supporting individualized therapy and reducing the burden on healthcare systems. The framework centers on a rigorous data preprocessing stage tailored to the characteristics of each EEG recording type. For resting-state EEG, the preprocessing pipeline comprises baseline signal removal, bad channel identification and exclusion, re-referencing, bandpass and notch filtering, Independent Component Analysis, and power spectral density analysis. For motor execution and motor imagery recordings, the same initial steps are applied, after which signals are aligned to trigger events so that event-related desynchronization (ERD) and event-related synchronization (ERS) can be quantified. Synchronously recorded electromyography data are bandpass filtered and smoothed with a moving average, then correlated with the corresponding EEG channels to characterize the EEG EMG relationship during attempted movement. In parallel, we performed an extensive literature review of machine learning models applied to clinical EEG (763 records initially screened, 16 patient and 47 healthy-control studies retained), to inform the post-processing strategy. Through this combined preprocessing and review effort, we aim to develop a robust predictive model that can support personalized healthcare strategies in chronic pain management.

2605.16325 2026-05-19 cs.LG cs.AI 版本更新

Phase Transitions in Driven Informational Systems: A Two-Field Perspective on Learning Theory and Non-Equilibrium Chemistry

驱动信息系统的相变:从学习理论和非平衡化学的角度看两种场视角

Truong Xuan Khanh

AI总结 本文从学习理论和非平衡化学角度,提出驱动信息系统的两种场框架,引入对抗破裂阈值和自参照耦合阈值,探讨相变现象的普遍性类与可验证预测。

Comments 29 pages, 2 figures

详情
AI中文摘要

深度学习中的相变现象(如grokking、涌现能力、上下文转移下的本体重构)已通过表征压缩、奇异学习理论和信息论进步度量等视角研究。同时,非平衡统计物理在预生物选择下的驱动化学反应网络中识别出相变,其经验特征难以在单一场梯度模型中复现。本文提出一种视角,将两类现象共同描述为驱动信息系统:由两个梯度场(熵产率Sigma和信息准势Phi_I := -ln p*,其中p*是稳态密度)支配的随机过程。在此框架中引入两个候选序参数:对抗破裂阈值alpha_dagger和自参照耦合阈值kappa_c。(alpha_dagger, kappa_c)的联合缩放定义了一个候选普遍性类,具有指数(gamma_1, gamma_2)。本文概述了该框架的几何结构,识别出可区分其与单一场替代方案的可验证预测,并展示其与2024-2026年最近的实证发现(如对齐相变、对抗破裂缩放和大语言模型部分自我反思)的一致性。

英文摘要

Phase-transition phenomena in deep learning (grokking, emergent capabilities, and ontological reorganization under context shift) have been studied through several lenses, including representational compression, singular learning theory, and information-theoretic progress measures. Independently, non-equilibrium statistical physics has identified phase transitions in driven chemical reaction networks underlying prebiotic selection, with empirical signatures that are difficult to reproduce within single-field gradient accounts. We propose a perspective in which both classes of phenomena admit a common description as driven informational systems: stochastic processes governed by two gradient fields, an entropy production rate Sigma and an information quasi-potential Phi_I := -ln p*, where p* is the stationary density. Within this framework we introduce two candidate order parameters: an adversarial breakdown threshold alpha_dagger and a self-referential coupling threshold kappa_c. The joint scaling of (alpha_dagger, kappa_c) defines a candidate universality class with exponents (gamma_1, gamma_2). We outline the geometric structure of this framework, identify falsifiable predictions distinguishing it from single-field alternatives, and show consistency with recent empirical findings (2024--2026) on alignment transitions, adversarial breakdown scaling, and partial introspection in large language models.

2605.16324 2026-05-19 cs.LG cs.CE q-fin.ST 版本更新

Bi-Level Chaotic Fusion Based Graph Convolutional Network for Stock Market Prediction Interval

双层混沌融合基于图卷积网络的股票市场预测区间

Eshwar Sai Kandimalla, Sravan Chowdary Kankanala, Sumana Bhimineni, Hem Sundhar Korukunda, Vivek Yelleti

AI总结 本文提出双层混沌融合图卷积网络,用于解决股票市场预测区间中不确定性表示问题,通过时空图结构和动态市场制度感知机制提升预测精度和置信度。

详情
AI中文摘要

金融市场预测本质上具有不确定性,但大多数深度学习方法依赖于点预测,仅提供单一值估计而无法量化不确定性。此类预测不足以支持风险意识决策,因为它们无法捕捉可能结果的范围和预测的置信度。通过预测区间可以获取预测的上下限,从而在模型中表示不确定性。然而,当前方法往往忽视资产之间的关系或无法同时确保在动态变化的市场制度中预测区间的良好校准和精确性。在本文中,我们提出了一种基于时空图的双层混沌融合方法,通过非线性变换函数分别估计区间中心和宽度,并采用波动率感知门控机制使预测依赖于市场所处的制度。通过嵌入图结构并按时间顺序建模来考虑时间依赖性。训练根据下上界估计(LUBE)目标进行。实验结果表明,与现有基线(LSTM、GRU、GCN、HGNN)相比,在2016至2026年43家领先公司在NSE八个行业的数据上,本方法在Winkler评分(0.0778)、最紧的预测区间(PIAW=0.1407)和最高的覆盖率(PICP=96.6%)方面均显著改进,所有差异均在统计上显著(p<0.001)根据Diebold-Mariano检验。

英文摘要

Financial market forecasting is inherently uncertain, yet most deep learning approaches rely on point predictions that provide only single-value estimates without quantifying uncertainty. Such predictions are insufficient for risk-aware decision-making, as they fail to capture the range of possible outcomes and the associated confidence of forecasts.The problem can be solved using prediction intervals, which allow obtaining an upper and lower bound for the prediction, thus enabling uncertainty representation in the model. Yet, the current methods tend to disregard relationships between assets or cannot simultaneously ensure good calibration and sharpness of the resulting intervals in dynamically changing market regimes. In our work, we propose a spatio-temporal graph-based approach with a bi-level chaotic fusion technique to solve this problem. Our model uses separate nonlinear transformation functions to estimate the interval center and width. Additionally, a volatility-aware gating mechanism is used to make predictions dependent on the regime in which the market operates. Temporal dependencies are considered by embedding graph structures and sequentially modeling them. Training is conducted according to a Lower-Upper Bound Estimation (LUBE) objective. Our experimental results show significant improvements compared to existing baselines (LSTM, GRU, GCN, HGNN) when applied to data from 2016 to 2026 with 43 leading companies in eight sectors of the NSE. It provides the lowest Winkler score (0.0778), tightest prediction intervals (PIAW = 0.1407), and highest coverage (PICP = 96.6%), with all differences statistically significant (p < 0.001) according to the Diebold-Mariano test.

2605.16321 2026-05-19 cs.LG 版本更新

Language Game: Talking to Non-Human Systems

语言游戏:与非人类系统交谈

Yanbo Zhang, Michael Levin

AI总结 本文提出通过游戏机制与非人类系统对话,利用强化学习政策的核心动态,使系统通过自身行为回应,展示不同系统在对话中的收敛行为及特性影响。

Comments 29 pages, 12 figures, 7 tables

详情
AI中文摘要

语言承载人类的思想与协调,但很少延伸至多样智能的谱系。然而,非神经系统——从基因调控网络到真菌——正被视作计算、决策和记忆的载体,使与非人类智能对话成为可能。目前,这种对话仅通过代理实现:大型语言模型代表系统发言,而系统本身保持沉默。本文探讨系统是否能以自身声音说话。受维特根斯坦影响,将沟通视为与系统进行的游戏。系统内部动态冻结为强化学习策略的非线性核心,仅训练线性输入输出接口。通过使用与奖励,系统状态和响应在游戏内获得意义,使游戏成为说话。不同架构玩同一游戏优化相同奖励,其行为可视为追求该奖励;游戏成为跨不同表示的通用语言。给定人类提示,语言模型将其路由至语义最匹配的游戏,并设计环境状态,使期望动作成为理性响应,让系统通过自身行为回应。应用于多样基因调控网络和强化学习任务,该框架实现流畅对话而不改变任何系统参数,展示不同起源的训练代理收敛于相似行为,并揭示特定GRN特性使系统更容易或更难对话——这是自身水库的归纳偏差。本文框架开辟了与任何动态系统在自身术语下对话的新途径。

英文摘要

Language carries thought and coordination among humans but rarely reaches further along the spectrum of diverse intelligence. Yet non-neural systems -- from gene regulatory networks and microbial consortia to fungi -- are increasingly recognized as substrates of computation, decision-making and memory, making dialogue with non-human intelligence newly conceivable. Today such dialogue is attempted only by proxy: a large language model speaks on the system's behalf, so any intelligence on display originates from the model while the system itself remains silent. Here we ask whether the system can speak in its own voice. Following Wittgenstein, who located meaning in use, we treat communication as a game played with the system. Its internal dynamics are frozen as the nonlinear core of a reinforcement-learning policy, with only linear input and output interfaces trained. Through use and reward, the system's states and responses acquire meaning within the game, so playing becomes speaking. Because different architectures playing the same game optimize the same reward, their behaviors can all be read as pursuit of that reward; the game serves as a lingua franca across otherwise irreconcilable representations. Given a human prompt, a language model routes it to the game whose semantics best match it and designs an environmental state for which the desired action is the rational response, letting the system reply through its own behavior. Applied across diverse gene regulatory networks and reinforcement-learning tasks, the framework yields fluent dialogue without altering any system parameter, shows that well-trained agents of disparate origin converge on similar behavior, and reveals that specific GRN properties make a system easier or harder to talk with -- an inductive bias of the reservoir itself. Our framework opens a new route to conversing with any dynamical system on its own terms.

2605.16320 2026-05-19 cs.LG cs.AI 版本更新

AdaGraph: A Graph-Native Clustering Algorithm That Overcomes the Curse of Dimensionality and Enables Scientific Discovery

AdaGraph:一种克服维度诅咒并促进科学发现的图原生聚类算法

Ahmed Elmahdi

AI总结 AdaGraph是一种基于图的聚类算法,通过结构导向的计算方法克服维度诅咒,其在10个合成数据集上表现优异,且在基因表达、文本聚类和材料科学领域实现了科学发现。

Comments 12 pages, 4 figures, 1 table. Full paper in preparation for KDD 2027

详情
AI中文摘要

AdaGraph是一种基于图的聚类算法,通过结构导向的计算方法克服维度诅咒,其在10个合成数据集上表现优异,且在基因表达、文本聚类和材料科学领域实现了科学发现。

英文摘要

We present AdaGraph, a graph-native clustering algorithm born from the Structure-Centric Machine Learning (SC-ML) paradigm -- a new field of unsupervised learning that replaces geometry-centric (distance-based) computation with structure-centric (topology-based) computation, fundamentally dissolving the curse of dimensionality. AdaGraph operates entirely within the kNN graph topology, a representation that retains meaningful relational structure in arbitrarily high dimensions where Euclidean distance metrics become uninformative. AdaGraph requires no a priori specification of the number of clusters k, handles noise natively, and scales via the SLCD (Sample-Learn-Calibrate-Deploy) prototype-deployment framework. As its unsupervised tuning objective, AdaGraph pairs with Graph-SCOPE, the topology-based cluster validity index introduced as a separate SC-ML contribution. On 10 synthetic benchmarks spanning d=10 to d=5000, Graph-SCOPE achieves mean ARI=0.900 and correctly selects k on 9/10 datasets -- outperforming Silhouette, Davies-Bouldin, and Calinski-Harabasz -- while maintaining Kendall tau >= 0.92 with ground-truth cluster quality across all dimensionalities (Silhouette: tau ~= 0.46). We validate AdaGraph across three scientific domains: (1) gene co-expression discovery in hepatocellular carcinoma (GSE14520, 10,000 genes, 488 patients, no dimensionality reduction), where AdaGraph identifies condition-specific gene modules that WGCNA, ICA, NMF, and Spectral Biclustering fail to resolve; (2) natural language text clustering, where AdaGraph achieves ARI=0.751 on 20NG-6cat versus HDBSCAN's 0.464 (62% relative improvement); (3) materials science clustering of superconductors (145-dimensional Magpie features), perovskites, and JARVIS-DFT materials, where AdaGraph achieves the highest Graph-SCOPE on all three datasets.

2605.16319 2026-05-19 cs.LG stat.AP stat.ML 版本更新

Forecasting Medium-Horizon Alzheimer's Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories

中长期阿尔茨海默病进展预测:基于残差间隙感知的变换器用于ADNI临床和生物标志物历史的24个月CDR-SB变化

Ran Tong, Tong Wang, Lanruo Wang, Yin Ni

AI总结 本文提出残差间隙感知变换器,结合混合效应统计参考与变换器残差学习,用于预测24个月CDR-SB变化,提升了预测精度和相关性。

Comments Preprint; includes appendix, 4 figures, and 6 tables

详情
AI中文摘要

中长期阿尔茨海默病进展预测困难,因为未来临床评分可能与基线严重程度保持一致,而生物标志物历史不规则且不完全观察。我们开发了一种基于锚点的分析,利用统一的阿尔茨海默病神经影像计划(ADNI)表格分析24个月临床痴呆评定总箱数(CDR-SB)变化。每个标记样本在轻度认知障碍访问时锚定,仅使用在锚点之前观察到的临床和生物标志物历史,并将响应定义为在18-30个月窗口内最接近24个月的未来访问的CDR-SB减去锚定CDR-SB。分析队列包含来自858名受试者的2,600个标记锚点和7,276个纵向行。我们提出了一种残差间隙感知变换器,结合混合效应统计参考与基于变换器的残差学习,从预锚点的临床和生物标志物历史中学习。模型使用受试者层面的随机截距在混合效应参考中,观察层面的三元组标记化用于不规则历史,并在自注意力中学习非负时间间隙惩罚。我们比较了所提模型与通过贝叶斯信息准则选择的线性混合效应基线、GRU-D和STraTS在重复的受试者层面训练-测试分割下的表现。在五个受试者层面随机种子下,所提模型在所有报告指标上实现了最佳的平均测试性能,相对于混合效应基线,将MSE降低了13.1%,预测-观测相关性提高了26.4%。此外,它在均值误差和相关性上优于GRU-D和STraTS。这些结果表明,统计锚定和间隙感知残差学习为中长期阿尔茨海默病进展预测提供了一个有用的结构。

英文摘要

Medium-horizon Alzheimer's disease progression prediction is difficult because future clinical scores can remain tied to baseline severity, while biomarker histories are irregular and incompletely observed. We develop an anchor-based analysis of 24-month Clinical Dementia Rating Sum of Boxes (CDR-SB) change using harmonized Alzheimer's Disease Neuroimaging Initiative (ADNI) tables. Each labeled sample is anchored at a mild cognitive impairment visit, uses only clinical and biomarker history observed at or before that anchor, and defines the response as CDR-SB at the future visit closest to 24 months within an 18--30 month window minus anchor CDR-SB. The analytic cohort contains 2,600 labeled anchors from 858 participants and 7,276 longitudinal rows. We propose a residual gap-aware transformer that combines a mixed-effects statistical reference with transformer-based residual learning from pre-anchor clinical and biomarker histories. The model uses participant-level random intercepts in the mixed-effects reference, observation-level triplet tokenization for irregular histories, and a learned nonnegative time-gap penalty inside self-attention. We compare the proposed model with a Bayesian-information-criterion-selected linear mixed-effects baseline, GRU-D, and STraTS under repeated participant-level train--test splits. Across five participant-level random seeds, the proposed model achieves the best mean test performance across all reported metrics, reducing MSE by 13.1% and increasing prediction--observation correlation by 26.4% relative to the mixed-effects baseline. It also improves over both GRU-D and STraTS in mean error and correlation. These results show that statistical anchoring and gap-aware residual learning provide a useful structure for medium-horizon Alzheimer's disease progression prediction.

2605.16318 2026-05-19 cs.LG 版本更新

Investigating Action Encodings in Recurrent Neural Networks in Reinforcement Learning

在强化学习中探究循环神经网络的动作编码

Matthew Schlegel, Volodymyr Tkachuk, Adam White, Martha White

AI总结 本文探讨了在强化学习中如何通过修改循环神经网络架构来整合动作信息,评估了不同方法在多个示例领域中的效果,并讨论了未来发展的挑战。

Comments Published in TMLR in 2023, https: // openreview. net/ forum? id= K6g4MbAC1r .Transactions on Machine Learning Research (2023)

详情
AI中文摘要

构建和维护状态以学习策略和价值函数对于在现实世界中部署强化学习(RL)智能体至关重要。循环神经网络(RNNs)已成为解决状态构建问题的关键焦点,许多大规模强化学习智能体都集成了循环网络。尽管RNNs在许多RL应用中已成为主流,但许多关键设计选择和实现细节,这些对性能提升有贡献的,往往并未被报告。在本工作中,我们讨论了RNN架构可以(并且已经)被修改的一个轴线,即如何将动作信息整合到循环单元的状态更新函数中。我们讨论了在使用动作信息时的几种选择,并在一组示例领域中对由此产生的架构进行了实证评估。最后,我们讨论了开发循环单元的未来工作,并讨论了特定于RL环境的挑战。

英文摘要

Building and maintaining state to learn policies and value functions is critical for deploying reinforcement learning (RL) agents in the real world. Recurrent neural networks (RNNs) have become a key point of interest for the state-building problem, and several large-scale reinforcement learning agents incorporate recurrent networks. While RNNs have become a mainstay in many RL applications, many key design choices and implementation details responsible for performance improvements are often not reported. In this work, we discuss one axis on which RNN architectures can be (and have been) modified for use in RL. Specifically, we look at how action information can be incorporated into the state update function of a recurrent cell. We discuss several choices in using action information and empirically evaluate the resulting architectures on a set of illustrative domains. Finally, we discuss future work in developing recurrent cells and discuss challenges specific to the RL setting.

2605.16315 2026-05-19 cs.LG cs.AI 版本更新

A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning

决策能力阈值在自我对战强化学习中的崩溃中起作用

Arahan Kujur

AI总结 研究揭示决策能力阈值决定自我对战强化学习代理在不对称规则扰动下的崩溃,通过消除所有正可达条件决策导致快速收敛到确定性利用吸引子,而保留单个正可达条件决策可防止崩溃。

Comments 18 pages, 7 figures

详情
AI中文摘要

我们展示了一个阈值在决策能力中决定自我对战强化学习代理在不对称规则扰动下的崩溃。在扑克变种、矩阵游戏、骰子游戏和多种学习算法中,消除所有正可达条件决策导致快速收敛到确定性利用吸引子,一个在接近最大损失处的固定点。保留甚至一个正可达条件决策点可防止此崩溃。冻结基线和固定对手控制确认该机制是受约束下的共适应,而非扰动本身。该现象是时间不变的,一旦恢复行动即可完全可逆,并在函数逼近下加剧。这些结果确立了一个在测试领域中精确的阈值,即零可达加权条件行动能力,其严重性通过可达加权能力连续缩放。

英文摘要

We show that a threshold in decision capacity determines whether self-play reinforcement learning agents collapse under asymmetric rule perturbations. Across poker variants, matrix games, a dice game, and multiple learning algorithms, eliminating all positive-reach contingent decisions causes rapid convergence to a deterministic exploitation attractor, a fixed point at near-maximal loss. Preserving even a single positive-reach contingent decision point prevents this collapse. A frozen baseline and fixed-opponent control confirm that the mechanism is co-adaptation under constraint, not the perturbation itself. The phenomenon is timing-invariant, fully reversible upon action restoration, and intensifies under function approximation. These results establish a sharp threshold at zero reach-weighted contingent action capacity, with severity scaling continuously via reach-weighted capacity in the tested domains.

2605.16312 2026-05-19 cs.LG cs.AI 版本更新

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

当动作消失时:自我对战强化学习中的对抗性动作移除

Arahan Kujur

AI总结 研究了自我对战强化学习中的对抗性动作遮蔽,发现学习的遮蔽比随机遮蔽和学习扰动基线更具破坏性,揭示了动作可用性作为自我对战RL中的新鲁棒性表面。

Comments 17 pages, 2 figures, 18 tables

详情
AI中文摘要

我们研究了自我对战强化学习中的对抗性动作遮蔽:攻击者会选择性地从受害者的行为集移除合法动作。不同于观察或动作扰动,移除是在智能体行动前消除了决策选项。在从6到5531个信息状态的扑克游戏以及两个非扑克领域中,学习的遮蔽造成的损害比随机遮蔽和学习扰动基线要大得多。攻击在Q-learning、PPO、NFSP、神经NFSP和DQN受害者中持续存在;能够跨智能体转移;在自我对战中被放大;并在延长的遮蔽训练下没有恢复。机理上,攻击者针对高价值决策点,通过可达加权条件动作容量(CAC_w)和价值加权细化CAC_v来捕捉。这些结果将动作可用性识别为自我对战RL中的一个新鲁棒性表面。

英文摘要

We study adversarial action masking in self-play reinforcement learning: an attacker selectively removes legal actions from a victim's action set. Unlike observation or action perturbations, removal eliminates decision options before the agent acts. Across poker games scaling from 6 to 5,531 information states and two non-poker domains, learned masking causes substantially more damage than random masking and learned perturbation baselines. The attack persists across Q-learning, PPO, NFSP, neural NFSP, and DQN victims; transfers across agents; is amplified by self-play; and shows no recovery under extended masked training. Mechanistically, the adversary targets high-value decision points, captured by reach-weighted contingent action capacity (CAC$_w$) and a value-weighted refinement CAC$_v$. These results identify action availability as a distinct robustness surface in self-play RL.

2605.16311 2026-05-19 cs.LG cs.DC 版本更新

SignMuon: Communication-Efficient Distributed Muon Optimization

SignMuon:高效的分布式μon优化

Neel Mishra, Kushagara Trivedi, Pawan Kumar

AI总结 本文提出Sign-Muon优化器,结合signSGD的符号聚合与Muon的极步框架,通过矩阵感知的1位优化方法,在减少通信开销的同时提升训练效率,实验表明其在多个数据集上均取得最佳性能。

Comments 40 pages, 9 figures

详情
AI中文摘要

大规模神经网络的分布式训练受限于全精度梯度通信和忽略权重张量矩阵结构的坐标优化器。我们提出Sign-Muon,一种1位、矩阵感知的优化器,结合signSGD的多数投票符号聚合与Muon的极步框架。每个工作者通过牛顿-施卢茨迭代取动量的极因子形成Muon式方向,仅传输元素符号并进行多数投票聚合;可选的局部极步进一步在不增加通信开销的情况下强制正交性。在谱范数光滑性和有界方差随机梯度下,谱范数归一化的符号步在非凸问题中达到O(1/√T)的收敛速率。在单峰对称噪声下,多数投票跨M个工作者将随机项降低1/√M,与signSGD匹配。在α-β模型中,分布式Sign-Muon每个迭代仅需一次整数sum-allreduce;所有正交化本地完成,相比float32带来32倍的带宽减少(int8为4倍)。在330个CIFAR-10/ResNet-50配置中,Sign-Muon达到最佳验证准确率(92.15%);其4-GPU多数投票变体在匹配有效批次下以37%更少的训练时间达到92.02%。在nanoGPT上,Sign-Muon在 perplexity 和 anytime 性能上优于其他基于符号的基线,弱标度性能在16 GPU时表现良好。

英文摘要

Distributed training of large neural networks is bottlenecked by full-precision gradient communication and by coordinatewise optimizers that ignore the matrix structure of weight tensors. We propose Sign-Muon, a 1-bit, matrix-aware optimizer that combines majority-vote sign aggregation from signSGD with the polar-step framework of Muon. Each worker forms a Muon-style direction by taking the polar factor of its momentum via a Newton--Schulz iteration, transmits only the entrywise signs, and aggregates by majority vote; an optional local polar step further enforces orthogonality at no extra communication cost. Under spectral-norm smoothness and bounded-variance stochastic gradients, the spectral-norm normalized sign step yields an $\mathcal{O}(1/\sqrt{T})$ nonconvex rate for an $\ell_1$-based stationarity measure. With unimodal symmetric noise, majority vote across $M$ workers cuts the stochastic term by $1/\sqrt{M}$, matching signSGD. In the $α$-$β$ model, distributed Sign-Muon needs only one integer sum-allreduce per iteration; all orthogonalization is local, giving a $32\times$ bandwidth reduction over float32 ($4\times$ for int8). Across 330 CIFAR-10/ResNet-50 configurations Sign-Muon attains the best validation accuracy (92.15\%); its 4-GPU majority-vote variant reaches 92.02\% with 37\% less training time at matched effective batch. On nanoGPT, Sign-Muon achieves lower perplexity and better anytime performance than other sign-based baselines, with favorable weak-scaling up to 16 GPUs.

2605.16290 2026-05-19 cs.CY cs.AI cs.LG 版本更新

MCQ Difficulty Prediction via Modeling Learner Heterogeneity Using Data-Driven Cognitive Profiling

通过数据驱动的认知画像建模学习者异质性以预测多项选择题难度

Dhriti Krishnan, Jaromir Savelka

AI总结 本文提出基于学习者异质性的数据驱动认知画像框架,通过隐类分析识别行为画像并模拟响应分布,结合主题上下文和岭回归模型预测IRT难度参数,提升难度预测精度。

详情
AI中文摘要

预测多项选择题(MCQ)难度对有效评估至关重要,但当前方法通常假设学生能力分布单峰,忽视学生误解的异质性。本文提出一种基于角色的框架,用数据驱动的认知画像替代理论能力采样。利用EEDI数据集中的学生互动,通过潜在类分析(LCA)识别行为画像,然后将大语言模型(LLM)调制以模拟每个画像的响应分布。这些信号与主题上下文结合,输入岭回归模型预测项目反应理论(IRT)难度参数。通过五折交叉验证,本文方法在MSE上优于最近的基线(0.367到0.274;R2:0.525到0.686)。发现的画像具有可解释性,并提供了关于项目难度原因的见解,潜在应用于诊断评估设计。

英文摘要

Predicting the difficulty of multiple-choice questions (MCQs) is important for effective assessment, yet current methods typically assume a unimodal student ability distribution, overlooking the heterogeneous nature of student misconceptions. We propose a persona-driven framework that replaces theoretical ability sampling with data-driven cognitive profiling. Using student interactions from the EEDI dataset, we identify behavioral personas via latent class analysis (LCA), then condition a large language model (LLM) to simulate response distributions for each persona. These signals are aggregated with topic context and fed into a Ridge Regression model to predict the item response theory (IRT) difficulty parameter. With five-fold cross-validation, our method improves over a recent baseline (MSE: 0.367 to 0.274; R2: 0.525 to 0.686). The discovered personas are interpretable and offer insights into why items are difficult, with potential applications to diagnostic assessment design.

2605.16268 2026-05-19 cs.HC cs.AI cs.LG 版本更新

Helping Customers in Distress: An LLM-powered Agent that Converses, Probes, and Routes

帮助陷入困境的客户:一个基于LLM的代理,能够对话、探测和分流

Alankar Atreya, Stefan Sylvius Wanger, Devesh Batra, Robert Hankache, Cristovao Iglesias, Patrick Sinclair, Giulio Pelosio, Michael McMillan, Greig A. Cowan, Raad Khraishi

AI总结 本文提出一个基于LLM的AI分流代理,通过多轮对话和提问提高客户问题分类准确性,提升银行客户服务效率。

详情
AI中文摘要

银行每年收到数百万起欺诈、诈骗和争议交易报告,准确将客户引导至合适专业团队极具挑战性。现有的人工流程对客户和员工都缓慢且压力大。为此,我们开发了一个面向客户的AI分流代理,利用大型语言模型(LLMs)进行多轮对话、提问和分类,以实现准确的政策引导分流,嵌入客户旅程中。为了评估和持续改进代理,模拟了真实的客户数字孪生,生成基于历史数据的真实、标注对话,以测试各种现实场景。本文详细介绍了分流代理的建模方法、与政策、安全护栏和推理框架的整合、使用合成代理进行可扩展评估,以及AI系统在准确性、鲁棒性和合规性方面的发现。结果表明,代理成功提高了历史案例的分流效果,实现分类准确率提升30.6%,我们的领域专家报告了高水平的满意度,突显了针对性探测在大规模银行运营中的有效分流作用。

英文摘要

Banks receive millions of reports of fraud, scams, and disputed transactions every year, making it challenging to accurately direct customers to the appropriate specialist teams for assistance. The existing manual process driven by humans is slow and stressful for both customers and staff. To address this, we develop a customer-facing AI powered triaging agent that leverages large language models (LLMs) to conduct multi-turn conversations, ask relevant questions, and classify cases for accurate, policy-guided routing, making it embedded in the customer journey. To evaluate and continuously improve the agent, synthetic digital twins of real customers were simulated, generating realistic, labelled dialogues based on historical data to test a wide range of real-world scenarios. This work details the triage agent's modelling approach, integration with policy, safety guardrails and reasoning frameworks, the use of the synthetic agent for scalable evaluation, and findings on the AI system's accuracy, robustness, and compliance. Results show that the agent successfully improves triaging of historical cases, achieving a 30.6% increase in classification accuracy, with high satisfaction levels reported by our subject-matter experts, highlighting how targeted probing can lead to more effective triage in banking operations at scale.

2605.16266 2026-05-19 cs.GR cs.CV cs.LG 版本更新

Patchwork: A compact representation for 3D polygonal shapes

Patchwork: 3D多边形形状的紧凑表示

Ruichen Zheng, Biao Zhang, Michael Birsak, Mikhail Skopenkov, Peter Wonka

AI总结 Patchwork提出了一种新的通用形状表示方法,通过少量参数建模2D和3D几何,提供可证明的复杂度界和任意精度近似能力,结合高效梯度优化和新型正则化损失,实现高紧凑性,适用于几何学习与重建任务。

详情
AI中文摘要

我们介绍了Patchwork,一种新的通用形状表示方法,能够用少量参数建模2D和3D几何。Patchwork建立在严谨的数学框架上,提供可证明的复杂度界,并能在任意维度中以任意精度近似任意形状。我们提出了一种高效的基于梯度的优化方案,用于拟合2D和3D数据,同时提出一种新的正则化损失,逐步剔除冗余元素,收敛后获得高紧凑性。我们的方法提供了快速拟合性能,参数数量比现有方法少很多,并原生支持内外分类,使其成为几何学习和重建任务的通用且紧凑的表示方法,未来可能用于3D生成。我们的实现可在:https://github.com/Ankbzpx/patchwork-experiment 获取。

英文摘要

We introduce Patchwork, a new general-purpose shape representation capable of modeling 2D and 3D geometry with a small number of parameters. Patchwork is grounded in a rigorous mathematical framework, providing provable complexity bounds and the ability to approximate arbitrary shapes with arbitrary precision in any dimension. We propose an efficient gradient-based optimization scheme to fit Patchwork representations to 2D and 3D data, along with a novel regularization loss that progressively prunes redundant elements, yielding high compactness after convergence. Our approach offers fast fitting performance, a fraction of the required parameters compared to existing alternatives, and native support for inside-outside classification, making it a versatile and compact representation for geometric learning and reconstruction tasks, with future potential for 3D generation. Our implementation is available at: https://github.com/Ankbzpx/patchwork-experiment.

2605.16262 2026-05-19 cs.LG math.OC 版本更新

Mirror Descent-Type Algorithms for the Variational Inequality Problem with Functional Constraints

镜像下降型算法用于带有功能约束的变分不等式问题

Mohammad S. Alkousa, Fedor S. Stonyakin, Belal A. Alashqar, Seydamet S. Ablaev

AI总结 本文提出了一种镜像下降型算法,根据功能约束值在迭代中的变化切换生产与非生产步骤,以提高求解带功能约束的变分不等式问题的效率和收敛速度。

详情
AI中文摘要

变分不等式在机器学习研究中起着关键作用,例如生成对抗网络、强化学习、对抗训练和生成模型。本文致力于解决带有功能约束(不等式类型约束)的受限变分不等式问题。我们提出了一些镜像下降型算法,根据迭代中功能约束值的大小切换生产与非生产步骤,使用多种不同的步长规则和停止准则。我们分析了所提出的算法,并证明了其在有界和单调算子以及利普希茨凸功能约束问题中达到所需精度的最优收敛率。此外,我们提出了一种修改算法,在生产步骤中考虑每个功能约束的计算,以及第一个违反可行性的约束。这种修改可以在有大量功能约束时节省算法运行时间。此外,我们还对所提出的算法进行了δ-单调算子的分析,使我们能够将所提出的算法作为特殊情况应用于没有确切子梯度信息的受限最小化问题。还给出了说明所提算法的工作和性能的数值实验。

英文摘要

Variational inequalities play a key role in machine learning research, such as generative adversarial networks, reinforcement learning, adversarial training, and generative models. This paper is devoted to the constrained variational inequality problems with functional constraints (inequality-type constraints). We propose some mirror descent-type algorithms that switch between productive and non-productive steps depending on the values of the functional constraints at iterations, with many different step size rules and stopping criteria. We analyze the proposed algorithms and prove their optimal convergence rate to achieve a solution with desired accuracy, for problems with bounded and monotone operators and Lipschitz convex functional constraints. In addition, we propose a modification of the proposed algorithms by considering each functional constraint in the calculation when we have a productive step, as well as the first constraint that violates the feasibility. This modification can save the running time of algorithms when we have many functional constraints. In addition, we provide an analysis of the proposed algorithms for $δ$-monotone operators, allowing us to apply the proposed algorithms, as a special case, to constrained minimization problems when we do not have access to the exact information about the subgradient of the objective function. Numerical experiments that illustrate the work and performance of the proposed algorithms are also given.

2605.16259 2026-05-19 cs.LG cs.AI cs.DC 版本更新

Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra

苹果M3 Ultra上的实时扩散模型推断系统优化

Yoichi Ochiai

AI总结 本文针对苹果M3 Ultra平台系统性优化扩散模型实时推断,通过多种技术结合实现22.7 FPS的实时图像转换,揭示了与NVIDIA GPU不同的优化特性。

详情
AI中文摘要

尽管扩散模型在NVIDIA GPU上的实时图像生成技术迅速发展,但针对非CUDA平台如苹果硅芯片的系统性优化研究极为有限。本文在苹果M3 Ultra(60核心GPU,512GB统一内存)上进行了10个阶段的全面优化实验,目标是实现实时摄像头img2img转换。我们探索了包括CoreML转换、量化、令牌合并、神经引擎利用、紧凑模型探索、帧内插、基于kNN搜索的合成、pix2pix-turbo、光学流帧跳过和知识蒸馏等多种技术,并定量评估了每种方法的有效性。最终,通过结合CoreML转换的蒸馏专用模型SDXS-512与三线程摄像头流水线,在512x512分辨率下实现了22.7 FPS的实时摄像头img2img转换。本文的主要贡献是系统地证明了在CUDA上建立的优化见解不一定适用于苹果硅芯片的统一内存架构。我们揭示了一个与NVIDIA GPU截然不同的优化景观,包括量化无提速、并行推断无效以及神经引擎不适合大规模模型等特性,并为苹果硅芯片上的扩散模型推断提供了实用指南。

英文摘要

While real-time image generation using diffusion models has advanced rapidly on NVIDIA GPUs, systematic optimization research on non-CUDA platforms such as Apple Silicon remains extremely limited. In this study, we conducted comprehensive optimization experiments across 10 phases targeting the Apple M3 Ultra (60-core GPU, 512 GB unified memory) with the goal of achieving real-time camera img2img transformation. We explored a wide range of techniques including CoreML conversion, quantization, Token Merging, Neural Engine utilization, compact model exploration, frame interpolation, kNN search-based synthesis, pix2pix-turbo, optical flow frame skipping, and knowledge distillation, quantitatively evaluating the effectiveness of each approach. Ultimately, by combining CoreML conversion of the distillation-specialized model SDXS-512 with a 3-thread camera pipeline, we achieved real-time camera img2img transformation at 22.7 FPS at 512x512 resolution. The primary contribution of this work is the systematic demonstration that optimization insights established for CUDA are not necessarily effective on Apple Silicon's unified memory architecture. We reveal an optimization landscape fundamentally different from that of NVIDIA GPUs -- including the absence of speedup from quantization, the ineffectiveness of parallel inference, and the unsuitability of the Neural Engine for large-scale models -- and provide practical guidelines for diffusion model inference on Apple Silicon.

2605.14624 2026-05-19 cs.LG cs.AI cs.NE 版本更新

An Amortized Efficiency Threshold for Comparing Neural and Heuristic Solvers in Combinatorial Optimization

比较神经求解器与启发式求解器在组合优化中的平均效率阈值

Sohaib Afifi

AI总结 本文研究了神经求解器在组合优化中的能耗问题,提出平均效率阈值框架,通过实验显示神经求解器在部署量超过阈值后能耗低于启发式方法,提供了新的评估方法。

Comments 13 pages, 3 figures, 1 table. Code and benchmark pipeline at https://github.com/sohaibafifi/aet. v1: initial release with CVRP n=50

详情
AI中文摘要

神经组合优化求解器常被批评其能耗高于CPU启发式方法,因其在GPU上训练的成本高。本文探讨了从

英文摘要

A common critique of neural combinatorial-optimization solvers is that they are less energy-efficient than CPU metaheuristics, given the operational energy cost of training them on GPUs. This paper examines the inferential step from "training is expensive" to "neural solvers are net-inefficient", which is where the critique actually goes wrong. Training the network costs a large fixed amount of GPU energy; running the metaheuristic costs a small amount of CPU energy on every instance, repeated as long as the solver is deployed. The two are not commensurable until a deployment volume is fixed. We define the Amortized Efficiency Threshold (AET) as the deployment volume above which a neural solver breaks even with a heuristic baseline in total energy or carbon, under an explicit constraint on solution quality. We show that the cumulative-energy ratio between the two solvers tends to a constant strictly below one whenever the network wins per instance, and that this limit does not depend on how the training cost was measured. An embodied-carbon term amortizes hardware fabrication symmetrically on both sides. We instantiate the framework on the CVRP environment at n=50 customers with the attention-based autoregressive solver of Kool et al. (2019), trained for 100 epochs on 20,000 instances over five random seeds, and HGS via PyVRP as the heuristic baseline. The measured operational crossover sits near 4.56e3 deployed instances at the median of a six-point baseline-budget sweep; the per-instance neural-to-heuristic ratio is 2.29e-3. The contribution is the framework, the open instrumentation, and the end-to-end measurement protocol. Code and benchmark pipeline are available at https://github.com/sohaibafifi/aet.

2605.14068 2026-05-19 cs.CV cs.LG 版本更新

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

CurveBench:一种用于嵌套乔丹曲线精确拓扑推理的基准

Amirreza Mohseni, Mona Mohammadi, Morteza Saghafian, Naser Talebizadeh Sardari

AI总结 CurveBench是一个用于从视觉输入中进行层次拓扑推理的基准,包含756张非相交的乔丹曲线图像,通过结构预测任务恢复由曲线诱导的根树结构。

详情
AI中文摘要

我们介绍了CurveBench,一种用于从视觉输入中进行层次拓扑推理的基准。CurveBench包含756张图像,这些图像中的乔丹曲线在易、多边形、地形启发、迷宫状和密集计数配置下成对不相交。每张图像都标注了一个根树,编码平面区域之间的包含关系。我们将任务定义为结构预测:给定一张图像,模型必须恢复由曲线诱导的完整根树。尽管该任务在视觉上简单,但最强的评估模型Gemini 3.1 Pro在CurveBench-Easy上仅达到71.1%的树生成准确率,在CurveBench-Hard上仅为19.1%。我们进一步通过RLVR风格的微调展示了基准的实用性。我们的训练Qwen3-VL-8B模型在CurveBench-Easy上将Qwen-3-VL-8B-Thinking的树生成准确率从2.8%提升到33.3%,超过GPT-5.4和Claude Opus 4.5。剩余的差距,尤其是在CurveBench-Hard上,表明精确的拓扑感知视觉推理仍远未解决。

英文摘要

We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of \textbf{756 images} of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given an image, a model must recover the full rooted containment tree induced by the curves. Despite the visual simplicity of the task, the strongest evaluated model, Gemini 3.1 Pro, achieves only \textbf{71.1\%} tree-generation accuracy on CurveBench-Easy and \textbf{19.1\%} on CurveBench-Hard. We further demonstrate benchmark utility through RLVR-style fine-tuning of open-weight vision-language models. Our trained Qwen3-VL-8B model improves over \texttt{Qwen-3-VL-8B-Thinking} from \textbf{2.8\%} to \textbf{33.3\%} tree-generation accuracy on CurveBench-Easy, exceeding GPT-5.4 and Claude Opus 4.5 under our evaluation protocol. The remaining gap, especially on CurveBench-Hard, shows that exact topology-aware visual reasoning remains far from solved.

2605.13161 2026-05-19 cs.CV cs.LG 版本更新

A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning

A₃B₂:一种自适应非对称适配器,用于缓解视觉-语言图像分类中的分支偏差

Yiyun Zhou, Zhonghua Jiang, Wenkang Han, Kunxi Li, Mingjing Xu, Chang Yao, Jingyuan Chen

AI总结 本文提出A₃B₂适配器,通过引入不确定性感知适配器阻尼机制,缓解少样本学习中的分支偏差问题,实验表明其在多个数据集上优于现有基线方法。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

高效的迁移学习方法为大规模视觉-语言模型(例如CLIP)提供了强大的少样本迁移能力,但现有适配方法遵循固定微调范式,隐含假设图像和文本分支的重要性是均匀的,这一假设在图像分类中未被系统研究。通过深入分析,我们揭示了视觉-语言图像分类中的分支偏差问题:在分布外设置下,适配图像编码器并不总能提高性能。受此启发,我们提出了A₃B₂,一种自适应非对称适配器,用于缓解少样本学习中的分支偏差。A₃B₂引入了不确定性感知适配器阻尼(UAAD),在预测不确定性较高时自动抑制图像分支适配,实现软且数据驱动的控制,无需手动干预。在架构上,A₃B₂采用了一种轻量级非对称设计,受混合专家启发,结合负载平衡正则化。在三个少样本图像分类任务上,对11个数据集的广泛实验表明,A₃B₂在多个数据集上一致优于11个竞争的提示和适配基线方法。

英文摘要

Efficient transfer learning methods for large-scale vision-language models ($e.g.$, CLIP) enable strong few-shot transfer, yet existing adaptation methods follow a fixed fine-tuning paradigm that implicitly assumes a uniform importance of the image and text branches, which has not been systematically studied in image classification. Through extensive analysis, we reveal a Branch Bias issue in vision-language image classification: adapting the image encoder does not always improve performance under out-of-distribution settings. Motivated by this observation, we propose A$_3$B$_2$, an Adaptive Asymmetric Adapter that alleviates Branch Bias in few-shot learning. A$_3$B$_2$ introduces Uncertainty-Aware Adapter Dampening (UAAD), which automatically suppresses image-branch adaptation when prediction uncertainty is high, enabling soft and data-driven control without manual intervention. Architecturally, A$_3$B$_2$ adopts a lightweight asymmetric design inspired by mixture-of-experts with Load Balancing Regularization. Extensive experiments on three few-shot image classification tasks across 11 datasets demonstrate that A$_3$B$_2$ consistently outperforms 11 competitive prompt- and adapter-based baselines.

2605.09730 2026-05-19 cs.LG cs.SE 版本更新

RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

RubricRefine: 通过无训练预执行细化提升工具使用代理的可靠性

Will LeVine, Brendan Evers, Sam Saltwick, Abhay Venkatesh

AI总结 RubricRefine通过预执行语义合同验证,在无执行尝试的情况下提升工具使用代理的可靠性,平均在M3ToolEval上达到0.86,比现有推理时间基线低2.6倍。

详情
AI中文摘要

迭代自我细化是一种流行的推理时间可靠性技术,但其在代码模式工具使用中的有效性严重依赖反馈信号的结构:无结构的批评在不同模型间不一致,即使使用真实执行反馈进行修订也只能小幅提升(0.75 vs. 0.65基线)。主导的失败是跨工具合同违规(错误输出形状、错误工具路由、断裂的参数来源),这些失败在完成运行时不会引发错误,使运行时反馈不足。我们引入RubricRefine,一种无训练的预执行语义合同验证方法,生成任务和注册表特定的评分标准,对候选代码进行显式合同检查评分,并在任何执行发生前迭代修复失败。RubricRefine在M3ToolEval上达到0.86,平均跨七个模型,无执行尝试,比现有推理时间基线提升最高2.6倍。性能在主要单步API-Bank上保持稳定,与方法对跨工具合同结构的依赖一致。评分类别消融和校准分析进一步阐明了该方法何时及为何有效。

英文摘要

Iterative self-refinement is a popular inference-time reliability technique, but its effectiveness in code-mode tool use depends heavily on the structure of the feedback signal: unstructured critique helps inconsistently across models, and even revision with real execution feedback improves only modestly ($0.75$ vs. $0.65$ baseline). The dominant failures are inter-tool contract violations (wrong output shape, incorrect tool routing, broken argument provenance) that run to completion without raising errors, making runtime feedback insufficient. We introduce RubricRefine, a training-free method for pre-execution semantic contract verification that generates task- and registry-specific rubrics, scores candidate code against explicit contract checks, and iteratively repairs failures before any execution occurs. RubricRefine reaches $0.86$, averaged across seven models, on M3ToolEval with zero execution attempts, improving over prior inference-time baselines with up to $2.6\times$ lower latency. Performance remains flat on the predominantly single-step API-Bank, consistent with the method's reliance on inter-tool contract structure. A rubric-category ablation and calibration analysis further characterize when and why the method works.

2605.08475 2026-05-19 cs.LG cs.AI cs.NA math.NA math.OC 版本更新

Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

Transformer 可实现用于上下文高斯核回归的预条件Richardson迭代

Mingsong Yan, Dongyang Li, Charles Kulick, Sui Tang

AI总结 本文研究了上下文核岭回归,证明标准softmax注意力transformer可通过预条件Richardson迭代近似高斯核回归预测器,展示了transformer架构中的功能分解。

详情
AI中文摘要

对上下文学习(ICL)的机制解释已识别出用于线性回归及相关线性预测任务的迭代算法,通常使用线性或ReLU注意力变体。对于非线性ICL,先前工作将softmax和核化注意力与功能梯度型动态相关联,但尚不清楚标准softmax注意力transformer能否实现具有端到端预测误差保证的收敛求解器。本文研究了具有高斯核的上下文核岭回归(KRR),并证明标准softmax-注意力transformer在正向传递过程中可通过在关联的核线性系统上实现预条件Richardson迭代来近似KRR预测器。在数据有界假设下,我们构建了一个具有O(log(1/ε))个块和MLP宽度O(√(N/ε))的单头transformer,实现了对长度为N的提示的ε精度预测。我们的构造揭示了transformer架构中的功能分解:softmax注意力产生用于跨token交互的行归一化高斯核算子,而ReLU MLP层局部近似所需的intra-token标量算术。通过训练GPT-2风格的transformer进行高斯过程回归任务进一步测试预条件Richardson解释。通过线性探测,我们比较transformer的层间预测与经典KRR求解器的逐步输出,发现其误差谱与预条件Richardson迭代最一致。消融研究进一步支持这一解释。共同,我们的理论和实验识别出预条件Richardson迭代作为softmax-注意力transformer实现非线性上下文高斯核回归的明确机制。

英文摘要

Mechanistic accounts of in-context learning (ICL) have identified iterative algorithms for linear regression and related linear prediction tasks, often using linear or ReLU attention variants. For nonlinear ICL, prior work has related softmax and kernelized attention to functional-gradient-type dynamics, but it remains unclear whether a standard transformer with softmax attention can implement a convergent solver with an end-to-end prediction-error guarantee. In this paper, we study in-context kernel ridge regression (KRR) with Gaussian kernels and show that a standard softmax-attention transformer can approximate the KRR predictor during its forward pass by implementing preconditioned Richardson iteration on the associated kernel linear system. Under bounded-data assumptions, we construct a single-head transformer with $O(\log(1/ε))$ blocks and MLP width $O(\sqrt{N/ε})$ that achieves $ε$-accurate prediction for prompts of length $N$. Our construction reveals a functional decomposition within the transformer architecture: softmax attention produces a row-normalized Gaussian-kernel operator needed for cross-token interactions, while ReLU MLP layers act locally to approximate the intra-token scalar arithmetic required by the update. Empirically, we train GPT-2-style transformers on Gaussian-process regression tasks to further test the preconditioned Richardson interpretation. Through linear probing, we compare the transformer's layer-wise predictions with the step-wise outputs of classical KRR solvers and find that its error profiles align most consistently with preconditioned Richardson iteration. Ablation studies further support this interpretation. Together, our theory and experiments identify preconditioned Richardson iteration as a concrete mechanism that softmax-attention transformers can realize for nonlinear in-context Gaussian-kernel regression.

2605.03183 2026-05-19 cs.LG eess.SP 版本更新

Enhancing AI-Based ECG Delineation with Deep Learning Denoising Techniques

利用深度学习去噪技术增强基于AI的ECG分割

Jeff Breeding-Allison, Emil Walleser

AI总结 本文提出基于自编码器的神经网络模型,用于去噪ECG信号,提升犬类ECG分析的准确性与鲁棒性。

Comments 26 pages, 8 figures

详情
AI中文摘要

评估犬类心电图(ECG)具有挑战性,因为噪声可能掩盖临床相关的心脏电活动。常见干扰源包括呼吸、肌肉活动、导联接触不良和外部电噪声。传统信号去噪技术如滤波和小波方法难以抑制多样的噪声模式同时保留对准确ECG分割至关重要的形态学特征。我们提出了一种基于自编码器的神经网络模型和训练策略,作为犬类ECG分析的预处理步骤。该模型被训练从噪声输入中重建干净的心脏信号,从而实现有效的去噪而不影响诊断重要的波形。我们的方法在噪声和清洁的ECG记录中均表现出色,表明其对各种信号条件的鲁棒性和对后续分割任务的适用性。

英文摘要

Evaluating canine electrocardiograms (ECGs) is challenging due to noise that can obscure clinically relevant cardiac electrical activity. Common sources of interference include respiration, muscle activity, poor lead contact, and external electrical artifacts. Classical signal denoising techniques, such as filtering and wavelet-based methods, struggle to suppress diverse noise patterns while preserving morphological features critical for accurate ECG delineation. We propose an autoencoder-based neural network model and training strategy for ECG denoising as a preprocessing step for canine ECG analysis. The model is trained to reconstruct clean cardiac signals from noisy inputs, enabling effective noise reduction without degrading diagnostically important waveforms. Our approach demonstrates strong performance across both noisy and clean ECG recordings, indicating robustness to varying signal conditions and suitability for downstream delineation tasks.

2604.28010 2026-05-19 cs.LG cs.AI 版本更新

Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care

从分歧中学习:临床医生的覆盖作为价值医疗中临床AI的隐含偏好信号

Prabhjot Singh, Abhishek Gupta, Chris Betz, Abe Flansburg, Brett Ives, Sudeep Lama, Jung Hoon Son

AI总结 本文提出了一种框架,将临床医生对AI建议的覆盖视为隐含偏好数据,通过引入五类覆盖分类法和双学习架构,解决抑制偏差问题,以提升价值医疗中AI的决策能力。

Comments 22 pages, 2 tables, 1 figure

详情
AI中文摘要

我们重新将临床医生对临床AI建议的覆盖视为隐含偏好数据——与强化学习从人类反馈(RLHF)中利用的相同信号结构,但更丰富:标注者是领域专家,替代方案具有实际后果,下游结果是可观察的。我们提出了一个扩展标准偏好学习的正式框架,包含三个贡献:五类覆盖分类法,将覆盖类型映射到不同的模型更新目标;一个基于患者状态s、组织背景c和临床医生能力κ的偏好公式,其中κ分解为执行能力κ-exec和对齐能力κ-align;以及一个双学习架构,通过交替优化联合训练奖励模型和能力模型,防止我们称为抑制偏差的系统性问题——当临床医生能力低于执行阈值时,系统性地压制正确但困难的建议。我们论证,在基于结果的支付合同下慢性病管理产生具有独特有利属性的覆盖数据——纵向密度、集中决策空间、结果标签和自然能力变化,并认为结合纵向结果测量与对齐的财务激励的训练环境是学习与患者轨迹而非就诊经济相一致的奖励模型的必要条件。此框架源于改进价值医疗部署中临床医生能力的运营工作。

英文摘要

We reframe clinician overrides of clinical AI recommendations as implicit preference data - the same signal structure exploited by reinforcement learning from human feedback (RLHF), but richer: the annotator is a domain expert, the alternatives carry real consequences, and downstream outcomes are observable. We present a formal framework extending standard preference learning with three contributions: a five-category override taxonomy mapping override types to distinct model update targets; a preference formulation conditioned on patient state s, organizational context c, and clinician capability kappa, where kappa decomposes into execution capability kappa-exec and alignment capability kappa-align; and a dual learning architecture that jointly trains a reward model and a capability model via alternating optimization, preventing a failure mode we term suppression bias-the systematic suppression of correct-but-difficult recommendations when clinician capability falls below the execution threshold. We argue that chronic disease management under outcome-based payment contracts produces override data with uniquely favorable properties-longitudinal density, concentrated decision space, outcome labels, and natural capability variation-and that training environments combining longitudinal outcome measurement with aligned financial incentives are a necessary condition for learning a reward model aligned with patient trajectory rather than with encounter economics. This framework emerged from operational work to improve clinician capability in a live value-based care deployment.

2604.28005 2026-05-19 cs.LG stat.ML 版本更新

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

核化优势估计:从非参数统计到大语言模型推理

Shijin Gong, Kai Ye, Jin Zhu, Xinyu Zhang, Hongyi Zhou, Chengchun Shi

AI总结 本文提出利用非参数统计方法提升大语言模型推理中的优势估计,通过核平滑技术实现高效的价值函数估计与策略优化,提升政策学习质量。

Comments 45 pages, 5 figures

详情
AI中文摘要

近年来,大语言模型(LLM)在增强推理能力方面越来越多地依赖强化学习(RL)。三种方法被广泛采用:第一种依赖深度神经网络估计学习策略的价值函数以减少策略梯度的方差,但估计和维护此类价值网络会带来显著的计算和内存开销。第二种方法通过样本平均近似价值函数,但每个提示需要大量推理轨迹样本以实现准确的价值函数近似,这使计算成本很高。第三种方法每个提示仅采样一个推理轨迹,这降低了计算成本,但样本效率低下。本文聚焦于一个实际且资源受限的场景,其中每个提示只能采样少量推理轨迹,同时低方差梯度估计对于高质量策略学习至关重要。为解决这一挑战,我们引入经典的非参数统计方法,这些方法在计算和统计上都具有效率,应用于LLM推理。我们采用核平滑作为价值函数估计的具体例子,并进行后续的策略优化。数值和理论结果表明,我们的方法实现了准确的价值和梯度估计,从而提升了策略优化。

英文摘要

Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three types of approaches have been widely adopted: The first relies on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. The second avoids training a value network by approximating the value function using sample averages. However, it samples a large number of reasoning traces per prompt for accurate value function approximation, making it computationally expensive. The third samples only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. This paper focuses on a practical, resource-constrained setting in which only a small number of reasoning traces can be sampled per prompt, while low-variance gradient estimation remains essential for high-quality policy learning. To address this challenge, we bring classical nonparametric statistical methods, which are both computationally and statistically efficient, to LLM reasoning. We employ kernel smoothing as a concrete example for value function estimation and the subsequent policy optimization. Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation, leading to improved policy optimization.

2604.16400 2026-05-19 cs.DC cs.AI cs.LG 版本更新

CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters

CoLLM:面向共享GPU集群的SLO感知LLM服务连续适应

Shaoyuan Huang, Yunfeng Zhao, Na Yan, Tiancheng Zhang, Xiaokai Wang, Xiaofei Wang, Wenyu Wang, Yansha Deng

AI总结 CoLLM通过统一联邦参数高效微调与推理,实现LLM服务在共享GPU集群中的连续适应,提升模型质量和效率,实验显示其在吞吐量上表现优异。

详情
AI中文摘要

随着大型语言模型(LLM)在边缘智能中被越来越多地用于驱动领域特定应用和个性化服务,LLM训练后的质量与效率,包括微调和推理,因资源受限而变得至关重要。尽管最近在联邦参数高效微调(FL PEFT)和低延迟推理方面的进展提高了单个任务性能,但微调和推理仍被视为孤立的工作负载,忽略了它们的相互依赖性,导致冗余部署和推理质量提升延迟。为了解决这些限制,我们引入了一个新的共执行框架,并将其实例化为CoLLM,一个将FL PEFT和推理统一在共享边缘副本和模型参数上的系统。CoLLM通过在副本和集群层面解决关键挑战,实现了高效模型参数重用和工作负载平衡,从而联合优化长期模型质量增益和短期推理效率。在多样化的LLM和真实世界跟踪上进行的广泛评估显示,CoLLM在吞吐量上比最先进的LLM系统高出多达3倍,证明了其在边缘智能中无缝LLM训练后处理的有效性。

英文摘要

As Large Language Models (LLMs) are increasingly adopted in edge intelligence to power domain-specific applications and personalized services, the quality and efficiency of the LLM post-training phase-including fine-tuning and inference, have become critical due to constrained resources. Although recent advances in federated parameter-efficient fine-tuning (FL PEFT) and low-latency inference have improved individual task performance, fine-tuning and inference are still handled as isolated workloads, which overlooks their interdependence and results in redundant deployments and delayed improvement in inference quality. To address these limitations, we introduce a new co-execution framework and instantiate it with CoLLM, a system that unifies FL PEFT and inference on shared edge replicas and model parameters. CoLLM addresses key challenges at both replica and cluster levels through: (1) an intra-replica model sharing mechanism that enables real-time model parameter reuse via unmerged inference and shadow adapter strategies; and (2) a two-timescale inter-replica coordination algorithm that adaptively balances fine-tuning and inference workloads to jointly optimize long-term model quality gains and short-term inference efficiency. Extensive evaluation across diverse LLMs and real-world traces show that CoLLM consistently outperforms state-of-the-art LLM systems, achieving up to 3x higher goodput, demonstrating its effectiveness in enabling seamless LLM post-training for edge intelligence.

2604.02178 2026-05-19 cs.CL cs.AI cs.LG 版本更新

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

专家反击:在专家层面解读混合专家语言模型

Jeremy Herbst, Stefan Wermter, Jae Hee Lee

AI总结 研究通过k稀疏探测比较MoE专家与密集FFN,发现专家神经元更单语义,提出以专家为分析单位,揭示专家是细粒度任务专家,而非领域专家或token处理者。

Comments 8 pages, 7 Figures. Accepted at ICML 2026. Improved writing, changed author order, updated citations

详情
AI中文摘要

混合专家(MoE)架构已成为扩展大语言模型(LLMs)的主导选择,每个token仅激活部分参数。尽管MoE主要用于计算效率,但其稀疏性是否使其比密集前馈网络(FFN)更容易解释仍存疑问。通过k稀疏探测比较MoE专家与密集FFN,发现专家神经元始终更单语义,随着路由稀疏性增加,差距扩大。这表明稀疏性迫使神经元和整个专家朝单语义方向发展。基于此发现,我们从神经元层面转向专家层面作为更有效的分析单位。通过自动解读数百个专家,验证了这一方法。此分析解决了关于专业化争论:专家既非广领域专家(如生物学)也非简单token处理者。相反,它们作为细粒度任务专家,专门处理语言操作或语义任务(如闭合LaTeX括号)。我们的发现表明,MoE在专家层面具有内在可解释性,为大规模模型可解释性提供了更清晰路径。代码见:https://github.com/jerryy33/MoE_analysis。

英文摘要

Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in $\LaTeX{}$). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis.

2603.20421 2026-05-19 cs.CR cs.AR cs.LG cs.NA math.NA 版本更新

Hawkeye: Reproducing GPU-Level Non-Determinism

Hawkeye:重现GPU级非确定性

Erez Badash, Dan Boneh, Ilan Komargodski, Megha Srivastava

AI总结 Hawkeye系统通过在CPU上重现GPU矩阵乘法运算,实现机器学习模型训练和推理流程的精确复现,解决了传统验证方法的计算开销和鲁棒性问题。

Comments Accepted to MLSys 2026

详情
AI中文摘要

我们提出了Hawkeye系统,用于分析和重现GPU级别的算术运算。通过我们的框架,任何人都可以在CPU上重新执行机器学习模型训练或推理流程中底层的矩阵乘法运算,而不会有任何精度损失。这与以往的可验证机器学习方法形成鲜明对比,后者要么对原始模型所有者引入显著的计算开销,要么导致非鲁棒性和质量退化。Hawkeye的主要技术贡献是系统性的精心设计测试序列,研究矩阵乘法中舍入方向、亚正常数处理以及非结合性积累的顺序,针对NVIDIA的Tensor Cores。我们在多种NVIDIA GPU架构(Ampere、Hopper和Lovelace)和精度类型(FP16、BFP16、FP8)上测试和评估了我们的框架。在所有测试用例中,Hawkeye都能在CPU上完美重现矩阵乘法,为高效且可信的第三方审计ML模型训练和推理铺平了道路。我们提供了Hawkeye的源代码,网址为https://github.com/badasherez/gpu-simulator。

英文摘要

We present Hawkeye, a system for analyzing and reproducing GPU-level arithmetic operations. Using our framework, anyone can re-execute on a CPU the exact matrix multiplication operations underlying a machine learning model training or inference workflow that was executed on an NVIDIA GPU, without any precision loss. This is in stark contrast to prior approaches to verifiable machine learning, which either introduce significant computation overhead to the original model owner, or suffer from non-robustness and quality degradation. The main technical contribution of Hawkeye is a systematic sequence of carefully crafted tests that study rounding direction, subnormal number handling, and order of (non-associative) accumulation during matrix multiplication on NVIDIA's Tensor Cores. We test and evaluate our framework on multiple NVIDIA GPU architectures ( Ampere, Hopper, and Lovelace) and precision types (FP16, BFP16, FP8). In all test cases, Hawkeye enables perfect reproduction of matrix multiplication on a CPU, paving the way for efficient and trustworthy third-party auditing of ML model training and inference. We provide source code for Hawkeye at https://github.com/badasherez/gpu-simulator.

2603.19470 2026-05-19 cs.LG cs.AI 版本更新

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

自适应分层扰动:统一LLM RL中的非策略修正

Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang

AI总结 本文提出自适应分层扰动(ALP),通过在更新过程中向每一层的输入隐藏状态注入可控噪声,缓解策略退化和训练-推理不匹配问题,提升训练稳定性与探索能力。

详情
AI中文摘要

非策略问题如策略老化和训练-推理不匹配已成为LLM RL训练稳定性及进一步探索的主要瓶颈。由于增强推理效率的技术,推理策略与更新策略的分布差距扩大,导致重要性比率呈重尾分布。当策略局部尖锐时,重尾比率出现,进一步放大梯度并可能使更新超出信任区域。为解决此问题,我们提出自适应分层扰动(ALP),在更新过程中向每一层的输入隐藏状态注入小的可学习扰动,并将由此产生的扰动策略作为重要性比率的分子,与未改变的推理策略进行比较。直观上,通过向中间表示添加受控噪声,ALP防止更新策略过于偏离推理策略,并扩大策略家族以覆盖推理时的不匹配噪声。因此,扁平化的分布可自然缩小更新策略与推理策略之间的差距,并减少重要性比率的尾部,从而维持训练稳定性。这通过实验证实。在单轮数学和多轮工具集成推理任务中的实验表明,ALP不仅提高了最终性能,还避免了重要性比率尾部的爆炸和KL尖峰,同时提升了探索能力。消融实验显示,跨所有层的表示级扰动效果最佳,显著优于部分层和logits-only变体。

英文摘要

Off-policy problems such as policy staleness and training--inference mismatch have become a major bottleneck for training stability and further exploration in LLM RL. The distribution gap between the inference and updated policies grows because of the techniques to enhance inference efficiency, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation (ALP), which injects small learnable perturbations into the input hidden states of each layer during updates and uses the resulting perturbed policy as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy and enlarges the policy family to cover inference-time mismatch noise. Hence, the flattened distribution can naturally tighten the gap between the updated and inference policies and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoids blow-up in the importance-ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.

2603.03538 2026-05-19 cs.LG 版本更新

Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

链式思维验证器的在线可学习性:正确性与完备性的权衡

Maria-Florina Balcan, Avrim Blum, Kiriaki Fragkia, Zhiyuan Li, Dravyansh Sharma

AI总结 本文提出一种在线学习框架,用于学习链式思维验证器,通过检查解决方案的正确性,解决生成器与验证器之间的反馈循环导致的分布偏移问题,并引入新的Littlestone维度扩展以优化验证器的学习。

详情
AI中文摘要

大型语言模型(LLMs)通过链式思维生成在解决复杂推理和规划任务中展现出巨大潜力。然而,当前LLMs的输出不完全可靠,需要仔细验证。即使LLMs随时间变得更准确,学习的验证器可以帮助提高信任度,执行安全约束,并确保与个人偏好一致。然而,学习验证器的主要挑战在于,当其输出被生成器用来改进推理时,生成器与验证器之间的反馈循环可能产生显著的分布偏移。受此挑战启发,我们提出了一种在线学习框架,用于学习链式思维验证器,给定一个问题和一系列推理步骤,检查解决方案的正确性。我们强调了正确性错误(未能捕捉推理轨迹中的错误)和完备性错误(将正确的推理步骤标记为错误)的不对称作用,并引入了新的Littlestone维度扩展,紧密刻画了在可实现设置中学习验证器的错误界。我们提供了最优算法,用于找到帕累托前沿(给定声音错误预算下的最小总错误数)以及最小化不对称成本的线性组合。此外,我们进一步展示了如何利用学习的验证器来提高弱生成器的准确性,并使生成的证明超越其初始训练内容。在假设其中一个生成器能够以某些最小概率生成下一个推理步骤的前提下,我们展示了如何学习一个具有小误差和回避率的强生成器。

英文摘要

Large Language Models (LLMs) with chain-of-thought generation have demonstrated great potential for solving complex reasoning and planning tasks. However, the output of current LLMs is not fully reliable and needs careful verification. Even if LLMs get more accurate over time, learned verifiers can help increase trust, enforce safety constraints, and ensure alignment with personal preferences. A major challenge in learning verifiers, however, especially when their output will be used by the generator to improve its reasoning, is that the feedback loop between generator and verifier may produce substantial distribution shift. Motivated by this challenge, we propose an online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution. Highlighting the asymmetric role of soundness errors (failure in catching errors in a reasoning trace) and completeness errors (flagging correct reasoning steps as wrong), we introduce novel extensions of the Littlestone dimension which tightly characterize the mistake bounds for learning a verifier in the realizable setting. We provide optimal algorithms for finding the Pareto-frontier (the smallest total number of mistakes given a budget of soundness mistakes) as well as for minimizing a linear combination of asymmetric costs. We further show how our learned verifiers can be used to boost the accuracy of a collection of weak generators, and enable generation of proofs beyond what they were initially trained on. With the mild assumption that one of the generators can generate the next reasoning step correctly with some minimal probability, we show how to learn a strong generator with small error and abstention rates.

2603.02218 2026-05-19 cs.LG cs.AI cs.CL cs.IT math.IT 版本更新

Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

仅在自我合成管道确保可学习信息增益时,自我博弈才会进化

Wei Liu, Siya Qi, Yali Du, Yulan He

AI总结 本文通过实验揭示可持续自我进化需要可学习信息递增的自我合成数据管道,提出自我进化LLM的三重角色及系统设计,解决自我博弈停滞问题。

Comments 10 pages, 6 figures, 7 formulas, accepted by ICML 2026 position paper track

详情
AI中文摘要

大型语言模型(LLMs)使构建通过自我进化循环改进的系统成为可能,但许多现有方案更倾向于自我博弈且易陷入停滞。核心失败模式是循环生成更多数据但未增加下一轮的可学习信息。通过自我博弈编程任务实验,我们发现可持续自我进化需要具有可学习信息递增的自我合成数据管道。我们识别出自我进化LLM的三重角色:生成任务的Proposer、尝试解决方案的Solver以及提供训练信号的Verifier,并提出三种系统设计共同针对三重角色视角下的可学习信息增益。不对称共进化在角色间形成弱到强到弱的循环。容量增长扩展参数和推理时间预算以匹配上升的可学习信息。主动信息寻求引入外部上下文和新任务来源以防止饱和。这些模块共同提供从脆弱自我博弈动态到持续自我进化的可测量系统级路径。

英文摘要

Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises more data without increasing learnable information for the next iteration. Through experiments on a self-play coding task, we reveal that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations. We identify triadic roles that self-evolving LLMs play: the Proposer, which generates tasks; the Solver, which attempts solutions; and the Verifier, which provides training signals, and we identify three system designs that jointly target learnable information gain from this triadic roles perspective. Asymmetric co-evolution closes a weak-to-strong-to-weak loop across roles. Capacity growth expands parameter and inference-time budgets to match rising learnable information. Proactive information seeking introduces external context and new task sources that prevent saturation. Together, these modules provide a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.

2602.16473 2026-05-19 cs.LG cs.FL cs.LO 版本更新

Synthesis and Verification of Transformer Programs (Technical Report)

变换器程序的合成与验证(技术报告)

Hongjian Jiang, Matthew Hague, Philipp Rümmer, Anthony Widjaja Lin

AI总结 本文提出新算法自动验证C-RASP程序,并提供学习C-RASP程序的新方法,应用于变换器程序优化与约束学习。

详情
AI中文摘要

本文提出新算法自动验证C-RASP程序,并提供学习C-RASP程序的新方法,应用于变换器程序优化与约束学习。

英文摘要

C-RASP is a simple programming language that was recently shown to capture concepts expressible by transformers. In this paper, we develop new algorithmic techniques for automatically verifying C-RASPs. To this end, we establish a connection to the verification of synchronous dataflow programs in Lustre, which enables us to exploit state-of-the-art model checkers utilizing highly optimized SMT-solvers. Our second contribution addresses learning a C-RASP program in the first place. To this end, we provide a new algorithm for learning a C-RASP from examples using local search. We demonstrate efficacy of our implementation for benchmarks of C-RASPs in the literature, in particular in connection to the following applications: (1) transformer program optimization, and (2) constrained learning of transformer programs (based on a partial specification).

2601.21941 2026-05-19 cs.LG cs.AI 版本更新

Robust Multimodal Representation Learning in Healthcare

医疗领域鲁棒多模态表征学习

Xiaoguang Zhu, Linxiao Gong, Lianlong Sun, Yang Liu, Haoyu Wang, Jing Liu

AI总结 本文提出双流特征去相关框架,通过结构因果分析处理医疗多模态数据中的系统性偏差,提升模型泛化能力,实验验证在MIMIC-IV、eICU和ADNI数据集上的性能提升。

详情
AI中文摘要

医疗多模态表征学习旨在将异构数据整合为统一的患者表示以支持临床结果预测。然而,真实世界医疗数据集通常包含来自多个来源的系统性偏差,这对医疗多模态表征学习提出了重大挑战。现有方法通常专注于有效的多模态融合,忽视了影响泛化能力的固有偏见特征。为解决这些挑战,我们提出了一种双流特征去相关框架,通过引入由潜在混杂因素引入的结构因果分析来识别和处理偏见。我们的方法采用因果偏见去相关框架,结合双流神经网络,将因果特征与虚假相关性分离,利用广义交叉熵损失和互信息最小化实现有效去相关。该框架模型无关,可集成到现有医疗多模态学习方法中。在MIMIC-IV、eICU和ADNI数据集上的全面实验显示了一致的性能提升。

英文摘要

Medical multimodal representation learning aims to integrate heterogeneous data into unified patient representations to support clinical outcome prediction. However, real-world medical datasets commonly contain systematic biases from multiple sources, which poses significant challenges for medical multimodal representation learning. Existing approaches typically focus on effective multimodal fusion, neglecting inherent biased features that affect the generalization ability. To address these challenges, we propose a Dual-Stream Feature Decorrelation Framework that identifies and handles the biases through structural causal analysis introduced by latent confounders. Our method employs a causal-biased decorrelation framework with dual-stream neural networks to disentangle causal features from spurious correlations, utilizing generalized cross-entropy loss and mutual information minimization for effective decorrelation. The framework is model-agnostic and can be integrated into existing medical multimodal learning methods. Comprehensive experiments on MIMIC-IV, eICU, and ADNI datasets demonstrate consistent performance improvements.

2512.12572 2026-05-19 cs.LG stat.ML 版本更新

On the Accuracy of Newton Step and Influence Function Data Attributions

关于牛顿步和影响函数数据归因的准确性

Ittai Rubinstein, Samuel B. Hopkins

AI总结 本文研究了牛顿步和影响函数数据归因方法的准确性,推导出误差缩放规律,揭示了NS方法在特定条件下更准确的原因。

详情
AI中文摘要

数据归因旨在通过估计移除某些训练点时预测的变化来解释模型预测,广泛应用于可解释性、信用分配、遗忘和隐私等领域。即使在逻辑回归这种相对简单的案例中,现有对影响函数(IF)和单步牛顿步(NS)等主流数据归因方法的数学分析仍存在两个关键局限:首先,它们依赖于全局强凸性假设,这在实践中往往不成立;其次,所得的界限在参数数量(d)和移除样本数量(k)方面表现极差。因此,这些分析不够精确,无法回答诸如“每种方法的渐进行为误差如何”或“给定数据集哪种方法更准确”等基本问题。本文引入了针对凸学习问题的NS和IF数据归因方法的新分析。据我们所知,这是首个不假设全局强凸性且解释了[KATL19]和[RH25a]观察到NS数据归因常比IF更准确的分析。我们证明,对于足够良好的逻辑回归,我们的界限在多项对数因子范围内渐近紧致,从而得到平均样本移除情况下的误差缩放定律。[公式]

英文摘要

Data attribution aims to explain model predictions by estimating how they would change if certain training points were removed, and is used in a wide range of applications, from interpretability and credit assignment to unlearning and privacy. Even in the relatively simple case of logistic regressions, existing mathematical analyses of leading data attribution methods such as Influence Functions (IF) and single Newton Step (NS) remain limited in two key ways. First, they rely on global strong convexity assumptions which are often not satisfied in practice. Second, the resulting bounds scale very poorly with the number of parameters ($d$) and the number of samples removed ($k$). As a result, these analyses are not tight enough to answer fundamental questions such as "what is the asymptotic scaling of the errors of each method?" or "which of these methods is more accurate for a given dataset?" In this paper, we introduce a new analysis of the NS and IF data attribution methods for convex learning problems. To the best of our knowledge, this is the first analysis of these questions that does not assume global strong convexity and also the first explanation of [KATL19] and [RH25a]'s observation that NS data attribution is often more accurate than IF. We prove that for sufficiently well-behaved logistic regressions, our bounds are asymptotically tight up to poly-logarithmic factors, yielding scaling laws for the errors in the average-case sample removals. \[ \mathbb{E}_{T \subseteq [n],\, |T| = k} \bigl[ \|\hatθ_T - \hatθ_T^{\mathrm{NS}}\|_2 \bigr] = \widetildeΘ\!\left(\frac{k d}{n^2}\right), \qquad \mathbb{E}_{T \subseteq [n],\, |T| = k} \bigl[ \|\hatθ_T^{\mathrm{NS}} - \hatθ_T^{\mathrm{IF}}\|_2 \bigr] = \widetildeΘ\!\left( \frac{(k + d)\sqrt{k d}}{n^2} \right). \]

2512.03280 2026-05-19 cs.LG cs.AI 版本更新

BlendedNet++: A dataset and benchmark for field-resolved aerodynamics and inverse design of blended wing body aircraft

BlendedNet++:一种用于混合翼身融合飞机场解气动学和逆设计的数据集和基准

Nicholas Sung, Steven Spreizer, Mohamed Elrefaie, Matthew C. Jones, Faez Ahmed

AI总结 本文提出BlendedNet++数据集,包含12492种独特的BWB几何体,通过RANS模拟提供集成力和密集表面场,利用几何深度学习模型实现实时气动预测和生成性逆设计,验证了Transolver在场预测中的准确性。

详情
AI中文摘要

Blended Wing Body (BWB)飞机的概念设计常受限于高维设计空间中复杂气动学的高计算成本。尽管深度学习为快速气动预测和逆设计提供了途径,但其在航空航天工程中的应用受限于缺乏大规模、场解训练数据。本文通过引入BlendedNet++,一个包含12,492种独特BWB几何体的综合气动学数据集,每种几何体均通过稳态雷诺平均纳维-斯托克斯(RANS)模拟进行评估,以提供集成力和密集表面场(Cp,Cf)。利用此数据,我们建立了两个关键工程任务的稳健框架:(1)利用几何深度学习模型实时预测表面气动场;(2)生成性逆设计。我们基准测试了五种替代架构,发现Transolver在场预测中最为准确。此外,我们通过条件扩散模型结合梯度基优化方法演示了生成性逆设计流程。这种混合方法被证明能够生成多个可行设计,满足特定升阻比目标,精度高(R^2 > 0.99),经计算流体动力学(CFD)模拟验证。这些资源使早期阶段BWB设计从迭代分析转向直接生成。

英文摘要

The conceptual design of Blended Wing Body (BWB) aircraft is often constrained by the high computational cost of resolving complex aerodynamics over a high-dimensional design space. While deep learning offers a pathway to rapid aerodynamic prediction and inverse design, its adoption in aerospace engineering is limited by a lack of large-scale, field-resolved training data. This work addresses this gap by introducing BlendedNet++, a comprehensive aerodynamic dataset comprising 12,492 unique BWB geometries, each evaluated using steady Reynolds-Averaged Navier--Stokes (RANS) simulations to provide integrated forces and dense surface fields (Cp, Cf). Leveraging this data, we establish a robust framework for two critical engineering tasks: (1) real-time prediction of surface aerodynamic fields using geometric deep learning models, and (2) generative inverse design. We benchmark five surrogate architectures, identifying Transolver as the most accurate for field predictions. Furthermore, we demonstrate a generative inverse design pipeline using conditional diffusion models combined with gradient-based refinement. This hybrid approach is shown to generate multiple feasible designs that satisfy specific lift-to-drag targets with high accuracy (R^2 > 0.99), as confirmed by computational fluid dynamics (CFD) simulation. These resources enable a shift from iterative analysis to direct generation in early-stage BWB design.

2510.21523 2026-05-19 cs.LG stat.ML 版本更新

Interpretable epistemic uncertainty decomposition in sequential generative models via polynomial chaos surrogates

通过多项式混沌代理实现序列生成模型中可解释的epistemic不确定性分解

Ramón Nartallo-Kaluarachchi, Shashanka Ubaru, Małgorzata J Zimoń, Dongsung Huh, Robert Manson-Sawko, Lior Horesh, Yoshua Bengio

AI总结 本文提出通过多项式混沌展开分析序列生成模型中epistemic不确定性的来源,揭示奖励组件对生成决策的影响,优于深度集合、贝叶斯神经网络等方法,且在多个真实任务中展现高效性和鲁棒性。

Comments 37 pages, 15 figures

详情
AI中文摘要

条件于不确定奖励的序列生成模型在AI驱动的科学发现中至关重要,但其继承的epistemic不确定性仍无法量化。我们通过拟合多项式混沌展开(PCE)到小规模训练模型集合,将不确定性传播通过生成流网络(GFlowNets)。PCE系数产生分析Sobol敏感性指数,提供首次可解释的分解,揭示哪些奖励组件驱动哪些生成决策,这一能力无法由深度集合、贝叶斯神经网络或蒙特卡洛dropout提供。理论上建立了收敛保证,并在Lean 4证明助手中正式验证了四分之五。在三个真实任务中,该框架揭示了无法被集合单独发现的可操作结构。在Doyle-Dreher Buchwald-Hartwig催化剂选择任务中,催化剂选择稳健(D_catalyst≈71),而添加剂选择脆弱(D_additive≈179,2.5倍更高)。在基于片段的分子设计中,连接位置是最敏感的(D_linker≈28),而装饰位置是最稳健的(D≈14-18),逆转了传统支架稳健/装饰脆弱的假设。在Sachs蛋白质信号网络中,MAPK级联边和PKA/PKC枢纽边分离到不同的敏感性区域,为扰动实验提供靶向地图。95%置信度下的校准覆盖率达到0.97-1.00,且代理在毫秒内评估10,000个策略样本,比穷举重新训练快10^3-10^4倍。

英文摘要

Sequential generative models conditioned on uncertain rewards are central to AI-driven scientific discovery, yet the epistemic uncertainty they inherit from imperfect reward estimates remains unquantified. We propagate this uncertainty through generative flow networks (GFlowNets) by fitting polynomial chaos expansions (PCEs) to small ensembles of trained models. The PCE coefficients yield analytical Sobol sensitivity indices, providing the first interpretable decomposition of which reward components drive which generative decisions, a capability unavailable from deep ensembles, Bayesian neural networks, or Monte Carlo dropout. Convergence guarantees are established theoretically and four of five are formally verified in the Lean 4 proof assistant. Across three real-world tasks the framework reveals actionable structure invisible to ensembles alone. On the Doyle-Dreher Buchwald-Hartwig dataset catalyst selection is robust ($D_{\mathrm{catalyst}}\approx 71$) while additive selection is fragile ($D_{\mathrm{additive}}\approx 179$, $2.5\times$ higher). In fragment-based molecular design the linker position is the most sensitive ($D_{\mathrm{linker}}\approx 28$) while decoration positions are the most robust ($D\approx 14$-$18$), reversing the conventional scaffold-robust / decoration-fragile assumption. On the Sachs protein signalling network, MAPK-cascade edges and PKA/PKC hub edges separate into distinct sensitivity regimes, providing a targeted map for perturbation experiments. Calibration coverage at the 95% level reaches 0.97-1.00 across the dominant steps, and the surrogate evaluates 10{,}000 policy samples in milliseconds - $10^{3}$-$10^{4}\times$ faster than exhaustive retraining.

2509.15357 2026-05-19 cs.CV cs.LG 版本更新

MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

MaskAttn-SDXL:可控区域级文本到图像生成

Yu Chang, Jiahao Chen, Anzhe Cheng, Paul Bogdan

AI总结 本文提出MaskAttn-SDXL模块,通过在softmax前注入token条件空间门控,解决扩散模型在多物体生成中的全局协调和可靠性问题,无需改变SDXL流程。

Comments Accepted to the 2026 International Joint Conference on Neural Networks (IJCNN 2026)

详情
AI中文摘要

扩散模型在文本到图像生成中取得了强劲成果,但随着提示变得更为结构化和多对象,仍存在重要限制。在架构方面,U-Net主干高效稳定,但其局部性使全局协调更困难,而基于Transformer的扩散模型虽改善了全局交互,但计算和内存成本显著增加。同时,组合可靠性仍较弱:模型常混合对象属性、违反空间关系或遗漏请求实体,且这些错误无法可靠地通过FID或基于CLIP的指标反映。为解决这些问题而无需更改SDXL流程,我们提出MaskAttn-SDXL,一种插件模块,通过在softmax前注入token条件空间门控到交叉注意力logits中。该门控将token到位置的交互稀疏化,以抑制无关绑定,同时保留预训练主干和标准采样过程,无需外部监督或推理时编辑。

英文摘要

Diffusion models have achieved strong results in text-to-image generation, but important limitations remain as prompts become more structured and multi-object. On the architecture side, U-Net backbones are efficient and stable, yet their locality makes global coordination harder, while Transformer-based diffusion models improve global interactions but at substantially higher compute and memory cost. In parallel, compositional reliability remains weak: models often mix attributes across objects, violate spatial relations, or omit requested entities, and these errors are not reliably reflected by global metrics such as FID or CLIP-based scores. To address these issues without changing the SDXL pipeline, we propose MaskAttn-SDXL, a plug-in module that injects token-conditioned spatial gating into cross-attention logits before softmax. The gating sparsifies token-to-location interactions to suppress irrelevant bindings while preserving the pretrained backbone and standard sampling process, requiring no external supervision or inference-time editing.

2508.06524 2026-05-19 cs.CL cs.AI cs.CY cs.DC cs.LG 版本更新

CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

CarbonScaling:扩展神经缩放定律以用于大型语言模型中的碳足迹

Lei Jiang, Fan Chen

AI总结 本文提出CarbonScaling框架,结合神经缩放定律和分布式训练策略,准确建模前沿LLM训练的碳足迹,提高硬件配置和排放估计的精度。

Comments 8 pages

详情
AI中文摘要

大型语言模型(LLMs)日益遵循将性能提升与快速扩展计算预算联系起来的神经缩放定律,这引发了对前沿规模训练可持续性的担忧。现有碳估计方法主要依赖于历史运行的回归分析,无法捕捉关键系统级因素,包括硬件异质性、分布式并行性、通信开销和架构稀疏性。我们提出了CarbonScaling,一种硬件意识的分析框架,用于建模前沿LLM训练的碳缩放行为。该框架整合了神经缩放定律、分布式训练策略、加速器和互连建模,以及操作和嵌入碳会计,以估计可行的硬件配置和相关排放。CarbonScaling同时建模张量、流水线、数据和专家并行性,同时纳入内存、带宽、利用率和运行时间约束。实验验证显示其比基于回归的基线具有显著更高的保真度,并突显了在万亿参数规模下嵌入碳的重要性。源代码:https://github.com/UnchartedRLab/CarbonScaling。

英文摘要

Large language models (LLMs) increasingly follow neural scaling laws that tie performance gains to rapidly expanding computational budgets, raising concerns about the sustainability of frontier-scale training. Existing carbon-estimation methods largely depend on regression over historical runs and fail to capture critical system-level factors, including hardware heterogeneity, distributed parallelism, communication overhead, and architectural sparsity. We present \textit{CarbonScaling}, a hardware-aware analytical framework for modeling the carbon scaling behavior of frontier LLM training. The framework integrates neural scaling laws, distributed training strategies, accelerator and interconnect modeling, and operational and embodied carbon accounting to estimate feasible hardware configurations and associated emissions. CarbonScaling jointly models tensor, pipeline, data, and expert parallelism while incorporating memory, bandwidth, utilization, and runtime constraints. Experimental validation demonstrates substantially higher fidelity than regression-based baselines and highlights the growing importance of embodied carbon at trillion-parameter scales. Source code: \url{https://github.com/UnchartedRLab/CarbonScaling}.

2506.12648 2026-05-19 math.OC cs.LG 版本更新

Glocal Smoothness: Line search and adaptive step sizes can help in theory too!

全球局部平滑性:线搜索和自适应步长在理论上也能有所帮助!

Curtis Fox, Aaron Mishkin, Sharan Vaswani, Mark Schmidt

AI总结 本文提出全球局部平滑性概念,通过函数属性定义,允许用迭代无关常数界迭代复杂度,展示线搜索优于固定步长,且在某些情况下梯度下降比加速方法更优。

详情
AI中文摘要

本文提出全球局部平滑性概念,通过函数属性定义,允许用迭代无关常数界迭代复杂度,展示线搜索优于固定步长,且在某些情况下梯度下降比加速方法更优。

英文摘要

Iteration complexities for optimizing smooth functions with first-order algorithms are typically stated in terms of a global Lipschitz constant of the gradient, and near-optimal results are then achieved using fixed step sizes. But many objective functions that arise in practice have regions with small Lipschitz constants where larger step sizes can be used. Many local Lipschitz assumptions have been proposed, which have led to results showing that adaptive step sizes and/or line searches yield improved convergence rates over fixed step sizes. However, these faster rates tend to depend on the iterates of the algorithm, which makes it difficult to compare the iteration complexities of different methods. We consider a simple characterization of global and local ("glocal") smoothness that only depends on properties of the function. This allows upper bounds on iteration complexities in terms of iterate-independent constants and enables us to compare iteration complexities between algorithms. Under this assumption it is straightforward to show the advantages of line searches over fixed step sizes and that, in some settings, gradient descent with line search has a better iteration complexity than accelerated methods with fixed step sizes. We further show that glocal smoothness can lead to improved complexities for the Polyak and AdGD step sizes, as well other algorithms including coordinate optimization, stochastic gradient methods, accelerated gradient methods, and non-linear conjugate gradient methods.

2502.18663 2026-05-19 cs.LG cs.DM cs.SI math.CO math.GR 版本更新

CayleyPy RL: Pathfinding and Reinforcement Learning on Cayley Graphs

CayleyPy RL:Cayley图上的路径寻找与强化学习

A. Chervov, M. Obozov, A. Soibelman, S. Lytkin, I. Kiselev, S. Fironov, A. Lukyanenko, A. Dolgorukova, A. Ogurtsov, F. Petrov, S. Krymskii, M. Evseev, L. Grunvald, D. Gorodkov, G. Antiufeev, G. Verbii, V. Zamkovoy, L. Cheldieva, I. Koltsov, A. Sychev, A. Eliseev, S. Nikolenko, N. Narynbaev, R. Turtayev, N. Rokotyan, S. Kovalev, A. Rozanov, V. Nelin, S. Ermilov, L. Shishina, D. Mamayeva, A. Korolkova, K. Khoruzhii, A. Romanov

AI总结 本文提出一种结合强化学习与扩散距离方法的新型路径寻找方法,通过基准测试和数学方法验证了对称群Cayley图直径的猜想,并在Kaggle平台发起挑战以促进众包活动。

Comments 32+16 pages

详情
AI中文摘要

本文是关于开发高效人工智能方法用于在极大规模图(如10^70个节点)上路径寻找的一系列研究中的第二篇,重点研究Cayley图和数学应用。CayleyPy项目是研究的核心部分。本文提出了一种新的强化学习方法与更直接的扩散距离方法的结合。我们的分析包括对方法关键构建块的各种选择进行基准测试:神经网络架构、随机游走生成器和束搜索路径寻找。我们将其与经典计算机代数系统GAP进行比较,证明其在所考虑的例子中超越了GAP。作为特定的数学应用,我们研究了对称群的Cayley图,其生成元为循环移位和置换。我们通过机器学习和数学方法为OEIS-A186783猜想提供有力支持,即直径等于n(n-1)/2。我们识别了猜想中的最长元素并生成其分解。我们证明了直径下界为n(n-1)/2 -n/2,上界为n(n-1)/2 + 3n,并通过给定复杂度的算法证明。我们还提出了由数值实验激发的若干猜想,包括关于中心极限现象(增长近似由Gumbel分布)、图谱的均匀分布以及排序网络的数值研究。为了促进众包活动,我们在Kaggle平台创建挑战并邀请贡献以改进和基准测试Cayley图路径寻找及其他任务的方法。

英文摘要

This paper is the second in a series of studies on developing efficient artificial intelligence-based approaches to pathfinding on extremely large graphs (e.g. $10^{70}$ nodes) with a focus on Cayley graphs and mathematical applications. The open-source CayleyPy project is a central component of our research. The present paper proposes a novel combination of a reinforcement learning approach with a more direct diffusion distance approach from the first paper. Our analysis includes benchmarking various choices for the key building blocks of the approach: architectures of the neural network, generators for the random walks and beam search pathfinding. We compared these methods against the classical computer algebra system GAP, demonstrating that they "overcome the GAP" for the considered examples. As a particular mathematical application we examine the Cayley graph of the symmetric group with cyclic shift and transposition generators. We provide strong support for the OEIS-A186783 conjecture that the diameter is equal to n(n-1)/2 by machine learning and mathematical methods. We identify the conjectured longest element and generate its decomposition of the desired length. We prove a diameter lower bound of n(n-1)/2-n/2 and an upper bound of n(n-1)/2+ 3n by presenting the algorithm with given complexity. We also present several conjectures motivated by numerical experiments, including observations on the central limit phenomenon (with growth approximated by a Gumbel distribution), the uniform distribution for the spectrum of the graph, and a numerical study of sorting networks. To stimulate crowdsourcing activity, we create challenges on the Kaggle platform and invite contributions to improve and benchmark approaches on Cayley graph pathfinding and other tasks.

2502.17007 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Uncertainty Quantification as a Principled Foundation for Explainable Artificial Intelligence: A Case Study of Counterfactual Explanations

不确定性量化作为可解释人工智能的原理性基础:反事实解释的案例研究

Kacper Sokol, Santo M. A. R. Thies, Eyke Hüllermeier

AI总结 本文通过反事实可解释性中的不确定性量化,展示其作为统一框架的潜力,提出两种解释器变体,并证明其在性能上优于现有方法。

详情
AI中文摘要

本文认为,透明性研究忽视了人工智能的基础概念。以反事实可解释性中的不确定性量化为例,证明其广泛应用能解决领域关键挑战。通过将核心反事实属性用不确定性表达,构建两种解释器变体,并展示框架在性能上优于现有方法。文章进一步表明,将人工智能基础融入透明性研究能产生更可靠、稳健和易懂的预测模型。提出使人工智能可解释性真正不确定性感知是实现该目标的第一步。

英文摘要

In this paper we argue that, to its detriment, transparency research overlooks many foundational concepts of artificial intelligence. As an illustrating example we focus on uncertainty quantification in the context of counterfactual explainability, demonstrating that its broader adoption could address key challenges in the field. To this end, we show how uncertainty can provide a principled unifying framework for counterfactual explainability by expressing the core counterfactual properties in terms of uncertainty, allowing us to build two variants of an explainer upon them -- one based solely on uncertainty estimates and another pairing them with distance measured in the feature space. Our comprehensive experiments illustrate highly competitive performance of our framework when compared to many state-of-the-art methods despite its radically simple design. More broadly, the paper demonstrates that integrating artificial intelligence fundamentals into transparency research promises to yield more reliable, robust and understandable predictive models. We posit that making artificial intelligence explainability truly uncertainty-aware is the first step towards this goal.

2306.12282 2026-05-19 cs.DS cs.LG math.OC 版本更新

Online Resource Allocation with Convex-set Machine-Learned Advice

在线资源分配与凸集机器学习建议

Negin Golrezaei, Patrick Jaillet, Zijie Zhou

AI总结 本文提出一种在线资源分配框架,结合凸集机器学习建议,平衡一致性与鲁棒性,通过动态保护水平提升在不确定环境下的性能。

Comments 77 pages, 8 figures

详情
AI中文摘要

决策者往往能够获得关于未来需求的机器学习预测,这些预测可以帮助指导在线资源分配决策。然而,这些预测可能不准确。我们开发了一个在线资源分配框架,该框架可以处理潜在不可靠的机器学习建议,其中建议以需求向量的凸不确定性集形式表示,而不是单一点估计。我们引入了一类参数化的帕累托最优在线算法,以平衡一致性和鲁棒性。一致性比率衡量在建议准确时的性能,而鲁棒比率衡量在对抗性需求下建议不准确时的性能。对于目标一致性水平C,我们的算法在满足至少一致性水平C的条件下最大化鲁棒性。我们的方法通过引入自适应保护水平扩展了经典保护水平算法,这些保护水平能够动态响应建议中的不确定性。我们还提供了一种计算最大可实现一致性水平的方法。数值实验表明,我们的算法在有效平衡最坏情况和平均情况性能方面优于基准方法,包括仅基于点预测的方法。

英文摘要

Decision-makers often have access to machine-learned predictions about future demand that can help guide online resource allocation decisions. However, such predictions may be inaccurate. We develop a framework for online resource allocation with potentially unreliable machine-learned advice, where the advice is represented as a convex uncertainty set for the demand vector rather than a single point estimate. We introduce a parameterized class of Pareto-optimal online algorithms that balance consistency and robustness. The consistent ratio measures performance when the advice is accurate, while the robust ratio measures performance under adversarial demand when the advice is inaccurate. For a target consistency level C, our algorithms maximize robustness subject to achieving at least consistency level C. Our approach extends classical protection-level algorithms by introducing adaptive protection levels that dynamically respond to uncertainty in the advice. We also provide a method for computing the maximum achievable consistency level. Numerical experiments demonstrate that our algorithms outperform benchmark methods, including approaches based solely on point forecasts, by effectively balancing worst-case and average-case performance.

2304.03427 2026-05-19 cs.CL cs.AI cs.CY cs.LG 版本更新

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

清除珠宝:基于谷歌OCR的藏文手稿的神经拼写纠正模型

Queenie Luo, Yung-Sung Chuang

AI总结 本文提出基于谷歌OCR的藏文手稿的神经拼写纠正模型,通过改进的Transformer架构实现自动纠正OCR噪声输出,实验表明其优于其他序列模型。

详情
Journal ref
Association for Computing Machinery 2024
AI中文摘要

人文学者依赖古代手稿来研究历史、宗教和社会政治结构。许多努力致力于使用OCR技术数字化这些珍贵的手稿,但大多数手稿因数世纪的污损,使得OCR程序无法准确捕捉褪色的字符和污渍。本文提出基于谷歌OCR处理的藏文手稿的神经拼写纠正模型,用于自动纠正OCR输出中的噪声。本文分为四个部分:数据集、模型架构、训练和分析。首先,我们将原始藏文电子文本语料库特征工程为两个结构化数据框——一组配对玩具数据和一组配对真实数据。然后,我们在Transformer架构中实现了置信度评分机制,用于拼写纠正任务。根据损失和字符错误率,我们的Transformer加置信度评分机制架构证明优于Transformer、LSTM-2-LSTM和GRU-2-GRU架构。最后,为了检验模型的鲁棒性,我们分析了错误的标记,可视化了模型中的注意力和自我注意力热图。

英文摘要

Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.

2010.15538 2026-05-19 stat.ML cs.LG 版本更新

Matérn Gaussian Processes on Graphs

图上的Matérn高斯过程

Viacheslav Borovitskiy, Iskander Azangulov, Alexander Terenin, Peter Mostowsky, Marc Peter Deisenroth, Nicolas Durrande

AI总结 本文研究了图上Matérn高斯过程,利用其随机偏微分方程特性,继承了欧几里得和黎曼流形高斯过程的特性,提供标准训练方法,使其适用于小批量和非共轭场景。

详情
Journal ref
Artificial Intelligence and Statistics, 2021
AI中文摘要

高斯过程是一种用于学习未知函数的灵活框架,允许利用对函数性质的先验信息。尽管许多不同的高斯过程模型在欧几里得输入空间中 readily available,但对于输入空间为无向图的高斯过程,选择则更加有限。在本文中,我们利用Matérn高斯过程的随机偏微分方程特性——在欧几里得设置中广泛使用的模型类——来研究其在无向图上的类比。我们证明,所得到的高斯过程继承了其欧几里得和黎曼流形类比的各种吸引特性,并提供了允许使用标准方法(如诱导点)进行训练的技术。这使得图Matérn高斯过程能够应用于小批量和非共轭设置,从而使其更易于从业者使用,并更容易在更大的学习框架中部署。

英文摘要

Gaussian processes are a versatile framework for learning unknown functions in a manner that permits one to utilize prior information about their properties. Although many different Gaussian process models are readily available when the input space is Euclidean, the choice is much more limited for Gaussian processes whose input space is an undirected graph. In this work, we leverage the stochastic partial differential equation characterization of Matérn Gaussian processes - a widely-used model class in the Euclidean setting - to study their analog for undirected graphs. We show that the resulting Gaussian processes inherit various attractive properties of their Euclidean and Riemannian analogs and provide techniques that allow them to be trained using standard methods, such as inducing points. This enables graph Matérn Gaussian processes to be employed in mini-batch and non-conjugate settings, thereby making them more accessible to practitioners and easier to deploy within larger learning frameworks.

1908.05387 2026-05-19 cs.LG stat.ML 版本更新

HONEM: Learning Embedding for Higher Order Networks

HONEM:用于高阶网络的嵌入学习

Mandana Saebi, Giovanni Luca Ciampaglia, Lance M Kaplan, Nitesh V Chawla

AI总结 本文提出HONEM方法,针对高阶网络结构,有效捕捉非马尔可夫高阶依赖,提升节点分类、网络重建、链接预测和可视化性能。

详情
Journal ref
Big Data 8, no. 4 (2020): 255-269
AI中文摘要

图网络上的表示学习为手动特征工程往往繁琐的过程提供了一个强大的替代方案,因此近年来取得了显著的成功。然而,现有的所有表示学习方法都是基于一阶网络(FON),即只捕捉节点之间成对相互作用的网络。因此,这些方法可能无法纳入非马尔可夫高阶依赖性。因此,生成的嵌入可能无法准确表示网络中的底层现象,导致在不同的归纳或传递学习任务中表现不佳。为了解决这一挑战,本文提出了HONEM,一种能够捕捉网络中非马尔可夫高阶依赖性的高阶网络嵌入方法。HONEM专门针对高阶网络结构(HON)设计,并在包含非马尔可夫高阶依赖性的网络中,在节点分类、网络重建、链接预测和可视化任务中优于其他最先进的方法。

英文摘要

Representation learning on networks offers a powerful alternative to the oft painstaking process of manual feature engineering, and as a result, has enjoyed considerable success in recent years. However, all the existing representation learning methods are based on the first-order network (FON), that is, the network that only captures the pairwise interactions between the nodes. As a result, these methods may fail to incorporate non-Markovian higher-order dependencies in the network. Thus, the embeddings that are generated may not accurately represent of the underlying phenomena in a network, resulting in inferior performance in different inductive or transductive learning tasks. To address this challenge, this paper presents HONEM, a higher-order network embedding method that captures the non-Markovian higher-order dependencies in a network. HONEM is specifically designed for the higher-order network structure (HON) and outperforms other state-of-the-art methods in node classification, network re-construction, link prediction, and visualization for networks that contain non-Markovian higher-order dependencies.