arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1970
2603.10718 2026-05-21 cs.LG

Riemannian MeanFlow for One-Step Generation on Manifolds

Riemannian MeanFlow用于流形上的单步生成

Zichen Zhong, Haoliang Sun, Yukun Zhao, Yongshun Gong, Yilong Yin

AI总结 本文提出Riemannian MeanFlow(RMF),通过平行运输定义平均速度场,并推导出将平均速度与瞬时速度联系起来的Riemannian MeanFlow恒等式,从而实现流形上基于位置的切空间中的单步生成,改进了生成质量与效率的权衡并降低了采样成本。

Comments International Conference on Machine Learning

详情
AI中文摘要

Flow Matching enables simulation-free training of generative models on Riemannian manifolds, yet sampling typically still relies on numerically integrating a probability-flow ODE. We propose Riemannian MeanFlow (RMF), extending MeanFlow to manifold-valued generation where velocities lie in location-dependent tangent spaces. RMF defines an average-velocity field via parallel transport and derives a Riemannian MeanFlow identity that links average and instantaneous velocities for intrinsic supervision. We make this identity practical in a log-map tangent representation, avoiding trajectory simulation and heavy geometric computations. For stable optimization, we decompose the RMF objective into two terms and apply conflict-aware multi-task learning to mitigate gradient interference. RMF also supports conditional generation via classifier-free guidance. Experiments on spheres, tori, SO(3), and SE(3) demonstrate competitive one-step sampling with improved quality-efficiency trade-offs and substantially reduced sampling cost.

英文摘要

Flow Matching enables simulation-free training of generative models on Riemannian manifolds, yet sampling typically still relies on numerically integrating a probability-flow ODE. We propose Riemannian MeanFlow (RMF), extending MeanFlow to manifold-valued generation where velocities lie in location-dependent tangent spaces. RMF defines an average-velocity field via parallel transport and derives a Riemannian MeanFlow identity that links average and instantaneous velocities for intrinsic supervision. We make this identity practical in a log-map tangent representation, avoiding trajectory simulation and heavy geometric computations. For stable optimization, we decompose the RMF objective into two terms and apply conflict-aware multi-task learning to mitigate gradient interference. RMF also supports conditional generation via classifier-free guidance. Experiments on spheres, tori, SO(3), and SE(3) demonstrate competitive one-step sampling with improved quality-efficiency trade-offs and substantially reduced sampling cost.

2603.09024 2026-05-21 cs.LG

When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency

在漂移后何时重新训练:对后漂移数据大小充分性的数据-only测试

Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai

AI总结 本文提出CALIPER方法,通过数据-only测试估计后漂移数据大小以确保稳定重新训练,该方法利用动态系统生成的数据状态依赖性,通过单次加权局部回归和局部性参数θ跟踪一歩代理误差,当有效样本量门控满足时,误差随局部性参数增加而单调非递增,表明数据足够用于重新训练。

Comments Accepted by ICLR 2026

详情
AI中文摘要

突然的概念漂移使之前训练的预测器变得不可靠,但决定何时重新训练和后漂移数据大小是否足够 rarely 被解决。我们提出CALIPER - 一个检测器和模型无关的数据-only测试,用于估计后漂移数据大小以实现稳定重新训练。CALIPER利用动态系统生成的数据状态依赖性:我们在后漂移窗口上运行一次加权局部回归,并跟踪一个一步代理误差作为局部性参数θ的函数。当有效样本量门控被满足时,该误差随局部性参数增加而单调非递增,表明数据大小足够用于重新训练。我们还提供了对方法的理论分析,并展示了该算法具有低的每更新时间和内存。在四个异质领域、三个学习者家族和两个检测器的数据集上,CALIPER一致匹配或超过最佳固定数据大小进行重新训练,同时产生可忽略的开销,经常优于增量更新。CALIPER缩小了漂移检测和数据充分适应在流学习中的差距。

英文摘要

Sudden concept drift makes previously trained predictors unreliable, yet deciding when to retrain and what post-drift data size is sufficient is rarely addressed. We propose CALIPER - a detector- and model-agnostic, data-only test that estimates the post-drift data size required for stable retraining. CALIPER exploits state dependence in streams generated by dynamical systems: we run a single-pass weighted local regression over the post-drift window and track a one-step proxy error as a function of a locality parameter $θ$. When an effective sample size gate is satisfied, a monotonically non-increasing trend in this error with increasing a locality parameter indicates that the data size is sufficiently informative for retraining. We also provide a theoretical analysis of our method, and we show that the algorithm has a low per-update time and memory. Across datasets from four heterogeneous domains, three learner families, and two detectors, CALIPER consistently matches or exceeds the best fixed data size for retraining while incurring negligible overhead and often outperforming incremental updates. CALIPER closes the gap between drift detection and data-sufficient adaptation in streaming learning.

2603.08235 2026-05-21 cs.CV cs.AI

Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema

探索深度学习与超宽场成像用于糖尿病视网膜病变和黄斑水肿

Pablo Jimenez-Lizcano, Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, Guillermo González de Rivera, Ruben Vera-Rodriguez, Julian Fierrez

AI总结 本文研究了深度学习和超宽场成像在糖尿病视网膜病变和黄斑水肿检测中的应用,通过公开数据集评估了多种深度学习模型,并探讨了特征融合和频域表示的潜力。

Comments 6 pages, 4 figures, 2 tables

详情
AI中文摘要

糖尿病视网膜病变(DR)和糖尿病黄斑水肿(DME)是导致成年劳动力失明的主要原因之一。传统方法主要依赖标准彩色视网膜摄影(CFP)进行检测。然而,最近的超宽场成像(UWF)相比CFP提供了更宽的视野。受此启发,本文探讨了最新深度学习(DL)方法和UWF成像在三个临床相关任务上的应用:i)UWF图像质量评估,ii)可参考糖尿病视网膜病变(RDR)的识别,iii)DME的识别。使用公开的UWF4DR挑战数据集(作为MICCAI 2024会议的一部分发布),我们评估了DL模型在空间(RGB)和频域中的表现,包括流行的卷积神经网络(CNNs)以及最近的视觉变换器(ViTs)和基础模型。此外,我们还探索了最终的特征级融合以提高鲁棒性。最后,我们还利用Grad-CAM分析DL模型的决策,提高可解释性。我们的方法在所有架构中均实现了稳定强劲的性能,凸显了新兴ViTs和基础模型的竞争力,以及特征级融合和频域表示在UWF分析中的潜力。

英文摘要

Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.

2603.08155 2026-05-21 cs.LG

C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis

C$^2$FG: 通过分数差异分析实现控制分类器无关引导

Jiayang Gao, Tianyi Zheng, Jiayang Zou, Fengxiang Yang, Shice Liu, Luyao Fan, Zheyu Zhang, Hao Zhang, Jinwei Chen, Peng-Tao Jiang, Bo Li, Jia Wang

AI总结 本文提出C$^2$FG,一种基于分数差异分析的控制分类器无关引导方法,通过严格理论分析建立了条件分布与无条件分布在不同时间步的分数差异上界,从而为时间依赖引导提供了理论基础,并通过实验验证了其在多种生成任务中的有效性。

Comments Accepted to CVPR 2026 (Highlight)

详情
AI中文摘要

分类器无关引导(CFG)是现代条件扩散模型的核心,但其依赖于固定或启发式动态引导权重,主要基于经验,忽略了扩散过程的内在动态。本文对分类器无关引导进行了严格的理论分析。具体而言,我们基于扩散过程建立了条件分布与无条件分布在不同时间步的分数差异的严格上界。这一发现解释了固定权重策略的局限性,并为时间依赖引导建立了原理基础。受此启发,我们引入了控制分类器无关引导(C$^2$FG),一种新颖的、无需训练且可直接使用的插件方法,通过指数衰减控制函数将引导强度与扩散动态对齐。大量实验表明,C$^2$FG在多种生成任务中均有效且具有广泛的应用性,同时与现有策略具有正交性。

英文摘要

Classifier-Free Guidance (CFG) is a cornerstone of modern conditional diffusion models, yet its reliance on the fixed or heuristic dynamic guidance weight is predominantly empirical and overlooks the inherent dynamics of the diffusion process. In this paper, we provide a rigorous theoretical analysis of the Classifier-Free Guidance. Specifically, we establish strict upper bounds on the score discrepancy between conditional and unconditional distributions at different timesteps based on the diffusion process. This finding explains the limitations of fixed-weight strategies and establishes a principled foundation for time-dependent guidance. Motivated by this insight, we introduce \textbf{Control Classifier-Free Guidance (C$^2$FG)}, a novel, training-free, and plug-in method that aligns the guidance strength with the diffusion dynamics via an exponential decay control function. Extensive experiments demonstrate that C$^2$FG is effective and broadly applicable across diverse generative tasks, while also exhibiting orthogonality to existing strategies.

2603.06007 2026-05-21 cs.CL cs.AI cs.MA

MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

MASFactory: 一种基于图的框架,用于通过Vibe图谱编排基于大语言模型的多智能体系统

Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu, Xin Li, Chen Qian, Chuan Shi, Cheng Yang

AI总结 本文提出MASFactory,一种基于图的框架,用于通过Vibe图谱编排基于大语言模型的多智能体系统,解决了现有框架在实现复杂图工作流时需要大量手动工作、重用性差和难以整合异构外部上下文源的问题。

Comments Accepted to the ACL 2026 Demo Track. Camera-ready version. 10 pages, 6 figures. Code and documentation are available at: https://github.com/BUPT-GAMMA/MASFactory

详情
AI中文摘要

基于大语言模型的(LLM-based)多智能体系统(MAS)越来越多地被用于通过角色专业化和协作扩展智能体问题解决。MAS工作流可以自然地建模为有向计算图,其中节点执行智能体或子工作流,边编码依赖性和消息传递。然而,目前框架在实现复杂图工作流时仍然需要大量的手动工作,提供有限的重用性,并使整合异构外部上下文源变得困难。为克服这些限制,我们提出了MASFactory,一种用于编排基于大语言模型的MAS的基于图的框架。它引入了Vibe图谱,一种人机交互的方法,将自然语言意图编译成可编辑的工作流规范,然后编译成可执行的图。此外,该框架提供了可重用的组件、技能支持、多模态消息处理和可插拔的上下文整合,以及用于拓扑预览、运行时跟踪和人机交互的可视化工具。我们在七个公开基准上评估了MASFactory,验证了代表性MAS方法的再生产一致性以及Vibe图谱的有效性。我们的代码(https://github.com/BUPT-GAMMA/MASFactory,Apache-2.0许可)和视频演示(https://youtu.be/ANynzVfY32k)均已公开。

英文摘要

Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents or sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components, skill support, multimodal message handling, and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory, licensed under Apache-2.0) and video demonstration (https://youtu.be/ANynzVfY32k) are publicly available.

2603.01712 2026-05-21 cs.AI cs.LG

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

FT-Dojo: 向自主LLM微调迈进的语言代理

Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, Jiang Bian

AI总结 本文提出FT-Dojo交互式基准环境,用于研究自主LLM微调,通过标准化任务接口、共享数据仓库、沙盒执行环境和反馈协议,开发了FT-Agent框架,实现了结构化迭代规划和多级反馈分析,实验显示FT-Agent在13个任务中表现优异,且展示了代理在故障恢复和长期规划中的能力。

Comments 26 pages, 6 figures, 11 tables

详情
AI中文摘要

针对垂直领域LLM微调仍需大量人力劳动的问题,本文引入FT-Dojo交互式基准环境,包含5个领域13个任务。FT-Dojo标准化了任务接口、共享数据仓库、沙盒执行环境、结构化反馈协议和评估流程。进一步开发了FT-Agent框架,通过结构化迭代规划、快速失败验证和多级反馈分析优化数据和训练策略。实验表明FT-Agent在13个任务中表现优异,且通过与前沿代理、开源规划框架和多轮统计对比验证了主要发现。案例研究表明代理可通过累积学习恢复故障,但仍存在因果诊断和长期规划的局限性。实现代码见https://github.com/microsoft/rd-agent。

英文摘要

Fine-tuning large language models for vertical domains remains labor-intensive, requiring practitioners to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning and language agents, end-to-end LLM fine-tuning has not been systematically studied as an interactive agent task. We introduce FT-Dojo, an interactive benchmark environment for autonomous LLM fine-tuning, comprising 13 tasks across 5 domains. Rather than a new collection of static datasets, FT-Dojo standardizes a task interface, shared raw-data repository, sandboxed execution environment, structured feedback protocol, and held-out evaluation procedure. We further develop FT-Agent, a fine-tuning-oriented autonomous framework that uses structured iteration planning, fail-fast validation, and multi-level feedback analysis to refine data and training strategies. Experiments show that FT-Agent provides a strong initial baseline, achieving the best performance on 10 out of 13 tasks, with additional controlled comparisons against frontier agents, open-source planning backbones, and multi-run statistics supporting the main findings. Case studies show that agents can recover from failures through cumulative learning, while still exposing limitations in causal diagnosis and long-horizon planning. The implementation is available at https://github.com/microsoft/rd-agent.

2603.01406 2026-05-21 cs.LG cs.AI cs.NA math.NA

One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers

一个运算符统治一切?关于神经PDE求解器中边界索引运算符家族的探讨

Lennon J. Shikhman

AI总结 本文探讨了神经PDE求解器中边界索引运算符家族的核心问题,指出传统方法在边界条件变化时存在非识别性问题,并通过实验验证了在不同边界条件下求解器的局限性。

Comments Published in the ICLR 2026 Workshop on AI & PDEs. 10 pages, 5 figures

详情
AI中文摘要

神经PDE求解器通常被描述为学习映射问题数据到PDE解的运算符。本文作者认为,当边界条件变化时,这种解释通常是不正确的。我们展示了标准的神经运算符训练实际上隐式地学习了一个边界索引的运算符家族,而不是一个单一的、不考虑边界的运算符,其中学习的映射本质上依赖于训练过程中看到的边界条件分布。我们通过将运算符学习框架为边界条件上的条件风险最小化来正式化这一观点,这导致了在训练边界分布之外的非识别性结果。因此,forcing terms或resolution的泛化并不意味着在边界条件上的泛化。我们通过受控实验在泊松方程上支持我们的理论分析,展示了在边界条件转移时的明显退化,不同边界集合之间的跨分布失败,以及在去除边界信息时收敛到条件期望。我们的结果澄清了当前神经PDE求解器的核心限制,并突显了在追求PDE基础模型时需要显式边界意识建模的必要性。

英文摘要

Neural PDE solvers are often described as learning solution operators that map problem data to PDE solutions. In this work, we argue that this interpretation is generally incorrect when boundary conditions vary. We show that standard neural operator training implicitly learns a boundary-indexed family of operators, rather than a single boundary-agnostic operator, with the learned mapping fundamentally conditioned on the boundary-condition distribution seen during training. We formalize this perspective by framing operator learning as conditional risk minimization over boundary conditions, which leads to a non-identifiability result outside the support of the training boundary distribution. As a consequence, generalization in forcing terms or resolution does not imply generalization across boundary conditions. We support our theoretical analysis with controlled experiments on the Poisson equation, demonstrating sharp degradation under boundary-condition shifts, cross-distribution failures between distinct boundary ensembles, and convergence to conditional expectations when boundary information is removed. Our results clarify a core limitation of current neural PDE solvers and highlight the need for explicit boundary-aware modeling in the pursuit of foundation models for PDEs.

2602.24138 2026-05-21 cs.CV cs.AI

Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics

多模态最优传输用于手术机器人中的无训练时序分割

Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji, Hamdan Alhadhrami, Khalfan Hableel, Saif Alkindi, Ivan Laptev, Cesare Stefanini

AI总结 本文提出了一种无需标注的手术时序分割框架TASOT,通过结合时间对齐的文本描述和视觉信息,在统一的不平衡Gromov-Wasserstein最优传输目标下融合视觉和语义线索,实现了在多个公开手术数据集上的显著提升。

详情
AI中文摘要

自动化识别手术阶段和步骤是机器人辅助手术中术中决策支持、工作流自动化和技能评估的基本能力。现有方法要么依赖大规模标注手术数据集,要么需要昂贵的领域特定预训练,这限制了它们在不同机器人平台和临床环境中的实际部署。在本文中,我们提出TASOT(文本增强的动作分割最优传输),一种无需任务特定标注或手术领域预训练的手术时序分割框架。TASOT扩展了动作分割最优传输(ASOT)公式,通过结合直接从输入视频生成的时间对齐文本描述,在统一的不平衡Gromov-Wasserstein最优传输目标下融合视觉和语义线索。视觉表示使用DINOv3提取,而由视觉-语言模型生成的时间描述通过CLIP编码并时间对齐到单个帧,为传输成本提供互补的语义结构。我们在三个公开手术数据集和四个基准设置上评估了TASOT,涵盖腹腔镜和机器人手术程序,显示出显著优于最强的零样本基线:在Cholec80上+18.9 F1,在AutoLaparo上+33.7,在StrasByPass70上+23.7,在BernByPass70上+4.5。这些结果表明,在机器人环境中可以实现细粒度的手术工作流理解,而无需手动训练标注或手术特定的预训练流程,为实际的机器人手术系统提供了一种有前景的替代方案。

英文摘要

Automated recognition of surgical phases and steps is a fundamental capability for intraoperative decision support, workflow automation, and skill assessment in robotic-assisted surgery. Existing approaches either depend on large-scale annotated surgical datasets or require expensive domain-specific pretraining on thousands of labeled videos, limiting their practical deployability across diverse robotic platforms and clinical environments. In this work, we propose TASOT (Text-Augmented Action Segmentation Optimal Transport), an annotation-free framework for surgical temporal segmentation that requires no task-specific annotations or surgical-domain pretraining. TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3, while temporal captions produced by a vision-language model are encoded via CLIP and temporally aligned to individual frames, providing complementary semantic structure to the transport cost. We evaluate TASOT on three public surgical datasets and four benchmark settings spanning laparoscopic and robotic procedures, showing substantial improvements over the strongest zero-shot baselines: +18.9 F1 on Cholec80, +33.7 on AutoLaparo, +23.7 on StrasByPass70, and +4.5 on BernByPass70. These results suggest that fine-grained surgical workflow understanding in robotic settings can be achieved without manual training annotations or surgical-specific pretraining pipelines, offering a promising alternative for real-world robotic surgical systems.

2602.20399 2026-05-21 cs.LG

GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training

GeoPT:通过提升几何预训练实现物理模拟的扩展

Haixu Wu, Minghao Guo, Zongyi Li, Zhiyang Dou, Mingsheng Long, Kaiming He, Wojciech Matusik

AI总结 本文提出GeoPT,一种基于提升几何预训练的通用物理模拟预训练模型,通过合成动态增强几何,实现动态感知的自监督学习,从而提升物理模拟的效率和效果。

Comments Project Page: https://physics-scaling.github.io/GeoPT/

详情
AI中文摘要

神经模拟器有望成为高效的物理模拟替代品,但其扩展受限于生成高保真训练数据的高昂成本。在大量现成几何上预训练提供了一种自然替代方案,但面临根本性缺口:仅对静态几何进行监督会忽略动态,并可能导致物理任务的负迁移。我们提出了GeoPT,一种基于提升几何预训练的通用物理模拟预训练模型。核心思想是通过合成动态增强几何,实现动态感知的自监督学习,无需物理标签。在超过一百万个样本上预训练后,GeoPT在流体力学、汽车、飞机和船舶以及固体力学中的碰撞模拟等工业保真度基准上持续改进,将标注数据需求减少20-60%,并加速收敛2倍。这些结果表明,通过合成动态提升可以弥合几何-物理的差距,解锁神经模拟的可扩展路径,可能进一步扩展到其他领域。代码可在https://github.com/Physics-Scaling/GeoPT上获得。

英文摘要

Neural simulators promise efficient surrogates for physics simulation, but scaling them is bottlenecked by the prohibitive cost of generating high-fidelity training data. Pre-training on abundant off-the-shelf geometries offers a natural alternative, yet faces a fundamental gap: supervision on static geometry alone ignores dynamics and can lead to negative transfer on physics tasks. We present GeoPT, a unified pre-trained model for general physics simulation based on lifted geometric pre-training. The core idea is to augment geometry with synthetic dynamics, enabling dynamics-aware self-supervision without physics labels. Pre-trained on over one million samples, GeoPT consistently improves industrial-fidelity benchmarks spanning fluid mechanics for cars, aircraft, and ships, and solid mechanics in crash simulation, reducing labeled data requirements by 20-60% and accelerating convergence by 2$\times$. These results show that lifting with synthetic dynamics bridges the geometry-physics gap, unlocking a scalable path for neural simulation and potentially beyond. Code is available at https://github.com/Physics-Scaling/GeoPT.

2602.19320 2026-05-21 cs.CL cs.AI

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

代理记忆的解剖结构:评估和系统限制的分类与实证分析

Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore, Alysa Zhao, Dingyi Kang, Xu Hu, Feng Chen, Qiannan Li, Bingzhe Li

AI总结 本文通过分类和实证分析,探讨了代理记忆系统的架构和系统限制,揭示了现有系统在基准饱和效应、指标有效性、模型依赖性和内存维护开销等方面的问题,并提出了更可靠的评估方法和可扩展的系统设计方向。

详情
AI中文摘要

代理记忆系统使大型语言模型(LLM)代理能够在长时间交互中保持状态,支持超出固定上下文窗口的长周期推理和个性化。尽管架构发展迅速,这些系统的实证基础仍脆弱:现有基准通常规模不足,评估指标与语义效用不一致,性能在基础模型上变化显著,且系统层面的成本常被忽视。本文从架构和系统角度对代理记忆进行了结构化分析。我们首先基于四种记忆结构介绍了MAG系统简要分类。然后,我们分析了限制当前系统的关键痛点,包括基准饱和效应、指标有效性和判断敏感性、基础模型依赖的准确性,以及内存维护引入的延迟和吞吐量开销。通过将内存结构与实证限制联系起来,本文阐明了当前代理记忆系统为何经常无法达到其理论承诺,并概述了更可靠评估和可扩展系统设计的方向。

英文摘要

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

2602.18532 2026-05-21 cs.CV cs.AI cs.RO

VLANeXt: Recipes for Building Strong VLA Models

VLANeXt: 构建强大VLA模型的配方

Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, Chen Change Loy

AI总结 本文通过统一框架和评估设置重新审视VLA设计空间,系统分析了基础组件、感知要素和动作建模视角,总结出12项关键发现,提出了一种简单有效的VLA模型VLANeXt,并在LIBERO和LIBERO-plus基准测试中超越了现有方法,同时提供了易于使用的代码库。

Comments Accepted in ICML 2026, Project Page: https://dravenalg.github.io/VLANeXt/

详情
AI中文摘要

在大基础模型兴起之后,视觉-语言-动作模型(VLAs)应运而生,利用视觉语言模型的强大视觉和语言理解能力进行通用目的策略学习。然而,当前VLA领域仍处于碎片化和探索阶段。尽管许多团队提出了各自的VLA模型,但训练协议和评估设置的一致性不足,使得难以确定哪些设计选择真正重要。为了使这一发展领域更具结构化,我们重新审视VLA设计空间,基于类似RT-2的简单VLA基线,系统地分析了三个维度:基础组件、感知要素和动作建模视角。从这项研究中,我们提炼出12项关键发现,共同构成了构建强大VLA模型的实用配方。该探索的成果是一种简单而有效的模型VLANeXt,它在LIBERO和LIBERO-plus基准测试中优于现有方法,并在现实世界实验中表现出色。我们还发布了一个统一且易于使用的代码库,以重现我们的发现、探索设计空间并基于共享基础开发新的VLA变体。代码库可在https://github.com/DravenALG/VLANeXt上获得。

英文摘要

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding from Vision-Language Models for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2, which is the origin of VLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. It outperforms the state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong performance in real-world experiments. We release a unified and easy-to-use codebase to reproduce our findings, explore the design space, and develop new VLA variants on top of a shared foundation. The codebase is available at https://github.com/DravenALG/VLANeXt.

2602.17062 2026-05-21 cs.AI

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

保留次优动作以跟随移动的最优解在多智能体强化学习中

Yonghyeon Jo, Sunwoo Lee, Seungyul Han

AI总结 本文提出S2Q算法,通过学习多个子价值函数来保留替代的高价值动作,以解决多智能体强化学习中适应值函数变化时的最优解移动问题,实验表明其在多智能体强化学习基准上表现优异。

Comments 10 technical page followed by references and appendix. Accepted to ICLR 2026

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

价值分解是合作多智能体强化学习(MARL)的核心方法。然而,现有方法仍然依赖于单一最优动作,在训练过程中底层价值函数变化时难以适应,往往收敛到次优策略。为了解决这一限制,我们提出了Successive Sub-value Q-learning(S2Q),该方法学习多个子价值函数以保留替代的高价值动作。将这些子价值函数纳入基于Softmax的行为策略中,S2Q鼓励持续探索并使$Q^{ ext{tot}}$能够快速调整到变化的最优解。在具有挑战性的MARL基准上的实验表明,S2Q在各种MARL算法中始终表现更优,证明了其改进的适应性和整体性能。我们的代码可在https://github.com/hyeon1996/S2Q上获得。

英文摘要

Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions. Incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{\text{tot}}$ to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at https://github.com/hyeon1996/S2Q.

2602.16608 2026-05-21 cs.CL cs.AI cs.CV cs.LG

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

可解释的人工智能:面向Transformer模型的上下文感知分层集成梯度方法

Melkamu Abay Mersha, Jugal Kalita

AI总结 本文提出了一种上下文感知分层集成梯度框架(CA-LIG),用于解释Transformer模型的决策过程,通过计算每个Transformer块内的分层集成梯度,并将这些token级属性与类特定的注意力梯度融合,从而生成具有符号和上下文敏感性的属性图,以捕捉支持和反对的证据,并追踪Transformer层中的相关性层次流动。

详情
AI中文摘要

Transformer模型在多个领域和任务中实现了最先进的性能,然而其深层表示使得预测难以解释。现有的可解释性方法依赖于最终层的属性,只能捕捉局部token级属性或全局注意力模式,缺乏对token间依赖关系和结构组件的上下文感知能力。它们还无法捕捉相关性如何在层之间演变以及结构组件如何影响决策。为了解决这些限制,我们提出了上下文感知分层集成梯度(CA-LIG)框架,一种统一的层次属性框架,该框架在每个Transformer块内计算分层集成梯度,并将这些token级属性与类特定的注意力梯度融合。这种整合产生了带有符号和上下文敏感性的属性图,能够捕捉支持和反对的证据,同时追踪Transformer层中的相关性层次流动。我们评估了CA-LIG框架在多样化的任务、领域和Transformer模型家族中的表现,包括使用BERT进行情感分析和长多类文档分类,使用XLM-R和AfroLM在低资源语言设置中进行仇恨言论检测,以及使用Masked Autoencoder Vision Transformer模型进行图像分类。在所有任务和架构中,CA-LIG提供了更忠实的属性,显示出对上下文依赖的更强敏感性,并产生了更清晰、更语义连贯的可视化结果,优于现有可解释性方法。这些结果表明,CA-LIG提供了更全面、上下文感知和可靠的Transformer决策解释,推动了深度神经网络的实用可解释性和概念理解。

英文摘要

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

2602.11675 2026-05-21 cs.AI

Epistemic Regret Minimization: Label-Free Causal Critique Beyond Outcome Reward

知识悔恨最小化:超越结果奖励的无标签因果批评

Edward Y. Chang, Longling Geng

AI总结 本文提出了一种名为知识悔恨最小化(ERM)的框架,通过评估模型推理轨迹的因果结构而非答案本身,来改进因果批评,从而在没有正确答案的情况下进行无标签操作,并在多个前沿LLM上实验表明,ERM在因果批评任务中表现优于传统方法。

Comments 43 pages, 22 tables, 18 figures

详情
AI中文摘要

大型语言模型可以正确回答因果问题,但其正确性基于错误的原因。当前的强化学习方法奖励模型得出的结论,但忽略其原因,强化了相关性捷径——这是我们称之为“奖励固化”的失败。我们引入了知识悔恨最小化(ERM),一种框架,它批评模型推理轨迹的因果结构,而非其答案。应用已建立的因果原则,ERM标记未审查的混杂因素、相关性-干预混淆以及从暴露推理轨迹中未检查的后门路径。该框架允许无标签操作——无需真实的因果图或正确答案,并且我们在实验中分别区分了有利的基准衍生批评、错误方向提示以及完全无标签的判断生成批评。在单个回合内,ERM检测并修复因果推理错误;在多个回合中,它将干预证据积累到可用于无答案键的奖励信号中。在六个前沿LLM上的1360个场景实验中表明,推理密集型模型(GPT-4 Turbo,GPT-5.2)对结果仅修正(25-31%恢复)的抵抗,但对因果批评(78-91%)的响应,获得+53-59 pp。标准测试时间方法(自一致性,Best-of-N,Self-Refine)在因果任务中表现不佳,而ERM将残余Rung Collapse从55-70%降至4%。一个分离定理证明仅结果奖励无法缩小这一差距;受控模拟证实了知识反馈确实能缩小这一差距,其表现优于仅结果奖励基线38倍。

英文摘要

Large language models can answer causal questions correctly for the wrong reasons. Current RL methods reward \emph{what} a model concludes but ignore \emph{why}, reinforcing correlational shortcuts -- a failure we call \emph{Reward Entrenchment}. We introduce \emph{Epistemic Regret Minimization} (\erm), a framework that critiques the causal \emph{structure} of a model's reasoning trace rather than its answer. Applying established causal principles, \erm flags unexamined confounders, correlation--intervention conflation, and unchecked back-door paths from exposed reasoning traces. The framework admits \emph{label-free} operation -- without the true causal graph or correct answer -- and we separately distinguish favorable benchmark-derived critique, error-direction cues, and fully label-free judge-generated critique in the experiments. Within a single episode, \erm detects and repairs causal reasoning errors; across episodes, it accumulates interventional evidence into a reward signal applicable where no answer key exists. Experiments on 1,360 scenarios across six frontier LLMs show that reasoning-heavy models (GPT-4 Turbo, GPT-5.2) resist outcome-only correction (25--31\% recovery) yet respond to causal critique (78--91\%), gaining $+53$--$59$ pp. Standard test-time methods (self-consistency, Best-of-$N$, Self-Refine) \emph{underperform} outcome-only reprompting on causal tasks, while ERM reduces residual Rung Collapse from 55--70\% to 4\%. A separation theorem proves outcome-only reward cannot close this gap; a controlled simulation confirms epistemic feedback does, outperforming outcome-only baselines 38-fold.

2602.11499 2026-05-21 cs.CV

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

如果智能体能够想象?通过生成强化开放词汇人-物交互理解

Zhenlong Yuan, Yue Wang, Dapeng Zhang, Kejin Cui, Rui Chen, Jing Tang, Lei Sun, Hongwei Yu, Chengxuan Qian, Xiangxiang Chu, Shuo Li, Yuyin Zhou

AI总结 本文提出ImagineAgent框架,通过生成式世界建模和工具增强强化学习,解决开放词汇人-物交互理解中的跨模态幻觉和视角限制问题,实现了高效且鲁棒的推理。

详情
AI中文摘要

多模态大语言模型在连接视觉和文本推理方面展现出有前景的能力,但其在开放词汇人-物交互(OV-HOI)中的推理能力受到跨模态幻觉和图像视角有限的限制。为此,我们提出ImagineAgent,一种整合认知映射、工具增强强化学习(RL)和生成式世界建模的智能体框架,以实现稳健的OV-HOI理解。具体而言,我们首先提出一个创新的CoT数据集hicodet-6K用于监督微调(SFT),通过将感知实体结构化为交互对,有效弥合感知到认知的差距,实现全面预测。随后,我们开发了一个多模态工具库,集成了在线检索、图像裁剪和生成式建模,使智能体能够动态增强推理,利用领域特定工具解决推理中的视觉-语义模糊性和幻觉问题。此外,我们引入生成模型重建替代视角,使智能体能够在有限视角下进行‘想象’。最后,我们提出一个复合奖励机制,共同优化预测准确性和工具效率。在SWIG-HOI和HICO-DET数据集上的评估表明,我们的方法在仅需36.7%的训练数据相比现有方法的情况下实现了最先进的性能,验证了我们的鲁棒性、经验有效性和效率。

英文摘要

Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and limited viewpoints of images. To address this, we propose ImagineAgent, an agentic framework that integrates cognitive mapping, tool-augmented reinforcement learning (RL), and generative world modeling for robust OV-HOI understanding. Specifically, we first propose an innovative CoT dataset named hicodet-6K for supervised fine-tuning (SFT), which effectively bridges the perception-to-cognition gap by structuring perceived entities into interaction pairs for comprehensive predictions. Subsequently, we develop a multimodal tool library integrating online retrieval, image cropping, and generative modeling, enabling the agent to dynamically augment reasoning with domain-specific tools to resolve visual-semantic ambiguities and hallucinations during inference. Moreover, we incorporate a generative model to reconstruct alternative viewpoints, enabling the agent to 'imagine' under limited viewpoints. Finally, we propose a composite reward mechanism to jointly optimize prediction accuracy and tool efficiency. Evaluations on both SWIG-HOI and HICO-DET datasets demonstrate that our method achieves state-of-the-art performance while requiring merely 36.7% of the training data compared to existing methods, validating our robustness, empirical effectiveness and efficiency.

2602.08819 2026-05-21 cs.LG cs.CL

Bayesian Preference Learning for Test-Time Steerable Reward Models

基于测试时间可调节的贝叶斯偏好学习的奖励模型

Jiwoo Hong, Shao Tang, Zhipeng Wang

AI总结 本文提出了一种新的贝叶斯奖励建模目标,即变分上下文奖励建模(ICRM),通过上下文偏好演示实现测试时间可调节性,从而适应未见过的偏好分布,提高了奖励模型的准确性和鲁棒性。

Comments Preprint

详情
AI中文摘要

奖励模型在通过强化学习(RL)对语言模型与人类偏好对齐中起核心作用。随着RL越来越多地应用于可验证奖励和多目标对齐等场景,RMs被期望编码更复杂和多维的偏好分布。然而,分类RMs一旦训练完成就保持静态,限制了测试时间的适应性。我们提出变分上下文奖励建模(ICRM),一种新颖的贝叶斯奖励建模目标,通过上下文偏好演示实现测试时间可调节性。ICRM将奖励建模视为在Bradley-Terry模型下对潜在偏好概率的变分推断,使用共轭Beta先验。我们证明ICRM能够适应单目标和多目标设置中的未见过的偏好分布。随着更多演示,ICRM在RM-Bench上的准确性从60.5提高到70.8,在道德困境偏好上比生成判断者具有更低的校准误差,并在冲突偏好下扩展了可达到的帕累托前沿。我们进一步研究了ICRM在RL训练中的实际适用性,证明其可以通过在数学推理中优于传统RM来有效编码可验证奖励。最后,我们提供了理论保证,变分目标在有限置信度下具有全局内部最优解,并分析了KL正则化如何缓解奖励过度优化。

英文摘要

Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapts to unseen preference distributions at test time for both single and multi-objective settings. With more demonstrations, ICRM improves RM-Bench accuracy from 60.5 to 70.8, achieves lower calibration error than a generative judge on moral dilemma preferences, and expands the attainable Pareto frontier under conflicting preferences. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.

2602.08028 2026-05-21 cs.CL cs.AI

Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

偏离以诱导提示:多理性诱导用于零样本推理

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

AI总结 本研究提出DIP框架,通过生成多个多样化的高层理由并诱导最终计划,以提升零样本推理的准确性,克服了传统链式思考提示中推理路径不稳定的问题。

Comments Accepted to Findings of IJCNLP-AACL 2025

详情
AI中文摘要

为了解决标准链式思考提示中无引导推理路径的不稳定性,最近的方法通过首先引导大型语言模型(LLMs)生成单一推理策略来指导模型。然而,仅依赖一个策略来回答每个问题仍然限制了在多样化任务中的性能。我们提出了偏离以诱导提示(DIP),一个框架,首先提示LLM为每个问题生成多个多样化的高层理由。每个理由随后被扩展成详细的、分步骤的草案计划。最后,这些草案计划被诱导成最终计划。DIP在不依赖资源密集型采样的情况下增强了零样本推理的准确性。实验表明,DIP优于单一策略提示,证明了多计划诱导对基于提示的推理的有效性。

英文摘要

To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.

2602.06862 2026-05-21 cs.CV

Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing

参数作为专家:通过动态参数路由适应视觉模型

Meng Lou, Stanley Yu, Yizhou Yu

AI总结 本文提出ParaX方法,通过动态参数路由机制实现视觉模型的高效微调,以生成更定制化和强大的特征表示,从而在多种视觉识别任务中取得优越性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

利用参数高效微调(PEFT)来适应预训练视觉模型仍然具有挑战性,因为其目标是在少量可训练参数的情况下实现与完整微调相当的性能。当应用于复杂的密集预测任务时,现有方法存在局限性,包括输入无关的建模和冗余的跨层表示。为此,我们提出了ParaX,一种新的适配器式方法,其特征是简单的混合专家(MoE)架构。具体而言,我们引入了共享专家中心,其中每个专家都是可训练的参数矩阵。在前向传递过程中,网络中的每个ParaX模块通过简单的动态参数路由机制动态生成针对当前模块的权重矩阵,该机制选择性地聚合相应专家中心的参数矩阵。ParaX模块中的动态权重矩阵通过输入依赖的方式实现低秩适应,从而生成更加定制化和强大的特征表示。此外,由于多个网络层的ParaX模块共享相同的专家中心,它们通过促进隐含的跨层特征交互来提高特征多样性。广泛的实验结果表明,ParaX在多种视觉识别任务中均表现出色。代码已公开发布:https://github.com/LMMMEng/ParaX。

英文摘要

Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose ParaX, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each ParaX module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in ParaX modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since ParaX modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experimental results demonstrate the superiority of ParaX across diverse visual recognition tasks. Code is publicly released at: https://github.com/LMMMEng/ParaX.

2602.06500 2026-05-21 cs.LG

Can Microcanonical Langevin Dynamics Leverage Mini-Batch Gradient Noise?

微 canonical 动力学能否利用小批量梯度噪声?

Emanuel Sommer, Kangning Diao, Jakob Robnik, Uros Seljak, David Rügamer

AI总结 本文研究了微 canonical 动力学能否有效利用小批量梯度噪声,提出了一种梯度噪声预条件化方案和能量方差基于的自适应调节器,从而开发出一种鲁棒且可扩展的微 canonical 采样器,实现了在高维推断任务中的最佳性能。

Comments In Proceedings of the 43rd International Conference on Machine Learning

详情
AI中文摘要

将推断方法如马尔可夫链蒙特卡罗扩展到高维模型仍然是贝叶斯深度学习中的核心挑战。一个有前景的最新提案,微 canonical 动力学蒙特卡罗,在广泛的问题上展示了最先进的性能。然而,其对完整数据集梯度的依赖使其在大规模问题中成本过高。本文解决了一个根本性问题:微 canonical 动力学能否有效利用小批量梯度噪声?我们提供了该问题的第一个系统研究,建立了随机梯度微 canonical 动力学的新型连续时间理论分析。我们揭示了两种关键的失败模式:由于各向异性梯度噪声导致的理论偏置和复杂高维后验中的数值不稳定性。为解决这些问题,我们提出了一种原理性的梯度噪声预条件化方案,已证明能显著减少这种偏置,并开发了一种新的基于能量方差的自适应调节器,自动化步长选择并动态告知数值保护措施。所得到的算法是一种鲁棒且可扩展的微 canonical 采样器,能够在具有挑战性的高维推断任务如贝叶斯神经网络中实现最先进的性能。结合最近的集合技术,我们的工作解锁了一种新的随机微 canonical 动力学集合(SMILE)采样器类,用于大规模贝叶斯推断。

英文摘要

Scaling inference methods such as Markov chain Monte Carlo to high-dimensional models remains a central challenge in Bayesian deep learning. A promising recent proposal, microcanonical Langevin Monte Carlo, has shown state-of-the-art performance across a wide range of problems. However, its reliance on full-dataset gradients makes it prohibitively expensive for large-scale problems. This paper addresses a fundamental question: Can microcanonical dynamics effectively leverage mini-batch gradient noise? We provide the first systematic study of this problem, establishing a novel continuous-time theoretical analysis of stochastic-gradient microcanonical dynamics. We reveal two critical failure modes: a theoretically derived bias due to anisotropic gradient noise and numerical instabilities in complex high-dimensional posteriors. To tackle these issues, we propose a principled gradient noise preconditioning scheme shown to significantly reduce this bias and develop a novel, energy-variance-based adaptive tuner that automates step size selection and dynamically informs numerical guardrails. The resulting algorithm is a robust and scalable microcanonical Monte Carlo sampler that achieves state-of-the-art performance on challenging high-dimensional inference tasks like Bayesian neural networks. Combined with recent ensemble techniques, our work unlocks a new class of stochastic microcanonical Langevin ensemble (SMILE) samplers for large-scale Bayesian inference.

2602.04907 2026-05-21 cs.LG cs.AI stat.ME

Causal Discovery from Heteroscedastic Stochastic Dynamical Systems under Imperfect Physical Models

从不完美物理模型下的异方差随机动力系统中进行因果发现

Jianhong Chen, Naichen Shi, Xubo Yue

AI总结 本文提出了一种整合因果发现框架,利用随机微分方程中的部分物理知识来提高动态系统中因果图的恢复能力,同时分析了在不完美物理模型下的鲁棒性。

Comments 101 pages

详情
AI中文摘要

因果发现是一种数据驱动的复杂系统分析范式,而基于物理的模型,如常微分方程(ODEs),为现实世界的动力学过程提供了机理结构。整合这些范式可以提高可识别性、稳定性和鲁棒性。然而,真实动力系统往往表现出循环交互和非平稳性,而许多因果发现方法依赖于无循环、平稳或平衡假设。我们提出了一种整合因果发现框架,利用随机微分方程(SDEs)中的部分物理知识。漂移项编码已知的ODE动力学,而扩散项捕捉超出规定物理的未知因果耦合。我们开发了一种可扩展的稀疏诱导最大准似然估计器,并通过理论上合理的稳定技术来改善优化景观。在温和条件下,我们为稳定和不稳定SDEs建立了因果图恢复保证。我们还分析了我们的因果图估计在ODE不准确情况下的鲁棒性,并澄清了引入的稳定技术如何平衡数值稳定性和统计恢复能力。在线性SDEs和非线性基准测试,包括具有无循环和循环结构的Lotka-Volterra和Lorenz动力学上,实验显示了比数据驱动基线更好的图恢复和鲁棒性。我们还通过在我们的因果发现框架内重建随机SIR动力学来展示实际应用,以在现实世界流行病数据中进行因果图重建。

英文摘要

Causal discovery is a data-driven paradigm for analyzing complex systems, while physics-based models, such as ordinary differential equations (ODEs), provide mechanistic structure for real-world dynamical processes. Integrating these paradigms can improve identifiability, stability, and robustness. However, real dynamical systems often exhibit cyclic interactions and nonstationarity, whereas many causal discovery methods rely on acyclicity, stationarity, or equilibrium assumptions. We propose an integrative causal discovery framework for dynamical systems that leverages partial physical knowledge through stochastic differential equations (SDEs). The drift term encodes known ODE dynamics, while the diffusion term captures unknown causal couplings beyond the prescribed physics. We develop a scalable sparsity-inducing maximum quasi-likelihood estimator with a theoretically justified stabilization technique to improve the optimization landscape. Under mild conditions, we establish causal graph recovery guarantees for both stable and unstable SDEs. We also analyze robustness of our causal graph estimate to ODE misspecification and clarify how the introduced stabilization technique balances numerical stability and statistical recoverability. Experiments on linear SDEs and nonlinear benchmarks, including Lotka-Volterra and Lorenz dynamics with acyclic and cyclic structures, show improved graph recovery and robustness over data-driven baselines. We also demonstrate practical utility on real-world epidemic data by reconstructing stochastic SIR dynamics within our causal discovery framework.

2602.04876 2026-05-21 cs.CV

PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation

PerpetualWonder: 长时间地平线动作条件4D场景生成

Jiahao Zhan, Zizhang Li, Hong-Xing Yu, Jiajun Wu

AI总结 本文提出PerpetualWonder,一种混合生成模拟器,能够从单张图像生成长时间地平线动作条件的4D场景。该方法通过引入真正的闭环系统,解决了现有方法因物理状态与视觉表示解耦导致的生成问题,实现了物理动态和外观的双向修正。

Comments Project website: https://johnzhan2023.github.io/PerpetualWonder/

详情
AI中文摘要

我们介绍了PerpetualWonder,一种混合生成模拟器,能够从单张图像生成长时间地平线动作条件的4D场景。当前工作无法完成此任务,因为其物理状态与其视觉表示相互分离,这阻止了生成性改进更新底层物理以供后续交互。PerpetualWonder通过引入首个真正的闭环系统来解决这一问题。它具有一个新颖的统一表示,创建了物理状态与视觉原语之间的双向链接,使生成性改进能够同时修正动态和外观。它还引入了一种稳健的更新机制,通过多个视角收集监督以解决优化模糊性。实验表明,从单张图像出发,PerpetualWonder能够成功模拟复杂、多步骤的长时间动作交互,保持物理合理性和视觉一致性。

英文摘要

We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.

2602.03209 2026-05-21 cs.RO

Depth Completion in Unseen Field Robotics Environments Using Extremely Sparse Depth Measurements

在未见过的田间机器人环境中使用极稀疏深度测量进行深度补全

Marco Job, Thomas Stastny, Eleni Kelasidi, Roland Siegwart, Michael Pantic

AI总结 本研究提出了一种深度补全模型,通过合成数据训练和极稀疏的深度传感器测量,在未见过的田间机器人环境中预测密集的度量深度,解决了低成本相机在田间机器人中应用受限的问题。

Comments Accepted to ICRA 2026

详情
AI中文摘要

在无结构环境中自主运行的田间机器人需要可靠的感知以确保安全和可靠的运行。最近的单目深度估计进展展示了低成本相机作为深度传感器的潜力;然而,由于缺乏可靠的尺度线索、模糊或低纹理条件以及大规模数据集的稀缺,其在田间机器人中的应用仍然有限。为了解决这些挑战,我们提出了一种深度补全模型,该模型在合成数据上训练,并利用深度传感器的极稀疏测量来预测未见过的田间机器人环境中的密集度量深度。一个针对田间机器人的合成数据集生成流程能够创建多个逼真的数据集用于训练。该数据集生成方法利用结构从运动的纹理3D网格和具有新视角合成的逼真渲染来模拟多样的田间机器人场景。我们的方法在Nvidia Jetson AGX Orin上实现了每帧53毫秒的端到端延迟,使嵌入式平台上的实时部署成为可能。广泛的评估表明,在多样化的现实世界田间机器人场景中具有竞争性的性能。

英文摘要

Autonomous field robots operating in unstructured environments require robust perception to ensure safe and reliable operations. Recent advances in monocular depth estimation have demonstrated the potential of low-cost cameras as depth sensors; however, their adoption in field robotics remains limited due to the absence of reliable scale cues, ambiguous or low-texture conditions, and the scarcity of large-scale datasets. To address these challenges, we propose a depth completion model that trains on synthetic data and uses extremely sparse measurements from depth sensors to predict dense metric depth in unseen field robotics environments. A synthetic dataset generation pipeline tailored to field robotics enables the creation of multiple realistic datasets for training purposes. This dataset generation approach utilizes textured 3D meshes from Structure from Motion and photorealistic rendering with novel viewpoint synthesis to simulate diverse field robotics scenarios. Our approach achieves an end-to-end latency of 53 ms per frame on a Nvidia Jetson AGX Orin, enabling real-time deployment on embedded platforms. Extensive evaluation demonstrates competitive performance across diverse real-world field robotics scenarios.

2602.03004 2026-05-21 cs.LG cs.AI

Graph Autoencoder for Process Monitoring

用于过程监控的图自编码器

Xiangrui Zhang

AI总结 本文提出了一种因果图时空自编码器(CGSTAE),通过结合基于空间自注意力机制的空间相关图结构学习模块和利用图卷积长短期记忆(GCLSTM)的空间-时间编码器-解码器模块,以提高工业过程监控的可靠性和可解释性。

详情
AI中文摘要

为提高工业过程监控的可靠性和可解释性,本文提出了一种因果图时空自编码器(CGSTAE)。CGSTAE的网络架构结合了两个组件:基于空间自注意力机制的空间相关图结构学习模块(SSAM)和利用图卷积长短期记忆(GCLSTM)的空间-时间编码器-解码器模块。SSAM通过捕捉变量之间的动态关系来学习相关图,而一种新的三步因果图结构学习算法被引入,以从这些相关图中推导出因果图。该算法利用因果不变性原理的反向视角来揭示从变化相关性中得到的不变因果图。空间-时间编码器-解码器由GCLSTM单元构建,在序列到序列框架内重建时间序列过程数据。所提出的CGSTAE通过特征空间和残差空间中的两个统计量实现有效的过程监控和故障检测。最后,我们通过田纳西东部过程和一个现实世界的空气分离过程验证了CGSTAE在过程监控中的有效性。

英文摘要

To improve the reliability and interpretability of industrial process monitoring, this article proposes a Causal Graph Spatial-Temporal Autoencoder (CGSTAE). The network architecture of CGSTAE combines two components: a correlation graph structure learning module based on spatial self-attention mechanism (SSAM) and a spatial-temporal encoder-decoder module utilizing graph convolutional long-short term memory (GCLSTM). The SSAM learns correlation graphs by capturing dynamic relationships between variables, while a novel three-step causal graph structure learning algorithm is introduced to derive a causal graph from these correlation graphs. The algorithm leverages a reverse perspective of causal invariance principle to uncover the invariant causal graph from varying correlations. The spatial-temporal encoder-decoder, built with GCLSTM units, reconstructs time-series process data within a sequence-to-sequence framework. The proposed CGSTAE enables effective process monitoring and fault detection through two statistics in the feature space and residual space. Finally, we validate the effectiveness of CGSTAE in process monitoring through the Tennessee Eastman process and a real-world air separation process.

2602.02660 2026-05-21 cs.AI

MARS: Modular Agent with Reflective Search for Automated AI Research

MARS:模块化代理与反思搜索用于自动化AI研究

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, Jinsung Yoon

AI总结 本文提出MARS框架,通过预算感知规划、模块化构建和比较反思记忆解决复杂机器学习工程任务中的执行成本与性能归因问题,实现开放源代码框架在MLE-Bench上的最佳性能。

Comments Paper published at International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

自动化AI研究的关键瓶颈在于执行复杂的机器学习工程(MLE)任务。MLE不同于一般软件工程,因其计算成本高昂(例如模型训练)和性能归因不透明。当前基于LLM的代理在此方面表现不佳,常生成忽视执行成本和因果因素的单体脚本。我们引入MARS(模块化代理与反思搜索),一种优化于自主AI研究的框架。MARS依赖三个支柱:(1)通过成本受限的蒙特卡洛树搜索(MCTS)进行预算感知规划,以显式平衡性能与执行成本;(2)模块化构建,采用“设计-分解-实现”流程来管理复杂的研究存储库;(3)比较反思记忆,通过分析解决方案差异来解决信用分配问题,从而提炼出高信号的洞察。MARS在可比条件下,在开放源代码框架中实现了MLE-Bench上的最佳性能,与全球排行榜上顶尖方法竞争性相当。此外,系统表现出定性“啊哈!”时刻,其中所有使用的63%的教训源自跨分支转移,表明代理能有效在搜索路径间泛化洞察。

英文摘要

A critical bottleneck in automating AI research is the execution of complex machine learning engineering (MLE) tasks. MLE differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

2602.02304 2026-05-21 cs.AI cs.LG

Comparing Explanations is Not Enough, Explain the Change: New Standards are Needed to Explain Behavioral Shifts in Large Language Models

比较解释并不足够,解释变化:需要新的标准来解释大型语言模型中的行为转变

Martino Ciaperoni, Marzio Di Vece, Roberto Pellungrini, Luca Pappalardo, Fosca Giannotti, Francesco Giannini

AI总结 本文提出了一种新的XAI方法,旨在解释大型语言模型在干预后行为转变的原因和机制,以应对现有解释方法无法解释行为转变的问题。

详情
AI中文摘要

大规模基础模型在受到缩放、微调、人类反馈强化学习或上下文学习等干预时会表现出行为转变。当前的可解释性方法结构上不适用于解释这些转变,因为它们要么将模型视为静态对象,如传统可解释AI(XAI)方法所做的,要么仅仅比较不同模型检查点的独立解释。因此,这些方法无法解释两个模型实例之间的功能转变,其中某种行为在干预后发生了变化。这种差距在欧盟人工智能法案、美国州立法和中国人工智能法规等司法管辖区中带来了重大治理风险,这些法规要求记录重大系统修改的因果链。本文主张,解释大型语言模型的行为转变需要一种系统的方法,将转变本身作为解释的主要对象:即解释干预如何和为何将参考模型转变为具有不同行为的更新模型。为了支持这一主张,我们引入了称为比较XAI(XAI_Δ)的新XAI范式,旨在解释两个模型检查点之间的差异,其中行为发生了变化,以及一组规范,规定XAI_Δ解释器和解释必须满足的条件,包括可比性、有效性、可操作性和监控,目标是将模型审计 grounded 在明确、可测量的要求中。最后,我们通过示例实验提供初步证据,表明在实践中需要XAI_Δ,将结果汇总成一份转换报告,直接可用于治理和事件记录。

英文摘要

Large-scale foundation models exhibit \emph{behavioral shifts} when subjected to interventions such as scaling, fine-tuning, reinforcement learning with human feedback, or in-context learning. Current explainability methods are structurally ill-suited to explain these shifts, because they either treat models as static objects, as traditional eXplainable AI (XAI) approaches do, or merely compare independent explanations across different checkpoints of a model. As a result, these approaches fail to explain the functional transition between two model instances in which a certain behavior has shifted following an intervention. This gap creates significant governance risks across jurisdictions including the EU AI Act, US state legislation, and Chinese AI regulations, which require documenting causal chains for substantial system modifications. This position paper argues that explaining behavioral shifts in large language models requires a principled approach that treats the shift itself as the primary object of explanation: namely, one that explains how and why an intervention transforms a reference model into an updated model with different behavior. To support this claim, we introduce \textit{Comparative} XAI (XAI$_Δ$), a novel XAI paradigm aimed at explaining the difference between two model checkpoints where a behavior has shifted, together with a set of desiderata specifying what XAI$_Δ$ explainers and explanations must satisfy, including comparability, validity, actionability, and monitoring, with the goal of grounding model auditing in explicit, measurable requirements. Finally, we provide preliminary evidence suggesting the need for XAI$_Δ$ in practice through illustrative experiments, compiling the resulting findings into a transition report directly usable for governance and incident documentation.

2602.01273 2026-05-21 cs.CV

Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution

Q-DiT4SR: 探索细节保留的扩散变换器量化以实现实景图像超分辨率

Xun Zhang, Kaicheng Yang, Hongliang Lu, Haotong Qin, Yong Guo, Yulun Zhang

AI总结 本文提出Q-DiT4SR,一种专门针对基于扩散变换器的实现实景图像超分辨率的后训练量化框架,通过引入层次化SVD和变异性感知时空混合精度方法,在保持细节的同时实现高效的模型压缩和加速。

Comments Accepted to ICML 2026. Our code and models will be available at https://github.com/xunzhang1128/Q-DiT4SR

详情
AI中文摘要

近年来,扩散变换器(DiTs)在实现实景图像超分辨率(Real-ISR)中崭露头角,能够生成高质量的纹理,但其沉重的推理负担阻碍了实际应用。尽管后训练量化(PTQ)是加速的有希望的解决方案,但现有超分辨率方法大多集中在U-Net架构上,而通用的DiT量化通常针对文本到图像任务设计。直接将这些方法应用于基于DiT的超分辨率模型会导致局部纹理严重退化。因此,我们提出了Q-DiT4SR,这是首个专门针对基于DiT的Real-ISR的PTQ框架。我们提出了H-SVD,一种层次化SVD,它在匹配的参数预算下集成了一个全局低秩分支和一个局部块状秩-1分支。我们进一步提出了变异性感知时空混合精度:VaSMP在无数据的情况下基于率-失真理论分配跨层权重位宽,而VaTMP通过动态规划(DP)在最小校准下调度跨扩散时间步的层内激活精度。在多个实现实景数据集上的实验表明,我们的Q-DiT4SR在W4A6和W4A4设置下均实现了SOTA性能。值得注意的是,W4A4量化配置将模型大小减少了5.8倍,并将计算操作减少了6.14倍。我们的代码和模型将在https://github.com/xunzhang1128/Q-DiT4SR上提供。

英文摘要

Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8$\times$ and computational operations by 6.14$\times$. Our code and models will be available at https://github.com/xunzhang1128/Q-DiT4SR.

2601.23086 2026-05-21 cs.AI

Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks

从输出监督学习的链式思维混淆可以泛化到未见过的任务

Nathaniel Mitrani Hadida, Sassan Bhanji, Cameron Tice, Puria Radmard

AI总结 本文研究了链式思维(CoT)推理中混淆现象的泛化能力,发现模型在学习混淆推理轨迹时,能够将这种混淆行为及其在未见过的任务中表现出来,从而影响模型的可监控性。

详情
AI中文摘要

链式思维(CoT)推理通过使大型语言模型(LLM)能够规划、探索和反思其行动,显著提升了性能。CoT也是监控这些代理行为的强大工具:当忠实时,它们提供模型决策过程的解释,并为危险行为发出早期警告。然而,优化压力可能会导致模型混淆推理轨迹,失去这一有益属性。我们证明混淆可以跨任务泛化;学习混淆涉及奖励黑客(例如访问和利用泄露信息)的推理的模型,不仅在未见过的奖励黑客设置中泛化了奖励黑客行为及其混淆。最令人担忧的是,我们显示当仅惩罚模型关闭CoT后的最终动作时,CoT推理的混淆及其跨任务泛化也随之发生。我们的发现表明,当前对有害生成的惩罚实践可能会无意中以不可预测的方式减少LLM的广泛可监控性。

英文摘要

Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs by enabling planning, exploration, and deliberation of their actions. CoT is also a powerful tool for monitoring the behaviours of these agents: when faithful, they offer interpretations of the model's decision making process, and an early warning sign for dangerous behaviours. However, optimisation pressures placed on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property. We show that obfuscation can generalise across tasks; models that learn to obfuscate reasoning involving reward hacking (e.g. accessing and utilising leaked information) generalise both the reward hacking behaviour and its obfuscation in CoT to unseen reward hacking settings. Most worryingly, we show that obfuscation of CoT reasoning, and its generalisation across tasks, also follows when we penalise only the model's final actions after closing its CoT. Our findings suggest that current practices of penalising harmful generations may inadvertently lead to a reduction in the broader monitorability of LLMs in unpredictable ways.

2601.22932 2026-05-21 cs.LG

DC-LA: Difference-of-Convex Langevin Algorithm

DC-LA:差分凸拉格朗日算法

Hoang Phuc Hau Luu, Zhongjian Wang

AI总结 本文研究了一个采样问题,其目标分布为π∝exp(-f-r),其中数据保真项f是Lipschitz光滑的,而正则化项r=r1-r2是一个非光滑的差分凸(DC)函数。通过利用r的DC结构,分别对r1和r2应用Moreau包络以平滑r。随后,将正则化部分的凹部分分配给数据保真项,并研究相应的近端拉格朗日算法(称为DC-LA)。在V远离耗散的假设下,建立了DC-LA在q-Wasserstein距离上收敛到目标分布π的结论,且在离散化和平滑误差范围内对所有q∈ℕ*成立。结果在非对数凹采样方面改进了之前的成果。

详情
AI中文摘要

我们研究了一个采样问题,其目标分布为π∝exp(-f-r),其中数据保真项f是Lipschitz光滑的,而正则化项r=r1-r2是一个非光滑的差分凸(DC)函数,即r1,r2是凸函数。通过利用r的DC结构,我们分别对r1和r2应用Moreau包络以平滑r。遵循DC编程,我们将正则化部分的凹部分分配给数据保真项,并研究其对应的近端拉格朗日算法(称为DC-LA)。我们在V远离耗散的假设下,建立了DC-LA在q-Wasserstein距离上收敛到目标分布π的结论,且在离散化和平滑误差范围内对所有q∈ℕ*成立。我们的结果在非对数凹采样方面改进了之前的成果。数值实验表明,DC-LA在合成设置中能够生成准确的分布,并在实际应用的计算机断层扫描中提供定性合理的不确定性量化。

英文摘要

We study a sampling problem whose target distribution is $π\propto \exp(-f-r)$ where the data fidelity term $f$ is Lipschitz smooth while the regularizer term $r=r_1-r_2$ is a non-smooth difference-of-convex (DC) function, i.e., $r_1,r_2$ are convex. By leveraging the DC structure of $r$, we can smooth out $r$ by applying Moreau envelopes to $r_1$ and $r_2$ separately. In line with DC programming, we then redistribute the concave part of the regularizer to the data fidelity and study its corresponding proximal Langevin algorithm (termed DC-LA). We establish convergence of DC-LA to the target distribution $π$, up to discretization and smoothing errors, in the $q$-Wasserstein distance for all $q \in \mathbb{N}^*$, under the assumption that $V$ is distant dissipative. Our results improve previous work on non-log-concave sampling in terms of a more general framework and assumptions. Numerical experiments show that DC-LA produces accurate distributions in synthetic settings and provides qualitatively reasonable uncertainty quantification in a real-world Computed Tomography application.

2601.21662 2026-05-21 cs.LG

Epistemic Uncertainty Quantification for Pre-trained VLMs via Riemannian Flow Matching

通过黎曼流匹配对预训练视觉语言模型进行知识不确定性量化

Li Ju, Mayank Nautiyal, Andreas Hellander, Ekta Vats, Prashant Singh

AI总结 本文提出REPVLM方法,通过黎曼流匹配在视觉语言模型嵌入的超球面流形上计算概率密度,以量化模型的知识不确定性,并在分类和异常检测中取得显著效果。

详情
Journal ref
Forty-Third International Conference on Machine Learning, 2026
AI中文摘要

视觉语言模型(VLMs)通常具有确定性性质,并缺乏内在机制来量化知识不确定性,这反映了模型对知识的缺乏或对其自身表示的无知。我们理论上提出嵌入的负对数密度作为知识不确定性的代理,低密度区域表示模型的无知。所提出的方法REPVLM通过黎曼流匹配在VLM嵌入的超球面流形上计算概率密度。我们实证表明,REPVLM在不确定性与预测误差之间实现了接近完美的相关性,显著优于现有基线。除了分类之外,我们还证明该模型还提供了一种可扩展的度量标准,用于异常检测和自动化数据整理。

英文摘要

Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model's lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hyperspherical manifold of the VLM embeddings using Riemannian Flow Matching. We empirically demonstrate that REPVLM achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Beyond classification, we also demonstrate that the model also provides a scalable metric for out-of-distribution detection and automated data curation.

2601.18696 2026-05-21 cs.LG

Explainability Methods for Hardware Trojan Detection: A Systematic Comparison

用于硬件木马检测的可解释性方法:系统性比较

Paul Whitten, Francis Wolff, Chris Papachristou

AI总结 本文针对硬件木马检测中的可解释性方法进行系统性比较,探讨领域感知属性分析、基于案例的推理和特征归因技术在硬件安全应用中的性能差异。

详情
AI中文摘要

硬件木马是恶意电路,会破坏集成电路(IC)的功能和安全性。这些电路直接制造在硅片上,无法像软件一样通过安全补丁修复。解决方案需要通过更换IC进行昂贵的产品召回,因此在设计过程中早期检测至关重要。最佳的硬件检测仅能提供基于统计的解决方案,存在大量假阳性和假阴性。这些检测方法需要更深入的可解释性分析来过滤假指标。现有为通用领域(如图像分类)开发的可解释性方法可能无法提供硬件工程师所需的操作洞察。问题在于:领域感知属性分析、基于案例的推理和特征归因技术在硬件安全应用中如何比较?本文比较了三种可解释性方法用于门级硬件木马检测,在Trust-Hub基准数据集上:(1)基于31个电路特定特征的领域感知属性分析,这些特征来自门扇入模式、触发器距离和主输入/输出(I/O)连接;(2)使用k-最近邻进行基于案例的推理以获得基于先例的解释;(3)基于模型无关的特征归因方法(局部可解释模型无关解释(LIME)、SHapley Additive exPlanations(SHAP)、梯度)提供通用的重要性评分,而无需电路级上下文。

英文摘要

Hardware trojans are malicious circuits which compromise the functionality and security of an integrated circuit (IC). These circuits are manufactured directly into the silicon and cannot be fixed by security patches like software. The solution would require a costly product recall by replacing the IC and hence, early detection in the design process is essential. Hardware detection at best provides statistically based solutions with many false positives and false negatives. These detection methods require more thorough explainable analysis to filter out false indicators. Existing explainability methods developed for general domains like image classification may not provide the actionable insights that hardware engineers need. A question remains: How do domain-aware property analysis, model-agnostic case-based reasoning, and model-agnostic feature attribution techniques compare for hardware security applications? This work compares three categories of explainability for gate-level hardware trojan detection on the Trust-Hub benchmark dataset: (1) domain-aware property-based analysis of 31 circuit-specific features derived from gate fanin patterns, flip-flop distances, and primary Input/Output (I/O) connectivity; (2) model-agnostic case-based reasoning using k-nearest neighbors for precedent-based explanations; and (3) model-agnostic feature attribution methods (Local Interpretable Model-agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), gradient) that provide generic importance scores without circuit-level context.