arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2163
2605.19856 2026-05-20 cs.LG cs.AI

StableGrad: Backward Scale Control without Batch Normalization

StableGrad: 无需批量归一化的反向缩放控制

Jose I. Mestre, Alberto Fernández-Hernández, Cristian Pérez-Corral, Manuel F. Dolz, Enrique S. Quintana-Ortí

AI总结 本文提出StableGrad,一种在无需批量归一化的情况下通过优化器层面控制权重-梯度缩放来稳定深度神经网络训练的方法,特别适用于物理信息神经网络等场景。

详情
AI中文摘要

训练非常深的神经网络需要控制深度方向上的量值传播。没有这种控制,激活值和梯度可能会消失、爆炸或进入不稳定区域,导致优化失败。现代架构通常通过批量归一化、残差连接或其他归一化层来缓解这个问题,这些机制会重复地重新缩放或绕过中间表示。然而,这些机制并不总是适用。在物理信息神经网络(PINNs)中,网络表示连续的物理场及其输入导数定义了训练目标,使批量依赖的归一化变得有问题,因为这会引入非局部依赖性到预测场及其导数中。我们提出StableGrad,一种优化器层面的缩放控制机制,可以在不修改前向模型的情况下纠正层间权重-梯度不平衡。因为归一化仅在反向传播后、优化器更新前应用,网络输出、其导数和物理残差保持不变。我们分析了这种缩放所引起的有效训练动态,并在深度PINNs上评估StableGrad作为目标应用,用无批量归一化的卷积网络作为诊断压力测试。在PINN基准测试中,StableGrad提高了匹配深度的解精度,并使更深层的模型在标准优化下更加可靠。在ResNet和EfficientNet架构中,移除批量归一化通常会导致训练崩溃,但StableGrad在不引入其他架构变化的情况下稳定了优化。这些结果表明,优化器层面的权重-梯度缩放控制可以提供一种实用的替代方案,当前向归一化不可用或不适用时。

英文摘要

Training very deep neural networks requires controlling the propagation of magnitudes across depth. Without such control, activations and gradients may vanish, explode, or enter unstable regimes that make optimization fail. Modern architectures often mitigate this problem through Batch Normalization, residual connections, or other normalization layers, which repeatedly re-scale or bypass intermediate representations. However, these mechanisms are not always appropriate. In Physics-Informed Neural Networks (PINNs), the network represents a continuous physical field and its input derivatives define the training objective, making batch-dependent normalization problematic because it can introduce non-local dependencies into the predicted field and its derivatives. We propose StableGrad, an optimizer-level scale-control mechanism that corrects layer-wise weight-gradient imbalances without modifying the forward model. Because the normalization is applied only after backpropagation and before the optimizer update, the network output, its derivatives, and the physical residual remain unchanged. We analyze the effective training dynamics induced by this rescaling and evaluate StableGrad on deep PINNs as the target application, with BatchNorm-free convolutional networks serving as a diagnostic stress test. On PINN benchmarks, StableGrad improves matched-depth solution accuracy and makes deeper models more reliable under standard optimization. On ResNet and EfficientNet architectures, where removing Batch Normalization normally leads to training collapse, StableGrad stabilizes optimization without introducing any other architectural change. These results show that optimizer-level control of weight-gradient scale can provide a practical alternative when forward normalization is unavailable or undesirable.

2605.19855 2026-05-20 cs.CV cs.AI

A Framework for Evaluating Zero-Shot Image Generation in Concept-based Explainability

基于概念的可解释性人工智能的零样本图像生成评估框架

Giacomo Astolfi, Matteo Bianchi, Riccardo Campi, Antonio De Santis, Marco Brambilla

AI总结 本文提出了一种基于概念的可解释性人工智能的零样本图像生成评估框架,通过生成合成概念数据集来评估概念基于的XAI方法,探讨了零样本文本到图像生成模型在模型分析中的挑战和开放性问题。

Comments G. Astolfi, M. Bianchi, and R. Campi contributed equally

详情
AI中文摘要

基于概念的可解释性人工智能(XAI)通过将内部表示与类别预测联系起来,利用人类可理解的视觉特征(如纹理或物体部分)来解释深度学习模型,从而弥合低级图像数据与高级语义之间的差距。然而,一个主要挑战是依赖大量标记图像来表示每个概念,这限制了可扩展性。在本工作中,我们研究了使用零样本文本到图像(T2I)生成模型作为合成概念数据集的来源,用于概念基于的XAI方法。具体而言,我们通过预定义提示生成概念,并通过四种互补分析评估其对真实概念的忠实性:(1)通过概念表示相似性比较合成与真实概念图像;(2)通过比较相同概念的子集对进行评估,子集大小逐步增加;(3)通过相关类别图像评估其在下游解释任务中的性能;(4)评估在移除测试类别图像中的概念对生成概念的解释影响。尽管当前T2I生成模型承诺为概念基于的XAI提供捷径,但我们的研究突显了挑战并提出了关于使用零样本管道生成的合成数据在模型分析中的使用问题。生成的数据集可在https://github.com/DataSciencePolimi/ZeroShot-T2I-Concepts获取。

英文摘要

Concept-based Explainable Artificial Intelligence (XAI) interprets deep learning models using human-understandable visual features (e.g., textures or object parts) by linking internal representations to class predictions, thereby bridging the gap between low-level image data and high-level semantics. A major challenge, however, is the reliance on large sets of labeled images to represent each concept, which limits scalability. In this work, we investigate the use of zero-shot Text-to-Image (T2I) generative models as a source of synthetic concept datasets for concept-based XAI methods. Specifically, we generate concepts using predefined prompts and evaluate their faithfulness to real ones through four complementary analyses: (1) comparing synthetic vs. real concept images via concept representation similarity; (2) evaluating their intra-similarity by comparing pairs of subsets of the same concept with progressively increasing size; (3) evaluating their performance for downstream explanation tasks using relevant class images; (4) evaluating how removing a concept from tested class images affects explanations of generated concepts. While current T2I generative models promise a shortcut to concept-based XAI, our study highlights challenges and raises open questions about the use of synthetic data generated by zero-shot pipelines in model analyses. The resulting dataset is available at https://github.com/DataSciencePolimi/ZeroShot-T2I-Concepts.

2605.19842 2026-05-20 cs.LG

Fast Tensorization of Neural Networks via Slice-wise Feature Distillation

通过切片特征蒸馏实现神经网络的快速张量化

Safa Hamreras, Sukhbinder Singh, Román Orús

AI总结 本文提出了一种基于切片特征蒸馏的可扩展张量化框架,用于神经网络压缩。该方法通过将网络分解为独立的切片(如单个层或块),并独立张量化每个切片以恢复原始预训练模型的中间表示,从而提高精度恢复、减少数据需求并实现高效的并行优化。

详情
AI中文摘要

我们提出了一种基于切片特征蒸馏的可扩展张量化框架,用于神经网络压缩。与传统的依赖于成本高昂的全局微调的张量分解方法不同,我们的方法将网络分解为由单个层、块(如卷积层或MLP)或连续层的小组构成的切片,并独立对每个切片进行张量化以重现原始预训练模型的中间表示。这种模块化策略提高了精度恢复,减少了数据需求,并实现了高效的并行优化。在ResNet-34上的实验表明,与传统全局张量化相比,该方法在中等压缩率下实现了接近无损的压缩效果,并具有更快的优化速度。在GPT-2 XL上的结果进一步展示了该方法的可扩展性和其在大规模模型中的适用性,特别是在分布式设置中。

英文摘要

We propose a scalable tensorization framework for neural network compression based on slice-wise feature distillation. Unlike conventional tensor decomposition methods that rely on costly global finetuning, our approach decomposes the network into slices consisting of either individual layers or blocks (e.g., convolutional layers or MLPs), or small groups of consecutive layers, and tensorizes each slice independently to reproduce the intermediate representations of the original pretrained model. This modular strategy improves accuracy recovery, reduces data requirements, and enables efficient parallel optimization. Experiments on ResNet-34 show significant gains over conventional global tensorization, achieving near-lossless compression at moderate compression rates with faster optimization. Results on GPT-2 XL further demonstrate the scalability of the method and its applicability to large-scale models, particularly in distributed settings.

2605.19840 2026-05-20 cs.RO

Justifying bio-inspired robotics research: A taxonomy of strategies

论证生物启发式机器人研究:一种策略的分类

Margaret J. Zhang, Justin Ting, Talia Y. Moore

AI总结 本文提出了一种生物启发式设计动机的分类,以帮助机器人研究者合理化其特定的生物启发方法,并帮助资助管理人员评估不同生物启发方法的价值。

详情
AI中文摘要

在人类历史的大部分时间里,我们并没有系统地思考为什么和如何将自然世界的方面纳入我们的设计中。缺乏系统方法导致了动机和方法的一致性问题,使得预测或评估生物启发式设计的成功变得困难。期望与结果之间的不匹配可能导致读者在认为生物启发式设计表面、薄弱或不完整时感到失望。这在机器人领域尤为明显,因为在该领域,与生物系统的相似性可能是构造的驱动动机。为了帮助机器人研究者合理化其特定的生物启发方法,并帮助资助计划管理人员区分不同生物启发方法的价值,本文提出了一种生物启发式设计动机的分类,并描述了不同方法可能带来的潜在重大贡献。

英文摘要

For most of human history, we have not thought systematically about how and why we incorporate aspects of the natural world into our designs. The lack of a systematic approach has resulted in inconsistencies in motivations and methods that make it difficult to predict or evaluate the success of bio-inspired design. This mismatch between expectations and results can lead to disappointment when a reader considers a bio-inspired design to be superficial, weak, or incomplete. This is especially true in the field of Robotics, in which similarity to a biological system might be the driving motivation for construction. In an effort to assist robotics researchers justify their specific bio-inspired approach and to assist funding program managers with discerning the value of different bio-inspired approaches, here we propose a taxonomy of motivations for bio-inspired design and describe the potential significant contributions that are likely to result from different approaches.

2605.19837 2026-05-20 cs.CV cs.AI cs.CL cs.RO

CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

CADENet:条件自适应异步双流增强网络用于自动驾驶中的恶劣天气感知

Sherif Khairy, Catherine M. Elias

AI总结 本文提出CADENet,一种无需训练的三线系统,通过条件自适应增强和熵引导NMS融合,实现自动驾驶中恶劣天气下的目标检测,同时无需重新训练或额外硬件。

详情
AI中文摘要

恶劣天气(雨、雾、沙尘和雪)会降级自动驾驶车辆基于摄像头的目标检测。现有先增强后检测的方法会阻碍安全关键的感知循环,违反严格的实时要求。该问题的进展也受到一个未被认识到的评估上限的限制:在降质图像上标注的地面真实数据不能为一个能够恢复注释者自身无法看到的目标的检测器提供信用,因此真正的有用的增强可以注册为接近平坦的F1增益。本文提出了CADENet(条件自适应异步双流增强网络),一种无需训练的三线系统:线S(YOLOv11n)以全帧率提供检测,无额外延迟;线Q应用条件自适应增强(CAPE)并通过熵引导NMS(EG-NMS)融合结果,不阻塞线S;线E提供CLIP零样本天气分类,因此新的天气类别只需新的文本提示,无需标注数据和重新训练。在1327张DAWN图像(YOLOv11m,IoU=0.5,置信度=0.25)上评估,CADENet在雪中实现Recall=0.0103(微),F1=0.0230,在雨中实现F1=0.0038。我们正式化了DAWN类数据上的注释完整性偏差,因此报告的F1值是真实增益的下限;Recall是注释-间隙-免疫的头条指标。线S在增强负载下保持约44 FPS。无需模型重新训练或额外传感器硬件。

英文摘要

Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time requirements. Progress on this problem is also constrained by an under-recognized evaluation ceiling: ground truth annotated on degraded images cannot credit a detector that recovers objects the annotators themselves could not see, so a genuinely useful enhancement can register as a near-flat F1 gain. This paper presents CADENet (Condition-Adaptive Asynchronous Dual-stream Enhancement Network), a training-free three-thread system: Thread S (YOLOv11n) delivers detections at full frame rate with zero added latency; Thread Q applies condition-adaptive enhancement (CAPE) and fuses results via entropy-guided NMS (EG-NMS) without blocking Thread S; Thread E provides CLIP zero-shot weather classification, so new weather categories require only a new text prompt, with no labeled data and no retraining. Evaluated on 1327 DAWN images (YOLOv11m, IoU = 0.5, confidence = 0.25), CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain. We formalize the annotation completeness bias on DAWN-class data, so the reported F1 values are lower bounds on the true gain; recall is the annotation-gap-immune headline metric. Thread S sustains approximately 44 FPS regardless of enhancement load. No model retraining or additional sensor hardware is required.

2605.19834 2026-05-20 cs.LG cs.AI cs.SY eess.SY

A Closed-loop, State-centric, Multi-agent Framework for Passenger Load Estimation from Heterogeneous Data Streams

一种闭环、以状态为中心的多智能体框架,用于从异构数据流中估计乘客负载

Yiyao Xu, Hao Zhou, Yuhang Wang, Jingran Sun

AI总结 本文提出一种闭环、以状态为中心的多智能体框架,用于从异构数据流中准确估计乘客负载,通过动态分配信任和物理约束提升鲁棒性。

Comments Preprint version of a paper accepted by the 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC). 7 pages, 4 figures

详情
AI中文摘要

为了支持运营和乘客服务,公共交通机构需要可靠的乘客负载轨迹。目前,负载估计通常是从不完美的传感系统推断而来,而非完全观察,现代自动乘客计数(APC)系统的准确性仍受车站布局、流量强度和运营条件的影响。为了解决从异构数据流中稳健估计乘客负载的挑战,包括增量计数误差、证据冲突和上下文依赖的传感器可靠性,我们提出了一种闭环、以状态为中心的多智能体框架。该方法在每一步都强制物理可行性,动态分配信任给证据源,并将物理推导出的违反残差反馈回训练以提高鲁棒性。该架构包括一个统一的停靠事件骨干,一个耦合的感知-物理-融合循环用于停靠点推断,以及可选的行程级宏修正和闭环校准模块。

英文摘要

To support operations and passenger-facing services, transit agencies need reliable passenger load trajectories. Currently, load estimates are typically inferred from imperfect sensing systems rather than fully observed, and the accuracy of modern automatic passenger counting (APC) systems still varies with station layout, flow intensity, and operating conditions. To address the challenges of robust passenger load estimation from heterogeneous data streams, including incremental count errors, evidence conflicts, and context-dependent sensor reliability, we propose a closed-loop, state-centric, multi-agent framework. This method enforces physical feasibility at every step, allocates trust dynamically among evidence sources, and feeds physics-derived violation residuals back into training for robustness improvement. The architecture consists of a unified stop-event backbone, a coupled Perception--Physical--Fusion loop for stop-by-stop inference, and optional trip-level macro-correction and closed-loop calibration modules.

2605.19833 2026-05-20 cs.SD cs.AI cs.CL cs.MM eess.AS

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Mega-ASR: 通过扩大现实世界声学模拟实现野外语音识别

Zhifei Xie, Kaiyu Pang, Haobin Zhang, Deheng Ye, Xiaobin Hu, Shuicheng Yan, Chunyan Miao

AI总结 本文提出Mega-ASR,一种统一的野外语音识别框架,结合可扩展的复合数据构建与渐进的声学到语义优化,通过在7种经典声学现象和54种物理上合理的复合场景上训练,显著提升了在恶劣环境下的语音识别性能。

Comments Project page: https://xzf-thu.github.io/Mega-ASR/. Code, models, and dataset will be released. A robust ASR framework targeting in-the-wild and compositional acoustic scenarios where conventional ASR systems fail

详情
AI中文摘要

尽管自动语音识别(ASR)和大型音频-语言模型取得了快速进展,但现实环境中鲁棒的识别仍然受到一个“声学鲁棒性瓶颈”的限制:模型在严重、复合的失真下常常失去声学基础并产生遗漏或幻觉。我们提出了Mega-ASR,一种统一的ASR-in-the-wild框架,结合可扩展的复合数据构建与渐进的声学到语义优化。我们引入了Voices-in-the-Wild-2M,涵盖7种经典声学现象和54种物理上合理的复合场景,并通过Acoustic-to-Semantic Progressive Supervised Fine-Tuning和Dual-Granularity WER-Gated Policy Optimization训练Mega-ASR。大量实验表明,Mega-ASR在恶劣条件ASR基准测试中显著优于先前的最先进系统(在VOiCES R4-B-F上45.69% vs. 54.01%,在NOIZEUS Sta-0上21.49% vs. 29.34%)。在复杂的复合声学场景中,Mega-ASR进一步在强大的开源和闭源基线中实现了超过30%的相对WER降低,建立了在野外鲁棒ASR的可扩展范式。

英文摘要

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.

2605.19830 2026-05-20 cs.LG math.ST stat.TH

Set-Valued Policy Learning

多治疗设置下的集合值策略学习

Laura Fuentes-Vicente, Mathieu Even, Gaëlle Dormion, Antoine Chambaz, Uri Shalit, Julie Josse

AI总结 本文提出了一种集合值策略学习方法,用于多治疗场景,通过输出可能的治疗集而非单一推荐,从而内在地量化不确定性,并通过新的 greatest Lower Bound 方法扩展了学习-延迟框架,并引入了符合政策学习,以连接未观察到的真实最优治疗与估计的最优治疗规则。

详情
AI中文摘要

传统治疗政策将患者协变量映射到单一推荐干预以最大化预期临床结果。尽管已开发出大量因果推断方法来估计此类政策,但点值推荐对估计不确定性、模型规范和有限样本变异高度敏感,通常提供很少关于应如何自信推荐行动的指导。在本文中,我们提出了一种多治疗设置下的集合值策略学习范式,其中策略输出一组可能的治疗而非单一推荐。这种形式使内在不确定性量化成为可能,预测集的大小反映决策不确定性的程度。我们通过新的 greatest Lower Bound 方法扩展了学习-延迟框架到多治疗,并引入了符合政策学习,它弥合了未观察到的真实最优治疗与估计最优治疗规则之间的差距。借鉴噪声标签文献的见解,我们开发了一种随机性注入方法,该方法在不需假设底层黑箱最优治疗规则的情况下保证边际覆盖率。通过在合成数据和实际应用到体外受精(IVF)上的实验,我们证明了我们的方法产生稳健且可操作的政策,这些政策自然地纳入临床考虑,同时有效平衡性能和可靠性。

英文摘要

Conventional treatment policies map patient covariates to a single recommended intervention in order to maximize expected clinical outcomes. Although a rich body of causal inference methods has been developed to estimate such policies, point-valued recommendations can be highly sensitive to estimation uncertainty, model specification, and finite-sample variability, while typically providing little guidance about how confident one should be in the recommended action. In this work, we propose a set-valued policy learning paradigm for the multiple-treatment setting, in which policies output a set of plausible treatments rather than a single recommendation. This formulation enables intrinsic uncertainty quantification, with the size of the predicted set reflecting the degree of decision ambiguity. We extend the learning-to-defer framework to multiple treatments via a novel \textit{greatest Lower Bound} method, and introduce \textit{conformal policy learning}, which bridges the gap between unobserved ground-truth optimal treatments and estimated optimal treatment rules. Drawing on insights from the noisy-label literature, we develop a randomness-injection approach that guarantees marginal coverage without requiring assumptions on underlying black-box optimal treatment rules. Through experiments on synthetic data and a real-world application to In-Vitro Fertilization (IVF), we demonstrate that our methods produce robust and actionable policies that naturally incorporate clinical considerations while effectively balancing performance and reliability.

2605.19826 2026-05-20 cs.AI

Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support

可解释的废水数字孪生:具有自否决决策支持的自适应上下文条件结构模拟器

Gary Simethy, Daniel Ortiz Arroyo, Petar Durdevic

AI总结 本文提出了一种可解释的废水数字孪生,通过自适应上下文条件结构模拟器CCSS-IX,结合自否决决策规则,解决废水处理厂中安全与效率之间的权衡问题,同时提供端到端的决策支持流程。

Comments 17 pages, 7 figures, 6 tables, 2 algorithms. Supplementary material (7 pages) included as ancillary file

详情
AI中文摘要

安全关键工业过程的操作员越来越多地依赖数字孪生来筛选控制干预,但此类模拟器很少具备认证的安全保证。废水处理厂体现了这一差距:操作员面临每日安全与效率之间的权衡,过度曝气可能导致出水违规和一氧化二氮(N2O)峰值,而过度曝气则浪费能量。本文开发了一种可解释的数字孪生用于曝气和投加设定点。CCSS-IX模拟器是一组可解释的局部线性状态空间“专家”的银行,通过上下文感知的门控网络自适应混合,基于连续时间状态切换框架。运行时决策层应用符合性风险控制,以拒绝、重新打开或返回任何无法统计认证的操作员提议行动的虚假时间见证。人工智能的贡献是双重的:一种可识别、上下文条件的结构替代体,保留了操作员可读的动力学,以及一种具有有限样本覆盖保证的自否决决策规则。工程贡献是经过验证的端到端决策支持流程,已在Avedøre全规模工厂(42.6%传感器缺失,2分钟采样)、丹麦Agtrup/BlueKolding全规模工厂以及国际基准模拟模型No.2(BSM2)上测试,采用匹配的十种子协议。静态结构集合与无约束黑盒参考的均方误差在0.78%以内,自适应变体在1.08%以内。校准的重新打开规则在不安全操作成本权重为4时将两厂总后悔量减少43.6%,并消除了BSM2主切片上的所有不安全选择动作。事件对齐的时间见证阻止了187个虚假安全N2O批准中的93个,大约是二元基线(配对McNemar p < 1e-21)的4.65倍。

英文摘要

Operators of safety-critical industrial processes increasingly rely on digital twins to screen control interventions, but such simulators rarely carry certified safety guarantees. Wastewater treatment plants exemplify the gap: operators face a daily safety-efficiency trade-off where aerating too little risks effluent violations and nitrous-oxide (N2O) spikes, and aerating too much wastes energy. We develop an explainable digital twin for aeration and dosing setpoints. CCSS-IX, the simulator, is a bank of interpretable locally linear state-space "experts" adaptively mixed by a context-aware gating network, building on a continuous-time regime-switching scaffold. A runtime decision layer applies conformal risk control to abstain, reopen, or return a falsifying temporal witness for any operator-proposed action that cannot be statistically certified. The artificial-intelligence contribution is twofold: an identifiable, context-conditioned structured surrogate that retains operator-readable dynamics, and a self-falsifying decision rule with finite-sample coverage guarantees. The engineering contribution is a validated, end-to-end decision-support pipeline, tested on a 1000-step slice of the Avedøre full-scale plant (42.6% sensor missingness, 2-minute sampling), the Agtrup/BlueKolding full-scale plant in Denmark, and the Benchmark Simulation Model No. 2 (BSM2) international benchmark, under a matched ten-seed protocol. The static structured ensemble lies within 0.78% root-mean-square error of an unconstrained black-box reference, and the adaptive variant within 1.08%. The calibrated reopen rule cuts aggregate two-plant regret by 43.6% at an unsafe-action cost weight of 4 and eliminates unsafe chosen actions on the BSM2 main slice. Event-aligned temporal witnesses prevent 93 of 187 false-safe N2O approvals, about 4.65x the dyadic baseline (paired McNemar p < 1e-21).

2605.19824 2026-05-20 cs.AI cs.CL cs.CV cs.RO

From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

从提示到路面通过时间:代理场景到计划推理中的时间定位

Ahmed Y. Gado, Omar Y. Goba, Alaa Hassanein, Catherine M. Elias, Ahmed Hussein

AI总结 本研究探讨了在代理间通信中引入时间条件是否能保持或增强推理的一致性,而不会降低语义或逻辑一致性,并通过BDD-X数据集的curated子集评估了三种具有递增时间整合的规划器架构。结果表明,时间条件改变了推理风格,但并未在标准NLP正确性指标上产生统计显著改进,但定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。

详情
AI中文摘要

近期尝试通过大型语言模型(LLMs)和大型多模态模型(LMMs)的集合来支持自动驾驶(AVs)中的高级场景解释和规划,仍然将时间视为次要属性。这种缺乏时间定位导致在连续动作推理中出现不一致,影响安全性和可解释性。本文探讨时间条件在代理间通信中是否能保持或增强一致性而不引入语义或逻辑一致性下降。为此,我们引入了三种具有递增时间整合的规划器架构,并在BDD-X数据集的curated子集上评估它们,使用语义、语法和逻辑指标。结果表明,虽然时间条件改变了推理风格,但并未在标准NLP基于的正确性指标上产生统计显著改进。然而,定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。这些发现澄清了基于提示的时间定位的局限性,并建立了时间场景到计划推理的第一个经验基准。

英文摘要

Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.

2605.19823 2026-05-20 cs.LG cs.AI math.AP math.DS stat.ML

Smooth Piecewise Cutting for Neural Operator to Handle Discontinuities and Sharp Transitions

通过平滑分段处理神经算子以应对不连续性和尖锐过渡

Ha Dang, Sebastian Schmidt, Juergen Hesser

AI总结 本文提出Cut-DeepONet,一种两阶段训练框架,通过将不连续性建模为更高维空间中的边界,减少学习复杂性,从而在处理偏微分方程的解算子时更有效地捕捉不连续性和尖锐过渡。

详情
AI中文摘要

神经算子在学习偏微分方程(PDEs)的解算子方面取得了强劲表现,但其本质上连续的表示在捕捉不连续性和尖锐过渡时存在困难。现有方法通常在连续函数空间内近似这些特征,往往需要增加模型容量和高分辨率数据。在本文中,我们提出Cut-DeepONet,一种两阶段训练框架,通过提升策略将问题重新表述,将域划分成平滑子区域,同时在更高维空间中将不连续性表示为边界。这种分离使算子学习任务与神经网络的归纳偏置对齐,并避免直接近似不连续性。一个额外的网络预测输入依赖的不连续性位置,然后用于指导神经算子在每个区域内生成平滑组件。在基准PDEs上的实验表明,Cut-DeepONet在低分辨率数据集上训练时也优于最先进的方法。该方法在存在不连续性和尖锐过渡的问题上表现优异,同时使用更少的可训练参数。我们的结果突显了改变算子学习的表示而非增加模型复杂性的优势。

英文摘要

Neural operators have achieved strong performance in learning solution operators of partial differential equations (PDEs), but their inherently continuous representations struggle to capture discontinuities and sharp transitions. Existing approaches typically approximate such features within continuous function spaces, often requiring increased model capacity and high-resolution data. In this work, we propose Cut-DeepONet, a two-stage training framework that explicitly models discontinuities while reducing learning complexity. Our approach reformulates the problem via a lifting strategy, partitioning the domain into smooth subregions while representing discontinuities as boundaries in a higher-dimensional space. This separation aligns the operator learning task with the inductive bias of neural networks and avoids directly approximating discontinuities. An additional network predicts input-dependent discontinuity locations for unseen inputs, which are then used to guide the neural operator in generating smooth components within each region. Experiments on benchmark PDEs show that Cut-DeepONet outperforms state-of-the-art methods, even when trained on low-resolution datasets. The method excels on problems with discontinuities and sharp transitions, while using fewer trainable parameters. Our results highlight the benefits of changing the representation of operator learning rather than increasing model complexity.

2605.19822 2026-05-20 cs.LG cs.AI

ST-TGExplainer: Disentangling Stability and Transition Patterns for Temporal GNN Interpretability

ST-TGExplainer: 解构稳定性与转换模式以提升时序GNN可解释性

Hongjiang Chen, Xin Zheng, Pengfei Jiao, Huan Liu, Zhidong Zhao, Huaming Wu, Feng Xia, Shirui Pan

AI总结 本文提出ST-TGExplainer,一种能够解构时序图中稳定性与转换模式的自解释时序GNN,以提升模型的可解释性。

详情
AI中文摘要

时序图神经网络(TGNNs)在解决现实中的时序图任务中取得了显著进展。然而,其可解释性仍然有限,因为大多数TGNNs无法识别哪些历史交互最影响给定预测。尽管在可解释性TGNNs上取得了令人鼓舞的进展,现有方法主要关注之前已见过的历史交互,我们称之为稳定性模式,而忽略了新出现的一次性交互,我们称之为转换模式。这两种模式对于忠实的时序解释都是必不可少的。为了解决这一限制,我们提出了ST-TGExplainer,一种自解释的TGNN,旨在解构时序图中的稳定性与转换模式,以获得更忠实的时序GNN解释器。受解构信息瓶颈目标的指导,ST-TGExplainer学习了一个紧凑的解释子图,该子图在预测事件标签时保持预测性,同时显式地抑制稳定性与转换模式之间的标签条件冗余。广泛的实验表明,ST-TGExplainer在预测性能上表现出色,并产生了更忠实的解释。代码可在https://github.com/hjchen-hdu/ST-TGExplainer上获取。

英文摘要

Temporal graph neural networks (TGNNs) have gained significant traction for solving real-world temporal graph tasks. However, their interpretability remains limited, as most TGNNs fail to identify which historical interactions most influence a given prediction. Despite promising progress on interpretable TGNNs, existing methods predominantly focus on previously seen historical interactions, which we term stability patterns, while overlooking newly emerging first-time interactions, which we term transition patterns. Both types of patterns are essential for faithful temporal explanations. To address this limitation, we propose ST-TGExplainer, a self-explainable TGNN that disentangles Stability and Transition patterns in temporal graphs for a more faithful Temporal GNN Explainer. Guided by a disentangled information bottleneck objective, ST-TGExplainer learns a compact explanatory subgraph that remains predictive of the event label while explicitly suppressing label-conditioned redundancy between stability and transition patterns. Extensive experiments demonstrate that ST-TGExplainer achieves strong predictive performance and yields more faithful explanations. Code is available at https://github.com/hjchen-hdu/ST-TGExplainer.

2605.19821 2026-05-20 cs.CV

LaCoVL-FER: Landmark-Guided Contrastive Learning Network with Vision-Language Enhancement for Facial Expression Recognition

LaCoVL-FER: 一种结合视觉-语言增强的地标引导对比学习网络用于面部表情识别

Jiaxin Wang, Muwei Jian, Hui Yu, Junyu Dong, Yifan Xia

AI总结 本文提出了一种结合视觉-语言增强的地标引导对比学习网络LaCoVL-FER,通过引入面部地标几何先验和视觉-语言模型语义先验,解决野生环境中面部表情识别的挑战,提升识别的鲁棒性和泛化能力。

详情
AI中文摘要

在真实环境中,面部表情识别(FER)仍然具有挑战性,由于姿态、遮挡和光照的不可控变化。现有的基于注意力的方法主要依赖于视觉外观线索,导致注意力冗余和不稳定,限制了其在复杂场景中的性能。为了解决这些问题,我们提出了一种新颖的地标引导对比学习网络,结合视觉-语言增强,用于面部表情识别(LaCoVL-FER),该网络整合了来自面部地标几何先验和视觉-语言模型的语义先验。具体而言,设计了一个地标引导自适应编码器(LGAE),通过双分支门控交叉注意力(BGCA)机制引入几何先验,实现自适应融合基于地标几何和视觉外观特征,生成与表情相关的特征,从而聚焦于关键面部区域并抑制噪声干扰。同时,提出了一种视觉-语言增强策略(VLES),利用表情相关的特征来优化冻结预训练CLIP图像编码器提取的一般视觉特征,生成表情特定的视觉表示。基于这些表示,采用表情条件提示(ECP)机制进一步调整来自冻结预训练CLIP文本编码器的固定类级提示文本特征,生成更实例感知的文本表示。这些视觉-文本表示作为语义先验对齐,以增强FER的鲁棒性和泛化能力。定量和定性实验表明,我们的LaCoVL-FER在三个具有代表性的现实世界FER数据集(RAF-DB、FERPlus和AffectNet)上优于最先进的方法。代码可在https://github.com/ylin06804/LaCoVL-FER上获得。

英文摘要

Facial Expression Recognition (FER) in the wild is still challenging due to uncontrolled variations in pose, occlusion, and illumination. Most existing attention-based methods primarily rely on visual appearance cues, suffering from attention redundancy and instability, which limits their performance in complex scenarios. To address these issues, we propose a novel landmark-guided contrastive learning network with vision-language enhancement for FER (LaCoVL-FER), which integrates geometric priors from facial landmarks and semantic priors from a vision-language model. Specifically, a Landmark-Guided Adaptive Encoder (LGAE) is designed to introduce geometric priors through a Bi-branch Gated Cross Attention (BGCA) mechanism, which achieves adaptive fusion of landmark-based geometric and visual appearance features to produce expression-relevant features, thereby focusing on key facial regions and suppressing noise interference. In parallel, a Vision-Language Enhancement Strategy (VLES) is presented to leverage the expression-relevant features to refine the generalizable visual features extracted by the frozen pretrained CLIP image encoder, yielding expression-specific visual representations. Based on these representations, an Expression-Conditioned Prompting (ECP) mechanism is utilized to further adapt the textual features of fixed class-level prompts from the frozen pretrained CLIP text encoder, generating more instance-aware textual representations. These visual-textual representations are aligned as semantic priors to enhance the robustness and generalization of FER. Quantitative and qualitative experiments demonstrate that our LaCoVL-FER outperforms state-of-the-art methods on three representative real-world FER datasets, including RAF-DB, FERPlus, and AffectNet. The code is available at https://github.com/ylin06804/LaCoVL-FER.

2605.19815 2026-05-20 cs.CL cs.AI

LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation

LP-Eval: 用于衡量法律命题生成质量的评估标准和数据集

Shanshan Xu, Johan Lindholm, Amogh Raina, Henrik Palmer Olsen, Daniel Hershcovich

AI总结 本文提出LP-Eval,一种与法律专家共同设计的三步评估标准,用于评估法律命题的质量,通过专家标注的100个LLM生成的法律命题数据集,展示了LLM生成的命题质量较高,但专家评估发现基于经典案例的命题质量更高,同时发现基于评估标准的LLM判断更接近专家评估,但缺乏对细粒度区别的敏感性。

详情
AI中文摘要

法律命题生成在法律推理和教义学研究中至关重要,但在法律NLP中仍缺乏充分研究。本文研究了使用大型语言模型(LLMs)从欧洲法院司法判决中自动生成和评估法律命题。我们引入了LP-Eval,一种与法律专家共同设计的三步评估标准,将法律命题质量分解为形式有效性和实质维度。使用此标准,我们发布了两个专家对100个LLM生成法律命题的注释数据集。我们的结果表明,LLMs能够生成主要形式正确且高质量的命题,而专家评估显示基于经典案例的命题质量高于基于近期案例的命题。我们进一步检验LLMs作为评估者,发现基于评估标准的LLM判断更接近专家评估,但对人类专家捕捉到的细粒度区别不够敏感。

英文摘要

Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under-examined in Legal NLP. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models (LLMs). We introduce LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions. Using this rubric, we release a dataset of two experts' annotations for 100 LLM-generated legal propositions. Our results show that LLMs can generate predominantly well-formed and high-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones. We further examine LLMs as evaluators and find that rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, but remain insensitive to finer-grained distinctions captured by human experts.

2605.19813 2026-05-20 cs.LG math.ST stat.TH

General Lower Bounds for Differentially Private Federated Learning with Arbitrary Public-Transcript Interactions

具有任意公共 transcripts 交互的差分隐私联邦学习的一般下界

Yicheng Li

AI总结 本文研究了在任意公共 transcripts 交互下差分隐私联邦学习的下界问题,提出了一个针对平方 $\ell_2$ 损失参数估计的联邦 Van Trees 下界,并通过均值估计、线性回归和非参数回归等应用展示了该下界。

详情
AI中文摘要

我们证明了在具有任意公共 transcripts 交互的差分隐私联邦学习协议中的一般下界。该协议可以使用任意数量的自适应轮次,且每个客户端的本地样本可以在不同轮次中重复使用。对于平方 $\ell_2$ 损失下的参数估计,我们为每个满足总客户端层面零集中差分隐私(zCDP)约束的估计器建立了联邦 Van Trees 下界。主要技术成分是一个针对完整公共 transcripts 的隐私-信息收缩不等式。我们通过均值估计、线性回归和非参数回归等应用来展示该下界。

英文摘要

We prove a general lower bound for differentially private federated learning protocols with arbitrary public-transcript interactions. The protocol may use any number of adaptive rounds, and each client's local samples may be reused across rounds. For parameter estimation under squared \(\ell_2\) loss, we establish a federated van Trees lower bound for every estimator satisfying a total clientwise sample-level zero-concentrated differential privacy (zCDP) constraint. The main technical ingredient is a privacy-information contraction inequality for complete public transcripts. We illustrate the bound through applications to mean estimation, linear regression, and nonparametric regression.

2605.19812 2026-05-20 cs.LG cs.AI stat.AP stat.ML

FLUXtrapolation: A benchmark on extrapolating ecosystem fluxes

FLUXtrapolation:一个用于外推生态系统通量的基准测试

Anya Fries, Jacob A Nelson, Martin Jung, Markus Reichstein, Jonas Peters

AI总结 该研究提出FLUXtrapolation基准测试,旨在外推生态系统通量,通过分析分布偏移对通量上推的挑战,评估机器学习方法在分布偏移下的表现,以促进通量上推的科学目标。

详情
AI中文摘要

我们介绍了FLUXtrapolation,一个用于在外推生态系统通量时应对逐渐加剧的分布偏移的基准测试。生态系统通量是理解碳、水和能量循环的关键,但只能通过稀疏分布的测量塔直接测量。因此,生成全球通量估计需要在可用的全球协变量上训练模型,并在未观测区域进行预测,即上推。通量上推是一个具有挑战性的领域泛化问题,受气候、生态系统类型和环境条件之间协变量分布偏移的影响,以及条件偏移的影响:重要的驱动因素在全局尺度上未被观测。我们对这两种偏移在P_X和P_{Y|X}中的定量分析。FLUXtrapolation基于对通量上推的领域专业知识设计:它定义了基于时间、空间和温度的外推场景,并在未观测的领域、时间聚合和尾部误差上评估性能。在试点研究中,我们发现基线方法在中位小时RMSE下表现相似,但在提出的尾部聚焦和多尺度评估下则有所不同。因此,FLUXtrapolation为机器学习方法在分布偏移下的现实挑战提出了相关挑战;同时,该基准测试的进步将直接支持科学目标,即改进通量上推。

英文摘要

We introduce FLUXtrapolation, a benchmark for extrapolating ecosystem fluxes under progressively harder distribution shifts. Ecosystem fluxes are central to understanding the carbon, water, and energy cycles, yet they can only be measured directly at sparsely located measurement towers. Producing global flux estimates therefore requires training models on observed sites using globally available covariates and predicting in unobserved regions, that is, upscaling. Flux upscaling is a challenging domain generalization problem that is affected by a shift in covariate distribution across climates, ecosystem types, and environmental conditions, as well as by conditional shift: important drivers remain unobserved at global scale. We provide a quantitative analysis of both these shifts in $P_X$ and $P_{Y\mid X}$. FLUXtrapolation is designed based on domain expertise on flux upscaling: it defines temporal, spatial, and temperature-based extrapolation scenarios and evaluates performance across held-out domains, temporal aggregations, and tail errors. In a pilot study, we find that baselines perform similarly under median hourly RMSE, but separate under the proposed tail-focused and multi-scale evaluation. FLUXtrapolation therefore poses a realistic and thus relevant challenge for machine learning methods under distribution shift; at the same time, progress on this benchmark would directly support the scientific goal of improving flux upscaling.

2605.19811 2026-05-20 cs.LG

LionMuon: Alternating Spectral and Sign Descent for Efficient Training

LionMuon: 交替频谱和符号下降用于高效训练

Arman Bolatov, Artem Riabinin, Nikita Kornilov, Andrey Veprikov, Samuel Horváth, Martin Takáč, Aleksandr Beznosikov

AI总结 本文提出LionMuon优化器,通过交替使用Lion和Muon的更新步骤,在保持有效性的同时显著降低平均迭代成本,同时证明了在重尾噪声下的复杂性界限,展示了其在不同模型规模下的优势。

Comments 38 pages, 13 figures, 4 tables

详情
AI中文摘要

在大规模优化中,更新步骤的廉价性和有效性是成功优化器的关键因素。基于符号的优化器如Lion或Signum产生廉价的每步更新,而Muon的谱矩阵-符号更新则在显著更高的每步成本下提供更强的方向。在本文中,我们提出LionMuon,它保留了Muon步骤的有效性,同时显著降低了平均迭代成本,类似于基于符号的方法。它在固定周期P内交替使用Lion和Muon的更新,共享一个单一的双EMA动量缓冲区。因此,优化器状态内存与Lion相同,恰好是AdamW的一半。一个更简单的单EMA变体SignMuon本身已经优于纯Muon。在P=2时,LionMuon在我们测试的124M模型大小的每个数据集和架构上都优于Muon、Lion、Signum和AdamW,在更低的计算下达到更低的验证损失,这一优势在355M和720M规模上仍然存在。在理论方面,我们证明了在重尾噪声下的严格复杂性界限,这些界限由周期平均平滑度和介于Muon和Lion之间的噪声所决定。这些界限预测了计算最优的周期以及LionMuon超越Muon和Lion的条件。代码:https://github.com/brain-lab-research/lion-muon

英文摘要

In large-scale optimization, the cheapness and effectiveness of update steps are the most crucial factors for a successful optimizer. Sign-based optimizers like Lion or Signum produce cheap per-step updates, whereas Muon's spectral matrix-sign update gives a much stronger direction at a substantially higher per-step cost. In this work, we propose LionMuon, which retains the effectiveness of Muon steps while considerably cutting the averaged iteration cost, similar to sign-based methods. It alternates between Lion's and Muon's updates on a fixed period P, sharing a single dual-EMA momentum buffer between them. The optimizer state memory therefore matches Lion and is exactly half of AdamW's. A simpler single-EMA variant, SignMuon, by itself already outperforms pure Muon. At P = 2, LionMuon Pareto-dominates Muon, Lion, Signum, and AdamW on every dataset and architecture we tested at 124M model size, reaching lower validation loss at lower compute, and the same advantage persists at 355M and 720M scale. On the theory side, we prove sharp complexity bounds under heavy-tailed noise which are governed by period-averaged smoothness and noise that interpolate between Muon's and Lion's constants. These bounds predict the compute-optimal period and the conditions under which LionMuon outruns Muon and Lion. Code: https://github.com/brain-lab-research/lion-muon

2605.19804 2026-05-20 cs.CV cs.AI cs.LG

Stitched Value Model for Diffusion Alignment

用于扩散对齐的拼接价值模型

Hyojun Go, Hyungjin Chung, Prune Truong, Goutam Bhat, Li Mi, Zhaochong An, Zixiang Zhao, Dominik Narnhofer, Serge Belongie, Federico Tombari, Konrad Schindler

AI总结 本文提出StitchVM,一种将预训练的干净图像奖励模型转移到噪声潜在空间的拼接框架,通过高效转移和微调,提升扩散对齐的效率和效果。

Comments Project page: https://gohyojun15.github.io/StitchVM/

详情
AI中文摘要

为了实际应用,基于扩散或流的生成模型必须与任务特定的奖励对齐,例如提示保真度或审美偏好。这种对齐具有挑战性,因为奖励是为干净的输出图像定义的,但对齐过程需要在噪声中间潜在空间中估计价值函数。现有方法倾向于Tweedie风格或蒙特卡洛近似,权衡估计器偏差与计算成本:Tweedie估计高效但有偏差,而蒙特卡洛估计更准确但需要昂贵的回放。一个自然的替代方法是学习的价值函数,但如何有效训练一个强大的、通用的价值模型专门用于噪声潜在空间仍然是一个开放问题。本文提出了StitchVM,一种模型拼接框架,该框架高效地将预训练用于干净图像的奖励模型转移到噪声潜在空间。StitchVM从一个现有的、截断的像素空间奖励模型开始,并将其冻结的扩散骨干作为其头部。从像素空间模型中,所得到的混合模型保留了精心预训练、稳健的奖励能力;从扩散骨干中,它继承了其处理噪声潜在空间的原生能力。拼接过程异常轻量,例如拼接和微调CLIP ViT-L和SD 3.5 Medium仅需10个GPU小时。通过将强大的像素空间奖励模型提升到潜在空间,StitchVM打开了一种新的扩散对齐风格:而不是对价值函数的粗糙但昂贵的每样本近似,正确的函数对于实际的噪声潜在空间一次构建,然后在许多样本和迭代中进行抵消。我们显示,这种方法在广泛下游引导和后训练方法中带来了改进:DPS变得比原来快3.2倍,同时将峰值GPU内存减半,DiffusionNFT变得比原来快2.3倍。

英文摘要

For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes $3.2\times$ faster while halving peak GPU memory, and DiffusionNFT becomes $2.3\times$ faster.

2605.19799 2026-05-20 cs.CV cs.AI

Synergistic Foundation Models for Semi-Supervised Fetal Cardiac Ultrasound Analysis: SAM-Med2D Boundary Refinement and DINOv3 Semantic Enhancement

协同基础模型用于半监督胎儿心脏超声分析:SAM-Med2D边界细化与DINOv3语义增强

Tonghao Zhuang, Shanglong Hu, Yongsheng Luo, Zhiqi Zhang, Yu Li

AI总结 本文提出了一种半监督框架,用于胎儿心脏超声图像的联合分割和分类,结合SAM-Med2D进行边界细化和DINOv3进行语义增强,有效提升了胎儿先天性心脏病筛查的性能。

Comments Accepted to the ISBI 2026 Fetal HearT UltraSound Segmentation and Diagnosis (FETUS) Challenge

详情
AI中文摘要

我们提出了一种半监督框架,用于胎儿心脏超声图像的联合分割和分类。基于EchoCare多任务主干网络,我们的方法整合了SAM-Med2D用于边界细化,并利用DINOv3提升伪标签质量。我们引入了视图特定的硬掩膜,并结合一种两阶段优化策略:一个EMA阶段用于巩固分割能力,随后是一个分类微调阶段,该阶段冻结分割参数并重置分类头以恢复分类性能,而不影响分割效果。在FETUS 2026排行榜上评估,我们的方法在Dice相似系数上达到79.99%,归一化表面距离为61.62%,F1分数为41.20%,验证了我们方法在产前先天性心脏病筛查中的有效性。源代码可在https://github.com/2826056177/zcst_fetus2026公开获取。

英文摘要

We present a semi-supervised framework for joint segmentation and classification of fetal cardiac ultrasound images. Built upon the EchoCare multi-task backbone, our method integrates SAM-Med2D for boundary refinement and leverages DINOv3 to enhance pseudo-label quality. We introduce view-specific hard masking along with a two-stage optimization strategy: an EMA phase to consolidate segmentation capabilities, followed by a Classification Fine-Tuning phase that freezes segmentation parameters and resets the classification head to recover classification performance without compromising segmentation gains. Evaluated on the FETUS 2026 leaderboard, our method achieves a Dice Similarity Coefficient at 79.99%, Normalized Surface Distance at 61.62%, and F1-score at 41.20%, validating the effectiveness of our approach for prenatal congenital heart disease screening. Source code is publicly available at: https://github.com/2826056177/zcst_fetus2026.

2605.19798 2026-05-20 cs.CL

Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs

迈向社交互动代理的信任校准:探究LLMs生成性别化的多模态行为生成

Lucie Galland, Chloé Clavel, Magalie Ochs

AI总结 本文研究了LLMs生成多模态行为以反映能力与善意的不同层次,探讨了性别对行为生成的影响,并通过用户研究验证了方法的有效性。

详情
AI中文摘要

随着社交互动代理(SIAs)日益融入日常生活,能够将用户信任校准到代理实际能力的能力将有助于确保这些代理的适当使用。本文探讨了大型语言模型(LLMs)生成多模态行为(言语、语音、手势和面部表情模态)以反映不同层次的能力和善意的能力。我们提出了一种新的方法,用于自动生成与特定层次的这些特征相匹配的行为,这是实现细腻和信任校准互动的第一步。通过分析由LLMs生成的大量多模态转录数据,我们证明GPT-5.4能够跨不同模态(文本、语调、面部表情和手势)生成连贯的行为。使用随机森林特征重要性分析,我们显示生成的行为符合能力与善意的理论期望。然而,我们还发现当提示中指定性别时,LLMs倾向于重现社会性别刻板印象,将男性代理的行为与高能力联系起来,女性代理的行为与高善意联系起来。为了验证我们的方法,我们使用Prolific进行了一项用户研究,采用被试内设计。参与者对生成的行为感知到的不同层次的能力和善意与预期指令一致。

英文摘要

As Socially Interactive Agents (SIAs) become increasingly integrated into daily life, the ability to calibrate user trust to an agent's actual capabilities would help ensure appropriate usage of these agents. In this paper, we explore the capacity of Large Language Models (LLMs) to generate multimodal behaviors (verbal, vocal, gestural, and facial expression modalities) that reflect varying levels of ability and benevolence, two key dimensions of trustworthiness. We propose a novel method for automatically generating behaviors aligned with specific levels of these traits, a first step towards enabling nuanced and trust-calibrated interactions. By analyzing a large dataset of multimodal transcripts generated by LLMs, we demonstrate that GPT-5.4 is able to produce coherent behavior across different modalities (text, intonation, facial expression, and gesture). Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence. However, we also find that when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes, associating male agents' behaviors with high ability and female agents' behaviors with high benevolence. To validate our approach, we conducted a user study on Prolific using a within-subjects design. Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions.

2605.19797 2026-05-20 cs.CV

Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth

Depth2Pose: 一种用于单目深度估计的基于姿态的基准,无需真实深度

Viktor Kocur, Sithu Aung, Gabrielle Flood, Yaqing Ding, Lukas Bujnak, Torsten Sattler, Zuzana Kukelova

AI总结 本文提出Depth2Pose基准,用于评估单目深度估计器在下游任务中的性能,通过结合深度预测与特征匹配,利用相对相机姿态估计精度作为深度质量的代理指标,解决了传统基准依赖像素级真实深度的高成本问题。

详情
AI中文摘要

单目深度估计近年来有了显著进步,这得益于越来越强大的模型和大规模训练数据。预测的深度越来越多地被用作下游任务(如结构从运动SfM、视觉定位和SLAM)的输入信号。然而,单目深度估计器(MDEs)仍然主要基于深度准确性进行评估。标准度量方法对误差进行全局汇总,可能无法反映深度对下游几何任务的有用性。因此,我们提出Depth2Pose,一种用于评估MDEs在下游任务中的框架。通过将深度预测与深度感知几何求解器中的特征匹配相结合,我们使用相对相机姿态估计精度作为任务驱动的深度质量代理。传统基准要求以像素级深度形式提供密集的真实数据,这获取成本高昂。相反,我们的方法仅需要相机姿态,这可以高效地估计,例如使用结构从运动(SfM)流水线。因此,我们的框架可以应用于难以获取真实深度的场景,例如由于场景规模大或重叠(如植被环境)。利用这一点,我们引入了D2P数据集,其中包含挑战性场景,这些场景不在常用训练数据分布中。我们展示了在现有基准上表现良好的方法在相同数据集上也表现良好,但在我们的更具挑战性的数据集上未必能推广。最后,我们提供了一个简单且可扩展的评估框架。数据集和代码可在kocurvik.github.io/depth2pose获取。

英文摘要

Monocular depth estimation has improved significantly in recent years, driven by increasingly powerful models and large-scale training data. Predicted depth is increasingly used as an input signal for downstream tasks such as Structure-from-Motion (SfM), visual localization, and SLAM. However, monocular depth estimators (MDEs) are still primarily evaluated in terms of depth accuracy. Standard metrics aggregate errors globally and may not reflect the usefulness of depth for downstream geometric tasks. We therefore propose Depth2Pose, a framework for evaluating MDEs in the context of downstream tasks. By combining depth predictions with feature correspondences in depth-aware geometric solvers, we use relative camera pose estimation accuracy as a task-driven proxy for depth quality. Traditional benchmarks require dense ground truth in the form of per-pixel depth, which is expensive to obtain. In contrast, our formulation requires only camera poses, which can be estimated efficiently, e.g., using Structure-from-Motion pipelines. As a result, our framework can be applied to scenes where ground-truth depth is difficult to obtain, for example due to large scene scale or heavy occlusions (e.g., vegetated environments). Leveraging this, we introduce the D2P dataset, which contains challenging scenes outside the distribution of commonly used training data. We show that methods performing well under standard depth error metrics on existing benchmarks also perform well under our pose-based metric when evaluated on the same datasets, but do not necessarily generalize to our more challenging dataset. Finally, we provide a simple and extensible evaluation framework. The dataset and code are available at kocurvik.github.io/depth2pose.

2605.19792 2026-05-20 cs.CV

Mechanisms of Object Localization in Vision-Language Models

视觉-语言模型中物体定位的机制

Timothy Schaumlöffel, Martina G. Vilas, Gemma Roig

AI总结 本文研究了视觉-语言模型中物体定位的核心机制,通过分析LLaVA-1.5和InternVL-3.5等模型,揭示了定位依赖于容器化机制,并发现只有少量注意力头参与分类和定位任务,为未来模型设计提供了指导。

Comments Accepted at CVPR 2026

详情
AI中文摘要

视觉引导的语言模型(VLMs)在连接视觉和文本信息方面非常有效,但它们在基本的分类和定位任务上常常遇到困难。尽管分类机制已被广泛研究,但支持物体定位的过程仍不明确。在本工作中,我们使用一系列机械可解释性工具,包括token消融、注意力剔除和因果中介分析,研究了LLaVA-1.5和InternVL-3.5两个代表性家族。我们发现,定位由一种容器化机制驱动,其中对齐对象的token定义了物体的空间范围,而这些边界内token的语义排列与预测框关系不大。只有非常小的注意力头集介导了分类和定位的因果效应,对于LLaVA集中在早期-中期层,而对于InternVL集中在中期-后期层。这两个任务共享一些早期处理,但最终依赖于大量不同的专用头。总体而言,我们提供了VLMs中定位的首个层和头级账户,揭示了狭窄的计算路径,可以指导未来模型设计和基础目标。

英文摘要

Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis. We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while the semantic arrangement of tokens within those boundaries is largely irrelevant to the predicted box. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early-mid layers for LLaVA and mid-late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads. Overall, we provide the first layer- and head-level account of localization in VLMs, revealing narrow computational pathways that can guide future model design and grounding objectives.

2605.19786 2026-05-20 cs.CV

Fast 4D Mesh Generation by Spatio-Temporal Attention Chains

通过时空注意力链实现快速的4D网格生成

Dvir Samuel, Yuval Atzmon, Gal Chechik, Yoni Kasten

AI总结 本文提出一种无需训练的4D网格生成方法,通过时空注意力链加速生成并提升时间对应质量,能够在9秒内生成4D网格,实现13倍速度提升,且能处理长达16倍的视频序列,同时在2D物体跟踪和4D跟踪任务中表现出色,还支持可靠的相机估计。

Comments https://research.nvidia.com/labs/par/fast4dmesh/

详情
AI中文摘要

4D网格生成最近已成为从视频中恢复动态3D结构的强大范式,但现有方法仍然缓慢、计算成本高且难以扩展到更长的序列。我们介绍了一种无需训练的方法,以加速4D网格生成并提高时间对应质量。我们的关键观察是,时间对应关系在4D骨干网络生成视觉准确的网格之前就已经在其中出现。我们利用这一发现,提出了一种通用框架,称为时空注意力链,该框架在空间和时间上传播信息。从锚定网格的顶点开始,链将顶点映射到潜在令牌,然后在潜在空间中跟随时间对应关系,并通过潜在到顶点的注意力恢复帧特定的顶点。这种设计避免了昂贵的显式匹配,同时保留了锚定网格的细节,从而改进了动态网格几何和时间一致性。与最先进的方法相比,我们的方法在9秒内生成4D网格,实现13倍的速度提升,同时产生更高质量的结果。此外,我们的方法可扩展到长达16倍的视频序列,而不降低网格质量。除了生成外,改进的对应关系使方法在两个下游任务上表现出色:2D物体跟踪和4D跟踪。我们进一步表明,我们的框架能够实现可靠的相机估计,这是先前4D网格生成方法所不支持的能力。

英文摘要

4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a $13\times$ speedup while producing higher-quality results. Moreover, our approach scales to videos up to $16\times$ longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods.

2605.19782 2026-05-20 cs.AI cs.LG cs.SE

Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization

先验知识还是搜索?LLM代理在硬件感知代码优化中的研究

Dmitry Redko, Albert Fazlyev, Konstantin Sozykin, Maria Ivanova, Evgeny Burnaev, Egor Shvetsov

AI总结 该研究探讨了在硬件感知代码优化中,LLM代理是依赖于先验知识还是搜索过程,通过三个受控实验发现LLM在纯黑盒优化中表现为贪婪优化器,在零样本内核生成中输入大小信息无明显影响,而在反馈循环内核优化中CUDA单调改进而TVM IR主动退化,表明LLM在代码优化任务中高度依赖预训练先验而非反馈或代理结构。

详情
AI中文摘要

LLM发现和优化系统在各个领域中被越来越多地应用,实现了一个常见的提出-评估-修订循环。此类优化或发现过程通过上下文条件在接收到环境反馈后进行。然而,随着现代LLM代理在结构上日益复杂,难以评估哪些组件贡献最大,以及何时以及如何探索可能失败。我们通过三个受控实验回答这些问题。我们的发现:(1) 在纯黑盒优化中,LLM表现为贪婪优化器。(2) 在零样本内核生成中,提供显式输入大小信息没有可测量的影响,模型无论大小或温度都会收敛到相同的内核参数,仿佛大小指令是不可见的。此外,当被要求为不常见的内核大小进行内核优化时,性能会急剧下降,无论使用的语言如何。(3) 在反馈循环内核优化中,CUDA在迭代反馈下单调改进,而TVM IR则主动退化,这表明当模型以低密度语言操作时,内核优化会退化。我们的结果得出结论:在代码优化任务中,LLM高度依赖于预训练的先验而非提供的反馈或代理结构。

英文摘要

LLM discovery and optimization systems are increasingly applied across domains, implementing a common propose-evaluate-revise loop. Such optimization or discovery progresses via context conditioning on received feedback from an environment. However, as modern LLM agents are increasingly complex in their structure, it is difficult to evaluate which components contribute the most, and when and how this exploration may fail. We answer these questions through three controlled experiments. Our findings: (1) In pure black-box optimization, LLMs act as greedy optimizers. (2) In zero-shot kernel generation, providing explicit input-size information has no measurable effect, models converge to the same kernel parameters regardless of size or temperature, as though the size instruction were invisible. Moreover, when tasked to perform kernel optimization for uncommon kernel sizes, performance sharply degrades regardless of the language used. (3) In feedback-loop kernel optimization, CUDA improves monotonically under iterative feedback, while TVM IR actively degrades, which demonstrates that kernel optimization degrades when models operate with low-density language. Our results conclude that LLMs in code optimization tasks highly depend on pretrained priors rather than provided feedback or agentic structure.

2605.19781 2026-05-20 cs.AI

From SGD to Muon: Adaptive Optimization via Schatten-p Norms

从SGD到穆恩:通过Schatten-p范数实现自适应优化

Thomas Massena, Corentin Friedrich, Mathieu Serrurier

AI总结 本文提出了一种基于Schatten-p范数的自适应优化方法,通过动态选择代理最优的更新LMO几何结构,实现了从SGD到Muon的优化策略转换,并在不同训练场景中验证了其有效性。

详情
AI中文摘要

现代优化器,如Muon,对其更新施加了矩阵级几何约束。这些矩阵级约束可以统一在线性最小化Oracle(LMO)理论下。然而,所有当前方法都对更新规则施加固定的LMO几何结构,这些结构是根据设计或经验选择的,不一定符合问题的几何特性。我们引入了一种新颖且高效的数据驱动标准,用于动态选择单个深度神经网络层的代理最优更新LMO几何结构。该标准通过使用单步随机特征回归替代模型,从梯度和激活统计信息中推导出闭合形式,从而在SGD到Muon的更新之间进行插值。此外,通过整合参数级预条件化,我们的框架能够恢复SGD、Muon、Adam和MuAdam作为特定极值。为了使这种自适应方法可扩展,我们将其与高效的计算策略相结合,仅在高度优化的基线模型上带来约3%的运行时间开销。作为概念验证,我们证明这种数据驱动的优化器在三个不同的训练场景中优于或至少与Muon和AdamW中表现最好的优化器相竞争。最终,这项工作提供了证据,证明LMO几何可以成功且高效地从运行时数据进行适应,为超越静态几何的优化器设计开辟了新的途径。

英文摘要

Modern optimizers, like Muon, impose matrix-wise geometry constraints on their updates. These matrix-wise constraints can be unified under Linear Minimization Oracle (LMO) theory. However, all current methods impose fixed LMO geometries for the update rules, chosen by-design or empirically, which are not necessarily optimal according to the problem's geometry. We introduce a novel efficient datadriven criterion for dynamically choosing proxy-optimal update LMO geometries on individual Deep Neural Network layers. Derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model, our criterion navigates a design space interpolating from SGD to Muon updates. Moreover, integrating parameter-wise preconditioning allows our framework to recover SGD, Muon, Adam, and MuAdam as specific extrema. To make this adaptive approach scalable, we pair it with efficient computational strategies, achieving only a $\sim$ 3% runtime overhead on highly optimized baselines. As a proof of concept, we show that this data-driven optimizer beats or remains competitive with the performance of the best performing optimizer between Muon and AdamW across three different training scenarios. Ultimately, this work provides evidence that LMO geometry can be successfully and efficiently adapted from runtime data, opening a new pathway for optimizer design beyond static geometries.

2605.19779 2026-05-20 cs.AI cs.LG

Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation

无分布不确定性量化用于连续AI代理评估

Yuxuan Gao, Megan Wang, Yi Ling Yu

AI总结 本文提出了一种无分布的不确定性量化方法,用于连续AI代理评估,通过适应性符合推断(ACI)提供预测质量分数的覆盖保证,并开发了多代理管道的组合不确定性界限、成对排名的符合回避规则以及领奖台规模多重检验的FDR校正回避方法。

Comments 6 pages, 7 figures, 2 tables. Accepted at the ICML 2026 Workshop on Agentic Uncertainty Quantification (AgenticUQ) - Poster

详情
AI中文摘要

我们适应了分割符合预测和适应性符合推断(ACI)用于连续AI代理评估,提供预测质量分数的无分布覆盖保证。符合区间在24小时范围内所有名义水平上实现了校准误差低于0.02,而ACI在代理发布后正确扩大了区间35%然后重新收敛。我们进一步开发了多代理管道的组合不确定性界限(通过模拟验证了不同阶段相关性rho在[-0.5, 0.9]范围内),一种用于成对排名的符合回避规则(具有受控的假排名率),以及领奖台规模多重检验的FDR校正回避方法。通过18个实时信号每小时收集的数据评估50个代理,我们显示每个代理的条件覆盖集中在名义水平(均值80.4%,90%的代理在[72%, 90%]范围内),并且跨源情感分歧预测排名不稳定性(r=0.64,p<0.01)。一个循环控制的验证确认了框架能够捕捉超过基准的信号(rho_s=0.52,p<0.01,n=35)。代码和数据在CC BY 4.0下发布。

英文摘要

We adapt split conformal prediction and adaptive conformal inference (ACI) to continuous AI agent evaluation, providing distribution-free coverage guarantees for forecasted quality scores. Conformal intervals achieve calibration error below 0.02 across all nominal levels at the 24h horizon, while ACI correctly widens intervals by 35% following agent releases then reconverges. We further develop compositional uncertainty bounds for multi-agent pipelines (validated via simulation across inter-stage correlations rho in [-0.5, 0.9]), a conformal abstention rule for pairwise rankings with controlled false-ranking rate, and FDR-corrected abstention for leaderboard-scale multiple testing. Evaluating 50 agents via 18 real-time signals collected hourly, we show that per-agent conditional coverage is well-concentrated around the nominal level (mean 80.4%, 90% of agents within [72%, 90%]), and that cross-source sentiment divergence predicts ranking instability (r=0.64, p<0.01). A circularity-controlled validation confirms the framework captures signal beyond benchmarks (rho_s=0.52, p<0.01, n=35). Code and data are released under CC BY 4.0.

2605.18870 2026-05-20 cs.LG math.AP math.FA

Multi-Headed Transformer Architectures as Time-dependent Wasserstein Gradient Flows

多头变压器架构作为时间依赖的Wasserstein梯度流

Alex Massucco, Leonardo Del Grande, Marcello Carioni, Christoph Brune, Carola-Bibiane Schönlieb

AI总结 本文提出将多头变压器架构中的数据流建模为时间依赖的Wasserstein梯度流,以捕捉注意力机制的设计,并证明了在合适积分性假设下,梯度流的ω-极限集元素是交互能量的稳态点,同时分析了梯度流的稳定性,并通过数值实验验证了预测的能量耗散身份和动力学的渐近行为。

详情
AI中文摘要

近年来,变压器架构已彻底改变了语言处理领域,开辟了前所未有的可能性。然而,从理论角度来看,文献中提出的数学模型往往缺乏与实际架构的直接联系,并依赖于强简化的假设。在本文中,我们通过将多头变压器架构中的数据流建模为时间依赖的梯度流,以捕捉注意力机制的设计,从而缩小这一差距。显式的时间依赖性使我们能够为每个头和每个层分配不同的权重,而无需对初始化方法施加限制。此外,我们证明,在合适积分性假设下,每个梯度流的ω-极限集元素都是交互能量在极限权重分布下的稳态点。最后,我们分析了梯度流的稳定性,考虑了初始数据和权重的扰动。一方面,我们研究了所提出模型对噪声输入的鲁棒性,建立了梯度流对初始数据的连续依赖性和流的唯一性。另一方面,我们证明了扰动的交互能量对未扰动能量的Γ收敛性,导致相应的梯度流收敛。我们通过数值实验补充了这些理论结果,验证了预测的能量耗散身份,并澄清了动力学在自主型(Ornstein-Uhlenbeck)和真正非自主型(振荡权重)两种情况下的渐近行为。

英文摘要

In recent years, transformer architectures have revolutionized the field of language processing, opening the door to previously unforeseen possibilities. However, from a theoretical point of view, the mathematical models proposed in the literature often lack direct contact with the actual architectures and depend on strong simplifying assumptions. In this paper, we reduce this gap by modelling the data flow in multi-headed transformer architectures as time-dependent gradient flows for a suitable interaction energy capturing the design of the attention mechanism. The explicit dependence on time allows us to consider different weights for each head and for each layer, without imposing constraints on the initialization method. Moreover, we prove that, under a suitable integrability assumption on the evolution of the weights, each element of the $ω$-limit set of the gradient flows is a stationary point of the interaction energy at a limiting weight distribution. Finally, we analyse the stability of the gradient flows considering perturbations of both the initial data and the weights. Specifically, on the one hand, we study the robustness of the proposed models with respect to noisy inputs, establishing a continuous dependence of the gradient flows on the initial data and uniqueness of the flows. On the other hand, we prove the $Γ$-convergence of the perturbed interaction energy to the unperturbed one, leading to the convergence of the corresponding gradient flows. We complement these theoretical results with numerical experiments that confirm the predicted energy-dissipation identity and clarify the asymptotic behavior of the dynamics in both the autonomous-like (Ornstein--Uhlenbeck) and the genuinely non-autonomous (oscillating-weights) regimes.

2605.18739 2026-05-20 cs.CV cs.DC

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

LongLive-2.0: 一个基于NVFP4的长视频生成并行基础设施

Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han

AI总结 本文提出LongLive-2.0,一个基于NVFP4的并行基础设施,用于长视频生成的整个训练和推理流程,解决了速度和内存瓶颈问题。通过引入序列并行自回归训练方法,结合NVFP4精度,显著降低了GPU内存消耗并加速了训练过程。同时,该系统能够将扩散模型转换为长视频生成的自回归扩散模型,并在不同GPU架构上实现了高效的推理和训练。

Comments Code, model, and demos are available at https://github.com/NVlabs/LongLive

详情
AI中文摘要

我们提出了LongLive-2.0,一个基于NVFP4的并行基础设施,贯穿长视频生成的整个训练和推理流程,以解决速度和内存瓶颈问题。在训练过程中,我们引入了序列并行自回归(AR)训练,具体实现为平衡SP,通过在每个rank上配对干净历史和噪声目标时间片段,共同设计高效的教师强制布局与SP执行,从而实现自然的教师强制掩码和SP-aware分块VAE编码。结合NVFP4精度,它减少了GPU内存成本并加速了GEMM计算,随着视频长度的增加,其比例增加。此外,我们表明高质量的基础设施和数据集能够实现显著清洁的训练流程。与现有Self-Forcing系列方法不同,LongLive-2.0直接调节扩散模型,使其成为长、多镜头、交互式自回归(AR)扩散模型。它可以进一步转换为实时生成(4到2去噪步骤)通过独立LoRA权重。在Blackwell GPU上进行推理时,我们启用了W4A4 NVFP4推理,将KV缓存量化为NVFP4以节省内存,并通过异步流式VAE解码提高端到端吞吐量。在非Blackwell GPU架构上,我们部署SP推理以匹配Blackwell GPU的速度,同时量化后的KV缓存可以降低SP的跨GPU通信。实验显示训练速度提高了2.15倍,推理速度提高了1.84倍。LongLive-2.0-5B在45.7 FPS的推理速度下实现了在基准测试中的强大性能。据我们所知,LongLive-2.0是首个针对长视频生成的NVFP4训练和推理系统。

英文摘要

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.

2605.18618 2026-05-20 cs.LG cs.AI

Stochastic Penalty-Barrier Methods for Constrained Machine Learning

随机罚函数-障碍方法用于约束机器学习

Adam Bosák, Andrii Kliachkin, Jana Lepšová, Gilles Bareilles, Jakub Mareček

AI总结 本文提出了一种随机罚函数-障碍方法(SPBM),用于解决深度学习中非凸、非光滑、随机环境下的约束优化问题,该方法通过指数对偶平均、稳定罚函数调度和Moreau包络来处理非光滑性,并在多个设置中验证了其性能。

详情
AI中文摘要

约束机器学习能够实现公平性感知训练、物理信息神经网络以及将符号领域知识整合到统计模型中。尽管其实际重要性,但目前尚无通用方法能够处理深度学习中自然出现的非凸、非光滑、随机环境。我们提出随机罚函数-障碍方法(SPBM),通过指数对偶平均、稳定罚函数调度和Moreau包络扩展经典罚函数和障碍方法,以处理非光滑性。在多个设置中的实验表明,SPBM在匹配或优于现有约束优化基线的同时,仅比无约束Adam方法多出线性运行时间开销,最多可处理10,000个约束。

英文摘要

Constrained machine learning enables fairness-aware training, physics-informed neural networks, and integration of symbolic domain knowledge into statistical models. Despite its practical importance, no general method exists for the non-convex, non-smooth, stochastic setting that arises naturally in deep learning. We propose the Stochastic Penalty-Barrier Method (SPBM), which extends classical penalty and barrier methods to this setting via exponential dual averaging, a stabilized penalty schedule, and the Moreau envelope to handle non-smoothness. Experiments across multiple settings show that SPBM matches or outperforms existing constrained optimization baselines while incurring only linear runtime overhead compared to unconstrained Adam for up to 10,000 constraints.

2605.18565 2026-05-20 cs.CL cs.AI

MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

MINTEval: 评估长时间跨度智能体系统中的多目标干扰下的记忆

Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal

AI总结 本文提出MINTEval基准,用于评估智能体在长时间跨度和多目标干扰下的记忆表现,通过长连接上下文、多领域和多类型问题来测试记忆增强代理的鲁棒性和泛化能力。

Comments Equal contribution; order decided by a coin flip. Code and data: https://github.com/amy-hyunji/MINTEval

详情
AI中文摘要

现实中的智能体在长时间和不断演变的范围内运作,其中信息被不断更新并可能在记忆之间产生干扰,需要准确的回忆和对多份信息的聚合推理。然而,现有的基准主要关注静态、独立的回忆,无法捕捉这些动态的演变记忆之间的相互作用。在本文中,我们研究了当前的记忆增强代理在多样领域和问题类型中的长时间跨度、高干扰设置中的表现。我们引入MINTEval(长时间跨度记忆在干扰下的评估),该基准具有(1)长且高度互联的上下文,包含频繁更新的信息,从而产生显著的干扰;(2)多领域(状态跟踪、多轮对话、维基百科修订和GitHub提交),使能够评估领域泛化能力;(3)多类型问题,评估对干扰的鲁棒性,包括(i)单目标回忆任务,要求从长上下文中检索特定目标,以及(ii)多目标聚合任务,要求对多个相关信息片段进行推理。总体而言,MINTEval包含15.6k个问答对,覆盖平均138.8k个token的长时间跨度上下文,每个实例可扩展至1.8M个token。我们评估了7个代表性系统,包括 vanilla 长上下文 LLMs、RAG 和记忆增强代理框架。在所有系统中,我们观察到一致的低性能(平均27.9%准确率),尤其是在需要对多份证据进行聚合推理的问题上。我们的分析表明,性能主要受限于检索和记忆构建。此外,当前的记忆系统在面对被后续上下文修改或干扰的早期事实时,难以回忆和推理,准确性随着中间更新数量的增加而下降。

英文摘要

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce MINTEval (Long-Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, MINTEval has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are revised or interfered with by subsequent context, with accuracy degrading as the number of intervening updates increases.