arXivDaily arXiv每日学术速递 周一至周五更新
重置
2512.14937 2026-06-12 cs.CV cs.AI 版本更新

Improving Pre-trained Adult Glioma Segmentation Models Using only Post-processing Techniques

仅使用后处理技术改进预训练的成人胶质瘤分割模型

Abhijeet Parida, Daniel Capellán-Martín, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer, Austin Tapp, Syed Muhammad Anwar, María J. Ledesma-Carbayo, Marius George Linguraru

发表机构 * Sheikh Zayed Institute for Pediatric Surgical Innovation(Sheikh Zayed儿童手术创新研究所) Children’s National Hospital(儿童医院) University of Madrid(马德里大学) CIBER-BBN ISCIII School of Medicine and Health Sciences(医学与健康科学学院) George Washington University(乔治·华盛顿大学)

AI总结 针对预训练模型在胶质瘤分割中的系统误差,提出自适应后处理技术,在BraTS 2025挑战中使排名指标提升14.9%(撒哈拉以南非洲)和0.9%(成人胶质瘤),推动向高效、公平、可持续的后处理策略转变。

详情
AI中文摘要

胶质瘤是成人中最常见的恶性脑肿瘤,也是最致命的肿瘤之一。尽管积极治疗,中位生存率仍低于15个月。准确的多参数MRI(mpMRI)肿瘤分割对于手术规划、放疗和疾病监测至关重要。虽然深度学习模型提高了自动分割的准确性,但大规模预训练模型泛化能力差且常表现不佳,产生系统性错误,如假阳性、标签交换和切片不连续。这些问题因GPU资源获取不平等和大规模模型训练日益增长的环境成本而进一步加剧。在这项工作中,我们提出自适应后处理技术,以改进为各种肿瘤类型开发的大规模预训练模型产生的胶质瘤分割质量。我们在多个BraTS 2025分割挑战任务中展示了这些技术,使撒哈拉以南非洲挑战的排名指标提升了14.9%,成人胶质瘤挑战提升了0.9%。该方法推动脑肿瘤分割研究从日益复杂的模型架构转向精确、计算公平且可持续的高效临床后处理策略。

英文摘要

Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.

2512.14648 2026-06-12 cs.CV eess.IV 版本更新

Adaptable Segmentation Pipeline for Diverse Brain Tumors with Radiomic-Guided Subtyping and Lesion-Wise Model Ensemble

适用于多样化脑肿瘤的自适应分割流程:放射组学引导的亚型分类与病灶级模型集成

Daniel Capellán-Martín, Abhijeet Parida, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer, Austin Tapp, Syed Muhammad Anwar, María J. Ledesma-Carbayo, Marius George Linguraru

发表机构 * Sheikh Zayed Institute for Pediatric Surgical Innovation(Sheikh Zayed儿童外科创新研究所) Children’s National Hospital(儿童医院) University of Washington(华盛顿大学) Universidad Politécnica de Madrid(马德里理工大学) CIBER-BBN ISCIII School of Medicine and Health Sciences(医学与健康科学学院)

AI总结 提出一种灵活模块化的自适应分割流程,通过放射组学特征检测肿瘤亚型并平衡训练,结合病灶级性能指标优化模型集成与后处理,在BraTS 2025挑战赛中达到顶尖性能,支持临床定量肿瘤测量。

详情
Comments
12 pages, 5 figures, 3 tables. Algorithm presented at MICCAI BraTS 2025
AI中文摘要

在多参数磁共振成像(MRI)上对脑肿瘤进行鲁棒且可泛化的分割仍然困难,因为肿瘤类型差异很大。BraTS 2025 Lighthouse挑战赛在多种高质量成人及儿童肿瘤数据集上对分割方法进行基准测试:多联盟国际儿童脑肿瘤分割(PED)、术前脑膜瘤肿瘤分割(MEN)、脑膜瘤放射治疗分割(MEN-RT)以及治疗前后脑转移瘤分割(MET)。我们提出了一种灵活、模块化且自适应的流程,通过选择和组合最先进的模型,并在训练前后应用肿瘤和病灶特定的处理,来提高分割性能。从MRI中提取的放射组学特征有助于检测肿瘤亚型,确保更平衡的训练。自定义的病灶级性能指标决定了每个模型在集成中的影响力,并优化了进一步细化预测的后处理,使工作流能够针对每个病例定制每一步。在BraTS测试集上,我们的流程在多个挑战中取得了与顶尖算法相当的性能。这些发现证实,自定义的病灶感知处理与模型选择能够产生鲁棒的分割,而无需将方法锁定在特定的网络架构上。我们的方法在临床实践中具有定量肿瘤测量的潜力,支持诊断和预后。

英文摘要

Robust and generalizable segmentation of brain tumors on multi-parametric magnetic resonance imaging (MRI) remains difficult because tumor types differ widely. The BraTS 2025 Lighthouse Challenge benchmarks segmentation methods on diverse high-quality datasets of adult and pediatric tumors: multi-consortium international pediatric brain tumor segmentation (PED), preoperative meningioma tumor segmentation (MEN), meningioma radiotherapy segmentation (MEN-RT), and segmentation of pre- and post-treatment brain metastases (MET). We present a flexible, modular, and adaptable pipeline that improves segmentation performance by selecting and combining state-of-the-art models and applying tumor- and lesion-specific processing before and after training. Radiomic features extracted from MRI help detect tumor subtype, ensuring a more balanced training. Custom lesion-level performance metrics determine the influence of each model in the ensemble and optimize post-processing that further refines the predictions, enabling the workflow to tailor every step to each case. On the BraTS testing sets, our pipeline achieved performance comparable to top-ranked algorithms across multiple challenges. These findings confirm that custom lesion-aware processing and model selection yield robust segmentations yet without locking the method to a specific network architecture. Our method has the potential for quantitative tumor measurement in clinical practice, supporting diagnosis and prognosis.

2506.18438 2026-06-12 cs.CV 版本更新

CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing

CPAM: 保持上下文的自适应操作用于零样本真实图像编辑

Dinh-Khoi Vo, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam(越南科学大学信息科技学院) Vietnam National University, Ho Chi Minh City, Vietnam(越南国家大学) Faculty of Information Technology, Monash University, Melbourne, Victoria, Australia(莫纳什大学信息科技学院) Department of Computer Science, University of Dayton, Dayton, Ohio, US(Dayton 大学计算机科学系)

AI总结 提出CPAM零样本框架,通过保持上下文的自适应操作和掩码引导,实现复杂非刚性真实图像的编辑,保留纹理和身份,无需微调。

详情
Comments
Accepted to IEEE Transactions on Multimedia. Project page: this https URL
AI中文摘要

使用文本描述在文本到图像扩散模型中编辑自然图像仍然是一个重大挑战,特别是在实现一致生成和处理复杂非刚性对象方面。现有方法通常难以保留纹理和身份,需要大量微调,并且在编辑特定空间区域或对象的同时保留背景细节方面存在局限性。本文提出了保持上下文的自适应操作(CPAM),一种用于复杂非刚性真实图像编辑的新型零样本框架。具体来说,我们提出了一个保留适应模块,该模块调整自注意力机制以有效保留并独立控制对象和背景。这确保了在编辑过程中使用掩码引导技术时,对象的形状、纹理和身份得以保持,同时背景不变形。此外,我们开发了一个局部提取模块,以减轻在交叉注意力机制的条件化过程中对非期望修改区域的干扰。我们还引入了各种掩码引导策略,以简单的方式促进多样化的图像操作任务。CPAM可以无缝集成到多个扩散骨干网络中,包括SD1.5、SD2.1和SDXL,展示了跨不同模型架构的强大泛化能力。在我们新构建的图像操作基准(IMBA)上进行的广泛实验表明,我们提出的方法是人类评估者的首选,优于现有的最先进编辑技术。源代码和数据将在项目页面公开发布:this https URL

英文摘要

Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects' shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. CPAM can be seamlessly integrated with multiple diffusion backbones, including SD1.5, SD2.1, and SDXL, demonstrating strong generalization across different model architectures. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques. The source code and data will be publicly released at the project page: this https URL

2604.08983 2026-06-12 cs.RO 版本更新

AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly

AssemLM: 用于机器人装配的空间推理多模态大语言模型

Zhi Jing, Jinbin Qiao, Ouyang Lu, Jicong Ao, Shuang Qiu, Huazhe Xu, Yu-Gang Jiang, Chenjia Bai

发表机构 * Fudan University(复旦大学) Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究所(TeleAI),中国电信) Tianjin University(天津大学) Northwestern Polytechnical University(西北工业大学) Tsinghua University(清华大学) City University of Hong Kong(香港城市大学)

AI总结 提出AssemLM,一种融合装配手册、点云和文本指令的多模态大语言模型,通过专用点云编码器提取几何与旋转特征,实现精确的6D装配位姿推理,并构建含90万样本的AssemBench基准,在真实机器人装配任务中取得最优性能。

详情
Comments
Project Page: this https URL
AI中文摘要

空间推理是具身智能的基本能力,尤其对于机器人装配等精细操作任务。当前基于视觉语言模型(VLM)的方法主要依赖粗粒度的2D感知,难以对复杂3D几何进行精确推理。为解决这一局限,我们提出AssemLM,一种用于机器人装配的空间多模态大语言模型,它整合装配手册、点云和文本指令,通过显式几何理解预测任务关键的6D装配位姿。为桥接原始3D感知与高层语言推理,AssemLM采用专用点云编码器提取细粒度几何与旋转特征,以实现装配任务中精确的3D空间推理。此外,我们引入AssemBench,一个面向装配空间推理的大规模基准,包含超过90万多模态样本和精确的6D位姿标注,将评估从2D定位扩展到完整的3D几何推理。大量实验和真实机器人评估表明,AssemLM在6D位姿推理性能上达到最优,并有效支持真实环境中的精细多步装配任务。代码、模型和AssemBench数据集将公开提供。

英文摘要

Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. Recent methods based on vision-language models (VLMs) largely rely on coarse 2D perception and struggle to perform accurate reasoning over complex 3D geometry. To address this limitation, we propose AssemLM, a spatial multimodal large language model for robotic assembly that integrates assembly manuals, point clouds, and textual instructions to predict task-critical 6D assembly poses with explicit geometric understanding. To bridge raw 3D perception and high-level linguistic reasoning, AssemLM employs a specialized point cloud encoder to extract fine-grained geometric and rotational features for accurate 3D spatial reasoning in assembly tasks. In addition, we introduce AssemBench, a large-scale benchmark for assembly-oriented spatial reasoning with over 900K multimodal samples and precise 6D pose annotations, extending evaluation from 2D grounding to full 3D geometric inference. Extensive experiments and real-robot evaluations demonstrate that AssemLM achieves state-of-the-art 6D pose reasoning performance and effectively supports fine-grained, multi-step assembly tasks in real-world settings. Code, models, and the AssemBench dataset will be made publicly available.

2604.07590 2026-06-12 cs.IR cs.AI 版本更新

DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation

DCD:面向领域的受控检索增强生成设计

Valerii Kovalskii, Nikita Belov, Nikita Miteyko, Igor Reshetnikov, Maksim Maksimov

发表机构 * red_mad_robot

AI总结 提出DCD(领域-集合-文档)层次化设计,通过结构化知识表示和多阶段路由控制检索与生成范围,无需修改语言模型,提升RAG在异构语料和多步查询中的鲁棒性和准确性。

详情
Comments
14 pages, 4 figures, 2 links, link to HF this https URL, link to GIT this https URL
AI中文摘要

检索增强生成(RAG)被广泛用于将大型语言模型锚定在外部知识源中。然而,当应用于异构语料库和多步查询时,朴素RAG管道由于扁平的知识表示和缺乏显式工作流而常常质量下降。在这项工作中,我们引入了DCD(领域-集合-文档),一种面向领域的设计,用于结构化知识并控制RAG系统中的查询处理,而无需修改底层语言模型。所提出的方法依赖于信息空间的层次分解和基于结构化模型输出的多阶段路由,使得检索和生成范围能够逐步受限。该架构辅以智能分块、混合检索以及集成验证和生成护栏机制。我们描述了DCD架构和工作流程,并讨论了在合成评估数据集上的评估结果,突出了它们在应用RAG场景中对鲁棒性、事实准确性和答案相关性的影响。

英文摘要

Retrieval-Augmented Generation (RAG) is widely used to ground large language models in external knowledge sources. However, when applied to heterogeneous corpora and multi-step queries, Naive RAG pipelines often degrade in quality due to flat knowledge representations and the absence of explicit workflows. In this work, we introduce DCD (Domain-Collection-Document), a domain-oriented design to structure knowledge and control query processing in RAG systems without modifying the underlying language model. The proposed approach relies on a hierarchical decomposition of the information space and multi-stage routing based on structured model outputs, enabling progressive restriction of both retrieval and generation scopes. The architecture is complemented by smart chunking, hybrid retrieval, and integrated validation and generation guardrail mechanisms. We describe the DCD architecture and workflow and discuss evaluation results on synthetic evaluation dataset, highlighting their impact on robustness, factual accuracy, and answer relevance in applied RAG scenarios.

2503.02178 2026-06-12 stat.ML cs.LG 版本更新

Central Limit Theorems for Stochastic Gradient Descent Quantile Estimators

随机梯度下降分位数估计量的中心极限定理

Ziyang Wei, Jiaqi Li, Likai Chen, Wei Biao Wu

发表机构 * Department of Statistics, University of Chicago(芝加哥大学统计系) Department of Statistics and Data Science, Washington University in St. Louis(圣路易斯华盛顿大学统计与数据科学系)

AI总结 本文针对常学习率SGD分位数估计,利用马尔可夫链理论证明其平稳分布随学习率趋于零时收敛到高斯分布,首次给出CLT型理论保证,并提出置信区间递归算法。

详情
AI中文摘要

本文发展了通过恒定学习率的随机梯度下降(SGD)进行分位数估计的渐近理论。分位数损失函数既不光滑也不强凸。超越传统视角和技术,我们将分位数SGD迭代视为一个不可约、周期且正常返的马尔可夫链,该链循环收敛到其唯一的平稳分布,无论初始值如何任意固定。为了推导平稳分布的精确形式,我们通过利用平稳方程分析其特征函数的结构。我们还推导了其矩生成函数(MGF)和尾部概率的紧界。综合上述方法,我们证明了当学习率$\eta\rightarrow0$时,中心化和标准化的平稳分布收敛到高斯分布。这一发现为恒定学习率的分位数SGD估计量提供了首个中心极限定理(CLT)类型的理论保证。我们进一步提出了一种递归算法来构建具有统计保证的估计量的置信区间。数值研究展示了在线估计器和推断过程的有效有限样本性能。本研究所发展的理论工具对于研究一般形式化为马尔可夫链的SGD算法具有独立意义,特别是在非强凸和非光滑设置中。

英文摘要

This paper develops asymptotic theory for quantile estimation via stochastic gradient descent (SGD) with a constant learning rate. The quantile loss function is neither smooth nor strongly convex. Beyond conventional perspectives and techniques, we view quantile SGD iteration as an irreducible, periodic, and positive recurrent Markov chain, which cyclically converges to its unique stationary distribution regardless of the arbitrarily fixed initialization. To derive the exact form of the stationary distribution, we analyze the structure of its characteristic function by exploiting the stationary equation. We also derive tight bounds for its moment generating function (MGF) and tail probabilities. Synthesizing the aforementioned approaches, we prove that the centered and standardized stationary distribution converges to a Gaussian distribution as the learning rate $\eta\rightarrow0$. This finding provides the first central limit theorem (CLT)-type theoretical guarantees for the quantile SGD estimator with constant learning rates. We further propose a recursive algorithm to construct confidence intervals of the estimators with statistical guarantees. Numerical studies demonstrate the effective finite-sample performance of the online estimator and inference procedure. The theoretical tools developed in this study are of independent interest for investigating general SGD algorithms formulated as Markov chains, particularly in non-strongly convex and non-smooth settings.

2305.08175 2026-06-12 cs.DB cs.CR cs.LG 版本更新

ResidualPlanner+: a scalable matrix mechanism for marginals and beyond

ResidualPlanner+:一种用于边际查询及更广泛查询的可扩展矩阵机制

Guanlin He, Yingtai Xiao, Levent Toksoz, Zeyu Ding, Danfeng Zhang, Daniel Kifer

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) Binghamton University(宾厄姆顿大学) Duke University(杜克大学) TikTok Inc.(抖音公司)

AI总结 提出两种可扩展的矩阵机制ResidualPlanner和ResidualPlanner+,分别优化边际查询的精度和支持更复杂的工作负载(如范围查询),在速度和内存上显著超越现有方法。

详情
AI中文摘要

带噪声的边际查询是保护机密性的常见数据发布形式,对于列联表分析、贝叶斯网络构建甚至合成数据生成等下游任务非常有用。为线性查询(如边际查询)提供无偏噪声答案的隐私机制称为矩阵机制。我们提出了ResidualPlanner和ResidualPlanner+,两种高度可扩展的矩阵机制。ResidualPlanner在使用高斯噪声回答边际查询时既最优又可扩展,而ResidualPlanner+支持更通用的工作负载,例如边际查询与范围查询或前缀和查询的组合。ResidualPlanner可以优化许多损失函数,这些损失函数可以写成边际方差的凸函数(先前的工作仅限于一个预定义的目标函数)。ResidualPlanner可以在几秒钟内优化大规模设置中边际查询的精度,即使之前的最先进方法(HDMM)内存耗尽。它甚至可以在几分钟内处理具有100个属性的数据集。此外,ResidualPlanner可以高效计算每个边际的方差/协方差值(先前的方法即使对于相对较小的数据集也会很快耗尽内存)。ResidualPlanner+支持更复杂的工作负载,这些工作负载结合了边际查询和范围/前缀和查询(例如,关于种族的边际查询、关于年龄的范围查询以及回答每个种族的年龄范围查询的组合种族/年龄表格)。它甚至支持用户在不同属性上自定义工作负载。凭借这种增加的灵活性,ResidualPlanner+不一定是最优的,但它仍然极具可扩展性,并且在精度和速度上均优于先前的最先进方法(HDMM)处理前缀和查询。

英文摘要

Noisy marginals are a common form of confidentiality protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy mechanisms that provide unbiased noisy answers to linear queries (such as marginals) are known as matrix mechanisms. We propose ResidualPlanner and ResidualPlanner+, two highly scalable matrix mechanisms. ResidualPlanner is both optimal and scalable for answering marginal queries with Gaussian noise, while ResidualPlanner+ provides support for more general workloads, such as combinations of marginals and range queries or prefix-sum queries. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore, ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets). ResidualPlanner+ provides support for more complex workloads that combine marginal and range/prefix-sum queries (e.g., a marginal on race, a range query on age, and a combined race/age tabulation that answers age range queries for each race). It even supports custom user-defined workloads on different attributes. With this added flexibility, ResidualPlanner+ is not necessarily optimal, however it is still extremely scalable and outperforms the prior state-of-the-art (HDMM) on prefix-sum queries both in terms of accuracy and speed.

2603.29515 2026-06-12 cs.LG 版本更新

Variational Graph Neural Networks for Uncertainty Quantification in Inverse Problems

变分图神经网络用于反问题中的不确定性量化

David Gonzalez, Alba Muixi, Beatriz Moya, Elias Cueto

发表机构 * Keysight-UZ Chair of the Spanish National Strategy on AI(西班牙人工智能国家战略主席席位) Aragon Institute of Engineering Research (I3A)(阿拉贡工程研究所(I3A)) Universidad de Zaragoza(萨拉戈塔大学) Laboratori de Càlcul Numèric (LaCàN)(数值计算实验室(LaCàN)) Universitat Politècnica de Catalunya - BarcelonaTech (UPC)(加泰罗尼亚理工大学 - 巴塞罗那科技大学(UPC)) Centre Internacional de Mètodes Numèrics en Enginyeria (CIMNE)(国际数值工程方法中心(CIMNE)) PIMM Lab. Arts et Métiers Institute of Technology(巴黎艺术与技术理工学院PIMM实验室)

AI总结 提出变分图神经网络(VGNN),通过在解码器引入变分层以较低成本量化认知和统计不确定性,在固体力学反问题中验证了高精度参数恢复与置信区间估计。

详情
AI中文摘要

深度学习技术在计算力学中的日益广泛应用显著加速了那些几年前还被认为是难以处理的问题的模拟。然而,在诸如工程或医学数字孪生等关键应用中,快速响应是不够的;还必须提供可靠的结果。在某些情况下,传统的确定性方法可能不是最优的,因为它们无法提供对其预测或结果的置信度度量,尤其是在反问题中,解可能不唯一或初始数据由于噪声等原因不完全可靠。经典的深度神经网络也缺乏明确的度量来量化其预测的不确定性。在这项工作中,我们提出了一种变分图神经网络(VGNN)架构,该架构将变分层集成到其架构中以建模权重的概率分布。与计算昂贵的全贝叶斯网络不同,我们的方法仅在解码器中策略性地引入变分层,从而能够以相对较低的成本估计认知不确定性和统计不确定性。在这项工作中,我们在两个固体力学案例中验证了所提出的方法:在二维弹性问题中识别具有非线性分布的弹性模量值,以及在三维超弹性梁中定位和量化施加的载荷,在这两种情况下仅使用每个测试的位移场作为输入数据。结果表明,该模型不仅以高精度恢复了物理参数,还提供了与问题物理特性一致的置信区间,并且能够定位施加载荷的位置并估计其值,为该实验提供了置信区间。

英文摘要

The increasingly wide use of deep machine learning techniques in computational mechanics has significantly accelerated simulations of problems that were considered unapproachable just a few years ago. However, in critical applications such as Digital Twins for engineering or medicine, fast responses are not enough; reliable results must also be provided. In certain cases, traditional deterministic methods may not be optimal as they do not provide a measure of confidence in their predictions or results, especially in inverse problems where the solution may not be unique or the initial data may not be entirely reliable due to the presence of noise, for instance. Classic deep neural networks also lack a clear measure to quantify the uncertainty of their predictions. In this work, we present a variational graph neural network (VGNN) architecture that integrates variational layers into its architecture to model the probability distribution of weights. Unlike computationally expensive full Bayesian networks, our approach strategically introduces variational layers exclusively in the decoder, allowing us to estimate cognitive uncertainty and statistical uncertainty at a relatively lower cost. In this work, we validate the proposed methodology in two cases of solid mechanics: the identification of the value of the elastic modulus with nonlinear distribution in a 2D elastic problem and the location and quantification of the loads applied to a 3D hyperelastic beam, in both cases using only the displacement field of each test as input data. The results show that the model not only recovers the physical parameters with high precision, but also provides confidence intervals consistent with the physics of the problem, as well as being able to locate the position of the applied load and estimate its value, giving a confidence interval for that experiment.

2601.06572 2026-06-12 cs.LG cs.AI 版本更新

Hellinger Multimodal Variational Autoencoders

Hellinger多模态变分自编码器

Huyen Vo, Isabel Valera

发表机构 * Department of Computer Science, Saarland University(萨尔兰大学计算机科学系) MPI-SWS, Saarland Informatics Campus(萨尔兰信息学校区Max Planck研究所)

AI总结 提出基于Hellinger距离的矩匹配近似方法HELVAE,避免子采样,在多模态变分自编码器中实现更优的生成一致性与质量权衡。

详情
Comments
Accepted at AISTATS 2026. Camera-ready version
AI中文摘要

多模态变分自编码器(VAEs)广泛用于弱监督生成学习,涉及多种模态。主流方法通过专家乘积(PoE)、专家混合(MoE)或其组合来聚合单模态推理分布,以近似联合后验。本文从概率意见池化的优化视角重新审视多模态推理。我们从$\alpha=0.5$的Hölder池化出发,这是$\alpha\text{-散度}$族中唯一的对称成员,并推导出一种矩匹配近似,称为Hellinger。我们利用这种近似提出HELVAE,一种避免子采样的多模态VAE,从而得到一个高效且有效的模型,该模型:(i)随着观察到的模态增加,学习更具表达力的潜在表示;(ii)在生成一致性和质量之间实现更好的权衡,优于最先进的多模态VAE模型。

英文摘要

Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with $\alpha=0.5$, which corresponds to the unique symmetric member of the $\alpha\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.

2603.25450 2026-06-12 cs.AI 版本更新

Cross-Model Disagreement as a Label-Free Correctness Signal

跨模型分歧作为无标签正确性信号

Matt Gorbett, Suman Jana

发表机构 * Independent Researcher(独立研究者) Department of Computer Science Columbia University(计算机科学系哥伦比亚大学)

AI总结 提出跨模型分歧作为无标签正确性指标,通过验证模型对生成模型答案的困惑度或熵来检测错误,无需训练或标签,在多个基准上优于模型内不确定性方法。

详情
AI中文摘要

在没有真实标签的情况下检测语言模型何时出错是安全部署的一个基本挑战。现有方法依赖于模型自身的不确定性——例如令牌熵或置信度分数——但这些信号在最危险的失败模式:自信错误(模型错误但确定)上会严重失效。在这项工作中,我们引入跨模型分歧作为正确性指标——一种简单、无需训练的信号,可以无需修改地插入现有的生产系统、路由管道和部署监控基础设施。给定模型生成的答案,跨模型分歧通过单次前向传递计算第二个验证模型在读取该答案时的惊讶或不确定性程度。不需要验证模型生成任何内容,也不需要正确性标签。我们将这一原则实例化为跨模型困惑度(CMP),它衡量验证模型对生成模型答案令牌的惊讶程度,以及跨模型熵(CME),它衡量验证模型在这些位置的不确定性。CMP和CME在涵盖推理、检索和数学问题求解(MMLU、TriviaQA和GSM8K)的基准测试中均优于模型内不确定性基线。在MMLU上,CMP的平均AUROC为0.75,而模型内熵基线为0.59。这些结果确立了跨模型分歧作为一种实用的、无需训练的无标签正确性估计方法,可直接应用于部署监控、模型路由、选择性预测、数据过滤和生产语言模型系统的可扩展监督。

英文摘要

Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model's surprise at the generating model's answer tokens, and Cross-Model Entropy (CME), which measures the verifying model's uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.

2601.19072 2026-06-12 cs.SE cs.AI 版本更新

HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation

HalluJudge: 代码审查自动化中上下文错位的无参考幻觉检测

Kla Tantithamthavorn, Hong Yi Lin, Patanamon Thongtanunam, Wachiraphan Charoenwet, Minwoo Jeong, Ming Wu

发表机构 * Monash University Australia(墨尔本大学澳大利亚) The University of Melbourne Australia(墨尔本大学澳大利亚) Atlassian USA(Atlassian美国)

AI总结 提出无参考幻觉检测方法HalluJudge,通过上下文对齐评估生成评论的根基性,采用多分支推理策略,在F1=0.85且成本$0.009下与开发者偏好67%一致。

详情
Comments
Accepted at FSE'26: Industry Track, Full-Length, Peer-Reviewed
AI中文摘要

大型语言模型(LLM)在代码审查自动化(如审查评论生成)中表现出强大能力,但它们存在幻觉——生成的审查评论与实际代码无根基——这对LLM在代码审查工作流程中的应用构成重大挑战。为解决此问题,我们探索了无需参考的、有效且可扩展的方法来检测LLM生成的代码审查评论中的幻觉。在这项工作中,我们设计了HalluJudge,旨在基于上下文对齐评估生成评论的根基性。HalluJudge包括四种关键策略,从直接评估到结构化多分支推理(例如,思维树)。我们在Atlassian的企业级软件项目中对这些评估策略进行了全面评估,以检验HalluJudge的有效性和成本效率。此外,我们分析了HalluJudge的判断与实际生产环境中LLM生成的代码审查评论的开发人员偏好之间的一致性。我们的结果表明,HalluJudge中的幻觉评估具有成本效益,F1得分为0.85,平均成本为0.009美元。平均而言,67%的HalluJudge评估与在线生产中实际LLM生成的审查评论的开发人员偏好一致。我们的结果表明,HalluJudge可以作为实用的保障措施,减少开发人员接触幻觉评论,从而促进对AI辅助代码审查的信任。

英文摘要

Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations -- where the generated review comments are ungrounded in the actual code -- poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian's enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge's judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of $0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers' exposure to hallucinated comments, fostering trust in AI-assisted code reviews.

2603.21563 2026-06-12 cs.AI 版本更新

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

多智能体协作的反事实信用策略优化

Zhongyi Li, Wan Tian, Jinju Chen, Huiming Zhang, Yang Liu, Yikun Ban, Fuzhen Zhuang

发表机构 * Beihang University(北航) Peking University(北京大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 针对多智能体大语言模型协作中信用分配难题,提出CCPO框架,通过反事实信用估计和验证器锚定的自评估两种分配器,将团队奖励转化为个体学习信号,提升数学推理任务表现。

详情
AI中文摘要

协作式多智能体大语言模型可以通过分解角色来解决复杂的推理任务,但此类系统的强化学习受到信用分配的限制:共享的终端奖励模糊了个体贡献,并可能鼓励搭便车行为。我们引入了协作信用策略优化(CCPO),这是一个与优化器无关的信用分配层,将团队层面的结果转化为智能体特定的学习信号。CCPO提供了两种互补的分配器。反事实信用通过比较实际团队结果与移除该智能体的反事实结果来估计智能体的边际贡献。验证器锚定的LLM自我评估是一种探索性分配器,它使用受限的自我评估和同伴评估来重新分配信用,同时保持外部验证器结果的主导地位。由此产生的角色特定奖励可以被GRPO风格的更新或其他策略梯度优化器(如GSPO和REINFORCE++)使用。我们在顺序的思考-求解设置中实例化CCPO,并在数学推理基准上评估它。结果表明,显式的信用分配通常能改善双智能体推理,尤其是在MATH500和几个分布外设置中,而增益因模型和数据集而异。

英文摘要

Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles, but reinforcement learning for such systems is limited by credit assignment: shared terminal rewards obscure individual contributions and can encourage free-riding. We introduce two optimizer-agnostic credit assignment methods for converting joint outcomes into agent-specific learning signals. Counterfactual Credit for Policy Optimization (CCPO) estimates an agent's marginal contribution by comparing the realized joint outcome with a counterfactual outcome where that agent is removed. Self-Evaluated Credit for Policy Optimization (SEPO) uses constrained self- and peer-evaluations as a verifier-anchored credit signal while keeping the external task outcome dominant. Both operate at the reward-construction layer rather than as policy optimizers, producing role-specific rewards or advantages for GRPO, GSPO, or REINFORCE++. We instantiate these credit signals in a sequential Think--Solve setting and evaluate them on mathematical reasoning benchmarks. Results show that explicit credit assignment often improves dual-agent reasoning, especially on MATH500 and several out-of-distribution settings, while gains vary across models and datasets. Our code is available at: this https URL.

2603.16013 2026-06-12 cs.RO cs.SE 版本更新

Safety Case Patterns for VLA-based driving systems: Insights from SimLingo

基于VLA的驾驶系统的安全案例模式:来自SimLingo的见解

Gerhard Yu, Fuyuki Ishikawa, Oluwafemi Odu, Alvine Boaye Belle

发表机构 * York University(约克大学) National Institute of Informatics(国家信息研究所)

AI总结 针对VLA驾驶系统提出RAISE安全案例设计方法,通过扩展HARA和定制模式,结合SimLingo案例验证其构建基于证据的安全声明的有效性。

详情
AI中文摘要

基于视觉-语言-动作(VLA)的驾驶系统代表了自动驾驶领域的重大范式转变,因为通过结合交通场景理解、语言解释和动作生成,这些系统能够实现更灵活、自适应和响应指令的驾驶行为。然而,尽管它们被越来越多地采用,并具有支持社会责任型自动驾驶以及理解高级人类指令的潜力,基于VLA的驾驶系统可能表现出新型的危险行为。例如,将开放式的自然语言输入(如用户或导航指令)集成到多模态控制回路中可能导致不可预测和不安全的行为,从而危及车辆乘员和行人。因此,确保这些系统的安全性对于建立对其运行的信任至关重要。为此,我们提出了一种名为RAISE的新型安全案例设计方法。我们的方法引入了针对基于指令的驾驶系统(如VLA驾驶系统)定制的新模式,扩展了危害分析和风险评估(HARA),详细说明了安全场景及其结果,并设计了一种创建VLA驾驶系统安全案例的技术。在SimLingo上的案例研究说明了如何使用我们的方法为这类新兴的自动驾驶系统构建严谨的、基于证据的安全声明。

英文摘要

Vision-Language-Action (VLA)-based driving systems represent a significant paradigm shift in autonomous driving since, by combining traffic scene understanding, linguistic interpretation, and action generation, these systems enable more flexible, adaptive, and instruction-responsive driving behaviors. However, despite their growing adoption and potential to support socially responsible autonomous driving as well as understanding high-level human instructions, VLA-based driving systems may exhibit new types of hazardous behaviors. For instance, the integration of open-ended natural language inputs (e.g., user or navigation instructions) into the multimodal control loop may lead to unpredictable and unsafe behaviors that could endanger vehicle occupants and pedestrians. Hence, assuring the safety of these systems is crucial to help build trust in their operations. To support this, we propose a novel safety case design approach called RAISE. Our approach introduces novel patterns tailored to instruction-based driving systems such as VLA-based driving systems, an extension of Hazard Analysis and Risk Assessment (HARA) detailing safe scenarios and their outcomes, and a design technique to create the safety cases of VLA-based driving systems. A case study on SimLingo illustrates how our approach can be used to construct rigorous, evidence-based safety claims for this emerging class of autonomous driving systems.

2603.17527 2026-06-12 stat.ML cs.LG math.OC 版本更新

Mirror Descent on Riemannian Manifolds

黎曼流形上的镜像下降

Jiaxin Jiang, Lei Shi, Jiyuan Tan

发表机构 * School of Mathematical Sciences, Fudan University, Shanghai 200433, China(复旦大学数学学院,上海200433,中国) Shanghai Key Laboratory for Contemporary Applied Mathematics, Fudan University, Shanghai 200433, China(上海当代应用数学重点实验室,复旦大学,上海200433,中国)

AI总结 将镜像下降推广到黎曼流形,通过重参数化提出黎曼镜像下降(RMD)及其随机变体,并建立非渐近收敛保证,在Stiefel流形上退化为曲线梯度下降(CGD)。

详情
AI中文摘要

镜像下降(MD)是一种可扩展的一阶方法,广泛应用于大规模优化,包括图像处理、策略优化和神经网络训练。本文将MD推广到黎曼流形上的优化。具体地,我们通过重参数化开发了一个黎曼镜像下降(RMD)框架,并进一步提出了RMD的随机变体。我们还为RMD和随机RMD建立了非渐近收敛保证。作为在Stiefel流形上的应用,我们的RMD框架退化为[26]中提出的曲线梯度下降(CGD)方法。此外,当将随机RMD框架特化到Stiefel设置时,我们得到了CGD的随机扩展,这有效地解决了大规模流形优化问题。

英文摘要

Mirror Descent (MD) is a scalable first-order method widely used in large-scale optimization, with applications in image processing, policy optimization, and neural network training. This paper generalizes MD to optimization on Riemannian manifolds. In particular, we develop a Riemannian Mirror Descent (RMD) framework via reparameterization and further propose a stochastic variant of RMD. We also establish non-asymptotic convergence guarantees for both RMD and stochastic RMD. As an application to the Stiefel manifold, our RMD framework reduces to the Curvilinear Gradient Descent (CGD) method proposed in [26]. Moreover, when specializing the stochastic RMD framework to the Stiefel setting, we obtain a stochastic extension of CGD, which effectively addresses large-scale manifold optimization problems.

2603.14482 2026-06-12 cs.CV 版本更新

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

V-JEPA 2.1: 解锁视频自监督学习中的密集特征

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes

发表机构 * FAIR at Meta(Meta的FAIR) Universidad de Zaragoza(萨拉戈萨大学)

AI总结 提出V-JEPA 2.1系列自监督模型,通过密集预测损失、深度自监督、多模态分词器和有效缩放,学习图像和视频的密集高质量视觉表示,在多个基准上取得最优性能。

详情
AI中文摘要

我们提出V-JEPA 2.1,一系列自监督模型,能够学习图像和视频的密集、高质量视觉表示,同时保持强大的全局场景理解。该方法结合了四个关键组件。首先,密集预测损失使用基于掩码的目标,其中可见和掩码令牌都贡献于训练信号,鼓励显式的空间和时间接地。其次,深度自监督在多个中间编码器层上分层应用自监督目标,以提高表示质量。第三,多模态分词器实现了图像和视频的统一训练。最后,该模型受益于模型容量和训练数据的有效缩放。这些设计选择共同产生了空间结构、语义一致和时间连贯的表示。实验上,V-JEPA 2.1在几个具有挑战性的基准上取得了最先进的性能,包括在Ego4D上短期物体交互预测的7.71 mAP,在EPIC-KITCHENS上高级动作预测的40.8 Recall@5,以及在实际机器人抓取成功率上比V-JEPA-2 AC提高了20个百分点。该模型还在机器人导航(TartanDrive上5.687 ATE)、深度估计(NYUv2上线性探针0.307 RMSE)和全局识别(Something-Something-V2上77.7)方面表现出强大的性能。这些结果表明,V-JEPA 2.1显著推进了密集视觉理解和世界建模的最新技术。

英文摘要

We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.

2603.15158 2026-06-12 cs.LG 版本更新

Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies

在不完美代理下潜在偏移中鲁棒预测器的点识别

Zahra Rahiminasab, Reza Soumi, Arto Klami, Samuel Kaski

发表机构 * Department of Computer Science, Aalto University(阿尔托大学计算机科学系) Department of Computer Science, University of Helsinki(赫尔辛基大学计算机科学系) ELLIS Institute Finland(芬兰埃利斯研究所) Department of Computer Science, Manchester University(曼彻斯特大学计算机科学系)

AI总结 针对潜在混淆变量导致的域适应问题,提出基于潜在等价类的点识别方法,通过跨域秩条件替代强完备性假设,并设计主动学习框架PQAL实现鲁棒预测。

详情
AI中文摘要

当跨域的分布偏移源于同时影响协变量和结果的潜在混淆变量时,域适应问题变得更加具有挑战性。现有的基于代理的方法通过强完备性假设来唯一确定(点识别)鲁棒预测器。完备性要求代理具有关于潜在混淆变量变化的足够信息。对于不完美代理,从混淆变量到代理分布空间的映射是非单射的,多个潜在混淆变量值可能生成相同的代理分布。这破坏了完备性假设,观测数据与多个潜在预测器(集识别)一致。为了解决这个问题,我们引入了潜在等价类(LECs)。LECs定义为诱导相同条件代理分布的潜在混淆变量组。我们证明,只要多个域在如何混合代理诱导的LECs以形成鲁棒预测器方面有足够差异,鲁棒预测器的点识别仍然可以实现。这种域多样性条件被形式化为混合权重的跨域秩条件,该条件比完备性假设弱得多。我们提出了近端准贝叶斯主动学习(PQAL)框架,该框架主动查询满足该秩条件的小型、有针对性的多样化域集合。PQAL可以恢复点识别的预测器,展示了对不同程度偏移的鲁棒性,并在合成数据、半合成dSprites、IHDP、ACS Folktables数据集上优于先前方法。

英文摘要

Addressing the domain adaptation problem becomes more challenging when distribution shifts across domains stem from latent confounders that affect both covariates and outcomes. Existing proxy-based approaches that address latent shift rely on a strong completeness assumption to uniquely determine (point-identify) a robust predictor. Completeness requires that proxies have sufficient information about variations in latent confounders. For imperfect proxies the mapping from confounders to the space of proxy distributions is non-injective, and multiple latent confounder values can generate the same proxy distribution. This breaks the completeness assumption and observed data are consistent with multiple potential predictors (set-identified). To address this, we introduce latent equivalent classes (LECs). LECs are defined as groups of latent confounders that induce the same conditional proxy distribution. We show that point-identification for the robust predictor remains achievable as long as multiple domains differ sufficiently in how they mix proxy-induced LECs to form the robust predictor. This domain diversity condition is formalized as a cross-domain rank condition on the mixture weights, which is substantially weaker assumption than completeness. We introduce the Proximal Quasi-Bayesian Active learning (PQAL) framework, which actively queries a small, targeted set of diverse domains that satisfy this rank condition. PQAL can recover the point-identified predictor, demonstrates robustness to varying degrees of shift and outperforms previous methods on synthetic data and semi-synthetic dSprites, IHDP, ACS Folktables datasets.

2603.11249 2026-06-12 cs.LG 版本更新

Differentiable Thermodynamic Phase-Equilibria for Machine Learning

可微热力学相平衡用于机器学习

Karim K. Ben Hicham, Moreno Ascani, Jan G. Rittig, Alexander Mitsos

发表机构 * RWTH Aachen University(亚琛工业大学) Process Systems Engineering (AVT.SVT)(过程系统工程) Forschungszentrum Jülich GmbH(吕根研究中心) Institute of Climate and Energy Systems ICE-1(气候与能源系统研究所) Energy Systems Engineering(能源系统工程) JARA-ENERGY

AI总结 提出DISCOMAX算法,通过可微相平衡计算结合离散枚举与掩码softmax,实现热力学一致性端到端学习,在二元液液平衡数据上优于现有方法。

详情
Comments
45 pages, 27 figures, 5 tables
AI中文摘要

相平衡的准确预测仍是化学工程中的核心挑战。将热力学结构融入神经网络的物理一致性机器学习方法最近在活度系数建模中表现出色。然而,将此类方法扩展到源于极值原理的平衡数据(如液液平衡)仍然困难。本文提出DISCOMAX,一种用于相平衡计算的可微算法,在训练和推理时均保证热力学一致性,仅受用户指定的离散化影响。该方法将可行相态的离散枚举与反向传播中的掩码softmax聚合相结合,在前向传播中传播真实平衡态,使用直通梯度估计器实现神经gE模型的物理一致性端到端学习。我们展示了该方法与统计热力学的类比,并在二元液液平衡数据上评估,其优于现有基于代理的方法,同时为从不同种类的平衡数据中学习提供了通用框架。

英文摘要

Accurate prediction of phase equilibria remains a central challenge in chemical engineering. Physics-consistent machine learning methods that incorporate thermodynamic structure into neural networks have recently shown strong performance for activity-coefficient modeling. However, extending such approaches to equilibrium data arising from an extremum principle, such as liquid-liquid equilibria, remains difficult. Here we present DISCOMAX, a differentiable algorithm for phase-equilibrium calculation that guarantees thermodynamic consistency at both training and inference, only subject to a user-specified discretization. The method combines discrete enumeration of feasible phase states with masked softmax aggregation in the backward pass, with the propagation of the true equilibrium state in the forward pass, using a straight-through gradient estimator to enable physics-consistent end-to-end learning of neural \gls{gE}-models. We show that this approach bears analogy to statistical thermodynamics, and we evaluate it on binary liquid-liquid equilibrium data where it outperforms existing surrogate-based methods, while offering a general framework for learning from different kinds of equilibrium data.

2603.14483 2026-06-12 cs.LG 版本更新

Disentangling Dynamical Systems: Causal Representation Learning Meets Local Sparse Attention

解耦动力系统:因果表示学习遇见局部稀疏注意力

Markus W. Baumgartner, Anson Lei, Joe Watson, Ingmar Posner

发表机构 * Applied Artificial Intelligence Lab, Oxford Robotics Institute, Oxford, UK(应用人工智能实验室,牛津机器人研究所,英国牛津)

AI总结 提出一种结合因果表示学习和局部稀疏注意力的方法,从原始轨迹数据中无结构假设地解耦系统参数,并通过图论准则保证可辨识性。

详情
Comments
Presented as an Oral at the 5th Conference on Causal Learning and Reasoning
AI中文摘要

参数化系统辨识方法从数据中估计显式定义的物理系统的参数。然而,它们仍然受限于需要提供显式函数空间,通常通过基于可用领域知识预定义的候选函数库。相比之下,深度学习能够以高保真度对广泛复杂性的系统进行建模,但黑箱函数逼近通常无法产生揭示系统结构的显式描述性或解耦表示。我们开发了一种新的可辨识性定理,利用因果表示学习,在没有结构假设的情况下发现系统参数的解耦表示。我们推导了一个图论准则,指定何时系统参数可以从原始轨迹数据中唯一解耦,直至置换和微分同胚。关键的是,我们的分析表明,全局因果结构为考虑局部状态依赖因果结构时可实现的解耦保证提供了下界。我们将系统参数识别实例化为变分推断问题,利用稀疏正则化变换器来发现状态依赖的因果结构。我们在四个合成领域上实证验证了我们的方法,证明了其恢复基线方法无法恢复的高度解耦表示的能力。与我们的理论分析一致,我们的结果证实了强制局部因果结构通常对于完全可辨识性是必要的。

英文摘要

Parametric system identification methods estimate the parameters of explicitly defined physical systems from data. Yet, they remain constrained by the need to provide an explicit function space, typically through a predefined library of candidate functions chosen via available domain knowledge. In contrast, deep learning can demonstrably model systems of broad complexity with high fidelity, but black-box function approximation typically fails to yield explicit descriptive or disentangled representations revealing the structure of a system. We develop a novel identifiability theorem, leveraging causal representation learning, to uncover disentangled representations of system parameters without structural assumptions. We derive a graphical criterion specifying when system parameters can be uniquely disentangled from raw trajectory data, up to permutation and diffeomorphism. Crucially, our analysis demonstrates that global causal structures provide a lower bound on the disentanglement guarantees achievable when considering local state-dependent causal structures. We instantiate system parameter identification as a variational inference problem, leveraging a sparsity-regularised transformer to uncover state-dependent causal structures. We empirically validate our approach across four synthetic domains, demonstrating its ability to recover highly disentangled representations that baselines fail to recover. Corroborating our theoretical analysis, our results confirm that enforcing local causal structure is often necessary for full identifiability.

2603.14407 2026-06-12 cs.LG 版本更新

Towards One-for-All Anomaly Detection for Tabular Data

面向表格数据的通用异常检测

Shiyuan Li, Yixin Liu, Yu Zheng, Xiaofeng Cao, Shirui Pan, Heng Tao Shen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出OFA-TAD框架,通过多视图邻居距离表示和混合专家评分网络,实现跨领域表格异常检测的通用化,一次训练即可泛化到未见数据集。

详情
Comments
Accepted by ICML 2026
AI中文摘要

表格异常检测(TAD)旨在识别表格数据中偏离大多数样本的样本,在许多实际应用中至关重要。然而,现有方法遵循“一个数据集一个模型(OFO)”范式,依赖于数据集特定的训练,导致计算成本高且对未见领域的泛化能力有限。为解决这些局限性,我们提出OFA-TAD,一个通用的“一劳永逸(OFA)”TAD框架,只需在多个源数据集上进行一次训练,即可即时泛化到来自不同领域的未见数据集。为实现通用表格异常检测,OFA-TAD提取邻居距离模式作为可迁移线索,并引入来自多个变换诱导度量空间的多视图邻居距离表示,以减轻距离分布对变换的敏感性。为自适应组合多视图距离证据,采用混合专家(MoE)评分网络进行视图特定异常评分和熵正则化门控融合,并采用多策略异常合成机制以支持单类约束下的训练。在来自14个领域的34个数据集上的大量实验表明,OFA-TAD在严格的OFA设置下实现了优越的异常检测性能和强大的跨领域泛化能力。源代码见:https://this URL。

英文摘要

Tabular anomaly detection (TAD) aims to identify samples that deviate from the majority in tabular data and is critical in many real-world applications. However, existing methods follow a ``one model for one dataset (OFO)'' paradigm, which relies on dataset-specific training and thus incurs high computational cost and yields limited generalization to unseen domains. To address these limitations, we propose OFA-TAD, a generalist one-for-all (OFA) TAD framework that only requires one-time training on multiple source datasets and can generalize to unseen datasets from diverse domains on-the-fly. To realize one-for-all tabular anomaly detection, OFA-TAD extracts neighbor-distance patterns as transferable cues, and introduces multi-view neighbor-distance representations from multiple transformation-induced metric spaces to mitigate the transformation sensitivity of distance profiles. To adaptively combine multi-view distance evidence, a Mixture-of-Experts (MoE) scoring network is employed for view-specific anomaly scoring and entropy-regularized gated fusion, with a multi-strategy anomaly synthesis mechanism to support training under the one-class constraint. Extensive experiments on 34 datasets from 14 domains demonstrate that OFA-TAD achieves superior anomaly detection performance and strong cross-domain generalizability under the strict OFA setting. The source code is available at this https URL.

2603.10834 2026-06-12 cs.CV cs.AI 版本更新

On the Reliability of Cue Conflict and Beyond

论线索冲突的可靠性及其超越

Pum Jun Kim, Seung-Ah Lee, Seongho Park, Dongyoon Han, Jaejun Yoo

发表机构 * Ulsan National Institute of Science and Technology(乌山国立科学研究院) College of Medicine, Hanyang University(翰阳大学医学院) NAVER AI Lab(NAVER AI实验室)

AI总结 针对现有线索冲突基准在评估形状-纹理偏好时存在不稳定和模糊的问题,提出REFINED-BIAS数据集与评估框架,通过显式定义形状和纹理、构建平衡的线索对及基于排序的度量,实现更可靠和可解释的偏差诊断。

详情
Comments
Shape-Texture Bias, Cue Conflict Benchmark
AI中文摘要

理解神经网络如何依赖视觉线索提供了其内部决策过程的人类可解释视角。线索冲突基准在探究形状-纹理偏好以及激发更强、类人形状偏差通常与改进的域内性能相关的见解方面具有影响力。然而,我们发现当前基于风格化的实例化可能产生不稳定和模糊的偏差估计。具体来说,风格化可能无法可靠地实例化感知上有效且可分离的线索,也无法控制其相对信息量;基于比率的偏差可能掩盖绝对线索敏感性;将评估限制在预选类别可能忽略完整决策空间而扭曲模型预测。这些因素共同可能将偏好与线索有效性、线索平衡和可识别性伪影混淆。我们引入了REFINED-BIAS,一个用于可靠和可解释的形状-纹理偏差诊断的集成数据集和评估框架。REFINED-BIAS使用形状和纹理的显式定义构建平衡的、人类和模型可识别的线索对,并通过基于排序的度量测量完整标签空间上的线索特定敏感性,从而实现更公平的跨模型比较。在不同的训练范式和架构中,REFINED-BIAS实现了更公平的跨模型比较、更忠实的形状和纹理偏差诊断以及更清晰的实证结论,解决了先前线索冲突评估无法可靠区分的矛盾。

英文摘要

Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.

2603.11479 2026-06-12 cs.LG cs.AI cs.MA 版本更新

Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

波的语法:通过神经符号VLM智能体实现可解释的多变量时间序列事件检测

Sky Chenwei Wan, Yifei Y. Wang, Tianjun Hou, Xiqing Chang, Aymeric Jan

发表机构 * AI Lab, SLB(SLB人工智能实验室) Télécom Paris, Institut Polytechnique de Paris, France(巴黎电信学院,巴黎高等理工学院,法国)

AI总结 提出语言引导的时间序列事件检测(TSED)任务,通过事件逻辑树(ELT)将文本描述转化为结构化时序逻辑,并构建神经符号VLM智能体SELA,实现零/少样本事件检测与可解释推理。

详情
Comments
8 pages (main text), 28 pages total including appendix. 9 figures, 7 tables
AI中文摘要

时间序列事件检测(TSED)旨在定位时间序列数据中具有语义意义的事件,在高风险领域具有关键应用。与统计异常不同,事件通常由自然语言描述定义,且跨多个物理通道具有内部时序逻辑结构。然而,在现实场景中,密集的事件标注成本高昂,使得纯监督学习困难。我们引入了语言引导的TSED,该设置中模型被赋予文本事件描述,并必须在几乎没有标注数据的情况下将其映射到多变量信号中的区间。为了解决这个问题,我们提出了事件逻辑树(ELT),一种知识表示框架,将语言描述转化为信号基元上的结构化时序逻辑。基于ELT,我们提出了SELA,一种神经符号VLM智能体框架,它从信号可视化中迭代地接地基元,并在ELT约束下组合它们,产生事件区间和忠实的树状结构解释。我们进一步发布了跨能源和气候领域的真实世界基准,包含专家知识和标注。实验表明,SELA优于监督微调和现有的零/少样本时间序列推理基线。

英文摘要

Time Series Event Detection (TSED) aims to localize semantically meaningful events in time series data, with critical applications in high-stakes domains. Unlike statistical anomalies, events are often defined by natural-language descriptions with internal temporal-logic structures across multiple physical channels. However, in real-world settings, dense event annotations are expensive to obtain, making purely supervised learning difficult. We introduce Language-guided TSED, a setting where a model is given textual event descriptions and must ground them to intervals in multivariate signals with little or no labeled data. To address this problem, we propose Event Logic Tree (ELT), a knowledge representation framework that converts linguistic descriptions into structured temporal logic over signal primitives. Building on ELT, we present SELA, a neuro-symbolic VLM agent framework that iteratively grounds primitives from signal visualizations and composes them under ELT constraints, producing both event intervals and faithful tree-structured explanations. We further release a real-world benchmark across energy and climate domains with expert knowledge and annotations. Experiments show that SELA improves over supervised fine-tuning and existing zero/few-shot time series reasoning baselines.

2603.11242 2026-06-12 stat.ML cs.LG 版本更新

A Unified Latent Space Disentanglement VAE Framework with Robust Disentanglement Effectiveness Evaluation

统一潜在空间解缠的VAE框架及鲁棒的解缠效果评估

Xiaoan Lang, Md Mostafizer Rahman, Fang Liu

发表机构 * Department of Applied and Computational Mathematics and Statistics(应用与计算数学与统计系) Lucy Family Institute for Data & Society(数据与社会学院)

AI总结 提出统一框架bfVAE整合多种解缠VAE方法,并开发FVH-LT和DBSR-LS评估解缠效果,引入LSSI指标量化潜在结构分离,无需真实生成因子。

详情
AI中文摘要

评估和解释潜在表示(如变分自编码器VAE)对于多样数据类型仍然是一个重大挑战,尤其是当真实生成因子未知时。为此,我们将几种最先进的用于潜在空间解缠的VAE方法统一到一个框架——bfVAE中。为了评估解缠VAE模型的有效性并增强潜在空间可解释性,我们提出了通过潜在遍历的特征方差异质性(FVH-LT)和潜在空间中的脏块稀疏回归(DBSR-LS)。为了确保学习到的潜在空间的鲁棒可解释性,我们开发了一种贪婪对齐策略(GAS),该策略减轻了标签切换问题,并对齐不同运行中的潜在维度,为结果聚合奠定基础。我们还引入了一个方便的标量潜在空间分离指数(LSSI),该指数基于FVH-LT和DBSR-LS的GAS对齐输出,在不知道真实生成因子的情况下总结整体潜在结构分离。我们将bfVAE与五个VAE模型进行比较,并在七个表格和图像数据集上验证了FVH-LT、DBSR-LS和LSSI的有效性。在我们检查的实验设置下,bfVAE提供了一个更灵活的解缠框架,在解缠和重构之间实现了比基准VAE模型更有利的整体权衡;FVH-LT和DBSR-LS可靠地揭示了语义上有意义且与领域相关的潜在结构,并且通常产生一致的结果;LSSI对潜在结构分离做出了有效的定量总结。

英文摘要

Evaluating and interpreting latent representations, such as variational autoencoders (VAEs), remains a significant challenge for diverse data types, especially when ground-truth generative factors are unknown. To address this, we unify several state-of-the-art disentangled VAE approaches for latent space disentanglement into one framework -- bfVAE. To assess the effectiveness of a disentangled VAE model and enhance latent space interpretability, we propose Feature Variance Heterogeneity via Latent Traversal (FVH-LT) and Dirty Block Sparse Regression in Latent Space (DBSR-LS). To ensure robust interpretability of learned latent space, we develop a greedy alignment strategy (GAS) that mitigates label switching and aligns latent dimensions across runs to set the foundation of result aggregation. We also introduce a convenient scalar latent space separation index (LSSI) based on the GAS-aligned outputs of FVH-LT and DBSR-LS to summarize the overall latent structural separation without knowledge of the ground-truth generative factors. We compare bfVAE to five VAE models and validate the effectiveness FVH-LT, DBSR-LS, and LSSI in on seven tabular and image datasets. Under our examined experimental settings, bfVAE provides a more flexible disentanglement framework achieves more favorable overall trade-off between disentanglement and reconstruction than the benchmark VAE models; FVH-LT and DBSR-LS reliably uncover semantically meaningful and domain-relevant latent structures and generally yield consistent results; and LSSI makes an effective quantitative summary of latent structural separation.

2603.08505 2026-06-12 cs.LG cs.AI 版本更新

Echo2ECG: Enhancing ECG Representations with Cardiac Morphology from Multi-View Echos

Echo2ECG:利用多视角超声心动图的心脏形态增强心电图表示

Michelle Espranita Liman, Özgün Turgut, Alexander Müller, Eimo Martens, Daniel Rueckert, Philip Müller

发表机构 * Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM) and TUM University Hospital(人工智能在医疗与医学中的中心,慕尼黑技术大学(TUM)和慕尼黑大学医院) Department of Cardiology, TUM University Hospital(心血管科,慕尼黑大学医院) Department of Computing, Imperial College London(计算系,伦敦帝国理工学院) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 提出Echo2ECG多模态自监督学习框架,通过多视角超声心动图丰富心电图表示,在结构表型分类和超声检索任务上优于现有方法,模型大小仅为最大基线的1/18。

详情
Comments
Accepted at MICCAI 2026
AI中文摘要

心电图(ECG)是一种低成本、广泛使用的模态,通过捕捉心脏电活动来诊断电异常(如房颤)。然而,它无法直接测量心脏形态表型,如左心室射血分数(LVEF),这通常需要超声心动图(Echo)。从ECG预测这些表型将实现早期、可及的健康筛查。现有的自监督方法通过将ECG与单视角Echo对齐而遭受表示不匹配,单视角Echo仅捕捉局部、空间受限的解剖快照。为解决此问题,我们提出Echo2ECG,一种多模态自监督学习框架,利用多视角Echo中捕捉的心脏形态结构丰富ECG表示。我们在两个根本上需要形态信息的临床相关任务上评估Echo2ECG作为ECG特征提取器:(1)跨三个数据集的结构性心脏表型分类,以及(2)使用ECG查询检索具有相似形态特征的Echo研究。我们的提取的ECG表示在两个任务上始终优于最先进的单模态和多模态基线,尽管模型大小仅为最大基线的1/18。这些结果表明Echo2ECG是一个鲁棒、强大的ECG特征提取器。我们的代码可从此https URL获取。

英文摘要

Electrocardiography (ECG) is a low-cost, widely used modality for diagnosing electrical abnormalities like atrial fibrillation by capturing the heart's electrical activity. However, it cannot directly measure cardiac morphological phenotypes, such as left ventricular ejection fraction (LVEF), which typically require echocardiography (Echo). Predicting these phenotypes from ECG would enable early, accessible health screening. Existing self-supervised methods suffer from a representational mismatch by aligning ECGs to single-view Echos, which only capture local, spatially restricted anatomical snapshots. To address this, we propose Echo2ECG, a multimodal self-supervised learning framework that enriches ECG representations with the heart's morphological structure captured in multi-view Echos. We evaluate Echo2ECG as an ECG feature extractor on two clinically relevant tasks that fundamentally require morphological information: (1) classification of structural cardiac phenotypes across three datasets, and (2) retrieval of Echo studies with similar morphological characteristics using ECG queries. Our extracted ECG representations consistently outperform those of state-of-the-art unimodal and multimodal baselines across both tasks, despite being 18x smaller than the largest baseline. These results demonstrate that Echo2ECG is a robust, powerful ECG feature extractor. Our code is accessible at this https URL.

2603.06652 2026-06-12 cs.CV cs.AI 版本更新

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

PaLMR: 通过多模态过程对齐实现忠实视觉推理

Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, Shiguo Lian

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Data Science & Artificial Intelligence Research Institute, China Unicom(中国unicom数据科学与人工智能研究院) Unicom Data Intelligence, China Unicom(中国unicom数据智能)

AI总结 提出PaLMR框架,通过感知对齐数据层和过程对齐优化层,减少推理幻觉并提升视觉推理忠实度,在多个基准上取得最优结果。

详情
AI中文摘要

强化学习近期提升了大语言模型和多模态大语言模型的推理能力,但现有的奖励设计强调最终答案的正确性,因此容忍过程幻觉——即模型在得到正确答案的同时错误感知视觉证据的情况。我们通过PaLMR框架解决这种过程层面的不对齐,该框架不仅对齐结果,还对齐推理过程本身。PaLMR包含两个互补组件:一个感知对齐数据层,构建具有结构化伪真值和可验证视觉事实的过程感知推理数据;以及一个过程对齐优化层,构建具有过程感知评分函数的分层奖励融合方案,以鼓励视觉上可信的思维链并提高训练稳定性。在Qwen2.5-VL-7B上的实验表明,我们的方法显著减少了推理幻觉并提高了视觉推理忠实度,在HallusionBench上取得了最先进的结果,同时在MMMU、MathVista和MathVerse上保持了强劲性能。这些发现表明,PaLMR为过程对齐的多模态推理提供了一条原则性且实用的路径,推进了MLLM的可靠性和可解释性。

英文摘要

Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.

2603.05965 2026-06-12 cs.RO cs.CV 版本更新

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

PROBE: 具有解析平移鲁棒性的概率占用BEV编码用于3D地点识别

Jinseop Lee, Byoungho Lee, Gichul Yoo

发表机构 * SK Intellix

AI总结 提出无学习的LiDAR地点描述符PROBE,通过极坐标雅可比解析边缘化连续平移,实现距离自适应角度不确定性,在跨传感器泛化中取得高精度。

详情
Comments
8 pages, 8 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
AI中文摘要

我们提出PROBE(概率占用BEV编码),一种无学习的LiDAR地点识别描述符,将每个BEV单元的占用建模为伯努利随机变量。PROBE不依赖于离散点云扰动,而是通过极坐标雅可比解析边缘化连续笛卡尔平移,在O(R·S)时间内得到距离自适应角度不确定性σ_θ = σ_t / r。主要参数σ_t表示以米为单位的预期平移不确定性,这是一种与传感器无关的物理量,增强了跨传感器泛化能力,同时减少了对每个数据集大量调参的需求。成对相似性结合了伯努利-KL Jaccard与指数不确定性门控以及基于FFT的高度余弦相似性用于旋转对齐。在涵盖四种不同LiDAR类型的四个数据集上评估,PROBE在多会话评估中实现了手工描述符中最高的精度,并且在单会话性能上与手工和监督基线相比具有竞争力。源代码和补充材料可在该https URL获取。

英文摘要

We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $\sigma_\theta = \sigma_t / r$ in $\mathcal{O}(R{\cdot}S)$ time. The primary parameter $\sigma_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity that enhances cross-sensor generalization while reducing the need for extensive per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance relative to both handcrafted and supervised baselines. The source code and supplementary materials are available at this https URL.

2509.14210 2026-06-12 cs.RO 版本更新

GLIDE: A Coordinated Aerial-Ground Framework for Search and Rescue in Unknown Environments

GLIDE:未知环境下的空地协同搜索与救援框架

Seth Farrell, Chenghao Li, Henrik I. Christensen

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出GLIDE框架,通过两架无人机与一辆无人地面车协同,实现未知环境中的快速受害者定位和障碍物感知导航,利用角色分离和地形侦察提升救援效率。

详情
AI中文摘要

我们提出了一种空地协同搜索与救援(SAR)框架,将两架无人机(UAV)与一辆无人地面车(UGV)配对,以在未知环境中实现快速受害者定位和障碍物感知导航。我们将该框架命名为引导式长视距集成无人机护航(GLIDE),强调UGV在长视距规划中对UAV引导的依赖。在我们的框架中,目标搜索UAV执行实时机载受害者检测和地理参考,为地面平台提名目标,而地形侦察UAV则在UGV计划路径前方飞行,提供中程可通行性更新。UGV融合空中线索与本地感知,执行时间高效的A*规划,并在信息到达时持续重新规划。此外,我们进行了硬件演示(使用GEM e6高尔夫球车作为UGV和两架X500 UAV),以评估端到端SAR任务性能,并包括模拟消融实验,以独立于检测评估规划栈。实证结果表明,UAV之间的明确角色分离,结合地形侦察和引导规划,在时间关键的SAR任务中改善了到达时间和导航安全性。

英文摘要

We present a cooperative aerial-ground search-and-rescue (SAR) framework that pairs two unmanned aerial vehicles (UAVs) with an unmanned ground vehicle (UGV) to achieve rapid victim localization and obstacle-aware navigation in unknown environments. We dub this framework Guided Long-horizon Integrated Drone Escort (GLIDE), highlighting the UGV's reliance on UAV guidance for long-horizon planning. In our framework, a goal-searching UAV executes real-time onboard victim detection and georeferencing to nominate goals for the ground platform, while a terrain-scouting UAV flies ahead of the UGV's planned route to provide mid-level traversability updates. The UGV fuses aerial cues with local sensing to perform time-efficient A* planning and continuous replanning as information arrives. Additionally, we present a hardware demonstration (using a GEM e6 golf cart as the UGV and two X500 UAVs) to evaluate end-to-end SAR mission performance and include simulation ablations to assess the planning stack in isolation from detection. Empirical results demonstrate that explicit role separation across UAVs, coupled with terrain scouting and guided planning, improves reach time and navigation safety in time-critical SAR missions.

2603.00610 2026-06-12 cs.SD cs.AI cs.LG cs.MM eess.AS 版本更新

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

CMI-RewardBench: 基于组合多模态指令评估音乐奖励模型

Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshuo Ding, Yizhi Li, Ruibin Yuan, Simon Dixon, Emmanouil Benetos

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 针对音乐生成模型缺乏有效评估机制的问题,提出CMI-RewardBench基准,包含大规模偏好数据集和参数高效奖励模型,实现多模态指令下的音乐质量评估。

详情
Comments
Accepted by ICML 2026
AI中文摘要

虽然音乐生成模型已经发展到能够处理混合文本、歌词和参考音频的复杂多模态输入,但评估机制却滞后了。在本文中,我们通过为组合多模态指令(CMI)下的音乐奖励建模建立了一个全面的生态系统来弥补这一关键差距,其中生成的音乐可能以文本描述、歌词和音频提示为条件。我们首先引入了CMI-Pref-Pseudo,一个包含11万个伪标签样本的大规模偏好数据集,以及CMI-Pref,一个针对细粒度对齐任务量身定制的高质量人工标注语料库。为了统一评估格局,我们提出了CMI-RewardBench,一个统一的基准,用于评估音乐奖励模型在音乐性、文本-音乐对齐和组合指令对齐方面的异质样本。利用这些资源,我们开发了CMI奖励模型(CMI-RMs),一个能够处理异质输入的参数高效奖励模型家族。我们评估了它们与人类判断分数在音乐性和对齐方面的相关性,使用了CMI-Pref以及之前的数据集。进一步的实验表明,CMI-RM不仅与人类判断高度相关,而且通过top-k过滤实现了有效的推理时扩展。代码可在GitHub(此 https URL )获取。模型权重:CMI-RM(此 https URL )。数据集:CMI-Pref-Pseudo(此 https URL )和CMI-Pref(此 https URL )。

英文摘要

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgment scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. Code is available at GitHub ( this https URL ). Model weights: CMI-RM ( this https URL ). Datasets: CMI-Pref-Pseudo ( this https URL ) and CMI-Pref ( this https URL )

2603.00167 2026-06-12 cs.RO 版本更新

EgoMoD: Predicting Global Maps of Dynamics from Local Egocentric Observations

EgoMoD:从局部自我中心观测预测全局动态地图

Iacopo Catalano, David Morilla-Cabello, Jorge Pena-Queralta, Eduardo Montijano

发表机构 * University of Turku, Finland(芬兰图尔库大学) Centre for Artificial Intelligence, Zürich University of Applied Sciences, Winterthur, Switzerland(瑞士应用科学大学人工智能中心) Instituto de Investigación en Ingeniería de Aragón, Universidad de Zaragoza, Spain(西班牙阿拉贡工程研究所,萨拉戈萨大学)

AI总结 提出EgoMoD方法,利用短时自我中心视频和位姿条件架构,学习从局部观测预测全局运动动态地图,替代传统全局感知基础设施,实现零样本迁移。

详情
AI中文摘要

在动态环境中高效导航需要预测机器人即时感知范围之外的运动模式演变,从而在拥挤场景中实现先发制人而非纯粹反应式规划。运动动态地图(MoDs)提供了空间中运动趋势的结构化表示,有助于长期全局规划,但传统上需要长时间全局环境观测来构建。我们提出EgoMoD,这是第一种学习直接从机器人操作期间收集的短时自我中心视频片段预测未来MoDs的方法。我们的方法使用视频和位姿条件架构,以从外部观测计算的MoDs作为特权监督进行训练,从而学习从局部动态线索推断环境范围的运动趋势,使局部观测成为全局运动结构的预测信号。因此,我们能够预测整个环境的未来运动动态,而不仅仅是扩展机器人视野中的过去模式。作为特定地点的动态先验,EgoMoD在推理时用标准车载传感器替代了先前MoD方法所需的外部全局感知基础设施。在大型模拟环境中的实验表明,EgoMoD能在有限可观测性下预测未来MoDs,而使用真实图像的评估展示了其对真实系统的零样本迁移能力。

英文摘要

Efficient navigation in dynamic environments requires anticipating how motion patterns evolve beyond the robot's immediate perceptual range, enabling preemptive rather than purely reactive planning in crowded scenes. Maps of Dynamics (MoDs) offer a structured representation of motion tendencies in space useful for long-term global planning, but constructing them traditionally requires global environment observations over extended periods of time. We introduce EgoMoD, the first approach that learns to predict future MoDs directly from short egocentric video clips collected during robot operation. Our method learns to infer environment-wide motion tendencies from local dynamic cues using a video- and pose-conditioned architecture trained with MoDs computed from external observations as privileged supervision, allowing local observations to serve as predictive signals of global motion structure. Thanks to this, we offer the capacity to forecast future motion dynamics over the whole environment rather than merely extend past patterns in the robot's field of view. As a site-specific dynamic prior, EgoMoD replaces the external global sensing infrastructure required by prior MoD methods at inference time with standard onboard sensors. Experiments in large simulated environments show that EgoMoD predicts future MoDs under limited observability, while evaluation with real images showcases its zero-shot transferability to real systems.

2603.00025 2026-06-12 cs.CL 版本更新

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation

TAB-PO:面向Token关键结构化生成的具有Token级自适应障碍的偏好优化

Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Sreeraj Ramachandran, Elyas Irankhah, Muhammad Arif, Ashley Hagaman, Sarah R. Lowe, Aimee Kendall Roundtree

发表机构 * Yale University(耶鲁大学) Texas State University(德克萨斯州立大学)

AI总结 针对结构化预测中偏好与拒绝对象仅少数token不同导致的梯度稀释和token侵蚀问题,提出基于混淆感知偏好构建和Token级自适应障碍的TAB-PO方法,在SciERC任务上显著提升关键指标。

详情
AI中文摘要

直接偏好优化(DPO)是一种有效且广泛采用的离线对齐方法,但难以适应本体驱动的结构化预测,其中偏好和拒绝的JSON对象通常仅在少数模式定义token上存在差异。在这种低编辑距离场景下,序列级DPO将梯度质量分散到非关键的序列化token上(梯度稀释),并可能降低罕见、低置信度的偏好模式token的似然(token侵蚀)。为解决这些限制,我们首先开发了一种混淆感知的偏好构建策略,该策略用从验证集SFT预测中估计的经验结构化错误模式来增强专家策划的歧义模式,合成最小扰动的、模式有效的负样本,将偏好学习聚焦于现实的本体级决策错误。然后,我们引入了Token自适应障碍偏好优化(TAB-PO),这是一种用于token关键结构化生成的SFT后目标。TAB-PO添加了一个置信门控的token级障碍,对低置信度的模式token施加监督锚定。在公开的SciERC科学信息抽取任务上,使用1.5B到70B的Llama/Qwen模型评估,TAB-PO在本体关键的语义标签和关系链接指标上平均比SFT提升11.59%,在这些指标上100%胜于最强的token级和序列级DPO变体,并领先领先的前沿模型14.71%,同时在文本基础方面取得了强劲的增益。

英文摘要

Direct Preference Optimization (DPO) is an effective and widely adopted approach for offline alignment but is poorly matched to ontology-driven structured prediction, where preferred and rejected JSON objects often differ in only a few schema-defining tokens. In this low-edit-distance regime, sequence-level DPO spreads gradient mass across non-critical serialization tokens (gradient dilution) and can reduce likelihood on rare, under-confident preferred schema tokens (token erosion). To address these limitations, we first develop a confusion-aware preference-construction strategy that augments expert-curated ambiguity patterns with empirical structured-error modes estimated from validation-set SFT predictions, synthesizing minimally perturbed, schema-valid negatives that focus preference learning on realistic ontology-level decision errors. We then introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), a post-SFT objective for token-critical structured generation. TAB-PO adds a confidence-gated token-level barrier that applies supervised anchoring to under-confident schema tokens. On the public SciERC scientific information extraction task, evaluated with Llama/Qwen models from 1.5B to 70B, TAB-PO improves ontology-critical semantic-label and relational-linking metrics over SFT by 11.59% on average, wins 100% of comparisons against the strongest token-level and sequence-level DPO variants on these metrics, and surpasses leading frontier models by 14.71%, while delivering strong gains in textual grounding.

2510.02524 2026-06-12 cs.CL cs.FL cs.LG 版本更新

Unraveling Syntax: Language Modeling and the Substructure of Grammars

解析句法:语言建模与语法的子结构

Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究语言模型在上下文无关语法子结构上的学习行为,证明损失函数在顶层子语法上线性递归,并发现参数化模型并行学习子语法,子语法预训练能提升小模型性能并改善内部表征。

详情
Comments
Equal contribution by LYS and DM. Accepted to the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

尽管语言模型取得了令人印象深刻的结果,但其学习动态远未被理解。许多感兴趣的领域——如自然语言句法、编程语言、算术——都由上下文无关语法(CFG)捕获。在这项工作中,我们将先前关于CFG神经语言建模的工作扩展到一个新的方向:语言建模如何相对于CFG子结构(即子语法)表现。我们定义了子语法,并证明了一组连接语言建模和子语法的基本定理。我们表明,语言建模损失在其顶层子语法上线性递归;递归应用,损失分解为“不可约”子语法的损失。在额外假设下,并且经验上,参数化模型并行学习子语法,不同于首先掌握简单子结构的儿童。我们发现,子语法预训练可以提高最终性能,但仅对于相对于语法而言微小的模型,而对齐分析表明,预训练一致地导致内部表征更好地反映语法的子结构。

英文摘要

While language models achieve impressive results, their learning dynamics are far from understood. Many domains of interest -- such as natural language syntax, coding languages, arithmetic -- are captured by context-free grammars (CFGs). In this work, we extend prior work on neural language modeling of CFGs in a novel direction: how language modeling behaves with respect to CFG substructure, namely subgrammars. We define subgrammars, and prove a set of fundamental theorems connecting language modeling and subgrammars. We show that language modeling loss recurses linearly over its top-level subgrammars; applied recursively, the loss decomposes into losses for "irreducible" subgrammars. Under additional assumptions, and empirically, parametrized models learn subgrammars in parallel, unlike children who first master simple substructures. We find that subgrammar pretraining can improve final performance, but only for tiny models relative to the grammar, while alignment analyses show that pretraining consistently leads to internal representations that better reflect the grammar's substructure.