arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1970
专题追踪
2603.15158 2026-06-12 cs.LG 版本更新

Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies

在不完美代理下潜在偏移中鲁棒预测器的点识别

Zahra Rahiminasab, Reza Soumi, Arto Klami, Samuel Kaski

发表机构 * Department of Computer Science, Aalto University(阿尔托大学计算机科学系) Department of Computer Science, University of Helsinki(赫尔辛基大学计算机科学系) ELLIS Institute Finland(芬兰埃利斯研究所) Department of Computer Science, Manchester University(曼彻斯特大学计算机科学系)

AI总结 针对潜在混淆变量导致的域适应问题,提出基于潜在等价类的点识别方法,通过跨域秩条件替代强完备性假设,并设计主动学习框架PQAL实现鲁棒预测。

详情
AI中文摘要

当跨域的分布偏移源于同时影响协变量和结果的潜在混淆变量时,域适应问题变得更加具有挑战性。现有的基于代理的方法通过强完备性假设来唯一确定(点识别)鲁棒预测器。完备性要求代理具有关于潜在混淆变量变化的足够信息。对于不完美代理,从混淆变量到代理分布空间的映射是非单射的,多个潜在混淆变量值可能生成相同的代理分布。这破坏了完备性假设,观测数据与多个潜在预测器(集识别)一致。为了解决这个问题,我们引入了潜在等价类(LECs)。LECs定义为诱导相同条件代理分布的潜在混淆变量组。我们证明,只要多个域在如何混合代理诱导的LECs以形成鲁棒预测器方面有足够差异,鲁棒预测器的点识别仍然可以实现。这种域多样性条件被形式化为混合权重的跨域秩条件,该条件比完备性假设弱得多。我们提出了近端准贝叶斯主动学习(PQAL)框架,该框架主动查询满足该秩条件的小型、有针对性的多样化域集合。PQAL可以恢复点识别的预测器,展示了对不同程度偏移的鲁棒性,并在合成数据、半合成dSprites、IHDP、ACS Folktables数据集上优于先前方法。

英文摘要

Addressing the domain adaptation problem becomes more challenging when distribution shifts across domains stem from latent confounders that affect both covariates and outcomes. Existing proxy-based approaches that address latent shift rely on a strong completeness assumption to uniquely determine (point-identify) a robust predictor. Completeness requires that proxies have sufficient information about variations in latent confounders. For imperfect proxies the mapping from confounders to the space of proxy distributions is non-injective, and multiple latent confounder values can generate the same proxy distribution. This breaks the completeness assumption and observed data are consistent with multiple potential predictors (set-identified). To address this, we introduce latent equivalent classes (LECs). LECs are defined as groups of latent confounders that induce the same conditional proxy distribution. We show that point-identification for the robust predictor remains achievable as long as multiple domains differ sufficiently in how they mix proxy-induced LECs to form the robust predictor. This domain diversity condition is formalized as a cross-domain rank condition on the mixture weights, which is substantially weaker assumption than completeness. We introduce the Proximal Quasi-Bayesian Active learning (PQAL) framework, which actively queries a small, targeted set of diverse domains that satisfy this rank condition. PQAL can recover the point-identified predictor, demonstrates robustness to varying degrees of shift and outperforms previous methods on synthetic data and semi-synthetic dSprites, IHDP, ACS Folktables datasets.

2603.11249 2026-06-12 cs.LG 版本更新

Differentiable Thermodynamic Phase-Equilibria for Machine Learning

可微热力学相平衡用于机器学习

Karim K. Ben Hicham, Moreno Ascani, Jan G. Rittig, Alexander Mitsos

发表机构 * RWTH Aachen University(亚琛工业大学) Process Systems Engineering (AVT.SVT)(过程系统工程) Forschungszentrum Jülich GmbH(吕根研究中心) Institute of Climate and Energy Systems ICE-1(气候与能源系统研究所) Energy Systems Engineering(能源系统工程) JARA-ENERGY

AI总结 提出DISCOMAX算法,通过可微相平衡计算结合离散枚举与掩码softmax,实现热力学一致性端到端学习,在二元液液平衡数据上优于现有方法。

Comments 45 pages, 27 figures, 5 tables

详情
AI中文摘要

相平衡的准确预测仍是化学工程中的核心挑战。将热力学结构融入神经网络的物理一致性机器学习方法最近在活度系数建模中表现出色。然而,将此类方法扩展到源于极值原理的平衡数据(如液液平衡)仍然困难。本文提出DISCOMAX,一种用于相平衡计算的可微算法,在训练和推理时均保证热力学一致性,仅受用户指定的离散化影响。该方法将可行相态的离散枚举与反向传播中的掩码softmax聚合相结合,在前向传播中传播真实平衡态,使用直通梯度估计器实现神经gE模型的物理一致性端到端学习。我们展示了该方法与统计热力学的类比,并在二元液液平衡数据上评估,其优于现有基于代理的方法,同时为从不同种类的平衡数据中学习提供了通用框架。

英文摘要

Accurate prediction of phase equilibria remains a central challenge in chemical engineering. Physics-consistent machine learning methods that incorporate thermodynamic structure into neural networks have recently shown strong performance for activity-coefficient modeling. However, extending such approaches to equilibrium data arising from an extremum principle, such as liquid-liquid equilibria, remains difficult. Here we present DISCOMAX, a differentiable algorithm for phase-equilibrium calculation that guarantees thermodynamic consistency at both training and inference, only subject to a user-specified discretization. The method combines discrete enumeration of feasible phase states with masked softmax aggregation in the backward pass, with the propagation of the true equilibrium state in the forward pass, using a straight-through gradient estimator to enable physics-consistent end-to-end learning of neural \gls{gE}-models. We show that this approach bears analogy to statistical thermodynamics, and we evaluate it on binary liquid-liquid equilibrium data where it outperforms existing surrogate-based methods, while offering a general framework for learning from different kinds of equilibrium data.

2603.14483 2026-06-12 cs.LG 版本更新

Disentangling Dynamical Systems: Causal Representation Learning Meets Local Sparse Attention

解耦动力系统:因果表示学习遇见局部稀疏注意力

Markus W. Baumgartner, Anson Lei, Joe Watson, Ingmar Posner

发表机构 * Applied Artificial Intelligence Lab, Oxford Robotics Institute, Oxford, UK(应用人工智能实验室,牛津机器人研究所,英国牛津)

AI总结 提出一种结合因果表示学习和局部稀疏注意力的方法,从原始轨迹数据中无结构假设地解耦系统参数,并通过图论准则保证可辨识性。

Comments Presented as an Oral at the 5th Conference on Causal Learning and Reasoning

Journal ref Proceedings of Machine Learning Research 323, 2026

详情
AI中文摘要

参数化系统辨识方法从数据中估计显式定义的物理系统的参数。然而,它们仍然受限于需要提供显式函数空间,通常通过基于可用领域知识预定义的候选函数库。相比之下,深度学习能够以高保真度对广泛复杂性的系统进行建模,但黑箱函数逼近通常无法产生揭示系统结构的显式描述性或解耦表示。我们开发了一种新的可辨识性定理,利用因果表示学习,在没有结构假设的情况下发现系统参数的解耦表示。我们推导了一个图论准则,指定何时系统参数可以从原始轨迹数据中唯一解耦,直至置换和微分同胚。关键的是,我们的分析表明,全局因果结构为考虑局部状态依赖因果结构时可实现的解耦保证提供了下界。我们将系统参数识别实例化为变分推断问题,利用稀疏正则化变换器来发现状态依赖的因果结构。我们在四个合成领域上实证验证了我们的方法,证明了其恢复基线方法无法恢复的高度解耦表示的能力。与我们的理论分析一致,我们的结果证实了强制局部因果结构通常对于完全可辨识性是必要的。

英文摘要

Parametric system identification methods estimate the parameters of explicitly defined physical systems from data. Yet, they remain constrained by the need to provide an explicit function space, typically through a predefined library of candidate functions chosen via available domain knowledge. In contrast, deep learning can demonstrably model systems of broad complexity with high fidelity, but black-box function approximation typically fails to yield explicit descriptive or disentangled representations revealing the structure of a system. We develop a novel identifiability theorem, leveraging causal representation learning, to uncover disentangled representations of system parameters without structural assumptions. We derive a graphical criterion specifying when system parameters can be uniquely disentangled from raw trajectory data, up to permutation and diffeomorphism. Crucially, our analysis demonstrates that global causal structures provide a lower bound on the disentanglement guarantees achievable when considering local state-dependent causal structures. We instantiate system parameter identification as a variational inference problem, leveraging a sparsity-regularised transformer to uncover state-dependent causal structures. We empirically validate our approach across four synthetic domains, demonstrating its ability to recover highly disentangled representations that baselines fail to recover. Corroborating our theoretical analysis, our results confirm that enforcing local causal structure is often necessary for full identifiability.

2603.14407 2026-06-12 cs.LG 版本更新

Towards One-for-All Anomaly Detection for Tabular Data

面向表格数据的通用异常检测

Shiyuan Li, Yixin Liu, Yu Zheng, Xiaofeng Cao, Shirui Pan, Heng Tao Shen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出OFA-TAD框架,通过多视图邻居距离表示和混合专家评分网络,实现跨领域表格异常检测的通用化,一次训练即可泛化到未见数据集。

Comments Accepted by ICML 2026

详情
AI中文摘要

表格异常检测(TAD)旨在识别表格数据中偏离大多数样本的样本,在许多实际应用中至关重要。然而,现有方法遵循“一个数据集一个模型(OFO)”范式,依赖于数据集特定的训练,导致计算成本高且对未见领域的泛化能力有限。为解决这些局限性,我们提出OFA-TAD,一个通用的“一劳永逸(OFA)”TAD框架,只需在多个源数据集上进行一次训练,即可即时泛化到来自不同领域的未见数据集。为实现通用表格异常检测,OFA-TAD提取邻居距离模式作为可迁移线索,并引入来自多个变换诱导度量空间的多视图邻居距离表示,以减轻距离分布对变换的敏感性。为自适应组合多视图距离证据,采用混合专家(MoE)评分网络进行视图特定异常评分和熵正则化门控融合,并采用多策略异常合成机制以支持单类约束下的训练。在来自14个领域的34个数据集上的大量实验表明,OFA-TAD在严格的OFA设置下实现了优越的异常检测性能和强大的跨领域泛化能力。源代码见:https://this URL。

英文摘要

Tabular anomaly detection (TAD) aims to identify samples that deviate from the majority in tabular data and is critical in many real-world applications. However, existing methods follow a ``one model for one dataset (OFO)'' paradigm, which relies on dataset-specific training and thus incurs high computational cost and yields limited generalization to unseen domains. To address these limitations, we propose OFA-TAD, a generalist one-for-all (OFA) TAD framework that only requires one-time training on multiple source datasets and can generalize to unseen datasets from diverse domains on-the-fly. To realize one-for-all tabular anomaly detection, OFA-TAD extracts neighbor-distance patterns as transferable cues, and introduces multi-view neighbor-distance representations from multiple transformation-induced metric spaces to mitigate the transformation sensitivity of distance profiles. To adaptively combine multi-view distance evidence, a Mixture-of-Experts (MoE) scoring network is employed for view-specific anomaly scoring and entropy-regularized gated fusion, with a multi-strategy anomaly synthesis mechanism to support training under the one-class constraint. Extensive experiments on 34 datasets from 14 domains demonstrate that OFA-TAD achieves superior anomaly detection performance and strong cross-domain generalizability under the strict OFA setting. The source code is available at https://github.com/Shiy-Li/OFA-TAD.

2603.12530 2026-06-12 cs.LG 版本更新

Mixing Makes Markovian Contexts Cheap for Linear Bandits

混合使得马尔可夫上下文在线性赌博机中变得廉价

Kaan Buyukkalayci, Osama Hanna, Christina Fragouli

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Meta, Superintelligence Lab(Meta超智能实验室)

AI总结 针对马尔可夫上下文线性赌博机问题,提出一种基于均匀几何遍历性的约简方法,通过构建平稳替代动作集和延迟更新方案,实现了与标准线性赌博机相当的最坏情况遗憾界。

详情
AI中文摘要

最近的研究表明,当上下文是独立同分布时,线性上下文赌博机可以简化为单上下文线性赌博机。这种“上下文廉价”的视角非常有利,因为它允许更精确的有限时间分析,并利用线性赌博机文献中的成熟技术,例如针对错误规范和对抗性腐败的技术。然而,这种约简关键依赖于上下文的独立性,并不适用于时间相关(例如马尔可夫)的上下文设置,而这种设置在现实中经常出现。受时间相关可用性应用的启发,我们将这一视角扩展到具有马尔可夫上下文过程的线性赌博机,其中动作集通过外生马尔可夫链演化。我们的主要贡献是在均匀几何遍历性条件下的一种约简。我们构建了一个平稳替代动作集,使用标准线性赌博机预言机来解决问题,并采用延迟更新方案来控制由非平稳条件上下文分布引起的偏差。我们进一步为未知平稳分布提供了一种分阶段算法,该算法在线学习替代映射。在两种设置中,我们在足够快的混合区域获得了与底层线性赌博机预言机相匹配的高概率最坏情况遗憾界。然后,我们在一个真实世界实例上验证了我们的结果,展示了相对于LinUCB基线的实际改进。

英文摘要

Recent work shows that when contexts are drawn i.i.d., linear contextual bandits can be reduced to single-context linear bandits. This ``contexts are cheap'' perspective is highly advantageous, as it allows for sharper finite-time analyses and leverages mature techniques from the linear bandit literature, such as those for misspecification and adversarial corruption. However, this reduction crucially relies on the independence of contexts and does not extend to settings with temporally correlated (e.g., Markovian) contexts, which arise frequently in practice. Motivated by applications with temporally correlated availability, we extend this perspective to linear bandits with Markovian context processes, where the action set evolves via an exogenous Markov chain. Our main contribution is a reduction that applies under uniform geometric ergodicity. We construct a stationary surrogate action set to solve the problem using a standard linear bandit oracle, employing a delayed-update scheme to control the bias induced by the nonstationary conditional context distributions. We further provide a phased algorithm for unknown stationary distributions that learns the surrogate mapping online. In both settings, we obtain a high-probability worst-case regret bound matching that of the underlying linear bandit oracle in sufficiently fast mixing regimes. We then validate our results on a real-world instance, where we show practical gains over a LinUCB baseline.

2603.10834 2026-06-12 cs.CV cs.AI 版本更新

On the Reliability of Cue Conflict and Beyond

论线索冲突的可靠性及其超越

Pum Jun Kim, Seung-Ah Lee, Seongho Park, Dongyoon Han, Jaejun Yoo

发表机构 * Ulsan National Institute of Science and Technology(乌山国立科学研究院) College of Medicine, Hanyang University(翰阳大学医学院) NAVER AI Lab(NAVER AI实验室)

AI总结 针对现有线索冲突基准在评估形状-纹理偏好时存在不稳定和模糊的问题,提出REFINED-BIAS数据集与评估框架,通过显式定义形状和纹理、构建平衡的线索对及基于排序的度量,实现更可靠和可解释的偏差诊断。

Comments Shape-Texture Bias, Cue Conflict Benchmark

详情
AI中文摘要

理解神经网络如何依赖视觉线索提供了其内部决策过程的人类可解释视角。线索冲突基准在探究形状-纹理偏好以及激发更强、类人形状偏差通常与改进的域内性能相关的见解方面具有影响力。然而,我们发现当前基于风格化的实例化可能产生不稳定和模糊的偏差估计。具体来说,风格化可能无法可靠地实例化感知上有效且可分离的线索,也无法控制其相对信息量;基于比率的偏差可能掩盖绝对线索敏感性;将评估限制在预选类别可能忽略完整决策空间而扭曲模型预测。这些因素共同可能将偏好与线索有效性、线索平衡和可识别性伪影混淆。我们引入了REFINED-BIAS,一个用于可靠和可解释的形状-纹理偏差诊断的集成数据集和评估框架。REFINED-BIAS使用形状和纹理的显式定义构建平衡的、人类和模型可识别的线索对,并通过基于排序的度量测量完整标签空间上的线索特定敏感性,从而实现更公平的跨模型比较。在不同的训练范式和架构中,REFINED-BIAS实现了更公平的跨模型比较、更忠实的形状和纹理偏差诊断以及更清晰的实证结论,解决了先前线索冲突评估无法可靠区分的矛盾。

英文摘要

Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.

2603.11479 2026-06-12 cs.LG cs.AI cs.MA 版本更新

Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

波的语法:通过神经符号VLM智能体实现可解释的多变量时间序列事件检测

Sky Chenwei Wan, Yifei Y. Wang, Tianjun Hou, Xiqing Chang, Aymeric Jan

发表机构 * AI Lab, SLB(SLB人工智能实验室) Télécom Paris, Institut Polytechnique de Paris, France(巴黎电信学院,巴黎高等理工学院,法国)

AI总结 提出语言引导的时间序列事件检测(TSED)任务,通过事件逻辑树(ELT)将文本描述转化为结构化时序逻辑,并构建神经符号VLM智能体SELA,实现零/少样本事件检测与可解释推理。

Comments 8 pages (main text), 28 pages total including appendix. 9 figures, 7 tables

详情
AI中文摘要

时间序列事件检测(TSED)旨在定位时间序列数据中具有语义意义的事件,在高风险领域具有关键应用。与统计异常不同,事件通常由自然语言描述定义,且跨多个物理通道具有内部时序逻辑结构。然而,在现实场景中,密集的事件标注成本高昂,使得纯监督学习困难。我们引入了语言引导的TSED,该设置中模型被赋予文本事件描述,并必须在几乎没有标注数据的情况下将其映射到多变量信号中的区间。为了解决这个问题,我们提出了事件逻辑树(ELT),一种知识表示框架,将语言描述转化为信号基元上的结构化时序逻辑。基于ELT,我们提出了SELA,一种神经符号VLM智能体框架,它从信号可视化中迭代地接地基元,并在ELT约束下组合它们,产生事件区间和忠实的树状结构解释。我们进一步发布了跨能源和气候领域的真实世界基准,包含专家知识和标注。实验表明,SELA优于监督微调和现有的零/少样本时间序列推理基线。

英文摘要

Time Series Event Detection (TSED) aims to localize semantically meaningful events in time series data, with critical applications in high-stakes domains. Unlike statistical anomalies, events are often defined by natural-language descriptions with internal temporal-logic structures across multiple physical channels. However, in real-world settings, dense event annotations are expensive to obtain, making purely supervised learning difficult. We introduce Language-guided TSED, a setting where a model is given textual event descriptions and must ground them to intervals in multivariate signals with little or no labeled data. To address this problem, we propose Event Logic Tree (ELT), a knowledge representation framework that converts linguistic descriptions into structured temporal logic over signal primitives. Building on ELT, we present SELA, a neuro-symbolic VLM agent framework that iteratively grounds primitives from signal visualizations and composes them under ELT constraints, producing both event intervals and faithful tree-structured explanations. We further release a real-world benchmark across energy and climate domains with expert knowledge and annotations. Experiments show that SELA improves over supervised fine-tuning and existing zero/few-shot time series reasoning baselines.

2603.08505 2026-06-12 cs.LG cs.AI 版本更新

Echo2ECG: Enhancing ECG Representations with Cardiac Morphology from Multi-View Echos

Echo2ECG:利用多视角超声心动图的心脏形态增强心电图表示

Michelle Espranita Liman, Özgün Turgut, Alexander Müller, Eimo Martens, Daniel Rueckert, Philip Müller

发表机构 * Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM) and TUM University Hospital(人工智能在医疗与医学中的中心,慕尼黑技术大学(TUM)和慕尼黑大学医院) Department of Cardiology, TUM University Hospital(心血管科,慕尼黑大学医院) Department of Computing, Imperial College London(计算系,伦敦帝国理工学院) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 提出Echo2ECG多模态自监督学习框架,通过多视角超声心动图丰富心电图表示,在结构表型分类和超声检索任务上优于现有方法,模型大小仅为最大基线的1/18。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

心电图(ECG)是一种低成本、广泛使用的模态,通过捕捉心脏电活动来诊断电异常(如房颤)。然而,它无法直接测量心脏形态表型,如左心室射血分数(LVEF),这通常需要超声心动图(Echo)。从ECG预测这些表型将实现早期、可及的健康筛查。现有的自监督方法通过将ECG与单视角Echo对齐而遭受表示不匹配,单视角Echo仅捕捉局部、空间受限的解剖快照。为解决此问题,我们提出Echo2ECG,一种多模态自监督学习框架,利用多视角Echo中捕捉的心脏形态结构丰富ECG表示。我们在两个根本上需要形态信息的临床相关任务上评估Echo2ECG作为ECG特征提取器:(1)跨三个数据集的结构性心脏表型分类,以及(2)使用ECG查询检索具有相似形态特征的Echo研究。我们的提取的ECG表示在两个任务上始终优于最先进的单模态和多模态基线,尽管模型大小仅为最大基线的1/18。这些结果表明Echo2ECG是一个鲁棒、强大的ECG特征提取器。我们的代码可从此https URL获取。

英文摘要

Electrocardiography (ECG) is a low-cost, widely used modality for diagnosing electrical abnormalities like atrial fibrillation by capturing the heart's electrical activity. However, it cannot directly measure cardiac morphological phenotypes, such as left ventricular ejection fraction (LVEF), which typically require echocardiography (Echo). Predicting these phenotypes from ECG would enable early, accessible health screening. Existing self-supervised methods suffer from a representational mismatch by aligning ECGs to single-view Echos, which only capture local, spatially restricted anatomical snapshots. To address this, we propose Echo2ECG, a multimodal self-supervised learning framework that enriches ECG representations with the heart's morphological structure captured in multi-view Echos. We evaluate Echo2ECG as an ECG feature extractor on two clinically relevant tasks that fundamentally require morphological information: (1) classification of structural cardiac phenotypes across three datasets, and (2) retrieval of Echo studies with similar morphological characteristics using ECG queries. Our extracted ECG representations consistently outperform those of state-of-the-art unimodal and multimodal baselines across both tasks, despite being 18x smaller than the largest baseline. These results demonstrate that Echo2ECG is a robust, powerful ECG feature extractor. Our code is accessible at https://github.com/michelleespranita/Echo2ECG.

2603.06652 2026-06-12 cs.CV cs.AI 版本更新

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

PaLMR: 通过多模态过程对齐实现忠实视觉推理

Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, Shiguo Lian

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Data Science & Artificial Intelligence Research Institute, China Unicom(中国unicom数据科学与人工智能研究院) Unicom Data Intelligence, China Unicom(中国unicom数据智能)

AI总结 提出PaLMR框架,通过感知对齐数据层和过程对齐优化层,减少推理幻觉并提升视觉推理忠实度,在多个基准上取得最优结果。

Journal ref CVPR 2026 Findings

详情
AI中文摘要

强化学习近期提升了大语言模型和多模态大语言模型的推理能力,但现有的奖励设计强调最终答案的正确性,因此容忍过程幻觉——即模型在得到正确答案的同时错误感知视觉证据的情况。我们通过PaLMR框架解决这种过程层面的不对齐,该框架不仅对齐结果,还对齐推理过程本身。PaLMR包含两个互补组件:一个感知对齐数据层,构建具有结构化伪真值和可验证视觉事实的过程感知推理数据;以及一个过程对齐优化层,构建具有过程感知评分函数的分层奖励融合方案,以鼓励视觉上可信的思维链并提高训练稳定性。在Qwen2.5-VL-7B上的实验表明,我们的方法显著减少了推理幻觉并提高了视觉推理忠实度,在HallusionBench上取得了最先进的结果,同时在MMMU、MathVista和MathVerse上保持了强劲性能。这些发现表明,PaLMR为过程对齐的多模态推理提供了一条原则性且实用的路径,推进了MLLM的可靠性和可解释性。

英文摘要

Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.

2603.05965 2026-06-12 cs.RO cs.CV 版本更新

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

PROBE: 具有解析平移鲁棒性的概率占用BEV编码用于3D地点识别

Jinseop Lee, Byoungho Lee, Gichul Yoo

发表机构 * SK Intellix

AI总结 提出无学习的LiDAR地点描述符PROBE,通过极坐标雅可比解析边缘化连续平移,实现距离自适应角度不确定性,在跨传感器泛化中取得高精度。

Comments 8 pages, 8 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

详情
AI中文摘要

我们提出PROBE(概率占用BEV编码),一种无学习的LiDAR地点识别描述符,将每个BEV单元的占用建模为伯努利随机变量。PROBE不依赖于离散点云扰动,而是通过极坐标雅可比解析边缘化连续笛卡尔平移,在O(R·S)时间内得到距离自适应角度不确定性σ_θ = σ_t / r。主要参数σ_t表示以米为单位的预期平移不确定性,这是一种与传感器无关的物理量,增强了跨传感器泛化能力,同时减少了对每个数据集大量调参的需求。成对相似性结合了伯努利-KL Jaccard与指数不确定性门控以及基于FFT的高度余弦相似性用于旋转对齐。在涵盖四种不同LiDAR类型的四个数据集上评估,PROBE在多会话评估中实现了手工描述符中最高的精度,并且在单会话性能上与手工和监督基线相比具有竞争力。源代码和补充材料可在该https URL获取。

英文摘要

We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $σ_θ= σ_t / r$ in $\mathcal{O}(R{\cdot}S)$ time. The primary parameter $σ_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity that enhances cross-sensor generalization while reducing the need for extensive per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance relative to both handcrafted and supervised baselines. The source code and supplementary materials are available at https://sites.google.com/view/probe-pr.

2509.14210 2026-06-12 cs.RO 版本更新

GLIDE: A Coordinated Aerial-Ground Framework for Search and Rescue in Unknown Environments

GLIDE:未知环境下的空地协同搜索与救援框架

Seth Farrell, Chenghao Li, Henrik I. Christensen

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出GLIDE框架,通过两架无人机与一辆无人地面车协同,实现未知环境中的快速受害者定位和障碍物感知导航,利用角色分离和地形侦察提升救援效率。

详情
AI中文摘要

我们提出了一种空地协同搜索与救援(SAR)框架,将两架无人机(UAV)与一辆无人地面车(UGV)配对,以在未知环境中实现快速受害者定位和障碍物感知导航。我们将该框架命名为引导式长视距集成无人机护航(GLIDE),强调UGV在长视距规划中对UAV引导的依赖。在我们的框架中,目标搜索UAV执行实时机载受害者检测和地理参考,为地面平台提名目标,而地形侦察UAV则在UGV计划路径前方飞行,提供中程可通行性更新。UGV融合空中线索与本地感知,执行时间高效的A*规划,并在信息到达时持续重新规划。此外,我们进行了硬件演示(使用GEM e6高尔夫球车作为UGV和两架X500 UAV),以评估端到端SAR任务性能,并包括模拟消融实验,以独立于检测评估规划栈。实证结果表明,UAV之间的明确角色分离,结合地形侦察和引导规划,在时间关键的SAR任务中改善了到达时间和导航安全性。

英文摘要

We present a cooperative aerial-ground search-and-rescue (SAR) framework that pairs two unmanned aerial vehicles (UAVs) with an unmanned ground vehicle (UGV) to achieve rapid victim localization and obstacle-aware navigation in unknown environments. We dub this framework Guided Long-horizon Integrated Drone Escort (GLIDE), highlighting the UGV's reliance on UAV guidance for long-horizon planning. In our framework, a goal-searching UAV executes real-time onboard victim detection and georeferencing to nominate goals for the ground platform, while a terrain-scouting UAV flies ahead of the UGV's planned route to provide mid-level traversability updates. The UGV fuses aerial cues with local sensing to perform time-efficient A* planning and continuous replanning as information arrives. Additionally, we present a hardware demonstration (using a GEM e6 golf cart as the UGV and two X500 UAVs) to evaluate end-to-end SAR mission performance and include simulation ablations to assess the planning stack in isolation from detection. Empirical results demonstrate that explicit role separation across UAVs, coupled with terrain scouting and guided planning, improves reach time and navigation safety in time-critical SAR missions.

2603.02234 2026-06-12 cs.LG cs.AI 版本更新

Structured vs. Unstructured Pruning: An Exponential Gap

结构化剪枝与非结构化剪枝:指数级差距

Davide Ferre', Frédéric Giroire, Frederik Mallmann-Trenn, Emanuele Natale

发表机构 * Department of Informatics, King’s College London(伦敦国王学院信息学院)

AI总结 研究随机初始化网络中剪枝的局限性,证明神经元剪枝需要指数级更大的网络规模才能达到与非结构化剪枝相同的近似精度。

详情
AI中文摘要

强彩票假说(SLTH)指出,大型随机初始化神经网络包含稀疏子网络,无需训练即可在初始化时逼近目标函数,这表明仅剪枝就足够了。剪枝方法通常分为非结构化(可移除单个权重)和结构化(根据特定模式移除参数,如神经元剪枝)。现有支持SLTH的理论结果几乎完全依赖于非结构化剪枝,表明对数级的过参数化足以逼近简单的目标网络。相比之下,神经元剪枝尽管因其直接加速硬件的实用性而备受关注,但理论关注有限。本文考虑通过剪枝随机初始化两层ReLU网络的隐藏单元来逼近单个无偏置ReLU神经元的问题,从而隔离神经元剪枝的内在局限性。我们证明,实现ε-逼近需要神经元剪枝的起始网络规模为Ω(1/ε),而权重剪枝仅需O(log(1/ε))个隐藏单元,揭示了两种方法之间的指数级差距。

英文摘要

The Strong Lottery Ticket Hypothesis (SLTH) states that large, randomly initialized neural networks contain sparse subnetworks capable of approximating a target function at initialization without training, suggesting that pruning alone is sufficient. Pruning methods are typically classified as unstructured, where individual weights can be removed from the network, and structured, where parameters are removed according to specific patterns, as in neuron pruning. Existing theoretical results supporting the SLTH rely almost exclusively on unstructured pruning, showing that logarithmic overparameterization suffices to approximate simple target networks. In contrast, neuron pruning has received limited theoretical attention, despite its practical appeal for direct hardware speedups. In this work, we consider the problem of approximating a single bias-free ReLU neuron by pruning hidden units of a randomly initialized two-layer ReLU network, effectively isolating the intrinsic limitations of neuron pruning. We show that achieving an $\varepsilon$-approximation requires a starting network size of $Ω(1/\varepsilon)$ for neuron pruning, whereas weight pruning succeeds with only $O(\log(1/\varepsilon))$ hidden units, revealing an exponential separation between the two approaches.

2603.00610 2026-06-12 cs.SD cs.AI cs.LG cs.MM eess.AS 版本更新

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

CMI-RewardBench: 基于组合多模态指令评估音乐奖励模型

Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshuo Ding, Yizhi Li, Ruibin Yuan, Simon Dixon, Emmanouil Benetos

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 针对音乐生成模型缺乏有效评估机制的问题,提出CMI-RewardBench基准,包含大规模偏好数据集和参数高效奖励模型,实现多模态指令下的音乐质量评估。

Comments Accepted by ICML 2026

详情
AI中文摘要

虽然音乐生成模型已经发展到能够处理混合文本、歌词和参考音频的复杂多模态输入,但评估机制却滞后了。在本文中,我们通过为组合多模态指令(CMI)下的音乐奖励建模建立了一个全面的生态系统来弥补这一关键差距,其中生成的音乐可能以文本描述、歌词和音频提示为条件。我们首先引入了CMI-Pref-Pseudo,一个包含11万个伪标签样本的大规模偏好数据集,以及CMI-Pref,一个针对细粒度对齐任务量身定制的高质量人工标注语料库。为了统一评估格局,我们提出了CMI-RewardBench,一个统一的基准,用于评估音乐奖励模型在音乐性、文本-音乐对齐和组合指令对齐方面的异质样本。利用这些资源,我们开发了CMI奖励模型(CMI-RMs),一个能够处理异质输入的参数高效奖励模型家族。我们评估了它们与人类判断分数在音乐性和对齐方面的相关性,使用了CMI-Pref以及之前的数据集。进一步的实验表明,CMI-RM不仅与人类判断高度相关,而且通过top-k过滤实现了有效的推理时扩展。代码可在GitHub(此 https URL )获取。模型权重:CMI-RM(此 https URL )。数据集:CMI-Pref-Pseudo(此 https URL )和CMI-Pref(此 https URL )。

英文摘要

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgment scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. Code is available at GitHub (https://github.com/Haiwen-Xia/CMI-RewardBench). Model weights: CMI-RM (https://huggingface.co/HaiwenXia/CMI-RM). Datasets: CMI-Pref-Pseudo (https://huggingface.co/datasets/HaiwenXia/cmi-pref-pseudo) and CMI-Pref (https://huggingface.co/datasets/HaiwenXia/cmi-pref)

2603.00167 2026-06-12 cs.RO 版本更新

EgoMoD: Predicting Global Maps of Dynamics from Local Egocentric Observations

EgoMoD:从局部自我中心观测预测全局动态地图

Iacopo Catalano, David Morilla-Cabello, Jorge Pena-Queralta, Eduardo Montijano

发表机构 * University of Turku, Finland(芬兰图尔库大学) Centre for Artificial Intelligence, Zürich University of Applied Sciences, Winterthur, Switzerland(瑞士应用科学大学人工智能中心) Instituto de Investigación en Ingeniería de Aragón, Universidad de Zaragoza, Spain(西班牙阿拉贡工程研究所,萨拉戈萨大学)

AI总结 提出EgoMoD方法,利用短时自我中心视频和位姿条件架构,学习从局部观测预测全局运动动态地图,替代传统全局感知基础设施,实现零样本迁移。

详情
AI中文摘要

在动态环境中高效导航需要预测机器人即时感知范围之外的运动模式演变,从而在拥挤场景中实现先发制人而非纯粹反应式规划。运动动态地图(MoDs)提供了空间中运动趋势的结构化表示,有助于长期全局规划,但传统上需要长时间全局环境观测来构建。我们提出EgoMoD,这是第一种学习直接从机器人操作期间收集的短时自我中心视频片段预测未来MoDs的方法。我们的方法使用视频和位姿条件架构,以从外部观测计算的MoDs作为特权监督进行训练,从而学习从局部动态线索推断环境范围的运动趋势,使局部观测成为全局运动结构的预测信号。因此,我们能够预测整个环境的未来运动动态,而不仅仅是扩展机器人视野中的过去模式。作为特定地点的动态先验,EgoMoD在推理时用标准车载传感器替代了先前MoD方法所需的外部全局感知基础设施。在大型模拟环境中的实验表明,EgoMoD能在有限可观测性下预测未来MoDs,而使用真实图像的评估展示了其对真实系统的零样本迁移能力。

英文摘要

Efficient navigation in dynamic environments requires anticipating how motion patterns evolve beyond the robot's immediate perceptual range, enabling preemptive rather than purely reactive planning in crowded scenes. Maps of Dynamics (MoDs) offer a structured representation of motion tendencies in space useful for long-term global planning, but constructing them traditionally requires global environment observations over extended periods of time. We introduce EgoMoD, the first approach that learns to predict future MoDs directly from short egocentric video clips collected during robot operation. Our method learns to infer environment-wide motion tendencies from local dynamic cues using a video- and pose-conditioned architecture trained with MoDs computed from external observations as privileged supervision, allowing local observations to serve as predictive signals of global motion structure. Thanks to this, we offer the capacity to forecast future motion dynamics over the whole environment rather than merely extend past patterns in the robot's field of view. As a site-specific dynamic prior, EgoMoD replaces the external global sensing infrastructure required by prior MoD methods at inference time with standard onboard sensors. Experiments in large simulated environments show that EgoMoD predicts future MoDs under limited observability, while evaluation with real images showcases its zero-shot transferability to real systems.

2603.00025 2026-06-12 cs.CL 版本更新

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation

TAB-PO:面向Token关键结构化生成的具有Token级自适应障碍的偏好优化

Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Sreeraj Ramachandran, Elyas Irankhah, Muhammad Arif, Ashley Hagaman, Sarah R. Lowe, Aimee Kendall Roundtree

发表机构 * Yale University(耶鲁大学) Texas State University(德克萨斯州立大学)

AI总结 针对结构化预测中偏好与拒绝对象仅少数token不同导致的梯度稀释和token侵蚀问题,提出基于混淆感知偏好构建和Token级自适应障碍的TAB-PO方法,在SciERC任务上显著提升关键指标。

详情
AI中文摘要

直接偏好优化(DPO)是一种有效且广泛采用的离线对齐方法,但难以适应本体驱动的结构化预测,其中偏好和拒绝的JSON对象通常仅在少数模式定义token上存在差异。在这种低编辑距离场景下,序列级DPO将梯度质量分散到非关键的序列化token上(梯度稀释),并可能降低罕见、低置信度的偏好模式token的似然(token侵蚀)。为解决这些限制,我们首先开发了一种混淆感知的偏好构建策略,该策略用从验证集SFT预测中估计的经验结构化错误模式来增强专家策划的歧义模式,合成最小扰动的、模式有效的负样本,将偏好学习聚焦于现实的本体级决策错误。然后,我们引入了Token自适应障碍偏好优化(TAB-PO),这是一种用于token关键结构化生成的SFT后目标。TAB-PO添加了一个置信门控的token级障碍,对低置信度的模式token施加监督锚定。在公开的SciERC科学信息抽取任务上,使用1.5B到70B的Llama/Qwen模型评估,TAB-PO在本体关键的语义标签和关系链接指标上平均比SFT提升11.59%,在这些指标上100%胜于最强的token级和序列级DPO变体,并领先领先的前沿模型14.71%,同时在文本基础方面取得了强劲的增益。

英文摘要

Direct Preference Optimization (DPO) is an effective and widely adopted approach for offline alignment but is poorly matched to ontology-driven structured prediction, where preferred and rejected JSON objects often differ in only a few schema-defining tokens. In this low-edit-distance regime, sequence-level DPO spreads gradient mass across non-critical serialization tokens (gradient dilution) and can reduce likelihood on rare, under-confident preferred schema tokens (token erosion). To address these limitations, we first develop a confusion-aware preference-construction strategy that augments expert-curated ambiguity patterns with empirical structured-error modes estimated from validation-set SFT predictions, synthesizing minimally perturbed, schema-valid negatives that focus preference learning on realistic ontology-level decision errors. We then introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), a post-SFT objective for token-critical structured generation. TAB-PO adds a confidence-gated token-level barrier that applies supervised anchoring to under-confident schema tokens. On the public SciERC scientific information extraction task, evaluated with Llama/Qwen models from 1.5B to 70B, TAB-PO improves ontology-critical semantic-label and relational-linking metrics over SFT by 11.59% on average, wins 100% of comparisons against the strongest token-level and sequence-level DPO variants on these metrics, and surpasses leading frontier models by 14.71%, while delivering strong gains in textual grounding.

2510.02524 2026-06-12 cs.CL cs.FL cs.LG 版本更新

Unraveling Syntax: Language Modeling and the Substructure of Grammars

解析句法:语言建模与语法的子结构

Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究语言模型在上下文无关语法子结构上的学习行为,证明损失函数在顶层子语法上线性递归,并发现参数化模型并行学习子语法,子语法预训练能提升小模型性能并改善内部表征。

Comments Equal contribution by LYS and DM. Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

尽管语言模型取得了令人印象深刻的结果,但其学习动态远未被理解。许多感兴趣的领域——如自然语言句法、编程语言、算术——都由上下文无关语法(CFG)捕获。在这项工作中,我们将先前关于CFG神经语言建模的工作扩展到一个新的方向:语言建模如何相对于CFG子结构(即子语法)表现。我们定义了子语法,并证明了一组连接语言建模和子语法的基本定理。我们表明,语言建模损失在其顶层子语法上线性递归;递归应用,损失分解为“不可约”子语法的损失。在额外假设下,并且经验上,参数化模型并行学习子语法,不同于首先掌握简单子结构的儿童。我们发现,子语法预训练可以提高最终性能,但仅对于相对于语法而言微小的模型,而对齐分析表明,预训练一致地导致内部表征更好地反映语法的子结构。

英文摘要

While language models achieve impressive results, their learning dynamics are far from understood. Many domains of interest -- such as natural language syntax, coding languages, arithmetic -- are captured by context-free grammars (CFGs). In this work, we extend prior work on neural language modeling of CFGs in a novel direction: how language modeling behaves with respect to CFG substructure, namely subgrammars. We define subgrammars, and prove a set of fundamental theorems connecting language modeling and subgrammars. We show that language modeling loss recurses linearly over its top-level subgrammars; applied recursively, the loss decomposes into losses for "irreducible" subgrammars. Under additional assumptions, and empirically, parametrized models learn subgrammars in parallel, unlike children who first master simple substructures. We find that subgrammar pretraining can improve final performance, but only for tiny models relative to the grammar, while alignment analyses show that pretraining consistently leads to internal representations that better reflect the grammar's substructure.

2602.22629 2026-06-12 cs.CV 版本更新

CRAG: Can 3D Generative Models Help 3D Assembly?

CRAG: 3D生成模型能否辅助3D装配?

Zeyu Jiang, Sihang Li, Siqi Tan, Chenyang Xu, Juexiao Zhang, Julia Galway-Witham, Xue Wang, Scott A. Williams, Radu Iovita, Chen Feng, Jing Zhang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CRAG方法,将3D装配与形状生成联合优化,通过生成完整形状和预测部件姿态实现相互增强,在多种几何、部件数和缺失情况下达到最优性能。

Comments 15 pages, 8 figures

详情
AI中文摘要

大多数现有的3D装配方法将问题视为纯姿态估计,通过刚性变换重新排列观察到的部件。相比之下,人类装配自然地将结构推理与整体形状推断相结合。受此直觉启发,我们将3D装配重新表述为装配和生成的联合问题。我们表明这两个过程相互增强:装配为生成提供部件级结构先验,而生成注入整体形状上下文,解决装配中的歧义。与无法合成缺失几何形状的先前方法不同,我们提出了CRAG,它同时生成合理的完整形状并预测输入部件的姿态。大量实验表明,在具有不同几何形状、不同部件数量和缺失部件的野外物体上,该方法达到了最先进的性能。项目页面:this https URL

英文摘要

Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Project Page: https://ai4ce.github.io/CRAG/

2602.00462 2026-06-12 cs.CV cs.AI 版本更新

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

LatentLens: 揭示大语言模型中高度可解释的视觉标记

Benno Krojer, Shravan Nayak, Oscar Mañas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出 LatentLens 方法,通过将视觉标记与文本语料库中的上下文标记表示进行最近邻匹配,实现视觉标记的可解释性,发现大多数视觉标记在各层均具有可解释性。

Comments ICML 2026 (Camera Ready)

详情
AI中文摘要

将大型语言模型(LLM)转换为视觉语言模型(VLM)可以通过将视觉编码器输出的视觉标记映射到LLM的嵌入空间来实现。有趣的是,这种映射可以简单到浅层MLP变换。为了理解LLM为何能如此容易地处理视觉标记,我们需要可解释性方法来揭示在LLM处理的每一层中视觉标记表示所编码的内容。在这项工作中,我们引入了LatentLens,一种将潜在表示映射到自然语言描述的新方法。LatentLens编码一个大型文本语料库,并存储该语料库中每个标记的上下文化标记表示。然后将视觉标记表示与这些上下文化表示进行比较,并将最邻近的表示作为视觉标记的描述。我们在15个不同的VLM上评估了该方法,结果表明,常用的方法(如LogitLens)大大低估了视觉标记的可解释性。相反,使用LatentLens,大多数视觉标记在所有研究的模型和所有层中都是可解释的。定性上,我们展示了LatentLens产生的描述在语义上有意义,并且与单个标记相比,为人类提供了更细粒度的解释。更广泛地说,我们的发现为视觉和语言表示之间的对齐提供了新的证据,并为分析LLM的潜在表示开辟了新的方向。

英文摘要

Transforming a large language model (LLM) into a vision-language model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens encodes a large text corpus and stores contextualized token representations for each token in that corpus. Visual token representations are then compared to these contextualized representations and the top-nearest neighbor representations serve as descriptions of the visual token. We evaluate this method on 15 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations and open up new directions for analyzing the latent representations of LLMs.

2602.18154 2026-06-12 cs.CL cs.AI cs.DB 版本更新

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

FENCE:一个金融和多模态越狱检测数据集

Mirae Kim, Seonghun Jeong, Youngjun Kwak

发表机构 * arXiv

AI总结 针对金融领域多模态越狱检测资源匮乏的问题,提出FENCE数据集,包含韩英双语文本和图像,用于训练和评估检测器,实验表明基线检测器准确率达99%。

Comments lrec 2026 accepted paper

详情
AI中文摘要

越狱对大型语言模型(LLM)和视觉语言模型(VLM)的部署构成重大风险。VLM尤其脆弱,因为它们处理文本和图像,创造了更广泛的攻击面。然而,可用于越狱检测的资源很少,特别是在金融领域。为填补这一空白,我们提出了FENCE,一个双语(韩语-英语)多模态数据集,用于训练和评估金融应用中的越狱检测器。FENCE通过金融相关查询与图像威胁配对,强调领域真实性。使用商业和开源VLM进行的实验揭示了持续的脆弱性,GPT-4o显示出可测量的攻击成功率,而开源模型则表现出更大的暴露。在FENCE上训练的基线检测器实现了99%的分布内准确率,并在外部基准测试中保持强劲性能,突显了该数据集在训练可靠检测模型方面的鲁棒性。FENCE为推进金融领域的多模态越狱检测以及支持敏感领域中更安全、更可靠的AI系统提供了重点资源。警告:本文包含可能具有冒犯性的示例数据。

英文摘要

Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.

2602.15424 2026-06-12 cs.RO 版本更新

Lyapunov-Based PI-Like Control for Robust Trajectory Tracking of a Four-Wheel Independently Driven and Steered Robot: Design and Experimental Validation

基于李雅普诺夫的PI类控制用于四轮独立驱动与转向机器人的鲁棒轨迹跟踪:设计与实验验证

Branimir Ćaran, Vladimir Milić, Marko Švaco, Bojan Jerbić

发表机构 * Faculty of Mechanical Engineering and Naval Architecture, University of Zagreb(Zagreb大学机械工程与造船工程学院) Regional Centre of Excellence for Robotic Technology (CRTA)(机器人技术卓越研究中心) Croatian Academy of Sciences and Arts(克罗地亚科学院)

AI总结 提出一种基于李雅普诺夫的PI类控制器,结合模型前馈补偿,实现四轮独立驱动与转向机器人的鲁棒轨迹跟踪,并通过实验验证其优于PI和滑模控制器。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

本文提出了一种基于李雅普诺夫综合的PI类控制器,用于独立驱动和转向的四轮移动机器人的鲁棒轨迹跟踪。对于本文所考虑的机器人,使用了一个明确的结构验证数学模型,以实现系统化的控制器设计,并具有严格的稳定性保证,适用于实时实现。针对内环的速度误差和积分误差联合动力学,开发了基于李雅普诺夫的实用稳定性分析,得出了速度误差和积分误差联合状态的实用稳定性和一致最终有界性的显式界和充分条件。所得控制律保留了PI类结构,并具有基于模型的前馈补偿,使其适用于标准嵌入式平台上的实现,同时提高了对构型依赖的残余动力学和未建模效应的鲁棒性。所提设计的有效性和鲁棒性在四轮独立转向和独立驱动的移动机器人平台上进行了实验验证,包括水平和垂直操作条件,并与PI控制器和滑模控制器进行了对比。

英文摘要

In this paper, a Lyapunov-based synthesis of a PI-like controller is proposed for robust trajectory tracking of an independently driven and steered four-wheel mobile robot. For the robot considered in this work, an explicit structurally verified mathematical model is used to enable systematic controller design with rigorous stability guarantees suitable for real time implementation. An augmented Lyapunov-based practical stability analysis is developed for the combined velocity-error and integral-error dynamics of the inner loop, yielding explicit bounds and sufficient conditions for practical stability and uniform ultimate boundedness of the combined velocity-error and integral-error state. The resulting control law retains a PI-like structure with model-based feedforward compensation, making it suitable for implementation on standard embedded platforms while improving robustness against configuration dependent residual dynamics and unmodelled effects. The effectiveness and robustness of the proposed design are demonstrated experimentally on a four-wheel independently steered and independently driven mobile robot platform, under both horizontal and vertical operating conditions and benchmarked against a PI controller and a sliding-mode controller.

2602.14367 2026-06-12 cs.CL cs.AI cs.IR cs.LG 版本更新

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

InnoEval:将研究思路评估视为基于知识的多视角推理问题

Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue, Ningyu Zhang, Hossein A. Rahmani, Yanshan Wang, Qiang Zhang, Keyan Ding, Jeff Z. Pan, Huajun Chen, Emine Yilmaz

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出InnoEval框架,通过异构深度知识检索和多视角评审委员会,实现基于知识的多维度解耦评估,在点对点、成对和分组评估任务中优于基线方法。

Comments ICML 2026

详情
AI中文摘要

大型语言模型的快速发展催生了科学思路的激增,但这一飞跃并未伴随思路评估的相应进步。科学评估的基本性质需要知识基础、集体审议和多标准决策。然而,现有的思路评估方法往往存在知识视野狭窄、评估维度扁平化以及LLM作为评判者的固有偏见。为解决这些问题,我们将思路评估视为一个基于知识的多视角推理问题,并引入InnoEval,一个深度创新评估框架,旨在模拟人类水平的思路评估。我们应用了一个异构深度知识搜索引擎,从多样化的在线来源中检索和获取动态证据。我们进一步通过一个包含不同学术背景的评审员的创新评审委员会实现评审共识,从而在多个指标上进行多维解耦评估。我们构建了来自权威同行评审提交的全面数据集,以基准测试InnoEval。实验表明,InnoEval在点对点、成对和分组评估任务中始终优于基线方法,展现出与人类专家高度一致的判断模式和共识。

英文摘要

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

2602.12753 2026-06-12 cs.LG 版本更新

Hierarchical Successor Representation for Robust Transfer

层次化后继表示用于鲁棒迁移

Changmin Yu, Máté Lengyel

发表机构 * University of Cambridge(剑桥大学) DeepMind(深度思维)

AI总结 提出层次化后继表示(HSR),通过时间抽象构建鲁棒的状态特征,结合非负矩阵分解实现稀疏低秩表示,支持多隔间环境下的高效任务迁移与探索。

详情
AI中文摘要

后继表示(SR)为将预测动态与奖励解耦提供了强大框架,能够实现跨奖励配置的快速泛化。然而,经典SR受其固有的策略依赖性限制:由于持续学习、环境非平稳性和任务需求变化,策略会发生变化,使得已建立的预测表示过时。此外,在拓扑复杂的环境中,SR遭受谱扩散,导致特征密集重叠且扩展性差。本文提出层次化后继表示(HSR)以克服这些限制。通过将时间抽象纳入预测表示的构建,HSR学习到对任务引起的策略变化鲁棒的稳定状态特征。将非负矩阵分解(NMF)应用于HSR,得到稀疏低秩的状态表示,有助于在多隔间环境中实现向新任务的高样本效率迁移。进一步分析表明,HSR-NMF发现了可解释的拓扑结构,提供了策略无关的层次化地图,有效桥接了无模型最优性和基于模型的灵活性。除了为任务迁移提供有用基础外,我们还展示了HSR的时间扩展预测结构也可用于驱动高效探索,有效扩展到大规模程序生成的环境。

英文摘要

The successor representation (SR) provides a powerful framework for decoupling predictive dynamics from rewards, enabling rapid generalisation across reward configurations. However, the classical SR is limited by its inherent policy dependence: policies change due to ongoing learning, environmental non-stationarities, and changes in task demands, making established predictive representations obsolete. Furthermore, in topologically complex environments, SRs suffer from spectral diffusion, leading to dense and overlapping features that scale poorly. Here we propose the Hierarchical Successor Representation (HSR) for overcoming these limitations. By incorporating temporal abstractions into the construction of predictive representations, HSR learns stable state features which are robust to task-induced policy changes. Applying non-negative matrix factorisation (NMF) to the HSR yields a sparse, low-rank state representation that facilitates highly sample-efficient transfer to novel tasks in multi-compartmental environments. Further analysis reveals that HSR-NMF discovers interpretable topological structures, providing a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility. Beyond providing a useful basis for task-transfer, we show that HSR's temporally extended predictive structure can also be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.

2505.11846 2026-06-12 cs.LG math.AG 版本更新

Learning on a Razor's Edge: Identifiability and Singularity of Polynomial Neural Networks

刀刃上的学习:多项式神经网络的可辨识性与奇异性

Vahid Shahverdi, Giovanni Luca Marchetti, Kathlén Kohn

发表机构 * Department of Mathematics, KTH Royal Institute of Technology(数学系,皇家理工学院)

AI总结 研究以多项式为激活函数的MLP和CNN的函数空间(神经流形),证明MLP参数化几乎处处有限对一,CNN参数化一一对应,并刻画奇异性源于稀疏子网络,解释MLP的稀疏偏好。

Comments Published at ICLR 2026

详情
AI中文摘要

我们研究由神经网络参数化的函数空间,称为神经流形。具体地,我们关注具有充分一般多项式激活函数的深度多层感知机(MLP)和卷积神经网络(CNN)。首先,我们解决可辨识性问题,表明对于MLP神经流形中的几乎所有函数,只有有限多个参数选择产生该函数。对于CNN,参数化通常是一一对应的。作为推论,我们计算了神经流形的维数。其次,我们描述神经流形的奇异点。我们完全刻画了CNN的奇异性,部分刻画了MLP的奇异性。在这两种情况下,奇异性都源于稀疏子网络。对于MLP,我们证明这些奇异性通常对应于均方误差损失的临界点,而这对CNN不成立。这为MLP的稀疏偏好提供了几何解释。我们的所有结果都利用了代数几何的工具。

英文摘要

We study function spaces parametrized by neural networks, referred to as neuromanifolds. Specifically, we focus on deep Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) with an activation function that is a sufficiently generic polynomial. First, we address the identifiability problem, showing that, for almost all functions in the neuromanifold of an MLP, there exist only finitely many parameter choices yielding that function. For CNNs, the parametrization is generically one-to-one. As a consequence, we compute the dimension of the neuromanifold. Second, we describe singular points of neuromanifolds. We characterize singularities completely for CNNs, and partially for MLPs. In both cases, they arise from sparse subnetworks. For MLPs, we prove that these singularities often correspond to critical points of the mean-squared error loss, which does not hold for CNNs. This provides a geometric explanation of the sparsity bias of MLPs. All of our results leverage tools from algebraic geometry.

2602.12024 2026-06-12 cs.RO 版本更新

Adaptive-Horizon Conflict-Based Search for Closed-Loop Multi-Agent Path Finding

自适应视界冲突搜索用于闭环多智能体路径规划

Jiarui Li, Federico Pecora, Runyu Zhang, Gioele Zardini

发表机构 * Laboratory for Information and Decision Systems, Massachusetts Institute of Technology(信息与决策系统实验室,麻省理工学院) Schwarzman College of Computing(施瓦茨曼计算学院)

AI总结 提出ACCBS算法,通过动态调整规划视界和重用约束树,在有限计算预算下快速生成高质量可行解,兼具渐近最优性和扰动适应性。

详情
AI中文摘要

MAPF是自动化仓库和物流中大型机器人编队的核心协调问题。现有方法要么是开环规划器,生成固定轨迹并难以处理扰动,要么是闭环启发式方法,没有可靠性能保证,限制了其在安全关键部署中的使用。本文提出ACCBS,一种基于有限视界CBS变体的闭环算法,具有受MPC中迭代加深启发的视界变化机制。ACCBS根据可用计算预算动态调整规划视界,并重用单个约束树以实现视界之间的无缝过渡。因此,它能在预算增加时快速产生高质量可行解,同时渐近最优,表现出任意时间行为。大量案例研究表明,ACCBS结合了对扰动的灵活性和强性能保证,有效弥合了大规模机器人部署中理论最优性与实际鲁棒性之间的差距。

英文摘要

MAPF is a core coordination problem for large robot fleets in automated warehouses and logistics. Existing approaches are typically either open-loop planners, which generate fixed trajectories and struggle to handle disturbances, or closed-loop heuristics without reliable performance guarantees, limiting their use in safety-critical deployments. This paper presents ACCBS, a closed-loop algorithm built on a finite-horizon variant of CBS with a horizon-changing mechanism inspired by iterative deepening in MPC. ACCBS dynamically adjusts the planning horizon based on the available computational budget, and reuses a single constraint tree to enable seamless transitions between horizons. As a result, it produces high-quality feasible solutions quickly while being asymptotically optimal as the budget increases, exhibiting anytime behavior. Extensive case studies demonstrate that ACCBS combines flexibility to disturbances with strong performance guarantees, effectively bridging the gap between theoretical optimality and practical robustness for large-scale robot deployment.

2602.09730 2026-06-12 cs.CV cs.LG cs.NA math.NA 版本更新

Allure of Craquelure: A Variational-Generative Approach to Crack Detection in Paintings

龟裂的魅力:一种变分-生成式绘画裂纹检测方法

Laura Paul, Holger Rauhut, Martin Burger, Samira Kabri, Tim Roith

发表机构 * Dept. of Mathematics, LMU Munich(数学系,慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Helmholtz Imaging, Deutsches Elektronen-Synchrotron DESY(海德堡影像,德意志电子同步辐射光源) Fachbereich Mathematik, University of Hamburg(数学学院,汉堡大学) CIT School, Technical University of Munich(技术大学慕尼黑信息学院)

AI总结 提出混合方法,将裂纹检测建模为逆问题,用深度生成模型作为画作先验,结合Mumford-Shah变分泛函和裂纹先验,通过联合优化获得像素级裂纹定位图。

详情
AI中文摘要

近期成像技术、深度学习与数值性能的进步使得对艺术品的非侵入性详细分析成为可能,支持其记录与保护。特别是,数字化绘画中龟裂的自动检测对于评估退化和指导修复至关重要,但由于可能复杂的场景以及裂纹与类似裂纹的艺术特征(如笔触或毛发)之间的视觉相似性,这仍然具有挑战性。我们提出一种混合方法,将裂纹检测建模为一个逆问题,将观测图像分解为无裂纹绘画和裂纹分量。采用深度生成模型作为底层艺术品的有力先验,同时使用Mumford-Shah型变分泛函结合裂纹先验来捕捉裂纹结构。联合优化得到绘画中裂纹定位的像素级图。

英文摘要

Recent advances in imaging technologies, deep learning and numerical performance have enabled non-invasive detailed analysis of artworks, supporting their documentation and conservation. In particular, automated detection of craquelure in digitized paintings is crucial for assessing degradation and guiding restoration, yet remains challenging due to the possibly complex scenery and the visual similarity between cracks and crack-like artistic features such as brush strokes or hair. We propose a hybrid approach that models crack detection as an inverse problem, decomposing an observed image into a crack-free painting and a crack component. A deep generative model is employed as powerful prior for the underlying artwork, while crack structures are captured using a Mumford--Shah-type variational functional together with a crack prior. Joint optimization yields a pixel-level map of crack localizations in the painting.

2602.08913 2026-06-12 cs.LG stat.ML 版本更新

GEMSS: A Variational Bayesian Method for Discovering Multiple Sparse Solutions in Classification and Regression Problems

GEMSS: 一种用于在分类和回归问题中发现多个稀疏解的变分贝叶斯方法

Kateřina Henclová, Václav Šmídl

发表机构 * Faculty of Electrical Engineering, Czech Technical University(捷克技术大学电子工程系)

AI总结 提出GEMSS算法,利用结构化spike-and-slab先验、高斯混合近似后验和Jaccard惩罚,通过变分推断同时发现多个多样化的稀疏特征组合,在128个实验和3个真实数据集上优于对比方法。

详情
AI中文摘要

高维、欠定且高度相关的系统在数据科学实践中很常见,尤其是在分析物理测量时。在这种情况下,特征选择面临根本性挑战,因为多个不同的稀疏子集可能同样好地解释响应。识别这些子集不仅对预测建模至关重要,而且对生成关于潜在机制的领域特定见解也至关重要。然而,传统方法通常只隔离单个解,掩盖了全部合理的解释。本文介绍了GEMSS(高斯集成多稀疏解),一种变分算法,旨在同时发现多个多样化的稀疏特征组合。该方法采用结构化spike-and-slab先验实现稀疏性,使用高斯混合近似难以处理的多模态后验,并引入基于Jaccard的惩罚进一步控制解的多样性。通过随机梯度下降优化单个目标函数。该方法通过一个新的基准测试框架在128个综合实验上进行测试,该框架旨在生成具有相同预测属性的多个稀疏解的人工问题。这使我们能够测量真实特征的检索,而不仅仅是评估预测性能——这些特征更符合我们的实际需求。比较分析表明,GEMSS始终优于通过ALFESE框架适配的五种著名特征选择方法。最后,我们通过来自代谢组学和物理化学的3个具有挑战性的真实世界数据集展示了其实用性:GEMSS成功分离出多个不同但质量高的解。GEMSS作为PyPI包'gemss'提供。相应的存储库此http URL包含完整的代码库和免费的无代码应用程序GEMSS Explorer。

英文摘要

High-dimensional, underdetermined and highly correlated systems are common in data science practice, especially when analyzing physical measurements. In such settings, feature selection poses a fundamental challenge because multiple distinct sparse subsets may explain the response equally well. Their identification is crucial not only for predictive modeling but also for generating domain-specific insights into the underlying mechanisms. Yet, conventional methods typically isolate a single solution, obscuring the full spectrum of plausible explanations. This work introduces GEMSS (Gaussian Ensemble for Multiple Sparse Solutions), a variational algorithm designed to simultaneously discover multiple, diverse sparse feature combinations. The method employs a structured spike-and-slab prior for sparsity, a mixture of Gaussians to approximate the intractable multimodal posterior, and a Jaccard-based penalty to further control solution diversity. A single objective function is optimized via stochastic gradient descent. The method is tested on 128 comprehensive experiments by a novel benchmarking framework designed to generate artificial problems with multiple sparse solutions of equal predictive properties. This allows us to measure the retrieval of ground truth features rather than only evaluating predictive performance -- characteristics more fitting to our practical needs. A comparative analysis shows that GEMSS consistently outperforms five prominent feature selection methods adapted through the ALFESE framework. Finally, we demonstrate practical usability through 3 challenging real-world datasets from metabolomics and physical chemistry: GEMSS successfully isolates multiple distinct yet quality solutions. GEMSS is available as a PyPI package 'gemss'. The corresponding repository github.com/kat-er-ina/gemss/ includes the full codebase and a free, no-code application GEMSS Explorer.

2602.07106 2026-06-12 cs.CV cs.AI cs.CL 版本更新

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Ex-Omni:为全模态大语言模型赋能3D面部动画生成

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) LIGHTSPEED Independent Researcher(独立研究员)

AI总结 提出Ex-Omni模型,通过混合形状感知语音单元生成器和解码器解耦语义推理与时间生成,并引入统一令牌查询门控融合机制,实现全模态大语言模型同步生成语音和3D面部动画。

详情
AI中文摘要

全模态大语言模型旨在统一多模态理解和生成,然而,尽管自然的人机交互至关重要,但扩展它们以联合生成语音和3D面部动画仍 largely unexplored。一个关键挑战是LLM的离散语义推理与3D面部运动所需的密集时间动态之间的不匹配。我们提出Expressive Omni (Ex-Omni),一个开源模型,通过原生语音伴随的3D面部动画增强OLLM。Ex-Omni通过混合形状感知语音单元生成器和混合形状解码器将语义推理与时间生成解耦,其中语音单元提供时间支架,隐藏语音表示携带面部相关线索。我们进一步引入统一的令牌查询门控融合机制用于受控语义注入,以及InstructS2SF-1200K,一个包含1200K样本的预训练数据集。大量实验表明,Ex-Omni在保持竞争性语音理解和生成能力的同时,实现了比级联管道更好的音视频同步和更低的面部生成延迟。

英文摘要

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet extending them to jointly produce speech and 3D facial animation remains largely unexplored despite its importance for natural human-computer interaction. A key challenge is the mismatch between the discrete semantic reasoning of LLMs and the dense temporal dynamics required for 3D facial motion. We propose Expressive Omni (Ex-Omni), an open-source model that augments OLLMs with native speech-accompanied 3D facial animation. Ex-Omni decouples semantic reasoning from temporal generation through a blendshape-aware speech unit generator and a blendshape decoder, where speech units provide temporal scaffolding and hidden speech representations carry facially relevant cues. We further introduce a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection, as well as InstructS2SF-1200K, a dataset consisting of 1200K samples for pre-training. Extensive experiments show that Ex-Omni maintains competitive speech understanding and generation ability while achieving better audio-visual synchronization and lower face-generation latency than cascaded pipelines.

2602.04675 2026-06-12 cs.LG 版本更新

Generalized Schrödinger Bridge on Graphs

图上的广义薛定谔桥

Panagiotis Theodoropoulos, Juno Nam, Evangelos Theodorou, Jaemoo Choi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GSBoG框架,通过似然优化学习图上可控连续时间马尔可夫链策略,满足端点边际分布并优化中间状态成本,实现可扩展的拓扑感知运输。

详情
AI中文摘要

图上的运输是许多领域中的一个基本挑战,决策必须尊重拓扑和操作约束。尽管需要可执行的策略,现有的图运输方法缺乏这种表达能力。它们依赖于限制性假设,无法在稀疏拓扑上泛化,并且随着图大小和时间范围的增加而扩展性差。为了解决这些问题,我们引入了图上的广义薛定谔桥(GSBoG),这是一种新颖的可扩展数据驱动框架,用于在状态成本增强动力学下学习任意图上的可执行受控连续时间马尔可夫链(CTMC)策略。值得注意的是,GSBoG学习轨迹级策略,避免了密集的全局求解器,从而增强了可扩展性。这是通过似然优化方法实现的,满足端点边际分布,同时优化状态依赖运行成本下的中间行为。在具有挑战性的真实世界图拓扑上的大量实验表明,GSBoG能够可靠地学习准确、尊重拓扑的策略,同时优化特定应用的中间状态成本,突出了其广泛的适用性,并为一般图上的成本感知动态运输开辟了新途径。

英文摘要

Transportation on graphs is a fundamental challenge across many domains, where decisions must respect topological and operational constraints. Despite the need for actionable policies, existing graph-transport methods lack this expressivity. They rely on restrictive assumptions, fail to generalize across sparse topologies, and scale poorly with graph size and time horizon. To address these issues, we introduce Generalized Schrödinger Bridge on Graphs (GSBoG), a novel scalable data-driven framework for learning executable controlled continuous-time Markov chain (CTMC) policies on arbitrary graphs under state cost augmented dynamics. Notably, GSBoG learns trajectory-level policies, avoiding dense global solvers and thereby enhancing scalability. This is achieved via a likelihood optimization approach, satisfying the endpoint marginals, while simultaneously optimizing intermediate behavior under state-dependent running costs. Extensive experimentation on challenging real-world graph topologies shows that GSBoG reliably learns accurate, topology-respecting policies while optimizing application-specific intermediate state costs, highlighting its broad applicability and paving new avenues for cost-aware dynamical transport on general graphs.

2602.04208 2026-06-12 cs.RO cs.AI cs.LG 版本更新

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

SCALE: 基于自不确定性条件自适应观察与执行的视觉-语言-动作模型

Hyeonbeom Choi, Daechul Ahn, Youhan Lee, Taewook Kang, Seongwon Cho, Jonghyun Choi

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出SCALE推理策略,利用自不确定性联合调节视觉感知和动作,无需额外训练或验证器,仅单次前向传播,提升VLA模型在模拟和真实环境中的鲁棒性。

Comments ICML 2026 Spotlight. Project page: https://dcahn12.github.io/projects/scale/

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人控制的一种有前景的范式,测试时缩放(TTS)在增强训练外鲁棒性方面受到关注。然而,现有的VLA TTS方法需要额外训练、验证器和多次前向传播,使其部署不切实际。此外,它们仅干预动作解码,而保持视觉表示固定——在感知模糊的情况下不足,此时重新考虑如何感知与决定做什么同样重要。为解决这些限制,我们提出SCALE,一种简单的推理策略,基于“自不确定性”联合调节视觉感知和动作,受主动推理理论中不确定性驱动探索的启发——无需额外训练、无需验证器,且仅需单次前向传播。SCALE在高不确定性下拓宽感知和动作的探索,而在自信时聚焦于利用——实现在不同条件下的自适应执行。在模拟和真实世界基准上的实验表明,SCALE改进了最先进的VLA模型,并优于现有TTS方法,同时保持单次前向传播的效率。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

2601.09693 2026-06-12 cs.LG stat.ML 版本更新

Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design

对比几何学习实现统一的结构与配体药物设计

Lisa Schneckenreiter, Sohvi Luukkonen, Lukas Friedrich, Daniel Kuhn, Günter Klambauer

发表机构 * DeepMind Ltd(DeepMind有限公司)

AI总结 提出对比几何模型ConGLUDe,统一结构与配体训练,实现虚拟筛选、靶标钓鱼和配体条件口袋预测,在多项基准测试中表现优异。

Comments Forty-Third International Conference on Machine Learning

详情
AI中文摘要

基于结构和基于配体的计算药物设计传统上依赖于不相关的数据源和建模假设,限制了它们在大规模上的联合使用。在这项工作中,我们引入了用于统一计算药物设计的对比几何学习(ConGLUDe),这是一个单一的对比几何模型,统一了基于结构和基于配体的训练。ConGLUDe将产生全蛋白质表示和预测结合位点的隐式嵌入的几何蛋白质编码器与快速配体编码器耦合,消除了对预定义口袋的需求。通过对比学习将配体与全局蛋白质表示和多个候选结合位点对齐,ConGLUDe除了支持虚拟筛选和靶标钓鱼外,还支持配体条件口袋预测,同时在蛋白质-配体复合物和大规模生物活性数据上联合训练。在多种基准测试中,ConGLUDe实现了具有竞争力的零样本虚拟筛选性能,在具有挑战性的靶标钓鱼任务上显著优于现有方法,并展示了最先进的配体条件口袋选择。这些结果突显了统一结构-配体训练的优势,并将ConGLUDe定位为迈向药物发现通用基础模型的一步。

英文摘要

Structure-based and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for predefined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves competitive zero-shot virtual screening performance, substantially outperforms existing methods on a challenging target fishing task, and demonstrates state-of-the-art ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.