arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2069
2511.11406 2026-06-01 cs.CV

Robust Low-Rank Sparse Framework for Video-Based Affective Computing

基于视频的情感计算的鲁棒低秩稀疏框架

Feng-Qi Cui, Jinyang Huang, Sirui Zhao, Xinyu Li, Xin Yan, Ziyu Jia, Xiaokang Zhou

AI总结 提出低秩稀疏情感理解框架(LSEF),通过层次化低秩稀疏分解将情感动态分解为情感基和瞬态波动,并采用秩感知优化策略提升鲁棒性和动态判别能力。

详情
AI中文摘要

基于视频的情感计算(VAC)对于情感分析和人机交互至关重要,但由于复杂的情感动态,存在模型不稳定和表示退化的问题。由于不同情感波动的含义在不同情感背景下可能不同,核心限制在于缺乏一种层次结构机制来分离不同的情感成分,即情感基(长期情感基调)和瞬态波动(短期情感波动)。为解决这一问题,我们提出了低秩稀疏情感理解框架(LSEF),这是一个基于低秩稀疏原理的统一模型,从理论上将情感动态重新定义为层次化的低秩稀疏组合过程。LSEF采用三个即插即用模块:稳定性编码模块(SEM)捕获低秩情感基;动态解耦模块(DDM)分离稀疏瞬态信号;一致性整合模块(CIM)重构多尺度稳定性和反应性一致性。该框架通过秩感知优化(RAO)策略进行优化,该策略自适应地平衡梯度平滑性和敏感性。跨多个数据集的大量实验证实,LSEF显著增强了鲁棒性和动态判别能力,进一步验证了层次化低秩稀疏建模对于理解情感动态的有效性和通用性。

英文摘要

Video-based Affective Computing (VAC), vital for emotion analysis and human-computer interaction, suffers from model instability and representational degradation due to complex emotional dynamics. Since the meaning of different emotional fluctuations may differ under different emotional contexts, the core limitation is the lack of a hierarchical structural mechanism to disentangle distinct affective components, i.e., emotional bases (the long-term emotional tone), and transient fluctuations (the short-term emotional fluctuations). To address this, we propose the Low-Rank Sparse Emotion Understanding Framework (LSEF), a unified model grounded in the Low-Rank Sparse Principle, which theoretically reframes affective dynamics as a hierarchical low-rank sparse compositional process. LSEF employs three plug-and-play modules, i.e., the Stability Encoding Module (SEM) captures low-rank emotional bases; the Dynamic Decoupling Module (DDM) isolates sparse transient signals; and the Consistency Integration Module (CIM) reconstructs multi-scale stability and reactivity coherence. This framework is optimized by a Rank Aware Optimization (RAO) strategy that adaptively balances gradient smoothness and sensitivity. Extensive experiments across multiple datasets confirm that LSEF significantly enhances robustness and dynamic discrimination, which further validates the effectiveness and generality of hierarchical low-rank sparse modeling for understanding affective dynamics.

2601.22985 2026-06-01 cs.LG

dgMARK: Decoding-Guided Watermarking for Diffusion Language Models

dgMARK: 面向扩散语言模型的解码引导水印方法

Pyo Min Hong, Albert No

AI总结 提出dgMARK方法,通过引导离散扩散语言模型的去掩码顺序满足奇偶约束,实现无需显式重加权概率的文本水印嵌入,并利用滑动窗口检测器保证对编辑操作的鲁棒性。

Comments Accepted at ICML 2026. Project page: https://dgmark-watermarking.github.io

详情
AI中文摘要

我们提出了dgMARK,一种面向离散扩散语言模型(dLLMs)的解码引导水印方法。与自回归模型不同,dLLMs可以以任意顺序生成token。虽然理想的条件预测器应对此顺序不变,但实际dLLMs对去掩码顺序表现出强敏感性,这为水印创建了一个新通道。dgMARK将去掩码顺序引导至那些高奖励候选token满足由二元哈希引入的简单奇偶约束的位置,而不显式重新加权模型学习到的概率。该方法可与常见解码策略(例如基于置信度、熵和边界的排序)即插即用,并可通过一步前瞻变体增强。水印通过升高的奇偶匹配统计量检测,滑动窗口检测器确保在插入、删除、替换和释义等后期编辑操作下的鲁棒性。项目网站:https://dgmark-watermarking.github.io

英文摘要

We propose dgMARK, a decoding-guided watermarking method for discrete diffusion language models (dLLMs). Unlike autoregressive models, dLLMs can generate tokens in arbitrary order. While an ideal conditional predictor would be invariant to this order, practical dLLMs exhibit strong sensitivity to the unmasking order, creating a new channel for watermarking. dgMARK steers the unmasking order toward positions whose high-reward candidate tokens satisfy a simple parity constraint induced by a binary hash, without explicitly reweighting the model's learned probabilities. The method is plug-and-play with common decoding strategies (e.g., confidence, entropy, and margin-based ordering) and can be strengthened with a one-step lookahead variant. Watermarks are detected via elevated parity-matching statistics, and a sliding-window detector ensures robustness under post-editing operations including insertion, deletion, substitution, and paraphrasing. Project website: https://dgmark-watermarking.github.io

2601.22943 2026-06-01 cs.LG

Scalable Topology-Preserving Graph Coarsening: Concepts and Algorithms

可扩展的拓扑保持图粗化:概念与算法

Xiang Wu, Rong-Hua Li, Xunkai Li, Kangfei Zhao, Hongchao Qin, Guoren Wang

AI总结 针对现有拓扑保持图粗化方法时间复杂度高的问题,提出基于代数拓扑的图强坍缩和图边坍缩概念的可扩展拓扑保持图粗化(STPGC),通过三种新算法消除主导节点和边,严格保持拓扑特征,并证明其保持GNN感受野,加速GNN训练。

详情
AI中文摘要

图粗化在保持某些属性的同时减小图的规模。现有方法大多保持谱或空间特征。最近研究表明,拓扑保持粗化方法在粗化图上保持GNN性能,但存在指数时间复杂度。为解决这些问题,我们通过引入从代数拓扑扩展而来的图强坍缩和图边坍缩概念,提出了可扩展拓扑保持图粗化(STPGC)。STPGC包含基于这两个概念的三种新算法:GStrongCollapse、GEdgeCollapse和NeighborhoodConing,它们在严格保持拓扑特征的同时消除主导节点和边。我们进一步证明STPGC保持GNN感受野,并开发近似算法以加速GNN训练。在节点分类任务上的实验表明了STPGC的效率和有效性。

英文摘要

Graph coarsening reduces the size of a graph while preserving certain properties. Most existing methods preserve either spectral or spatial characteristics. Recent research shows that topology-preserving coarsening methods maintain GNN performance on coarsened graphs but suffer from exponential time complexity. To address these problems, we propose Scalable Topology-Preserving Graph Coarsening (STPGC) by introducing the concepts of graph strong collapse and graph edge collapse extended from algebraic topology. STPGC comprises three new algorithms, GStrongCollapse, GEdgeCollapse, and NeighborhoodConing based on these two concepts, which eliminate dominated nodes and edges while rigorously preserving topological features. We further prove that STPGC preserves the GNN receptive field and develop approximate algorithms to accelerate GNN training. Experiments on node classification with GNNs demonstrate the efficiency and effectiveness of STPGC.

2601.22787 2026-06-01 cs.LG

Float8@2bits: Entropy Coding Enables Data-Free Model Compression

Float8@2bits: 熵编码实现无数据模型压缩

Patrick Putzky, Martin Genzel, Mattes Mollenhauer, Sebastian Schulze, Thomas Wollmann, Stefan Dietzel

AI总结 提出EntQuant框架,通过熵编码解耦数值精度与存储成本,在无需数据和微调的情况下实现2比特极端压缩,10分钟内压缩70B参数模型并保持性能。

Comments ICML 2026. Code available at https://github.com/merantix-momentum/entquant

详情
AI中文摘要

训练后压缩目前分为两种对比鲜明的范式。一方面,快速、无数据且与模型无关的方法(如NF4或HQQ)提供了最大的可访问性,但在低于4比特的极端比特率下会出现功能崩溃。另一方面,利用校准数据或大量恢复训练的技术实现了更高的保真度,但施加了高计算约束,并且在数据分布偏移下面临不确定的鲁棒性。我们引入了EntQuant,一个统一了这些不同范式优势的框架。通过匹配数据依赖方法的性能与数据无关技术的速度和通用性,EntQuant在极端压缩机制下实现了实际效用。我们的方法通过熵编码将数值精度与存储成本解耦,在不到10分钟内压缩了一个70B参数模型。我们证明,EntQuant不仅在标准评估集和模型上取得了最先进的结果,而且在指令调优模型的更复杂基准测试中保持了功能性能,同时推理开销适中。

英文摘要

Post-training compression is currently divided into two contrasting regimes. On the one hand, fast, data-free, and model-agnostic methods (e.g., NF4 or HQQ) offer maximum accessibility but suffer from functional collapse at extreme bit-rates below 4 bits. On the other hand, techniques leveraging calibration data or extensive recovery training achieve superior fidelity but impose high computational constraints and face uncertain robustness under data distribution shifts. We introduce EntQuant, a framework that unites the advantages of these distinct paradigms. By matching the performance of data-dependent methods with the speed and universality of data-free techniques, EntQuant enables practical utility in the extreme compression regime. Our method decouples numerical precision from storage cost via entropy coding, compressing a 70B parameter model in less than 10 minutes. We demonstrate that EntQuant does not only achieve state-of-the-art results on standard evaluation sets and models, but also retains functional performance on more complex benchmarks with instruction-tuned models, all at modest inference overhead.

2508.09925 2026-06-01 cs.LG cs.AI

Residual Reservoir Memory Networks

残差储备记忆网络

Matteo Pinna, Andrea Ceni, Claudio Gallicchio

AI总结 提出一种新型无训练循环神经网络ResRMN,通过结合线性记忆储备与基于时间残差正交连接的非线性储备,增强长期输入传播,在时间序列和像素级一维分类任务中优于传统储备计算模型。

Comments IJCNN 2025

详情
AI中文摘要

我们在储备计算(RC)范式内引入了一类新型无训练循环神经网络(RNN),称为残差储备记忆网络(ResRMN)。ResRMN将线性记忆储备与非线性储备相结合,其中后者基于沿时间维度的残差正交连接,以增强输入的长期传播。通过线性稳定性分析研究所得储备状态动力学,并探讨了时间残差连接的不同配置。所提出的方法在时间序列和像素级一维分类任务上进行了实证评估。我们的实验结果突出了所提出方法相对于其他传统RC模型的优势。

英文摘要

We introduce a novel class of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) paradigm, called Residual Reservoir Memory Networks (ResRMNs). ResRMN combines a linear memory reservoir with a non-linear reservoir, where the latter is based on residual orthogonal connections along the temporal dimension for enhanced long-term propagation of the input. The resulting reservoir state dynamics are studied through the lens of linear stability analysis, and we investigate diverse configurations for the temporal residual connections. The proposed approach is empirically assessed on time-series and pixel-level 1-D classification tasks. Our experimental results highlight the advantages of the proposed approach over other conventional RC models.

2506.01318 2026-06-01 cs.LG cs.AI

Unlearning's Blind Spots: Over-Unlearning and Prototypical Relearning Attack

机器遗忘的盲点:过度遗忘与原型重学习攻击

SeungBum Ha, Saerom Park, Sung Whan Yoon

AI总结 针对类别级机器遗忘,提出过度遗忘度量OU@epsilon并揭示原型重学习攻击,通过Spotter方法结合掩码知识蒸馏和类内分散损失来缓解这两个盲点。

Comments 9 pages, ICML 2026

详情
AI中文摘要

机器遗忘(MU)旨在从训练模型中删除指定的遗忘集,而无需昂贵的重新训练,但现有技术忽略了两个关键盲点:"过度遗忘"会恶化遗忘集附近的保留数据,以及事后"重学习"攻击旨在复活被遗忘的知识。聚焦于类别级遗忘,我们首先推导出一个过度遗忘度量OU@epsilon,它量化了遗忘集邻近区域(过度遗忘主要发生区域)的附带损害。接下来,我们揭示了MU上一个未预见的重学习威胁,即原型重学习攻击,该攻击仅利用少量样本就能利用遗忘类的每类原型,并轻松恢复遗忘前的性能。为了应对类别级遗忘中的这两个盲点,我们引入了Spotter,一个即插即用的目标函数,它结合了(i)对遗忘类邻近区域的掩码知识蒸馏惩罚以抑制OU@epsilon,和(ii)一个类内分散损失,用于分散遗忘类嵌入,从而中和原型重学习攻击。Spotter在CIFAR、TinyImageNet和CASIA-WebFace数据集上取得了最先进的结果,为机器遗忘的盲点提供了实用的补救措施。

英文摘要

Machine unlearning (MU) aims to expunge a designated forget set from a trained model without costly retraining, yet the existing techniques overlook two critical blind spots: "over-unlearning" that deteriorates retained data near the forget set, and post-hoc "relearning" attacks that aim to resurrect the forgotten knowledge. Focusing on class-level unlearning, we first derive an over-unlearning metric, OU@epsilon, which quantifies collateral damage in regions proximal to the forget set, where over-unlearning mainly occurs. Next, we expose an unforeseen relearning threat on MU, i.e., the Prototypical Relearning Attack, which exploits the per-class prototype of the forget class with just a few samples, and easily restores the pre-unlearning performance. To counter both blind spots in class-level unlearning, we introduce Spotter, a plug-and-play objective that combines (i) a masked knowledge-distillation penalty on the nearby region of forget classes to suppress OU@epsilon, and (ii) an intra-class dispersion loss that scatters forget-class embeddings, neutralizing Prototypical Relearning Attacks. Spotter achieves state-of-the-art results across CIFAR, TinyImageNet, and CASIA-WebFace datasets, offering a practical remedy to unlearning's blind spots.

2601.22412 2026-06-01 cs.CV

Calibrated Uncertainty for Trustworthy Clinical Gait Analysis Using Probabilistic Multiview Markerless Motion Capture

用于可信临床步态分析的校准不确定性:基于概率多视角无标记运动捕捉

Seth Donahue, Irina Djuraskovic, Kunal Shah, Fabian Sinz, Ross Chafetz, R. James Cotton

AI总结 提出一种概率多视角无标记运动捕捉方法,通过变分推断估计关节角度后验分布,并利用期望校准误差评估置信区间校准性,实现无需真实标记即可识别不可靠输出的可靠步态分析。

Comments 9 pages, 5 figures, EMBS Special Issue

详情
AI中文摘要

基于视频的人体运动分析在临床实践和研究中具有潜力。然而,多视角无标记运动捕捉(MMMC)的临床实施和信任要求,除了准确性外,这些系统还能为任何个体产生可靠的置信区间以指示其准确程度。基于我们先前利用变分推断估计关节角度后验分布的工作,本研究评估了一种概率MMMC方法的校准性和可靠性。我们分析了来自两个机构的68名参与者的数据,使用仪器化步道和标准标记运动捕捉验证模型。我们通过期望校准误差(ECE)测量置信区间的校准性。模型展示了可靠的校准性,步长和跨步长的ECE值通常<0.1,偏差校正的步态运动学也类似。我们观察到步长和跨步长中位误差分别约为16毫米和12毫米,下肢关节的中位偏差校正运动学误差范围为1.5至3.8度。与校准的ECE一致,模型预测的不确定性大小与观察到的误差测量值强相关。这些发现表明,按照设计,概率模型重建量化了认知不确定性,使其能够在无需同时使用真实标记仪器的情况下识别不可靠的输出。

英文摘要

Video-based human movement analysis holds potential for movement assessment in clinical practice and research. However, the clinical implementation and trust of multi-view markerless motion capture (MMMC) require that, in addition to being accurate, these systems produce reliable confidence intervals to indicate how accurate they are for any individual. Building on our prior work utilizing variational inference to estimate joint angle posterior distributions, this study evaluates the calibration and reliability of a probabilistic MMMC method. We analyzed data from 68 participants across two institutions, validating the model against an instrumented walkway and standard marker-based motion capture. We measured the calibration of the confidence intervals using the Expected Calibration Error (ECE). The model demonstrated reliable calibration, yielding ECE values generally < 0.1 for both step and stride length and bias-corrected gait kinematics. We observed a median step and stride length error of ~16 mm and ~12 mm respectively, with median bias-corrected kinematic errors ranging from 1.5 to 3.8 degrees across lower extremity joints. Consistent with the calibrated ECE, the magnitude of the model's predicted uncertainty correlated strongly with observed error measures. These findings indicate that, as designed, the probabilistic model reconstruction quantifies epistemic uncertainty, allowing it to identify unreliable outputs without the need for concurrent ground-truth instrumentation.

2601.22296 2026-06-01 cs.LG cs.AI

ParalESN: Enabling parallel information processing in Reservoir Computing

ParalESN:在储层计算中实现并行信息处理

Matteo Pinna, Giacomo Lagomarsini, Andrea Ceni, Claudio Gallicchio

AI总结 提出ParalESN,利用复数域对角线性递归实现储层计算的并行化,在保持回声状态属性和普适性保证的同时,大幅提升计算效率。

Comments ICML 2026

详情
AI中文摘要

储层计算(RC)已成为时间处理的有效范式。然而,其可扩展性受到顺序处理时间数据的需要和高维储层巨大内存占用的严重限制。为了解决这些限制,我们通过结构化算子和状态空间建模的视角重新审视RC,引入了并行回声状态网络(ParalESN)。利用复数域中的对角线性递归,ParalESN实现了时间数据的并行处理以及高效高维储层的构建。彻底的理论分析表明,传统回声状态网络的回声状态属性和普适性保证得以保留,同时允许任意线性储层在复数对角形式下的等价表示。实验上,ParalESN在预测精度上与传统的RC和完全可训练的序列模型相当,同时实现了数量级的计算节省。总体而言,ParalESN为将RC集成到深度学习领域提供了一条可扩展且有原则的路径。

英文摘要

Reservoir Computing (RC) has established itself as an efficient paradigm for temporal processing. However, its scalability remains severely constrained by the need to process temporal data sequentially and the prohibitive memory footprint of high-dimensional reservoirs. To address these limitations, we revisit RC through the lens of structured operators and state space modeling, introducing Parallel Echo State Network (ParalESN). Leveraging diagonal linear recurrence in the complex domain, ParalESN enables parallel processing of temporal data and the construction of efficient, high-dimensional reservoirs. A thorough theoretical analysis demonstrates that the Echo State Property and the universality guarantees of traditional Echo State Networks are preserved, while also admitting an equivalent representation of arbitrary linear reservoirs in the complex diagonal form. Empirically, ParalESN achieves competitive predictive accuracy with traditional RC and with fully trainable sequence models, while delivering computational savings by orders of magnitude. Overall, ParalESN offers a scalable and principled pathway for integrating RC within the deep learning landscape.

2509.24319 2026-06-01 cs.CL cs.AI

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models

价值表达的双重机制:大型语言模型中的内在价值与提示价值

Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo

AI总结 本文通过价值向量和价值神经元分析,揭示大型语言模型中内在价值表达与提示价值表达在机制上部分共享核心组件,但各自拥有独特功能,内在机制促进多样性,提示机制增强指令遵从。

Comments Accepted at ICML 2026. Project page: https://holi-lab.github.io/ValueMechanism/

详情
AI中文摘要

大型语言模型可以通过两种主要方式表达价值:(1)内在表达,反映模型在训练过程中学习到的固有价值;(2)提示表达,由显式提示引发。鉴于它们在价值对齐中的广泛应用,清楚理解其潜在机制至关重要,特别是它们是否主要重叠(如人们可能预期的)或依赖于不同的机制。我们在机制层面使用两种方法分析这个很大程度上未被充分研究的问题:(1)价值向量,从残差流中提取的代表价值机制的特征方向;(2)价值神经元,对价值向量有贡献的MLP神经元。我们证明内在和提示价值机制部分共享对诱导价值表达至关重要的共同组件,这些组件跨语言泛化并在模型的内部表示中重建理论上的价值间相关性。然而,每种机制也拥有独特的组件,发挥不同的作用。特别是,内在机制在更多样化的价值相关场景中激活并促进响应多样性,而提示机制增强指令遵从,甚至在遥远任务(如越狱)中也能生效。

英文摘要

Large language models can express values in two main ways: (1) intrinsic expression, reflecting the model's inherent values learned during training, and (2) prompted expression, elicited by explicit prompts. Given their widespread use in value alignment, it is paramount to clearly understand their underlying mechanisms, particularly whether they mostly overlap (as one might expect) or rely on distinct mechanisms. We analyze this largely understudied problem at the mechanistic level using two approaches: (1) value vectors, feature directions representing value mechanisms extracted from the residual stream, and (2) value neurons, MLP neurons that contribute to value vectors. We demonstrate that intrinsic and prompted value mechanisms partly share common components crucial for inducing value expression, generalizing across languages and reconstructing theoretical inter-value correlations in the model's internal representations. Yet, each mechanism also possesses unique components that fulfill distinct roles. In particular, the intrinsic mechanism activates in more diverse value-related scenarios and promotes response diversity, whereas the prompted mechanism strengthens instruction compliance, taking effect even in distant tasks like jailbreaking.

2509.20784 2026-06-01 cs.CL cs.AI

Towards Atoms of Large Language Models

迈向大型语言模型的原子

Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

AI总结 本文提出原子理论,通过原子内积(AIP)定义、评估和识别大型语言模型的基本表示单元(原子),并证明在阈值激活稀疏自编码器(TSAE)下原子可识别,实验发现神经元和特征不满足理想原子标准,而通过匹配TSAE容量与数据规模可识别出近乎完美的原子。

Comments To be published in ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)的基本表示单元(FRUs)尚未定义,这限制了对它们底层机制的进一步理解。在本文中,我们引入原子理论来系统地定义、评估和识别这样的FRUs,我们称之为原子。基于原子内积(AIP),一种捕捉LLM表示底层几何结构的非欧几里得度量,我们正式定义了原子,并提出了理想原子的两个关键标准:忠实性($R^2$)和稳定性($q^*$)。我们进一步证明,在阈值激活稀疏自编码器(TSAEs)下原子是可识别的。在实验上,我们揭示了LLMs中普遍存在的表示偏移,并证明AIP纠正了这种偏移以捕捉底层的表示几何结构。我们发现两个广泛使用的单元——神经元和特征——不符合理想原子的条件:神经元是忠实的($R^2=1$)但不稳定($q^*=0.5\%$),而特征更稳定($q^*=68.2\%$)但不忠实($R^2=48.8\%$)。为了找到LLMs的原子,利用TSAEs下的原子可识别性,我们通过大规模实验表明,只有当TSAE容量与数据规模匹配时,才能实现可靠的原子识别。在此洞察的指导下,我们在Gemma2-2B、Gemma2-9B和Llama3.1-8B的各层中识别出具有近乎完美忠实性($R^2=99.9\%$)和稳定性($q^*=99.8\%$)的FRUs,在统计上满足理想原子的标准。进一步分析证实,这些原子与理论预期一致,并表现出显著更高的单语义性。总体而言,我们提出并验证了原子理论作为理解LLMs内部表示的基础。代码可在https://github.com/ChenhuiHu/towards_atoms获取。

英文摘要

The fundamental representational units (FRUs) of large language models (LLMs) remain undefined, limiting further understanding of their underlying mechanisms. In this paper, we introduce Atom Theory to systematically define, evaluate, and identify such FRUs, which we term atoms. Building on the atomic inner product (AIP), a non-Euclidean metric that captures the underlying geometry of LLM representations, we formally define atoms and propose two key criteria for ideal atoms: faithfulness ($R^2$) and stability ($q^*$). We further prove that atoms are identifiable under threshold-activated sparse autoencoders (TSAEs). Empirically, we uncover a pervasive representation shift in LLMs and demonstrate that the AIP corrects this shift to capture the underlying representational geometry. We find that two widely used units, neurons and features, fail to qualify as ideal atoms: neurons are faithful ($R^2=1$) but unstable ($q^*=0.5\%$), while features are more stable ($q^*=68.2\%$) but unfaithful ($R^2=48.8\%$). To find atoms of LLMs, leveraging atom identifiability under TSAEs, we show via large-scale experiments that reliable atom identification occurs only when the TSAE capacity matches the data scale. Guided by this insight, we identify FRUs with near-perfect faithfulness ($R^2=99.9\%$) and stability ($q^*=99.8\%$) across layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B, satisfying the criteria of ideal atoms statistically. Further analysis confirms that these atoms align with theoretical expectations and exhibit substantially higher monosemanticity. Overall, we propose and validate Atom Theory as a foundation for understanding the internal representations of LLMs. Code available at https://github.com/ChenhuiHu/towards_atoms.

2505.17595 2026-06-01 cs.LG cs.CL

NeUQI: Near-Optimal Uniform Quantization Parameter Initialization for Low-Bit LLMs

NeUQI: 低比特大语言模型的近最优均匀量化参数初始化

Li Lin, Xinyu Hu, Xiaojun Wan

AI总结 针对低比特大语言模型均匀量化中参数初始化依赖Min-Max公式的局限,提出NeUQI方法,通过推导零点实现仅优化缩放因子,从而高效获得近最优初始化,在LLaMA和Qwen系列上优于现有方法,且结合轻量蒸馏可超越资源密集的PV-tuning。

Comments accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLM)在跨领域任务中表现出色,但由于高内存消耗和推理成本,在消费级GPU或个人设备(如笔记本电脑)上部署时面临重大挑战。LLM的训练后量化(PTQ)提供了一种有前景的解决方案,可减少内存占用和解码延迟。实践中,均匀量化表示的PTQ因其高效性和易于部署而受到青睐,因为均匀量化被主流硬件和软件库广泛支持。近期关于低比特均匀量化的研究在量化后模型性能上取得了显著改进;然而,这些研究主要关注量化方法,而量化参数的初始化仍未被充分探索,且仍依赖于传统的Min-Max公式。在本工作中,我们识别了Min-Max公式的局限性,突破其约束,提出了NeUQI,一种高效确定均匀量化近最优初始化的方法。我们的NeUQI通过为给定缩放因子推导零点,简化了缩放因子和零点的联合优化,从而将问题简化为仅缩放因子优化。得益于改进的量化参数,我们的NeUQI在LLaMA和Qwen系列的各种设置和任务上的实验中一致优于现有方法。此外,当与轻量蒸馏策略结合时,NeUQI甚至实现了优于PV-tuning(一种资源密集得多的方法)的性能。

英文摘要

Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs. Post-training quantization (PTQ) of LLMs offers a promising solution that reduces their memory footprint and decoding latency. In practice, PTQ with uniform quantization representation is favored due to its efficiency and ease of deployment, as uniform quantization is widely supported by mainstream hardware and software libraries. Recent studies on low-bit uniform quantization have led to noticeable improvements in post-quantization model performance; however, they mainly focus on quantization methodologies, while the initialization of quantization parameters remains underexplored and still relies on the conventional Min-Max formula. In this work, we identify the limitations of the Min-Max formula, move beyond its constraints, and propose NeUQI, a method that efficiently determines near-optimal initialization for uniform quantization. Our NeUQI simplifies the joint optimization of the scale and zero-point by deriving the zero-point for a given scale, thereby reducing the problem to a scale-only optimization. Benefiting from the improved quantization parameters, our NeUQI consistently outperforms existing methods in the experiments with the LLaMA and Qwen families on various settings and tasks. Furthermore, when combined with a lightweight distillation strategy, NeUQI even achieves superior performance to PV-tuning, a considerably more resource-intensive method.

2601.22068 2026-06-01 cs.LG

Quantifying the Uncertainty of Foundation Models with Singular Value Ensembles

用奇异值集合量化基础模型的不确定性

Mehmet Ozgur Turkoglu, Dominik J. Mühlematter, Alexander Becker, Konrad Schindler, Helge Aasen

AI总结 提出奇异值集成(SVE)方法,通过冻结奇异向量并仅训练每个成员的奇异值,以极小的参数开销实现隐式集成,从而有效量化基础模型的不确定性。

Comments Accepted at ICML 2026 (camera-ready version)

详情
AI中文摘要

基础模型已成为机器学习中的主导范式,通过大规模预训练在各种任务中取得了显著性能。然而,它们往往产生过度自信、未校准的预测。量化认知不确定性的标准方法是使用多个独立训练模型的集成。但它们的计算成本随集成规模线性增长,使得大型基础模型难以实用。我们提出奇异值集成(SVE),一种参数高效的隐式集成方法。SVE 基于一个简单而强大的核心假设:即权重矩阵的奇异向量对应于表示空间中有意义的方向。如果奇异向量确实是有意义的(正交)“知识方向”,那么可以通过仅调节每个方向对输出的贡献强度来获得模型集成。我们冻结奇异向量,而不是为每个集成成员学习新参数,仅训练每个成员的奇异值,这些奇异值重新缩放共享知识基中每个方向的贡献。集成多样性在联合训练期间自然出现,因为随机初始化和随机批次采样导致不同成员收敛到相同底层知识的不同组合。SVE 的性能与显式集成相当,同时将基础模型的参数数量增加不到1%,使得在资源受限环境中也能进行有原则的不确定性估计。我们在 NLP 和视觉任务上使用各种不同的骨干网络验证了 SVE,并表明它在保持预测准确性的同时改善了校准。

英文摘要

Foundation models have become a dominant paradigm in machine learning, achieving remarkable performance across diverse tasks through large-scale pretraining. However, they often yield overconfident, uncalibrated predictions. The standard approach to quantifying epistemic uncertainty are ensembles of multiple independently trained models. But their computational cost scales linearly with ensemble size, making them impractical for large foundation models. We propose Singular Value Ensemble (SVE), a parameter-efficient implicit ensembling method. SVE builds on a simple, but powerful core assumption: namely, that the singular vectors of the weight matrices correspond to meaningful directions in the representation space. If the singular vectors are indeed meaningful (orthogonal) "knowledge directions", then a model ensemble can be obtained by modulating only how strongly each direction contributes to the output. Rather than learning new parameters for each ensemble member, we freeze the singular vectors and only train per-member singular values that rescale the contribution of each direction in that shared knowledge basis. Ensemble diversity emerges naturally during joint training as stochastic initialization and random batch sampling cause different members to converge to different combinations of the same underlying knowledge. SVE performs comparable to an explicit ensemble, while increasing the parameter count of the base model by <1%, making principled uncertainty estimation accessible in resource-constrained settings. We validate SVE on NLP and vision tasks with various different backbones and show that it improves calibration while maintaining predictive accuracy.

2601.18537 2026-06-01 cs.RO cs.AI

SKETCH: Semantic Key-Point Conditioning for Long-Horizon Vessel Trajectory Prediction

SKETCH: 面向长时域船舶轨迹预测的语义关键点条件建模

Linyong Gan, Zimo Li, Wenxin Xu, Xingjian Li, Jianhua Z. Huang, Enmei Tu, Shuhang Chen

AI总结 针对长时域轨迹预测中方向漂移问题,提出基于语义关键点(NKP)的条件轨迹建模框架,将预测分解为全局语义决策与局部运动建模,采用预训练-微调策略估计NKP先验,在真实AIS数据上显著提升长时域、方向精度和细粒度预测性能。

详情
AI中文摘要

由于复杂导航行为和环境因素导致的复合不确定性,准确的长时域船舶轨迹预测仍然具有挑战性。现有方法在长时间外推时往往难以保持全局方向一致性,导致轨迹漂移或不合理。为解决这一问题,我们提出了一种语义关键点条件轨迹建模框架,通过以捕获导航意图的高级下一关键点(NKP)为条件来预测未来轨迹。该公式将长时域预测分解为全局语义决策和局部运动建模,有效将未来轨迹的支持集限制在语义可行的子集内。为了从历史观测中高效估计NKP先验,我们采用了预训练-微调策略。在真实AIS数据上的大量实验表明,所提方法在长旅行时长、方向精度和细粒度轨迹预测方面持续优于现有最先进方法。

英文摘要

Accurate long-horizon vessel trajectory prediction remains challenging due to compounded uncertainty from complex navigation behaviors and environmental factors. Existing methods often struggle to maintain global directional consistency, leading to drifting or implausible trajectories when extrapolated over long time horizons. To address this issue, we propose a semantic-key-point-conditioned trajectory modeling framework, in which future trajectories are predicted by conditioning on a high-level Next Key Point (NKP) that captures navigational intent. This formulation decomposes long-horizon prediction into global semantic decision-making and local motion modeling, effectively restricting the support of future trajectories to semantically feasible subsets. To efficiently estimate the NKP prior from historical observations, we adopt a pretrain-finetune strategy. Extensive experiments on real-world AIS data demonstrate that the proposed method consistently outperforms state-of-the-art approaches, particularly for long travel durations, directional accuracy, and fine-grained trajectory prediction.

2509.02970 2026-06-01 cs.LG math.OC

Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation

延迟动量聚合:部分参与下通信高效的拜占庭鲁棒联邦学习

Kaoru Otsuka, Yuki Takezawa, Makoto Yamada

AI总结 针对部分参与场景下拜占庭客户端可能占多数的问题,提出延迟动量聚合原则,通过聚合未采样客户端的缓存动量和采样客户端的即时动量,确保拜占庭客户端在服务器视角中保持少数,并实例化为DeMoA优化器,理论分析和实验验证其鲁棒性和高效性。

Comments camera-ready version for ICML 2026

详情
AI中文摘要

部分参与对于大规模通信高效的联邦学习至关重要,然而现有的拜占庭鲁棒方法通常假设完全客户端参与。在部分参与设置中,一旦拜占庭客户端占主导地位,现有方法会立即失效。我们引入了延迟动量聚合原则,即中央服务器聚合来自未采样客户端的缓存动量以及来自采样客户端的即时动量。该原则确保即使拜占庭客户端在采样集中占主导地位,从服务器视角看它们仍然是少数。我们将该原则实例化为我们的优化器DeMoA。我们分析了DeMoA的收敛速率,表明DeMoA在部分参与下具有拜占庭鲁棒性。实验表明,在20%的拜占庭比例和仅10%的部分参与率下,即使现有方法在实践中失败,DeMoA也能达到最佳准确率。

英文摘要

Partial participation is essential for communication-efficient federated learning at scale, yet existing Byzantine-robust methods typically assume full client participation. In the partial participation setting, a majority of the sampled clients may be Byzantine, once Byzantine clients dominate, existing methods break down immediately. We introduce delayed momentum aggregation, a principle where the central server aggregates cached momentum from non-sampled clients along with fresh momentum from sampled clients. This principle ensures Byzantine clients remain a minority from the server's perspective even when they dominate the sampled set. We instantiate this principle in our optimizer DeMoA. We analyze the convergence rate of DeMoA, showing that DeMoA is Byzantine-robust under partial participation. Experiments show that, with 20% Byzantine ratio and only 10% partial participation rate, DeMoA achieves the best accuracy even when existing methods fail empirically.

2601.21686 2026-06-01 cs.LG

Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

不要那么Stief!在Stiefel流形上学习KV缓存低秩近似

Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Yüzügüler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello

AI总结 提出StiefAttention方法,通过在Stiefel流形上学习正交投影基并最小化解码器层输出重建误差,实现KV缓存压缩,优于现有SVD方法。

详情
AI中文摘要

键值(KV)缓存能够实现快速自回归解码,但在长上下文中成为高带宽内存(HBM)容量和带宽的主要瓶颈。一种常见的缓解方法是通过将每个头的矩阵投影到较低秩来压缩缓存的键和值,仅将投影存储在HBM中。然而,现有的训练后方法通常使用SVD风格的代理目标来拟合这些投影,这可能无法很好地反映softmax、值混合以及后续解码器层变换后的端到端重建。为此,我们引入了StiefAttention,一种训练后KV缓存压缩方法,通过直接最小化解码器层输出重建误差来学习正交投影基。StiefAttention还构建了候选秩上的逐层误差-秩分布,从而能够在用户指定的KV缓存预算下进行顺序秩分配。值得注意的是,在相同条件下,对于Llama3-8B,StiefAttention在C4困惑度上比EigenAttention高出4.2个点,在0-shot MMLU准确率上高出8.9个点,在等压缩率下,相对于原始解码器层输出,实现了更低的相对误差和更高的余弦相似度。

英文摘要

Key-value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrices to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns orthonormal projection bases by directly minimizing decoder-layer output reconstruction error. StiefAttention additionally constructs layer-wise error-rank profiles over candidate ranks, enabling sequential rank allocation under a user-specified KV cache budget. Notably, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $4.2$ points on C4 perplexity and $8.9$ points on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.

2601.21645 2026-06-01 cs.LG math.CT math.RT

Identifiable Equivariant Networks are Layerwise Equivariant

可识别的等变网络是逐层等变的

Vahid Shahverdi, Giovanni Luca Marchetti, Georg Bökman, Kathlén Kohn

AI总结 本文证明,在适当可识别性条件下,端到端等变网络的参数选择可使每一层在潜在空间上等变,从而从数学上解释了训练中权重等变结构的涌现。

Comments Accepted at ICML 2026

详情
AI中文摘要

我们研究了深度神经网络中端到端等变性与逐层等变性之间的关系。我们证明:对于一个端到端函数关于输入和输出空间上的群作用等变的网络,存在一个参数选择使得该网络产生相同的端到端函数,并且其每一层关于潜在空间上的某些群作用是等变的。我们的结果假设模型参数在适当意义下是可识别的。对于一大类网络,这种可识别性已在文献中得到确立,我们的结果立即适用;而对于其他网络,它仍是推测性的。我们发展的理论基于抽象形式化,因此与架构无关。总体而言,我们的结果为训练过程中神经网络权重中等变结构的涌现——这一在实践中持续观察到的现象——提供了数学解释。

英文摘要

We investigate the relation between end-to-end equivariance and layerwise equivariance in deep neural networks. We prove the following: For a network whose end-to-end function is equivariant with respect to group actions on the input and output spaces, there is a parameter choice yielding the same end-to-end function such that its layers are equivariant with respect to some group actions on the latent spaces. Our result assumes that the parameters of the model are identifiable in an appropriate sense. This identifiability property has been established in the literature for a large class of networks, to which our results apply immediately, while it is conjectural for others. The theory we develop is grounded in an abstract formalism, and is therefore architecture-agnostic. Overall, our results provide a mathematical explanation for the emergence of equivariant structures in the weights of neural networks during training -- a phenomenon that is consistently observed in practice.

2601.21372 2026-06-01 cs.AI

NEMO: Execution-Aware Optimization Modeling via Autonomous Coding Agents

NEMO: 通过自主编码代理实现执行感知的优化建模

Yang Song, Anoushka Vyas, Zirui Wei, Sina Khoshfetrat Pakazad, Henrik Ohlsson, Graham Neubig

AI总结 提出NEMO系统,利用自主编码代理将决策问题的自然语言描述转化为可执行的数学优化实现,通过执行感知的架构和协调模式实现最先进的性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

我们提出NEMO,一个使用自主编码代理(ACA)将决策问题的自然语言描述转化为正式的可执行数学优化实现的系统。现有方法依赖于专门的大语言模型(LLM)或定制的任务特定代理,这些方法通常脆弱且经常生成语法无效或不可执行的代码。NEMO则将ACA视为与基于API的LLM交互类似的一等抽象;其沙盒执行保证代码从结构上可执行,并支持自动验证和修复。我们引入了新颖的协调模式,包括独立生成的优化器和模拟器实现之间的非对称验证循环、用于经验重用的外部记忆,以及通过最小贝叶斯风险(MBR)解码和自一致性增强的鲁棒性。在九个已建立的优化基准测试中,NEMO在大多数任务上取得了最先进的性能,并在多个数据集上大幅领先,展示了执行感知的代理架构在自动化优化建模中的强大能力。

英文摘要

We present NEMO, a system that translates Natural-language descriptions of decision problems into formal Executable Mathematical Optimization implementations using autonomous coding agents (ACAs). Existing approaches rely on specialized large language models (LLMs) or bespoke task-specific agents that are often brittle and frequently generate syntactically invalid or non-executable code. NEMO instead treats ACAs as a first-class abstraction analogous to API-based interaction with LLMs; their sandboxed execution guarantees code is executable by construction and supports automated validation and repair. We introduce novel coordination patterns including asymmetric validation loops between independently generated optimizer and simulator implementations, external memory for experience reuse, and robustness enhancements via minimum Bayes risk (MBR) decoding and self-consistency. Across nine established optimization benchmarks, NEMO achieves state-of-the-art performance on the majority of tasks with substantial margins on several datasets, demonstrating the power of execution-aware agentic architectures for automated optimization modeling.

2601.20774 2026-06-01 cs.LG

When More Data Doesn't Help: Limits of Adaptation in Multitask Learning

当更多数据无济于事:多任务学习中适应的极限

Steve Hanneke, Mingyue Xu

AI总结 本文通过建立更强的适应性不可能性结果,证明即使每个任务的数据量任意大,多任务学习仍然存在统计极限,无法通过聚合样本克服。

详情
AI中文摘要

多任务学习及相关框架在现代应用中取得了巨大成功。在多任务学习问题中,我们有一组从相关源任务收集的异构数据集,并希望提高性能,超过单独解决每个任务所能达到的效果。arXiv:2006.15785 的最新工作表明,在无法访问分布信息的情况下,只要每个任务的样本量有界,任何基于聚合样本的算法都无法保证最优风险。在本文中,我们专注于理解多任务学习的统计极限。我们超越了 arXiv:2006.15785 中的无免费午餐定理,建立了一个更强的适应性不可能性结果,该结果对每个任务的任意大样本量都成立。这一改进传达了一个重要信息:多任务学习的困难无法通过每个任务拥有大量数据来克服。我们还讨论了可能对未来研究感兴趣的最优适应性的概念。

英文摘要

Multitask learning and related frameworks have achieved tremendous success in modern applications. In multitask learning problem, we are given a set of heterogeneous datasets collected from related source tasks and hope to enhance the performance above what we could hope to achieve by solving each of them individually. The recent work of arXiv:2006.15785 has showed that, without access to distributional information, no algorithm based on aggregating samples alone can guarantee optimal risk as long as the sample size per task is bounded. In this paper, we focus on understanding the statistical limits of multitask learning. We go beyond the no-free-lunch theorem in arXiv:2006.15785 by establishing a stronger impossibility result of adaptation that holds for arbitrarily large sample size per task. This improvement conveys an important message that the hardness of multitask learning cannot be overcame by having abundant data per task. We also discuss the notion of optimal adaptivity that may be of future interests.

2601.19936 2026-06-01 cs.LG cs.AI cs.CL

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Gap-K%: 通过测量 Top-1 预测差距检测预训练数据

Minseo Kwak, Jaehyung Kim

AI总结 提出 Gap-K% 方法,利用 LLM 的 top-1 预测与目标 token 的对数概率差距及滑动窗口策略,在 WikiMIA 和 MIMIR 基准上实现预训练数据检测的最优性能。

Comments ACL 2026 Main Conference; 15 pages

详情
AI中文摘要

大型语言模型(LLM)中大规模预训练语料库的不透明性引发了严重的隐私和版权问题,使得预训练数据检测成为一项关键挑战。现有的最先进方法通常依赖于 token 似然,但它们往往忽略了目标 token 与模型 top-1 预测之间的差距,以及相邻 token 之间的局部相关性。在这项工作中,我们提出了 Gap-K%,一种基于 LLM 预训练优化动态的新型预训练数据检测方法。通过分析下一个 token 预测目标,我们观察到模型 top-1 预测与目标 token 之间的差异会引发强烈的梯度信号,这些信号在训练过程中被明确惩罚。受此启发,Gap-K% 利用 top-1 预测 token 与目标 token 之间的对数概率差距,并结合滑动窗口策略来捕获局部相关性并缓解 token 级别的波动。在 WikiMIA 和 MIMIR 基准上的大量实验表明,Gap-K% 实现了最先进的性能,在各种模型大小和输入长度上始终优于先前的基线方法。

英文摘要

The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the gap between the target token and the model's top-1 prediction, as well as local correlations between adjacent tokens. In this work, we propose Gap-K%, a novel pretraining data detection method grounded in the optimization dynamics of LLM pretraining. By analyzing the next-token prediction objective, we observe that discrepancies between the model's top-1 prediction and the target token induce strong gradient signals, which are explicitly penalized during training. Motivated by this, Gap-K% leverages the log probability gap between the top-1 predicted token and the target token, incorporating a sliding window strategy to capture local correlations and mitigate token-level fluctuations. Extensive experiments on the WikiMIA and MIMIR benchmarks demonstrate that Gap-K% achieves state-of-the-art performance, consistently outperforming prior baselines across various model sizes and input lengths.

2504.04430 2026-06-01 cs.AI

Foundational Requirements for Artificial General Intelligence: A Falsifiable Framework Based on Signal Prediction

人工通用智能的基础要求:一个基于信号预测的可证伪框架

Matej Šprogar

AI总结 本文提出一个基于信号预测的可证伪框架,通过定义低层要求(如从无知状态学习、实时活性)并设计可重复测试来检验人工通用智能。

Comments 9 pages, 2 figures

详情
AI中文摘要

基于高级智能可以从低级信号处理中涌现的前提,我们提出了关于人工通用智能所需的低级要求的假设。所提出的要求刻画了通过预测具有初始未知语义内容的时空结构化信号进行学习的系统的核心属性。它们包括从认知神经科学中观察到的基本原理,从从无知状态学习到实时活性。为了进行实证检验和假设拒绝,我们引入了一个由透明且可重复的测试组成的操作测试平台,每个要求对应一个测试。迄今为止,尚未发现或报告有任何非智能系统成功通过该测试平台。在出现这样的反例之前,该测试平台可作为通向通用智能的候选实证里程碑。该测试平台的参考实现已公开可用。

英文摘要

Grounded in the premise that high-level intelligence can emerge from low-level signal processing, we advance a hypothesis regarding low-level requirements necessary for artificial general intelligence. The proposed requirements characterise core properties of systems that learn through prediction over spatially and temporally structured signals with initially unknown semantic content. They include a selection of basic principles observed in cognitive neuroscience, from learning from an uninformed state to real-time liveness. To enable empirical testing and hypothesis rejection, we introduce an operational testbed composed of transparent and reusable tests, one per requirement. To date, no non-intelligent system has been identified or reported as successfully passing the testbed. Pending such a counterexample, the testbed serves as a candidate empirical milestone toward general intelligence. The reference implementation of the testbed is publicly available.

2601.19448 2026-06-01 cs.LG cs.CR

From Internal Diagnosis to External Auditing: A VLM-Driven Paradigm for Data-Free Online Backdoor Defense

从内部诊断到外部审计:一种VLM驱动的数据自由在线后门防御范式

Binyan Xu, Fan Yang, Xilin Dai, Di Tang, Kehuan Zhang

AI总结 提出一种从内部诊断到外部语义审计的范式转变,利用通用视觉语言模型(VLM)作为外部语义门控,通过PRISM框架(原型精炼与统计监控检查)实现数据自由的在线后门防御,在17个数据集和11种攻击类型上达到最先进性能。

Comments 25 pages, 10 figures, 19 tables. To appear in the Proceedings of the 43 rd International Conference on Machine Learning (ICML '26)

详情
AI中文摘要

深度神经网络本质上仍然容易受到后门攻击。传统的测试时防御主要在内部诊断方法的范式下运作,如模型修复或输入鲁棒性,但这些方法在高级攻击下往往脆弱,因为它们仍然与受害模型的受损参数纠缠在一起。我们提出从内部诊断到外部语义审计的范式转变,认为有效的防御需要通过一个独立的、基于语义的审计器将安全性与受害模型解耦。为此,我们提出了一个框架,利用通用视觉语言模型(VLM)作为不断演化的语义门控。我们引入了PRISM(原型精炼与统计监控检查),通过两个关键机制克服通用VLM的领域差距:一个混合VLM教师动态在线精炼视觉原型,以及一个由统计边界监控驱动的自适应路由器实时校准门控阈值。在17个数据集和11种攻击类型上的广泛评估表明,PRISM实现了最先进的性能,在CIFAR-10上将攻击成功率抑制到<1%,同时提高了干净准确率,为模型无关的外部化安全建立了新标准。

英文摘要

Deep Neural Networks remain inherently vulnerable to backdoor attacks. Traditional test-time defenses largely operate under the paradigm of internal diagnosis methods like model repairing or input robustness, yet these approaches are often fragile under advanced attacks as they remain entangled with the victim model's corrupted parameters. We propose a paradigm shift from Internal Diagnosis to External Semantic Auditing, arguing that effective defense requires decoupling safety from the victim model via an independent, semantically grounded auditor. To this end, we present a framework harnessing Universal Vision-Language Models (VLMs) as evolving semantic gatekeepers. We introduce PRISM (Prototype Refinement & Inspection via Statistical Monitoring), which overcomes the domain gap of general VLMs through two key mechanisms: a Hybrid VLM Teacher that dynamically refines visual prototypes online, and an Adaptive Router powered by statistical margin monitoring to calibrate gating thresholds in real-time. Extensive evaluation across 17 datasets and 11 attack types demonstrates that PRISM achieves state-of-the-art performance, suppressing Attack Success Rate to <1% on CIFAR-10 while improving clean accuracy, establishing a new standard for model-agnostic, externalized security.

2601.19220 2026-06-01 cs.LG

Accelerated Multiple Wasserstein Gradient Flows for Multi-objective Distributional Optimization

加速多Wasserstein梯度流用于多目标分布优化

Dai Hai Nguyen, Duc Dung Nguyen, Atsuyoshi Nakamura, Hiroshi Mamitsuka

AI总结 提出加速多Wasserstein梯度下降算法(A-MWGraD),通过Nesterov加速实现多目标分布优化,在测地凸和强测地凸目标下分别达到O(1/t^2)和O(e^{-√βt})收敛率,优于MWGraD的O(1/t)。

Comments ICML 2026

详情
AI中文摘要

我们研究了Wasserstein空间中概率分布的多目标优化。最近,Nguyen等人(2025)提出了多Wasserstein梯度下降(MWGraD)算法,该算法利用Wasserstein空间的几何结构来联合优化多个目标。基于这种方法,我们提出了一种加速变体A-MWGraD,其灵感来自Nesterov加速。我们分析了连续时间动力学,并建立了在概率空间中收敛到弱帕累托最优点。我们的理论结果表明,对于测地凸目标,A-MWGraD达到O(1/t^2)的收敛速度;对于β-强测地凸目标,达到O(e^{-√βt})的收敛速度,在测地凸设置下优于MWGraD的O(1/t)速率。我们进一步引入了A-MWGraD的实用基于核的离散化,并通过数值实验证明,在多目标采样任务中,它在收敛速度和采样效率上始终优于MWGraD。

英文摘要

We study multi-objective optimization over probability distributions in Wasserstein space. Recently, Nguyen et al. (2025) introduced Multiple Wasserstein Gradient Descent (MWGraD) algorithm, which exploits the geometric structure of Wasserstein space to jointly optimize multiple objectives. Building on this approach, we propose an accelerated variant, A-MWGraD, inspired by Nesterov's acceleration. We analyze the continuous-time dynamics and establish convergence to weakly Pareto optimal points in probability space. Our theoretical results show that A-MWGraD achieves a convergence rate of O(1/t^2) for geodesically convex objectives and O(e^{-\sqrtβt}) for $β$-strongly geodesically convex objectives, improving upon the O(1/t) rate of MWGraD in the geodesically convex setting. We further introduce a practical kernel-based discretization for A-MWGraD and demonstrate through numerical experiments that it consistently outperforms MWGraD in convergence speed and sampling efficiency on multi-target sampling tasks.

2510.05115 2026-06-01 cs.AI cs.CL cs.PL

SAC-Opt: Semantic Anchors for Iterative Correction in Optimization Modeling

SAC-Opt:优化建模中用于迭代修正的语义锚点

Yansen Zhang, Qingcan Kang, Yujie Chen, Yufei Wang, Xiongwei Han, Tao Zhong, Mingxuan Yuan, Chen Ma

AI总结 提出SAC-Opt框架,通过语义锚点对齐和选择性修正,在无需额外训练的情况下提升大语言模型生成优化建模代码的语义忠实度,平均建模准确率提升7.7%。

Comments ICML 2026 accepted

详情
AI中文摘要

大语言模型(LLMs)通过从自然语言描述生成可执行的求解器代码,为优化建模开辟了新范式。尽管前景广阔,现有方法通常仍以求解器驱动:它们依赖单次前向生成,并基于求解器错误信息进行有限的事后修正,这留下了未被检测到的语义错误,这些错误会静默地产生语法正确但逻辑有缺陷的模型。为应对这一挑战,我们提出SAC-Opt,一种反向引导的修正框架,将优化建模建立在问题语义而非求解器反馈之上。在每一步中,SAC-Opt将原始语义锚点与从生成代码中重建的锚点对齐,并仅选择性修正不匹配的组件,从而驱动模型收敛到语义忠实的模型。这种锚点驱动的修正能够对约束和目标逻辑进行细粒度改进,在无需额外训练或监督的情况下增强忠实性和鲁棒性。在七个公共数据集上的实验结果表明,SAC-Opt将平均建模准确率提升了7.7%,在ComplexLP数据集上提升高达21.9%。这些发现强调了在基于LLM的优化工作流中,语义锚点修正对于确保从问题意图到求解器可执行代码的忠实翻译的重要性。

英文摘要

Large language models (LLMs) have opened new paradigms in optimization modeling by enabling the generation of executable solver code from natural language descriptions. Despite this promise, existing approaches typically remain solver-driven: they rely on single-pass forward generation and apply limited post-hoc fixes based on solver error messages, leaving undetected semantic errors that silently produce syntactically correct but logically flawed models. To address this challenge, we propose SAC-Opt, a backward-guided correction framework that grounds optimization modeling in problem semantics rather than solver feedback. At each step, SAC-Opt aligns the original semantic anchors with those reconstructed from the generated code and selectively corrects only the mismatched components, driving convergence toward a semantically faithful model. This anchor-driven correction enables fine-grained refinement of constraint and objective logic, enhancing both fidelity and robustness without requiring additional training or supervision. Empirical results on seven public datasets demonstrate that SAC-Opt improves average modeling accuracy by 7.7%, with gains of up to 21.9% on the ComplexLP dataset. These findings highlight the importance of semantic-anchored correction in LLM-based optimization workflows to ensure faithful translation from problem intent to solver-executable code.

2601.16426 2026-06-01 cs.LG

Safe Multitask Molecular Graph Networks for Vapor Pressure and Odor Threshold Prediction

用于蒸气压和气味阈值预测的安全多任务分子图网络

Shuang Wu, Meijie Wang, Lun Yu

AI总结 提出一种安全多任务方法,以蒸气压为主任务、气味阈值为辅助任务,结合A20/E17分子图特征和PNA骨干网络,在Bemis-Murcko骨架划分下实现最优蒸气压泛化性能。

详情
AI中文摘要

我们研究了气味相关属性建模中的两个重要任务:蒸气压(VP)和气味阈值(OP)。为了评估模型的分布外(OOD)能力,我们采用了Bemis-Murcko骨架划分。在特征方面,我们引入了丰富的A20/E17分子图特征(20维原子特征+17维键特征),并系统比较了GINE和PNA骨干网络。结果表明:对于VP,使用简单回归头的PNA实现了验证MSE≈0.21(归一化空间);对于相同骨架划分下的OP单任务,使用A20/E17和鲁棒训练(Huber/winsor)实现了验证MSE≈0.60-0.61。对于多任务训练,我们提出了一种**“安全多任务”**方法:以VP为主任务,OP为辅助任务,使用延迟激活+梯度裁剪+小权重,这避免了对主任务的损害,同时获得了最佳的VP泛化性能。本文提供了完整的可重复实验、消融研究和误差相似性分析,同时讨论了数据噪声的影响和方法的局限性。

英文摘要

We investigate two important tasks in odor-related property modeling: Vapor Pressure (VP) and Odor Threshold (OP). To evaluate the model's out-of-distribution (OOD) capability, we adopt the Bemis-Murcko scaffold split. In terms of features, we introduce the rich A20/E17 molecular graph features (20-dimensional atom features + 17-dimensional bond features) and systematically compare GINE and PNA backbones. The results show: for VP, PNA with a simple regression head achieves Val MSE $\approx$ 0.21 (normalized space); for the OP single task under the same scaffold split, using A20/E17 with robust training (Huber/winsor) achieves Val MSE $\approx$ 0.60-0.61. For multitask training, we propose a **"safe multitask"** approach: VP as the primary task and OP as the auxiliary task, using delayed activation + gradient clipping + small weight, which avoids harming the primary task and simultaneously yields the best VP generalization performance. This paper provides complete reproducible experiments, ablation studies, and error-similarity analysis while discussing the impact of data noise and method limitations.

2601.16366 2026-06-01 cs.LG cs.SC

Post-Training Neural Network Pruning using Graph Curvature

使用图曲率的训练后神经网络剪枝

Shuhang Tan, Jayson Sia, Paul Bogdan, Radoslav Ivanov

AI总结 提出基于Ollivier-Ricci曲率(ORC)的神经曲率(NC)概念,通过计算激活模式下的边曲率来识别神经网络中不重要的连接,实现高效剪枝。

详情
AI中文摘要

本文通过图论的视角为神经网络(NN)剪枝问题提供了新的视角。为了实现有效的剪枝,我们旨在识别主要的NN数据流以及相应的NN连接,这些连接对于完整模型的性能最重要和最不重要。与基于信息论的NN数据分析标准方法不同,我们采用了图曲率的概念,特别是Ollivier-Ricci曲率(ORC)。ORC已成功用于识别各种领域中的重要图边,如道路交通分析、生物网络和社交网络。特别是,具有负ORC的边被认为是瓶颈,因此对图的整体连通性至关重要,而正ORC的边则不那么重要。我们将这种直觉用于NN:(1)构建由NN结构诱导的图,并基于ORC引入神经曲率(NC)的概念;(2)根据一组输入示例的激活模式计算曲率;(3)证明NC可用于根据边对整体NN功能的重要性对边进行排序。我们通过在三个图像数据集(MNIST、CIFAR-10和CIFAR-100)上训练的各种中小型模型上进行剪枝实验来评估我们的方法。结果表明,与现有剪枝方法相比,我们的方法可以识别出更多不重要的边。

英文摘要

This paper provides a fresh view of the neural network (NN) pruning problem through the lens of graph theory. To achieve effective pruning, we aim to identify the main NN data flows and the corresponding NN connections that are most and least important for the performance of the full model. Unlike the standard approach to NN data flow analysis, which is based on information theory, we employ the notion of graph curvature, specifically Ollivier-Ricci curvature (ORC). ORC has been successfully used to identify important graph edges in various domains such as road traffic analysis, biological networks, and social networks. In particular, edges with negative ORC are considered bottlenecks and are therefore critical to the graph's overall connectivity, whereas positive-ORC edges are less essential. We use this intuition for NNs to (1) construct a graph induced by the NN structure and introduce the notion of neural curvature (NC) based on ORC; (2) calculate curvatures based on activation patterns for a set of input examples; and (3) demonstrate that NC can be used to rank edges according to their importance for overall NN functionality. We evaluate our method through pruning experiments on a variety of small and medium size models trained on three image datasets: MNIST, CIFAR-10, and CIFAR-100. The results indicate that our method can identify a larger number of unimportant edges compared to existing pruning methods.

2601.13704 2026-06-01 cs.SD cs.AI cs.LG eess.AS

Performance and Complexity Trade-off Optimization of Speech Models During Training

训练过程中语音模型的性能与复杂度权衡优化

Esteban Gómez, Tom Backström

AI总结 提出一种基于特征噪声注入的重新参数化技术,利用随机梯度下降方法在训练中联合优化语音模型的性能和计算复杂度,实现动态模型大小调整。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

在语音机器学习中,神经网络模型通常通过选择具有固定层大小和结构的架构来设计。这些模型随后被训练以最大化与任务目标相关的性能指标。虽然整体架构通常由任务的先验知识指导,但各层的大小往往是启发式选择的。然而,这种方法并不能保证性能与计算复杂度之间的最优权衡;因此,通常采用权重量化或模型剪枝等后处理方法以降低计算成本。这是因为随机梯度下降(SGD)方法只能优化可微函数,而影响计算复杂度的因素(如层大小和每秒浮点运算次数(FLOP/s))是不可微的,需要在训练过程中修改模型结构。我们提出了一种基于特征噪声注入的重新参数化技术,使得在训练过程中能够使用基于SGD的方法联合优化性能和计算复杂度。与传统的剪枝方法不同,我们的方法允许模型大小针对目标性能-复杂度权衡进行动态优化,而无需依赖启发式标准来选择要移除的权重或结构。我们通过三个案例研究证明了我们方法的有效性,包括一个合成示例和两个实际应用:语音活动检测和音频反欺骗。与我们的工作相关的代码已公开,以鼓励进一步研究。

英文摘要

In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task's objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade-off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating-point operations per second (FLOP/s), are non-differentiable and require modifying the model structure during training. We propose a reparameterization technique based on feature noise injection that enables joint optimization of performance and computational complexity during training using SGD-based methods. Unlike traditional pruning methods, our approach allows the model size to be dynamically optimized for a target performance-complexity trade-off, without relying on heuristic criteria to select which weights or structures to remove. We demonstrate the effectiveness of our method through three case studies, including a synthetic example and two practical real-world applications: voice activity detection and audio anti-spoofing. The code related to our work is publicly available to encourage further research.

2411.04073 2026-06-01 cs.RO cs.CC cs.MA

A Two-Stage Reactive Auction Framework for the Multi-Depot Rural Postman Problem with Dynamic Vehicle Failures

面向动态车辆故障的多仓库农村邮差问题的两阶段反应式拍卖框架

Eashwar Sathyamurthy, Jeffrey W. Herrmann, Shapour Azarm

AI总结 针对多仓库农村邮差问题中车辆故障导致的任务中断,提出一种两阶段实时重调度框架,结合集中式拍卖与对等拍卖,在保证解质量的同时将重调度时间从小时级降至秒级。

详情
AI中文摘要

尽管无人车车队在运输、物流和巡检中提供了效率,但它们对故障的敏感性对任务连续性构成了重大挑战。我们研究了带有可充电和可重复使用车辆的多仓库农村邮差问题(MD-RPP-RRV),其中放置在多个仓库、具有容量约束的无人充电车辆在为基于弧的需求服务时可能发生故障。为了解决运行中意外的车辆故障,我们提出了一种两阶段实时重调度框架。首先,集中式拍卖快速生成可行的重调度方案;对于此阶段,我们推导了一个理论加性界,为最坏情况下的重调度惩罚提供了分析保证。其次,对等拍卖通过一个针对问题的磁场路由器对局部调度进行修复,以细化基线方案,该路由器利用通过敏感性分析校准的参数,确保计算增长可控。我们将该方法与模拟退火元启发式算法进行基准比较,以评估解质量和执行速度。在257个不同故障场景上的实验结果表明,与元启发式基线相比,该框架实现了平均运行时间减少超过95%,将重调度时间从小时级缩短到秒级,同时保持高质量的解。两阶段框架在大规模实例上表现出色,在近80%的场景中优于集中式拍卖,平均解改进超过12%。此外,它在59%和28%的场景中分别优于模拟退火的平均结果和最佳结果,为实时任务连续性提供了所需的鲁棒速度-质量权衡。

英文摘要

Although unmanned vehicle fleets offer efficiency in transportation, logistics and inspection, their susceptibility to failures poses a significant challenge to mission continuity. We study the Multi-Depot Rural Postman Problem with Rechargeable and Reusable Vehicles (MD-RPP-RRV) with vehicle failures, where unmanned rechargeable vehicles placed at multiple depots with capacity constraints may fail while serving arc-based demands. To address unexpected vehicle breakdowns during operation, we propose a two-stage real-time rescheduling framework. First, a centralized auction quickly generates a feasible rescheduling solution; for this stage, we derive a theoretical additive bound that establishes an analytical guarantee on the worst-case rescheduling penalty. Second, a peer auction refines this baseline through a problem-specific magnetic field router for local schedule repair, utilizing parameters calibrated via sensitivity analysis to ensure controlled computational growth. We benchmark this approach against a simulated annealing metaheuristic to evaluate solution quality and execution speed. Experimental results on 257 diverse failure scenarios demonstrate that the framework achieves an average runtime reduction of over 95\% relative to the metaheuristic baseline, cutting rescheduling times from hours to seconds while maintaining high solution quality. The two-stage framework excels on large-scale instances, surpassing the centralized auction in nearly 80\% of scenarios with an average solution improvement exceeding 12\%. Moreover, it outperforms the simulated annealing mean and best results in 59\% and 28\% of scenarios, respectively, offering the robust speed-quality trade-off required for real-time mission continuity.

2601.09633 2026-06-01 cs.CL

TaxoBell: Gaussian Box Embeddings for Self-Supervised Taxonomy Expansion

TaxoBell: 用于自监督分类体系扩展的高斯盒嵌入

Sahil Mishra, Srinitish Srinivasan, Srikanta Bedathur, Tanmoy Chakraborty

AI总结 提出TaxoBell高斯盒嵌入框架,通过将盒几何与多元高斯分布映射,解决现有方法在非对称关系建模、梯度不稳定、语义不确定性和多义性表示上的问题,在五个基准数据集上MRR提升19%、Recall@k提升约25%。

Comments Accepted in The Web Conference (WWW) 2026

详情
AI中文摘要

分类体系构成了跨领域结构化知识表示的骨干,支持电子商务和语义搜索等应用。然而,手动分类体系扩展既费力又缓慢。现有方法依赖于基于点的向量嵌入,这些嵌入建模对称相似性,因此难以处理分类体系中基本的非对称关系。盒嵌入通过支持包含和不相交性提供了一种有前景的替代方案,但它们面临关键问题:(i) 在交集边界处梯度不稳定,(ii) 没有语义不确定性的概念,(iii) 表示多义性或歧义的能力有限。我们通过TaxoBell解决了这些缺点,这是一个高斯盒嵌入框架,在盒几何与多元高斯分布之间进行转换,其中均值编码语义位置,协方差编码不确定性。基于能量的优化实现了稳定优化、对模糊概念的鲁棒建模以及可解释的层次推理。在五个基准数据集上的大量实验表明,TaxoBell在MRR上显著优于八种最先进的分类体系扩展基线19%,在Recall@k上约25%。我们进一步通过错误分析和消融研究展示了TaxoBell的优势和缺陷。

英文摘要

Taxonomies form the backbone of structured knowledge representation across diverse domains, enabling applications such as e-commerce and semantic search. Yet, manual taxonomy expansion is labor-intensive and slow. Existing methods rely on point-based vector embeddings, which model symmetric similarity and thus struggle with the asymmetric relationships that are fundamental to taxonomies. Box embeddings offer a promising alternative by enabling containment and disjointness, but they face key issues: (i) unstable gradients at the intersection boundaries, (ii) no notion of semantic uncertainty, and (iii) limited capacity to represent polysemy or ambiguity. We address these shortcomings with TaxoBell, a Gaussian box embedding framework that translates between box geometries and multivariate Gaussian distributions, where means encode semantic location and covariances encode uncertainty. Energy-based optimization yields stable optimization, robust modeling of ambiguous concepts, and interpretable hierarchical reasoning. Extensive experiments on five benchmark datasets demonstrate that TaxoBell significantly outperforms eight state-of-the-art taxonomy expansion baselines by 19% in MRR and around 25% in Recall@k. We further demonstrate the advantages and pitfalls of TaxoBell with error analysis and ablation studies.

2502.12119 2026-06-01 cs.CV cs.AI cs.CL

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

PRISM:免训练多模态数据选择的自剪枝内在选择方法

Jinhe Bi, Aniri, Zengjie Jin, Yifan Wang, Danqi Yan, Wenke Huang, Xiaowen Ma, Sikuan Yan, Artur Hecker, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, Yunpu Ma

AI总结 针对多模态大语言模型视觉指令数据冗余问题,提出一种免训练框架PRISM,通过隐式重中心化消除视觉特征各向异性导致的全局语义漂移,实现高效数据选择,在降低计算成本的同时提升模型性能。

Comments Accepted to ACL 2026 and selected for the Best Paper list; later desk-rejected due to an inadvertent manual bibliography-editing error. Previous versions are withdrawn due to an inadvertent manual bibliography-editing error; please refer to the latest corrected version

详情
AI中文摘要

视觉指令微调使预训练的多模态大语言模型(MLLMs)能够遵循人类指令以应用于现实场景。然而,这些数据集的快速增长引入了显著的冗余,导致计算成本增加。现有的指令数据选择方法旨在修剪这种冗余,但主要依赖于计算密集型技术,如基于代理的推理或基于训练的指标。因此,这些选择过程产生的巨大计算成本往往加剧了它们本应解决的效率瓶颈,对MLLMs的可扩展和有效微调构成了重大挑战。为了解决这一挑战,我们首先发现了一个关键但先前被忽视的因素:视觉特征分布中固有的各向异性。我们发现这种各向异性引发了 extit{全局语义漂移},而忽视这一现象是限制当前数据选择方法效率的关键因素。受此启发,我们设计了 extbf{PRISM},这是第一个用于高效视觉指令选择的免训练框架。PRISM通过隐式重中心化建模内在视觉语义,精确移除全局背景特征的干扰影响。实验表明,PRISM将数据选择和模型微调的端到端时间减少到传统流程的30%。更值得注意的是,它在实现这一效率的同时提升了性能,在八个多模态和三个语言理解基准上超越了在全数据集上微调的模型,最终相对于基线实现了101.7%的相对改进。代码可通过\href{https://github.com/bibisbar/PRISM}{此仓库}获取。

英文摘要

Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or training-based metrics. Consequently, the substantial computational costs incurred by these selection processes often exacerbate the very efficiency bottlenecks they are intended to resolve, posing a significant challenge to the scalable and effective tuning of MLLMs. To address this challenge, we first identify a critical, yet previously overlooked, factor: the anisotropy inherent in visual feature distributions. We find that this anisotropy induces a \textit{Global Semantic Drift}, and overlooking this phenomenon is a key factor limiting the efficiency of current data selection methods. Motivated by this insight, we devise \textbf{PRISM}, the first training-free framework for efficient visual instruction selection. PRISM surgically removes the corrupting influence of global background features by modeling the intrinsic visual semantics via implicit re-centering. Empirically, PRISM reduces the end-to-end time for data selection and model tuning to just 30\% of conventional pipelines. More remarkably, it achieves this efficiency while simultaneously enhancing performance, surpassing models fine-tuned on the full dataset across eight multimodal and three language understanding benchmarks, culminating in a 101.7\% relative improvement over the baseline. The code is available for access via \href{https://github.com/bibisbar/PRISM}{this repository}.

2601.06453 2026-06-01 cs.AI

ConSensus: Multi-Agent Collaboration for Multimodal Sensing

ConSensus:面向多模态感知的多智能体协作

Hyungjun Yoon, Mohammad Malekzadeh, Sung-Ju Lee, Fahim Kawsar, Lorena Qendro

AI总结 提出ConSensus,一种无需训练的多智能体协作框架,通过将多模态感知任务分解为专用智能体并采用混合融合机制,在五个基准上平均准确率提升7.1%,融合token成本降低12.7倍。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

大型语言模型(LLMs)越来越多地基于传感器数据来感知和推理人类生理及物理世界。然而,准确解释异构多模态传感器数据仍然是一个基本挑战。我们表明,单一的整体LLM通常无法跨模态进行连贯推理,导致解释不完整和先验知识偏差。我们引入了ConSensus,一种无需训练的多智能体协作框架,将多模态感知任务分解为专门的、模态感知的智能体。为了聚合智能体级别的解释,我们提出了一种混合融合机制,该机制平衡了语义聚合(实现跨模态推理和上下文理解)与统计共识(通过跨模态一致性提供鲁棒性)。虽然每种方法都有互补的失败模式,但它们的组合能够在传感器噪声和缺失数据下实现可靠推理。我们在五个不同的多模态感知基准上评估了ConSensus,与单智能体基线相比,平均准确率提高了7.1%。此外,ConSensus匹配或超过了迭代多智能体辩论方法的性能,同时通过单轮混合融合协议将平均融合token成本降低了12.7倍,为现实世界的多模态感知任务提供了鲁棒且高效的解决方案。源代码可在https://github.com/nokia/multi-agent-collaboration-for-multimodal-sensing获取。

英文摘要

Large language models (LLMs) are increasingly grounded in sensor data to perceive and reason about human physiology and the physical world. However, accurately interpreting heterogeneous multimodal sensor data remains a fundamental challenge. We show that a single monolithic LLM often fails to reason coherently across modalities, leading to incomplete interpretations and prior-knowledge bias. We introduce ConSensus, a training-free multi-agent collaboration framework that decomposes multimodal sensing tasks into specialized, modality-aware agents. To aggregate agent-level interpretations, we propose a hybrid fusion mechanism that balances semantic aggregation, which enables cross-modal reasoning and contextual understanding, with statistical consensus, which provides robustness through agreement across modalities. While each approach has complementary failure modes, their combination enables reliable inference under sensor noise and missing data. We evaluate ConSensus on five diverse multimodal sensing benchmarks, demonstrating an average accuracy improvement of 7.1% over the single-agent baseline. Furthermore, ConSensus matches or exceeds the performance of iterative multi-agent debate methods while achieving a 12.7 times reduction in average fusion token cost through a single-round hybrid fusion protocol, yielding a robust and efficient solution for real-world multimodal sensing tasks. The source code is available at https://github.com/nokia/multi-agent-collaboration-for-multimodal-sensing.