arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1695
2509.19858 2026-05-25 cs.CL

Benchmarking Gaslighting Attacks Against Speech Large Language Models

针对语音大语言模型的气灯攻击基准测试

Jinyang Wu, Bin Zhu, Xiandong Zou, Qiquan Zhang, Xu Fang, Pan Zhou

发表机构 * Singapore Management University(新加坡管理学院) Tongyi Speech Lab(通义语音实验室) Dalian University of Technology(大连理工大学)

AI总结 随着语音大语言模型(Speech LLMs)在语音应用中的广泛应用,确保其对操纵性或对抗性输入的鲁棒性变得尤为重要。本文引入了一种新型对抗攻击——“Gaslighting攻击”,通过精心设计的提示误导模型推理,评估Speech LLMs的脆弱性,并提出了五种操纵策略用于测试模型在不同任务下的鲁棒性。实验结果显示,五种攻击策略平均使模型准确率下降24.3%,突显了当前语音AI系统在行为层面存在的显著漏洞,亟需提升其鲁棒性和可靠性。

Comments 5 pages, 2 figures, 3 tables

详情
AI中文摘要

随着语音大语言模型(Speech LLMs)越来越多地集成到基于语音的应用中,确保其对操纵性或对抗性输入的鲁棒性变得至关重要。尽管先前的工作研究了基于文本的LLMs和视觉语言模型中的对抗性攻击,但基于语音交互的独特认知和感知挑战仍未得到充分探索。相比之下,语音具有固有的模糊性、连续性和感知多样性,这使得对抗性攻击更难检测。在本文中,我们引入了气灯攻击,即精心设计的提示,旨在误导、覆盖或扭曲模型推理,以评估语音LLMs的脆弱性。具体来说,我们构建了五种操纵策略:愤怒、认知干扰、讽刺、隐晦和专业否定,旨在测试模型在不同任务上的鲁棒性。值得注意的是,我们的框架捕获了性能下降和行为响应,包括未经请求的道歉和拒绝,以诊断不同维度的易感性。此外,还进行了声学扰动实验以评估多模态鲁棒性。为了量化模型脆弱性,在5个语音和多模态LLMs上,对来自5个不同数据集的超过10,000个测试样本进行全面评估,结果显示在五种气灯攻击下平均准确率下降24.3%,表明显著的行为脆弱性。这些发现强调了需要更具弹性和可信赖的基于语音的AI系统。

英文摘要

As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.

2509.06896 2026-05-25 cs.LG stat.ML

Are Targeted Data Poisoning Attacks as Effective as We Think?

定向数据投毒攻击是否如我们想象中那么有效?

William Xu, Chenyu Zhang, Yihan Wang, Matthew Y. R. Yang, Zuoqiu Liu, Gautam Kamath, Yaoliang Yu, Yiwei Lu

发表机构 * Waabi AI University of Waterloo(滑铁卢大学) Carnegie Mellon University(卡内基梅隆大学) Google(谷歌) Vector Institute(向量研究所) University of Ottawa(渥太华大学)

AI总结 本文研究目标数据投毒攻击的实际有效性,指出现有评估方法基于随机选择的目标样本,未能反映最坏情况下的攻击效果。为此,作者提出应聚焦于最难被攻击的样本进行评估,并基于干净模型的信息,提出了一种识别易受攻击和最难受攻击样本的方法,从而实现更严格的最坏情况评估和主动防御策略。

详情
AI中文摘要

定向数据投毒攻击通过向训练数据中注入恶意样本来操纵模型对特定测试样本的预测。然而,现有评估通常报告随机选择目标上的平均攻击成功率,掩盖了真实的最坏情况效果。我们认为正确的评估应聚焦于最难投毒的样本。同样的推理适用于防御:由于定向攻击在分布层面不留下痕迹,防御者应主动识别最脆弱的样本并应用定向对策。给定一个测试数据集,本文仅基于清洁模型信息识别最容易和最难投毒的样本。具体而言,我们利用清洁训练动态提供粗粒度评估,并利用投毒距离和预算对投毒类别进行细粒度分类。实验表明,这些指标能够可靠地按投毒脆弱性对样本分层,从而实现严格的最坏情况评估和主动的脆弱性感知防御。

英文摘要

Targeted data poisoning attacks manipulate model predictions on specific test samples by injecting malicious data into training. Yet existing evaluations report average attack success rates over randomly selected targets, obscuring true worst-case effectiveness. We argue that the right evaluation focuses on the hardest samples to poison. The same reasoning applies to defense: since targeted attacks leave no footprint at the distribution level, defenders should proactively identify the most vulnerable samples and apply targeted countermeasures. Given a test dataset, this paper identifies both the easiest and hardest to poison examples based on only clean model information. Specifically, we offer coarse evaluations using clean training dynamics, and fine-grained classification on poison class using poison distances and budgets. Our experiments show these metrics reliably stratify samples by poisoning vulnerability, enabling both rigorous worst-case evaluation and proactive vulnerability-aware defense.

2508.13663 2026-05-25 cs.AI cs.LG

Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints

具有软实体约束的知识图谱交互式查询回答

Daniel Daza, Alberto Bernardi, Luca Costabello, Christophe Gueret, Masoud Mansoury, Michael Cochez, Martijn Schut

发表机构 * Translational AI Laboratory, Department of Laboratory Medicine(转化人工智能实验室,实验室医学系) Amsterdam University Medical Center, Vrije Universiteit Amsterdam(阿姆斯特丹大学医学中心,伏里埃大学阿姆斯特丹) Accenture Labs(埃森哲实验室) Delft University of Technology(代尔夫特理工大学) ELLIS Institute Finland & Abo Akademi University, Turku, Finland & Elsevier Discovery Lab, Amsterdam(芬兰ELLIS研究所 & 阿博阿卡迪米大学,图尔库,芬兰 & 埃西弗尔发现实验室,阿姆斯特丹)

AI总结 本文研究了在知识图谱中结合软实体约束进行交互式查询回答的问题,旨在处理现实场景中含模糊或上下文依赖约束的查询。为此,作者提出了两种高效方法,能够在不破坏原有查询结果排名结构的前提下,通过少量参数调整或小型神经网络学习软约束,从而提升查询结果的相关性。实验表明,该方法在保持原有查询性能的同时,有效融入了用户偏好,为知识图谱交互提供了更灵活的方式。

Comments Accepted in Transactions on Machine Learning Research (2026)

详情
AI中文摘要

针对不完整知识图谱的查询回答方法检索可能成为答案的实体,这在由于缺失边而无法通过直接图遍历达到此类答案时特别有用。然而,现有方法侧重于使用一阶逻辑形式化的查询。在实践中,许多现实世界的查询涉及固有模糊或上下文依赖的约束,例如对属性或相关类别的偏好。针对这一差距,我们引入了具有软约束的查询回答问题。我们形式化了该问题,并提出了两种高效方法,旨在通过融入软约束来调整查询答案分数,同时不破坏查询的原始答案。这些方法是轻量级的,只需调整两个参数或训练一个小型神经网络来捕获软约束,同时保持原始排序结构。为了评估该任务,我们通过生成带有软约束的数据集来扩展现有的QA基准。我们的实验表明,我们的方法能够捕获软约束,同时保持稳健的查询回答性能,并增加很少的开销。通过我们的工作,我们探索了一种与图数据库交互的新颖灵活方式,允许用户通过交互式提供示例来指定其偏好。

英文摘要

Methods for query answering over incomplete knowledge graphs retrieve entities that are likely to be answers, which is particularly useful when such answers cannot be reached by direct graph traversal due to missing edges. However, existing approaches have focused on queries formalized using first-order-logic. In practice, many real-world queries involve constraints that are inherently vague or context-dependent, such as preferences for attributes or related categories. Addressing this gap, we introduce the problem of query answering with soft constraints. We formalize the problem and introduce two efficient methods designed to adjust query answer scores by incorporating soft constraints without disrupting the original answers to a query. These methods are lightweight, requiring tuning only two parameters or a small neural network trained to capture soft constraints while maintaining the original ranking structure. To evaluate the task, we extend existing QA benchmarks by generating datasets with soft constraints. Our experiments demonstrate that our methods can capture soft constraints while maintaining robust query answering performance and adding very little overhead. With our work, we explore a new and flexible way to interact with graph databases that allows users to specify their preferences by providing examples interactively.

2508.12247 2026-05-25 cs.LG cs.AI

STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction

STM3: 多尺度曼巴混合模型用于长期时空时间序列预测

Haolong Chen, Liang Zhang, Zhengyuan Xin, Guangxu Zhu

发表机构 * Shenzhen Loop Area Institute(深圳环城研究院)

AI总结 本文提出了一种名为STM3的新型深度学习模型,用于解决长期时空时间序列预测中的多尺度信息提取和空间依赖建模难题。STM3结合了多尺度Mamba架构与解耦的专家混合框架(DMoE),并引入自适应图因果网络以高效捕捉复杂的时空依赖关系。该模型通过稳定路由策略和因果对比学习策略,确保了表示学习的鲁棒性和多尺度信息的可区分性,实验表明其在多个现实数据集上均取得了优越的预测性能。

Comments Accepted by KDD 2026

详情
AI中文摘要

近年来,时空时间序列预测发展迅速,但现有深度学习方法难以高效学习复杂的长期时空依赖。长期时空依赖学习带来两个新挑战:1)长期时间序列自然包含多尺度信息,难以高效提取;2)不同节点的多尺度时间信息高度相关且难以建模。为解决这些问题,我们提出时空多尺度曼巴混合模型(STM3)。STM3在新型分离式混合专家(DMoE)框架内集成多尺度曼巴架构,以高效捕获多样的多尺度信息,同时利用自适应图因果网络建模复杂的空间依赖。为确保鲁棒的表示学习,我们引入稳定路由策略和因果对比学习策略,与层次信息聚合协同工作,保证尺度可区分性。我们理论上证明STM3实现了优越的路由平滑性,并保证了每个专家的模式分离。在跨领域的10个真实世界基准上的大量实验表明,STM3具有优越性能,在长期时空时间序列预测中达到了最先进的结果。值得注意的是,在PEMSD8数据集上,它取得了显著改进,在MAE、RMSE和MAPE上分别超过第二好的模型7.1%、8.5%和15.9%。代码可在https://github.com/IfReasonable/STM3_KDD26获取。

英文摘要

Recently, spatio-temporal time-series prediction has developed rapidly, yet existing deep learning methods struggle with learning complex long-term spatio-temporal dependencies efficiently. The long-term spatio-temporal dependency learning brings two new challenges: 1) The long-term temporal sequence naturally includes multiscale information, which is hard to extract efficiently; 2) The multiscale temporal information from different nodes is highly correlated and hard to model. To address these challenges, we propose Spatio-Temporal Mixture of Multiscale Mamba (STM3). STM3 integrates a Multiscale Mamba architecture within a novel Disentangled Mixture-of-Experts (DMoE) framework to capture diverse multiscale information efficiently, while utilizing an adaptive graph causal network to model complex spatial dependencies. To ensure robust representation learning, we introduce a stable routing strategy and a causal contrastive learning strategy, which work in tandem with hierarchical information aggregation to guarantee scale distinguishability. We theoretically prove that STM3 achieves superior routing smoothness and guarantees pattern disentanglement for each expert. Extensive experiments on 10 real-world benchmarks across domains demonstrate STM3's superior performance, achieving state-of-the-art results in long-term spatio-temporal time-series prediction. Notably, on the PEMSD8 dataset, it achieves significant improvements, surpassing the second-best model by 7.1% in MAE, 8.5% in RMSE, and 15.9% in MAPE. Code is available at https://github.com/IfReasonable/STM3_KDD26.

2506.20537 2026-05-25 cs.LG

Physics-Informed Machine Learning Regulated by Finite Element Analysis for Simulation Acceleration of Melt Pool Dynamics in Laser Powder Bed Fusion

基于有限元分析调控的物理信息机器学习用于激光粉末床熔融熔池动力学模拟加速

R. Sharma, Y. B. Guo

发表机构 * Dept. of Mechanical and Aerospace Engineering, Rutgers University-New Brunswick(罗杰斯大学机械与航空航天工程系) New Jersey Advanced Manufacturing Institute, Rutgers University-New Brunswick(新泽西先进制造研究所)

AI总结 该研究针对激光粉末床熔融(LPBF)过程中熔池动态模拟计算成本高的问题,提出了一种结合有限元分析(FEA)的物理信息神经网络(FEA-PINN)框架,以提高模拟效率并保持精度。该方法通过引入动态相变捕捉策略和物理一致性校正机制,有效解决了传统物理信息神经网络在时间依赖问题中精度下降的问题。实验表明,FEA-PINN在保证与有限元分析相当精度的同时,显著降低了计算成本。

Comments Further investigation revealed that the current version reflects an incomplete formulation and limited validation of the proposed method. We have since developed a substantially revised and extended study with updated assumptions and results, and therefore withdraw this version to prevent citation of superseded findings

详情
AI中文摘要

高效模拟激光粉末床熔融(LPBF)对于工艺预测至关重要,因为传统数值方法(如有限元分析,FEA)存在计算成本高昂的持久问题。虽然物理信息神经网络(PINN)可以用少量训练数据预测解场,并通过迁移学习实现新工艺参数的泛化,但由于残差累积以及难以捕捉LPBF过程中固有的陡峭空间和时间梯度,它在时间相关问题中精度下降。为克服这一问题,本研究开发了一个高效的建模框架——有限元分析调控的物理信息神经网络(FEA-PINN),以加速LPBF过程中熔池动力学现象的预测,同时保持FEA的精度。FEA-PINN的创新体现在两个方面。首先,在PINN模型内部开发了一种新策略来捕捉粉末-液体-固体的动态相变,从而能够跟踪激光熔化过程中的材料状态。该模型进一步纳入了温度相关的材料属性、粉末床的相变行为、马兰戈尼对流以及熔池内的自然对流。其次,FEA-PINN框架在推理过程中集成了校正性的FEA模拟,以强制执行物理一致性、减少误差漂移并捕捉陡峭梯度。对比分析表明,FEA-PINN在显著降低计算成本的同时,达到了与FEA相当的精度。该框架已针对LPBF中单道扫描的基准FEA数据进行了验证。

英文摘要

Efficient simulation of Laser Powder Bed Fusion (LPBF) is crucial for process prediction due to the lasting issue of high computational cost associated with traditional numerical methods such as finite element analysis (FEA). While a Physics-Informed Neural Network (PINN) can predict solution fields with small training data and enables the generalization of new process parameters via transfer learning, it suffers from accuracy degradation in time-dependent problems due to the accumulation of residual and the difficulty in capturing the steep spatial and temporal gradients inherent in the LPBF process. To overcome this issue, this study develops an efficient modeling framework, FEA-Regulated Physics-Informed Neural Network (FEA-PINN), to accelerate the prediction of melt pool dynamics phenomena in an LPBF process while maintaining the FEA accuracy. The innovation of FEA-PINN manifested itself in two aspects. First, a novel strategy has been developed within the PINN model to capture the dynamic phase change of powder-liquid-solid, enabling the tracking of material status during laser melting. The model further incorporates temperature-dependent material properties, phase change behavior of the powder bed, Marangoni convection, and natural convection within the melt pool. Second, the FEA-PINN framework integrates corrective FEA simulations during inference to enforce physical consistency, reduce error drift, and capture the steep gradients. A comparative analysis shows that FEA-PINN achieves accuracy comparable to FEA while significantly reducing computational cost. The framework has been validated against benchmark FEA data for single-track scanning in LPBF.

2506.14135 2026-05-25 cs.RO cs.CV

GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation

GAF: 高斯动作场作为机器人操作中动态世界建模的4D表示

Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang Li, Hongwen Zhang, Yebin Liu

发表机构 * Tsinghua University(清华大学) Beijing Normal University(北京师范大学) Shadow AI

AI总结 本文提出了一种基于高斯动作场(GAF)的四维表示方法,用于机器人操作中的动态世界建模。GAF通过引入可学习的运动属性,扩展了三维高斯点绘(3DGS),实现了对动态场景和操作动作的四维建模。该方法能够直接从运动感知的四维表示中进行动作推理,并通过重建当前场景、预测未来帧和估计初始动作三个相关输出,提升操作精度。实验表明,GAF在重建质量和机器人操作成功率方面均优于现有方法。

Comments https://ChaiYing1.github.io/projects/GAF/

详情
AI中文摘要

准确的场景感知对于基于视觉的机器人操作至关重要。现有方法通常遵循视觉到动作(V-A)范式,直接从视觉输入预测动作,或视觉到3D到动作(V-3D-A)范式,利用中间3D表示。然而,由于操作场景的复杂性和动态性,这些方法常常面临动作不准确的问题。在本文中,我们采用V-4D-A框架,通过高斯动作场(GAF)从运动感知的4D表示中直接进行动作推理。GAF通过引入可学习的运动属性扩展了3D高斯溅射(3DGS),实现了动态场景和操作动作的4D建模。为了学习时变场景几何和动作感知的机器人运动,GAF提供三个相互关联的输出:当前场景的重建、未来帧的预测以及通过高斯运动估计的初始动作。此外,我们采用一个动作-视觉对齐的去噪框架,以GAF生成的初始动作和高斯感知的统一表示为条件,进一步获得更精确的动作。大量实验表明,GAF在重建质量上实现了显著改进,PSNR提高+11.5385 dB,SSIM提高+0.3864,LPIPS降低-0.5574,同时在机器人操作任务中,相比最先进方法,平均成功率提升+7.3%。

英文摘要

Accurate scene perception is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. Furthermore, we employ an action-vision-aligned denoising framework, conditioned on a unified representation that combines the init action and the Gaussian perception, both generated by the GAF, to further obtain more precise actions. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success rate in robotic manipulation tasks over state-of-the-art methods.

2506.05438 2026-05-25 cs.LG cs.AI

An Unsupervised Framework for Dynamic Health Indicator Construction and Its Application in Rolling Bearing Prognostics

一种用于动态健康指标构建的无监督框架及其在滚动轴承预测中的应用

Tongda Sun, Chen Yin, Huailiang Zheng, Yining Dong

发表机构 * School of Data Science(数据科学学院) Hong Kong Institute for Data Science, City University of Hong Kong, Hong Kong(香港数据科学研究所,香港城市大学,香港) College of Mechanical(机械学院) Electrical Engineering, Harbin Engineering University, Harbin 150001, China(电气工程学院,哈尔滨工程大学,哈尔滨150001,中国)

AI总结 本文提出了一种无需专家知识的无监督框架,用于构建动态健康指标(HI),以提升滚动轴承退化趋势建模与剩余寿命预测的准确性。该方法通过基于跳跃连接的自编码器自动提取退化特征,并在特征空间中引入嵌入内部预测模块的HI生成模块,显式建模HI状态的时序依赖关系,从而捕捉退化过程中的动态信息。实验结果表明,所提出的动态HI在两个轴承生命周期数据集上优于现有方法,显著提升了预测性能。

详情
AI中文摘要

健康指标(HI)在滚动轴承的退化评估和预测中起着关键作用。尽管已有多种HI构建方法被研究,但大多数依赖于专家知识进行特征提取,并忽略了捕捉序列退化过程中隐藏的动态信息,这限制了所构建HI在退化趋势表示和预测中的能力。为解决这些问题,通过一种无监督框架构建了考虑HI级时间依赖性的新型动态HI。具体而言,由基于跳跃连接的自编码器组成的退化特征学习模块首先将原始信号映射到代表性退化特征空间(DFS),以自动提取必要的退化特征,无需专家知识。随后,在该DFS中,提出了一种嵌入内部HI预测模块的新型HI生成模块用于动态HI构建,其中过去和当前HI状态之间的时间依赖性被保证并显式建模。在此基础上,动态HI捕捉了退化过程固有的动态内容,确保其在退化趋势建模和未来退化预测中的有效性。在两个轴承生命周期数据集上的实验结果表明,所提出的HI构建方法优于对比方法,且构建的动态HI在预测任务中表现更优。

英文摘要

Health indicator (HI) plays a key role in degradation assessment and prognostics of rolling bearings. Although various HI construction methods have been investigated, most of them rely on expert knowledge for feature extraction and overlook capturing dynamic information hidden in sequential degradation processes, which limits the ability of the constructed HI for degradation trend representation and prognostics. To address these concerns, a novel dynamic HI that considers HI-level temporal dependence is constructed through an unsupervised framework. Specifically, a degradation feature learning module composed of a skip-connection-based autoencoder first maps raw signals to a representative degradation feature space (DFS) to automatically extract essential degradation features without the need for expert knowledge. Subsequently, in this DFS, a new HI-generating module embedded with an inner HI-prediction block is proposed for dynamic HI construction, where the temporal dependence between past and current HI states is guaranteed and modeled explicitly. On this basis, the dynamic HI captures the inherent dynamic contents of the degradation process, ensuring its effectiveness for degradation tendency modeling and future degradation prognostics. The experiment results on two bearing lifecycle datasets demonstrate that the proposed HI construction method outperforms comparison methods, and the constructed dynamic HI is superior for prognostic tasks.

2502.13731 2026-05-25 cs.AI

Robust Counterfactual Inference in Markov Decision Processes

马尔可夫决策过程中的鲁棒反事实推断

Jessica Lally, Milad Kazemi, Nicola Paoletti

发表机构 * King's College London(伦敦国王学院)

AI总结 本文针对马尔可夫决策过程(MDP)中现有反事实推理方法的一个关键局限性,提出了一种新的非参数方法。传统方法依赖特定的因果模型来识别反事实,而实际上存在多个与观测和干预分布一致的因果模型,导致反事实分布不同。本文通过计算所有兼容因果模型下反事实转移概率的紧致界,提供了高效且可扩展的反事实推理方法,并在此基础上设计出鲁棒的反事实策略,以优化最坏情况下的奖励。实验表明,该方法在多个案例中表现出更强的鲁棒性。

详情
AI中文摘要

本文解决了马尔可夫决策过程(MDP)中现有反事实推断方法的一个关键局限性。当前方法假设特定的因果模型以使反事实可识别。然而,通常存在许多与MDP的观测分布和干预分布一致的因果模型,每个模型产生不同的反事实分布,因此固定一个特定的因果模型限制了反事实推断的有效性(和有用性)。我们提出了一种新颖的非参数方法,该方法在所有兼容因果模型上计算反事实转移概率的紧界。与先前需要求解规模过大(变量随MDP大小呈指数增长)的优化问题的方法不同,我们的方法为这些界提供了闭式表达式,使得计算对于非平凡MDP高度高效且可扩展。一旦构建了这样的区间反事实MDP,我们的方法识别出鲁棒的反事实策略,该策略针对不确定的区间MDP概率优化最坏情况奖励。我们在各种案例研究上评估了我们的方法,证明了相比现有方法具有改进的鲁棒性。

英文摘要

This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

2501.08222 2026-05-25 cs.RO

Data-driven Spatial Classification using Multi-Arm Bandits for Monitoring with Energy-Constrained Mobile Robots

基于多臂老虎机的数据驱动空间分类用于能量受限移动机器人监测

Xiaoshan Lin, Siddharth Nayak, Stefano Di Cairano, Abraham P. Vinod

发表机构 * Aerospace Engineering and Mechanics department, University of Minnesota(明尼苏达大学航空航天工程与力学系) Aeronautics and Astronautics department, Massachusetts Institute of Technology(麻省理工学院航空与航天系) Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室)

AI总结 本文研究了利用协同移动机器人进行环境监测中的空间分类问题,旨在快速将搜索区域划分为感兴趣和不感兴趣区域。提出了一种基于多臂老虎机框架的双层策略,高层规划器根据实时数据确定待访问区域,底层规划器通过整数规划协调路径,同时考虑传感器噪声和能量约束。该方法在仿真和实际机器人实验中均表现出良好的分类效率和任务完成性能。

Comments 8 pages, 6 figures. See https://www.youtube.com/watch?v=gzulpOcVYzg for an overview of the approach along with videos of the hardware experiments

详情
AI中文摘要

我们考虑使用协调移动机器人团队收集的数据进行监测的空间分类问题。此类分类问题出现在包括搜索救援和精准农业在内的多个应用中。具体而言,我们希望使用移动传感器和移动充电站团队,尽可能快地将搜索环境的区域分类为有趣和无趣。我们开发了一种数据驱动策略,该策略适应传感数据中的噪声和传感器的有限能量容量,并为团队生成无碰撞运动计划。我们提出了一种双层方法,其中高层规划器利用多臂老虎机框架,根据在线收集的数据确定无人机接下来要访问的潜在感兴趣区域。然后,基于整数规划的低层路径规划器协调团队访问已确定区域的路径,并满足物理约束。我们描述了所提方法的若干理论特性,包括任意时间保证和任务完成时间。我们在仿真中展示了我们方法的有效性,并在使用移动机器人的物理实验中进一步验证了这些观察结果。

英文摘要

We consider the spatial classification problem for monitoring using data collected by a coordinated team of mobile robots. Such classification problems arise in several applications including search-and-rescue and precision agriculture. Specifically, we want to classify the regions of a search environment into interesting and uninteresting as quickly as possible using a team of mobile sensors and mobile charging stations. We develop a data-driven strategy that accommodates the noise in sensed data and the limited energy capacity of the sensors, and generates collision-free motion plans for the team. We propose a bi-level approach, where a high-level planner leverages a multi-armed bandit framework to determine the potential regions of interest for the drones to visit next based on the data collected online. Then, a low-level path planner based on integer programming coordinates the paths for the team to visit the determined regions subject to the physical constraints. We characterize several theoretical properties of the proposed approach, including anytime guarantees and task completion time. We show the efficacy of our approach in simulation, and further validate these observations in physical experiments using mobile robots.

2412.19098 2026-05-25 cs.LG

SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation

SyMerge:从无干扰到协同合并的单层自适应方法

Aecheon Jung, Seunghwan Lee, Dongyoon Han, Sungeun Hong

发表机构 * Sungkyunkwan University(成均馆大学) NAVER AI Lab(NAVER AI实验室)

AI总结 SyMerge 是一种轻量级的模型合并框架,旨在通过单层适配实现任务间的协同效应,而非仅仅避免任务干扰。该方法通过联合优化合并系数和一个任务特定层,引入专家引导的自标注目标,提升了合并效果的稳定性与性能。研究证明,SyMerge 能够成功合并不同初始化训练的模型,在多个视觉、密集预测和自然语言处理基准上取得了最先进的结果。

Comments Accepted at ICML 2026

详情
AI中文摘要

模型合并将独立训练的模型组合成一个多任务模型。然而,大多数现有方法主要关注避免任务干扰。我们认为其更大的潜力在于实现任务协同,即任务之间主动相互改进。我们识别出跨任务性能,由不同任务之间的编码器和预测器的兼容性定义,作为合并质量的关键指标。我们证明仅适应单个任务特定层就足以诱导这种协同。本研究提出SyMerge,一个轻量级框架,联合优化合并系数和单个任务特定层。我们采用专家引导的自标签目标,提供超越熵最小化的稳定监督。有趣的是,我们进一步表明SyMerge成功合并了从不同初始化训练的模型,而标准方法在此情况下失效。我们极简但有原则的方法在视觉、密集预测和NLP基准上达到了最先进的结果。我们的代码可在https://aim-skku.github.io/SyMerge获取。

英文摘要

Model merging combines independently trained models into a single multi-task model. However, most existing approaches focus primarily on avoiding task interference. We argue that its greater potential lies in enabling task synergy, where tasks actively improve one another. We identify cross-task performance, defined by compatibility between encoders and predictors across tasks, as a key indicator of merge quality. We demonstrate that adapting only a single task-specific layer is sufficient to induce such synergy. This study proposes SyMerge, a lightweight framework that jointly optimizes merging coefficients and a single task-specific layer. We adopt an expert-guided self-labeling objective, providing stable supervision beyond entropy minimization. Intriguingly, we further show that SyMerge successfully merges models trained from different initializations, a regime where standard methods break down. Our minimalist yet principled method achieves state-of-the-art results across vision, dense prediction, and NLP benchmarks. Our code is available at https://aim-skku.github.io/SyMerge

2412.14642 2026-05-25 cs.CL

Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

Speak-to-Structure:评估大语言模型在开放域自然语言驱动的分子生成中的表现

Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Yatao Bian, Dongzhan Zhou, Xiao-yong Wei, Qing Li

发表机构 * Hong Kong Polytechnic University(香港理工大学) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Lab(上海人工智能实验室) National University of Singapore(新加坡国立大学)

AI总结 近期,大型语言模型(LLMs)在自然语言驱动的分子发现任务中展现出巨大潜力,但现有数据集和基准主要基于一对一映射,仅评估模型检索单一预定义答案的能力,而忽略了其生成多样化且有效分子候选物的创造力。为此,研究者提出了首个用于评估LLMs在开放领域自然语言驱动分子生成能力的基准S²-Bench,该基准专门设计用于一对多关系,挑战模型展现真实的分子理解和开放生成能力。研究还引入了大规模指令微调数据集OpenMolIns,使Llama3.1-8B在该基准上超越了GPT-4o和Claude-3.5等强大多模态模型,并通过全面评估31个LLMs,推动了从简单模式记忆向真实分子设计的转变。

Comments Accepted by KDD 2026. Our codes and datasets are fully accessible through the https://github.com/phenixace/S2-TOMG-Bench and https://huggingface.co/datasets/phenixace/S2-TOMG-Bench

详情
AI中文摘要

近期,大语言模型(LLMs)在自然语言驱动的分子发现中展现出巨大潜力。然而,现有的分子-文本对齐数据集和基准主要基于一对一映射,衡量LLMs检索单一预定义答案的能力,而非其生成多样且同样有效的候选分子的创造性潜力。为填补这一关键空白,我们提出Speak-to-Structure(S^2-Bench),这是首个评估LLMs在开放域自然语言驱动分子生成中的基准。S^2-Bench专为一对多关系设计,挑战LLMs展现真正的分子理解和开放生成能力。我们的基准包括三个关键任务:分子编辑(MolEdit)、分子优化(MolOpt)和定制分子生成(MolCustom),每个任务探索分子发现的不同方面。我们还引入OpenMolIns,一个大规模指令微调数据集,使Llama3.1-8B在S^2-Bench上超越最强大的LLMs如GPT-4o和Claude-3.5。我们对31个LLMs的全面评估将焦点从简单的模式回忆转向现实的分子设计,为自然语言驱动分子发现中更强大的LLMs铺平道路。我们的代码和数据集完全可通过GitHub仓库(https://github.com/phenixace/S2-TOMG-Bench)和Huggingface数据集(https://huggingface.co/datasets/phenixace/S2-TOMG-Bench)获取。

英文摘要

Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on one-to-one mappings, measuring LLMs' ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates. To address this critical gap, we propose Speak-to-Structure (S^2-Bench), the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation. S^2-Bench is specifically designed for one-to-many relationships, challenging LLMs to exhibit genuine molecular understanding and open-ended generation capabilities. Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom), each probing a different aspect of molecule discovery. We also introduce OpenMolIns, a large-scale instruction tuning dataset that enables Llama3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S^2-Bench. Our comprehensive evaluation of 31 LLMs shifts the focus from simple pattern recall to realistic molecular design, paving the way for more capable LLMs in natural language-driven molecule discovery. Our codes and datasets are fully accessible through the Github Repository: https://github.com/phenixace/S2-TOMG-Bench and Huggingface Datasets: https://huggingface.co/datasets/phenixace/S2-TOMG-Bench.

2406.02883 2026-05-25 cs.LG cs.CR

Nonlinear Transformations Against Unlearnable Datasets

针对不可学习数据集的非线性变换

Thushari Hapuarachchi, Jing Lin, Kaiqi Xiong, Mohamed Rahouti, Gitte Ost

发表机构 * University of South Florida(佛罗里达州立大学) Fordham University(福特汉姆大学)

AI总结 本文研究了如何通过非线性变换方法解决深度学习模型对传统认为无法学习的“不可遗忘”数据集的学习问题。作者提出了一种有效的非线性变换框架,并通过大量实验表明,深度神经网络能够从由多种数据保护方法生成的不可遗忘数据中有效学习,显著优于近期提出的线性可分技术。实验结果表明,该方法在多个数据集上提升了模型性能,揭示了现有保护方法在防止数据未经授权使用方面存在不足,亟需更强大的防护机制。

详情
AI中文摘要

自动化爬取是深度学习模型中未经数据所有者授权收集数据的常见方法。近期研究开始解决这种数据收集方法带来的隐私问题。显著的方法包括Deepconfuse、误差最小化、误差最大化(也称为对抗性投毒)、神经正切泛化攻击、合成、自回归、单像素捷径、自集成保护、纠缠特征、鲁棒误差最小化、虚伪和TensorClog。这些方法生成的数据称为“不可学习”样本,阻止深度学习模型“学习”。在本研究中,我们调查并设计了一个有效的非线性变换框架,并进行大量实验,证明深度神经网络能够有效从上述十二种方法产生的传统上被认为不可学习的数据/样本中学习。与研究人员最近提出的线性可分技术相比,所提出的方法提高了破解不可学习数据的能力。具体来说,我们的大量实验表明,对于这些十二种数据保护方法生成的不可学习CIFAR10数据集(除单像素捷径外),改进范围为0.34%至249.59%。此外,与线性可分技术相比,所提出的框架在自回归和REM方法上实现了超过100%的测试准确率提升。我们的发现表明,这些方法不足以防止机器学习模型中数据的未经授权使用。迫切需要开发更强大的保护机制,有效阻止攻击者在未经所有者适当授权的情况下访问数据。

英文摘要

Automated scraping stands out as a common method for collecting data in deep learning models without the authorization of data owners. Recent studies have begun to tackle the privacy concerns associated with this data collection method. Notable approaches include Deepconfuse, error-minimizing, error-maximizing (also known as adversarial poisoning), Neural Tangent Generalization Attack, synthetic, autoregressive, One-Pixel Shortcut, Self-Ensemble Protection, Entangled Features, Robust Error-Minimizing, Hypocritical, and TensorClog. The data generated by those approaches, called "unlearnable" examples, are prevented "learning" by deep learning models. In this research, we investigate and devise an effective nonlinear transformation framework and conduct extensive experiments to demonstrate that a deep neural network can effectively learn from the data/examples traditionally considered unlearnable produced by the above twelve approaches. The resulting approach improves the ability to break unlearnable data compared to the linear separable technique recently proposed by researchers. Specifically, our extensive experiments show that the improvement ranges from 0.34% to 249.59% for the unlearnable CIFAR10 datasets generated by those twelve data protection approaches, except for One-Pixel Shortcut. Moreover, the proposed framework achieves over 100% improvement of test accuracy for Autoregressive and REM approaches compared to the linear separable technique. Our findings suggest that these approaches are inadequate in preventing unauthorized uses of data in machine learning models. There is an urgent need to develop more robust protection mechanisms that effectively thwart an attacker from accessing data without proper authorization from the owners.

2403.12401 2026-05-25 cs.CV

RT-NeRV: Rethinking Hybrid Neural Representations for Video via Residual Tokenization

RT-NeRV: 通过残差标记化重新思考混合神经视频表示

Yunjie Xu, Xiang Feng, Chengkai Wang, Alan Wee-Chung Liew, Xuefei Yin, Yanming Zhu

发表机构 * Ningbo University(宁波大学) Hangzhou Dianzi University(杭州电子科技大学) Griffith University(格里菲斯大学)

AI总结 本文提出了一种名为RT-NeRV的新型混合神经视频表示方法,旨在解决现有方法在低比特率下难以保留细节的问题。其核心思想是通过残差分块技术,将浅层残差特征和帧间残差信息离散化为紧凑的残差块,从而高效传输并利用这些信息进行重建。该方法设计了残差分块器和残差感知码本学习策略,有效提升了重建质量与训练稳定性,并在多个视频回归与修复任务中优于现有混合NeRV方法。

Comments Under Review

详情
AI中文摘要

神经视频表示(NeRV)通过将视频表示为紧凑的神经网络并实现高效解码,已成为视频压缩的一种有前景的范式。混合NeRV方法通过内容自适应嵌入进一步提高了重建质量,但在低比特率下仍难以保留精细细节。一个关键限制是,浅层残差支持信息虽然对重建非常有益,但其连续形式的传输成本高昂,因此未被充分利用。在本文中,我们重新思考混合NeRV,并提出了RT-NeRV,一种用于混合神经视频表示的残差标记化框架。核心思想是将浅层残差特征和帧间残差线索离散化为紧凑的残差标记,从而使得信息丰富的重建支持能够高效传输并被解码器利用。为此,我们设计了一个残差标记化器,并结合了一种残差感知的码本学习策略,该策略提高了标记利用率并稳定了训练。RT-NeRV可以轻松集成到现代混合NeRV主机中,持续增强细节保留、重建质量以及比特率-质量权衡。在视频回归和相关恢复任务上的大量实验表明,RT-NeRV优于强混合NeRV基线,并与近期基于INR的视频压缩方法保持竞争力。这些结果表明,残差标记化是推进混合神经视频表示的一个有效且互补的方向。

英文摘要

Neural Representations for Videos(NeRV) have emerged as a promising paradigm for video compression by representing videos as compact neural networks with efficient decoding. Hybrid NeRV methods further improve reconstruction quality through content adaptive embeddings, but still struggle to preserve fine details at low bitrates. A key limitation is that shallow residual support in formation, although highly beneficial for reconstruction, is costly to transmit in its continuous form and is therefore underutilized. In this paper, we rethink hybrid NeRV and present RT-NeRV, a residual tokenization framework for hybrid neural video representations. The core idea is to discretize shallow residual features and inter-frame residual cues into compact residual tokens, allowing informative reconstruction support to be transmitted efficiently and exploited by the decoder. To this end, we design a residual tokenizer together with a residual-aware codebook learning strategy that improves token utilization and stabilizes training. RT-NeRV can be readily integrated into modern hybrid NeRV hosts, consistently enhancing detail preservation, reconstruction quality, and bitrate quality trade-offs. Extensive experiments on video regression and related restoration tasks show that RT-NeRV outperforms strong hybrid NeRV baselines and remains competitive with recent INR based video compression methods. These results demonstrate that residual tokenization is an effective and complementary direction for advancing hybrid neural video representations

2605.23673 2026-05-25 cs.LG

Relevant Walk Search for Explaining Graph Neural Networks

用于解释图神经网络的相关游走搜索

Ping Xiong, Thomas Schnake, Michael Gastegger, Grégoire Montavon, Klaus-Robert Müller, Shinichi Nakajima

发表机构 * BIFOLD -- Berlin Institute for the Foundations of Learning(柏林学习与数据基础研究所) Google Research, Brain team, Berlin(谷歌研究,柏林脑团队) Department of Artificial Intelligence, Korea University, Seoul 136-713, Korea(人工智能系,韩国大学,首尔136-713,韩国) RIKEN Center for AIP, Japan(日本AIP研究中心)

AI总结 本文研究了图神经网络(GNN)的可解释性问题,提出了一种高效寻找关键路径(walk)的方法,用于揭示网络中的重要信息流动。针对现有基于层间相关性传播(GNN-LRP)方法计算复杂度高、难以应用于大规模网络的问题,作者设计了多项式时间算法,能够在保证解释精度的同时大幅提升计算效率。实验表明,该方法在多个实际应用领域中表现良好,具有广泛的应用价值。

Comments Published in ICML 2023

Journal ref Proceedings of the 40th International Conference on Machine Learning, PMLR 202:38301-38324, 2023

详情
AI中文摘要

图神经网络(GNN)已成为图分析的重要机器学习工具,其可解释性对于安全性、公平性和鲁棒性至关重要。GNN的逐层相关性传播(GNN-LRP)评估游走的相关性以揭示网络中的重要信息流,并提供高阶解释,已被证明优于低阶(即节点/边级)解释。然而,通过GNN-LRP识别相关游走需要相对于网络深度的指数级计算复杂度,本文将对这一问题进行改进。具体来说,我们提出了多项式时间算法来寻找前K个相关游走,这大大减少了计算量,从而提高了GNN-LRP在大规模问题上的适用性。我们提出的算法基于最大积算法——一种在概率图模型中寻找最大似然配置的常用工具——并且可以在神经元级别精确地找到最相关的游走,在节点级别近似地找到。我们的实验展示了我们的算法在规模上的性能及其在应用领域(即流行病学、分子和自然语言基准)中的实用性。我们在\href{https://github.com/xiong-ping/rel_walk_gnnlrp}{github.com/xiong-ping/rel\_walk\_gnnlrp}上提供代码。

英文摘要

Graph Neural Networks (GNNs) have become important machine learning tools for graph analysis, and its explainability is crucial for safety, fairness, and robustness. Layer-wise relevance propagation for GNNs (GNN-LRP) evaluates the relevance of \emph{walks} to reveal important information flows in the network, and provides higher-order explanations, which have been shown to be superior to the lower-order, i.e., node-/edge-level, explanations. However, identifying relevant walks by GNN-LRP requires {\em exponential} computational complexity with respect to the network depth, which we will remedy in this paper. Specifically, we propose {\em polynomial-time} algorithms for finding top-$K$ relevant walks, which drastically reduces the computation and thus increases the applicability of GNN-LRP to large-scale problems. Our proposed algorithms are based on the \emph{max-product} algorithm -- a common tool for finding the maximum likelihood configurations in probabilistic graphical models -- and can find the most relevant walks exactly at the neuron level and approximately at the node level. Our experiments demonstrate the performance of our algorithms at scale and their utility across application domains, i.e., on epidemiology, molecular, and natural language benchmarks. We provide our codes under \href{https://github.com/xiong-ping/rel_walk_gnnlrp}{github.com/xiong-ping/rel\_walk\_gnnlrp}.

2605.23672 2026-05-25 cs.CV

RiGS: Rigid-aware 4D Gaussian Splatting from a Single Monocular Video

RiGS: 从单目视频中的刚性感知4D高斯泼溅

Chenyu Wu, Wanhua Li, Zhu-Tian Chen, Hanspeter Pfister

发表机构 * Harvard University(哈佛大学) Zhejiang University(浙江大学) Nanyang Technological University(南洋理工大学) University of Minnesota - Twin Cities(明尼苏达大学-双城分校)

AI总结 从单目视频重建动态3D场景是一项基础但极具挑战性的任务,因为现实中的运动往往包含长期平滑变换和短期复杂形变。本文提出了一种名为RiGS的刚性感知四维高斯泼溅方法,能够同时捕捉多时间尺度的运动信息。该方法引入了三种高斯基元,分别用于表示静态背景、长期低频运动和短期高频动态,并通过对象级动态掩码聚合长距离时空运动信息,指导静态与动态区域的分解。实验表明,RiGS在新视角合成任务中取得了最先进的性能。

详情
AI中文摘要

从单目视频重建动态3D场景是一项基本但极具挑战性的任务,因为现实世界的运动通常涉及长期平滑变换和短期复杂变形。现有方法要么难以保持时间一致性,要么由于运动建模能力有限而无法捕捉高频动态。在这项工作中,我们提出了刚性感知4D高斯泼溅(RiGS),它同时捕捉多个时间尺度上的运动。具体来说,RiGS引入了三种类型的高斯原语:静态、刚性和瞬态,分别表示静态背景、长期低频运动和短期高频动态。提出了一种对象级动态掩码来聚合长距离时空运动信息,并指导静态和动态区域的分解。为了联合建模跨尺度的运动,允许刚性高斯根据其时间持续期转变为瞬态高斯,并且两者都在场景流引导下进行优化,提供密集的3D运动监督。大量实验表明,RiGS在新视角合成基准测试中达到了最先进的性能。代码可在\url{https://github.com/ladvu/RiGS}获取。

英文摘要

Reconstructing dynamic 3D scenes from monocular videos is a fundamental yet highly challenging task, as real-world motions often involve both long-term smooth transformations and short-term complex deformations. Existing methods either struggle to maintain temporal consistency or fail to capture high-frequency dynamics due to limited motion modeling capacity. In this work, we present Rigid-aware 4D Gaussian Splatting (RiGS), which simultaneously captures motions across multiple temporal scales. Specifically, RiGS introduces three types of Gaussian primitives: static, rigid, and transient, which represent static backgrounds, long-term low-frequency motions, and short-term high-frequency dynamics, respectively. An object-wise dynamic mask is proposed to aggregate long-range spatiotemporal motion information and guide the decomposition of static and dynamic regions. To jointly model motion across scales, rigid Gaussians are allowed to transition into transient Gaussians based on their temporal duration, and both are optimized under scene flow guidance, providing dense 3D motion supervision. Extensive experiments demonstrate that RiGS achieves state-of-the-art performance on novel view synthesis benchmarks. Code is available at \hyperlink{https://github.com/ladvu/RiGS}{https://github.com/ladvu/RiGS}.

2605.23668 2026-05-25 cs.CL cs.AI

OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations

OnePred: 多轮对话中基于递归意图记忆的下一个查询预测

Jiangwang Chen, Bowen Zhang, Zixin Song, Jiazheng Kang, Xiao Yang, Da Zhu, Guanjun Jiang

发表机构 * Tsinghua University(清华大学) Qwen Applications Business Group of Alibaba(阿里巴巴Qwen应用业务组)

AI总结 该研究提出了 OnePred,一种用于多轮对话中预测用户下一条查询的模型,旨在使对话系统更具主动性。其核心方法是通过递归更新的意图记忆来捕捉用户意图的演变,从而在不依赖完整对话历史的情况下实现高效且准确的预测。该方法通过两阶段强化学习训练模型,既学习预测内容又优化信息压缩,显著降低了计算成本并提升了预测性能。研究还发布了 NQP-Bench 基准数据集,实验表明 OnePred 在保持预测质量的同时,相比传统方法减少了高达 22 倍的计算开销。

详情
AI中文摘要

尽管大语言模型(LLM)对话系统每天处理数百万次多轮对话,但它们本质上仍是被动的:仅在用户输入查询后才响应。迈向主动交互的关键一步是下一个查询预测,即仅根据之前的对话预测用户后续的查询。该任务的进展受到缺乏专用基准以及基本效率-质量权衡的阻碍:简单拼接完整对话历史会导致线性增长的token消耗,而截断至最新一轮则会丢弃关键的跨轮上下文。我们的关键见解是,准确预测不需要重新阅读原始历史;只需跟踪用户跨主题、未解决需求和兴趣转移的不断演变的意图轨迹即可。我们提出OnePred,它维护一个递归更新的记忆作为唯一的跨轮上下文,将每轮成本限制为与对话长度无关。我们通过两阶段强化学习流程训练模型,首先教导预测什么,然后教导压缩什么,将记忆塑造成面向预测的意图链。为了建立严格的测试平台,我们引入了NQP-Bench,涵盖三个不同的子集。实验表明,与完整历史输入相比,OnePred将每轮token消耗减少高达22倍,同时在预测质量上持续超过所有基线,在较长对话中增益更大。我们的代码可在https://github.com/ZBWpro/OnePred公开获取。

英文摘要

Although large language model (LLM) conversational systems process millions of multi-turn dialogues daily, they remain fundamentally reactive: they respond only after the user types a query. A key step toward proactive interaction is next-query prediction, which anticipates the user's subsequent query based solely on the preceding dialogue. Progress on this task is hindered by the lack of dedicated benchmarks and a fundamental efficiency--quality trade-off: naively concatenating full dialogue history incurs linearly growing token consumption, while truncating to the latest turn discards crucial cross-turn context. Our key insight is that accurate prediction does not require re-reading raw history; it suffices to track the user's evolving intent trajectory across topics, unresolved needs, and interest shifts. We propose OnePred, which maintains a recursively updated memory as its sole cross-turn context, bounding the per-turn cost independently of conversation length. We train the model via a two-stage reinforcement learning pipeline that first teaches what to predict, then what to compress, shaping the memory into a prediction-oriented intent chain. To establish a rigorous testbed, we introduce NQP-Bench, spanning three diverse subsets. Experiments demonstrate that OnePred reduces per-turn token consumption by up to 22$\times$ compared to full-history inputs while consistently exceeding all baselines in prediction quality, with larger gains on longer conversations. Our code is publicly available at https://github.com/ZBWpro/OnePred.

2605.23656 2026-05-25 cs.CV

Recursive Block-Diagonal Coupling for Resource-Efficient Training of Vision Models

递归块对角耦合用于视觉模型的资源高效训练

Maxim Henry, Adrien Deliège, Sébastien Piérard, Marc Van Droogenbroeck

发表机构 * Montefiore Institute, University of Liège(蒙费尔研究所,列日大学)

AI总结 本文提出了一种名为RBDC的高效训练方法,通过递归地以无参数的块对角方式耦合多个窄模型,从而构建出宽模型,实现了对训练资源的灵活分配。该方法在ImageNet数据集上与从头训练的标准方法相比,在保持相似测试精度的情况下减少了30%的计算量,并在相同计算量下取得了优于现有模型增长方法的性能。此外,RBDC训练的模型在下游目标检测和实例分割任务中也表现出更优的性能。

Comments 22 pages, 3 figures, 4 tables, and 34 references

详情
AI中文摘要

从头训练高容量视觉模型需要大量计算资源。为了提高宽目标模型的训练效率,现有的增长方法通常假设存在更窄的模型,从而掩盖了整个流程的真实计算成本。我们提出了一种高效的训练协议RBDC,该协议通过递归方式以无参数块对角耦合独立训练的窄模型来构建宽模型。这允许灵活分配所有涉及模型的可用训练预算。在ImageNet上使用视觉变换器(DeiT)和卷积网络(ResNet)进行评估,我们的RBDC训练协议显示出比标准协议从头训练的模型更好的效率,在相似测试精度下实现了30%的FLOPs减少。与模型增长文献中的训练协议相比,它在相同训练FLOPs下也实现了更高的性能。最后,我们展示了我们的模型可以作为比原始模型更好的下游目标检测和实例分割任务的主干网络。

英文摘要

Training high-capacity vision models from scratch requires substantial computational resources. To improve training efficiency of a wide target model, existing growth methods often assume the availability of narrower models, obscuring the true computational cost of the entire pipeline. We propose an efficient training protocol, RBDC, that builds wide models by coupling in a parameter-free block-diagonal way narrower, independently trained models in a recursive way. This allows a flexible allocation of the training budget available across all the models involved. Evaluated with vision transformers (DeiT) and convolutional networks (ResNet) on ImageNet, our RBDC training protocol shows a much better efficiency than models trained from scratch with the standard protocol, yielding 30% FLOPs reduction at similar test accuracies. It also achieves higher performances at same training FLOPs than training protocols from the model growth literature. Finally, we show that our models can serve as better backbones than their original counterparts for downstream object detection and instance segmentation tasks.

2605.23655 2026-05-25 cs.CV cs.AI cs.LG cs.MM

CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception

CVSearch:赋予多模态大语言模型认知视觉搜索能力以感知高分辨率图像

Liupeng Li, Haoqian Kang, Zhenyu Lu, Jinpeng Wang, Bin Chen, Ke Chen, Yaowei Wang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China(深圳先进技术研究院)

AI总结 高分辨率图像感知是多模态大语言模型面临的关键瓶颈。为解决视觉搜索中覆盖性与效率之间的矛盾,本文提出CVSearch,一种无需训练的自适应框架,通过“评估-搜索”流程动态调度搜索策略。该方法在全局信息不足时采用专家辅助搜索,失败时触发语义感知的扫描机制,有效减少物体碎片化,并通过动态自底向上搜索策略提升局部细节的探索效率。实验表明,CVSearch在高分辨率基准上实现了最先进的准确率和显著提升的搜索效率。

Comments Accepted by ICML 2026. 22 pages, 12 figures, 7 tables

详情
AI中文摘要

高分辨率图像感知是多模态大语言模型的一个关键瓶颈。虽然视觉搜索提供了有希望的解决方案,但现有方法在覆盖率和效率之间难以权衡。视觉专家辅助搜索效率高,但当提议失败时容易出现盲点,而基于扫描的搜索以计算冗余和语义碎片化为代价保证了覆盖率。为了解决这一困境,我们引入了CVSearch,一种无需训练的自适应框架,通过评估-搜索工作流动态调度搜索策略。具体来说,CVSearch首先在全局信息不足时调用专家辅助搜索,仅在失败时触发一种新颖的语义感知扫描机制。与刚性网格划分不同,这种高效扫描范式结合了语义引导的自适应补丁,将图像分解为语义一致的区域,有效缓解了物体碎片化。此外,我们设计了一种由视觉复杂性先验驱动的动态自底向上搜索策略,以实现对局部细节的高效且精确的迭代探索。在高分辨率基准上的大量实验表明,CVSearch在显著提高搜索效率的同时实现了最先进的准确性。代码已发布在https://github.com/liliupeng28/ICML26-CVSearch。

英文摘要

High-resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade-off between coverage and efficiency. Visual expert-assisted search is efficient but prone to blind spots when proposals fail, whereas scan-based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training-free adaptive framework that dynamically schedules search strategies via an Assess-then-Search workflow. Specifically, CVSearch first invokes expert-assisted search when global information is insufficient, and only triggers a novel semantic-aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom-Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state-of-the-art accuracy while substantially improving search efficiency. Code is released at https://github.com/liliupeng28/ICML26-CVSearch.

2605.23653 2026-05-25 cs.CV

ExpOS: Explainable Open-Surgery Skills Assessment Using 3D Hand Reconstruction

ExpOS: 基于3D手部重建的可解释开放式手术技能评估

Roi Papo, Idan Smoller, Shlomi Laufer

发表机构 * Faculty of Data and Decision Sciences, Technion – Israel Institute of Technology, Haifa, 3200003, Israel(数据与决策科学学院,技术学院–以色列理工学院,海法,3200003,以色列)

AI总结 本文提出ExpOS,一种基于3D手部重建的可解释开放手术技能评估框架,旨在实现自动化的、以反馈为导向的手术训练评估。该方法通过从手术视频中提取手部姿态和工具检测信息,学习具有判别力的时间模式,并利用时空卷积网络和注意力机制生成帧级重要性图,从而预测技能水平并提供可解释的反馈。实验表明,ExpOS在多个手术任务中与专家评分具有高度相关性,尤其在筋膜闭合任务中表现优异,展示了其在可扩展性和实用性方面的潜力。

Comments 10 pages, 4 figures

详情
AI中文摘要

及时且透明的反馈对于有效的手术培训至关重要,但目前的评估仍然依赖于专家观察,限制了可扩展性和自主实践的机会。我们提出了ExpOS,一个用于数据驱动的开放式手术技能评估的可解释框架,旨在实现自动化的、面向反馈的评估。ExpOS不依赖于专家定义的指标,而是直接从运动数据中学习判别性时间模式,并识别出最能预测技能水平的片段和行为。我们在221名医学生执行三项开放式手术任务的视频上训练和评估了该方法。从每一帧中提取手部姿态和工具检测,以推导运动学描述符和全局运动统计。使用时间卷积骨干网络和基于注意力的池化对时空手-工具动态进行建模,生成帧级重要性图。这些表示与全局运动统计融合,以预测技能水平并提供可解释的反馈。ExpOS通过注意力权重识别信息事件发生的时间,并通过全局特征分析确定哪些运动特征对预测影响最大,从而提供多层级可解释性。在各项任务中,该框架与专家评分实现了强相关性,在筋膜闭合任务上表现最佳(r = 0.778, R2 = 0.74)。这些结果表明,将弱监督时间重要性学习与可解释运动统计相结合,能够实现可扩展且可操作的手术技能评估。

英文摘要

Timely and transparent feedback is essential for effective surgical training, yet current assessment remains dependent on expert observation, limiting scalability and opportunities for autonomous practice. We present ExpOS, an explainable framework for data-driven assessment of open-surgery skills designed to enable automatic, feedback-oriented evaluation. Rather than relying on expert-defined metrics, ExpOS learns discriminative temporal patterns directly from motion data and identifies the segments and behaviors most predictive of skill level. We trained and evaluated the method on 221 videos of medical students performing three open-surgery tasks. Hand poses and tool detections were extracted from each frame to derive kinematic descriptors and global motion statistics. Spatiotemporal hand-tool dynamics were modeled using a temporal convolutional backbone with attention-based pooling to generate frame-level importance maps. These representations were fused with global motion statistics to predict skill level and to provide interpretable feedback. ExpOS provides multi-level explainability by identifying when informative events occur through attention weights and which motion characteristics most influence predictions through global feature analysis. Across tasks, the framework achieved strong correlation with expert ratings, with best performance on fascial closure (r = 0.778, R2 = 0.74). These results demonstrate that combining weakly-supervised temporal importance learning with interpretable motion statistics enables scalable and actionable surgical skill assessment.

2605.23652 2026-05-25 cs.AI

One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

一个策略,无限NPC:用于可扩展游戏智能体的可追溯共享RL策略

Yoosung Hong

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究提出了一种名为 pcsp 的共享强化学习策略,用于实现可扩展的游戏 NPC 控制,能够根据自由形式的人格描述生成具有个性特征且可控的行为。该方法基于冻结的 LLM 嵌入进行条件化,并结合了多种技术如低秩投影和一致性训练目标,以确保人格一致性与行为多样性。实验表明,pcsp 在零样本人格识别、语义行为对齐和推理速度等方面显著优于现有方法,并在实际游戏引擎中验证了其有效性与稳定性。

Comments 18 pages, 15 figures, 14 tables

详情
AI中文摘要

在300人生活模拟基准上,pcsp实现了组合零样本角色识别,准确率比随机高17倍,Spearman相关系数约0.73的语义-行为对齐,推理速度比LLM作为策略的基线快22倍。生活模拟游戏需要数百到数千个非玩家角色(NPC),这些角色具有一致的个性,同时通过设计师编写的自然语言保持可控。现有方法在个性一致性、可控性或实时推理等约束下失败。我们引入了pcsp(个性条件共享策略),这是一种单一的强化学习策略,以自由形式个性描述的冻结LLM嵌入为条件。pcsp结合了每个NPC一次的个性编码、低秩个性投影、神经个性调节以及PPO + InfoNCE一致性 + KL多样性训练目标。在三个实验设置中,消融实验表明InfoNCE轨迹一致性目标是关键:移除它会导致零样本角色识别降至随机水平。在Melting Pot 2.4.0子任务上的外部验证证实,我们的方法在多智能体战略环境中产生了基于个性的行为差异。我们区分了两种保留评估的含义:组合零样本和词汇扩展保留。最后,在UE5部署中,以64个智能体在引擎内重现了基于个性的消融实验,故障率低,表明子帧推理轮廓在商业游戏引擎中得以保留。这些结果证明,共享RL策略可以支持可扩展、实时、基于个性的NPC控制。

英文摘要

On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic-behavioral alignment, and 22x faster inference than an LLM-as-policy baseline. Life simulation games require hundreds to thousands of non-player characters (NPCs) that behave consistently with distinct personalities while remaining controllable through designer-authored natural language. Existing methods fail on constraints like persona consistency, controllability, or real-time inference. We introduce pcsp (Persona Conditioned Shared Policy), a single reinforcement learning policy conditioned on frozen LLM embeddings of free-form persona descriptions. pcsp combines once-per-NPC persona encoding, low-rank persona projection, neural persona conditioning, and a PPO + InfoNCE consistency + KL diversity training objective. Across three experimental settings, ablations show that the InfoNCE trajectory-consistency objective is load bearing: removing it collapses zero-shot persona identification to chance. External validation on Melting Pot 2.4.0 substrates confirms that our method produces persona-conditioned behavioral divergence in multi-agent strategic environments. We distinguish two senses of held-out evaluation: compositional zero-shot and vocabulary-expansion held-out. Finally, a UE5 deployment reproduces the in-engine persona-conditioning ablation at 64 agents with a low failure rate, showing that the sub-frame inference profile survives in a commercial game engine. These results prove that shared RL policies can support scalable, real-time, persona-conditioned NPC control.

2605.23645 2026-05-25 cs.LG cs.AI

Learning Through Noise: Why Subliminal Learning Works and When It Fails

通过噪声学习:为什么潜意识学习有效以及何时失败

Vincent C. Brockers, Roman D. Ventzke, Valentin Neuhaus, Belén Hidalgo-Ogalde, Viola Priesemann

发表机构 * Max Planck Institute for Dynamics and Self-Organization(马克斯·普朗克动态与自组织研究所) Faculty of Physics, Institute for the Dynamics of Complex Systems, University of Göttingen(哥廷根大学物理系,复杂系统动力学研究所)

AI总结 本文研究了人工神经网络中的“潜意识学习”现象,即通过任务无关的输入-输出对进行知识蒸馏时,学生模型从教师模型中隐式学习任务相关知识或偏差的机制。研究发现,这一过程并不依赖于教师与学生模型的初始化一致性,而是由输出头的兼容性所决定。通过控制实验,作者展示了即使在随机初始化、网络结构变化等情况下,学生模型仍能通过兼容的辅助输出头从教师模型中学习有用信息,并在特定条件下达到与教师相当的任务性能。该研究为潜意识学习提供了理论解释,并明确了其适用范围与失效条件。

详情
AI中文摘要

在人工神经网络的背景下,潜意识学习指的是通过任务无关的输入-输出对的蒸馏,将任务相关知识或意外偏差从教师模型传递到学生模型。先前的解释将这种效应归因于共享或紧密匹配的教师-学生初始化。我们表明,紧密匹配的初始化并非必要。相反,潜意识学习由兼容的输出头控制。使用受控的MNIST设置,我们将输出分为辅助头(用于辅助的、任务无关的噪声信号)和分类头(用于分类),以证明潜意识学习发生——即使我们随机初始化隐藏层并移除层、添加新层或更改架构(MLP到CNN)。兼容的辅助头能够传递可恢复的教师信号,使学生的表示更接近教师的表示。当分类头也保持兼容时,仅训练于任务无关噪声的学生可以接近,并且在有利情况下达到教师级别的任务性能。我们的设置使我们能够发展一种理论来解释潜意识学习的机制,并推导出潜意识学习失败时的上界。总之,我们的结果将潜意识学习从一种令人惊讶的迁移效应转变为具有可预测限制的理论基础机制。

英文摘要

In the context of artificial neural networks, subliminal learning refers to the transfer of task-relevant knowledge or unintended biases from teacher to student models through distillation on task-unrelated input$\unicode{x2013}$output pairs. Prior explanations tie this effect to shared or closely matched teacher$\unicode{x2013}$student initialization. We show that a closely matched initialization is not necessary. Instead, subliminal learning is governed by compatible output heads. Using a controlled MNIST setting, we split outputs into an auxiliary head (for auxiliary, task-unrelated noise signals) and a class head (for classification) to demonstrate subliminal learning occurs$\unicode{x2014}$even when we randomly initialize hidden layers and remove layers, add new layers, or change the architecture (MLP-to-CNN). Compatible auxiliary heads enable transfer of a recoverable teacher signal, bringing the student's representations closer to the teacher's. When the class heads remain compatible as well, students trained only on task-unrelated noise can approach, and in favorable regimes match, teacher-level task performance. Our setting enables us to develop a theory that explains the mechanism of subliminal learning and to derive upper bounds on when subliminal learning fails. Together, our results turn subliminal learning from a surprising transfer effect into a theoretically grounded mechanism with predictable limits.

2605.23634 2026-05-25 cs.CV cs.AI

DualMem: Bypassing the Objectness Bottleneck for Calibrated Unknown-Stream Filtering in Open-World Object Detection

DualMem: 绕过目标性瓶颈以实现开放世界目标检测中校准的未知流过滤

Yingjun Xiao, Xi Chen, Gang Fang, Siyuan Chen

发表机构 * School of Artificial Intelligence, Guangzhou University(广州大学人工智能学院) School of Computer Science and Cyber Engineering, Guangzhou University(广州大学计算机科学与网络工程学院) Institute of Computing Science and Technology, Guangzhou University(广州大学计算科学与技术研究院)

AI总结 开放世界目标检测(OWOD)需要检测器既能定位已知类别,又能识别未知对象以支持未来的增量学习。本文发现当前强OWOD检测器的未知预测流中背景误检比例过高,问题根源在于对象性头的信息瓶颈。为此,作者提出DualMem,一种基于冻结SigLIP特征空间的校准后处理过滤器,通过非参数似然比检验实现对未知对象的筛选,有效提升了未知对象识别的准确性,同时保持已知类别检测性能不变。

详情
AI中文摘要

开放世界目标检测(OWOD)要求检测器定位已知类别,同时识别未知对象以进行未来的增量学习。我们发现,强OWOD检测器的未知预测流受到严重污染:在M-OWODB上,对于PROB、OW-DETR和HypOW,未来任务的正未知样本仅占未知预测的不到10%,而背景假阳性则占46-71%。我们表明,这不是信息缺失问题,而是目标性头部的信息瓶颈。在PROB任务1上,对256维解码器查询的线性探针在正负未知区分上达到了0.908的AUROC,但最终的一维目标性标量降至0.642。一个冻结的SigLIP特征,无需访问检测器,在过滤阶段独立恢复了大部分这种提议级别的可分离性(AUROC = 0.871)。基于这一发现,我们提出DualMem,一种校准的后验过滤器,它假设一个小的、图像不相交的、标注了未来任务对象的校准分割,并在冻结的SigLIP特征空间中执行非参数似然比检验。DualMem使用k近邻正记忆来保护未来任务对象,并使用负记忆来抑制类似背景的提议。其决策阈值通过Neyman-Pearson校准选择,为用户提供了假未知抑制与新奇召回之间的显式权衡。在M-OWODB任务1上的PROB、OW-DETR和HypOW中,DualMem将每幅图像的背景型假未知提议减少了44.9%-66.3%,平均减少56.6%。在PROB任务1上,它使自然K-means原型基线的减少量翻倍以上,同时保持已知类别的mAP不变,因为已知检测绕过过滤器。

英文摘要

Open-world object detection (OWOD) requires detectors to localize known classes while identifying unknown objects for future incremental learning. We find that the unknown prediction streams of strong OWOD detectors are heavily polluted: on M-OWODB, across PROB, OW-DETR, and HypOW, future-task positive unknowns make up less than 10% of unknown predictions, whereas background false positives account for 46-71%. We show that this is not a missing-information problem, but an information bottleneck at the objectness head. On PROB Task 1, a linear probe on the 256-D decoder query achieves an AUROC of 0.908 for positive-versus-negative unknown discrimination, but the final one-dimensional objectness scalar drops to 0.642. A frozen SigLIP feature, without access to the detector, independently recovers much of this proposal-level separability at the filtering stage (AUROC = 0.871). Motivated by this finding, we propose DualMem, a calibrated post-hoc filter that assumes a small image-disjoint annotated calibration split of held-out future-task objects and performs a non-parametric likelihood ratio test in frozen SigLIP feature space. DualMem uses a k-nearest-neighbor positive memory to protect future-task objects and a negative memory to suppress background-like proposals. Its decision threshold is chosen by Neyman-Pearson calibration, giving users an explicit trade-off between false-unknown suppression and novel recall. Across PROB, OW-DETR, and HypOW on M-OWODB Task 1, DualMem reduces background-type false unknown proposals per image by 44.9%-66.3%, with a mean reduction of 56.6%. On PROB Task 1, it more than doubles the reduction achieved by a natural K-means prototype baseline, while leaving known-class mAP unchanged because known detections bypass the filter.

2605.23632 2026-05-25 cs.LG

Valid and Expressive Copulas for Irregular Multivariate Time Series

不规则多元时间序列的有效且表达力强的Copula模型

Christian Klötergens, Tom Hanika, Lars Schmidt-Thieme, Vijaya Krishna Yalavarthi

发表机构 * Institute of Computer Science(计算机科学研究所) University of Hildesheim(希尔德斯海姆大学)

AI总结 本文提出了一种名为CopFITi的模型,用于对不规则多变量时间序列进行概率预测。该模型结合了归一化流在单变量边缘分布上的表达能力,以及高斯混合copula在联合依赖结构上的灵活性和一致性。研究首次构建了一个在边缘化上具有一致性的不规则多变量时间序列copula模型,并在联合密度建模方面取得了新的状态-of-the-art成果。

详情
AI中文摘要

我们提出了CopFITi,一种用于不规则多元时间序列(IMTS)概率预测的copula模型。该模型将单变量边缘分布的归一化流的表达力与联合依赖结构的高斯混合Copula的一致性和灵活性相结合。我们的实验表明,将边缘分布与联合分布解耦的基于copula的方法,比直接拟合完整联合分布的架构能产生更好的边缘模型。通过CopFITi,我们提出了第一个通过构造实现边缘化一致性的IMTS copula,并在联合IMTS密度建模中建立了新的最优水平。

英文摘要

We introduce CopFITi, a copula model for probabilistic forecasting of irregular multivariate time series (IMTS). Our model combines the expressivity of normalizing flows for univariate marginals with the consistency and flexibility of a Gaussian Mixture Copula for the joint dependency structure. Our experiments show that copula-based approaches, which decouple the marginals from the joint, yield better marginal models than architectures that directly fit the full joint. With CopFITi, we propose the first IMTS copula that is marginalization-consistent by construction and establish a new state of the art in joint IMTS density modeling.

2605.23629 2026-05-25 cs.CV

DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs

DDX-TRACE: 视觉语言模型中医学诊断轨迹的基准

Jiazhen Pan, Weixiang Shen, Jun Li, Julian Canisius, Felix Bitzer, Paula Roßmüller, Jiancheng Yang, Virginie Kreutzinger, Daniel Rueckert, Benedikt Wiestler

发表机构 * Technical University of Munich(慕尼黑技术大学) TUM University Hospital(TUM大学医院) Munich Center for Machine Learning(慕尼黑机器学习中心) LMU Munich(慕尼黑大学) Aalto University(阿尔托大学) Imperial College London(伦敦帝国学院)

AI总结 DDX-TRACE 是一个用于评估视觉语言模型在医学诊断过程中表现的基准,专注于神经放射学领域,包含211个复杂病例。该基准模拟了真实的诊断流程,模型需在有限的临床信息基础上逐步请求影像检查、更新诊断假设,并最终给出确诊结果。研究发现,传统仅评价最终答案的方法可能无法准确反映模型的诊断质量,而DDX-TRACE通过关注诊断轨迹,揭示了模型在证据获取、不确定性更新和推理能力方面的关键问题。

Comments 41 pages

详情
AI中文摘要

医学诊断并非来自完全指定的病例的单次预测。它是一个序贯工作流程:临床医生决定获取哪些证据,修订鉴别诊断,并在诊断得到充分支持时停止。大多数医学AI基准则提前揭示相关背景,仅对最终答案评分,使得无依据的正确猜测、过早闭合、低效工作流以及不良的不确定性更新变得不可见。我们引入了DDX-TRACE,一个由医生裁决的多模态神经放射学基准,在211个具有挑战性的病例中评估隐藏证据下的诊断轨迹。每个病例从有限的临床病史开始;模型以自由形式请求影像研究,在可用时接收匹配的图像包,每轮后更新概率性鉴别诊断,并以定位的最终诊断结束。评估最先进的VLM,我们发现最终诊断分数可能严重歪曲工作流质量:模型可能在没有必要证据的情况下猜测合理的诊断,请求有用的研究但误解原始图像,或者低效地获取证据同时更新不确定性不佳。受控证据变体隔离了规划、视觉证据提取和下游鉴别推理中的瓶颈。DDX-TRACE将医学AI评估从最终答案转向证据支持的诊断轨迹。

英文摘要

Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories.

2605.23628 2026-05-25 cs.LG

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

操纵基准测试有多难?排行榜鲁棒性的社会选择分析

Polina Gordienko, Georg Schollmeyer, Frauke Kreuter, Christoph Jansen

发表机构 * Department of Statistics, LMU Munich(慕尼黑大学统计系) Social Data Science Center, University of Maryland(马里兰大学社会数据科学中心) School of Computing & Communications, Lancaster University Leipzig(莱比锡兰卡斯特大学计算与通信学院)

AI总结 本文研究了在多任务基准测试中通过训练数据选择来操纵模型排名的难度问题,将其类比为社会选择理论中的选举操纵问题。作者将数据集视为选民、模型视为候选人,证明在Borda计数和平均胜率等评价指标下,基准特定训练问题属于NP难问题。此外,文章引入了实例级别的鲁棒性指标,用于衡量模型开发者需要包含多少数据集才能在排行榜上超越其他模型,并在多个基准测试中验证了不同指标下的鲁棒性差异,发现平均胜率最难被操纵。

详情
AI中文摘要

多任务基准测试已成为机器学习研究的核心支柱,但其日益增长的影响力激励了基准测试游戏——为提高特定模型的排行榜排名而采取的策略性行动。将数据集视为选民,模型视为候选人,我们将基准特定训练——在训练中包含基准数据——视为一种选举操纵形式。对于任何序数基准,选择训练数据集以使目标模型排名第一的问题对应于移位贿赂,这是计算社会选择中的一类操纵问题。利用这一识别,我们证明在Borda计数和平均胜率下,基准特定训练问题是NP难的。作为这种最坏情况视角的补充,我们引入了实例级鲁棒性,即模型开发者必须包含在训练中以使给定排行榜排名第一的最小数据集数量,并在算术平均、中位数、平均胜率和成对多数下推导出其表达式。我们在HELM下的MMLU和Open LLM排行榜下的BIG-Bench Hard(BBH)上评估了这些表达式。在两个套件中,平均胜率最难操纵:这一差距在BBH(24个任务,4507个模型)上很明显,其中位鲁棒性为22个任务(92%),而算术平均下为13个(54%),中位数和成对多数下为12个(50%)。

英文摘要

Multi-task benchmarks have become a central pillar of machine learning research, yet their growing influence has incentivised benchmark gaming -- strategic actions taken to improve the leaderboard rank of a specific model. Treating datasets as voters and models as candidates, we consider benchmark-specific training -- the inclusion of benchmark data in training -- as a form of election manipulation. For any ordinal benchmark, the problem of choosing datasets to train on so that a target model becomes top-ranked corresponds to shift bribery, a class of manipulation problems from computational social choice. Leveraging this identification, we show that the benchmark-specific training problem is NP-hard under Borda count and mean win rate. Complementing this worst-case perspective, we introduce the instance-level robustness, the minimum number of datasets a model developer must include in training to top a given leaderboard, and derive expressions for it under arithmetic mean, median, mean win rate and pairwise majority. We evaluate these expressions on MMLU under HELM and on BIG-Bench Hard (BBH) under the Open LLM Leaderboard. Across both suites, mean win rate is hardest to manipulate: this gap is clear on BBH (24 tasks, 4507 models), where its median robustness is 22 tasks (92%), compared with 13 (54%) under arithmetic mean and 12 (50%) under median and pairwise majority.

2605.23618 2026-05-25 cs.CL

Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems

Google Embeddings 2 与开源模型在多语言稠密检索和 RAG 系统中的基准测试

Stefano Cirillo, Domenico Desiato, Giuseppe Polese, Giandomenico Solimando

发表机构 * University of Salerno(萨勒诺大学) University of Bari(巴里大学)

AI总结 本文对比了Google Embeddings 2(GE2)与五个开源模型在多语言密集检索和RAG系统中的性能,发现GE2在多个任务中表现最佳,但其延迟较高;相比之下,mE5-L在保持较高检索效果的同时具有更低的延迟,适合对响应时间有要求的应用;实验还表明,所有模型在32词块长度时性能趋于饱和,而语义分块仅在16词块时带来明显提升。

Comments 9 pages, 2 figures, 5 tables. Text and evaluation code available at https://github.com/cciro94/GoogleEmbeddings2-benchmark

详情
AI中文摘要

我们对 Google Embeddings (GE2) 进行了基准测试,这是一个由 Vertex-AI 托管的双编码器,具有 2048 令牌上下文和显式任务类型条件,与五个开源替代方案:BGE-M3、E5-large、Multilingual-E5-large (mE5-L)、LaBSE 和 Paraphrase-Multilingual-MPNet (mMPNet)。评估涵盖四个 BEIR 子集、一个合成意大利语 RAG 语料库、考虑三种策略下 5 种令牌大小的分块消融实验,以及在商品 CPU 硬件上的每查询延迟。GE2 在每个任务上排名第一,达到 BEIR 平均 nDCG@10 = 0.638 和 IT-RAG-Bench nDCG@10 = 0.282,但中位延迟为 231.6 毫秒,比最快的本地模型慢约 14 倍。mE5-L 在意大利语上以 31 毫秒的延迟达到与 GE2 相差 0.003 nDCG 的性能,使其成为在子 100 毫秒 SLA 下更优的选择。一个更惊人的发现涉及 LaBSE,尽管广泛部署于多语言场景,其在 BEIR 上的平均 nDCG@10 仅为 0.188,低于包括 mMPNet 在内的所有专用检索模型。分块实验表明,所有六个模型在我们的语料库上在 32 令牌分块时性能饱和,语义分块仅在 16 令牌时提供可衡量的增益。

英文摘要

We benchmark Google Embeddings (GE2), a Vertex-AI-hosted bi-encoder with 2,048-token context and explicit task-type conditioning, against five open-source alternatives: BGE-M3, E5-large, Multilingual-E5-large (mE5-L), LaBSE, and Paraphrase-Multilingual-MPNet (mMPNet). Evaluation covers four BEIR subsets, a synthetic Italian RAG corpus, a chunking ablation considering 5 sizes of tokens with three strategies, and per-query latency on commodity CPU hardware. GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. A more striking finding concerns LaBSE, which, despite widespread multilingual deployment scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model including mMPNet. Chunking experiments show that all six models saturate at 32-token chunks on our corpus, with semantic chunking providing measurable gains only at 16 tokens.

2605.23610 2026-05-25 cs.CV cs.AI

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

EM-Vid:无需训练的以实体为中心的记忆,用于高效且一致的多镜头视频生成

Jente Vandersanden, Matheus Gadelha, Chun-Hao P. Huang, Hyeonho Jeong, Yulia Gryaditskaya

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克研究所) Adobe Research(Adobe研究)

AI总结 本文提出了一种无需训练的实体中心记忆机制 EM-Vid,用于高效且一致的多镜头视频生成。该方法通过存储实体相关的潜在补丁来分离持久实体信息与瞬时场景背景,结合稀疏 token 条件控制和结构化脚本格式,有效降低了计算成本并提升了生成一致性。此外,引入的预算化记忆更新策略和噪声注入机制,进一步增强了对实体外观的精细控制,防止了无关信息的泄露。

详情
AI中文摘要

多镜头视频生成需要在不同镜头间保持重复实体的一致外观,同时忠实于镜头特定的文本提示。最近的自回归方法重用先前生成的帧作为记忆。然而,全帧存储将持久实体信息与瞬态场景上下文纠缠在一起,导致无关信息泄漏和高计算成本。我们提出一种以实体为中心的记忆,形式为实体索引的潜在补丁库。我们引入与预训练模型兼容的稀疏令牌条件化,将自注意力限制在实体相关令牌上,降低计算成本。为此,我们引入一种结构化的多镜头脚本格式。我们还提出一种预算记忆更新策略,以维护紧凑且不断演化的记忆。最后,我们为实体表示配备噪声注入机制,实现细粒度外观控制,防止无关信息泄漏。我们的方法在保持主体一致性的同时,提高了提示遵循度和效率。

英文摘要

Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible with pretrained models, restricting self-attention to entity-relevant tokens and reducing computational cost. To support this, we introduce a structured multi-shot script format. We additionally propose a budgeted memory update strategy to maintain a compact, evolving memory. Finally, we equip the entity representation with a noise-injection mechanism that enables fine-grained appearance control, preventing leakage of irrelevant information. Our method improves prompt adherence and efficiency while preserving subject consistency.

2605.23605 2026-05-25 cs.LG cs.AI cs.CL

DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

DiLaDiff: 蒸馏潜在增强扩散用于语言建模

Jean-Marie Lemercier, Tomas Geffner, Karsten Kreis, Morteza Mardani, Arash Vahdat, Ante Jukić

发表机构 * NVIDIA(英伟达)

AI总结 DiLaDiff 是一种改进的扩散语言模型,旨在解决传统扩散模型在采样质量和生成速度之间的矛盾。该方法引入了连续语义潜在空间,并通过自编码器和一致性蒸馏技术提升生成效率和质量。实验表明,DiLaDiff 在不进行蒸馏时已优于基线模型,并在蒸馏后显著加快了推理速度。

详情
AI中文摘要

扩散语言模型本质上无法捕捉解码令牌之间的相关性,导致采样质量与吞吐量之间存在严峻的权衡。为了解决这个问题,我们提出了DiLaDiff,一种掩码扩散语言模型的变体,包含三个组件:(1)具有语义能力的连续潜在空间,通过从现有掩码扩散语言模型微调的自编码器学习;(2)学习编码器分布先验的潜在扩散模型;(3)将学习到的先验蒸馏为少步潜在生成模型的一致性模型。我们表明,即使没有蒸馏,我们的潜在引导扩散模型在显著加速推理的同时也优于掩码扩散基线。一致性蒸馏进一步降低了连续扩散的计算开销,使得潜在生成的时间相对于离散解码可以忽略不计。

英文摘要

Diffusion language models intrinsically fail to capture correlations between decoded tokens, which leads to a harsh trade-off between sampling quality and throughput. To solve this issue, we propose DiLaDiff, a variant of masked diffusion language models with three components: (1) a continuous latent space with semantic capabilities, learned by an auto-encoder fine-tuned from an existing masked diffusion language model; (2) a latent diffusion model learning the prior over the encoder distribution; (3) a consistency model distilling the learned prior into a few-step latent generative model. We show that, even without distillation, our latent-guided diffusion model outperforms the masked diffusion baseline while significantly accelerating inference. Consistency distillation further lowers the computational overhead of continuous diffusion, such that the latent is generated in negligible time compared to discrete decoding.

2605.23603 2026-05-25 cs.LG cond-mat.dis-nn cs.AI cs.NE

Preisach Attention: A Hysteretic Model of Sequential Memory

Preisach注意力:序列记忆的迟滞模型

Piotr Frydrych

发表机构 * Faculty of Mechatronics, Warsaw University of Technology(机电学院,华沙技术大学)

AI总结 本文提出了一种基于经典 Preisach 滞后算子的新型序列建模架构——Preisach 注意力层(PAL),用二值继电器操作符替代传统的 softmax 注意力机制,通过学习激活与去激活阈值来维护内部的局部极值栈。该架构在任意精度算术下实现图灵完备性,且单层 PAL-Transformer 的深度仅为 O(1),优于传统硬注意力 Transformer 所需的 O(log n) 深度。研究还证明 PAL 与 Transformer 在可计算函数类上互不包含,PAL 能以更少层数计算历史范围统计量,而 Transformer 支持随机访问但需额外状态支持,且 PAL 对序列的响应仅依赖于局部极值序列,而非绝对位置或时间间隔。

Comments 24 pages, 2 tables, preprint

详情
AI中文摘要

我们引入了Preisach注意力层(PAL),一种基于数学物理中经典Preisach迟滞算子的新型序列建模架构。PAL用由学习到的激活和去激活阈值参数化的二进制继电器算子替代了softmax注意力机制,并维护一个局部极值栈作为其内部状态。在任意精度算术下,具有O(1)深度的单层PAL-Transformer是图灵完备的,这可以通过模拟双栈下推自动机实现——而标准硬注意力变压器需要O(log n)深度。其次,我们证明了PAL和Transformer可计算的函数类是不可比的:PAL在O(1)层内计算历史范围统计,而Transformer需要O(log n)层;Transformer支持随机访问检索,而PAL在没有辅助状态的情况下无法执行。分离性质是率无关性——PAL仅响应局部极值序列,而不响应绝对标记位置或时间间隔。第三,我们证明了极值栈构成了所有率无关泛函的输入历史的最小充分统计量,提供了经典迟滞理论中擦除性质的形式类比。因此,PAL是一种适用于长情节记忆和弱位置依赖任务的高效架构,其总推理成本为O(n log n),而标准注意力为O(n^2)。

英文摘要

We introduce the Preisach Attention Layer (PAL), a novel sequence modelling architecture grounded in the classical Preisach hysteresis operator from mathematical physics. PAL replaces the softmax attention mechanism with a binary relay operator parameterised by learned activation and deactivation thresholds, maintaining a stack of local extrema as its internal state. A single-layer PAL-Transformer with O(1) depth is Turing-complete under arbitrary precision arithmetic, achievable through simulation of a two-stack pushdown automaton -- in contrast to the O(log n) depth required by standard hard-attention transformers. Second, we prove that the function classes computable by PAL and by the transformer are incomparable: PAL computes historical range statistics in O(1) layers that require O(log n) layers for transformers, while transformers support random-access retrieval that PAL cannot perform without auxiliary state. The separating property is rate-independence -- PAL responds only to the sequence of local extrema, not to absolute token positions or temporal spacing. Third, we show that the extremum stack constitutes a minimal sufficient statistic of the input history for all rate-independent functionals, providing a formal analogue of the wiping property in classical hysteresis theory. PAL is thus an efficient architecture for tasks with long episodic memory and weak positional dependence, with O(n log n) total inference cost versus O(n^2) for standard attention.

2605.23602 2026-05-25 cs.CV

GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes

GlowGS: 夜间发光场景中用于3D高斯溅射的生成式语义特征学习

Beibei Lin, Xiao Cao, Jingyuan Guo, Robby T. Tan

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 现有3D高斯泼溅(3DGS)方法在白天清晰场景中能生成高质量的新视角图像,但在夜间发光区域表现较差,主要因为缺乏纹理和边缘等结构特征。为此,本文提出GlowGS方法,结合扩散模型和视觉基础模型(VFM),通过语义特征生成和新视角语义学习两个关键思想,生成高质量的隐式结构线索,并在无需真实标签的情况下优化渲染结果,显著提升了夜间发光场景下3D重建的语义准确性和视觉质量。

Comments Accepted by CVPR Findings 2026

详情
AI中文摘要

现有的3DGS方法在晴朗场景中能有效渲染高质量的新视图。然而,它们在夜间场景中表现不佳,特别是在发光区域,因为缺乏纹理和边缘等结构特征,而这些特征是基于溅射重建的关键线索。为了解决这个问题,我们利用扩散模型和视觉基础模型(VFM)来补偿缺失的结构线索。我们的方法包含两个关键的新思想:语义特征生成和新视图语义学习。首先,语义特征生成为新视图生成高质量的语义特征作为隐式结构线索。具体来说,扩散模型从训练视图中合成具有未知相机姿态的新视图,而VFM评估其质量。一旦识别出高质量的新视图,VFM提取鲁棒特征以构建语义特征库。其次,新视图语义学习使3DGS能够优化渲染的新视图,而无需真实标签。它通过从渲染的新视图中提取语义特征,在特征库中搜索最相似的特征,并最小化它们的距离来实现。这个过程施加了隐式结构约束,确保语义一致、无伪影的渲染视图。大量实验证明了我们的GlowGS在生成语义准确的3D视图方面的有效性,显示出比现有方法显著的改进。

英文摘要

Existing 3DGS methods effectively render high-quality novel views in clear-day scenes. However, they struggle with night scenes, particularly in glow regions, due to the lack of structural features such as textures and edges, which are key cues for splatting-based reconstruction. To address this problem, we leverage a diffusion model and a Vision Foundation Model (VFM) to compensate for missing structural cues. Our method consists of two key novel ideas: semantic feature generation and novel-view semantic learning. First, semantic feature generation produces high-quality semantic features as implicit structural cues for novel views. Specifically, a diffusion model synthesizes novel views with unknown camera poses from training views, while a VFM evaluates their quality. Once high-quality novel views are identified, the VFM extracts robust features to construct the semantic feature bank. Second, novel-view semantic learning enables 3DGS to optimize rendered novel views without requiring ground truth. It achieves this by extracting semantic features from a rendered novel view, searching the feature bank for the most similar features, and minimizing their distance. This process enforces implicit structural constraints, ensuring semantically coherent, artifact-free rendered views. Extensive experiments demonstrate the effectiveness of our GlowGS in generating semantically accurate 3D views, showing significant improvements over existing methods.