arXivDaily arXiv每日学术速递 周一至周五更新
重置
2602.11146 2026-05-25 cs.CV cs.AI

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

超越基于VLM的奖励:扩散原生潜在奖励建模

Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo

AI总结 本文提出了一种基于扩散模型的原生潜在奖励模型DiNa-LRM,旨在解决扩散和流匹配模型在偏好优化中对奖励函数的需求。该方法直接在扩散过程的噪声状态上进行偏好学习,引入了与扩散噪声相关的不确定性校准的Thurstone似然函数,从而提升了奖励模型的判别鲁棒性和计算效率。实验表明,DiNa-LRM在图像对齐任务中显著优于现有的扩散奖励基线,并以更低的计算成本达到与最先进视觉语言模型相当的性能,同时提升了偏好优化的动态效率。

详情
Comments
Accepted by ICML 2026. Code: https://github.com/HKUST-C4G/diffusion-rm
AI中文摘要

扩散和流匹配模型的偏好优化依赖于既具有判别鲁棒性又计算高效的奖励函数。视觉语言模型(VLM)凭借其丰富的多模态先验,已成为主要的奖励提供者,用于指导对齐。然而,它们的计算和内存成本可能很高,并且通过像素空间奖励优化潜在扩散生成器会引入域不匹配,使对齐复杂化。在本文中,我们提出DiNa-LRM,一种扩散原生潜在奖励模型,直接在噪声扩散状态上制定偏好学习。我们的方法引入了一种噪声校准的Thurstone似然,具有扩散噪声依赖的不确定性。DiNa-LRM利用预训练的潜在扩散骨干网络,配备时间步条件奖励头,并支持推理时噪声集成,提供了一种扩散原生的机制用于测试时缩放和鲁棒奖励。在图像对齐基准测试中,DiNa-LRM显著优于现有的基于扩散的奖励基线,并以一小部分计算成本实现了与最先进VLM竞争的性能。在偏好优化中,我们证明DiNa-LRM改善了偏好优化动态,实现了更快且更资源高效的模型对齐。

英文摘要

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

2602.04431 2026-05-25 cs.LG cs.GT

MaMa: A Game-Theoretic Approach for Designing Safe Agentic Systems

MaMa: 一种基于博弈论的安全智能体系统设计方法

Jonathan Nöther, Adish Singla, Goran Radanovic

AI总结 本文研究了基于大语言模型的多智能体系统在部分智能体失效或对抗行为下的安全设计问题。受Stackelberg安全博弈启发,作者提出了一种名为MaMa的新算法,通过元对抗者与元代理之间的博弈过程,自动设计出在最坏情况下仍能保持安全的智能体系统。实验表明,该方法设计的系统不仅能够有效抵御最坏攻击,还能在不同攻击目标和大模型环境下保持良好的泛化能力。

详情
AI中文摘要

基于LLM的多智能体系统展现了令人印象深刻的能力,但当单个智能体失败或表现出对抗行为时,也会引入显著的安全风险。在这项工作中,我们研究了即使部分智能体被攻破时仍能保持安全的智能体系统的自动设计。受Stackelberg安全博弈启发,我们将此问题形式化为系统设计者(元智能体)与一个最佳响应的元对手之间的博弈,该对手选择并攻破一部分智能体以最小化安全性。我们提出了MaMa(元对手-元智能体),一种受此形式化启发的新算法,用于自动设计安全的智能体系统。我们的方法使用基于LLM的对抗搜索,其中元智能体迭代地提出系统设计,并根据元对手发现的最强攻击接收反馈。跨不同环境的实证评估表明,使用MaMa设计的系统能够持续防御最坏情况下的攻击,同时保持与仅优化任务成功率的系统相当的性能。此外,所得系统能够泛化到更强的对手,以及具有不同攻击目标或底层LLM的对手,展示了超越训练设置的鲁棒安全性。

英文摘要

LLM-based multi-agent systems have demonstrated impressive capabilities, but they also introduce significant safety risks when individual agents fail or behave adversarially. In this work, we study the automated design of agentic systems that remain safe even when a subset of agents is compromised. Inspired by Stackelberg security games, we formalize this problem as a game between a system designer (the Meta-Agent) and a best-responding Meta-Adversary that selects and compromises a subset of agents to minimize safety. We propose Meta-Adversary-Meta-Agent (MaMa), a novel algorithm inspired by this formalization for automatically designing safe agentic systems. Our approach uses LLM-based adversarial search, where the Meta-Agent iteratively proposes system designs and receives feedback based on the strongest attacks discovered by the Meta-Adversary. Empirical evaluations across diverse environments show that systems designed with MaMa consistently defend against worst-case attacks while maintaining performance comparable to systems optimized solely for task success. Moreover, the resulting systems generalize to stronger adversaries, as well as ones with different attack objectives or underlying LLMs, demonstrating robust safety beyond the training setting.

2601.21513 2026-05-25 cs.LG

Cascaded Transfer: Learning Many Tasks under Budget Constraints

级联迁移:在预算约束下学习多任务

Eloi Campagne, Yvenn Amara-Ouali, Yannig Goude, Mathilde Mougeot, Argyris Kalogeratos

AI总结 在分布式应用场景中,如变电站级别的用电需求预测或联邦学习,需要为大量相关任务训练不同模型,但任务之间的关系未知。本文提出了一种新的级联迁移学习(CTL)范式,通过构建以根节点为起点的树形结构,使模型参数在任务间逐层传递,同时遵循全局训练预算约束。该方法基于最小化任务间距离与预算约束的组合目标构建生成树,形成具有几何感知和深度限制的迁移图,并理论分析了迁移误差在级联路径上的累积与衰减特性。实验表明,CTL在多种任务集合上实现了比现有方法更准确且更节省成本的模型适应,尤其在预算受限时效果更显著。

详情
AI中文摘要

在分布式应用中,如变电站级能源需求预测或联邦学习,大量相关任务必须由不同模型学习,而确切的任务关系未知。我们提出了新颖的级联迁移学习(CTL)范式,其中模型参数通过组织为有根树的任务层级级联,并遵守全局训练预算。从源任务开始,树指定了任务学习和细化的顺序,预算沿其分支分配。我们设计了基于生成树的级联机制,通过最小化结合成对任务距离和可用训练预算的目标来连接所有任务,从而产生几何感知和深度有界的迁移图。我们从理论上刻画了迁移误差如何沿级联路径累积和衰减:任何上游节点引入的误差都会被每个下游细化收缩,而平衡的树拓扑限制了这种累积。在合成和真实多任务场景、时间序列预测和图像分类上的实验表明,CTL能够在大量任务集合中实现比替代方法更准确和成本效益更高的适应,且在预算最紧张时增益最大。

英文摘要

In distributed applications, such as energy demand forecasting at the substation level or federated learning, a large number of related tasks must be learned by different models, while the exact task relationships are unknown. We propose the novel Cascaded Transfer Learning (CTL) paradigm in which model parameters cascade hierarchically through tasks organized as a rooted tree, respecting a global training budget. Starting from a source task, the tree specifies the order in which tasks are learned and refined, with the budget allocated along its branches. We design cascade mechanisms based on spanning trees that connect all tasks by minimizing an objective combining pairwise task distances and the available training budget, which yield geometry-aware and depth-bounded transfer graphs. We theoretically characterize how transfer errors accumulate and attenuate along cascade paths: errors introduced at any upstream node are contracted by every downstream refinement, and balanced tree topologies bound this accumulation. Experiments on synthetic and real many-task settings, time-series forecasting and image classification, show that CTL enables more accurate and cost-effective adaptation across large task collections than alternative approaches, with the largest gains at the tightest budgets.

2601.14180 2026-05-25 cs.CV

Progressive $\mathcal{J}$-Invariant Self-supervised Learning for Low-Dose CT Denoising

渐进式 $\mathcal{J}$-不变自监督学习用于低剂量CT去噪

Yichao Liu, Zongru Shao, Yueyang Teng, Junwen Guo

AI总结 本文研究了低剂量CT图像去噪中的自监督学习方法,旨在减少对配对正常剂量CT数据的依赖。为了解决现有方法因感受野受限导致的训练效率低和性能不足的问题,提出了一种渐进式$\mathcal{J}$-不变自监督学习方法,通过逐步盲区去噪机制和引入控制噪声来提升去噪效果。实验表明,该方法在Mayo低剂量CT数据集上优于现有自监督方法,并达到或超越了部分监督去噪方法的性能。

详情
AI中文摘要

自监督学习越来越多地被研究用于低剂量计算机断层扫描(LDCT)图像去噪,因为它减轻了对通常难以收集的配对正常剂量CT(NDCT)数据的依赖。然而,许多现有的自监督盲点去噪方法由于感受野受限,存在训练效率低下和性能次优的问题。为了缓解这一问题,我们提出了一种新颖的渐进式 $\mathcal{J}$-不变学习,最大化利用 $\mathcal{J}$-不变性来增强LDCT去噪性能。我们引入了一种逐步盲点去噪机制,以渐进方式强制执行条件独立性,从而实现更细粒度的去噪学习。此外,我们在训练过程中显式注入受控的高斯噪声和泊松噪声的组合,以正则化去噪过程并减轻过拟合。在Mayo LDCT数据集上的大量实验表明,所提出的方法持续优于现有的自监督方法,并实现了与几种代表性监督去噪方法相当或更好的性能。

英文摘要

Self-supervised learning has been increasingly investigated for low-dose computed tomography (LDCT) image denoising, as it alleviates the dependence on paired normal-dose CT (NDCT) data, which are often difficult to collect. However, many existing self-supervised blind-spot denoising methods suffer from training inefficiencies and suboptimal performance due to restricted receptive fields. To mitigate this issue, we propose a novel Progressive $\mathcal{J}$-invariant Learning that maximizes the use of $\mathcal{J}$-invariant to enhance LDCT denoising performance. We introduce a step-wise blind-spot denoising mechanism that enforces conditional independence in a progressive manner, enabling more fine-grained learning for denoising. Furthermore, we explicitly inject a combination of controlled Gaussian and Poisson noise during training to regularize the denoising process and mitigate overfitting. Extensive experiments on the Mayo LDCT dataset demonstrate that the proposed method consistently outperforms existing self-supervised approaches and achieves performance comparable to, or better than, several representative supervised denoising methods.

2601.03715 2026-05-25 cs.LG cs.AI

R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

R$^3$L: 反思-重试强化学习与语言引导探索、关键信用和正向放大

Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, Yaliang Li

AI总结 R$^3$L 是一种结合语言引导探索、关键信用分配和正向增强的强化学习方法,旨在解决大语言模型在推理和智能体能力训练中面临的探索与利用难题。该方法通过“反思-重试”机制合成高质量轨迹,利用语言反馈定位错误并优化失败路径,同时仅更新存在差异的轨迹后缀以提高信用分配精度,并通过增强成功轨迹的权重来稳定训练过程。实验表明,R$^3$L 在多个任务中相较基线方法实现了显著性能提升,同时保持了训练稳定性。

详情
AI中文摘要

强化学习推动了LLM推理和智能体能力的最新进展,但当前方法在探索和利用方面均存在困难。探索方面,困难任务成功率低且从头开始重复rollout成本高;利用方面,粗粒度的信用分配和训练不稳定:轨迹级奖励因后续错误惩罚有效前缀,且失败主导的群体淹没少数正向信号,使优化缺乏建设性方向。为此,我们提出R$^3$L,即反思-重试强化学习与语言引导探索、关键信用和正向放大。为合成高质量轨迹,R$^3$L通过反思-重试从随机采样转向主动合成,利用语言反馈诊断错误,将失败尝试转化为成功尝试,并通过从识别出的失败点重启来降低rollout成本。在错误被诊断和定位后,关键信用分配仅更新存在对比信号的分叉后缀,排除共享前缀的梯度更新。由于困难任务中失败占主导且反思-重试产生离策略数据,可能导致训练不稳定,正向放大提高成功轨迹的权重,确保正向信号引导优化过程。在智能体和推理任务上的实验表明,与基线相比,相对提升5%到52%,同时保持训练稳定性。我们的代码已发布在https://github.com/shiweijiezero/R3L。

英文摘要

Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R$^3$L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R$^3$L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect-then-retry produces off-policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5\% to 52\% relative improvements over baselines while maintaining training stability. Our code is released at https://github.com/shiweijiezero/R3L.

2512.20901 2026-05-25 cs.CV

Benchmarking and Enhancing VLM for Compressed Image Understanding

基准测试与增强VLM对压缩图像的理解

Zifu Zhang, Tongda Xu, Siqi Li, Shengxi Li, Yue Zhang, Mai Xu, Yan Wang

AI总结 随着图像压缩技术的广泛应用,如何提升视觉语言模型(VLM)对压缩图像的理解能力变得尤为重要。本文首次构建了一个全面的基准,用于评估VLM在不同压缩编码和任务下的表现,并分析了模型在压缩图像上的性能差距来源,发现仅通过增强模型泛化能力可以有效缓解这一问题。基于此,作者提出了一种通用的VLM适配器,能够在多种压缩格式和比特率下提升模型性能10%-30%,为VLM在压缩图像任务中的应用提供了重要参考。

详情
Comments
The paper is accepted by ICML 2026
AI中文摘要

随着视觉语言模型(VLM)的快速发展及其应用需求的增长,图像输入的高效压缩变得日益重要。现有VLM主要处理和理解高比特率压缩图像,而它们对低比特率压缩图像的解读能力迄今尚未被探索。本文首次引入全面基准测试,评估VLM对压缩图像的能力,涵盖多种现有广泛使用的图像编解码器和多样化任务,基准测试中包含超过一百万个压缩图像。接着,我们分析性能差距的来源,将其归因于a)压缩过程中的信息损失和b)VLM的泛化失败。我们通过具体示例可视化这些差距,并确定对于压缩图像,只有泛化差距可以缓解。最后,我们提出一个通用VLM适配器,以增强模型对现有编解码器压缩图像的性能。结果证明,单个适配器可以将VLM在不同编解码器和比特率图像上的性能提升10%-30%。我们相信,我们的基准测试和增强方法为弥合VLM与压缩图像之间的差距提供了宝贵的见解和贡献。源代码可在https://github.com/bblgbr/CompressVLMBench获取。

英文摘要

With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images. The source code is available at https://github.com/bblgbr/CompressVLMBench.

2512.15767 2026-05-25 cs.LG cs.AI

Bridging Data and Physics: A Graph Neural Network-Based Hybrid Twin Framework

连接数据与物理:基于图神经网络的混合孪生框架

M. Gorpinich, B. Moya, S. Rodriguez, F. Meraghni, Y. Jaafra, A. Briot, M. Henner, R. Leon, F. Chinesta

AI总结 该研究提出了一种基于图神经网络的混合孪生框架,旨在解决物理仿真中因模型简化或未建模效应导致的“无知模型”问题。通过结合物理模型与数据驱动方法,该方法利用图神经网络学习稀疏空间测量中的缺失物理规律,从而在减少数据需求的前提下提升仿真精度与可解释性。实验表明,该框架在不同网格、几何和负载位置的非线性热传导问题中均表现出良好的泛化能力与修正效果。

详情
Comments
27 pages, 14 figures
AI中文摘要

模拟复杂的非定常物理现象依赖于详细的数学模型,例如通过有限元方法(FEM)进行仿真。然而,由于未建模效应或简化假设,这些模型通常与实际情况存在差异。我们将这种差距称为无知模型。纯数据驱动的方法试图学习整个系统的行为,但需要跨越整个空间和时间域的大量高质量数据。在现实场景中,此类信息不可用,使得完全数据驱动的建模不可靠。为了克服这一限制,我们采用混合孪生方法对无知分量进行建模,而不是从头模拟现象。由于基于物理的模型近似了现象的整体行为,剩余的无知通常比完整的物理响应复杂度低,因此可以用更少的数据进行学习。然而,一个关键困难是空间测量是稀疏的,并且在实际中获取不同空间配置下同一现象的数据具有挑战性。我们的贡献是通过使用图神经网络(GNN)来表示无知模型来克服这一限制。即使测量位置数量有限,GNN也能学习缺失物理的空间模式。这使得我们能够用数据驱动的修正来丰富基于物理的模型,而无需密集的空间、时间和参数数据。为了展示所提出方法的性能,我们在不同网格、几何形状和载荷位置的非线性热传导问题上评估了这种基于GNN的混合孪生方法。结果表明,GNN成功捕获了无知并泛化了跨空间配置的修正,提高了仿真精度和可解释性,同时最小化了数据需求。

英文摘要

Simulating complex unsteady physical phenomena relies on detailed mathematical models, simulated for instance by using the Finite Element Method (FEM). However, these models often exhibit discrepancies from the reality due to unmodeled effects or simplifying assumptions. We refer to this gap as the ignorance model. While purely data-driven approaches attempt to learn full system behavior, they require large amounts of high-quality data across the entire spatial and temporal domain. In real-world scenarios, such information is unavailable, making full data-driven modeling unreliable. To overcome this limitation, we model of the ignorance component using a hybrid twin approach, instead of simulating phenomena from scratch. Since physics-based models approximate the overall behavior of the phenomena, the remaining ignorance is typically lower in complexity than the full physical response, therefore, it can be learned with significantly fewer data. A key difficulty, however, is that spatial measurements are sparse, also obtaining data measuring the same phenomenon for different spatial configurations is challenging in practice. Our contribution is to overcome this limitation by using Graph Neural Networks (GNNs) to represent the ignorance model. GNNs learn the spatial pattern of the missing physics even when the number of measurement locations is limited. This allows us to enrich the physics-based model with data-driven corrections without requiring dense spatial, temporal and parametric data. To showcase the performance of the proposed method, we evaluate this GNN-based hybrid twin on nonlinear heat transfer problems across different meshes, geometries, and load positions. Results show that the GNN successfully captures the ignorance and generalizes corrections across spatial configurations, improving simulation accuracy and interpretability, while minimizing data requirements.

2512.07078 2026-05-25 cs.CV cs.LG

DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection

DFIR-DETR:面向小目标检测的频域迭代细化与动态特征聚合

Bo Gao, Jingcheng Tong, Xingsheng Chen, Han Yu, Zichen Li

AI总结 本文针对复杂场景中小目标检测中的核心挑战,提出了一种名为DFIR-DETR的新方法,通过频率域迭代优化和动态特征聚合,有效解决了现有网络在注意力分配、特征上采样和高频信息保留方面的不足。该方法在保持较低计算成本的同时,在NEU-DET和VisDrone数据集上取得了显著的性能提升,验证了其在不同检测任务中的有效性。

详情
AI中文摘要

复杂场景中的小目标检测暴露了神经网络设计中的基本矛盾:骨干注意力分布均匀而不考虑内容,金字塔颈部在上采样过程中放大激活幅度而不进行归一化补偿,瓶颈卷积通过累积空间滤波逐步平滑高频边缘分量。为此,我们开发了DFIR-DETR,将每个提出的模块追溯到RT-DETR基线中特定的、可测量的缺陷:忽略空间复杂性的均匀注意力、破坏上采样特征稳定性的归一化漂移,以及逐步抑制小目标所依赖的高频分量的空间卷积。在NEU-DET和VisDrone上,DFIR-DETR仅以11.7M参数和47.2 GFLOPs就达到了92.9%和51.6%的mAP50,在两个性质不同的检测领域展示了持续的性能提升。

英文摘要

Small object detection in complex scenes exposes a fundamental tension in neural network design: backbone attention distributes computation uniformly regardless of content, pyramid necks inflate activation magnitudes during upsampling without norm compensation, and bottleneck convolutions progressively smooth high-frequency edge components through accumulated spatial filtering. In response, we develop DFIR-DETR by tracing each proposed module back to a specific, measurable deficiency in the RT-DETR baseline: uniform attention that ignores spatial complexity, norm drift that destabilises upsampled features, and spatial convolutions that progressively suppress the high-frequency components small objects depend on. On NEU-DET and VisDrone, DFIR-DETR achieves 92.9% and 51.6% mAP50 with only 11.7M parameters and 47.2 GFLOPs, demonstrating consistent gains across two qualitatively different detection domains.

2511.22521 2026-05-25 cs.CV cs.AI

DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

DocVAL:用于基于文档的视觉问答的验证链式思维蒸馏

Pinaki Prasad Guha Neogi, Ahmad Mohammadshirazi, Ser-Nam Lim, Rajiv Ramnath

AI总结 DocVAL 是一种用于文档视觉问答(VQA)的验证式思维链(CoT)蒸馏框架,旨在将大型视觉语言模型(VLM)中的精确空间推理能力转移到更高效的紧凑模型中。该方法结合了教师模型生成的空间推理监督、基于规则的双模式验证器以过滤低质量训练信号,并采用两阶段训练流程进行迭代优化,最终使学生模型无需OCR或检测模块即可独立运行。实验表明,DocVAL 在多个基准测试中显著提升了紧凑模型的定位性能,并引入了mAP作为新的定位评估指标。

详情
AI中文摘要

文档视觉问答要求模型不仅正确回答问题,还要在复杂文档布局中精确定位答案。大型视觉语言模型(VLM)具有强大的空间定位能力,但其推理成本和延迟限制了实际部署。紧凑型VLM更高效,但在标准微调或蒸馏下常出现显著的定位退化。为解决这一问题,我们提出DocVAL,一种验证链式思维(CoT)蒸馏框架,将显式空间推理从大型教师模型转移到紧凑、可部署的学生VLM。DocVAL结合了(1)教师生成的空间CoT监督,(2)基于规则的双模式验证器,过滤低质量训练信号并提供细粒度像素级纠正反馈,以及(3)验证驱动的两阶段训练过程与迭代细化。文本检测仅作为训练时的监督和验证脚手架,使得最终学生模型在推理时作为纯VLM运行,无需OCR或检测。在多个文档理解基准上,DocVAL相比可比的紧凑VLM持续提升高达6-7个ANLS点。我们进一步引入平均精度(mAP)作为文档问答的定位指标,并在此新评估下报告了强大的空间定位性能。我们发布了95K验证器验证的CoT轨迹,并表明高质量、验证过的监督比扩展未过滤数据更有效,实现了高效且可信的文档定位。代码/数据:https://github.com/ahmad-shirazi/DocVAL

英文摘要

Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are more efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we propose DocVAL, a validated chain-of-thought (CoT) distillation framework that transfers explicit spatial reasoning from large teacher models to compact, deployable student VLMs. DocVAL combines (1) teacher-generated spatial CoT supervision, (2) a rule-based dual-mode validator that filters low-quality training signals and provides fine-grained, pixel-level corrective feedback, and (3) a validation-driven two-stage training procedure with iterative refinement. Text detection is used only as training-time scaffolding for supervision and validation, enabling the final student to operate as a pure VLM without OCR or detection at inference. Across multiple document understanding benchmarks, DocVAL yields consistent improvements of up to 6-7 ANLS points over comparable compact VLMs. We further introduce mean Average Precision (mAP) as a localization metric for document question answering and report strong spatial grounding performance under this new evaluation. We release 95K validator-verified CoT traces and show that high-quality, validated supervision is more effective than scaling unfiltered data, enabling efficient and trustworthy document grounding. Code/Data: https://github.com/ahmad-shirazi/DocVAL

2511.15503 2026-05-25 cs.AR cs.DC cs.LG cs.PF

DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

DCC: 面向处理-内存架构的机器学习内核数据驱动编译

Peiming Yang, Sankeerth Durvasula, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu, Gennady Pekhimenko, Christina Giannoula

AI总结 本文提出了一种面向存算一体架构的数据为中心的机器学习内核编译器DCC,旨在解决在处理大型语言模型等内存密集型任务时,主机处理器与存算一体核心之间数据布局不一致带来的性能瓶颈。DCC通过统一优化数据重排与计算代码生成,结合多层PIM抽象和性能预测模型,有效提升了在不同PIM设备上的执行效率。实验表明,DCC在多种机器学习内核和端到端大语言模型推理中均实现了显著的加速效果。

详情
AI中文摘要

高性能主机处理器可以集成处理-内存(PIM)设备,通过利用PIM核心可用的大内存带宽,加速机器学习(ML)模型(包括大型语言模型(LLM))的内存密集型内核。然而,主机处理器需要分布在DRAM bank中的连续元素,而PIM核心需要其本地bank内的连续元素。这需要在ML内核执行中进行数据重排,带来了显著的性能和可编程性挑战,并且由于需要支持多种PIM设备而进一步加剧。当前的编译方法缺乏针对多种ML内核和多个PIM设备的系统优化,并且可能在计算代码优化步骤中很大程度上忽略数据重排成本。我们表明数据重排和计算代码优化是相互依赖的,需要在调优过程中联合优化。因此,我们设计了DCC,这是首个面向PIM系统的数据驱动ML编译器,它在统一的调优过程中联合优化数据重排和计算代码。DCC集成了多层PIM抽象以支持多个PIM后端。DCC实现了数据分区策略与计算循环分区方案的有效联合优化。DCC应用了PIM特定的代码优化,并利用快速准确的性能预测模型为目标PIM架构上的给定内核选择最佳性能的代码调度。我们在各种单个ML内核上的评估表明,与仅GPU执行相比,DCC在HBM-PIM上实现了高达7.68倍的加速(平均2.21倍),在AttAcc PIM上实现了高达13.17倍的加速(平均3.92倍)。在端到端LLM推理中,AttAcc上的DCC在GPT-3和LLaMA-2上比GPU平均加速4.52倍(LLaMA-2上最高7.71倍)。DCC已在https://github.com/SPIN-Research-Group/DCC开源。

英文摘要

High-performance Host processors can integrate Processing-In-Memory (PIM) devices, which can accelerate memory-intensive kernels of Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging the large memory bandwidth available at PIM cores. However, Host processor needs consecutive elements distributed across DRAM banks, while PIM cores need consecutive elements within their local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM devices. Current compilation approaches lack systematic optimization for diverse ML kernels and multiple PIM devices, and may largely ignore data rearrangement costs during the compute code optimization step. We show that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. Therefore, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction to support multiple PIM backends. DCC enables effective co-optimization of data partitioning strategies with compute loop partitioning schemes. DCC applies PIM-specific code optimizations, and leverages a fast and accurate performance prediction model to select the bestperforming code schedule for a given kernel on a target PIM architecture. Our evaluations in various individual ML kernels show that DCC achieves up to 7.68x speedup (2.21x average) on HBM-PIM, and up to 13.17x speedup (3.92x average) on AttAcc PIM, over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by 4.52x average (up to 7.71x in LLaMA-2) over GPU. DCC is open-sourced at https://github.com/SPIN-Research-Group/DCC.

2511.13904 2026-05-25 cs.CV

Edge Assisted Multi-Camera Vehicle Tracking Framework for Real-Time and Scalable Deployment

面向实时可扩展部署的边缘辅助多摄像头车辆跟踪框架

Yuqiang Lin, Sam Lockyer, Shucheng Zhang, Florian Stanek, Markus Zarbock, Adrian Evans, Wenbin Li, Yinhai Wang, Nic Zhang

AI总结 本文提出了一种名为EASE-MCVT的边缘辅助多摄像头车辆跟踪框架,旨在解决现有方法在实时性和可扩展性方面的不足。该框架采用分布式边缘-服务器架构,通过在边缘端进行目标检测、单摄像头跟踪和特征提取,仅传输轻量级元数据至中心服务器,从而实现高效的跨摄像头关联。研究在算法和系统层面进行了优化,包括动态工作负载分配、服务器端重匹配模块和自监督摄像头链接模型,实验表明该方法在保证跟踪精度的同时实现了实时处理能力,为城市级实时交通管理提供了可行方案。

详情
AI中文摘要

摄像头是现代智能交通系统中的核心传感模态,提供关于道路使用者活动的丰富视觉信息。多摄像头车辆跟踪利用这些数据重建跨摄像头网络的车辆轨迹,支持交通流预测和优化等应用。然而,现有大多数MCVT研究强调跟踪精度,而对实时性能和可扩展性关注有限,这两者对于实际城市规模部署至关重要。为弥补这一差距,我们提出边缘辅助、可扩展且高效的MCVT(EASE-MCVT),一种分布式边缘-服务器框架,专为实时吞吐量和可扩展操作设计。在边缘端,每个摄像头流通过目标检测、单摄像头跟踪、地理映射和特征提取进行处理,而仅将轻量级元数据(包括车辆位置和外观特征)发送到中央服务器进行跨摄像头关联。为提高跟踪精度和系统效率,EASE-MCVT从算法和系统角度进行了优化。算法上,它引入了用于轨迹级特征提取的动态工作负载方案、用于重新连接碎片化轨迹的服务器端重新匹配模块,以及一个自监督摄像头链接模型,该模型学习时空约束以加速和稳定跨摄像头关联。系统上,它集成了面向生产的数据工程组件,以标准化大规模操作的部署和数据交换。据我们所知,EASE-MCVT是首个明确设计用于在分布式边缘-服务器设置中同时解决实时性能和可扩展性的MCVT框架。在RoundaboutHD和CityFlow数据集上的实验表明,该框架实现了实时吞吐量并具有竞争力的跟踪精度,为城市范围的实时交通管理铺平了道路。

英文摘要

Cameras are a core sensing modality in modern intelligent transportation systems (ITS), providing rich visual information on road-user activities. Multi-Camera Vehicle Tracking (MCVT) uses this data to reconstruct vehicle trajectories across camera networks, supporting applications such as traffic flow prediction and optimisation. However, most existing MCVT studies emphasise tracking accuracy while paying limited attention to real-time performance and scalability, both essential for real-world and city-scale deployment. To address this gap, we propose Edge-Assisted, Scalable and Efficient MCVT (EASE-MCVT), a distributed edge--server framework designed for real-time throughput and scalable operation. On the edge side, each camera stream is processed through object detection, single-camera tracking, geo-mapping and feature extraction, while only lightweight metadata, including vehicle locations and appearance features, is sent to the central server for cross-camera association. To improve both tracking accuracy and system efficiency, EASE-MCVT is optimised from algorithmic and system perspectives. Algorithmically, it introduces a dynamic workload scheme for tracklet-level feature extraction, a server-side re-match module to reconnect fragmented tracklets, and a self-supervised camera link model that learns spatio-temporal constraints to accelerate and stabilise cross-camera association. Systemically, it integrates production-oriented data engineering components to standardise deployment and data exchange for large-scale operation. To the best of our knowledge, EASE-MCVT is the first MCVT framework explicitly designed to address both real-time performance and scalability in a distributed edge--server setting. Experiments on the RoundaboutHD and CityFlow datasets demonstrate real-time throughput with competitive tracking accuracy, paving the way for city-wide real-time traffic management.

2511.10404 2026-05-25 cs.CL

DELICATE: Diachronic Entity LInking using Classes And Temporal Evidence

DELICATE: 利用类别和时间证据的历时实体链接

Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Mehwish Alam

AI总结 本文提出了一种名为 DELICATE 的新型神经符号方法,用于解决历史意大利语文本中的实体链接问题,该方法结合了基于 BERT 的编码器和来自 Wikidata 的上下文信息,通过时间合理性与实体类型一致性选择合适的知识库实体。同时,研究还构建了一个名为 ENEIDE 的多领域实体链接语料库,涵盖19至20世纪的文学与政治文本。实验表明,DELICATE 在性能上优于其他历史文本实体链接模型,且其置信度评分和特征敏感性提升了结果的可解释性。

详情
AI中文摘要

尽管自然语言处理领域取得了显著进展,但由于复杂的文档类型、缺乏特定领域的数据集和模型以及长尾实体(即在知识库中代表性不足的实体),实体链接任务在人文学科中仍然具有挑战性。本文旨在通过两个主要贡献解决这些问题。第一个贡献是DELICATE,一种用于历史意大利语的新型神经符号方法,它结合了基于BERT的编码器和来自Wikidata的上下文信息,利用时间合理性和实体类型一致性来选择适当的KB实体。第二个贡献是ENEIDE,一个多领域的历史意大利语实体链接语料库,半自动地从两个注释版本中提取,时间跨度从19世纪到20世纪,包括文学和政治文本。结果表明,即使与拥有数十亿参数的更大架构相比,DELICATE在历史意大利语中的表现也优于其他实体链接模型。此外,进一步的分析揭示了DELICATE的置信度分数和特征敏感性如何提供比纯神经方法更可解释和可解释的结果。

英文摘要

In spite of the remarkable advancements in the field of Natural Language Processing, the task of Entity Linking (EL) remains challenging in the field of humanities due to complex document typologies, lack of domain-specific datasets and models, and long-tail entities, i.e., entities under-represented in Knowledge Bases (KBs). The goal of this paper is to address these issues with two main contributions. The first contribution is DELICATE, a novel neuro-symbolic method for EL on historical Italian which combines a BERT-based encoder with contextual information from Wikidata to select appropriate KB entities using temporal plausibility and entity type consistency. The second contribution is ENEIDE, a multi-domain EL corpus in historical Italian semi-automatically extracted from two annotated editions spanning from the 19th to the 20th century and including literary and political texts. Results show how DELICATE outperforms other EL models in historical Italian even if compared with larger architectures with billions of parameters. Moreover, further analyses reveal how DELICATE confidence scores and features sensitivity provide results which are more explainable and interpretable than purely neural methods.

2511.03882 2026-05-25 cs.CV cs.AI cs.LG cs.RO

Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures

自主X光引导脊柱手术的机器人控制策略学习研究

Florence Klitzner, Blanca Inigo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song, Axel Krieger, Mathias Unberath

AI总结 本文研究了基于模仿学习的机器人控制策略在X射线引导脊柱手术中的应用,特别是在椎体成形术中导管插入任务中的可行性与挑战。研究构建了一个高度逼真的仿真环境,并构建了包含正确操作轨迹和双平面X射线序列的数据集,用于训练仅依赖视觉信息的模仿学习策略。实验表明,该策略在多种脊柱解剖结构和初始条件下均能实现安全的导管插入,为未来轻量化、无需CT的术中脊柱机器人导航提供了基础。

详情
AI中文摘要

基于模仿学习的机器人控制策略在基于视频的机器人学中重新受到关注。然而,对于稀疏输入的X光引导手术(如脊柱内固定),这种方法是否适用尚不清楚。我们研究了在双平面引导的套管针插入中模仿策略学习的可行性、机遇和挑战。我们开发了一个用于可扩展、自动化模拟X光引导脊柱手术的计算机沙盒,具有高度逼真性。我们整理了一个包含正确轨迹和相应双平面X光序列的数据集,模拟了提供者的逐步对齐过程。然后,我们训练了用于规划和开环控制的模仿学习策略,该策略仅基于视觉信息在椎体成形术环境中迭代对齐套管针。这种精确控制的设置提供了对该方法局限性和能力的见解。我们的策略在68.5%的案例中首次尝试成功,在不同椎体水平上保持了安全的椎弓根内轨迹。该策略迁移到了复杂解剖结构(包括骨折)以及不同的解剖结构和初始位置。在真实X光上的展开表明,具有合理轨迹的部分仿真到真实迁移是可能的。尽管这些初步结果令人鼓舞,但我们还发现了局限性,特别是在入口点精度方面。当前的结果为未来的努力提供了明确的基准,而借助更稳健的先验和领域知识,此类模型可能为未来实现轻量级、无CT的机器人术中脊柱导航奠定基础。

英文摘要

Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation, with sparse inputs. We examine the feasibility, opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula in a vertebroplasty setting solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy transferred to complex anatomy, including fractures, as well as varied anatomies and initializations. Rollouts on real X-ray indicate that partial sim-to-real transfer with plausible trajectories is possible. While these preliminary results are promising, we also identify limitations, especially in entry point precision. The current results present a clear benchmark for future efforts, while with more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.

2510.15060 2026-05-25 cs.CV

A solution to generalized learning from small training sets found in infant repeated visual experiences of individual objects

从婴儿个体物体重复视觉经验中发现的小训练集泛化学习问题的解决方案

Frangil Ramirez, Elizabeth Clerkin, David J. Crandall, Linda B. Smith

AI总结 该研究探讨了婴儿在日常生活中通过重复视觉经验学习物体类别的方式,分析了14名一岁婴儿在用餐时拍摄的87段头部摄像头图像,涉及8类早期学习的物体。研究发现,每个婴儿对每个类别的视觉体验呈现高度偏态分布,即少数物体被频繁观看,而其他实例较少。通过图论方法分析,发现这些类别内部存在高相似性与高变异性并存的“块状”结构。实验表明,这种分布特征的人工训练集能够在极少样本的情况下支持模型对新实例的泛化,为人类和机器的视觉识别及学习机制提供了新见解。

详情
Comments
28 pages, 7 figures, 3 tables
AI中文摘要

一岁婴儿能快速形成并泛化他们遇到的日常物体类别。这里我们提供了关于婴儿日常视觉经验中8个早期学习物体类别的证据。使用婴儿在用餐时间记录的头戴摄像机图像语料库(14名婴儿记录的87次用餐时间),我们测量了每个类别独特实例的频率以及每个实例视觉经验的变异性。实例分布高度偏斜,对于每个婴儿和类别,包含大量同一少数物体的图像以及较少其他实例的图像。单个类别相似性结构的图论度量揭示了高相似度和高变异性的混合,组织成多个但相互连接的高相似度图像簇。在计算实验中,我们表明,以相似性团块分布为特征的人工创建的训练集在非常少的训练经验后支持对新实例的泛化。我们讨论了对视觉物体识别以及更一般的学习(包括人类和机器)的启示。

英文摘要

One-year-old infants rapidly form and generalize categories of the everyday objects they encounter. Here we provide evidence on infants daily-life visual experiences for 8 early-learned object categories. Using a corpus of infant head-camera images recorded at mealtimes (87 mealtimes captured by 14 infants), we measure the frequency of the unique instances of each category and the variability of the visual experiences of each instance. The distribution of instances is highly skewed, containing, for each infant and category, many images of the same few objects along with fewer images of other instances. Graph theoretic measures of the similarity structure for individual categories reveal a lumpy mix of high similarity and high variability, organized into multiple but interconnected clusters of high-similarity images. In computational experiments, we show that artificially-created training sets characterized by a lumpy distribution of similarities support generalization to novel instances after very few training experiences. We discuss implications for visual object recognition, and for learning more generally, by both humans and machines.

2510.07869 2026-05-25 cs.RO

USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots

USIM 和 U0:面向通用水下机器人的视觉-语言-动作数据集与模型

Junwen Gu, Zhiheng Wu, Pengxuan Si, Shuang Qiu, Zhentao Zhang, Yukai Feng, Luoyang Sun, Laien Luo, Lianyi Yu, Jian Wang, Zhengxing Wu

AI总结 本文提出了一种面向通用水下机器人的视觉-语言-动作框架,旨在解决水下环境中多任务执行的通用智能问题。研究构建了一个基于仿真的大规模数据集USIM,并设计了一个名为U0的视觉-语言-动作模型,该模型通过引入目标姿态估计辅助任务提升了空间感知能力,能够在避障导航和三维移动操作等任务中取得优异表现。实验表明,U0在离线动作预测误差和在线任务成功率方面均达到当前最优水平,验证了通用智能在水下机器人领域的可行性。

详情
Comments
Project Page: https://vincentgu2000.github.io/u0project/
AI中文摘要

水下环境对机器人导航和操作提出了独特挑战。现有研究主要关注特定任务方法,而针对多任务执行的通用智能研究仍然稀缺。为填补这一空白,我们提出一个面向通用水下机器人的统一框架,该框架集成了由语言指令驱动的感知和动作。首先,我们开发了一个数据合成管道来构建 USIM,这是一个基于模拟的数据集,包含来自 2275 条轨迹的超过 905K 帧,总计约 25 小时的 BlueROV2 交互。此外,我们提出了 U0,一个能够执行从避障导航到三维移动操作等各种任务的视觉-语言-动作(VLA)模型。该模型具有基于卷积-注意力的感知(CAP)模块,该模块将目标姿态估计作为辅助任务,以显式增强模型的空间感知能力。在评估方面,我们建立了一个系统评估框架和一个自动化管道,涵盖离线指标和在线任务执行。实验结果表明,USIM 数据集显著增强了现有 VLA 模型适应水下场景的能力。值得注意的是,我们的 U0 模型实现了最先进的性能:它将离线平均动作预测误差降低到 0.0359,并实现了 43.1% 的总体在线成功率,相比现有竞争基线(低于 37.6%)提升了 5.5%,其中导航任务成功率高达 87.5%。这些结果验证了水下机器人通用智能的可行性,为可扩展数据集合成和水下具身智能体提供了基础。

英文摘要

Underwater environments pose unique challenges for robotic navigation and manipulation. While existing research has primarily focused on task-specific methods, studies on general-purpose intelligence for multi-task execution remain scarce. To address this gap, we propose a unified framework for general-purpose underwater robots that integrates perception and action driven by language instructions. First, we develop a data synthesis pipeline to construct USIM, a simulation-based dataset which comprises over 905K frames from 2275 trajectories, totaling approximately 25 hours of BlueROV2 interactions. Furthermore, we propose U0, a vision-language-action (VLA) model capable of executing various tasks from obstacle-avoidance navigation to three-dimensional mobile manipulation. The model features a convolution-attention-based perception (CAP) module, which incorporates target pose estimation as an auxiliary task to explicitly bolster the model's spatial awareness. For evaluation, we establish a systematic assessment framework and an automated pipeline encompassing both offline metrics and online task execution. Experimental results demonstrate that the USIM dataset significantly empowers existing VLA models to adapt to underwater scenarios. Notably, our U0 model achieves state-of-the-art performance: it reduces the offline mean action prediction error to 0.0359 and achieves an overall online success rate of 43.1%, marking a 5.5% improvement over existing competitive baselines (below 37.6%), with navigation tasks reaching as high as 87.5%. These results validate the feasibility of general-purpose intelligence in underwater robotics, providing a foundation for scalable dataset synthesis and aquatic embodied agents.

2509.19858 2026-05-25 cs.CL

Benchmarking Gaslighting Attacks Against Speech Large Language Models

针对语音大语言模型的气灯攻击基准测试

Jinyang Wu, Bin Zhu, Xiandong Zou, Qiquan Zhang, Xu Fang, Pan Zhou

AI总结 随着语音大语言模型(Speech LLMs)在语音应用中的广泛应用,确保其对操纵性或对抗性输入的鲁棒性变得尤为重要。本文引入了一种新型对抗攻击——“Gaslighting攻击”,通过精心设计的提示误导模型推理,评估Speech LLMs的脆弱性,并提出了五种操纵策略用于测试模型在不同任务下的鲁棒性。实验结果显示,五种攻击策略平均使模型准确率下降24.3%,突显了当前语音AI系统在行为层面存在的显著漏洞,亟需提升其鲁棒性和可靠性。

详情
Comments
5 pages, 2 figures, 3 tables
AI中文摘要

随着语音大语言模型(Speech LLMs)越来越多地集成到基于语音的应用中,确保其对操纵性或对抗性输入的鲁棒性变得至关重要。尽管先前的工作研究了基于文本的LLMs和视觉语言模型中的对抗性攻击,但基于语音交互的独特认知和感知挑战仍未得到充分探索。相比之下,语音具有固有的模糊性、连续性和感知多样性,这使得对抗性攻击更难检测。在本文中,我们引入了气灯攻击,即精心设计的提示,旨在误导、覆盖或扭曲模型推理,以评估语音LLMs的脆弱性。具体来说,我们构建了五种操纵策略:愤怒、认知干扰、讽刺、隐晦和专业否定,旨在测试模型在不同任务上的鲁棒性。值得注意的是,我们的框架捕获了性能下降和行为响应,包括未经请求的道歉和拒绝,以诊断不同维度的易感性。此外,还进行了声学扰动实验以评估多模态鲁棒性。为了量化模型脆弱性,在5个语音和多模态LLMs上,对来自5个不同数据集的超过10,000个测试样本进行全面评估,结果显示在五种气灯攻击下平均准确率下降24.3%,表明显著的行为脆弱性。这些发现强调了需要更具弹性和可信赖的基于语音的AI系统。

英文摘要

As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.

2509.06896 2026-05-25 cs.LG stat.ML

Are Targeted Data Poisoning Attacks as Effective as We Think?

定向数据投毒攻击是否如我们想象中那么有效?

William Xu, Chenyu Zhang, Yihan Wang, Matthew Y. R. Yang, Zuoqiu Liu, Gautam Kamath, Yaoliang Yu, Yiwei Lu

AI总结 本文研究目标数据投毒攻击的实际有效性,指出现有评估方法基于随机选择的目标样本,未能反映最坏情况下的攻击效果。为此,作者提出应聚焦于最难被攻击的样本进行评估,并基于干净模型的信息,提出了一种识别易受攻击和最难受攻击样本的方法,从而实现更严格的最坏情况评估和主动防御策略。

详情
AI中文摘要

定向数据投毒攻击通过向训练数据中注入恶意样本来操纵模型对特定测试样本的预测。然而,现有评估通常报告随机选择目标上的平均攻击成功率,掩盖了真实的最坏情况效果。我们认为正确的评估应聚焦于最难投毒的样本。同样的推理适用于防御:由于定向攻击在分布层面不留下痕迹,防御者应主动识别最脆弱的样本并应用定向对策。给定一个测试数据集,本文仅基于清洁模型信息识别最容易和最难投毒的样本。具体而言,我们利用清洁训练动态提供粗粒度评估,并利用投毒距离和预算对投毒类别进行细粒度分类。实验表明,这些指标能够可靠地按投毒脆弱性对样本分层,从而实现严格的最坏情况评估和主动的脆弱性感知防御。

英文摘要

Targeted data poisoning attacks manipulate model predictions on specific test samples by injecting malicious data into training. Yet existing evaluations report average attack success rates over randomly selected targets, obscuring true worst-case effectiveness. We argue that the right evaluation focuses on the hardest samples to poison. The same reasoning applies to defense: since targeted attacks leave no footprint at the distribution level, defenders should proactively identify the most vulnerable samples and apply targeted countermeasures. Given a test dataset, this paper identifies both the easiest and hardest to poison examples based on only clean model information. Specifically, we offer coarse evaluations using clean training dynamics, and fine-grained classification on poison class using poison distances and budgets. Our experiments show these metrics reliably stratify samples by poisoning vulnerability, enabling both rigorous worst-case evaluation and proactive vulnerability-aware defense.

2508.13663 2026-05-25 cs.AI cs.LG

Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints

具有软实体约束的知识图谱交互式查询回答

Daniel Daza, Alberto Bernardi, Luca Costabello, Christophe Gueret, Masoud Mansoury, Michael Cochez, Martijn Schut

AI总结 本文研究了在知识图谱中结合软实体约束进行交互式查询回答的问题,旨在处理现实场景中含模糊或上下文依赖约束的查询。为此,作者提出了两种高效方法,能够在不破坏原有查询结果排名结构的前提下,通过少量参数调整或小型神经网络学习软约束,从而提升查询结果的相关性。实验表明,该方法在保持原有查询性能的同时,有效融入了用户偏好,为知识图谱交互提供了更灵活的方式。

详情
Comments
Accepted in Transactions on Machine Learning Research (2026)
AI中文摘要

针对不完整知识图谱的查询回答方法检索可能成为答案的实体,这在由于缺失边而无法通过直接图遍历达到此类答案时特别有用。然而,现有方法侧重于使用一阶逻辑形式化的查询。在实践中,许多现实世界的查询涉及固有模糊或上下文依赖的约束,例如对属性或相关类别的偏好。针对这一差距,我们引入了具有软约束的查询回答问题。我们形式化了该问题,并提出了两种高效方法,旨在通过融入软约束来调整查询答案分数,同时不破坏查询的原始答案。这些方法是轻量级的,只需调整两个参数或训练一个小型神经网络来捕获软约束,同时保持原始排序结构。为了评估该任务,我们通过生成带有软约束的数据集来扩展现有的QA基准。我们的实验表明,我们的方法能够捕获软约束,同时保持稳健的查询回答性能,并增加很少的开销。通过我们的工作,我们探索了一种与图数据库交互的新颖灵活方式,允许用户通过交互式提供示例来指定其偏好。

英文摘要

Methods for query answering over incomplete knowledge graphs retrieve entities that are likely to be answers, which is particularly useful when such answers cannot be reached by direct graph traversal due to missing edges. However, existing approaches have focused on queries formalized using first-order-logic. In practice, many real-world queries involve constraints that are inherently vague or context-dependent, such as preferences for attributes or related categories. Addressing this gap, we introduce the problem of query answering with soft constraints. We formalize the problem and introduce two efficient methods designed to adjust query answer scores by incorporating soft constraints without disrupting the original answers to a query. These methods are lightweight, requiring tuning only two parameters or a small neural network trained to capture soft constraints while maintaining the original ranking structure. To evaluate the task, we extend existing QA benchmarks by generating datasets with soft constraints. Our experiments demonstrate that our methods can capture soft constraints while maintaining robust query answering performance and adding very little overhead. With our work, we explore a new and flexible way to interact with graph databases that allows users to specify their preferences by providing examples interactively.

2508.12247 2026-05-25 cs.LG cs.AI

STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction

STM3: 多尺度曼巴混合模型用于长期时空时间序列预测

Haolong Chen, Liang Zhang, Zhengyuan Xin, Guangxu Zhu

AI总结 本文提出了一种名为STM3的新型深度学习模型,用于解决长期时空时间序列预测中的多尺度信息提取和空间依赖建模难题。STM3结合了多尺度Mamba架构与解耦的专家混合框架(DMoE),并引入自适应图因果网络以高效捕捉复杂的时空依赖关系。该模型通过稳定路由策略和因果对比学习策略,确保了表示学习的鲁棒性和多尺度信息的可区分性,实验表明其在多个现实数据集上均取得了优越的预测性能。

详情
Comments
Accepted by KDD 2026
AI中文摘要

近年来,时空时间序列预测发展迅速,但现有深度学习方法难以高效学习复杂的长期时空依赖。长期时空依赖学习带来两个新挑战:1)长期时间序列自然包含多尺度信息,难以高效提取;2)不同节点的多尺度时间信息高度相关且难以建模。为解决这些问题,我们提出时空多尺度曼巴混合模型(STM3)。STM3在新型分离式混合专家(DMoE)框架内集成多尺度曼巴架构,以高效捕获多样的多尺度信息,同时利用自适应图因果网络建模复杂的空间依赖。为确保鲁棒的表示学习,我们引入稳定路由策略和因果对比学习策略,与层次信息聚合协同工作,保证尺度可区分性。我们理论上证明STM3实现了优越的路由平滑性,并保证了每个专家的模式分离。在跨领域的10个真实世界基准上的大量实验表明,STM3具有优越性能,在长期时空时间序列预测中达到了最先进的结果。值得注意的是,在PEMSD8数据集上,它取得了显著改进,在MAE、RMSE和MAPE上分别超过第二好的模型7.1%、8.5%和15.9%。代码可在https://github.com/IfReasonable/STM3_KDD26获取。

英文摘要

Recently, spatio-temporal time-series prediction has developed rapidly, yet existing deep learning methods struggle with learning complex long-term spatio-temporal dependencies efficiently. The long-term spatio-temporal dependency learning brings two new challenges: 1) The long-term temporal sequence naturally includes multiscale information, which is hard to extract efficiently; 2) The multiscale temporal information from different nodes is highly correlated and hard to model. To address these challenges, we propose Spatio-Temporal Mixture of Multiscale Mamba (STM3). STM3 integrates a Multiscale Mamba architecture within a novel Disentangled Mixture-of-Experts (DMoE) framework to capture diverse multiscale information efficiently, while utilizing an adaptive graph causal network to model complex spatial dependencies. To ensure robust representation learning, we introduce a stable routing strategy and a causal contrastive learning strategy, which work in tandem with hierarchical information aggregation to guarantee scale distinguishability. We theoretically prove that STM3 achieves superior routing smoothness and guarantees pattern disentanglement for each expert. Extensive experiments on 10 real-world benchmarks across domains demonstrate STM3's superior performance, achieving state-of-the-art results in long-term spatio-temporal time-series prediction. Notably, on the PEMSD8 dataset, it achieves significant improvements, surpassing the second-best model by 7.1% in MAE, 8.5% in RMSE, and 15.9% in MAPE. Code is available at https://github.com/IfReasonable/STM3_KDD26.

2506.20537 2026-05-25 cs.LG

Physics-Informed Machine Learning Regulated by Finite Element Analysis for Simulation Acceleration of Melt Pool Dynamics in Laser Powder Bed Fusion

基于有限元分析调控的物理信息机器学习用于激光粉末床熔融熔池动力学模拟加速

R. Sharma, Y. B. Guo

AI总结 该研究针对激光粉末床熔融(LPBF)过程中熔池动态模拟计算成本高的问题,提出了一种结合有限元分析(FEA)的物理信息神经网络(FEA-PINN)框架,以提高模拟效率并保持精度。该方法通过引入动态相变捕捉策略和物理一致性校正机制,有效解决了传统物理信息神经网络在时间依赖问题中精度下降的问题。实验表明,FEA-PINN在保证与有限元分析相当精度的同时,显著降低了计算成本。

详情
Comments
Further investigation revealed that the current version reflects an incomplete formulation and limited validation of the proposed method. We have since developed a substantially revised and extended study with updated assumptions and results, and therefore withdraw this version to prevent citation of superseded findings
AI中文摘要

高效模拟激光粉末床熔融(LPBF)对于工艺预测至关重要,因为传统数值方法(如有限元分析,FEA)存在计算成本高昂的持久问题。虽然物理信息神经网络(PINN)可以用少量训练数据预测解场,并通过迁移学习实现新工艺参数的泛化,但由于残差累积以及难以捕捉LPBF过程中固有的陡峭空间和时间梯度,它在时间相关问题中精度下降。为克服这一问题,本研究开发了一个高效的建模框架——有限元分析调控的物理信息神经网络(FEA-PINN),以加速LPBF过程中熔池动力学现象的预测,同时保持FEA的精度。FEA-PINN的创新体现在两个方面。首先,在PINN模型内部开发了一种新策略来捕捉粉末-液体-固体的动态相变,从而能够跟踪激光熔化过程中的材料状态。该模型进一步纳入了温度相关的材料属性、粉末床的相变行为、马兰戈尼对流以及熔池内的自然对流。其次,FEA-PINN框架在推理过程中集成了校正性的FEA模拟,以强制执行物理一致性、减少误差漂移并捕捉陡峭梯度。对比分析表明,FEA-PINN在显著降低计算成本的同时,达到了与FEA相当的精度。该框架已针对LPBF中单道扫描的基准FEA数据进行了验证。

英文摘要

Efficient simulation of Laser Powder Bed Fusion (LPBF) is crucial for process prediction due to the lasting issue of high computational cost associated with traditional numerical methods such as finite element analysis (FEA). While a Physics-Informed Neural Network (PINN) can predict solution fields with small training data and enables the generalization of new process parameters via transfer learning, it suffers from accuracy degradation in time-dependent problems due to the accumulation of residual and the difficulty in capturing the steep spatial and temporal gradients inherent in the LPBF process. To overcome this issue, this study develops an efficient modeling framework, FEA-Regulated Physics-Informed Neural Network (FEA-PINN), to accelerate the prediction of melt pool dynamics phenomena in an LPBF process while maintaining the FEA accuracy. The innovation of FEA-PINN manifested itself in two aspects. First, a novel strategy has been developed within the PINN model to capture the dynamic phase change of powder-liquid-solid, enabling the tracking of material status during laser melting. The model further incorporates temperature-dependent material properties, phase change behavior of the powder bed, Marangoni convection, and natural convection within the melt pool. Second, the FEA-PINN framework integrates corrective FEA simulations during inference to enforce physical consistency, reduce error drift, and capture the steep gradients. A comparative analysis shows that FEA-PINN achieves accuracy comparable to FEA while significantly reducing computational cost. The framework has been validated against benchmark FEA data for single-track scanning in LPBF.

2506.14135 2026-05-25 cs.RO cs.CV

GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation

GAF: 高斯动作场作为机器人操作中动态世界建模的4D表示

Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang Li, Hongwen Zhang, Yebin Liu

AI总结 本文提出了一种基于高斯动作场(GAF)的四维表示方法,用于机器人操作中的动态世界建模。GAF通过引入可学习的运动属性,扩展了三维高斯点绘(3DGS),实现了对动态场景和操作动作的四维建模。该方法能够直接从运动感知的四维表示中进行动作推理,并通过重建当前场景、预测未来帧和估计初始动作三个相关输出,提升操作精度。实验表明,GAF在重建质量和机器人操作成功率方面均优于现有方法。

详情
Comments
https://ChaiYing1.github.io/projects/GAF/
AI中文摘要

准确的场景感知对于基于视觉的机器人操作至关重要。现有方法通常遵循视觉到动作(V-A)范式,直接从视觉输入预测动作,或视觉到3D到动作(V-3D-A)范式,利用中间3D表示。然而,由于操作场景的复杂性和动态性,这些方法常常面临动作不准确的问题。在本文中,我们采用V-4D-A框架,通过高斯动作场(GAF)从运动感知的4D表示中直接进行动作推理。GAF通过引入可学习的运动属性扩展了3D高斯溅射(3DGS),实现了动态场景和操作动作的4D建模。为了学习时变场景几何和动作感知的机器人运动,GAF提供三个相互关联的输出:当前场景的重建、未来帧的预测以及通过高斯运动估计的初始动作。此外,我们采用一个动作-视觉对齐的去噪框架,以GAF生成的初始动作和高斯感知的统一表示为条件,进一步获得更精确的动作。大量实验表明,GAF在重建质量上实现了显著改进,PSNR提高+11.5385 dB,SSIM提高+0.3864,LPIPS降低-0.5574,同时在机器人操作任务中,相比最先进方法,平均成功率提升+7.3%。

英文摘要

Accurate scene perception is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. Furthermore, we employ an action-vision-aligned denoising framework, conditioned on a unified representation that combines the init action and the Gaussian perception, both generated by the GAF, to further obtain more precise actions. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success rate in robotic manipulation tasks over state-of-the-art methods.

2506.05438 2026-05-25 cs.LG cs.AI

An Unsupervised Framework for Dynamic Health Indicator Construction and Its Application in Rolling Bearing Prognostics

一种用于动态健康指标构建的无监督框架及其在滚动轴承预测中的应用

Tongda Sun, Chen Yin, Huailiang Zheng, Yining Dong

AI总结 本文提出了一种无需专家知识的无监督框架,用于构建动态健康指标(HI),以提升滚动轴承退化趋势建模与剩余寿命预测的准确性。该方法通过基于跳跃连接的自编码器自动提取退化特征,并在特征空间中引入嵌入内部预测模块的HI生成模块,显式建模HI状态的时序依赖关系,从而捕捉退化过程中的动态信息。实验结果表明,所提出的动态HI在两个轴承生命周期数据集上优于现有方法,显著提升了预测性能。

详情
AI中文摘要

健康指标(HI)在滚动轴承的退化评估和预测中起着关键作用。尽管已有多种HI构建方法被研究,但大多数依赖于专家知识进行特征提取,并忽略了捕捉序列退化过程中隐藏的动态信息,这限制了所构建HI在退化趋势表示和预测中的能力。为解决这些问题,通过一种无监督框架构建了考虑HI级时间依赖性的新型动态HI。具体而言,由基于跳跃连接的自编码器组成的退化特征学习模块首先将原始信号映射到代表性退化特征空间(DFS),以自动提取必要的退化特征,无需专家知识。随后,在该DFS中,提出了一种嵌入内部HI预测模块的新型HI生成模块用于动态HI构建,其中过去和当前HI状态之间的时间依赖性被保证并显式建模。在此基础上,动态HI捕捉了退化过程固有的动态内容,确保其在退化趋势建模和未来退化预测中的有效性。在两个轴承生命周期数据集上的实验结果表明,所提出的HI构建方法优于对比方法,且构建的动态HI在预测任务中表现更优。

英文摘要

Health indicator (HI) plays a key role in degradation assessment and prognostics of rolling bearings. Although various HI construction methods have been investigated, most of them rely on expert knowledge for feature extraction and overlook capturing dynamic information hidden in sequential degradation processes, which limits the ability of the constructed HI for degradation trend representation and prognostics. To address these concerns, a novel dynamic HI that considers HI-level temporal dependence is constructed through an unsupervised framework. Specifically, a degradation feature learning module composed of a skip-connection-based autoencoder first maps raw signals to a representative degradation feature space (DFS) to automatically extract essential degradation features without the need for expert knowledge. Subsequently, in this DFS, a new HI-generating module embedded with an inner HI-prediction block is proposed for dynamic HI construction, where the temporal dependence between past and current HI states is guaranteed and modeled explicitly. On this basis, the dynamic HI captures the inherent dynamic contents of the degradation process, ensuring its effectiveness for degradation tendency modeling and future degradation prognostics. The experiment results on two bearing lifecycle datasets demonstrate that the proposed HI construction method outperforms comparison methods, and the constructed dynamic HI is superior for prognostic tasks.

2502.20349 2026-05-25 q-bio.NC cs.AI

Naturalistic Computational Cognitive Science: Towards generalizable models and theories that capture the full range of natural behavior

自然主义计算认知科学:迈向能够捕捉自然行为全范围的通用模型与理论

Wilka Carvalho, Andrew Lampinen

AI总结 本文探讨如何通过结合人工智能的最新进展,构建能够涵盖自然情境和行为全貌的通用认知科学理论。研究指出,采用更加自然化的实验范式和计算模型,有助于更准确地理解自然智能的本质,并推动理论的泛化能力。文章综述了认知科学、神经科学和人工智能领域的相关研究,提出整合这些领域进展有助于在保持实验控制和理论深度的同时,更好地解释和模拟人类认知过程。

详情
AI中文摘要

认知科学如何构建能够涵盖自然情境与行为全范围的通用理论?我们认为,人工智能(AI)的进展为认知科学提供了及时的机会,使其能够采用日益自然化的刺激、任务和行为进行实验,并构建能够适应这些变化的计算模型。我们首先回顾了涵盖神经科学、认知科学和AI的日益增长的研究,这些研究表明,纳入更广泛的自然主义实验范式及其相应模型,可能是解决自然智能某些方面并确保理论泛化的必要条件。我们回顾了认知科学和神经科学中的案例,其中自然主义范式引发了不同的行为或涉及不同的过程。然后,我们讨论了AI的最新进展,表明从自然主义数据中学习会产生定性的不同行为模式和泛化模式,并探讨了这些发现如何影响我们从认知建模中得出的结论,以及如何帮助产生关于认知和神经现象根源的新假设。接着,我们建议整合AI和认知科学的最新进展,将使我们能够处理更自然的现象,而不放弃实验控制或对理论理解基础的追求。我们提供了关于方法论实践如何有助于自然主义计算认知科学中累积进展的实用指导,并描绘了一条构建能够解决自然认知实际问题的计算模型的道路,同时对这些模型所依据的过程和原则进行还原性理解。

英文摘要

How can cognitive science build generalizable theories that span the full scope of natural situations and behaviors? We argue that progress in Artificial Intelligence (AI) offers timely opportunities for cognitive science to embrace experiments with increasingly naturalistic stimuli, tasks, and behaviors; and computational models that can accommodate these changes. We first review a growing body of research spanning neuroscience, cognitive science, and AI that suggests that incorporating a broader range of naturalistic experimental paradigms, and models that accommodate them, may be necessary to resolve some aspects of natural intelligence and ensure that our theories generalize. We review cases from cognitive science and neuroscience where naturalistic paradigms elicit distinct behaviors or engage different processes. We then discuss recent progress in AI that shows that learning from naturalistic data yields qualitatively different patterns of behavior and generalization, and examine how these findings impact the conclusions we draw from cognitive modeling, and can help yield new hypotheses for the roots of cognitive and neural phenomena. We then suggest that integrating recent progress in AI and cognitive science will enable us to engage with more naturalistic phenomena without giving up experimental control or the pursuit of theoretically grounded understanding. We offer practical guidance on how methodological practices can contribute to cumulative progress in naturalistic computational cognitive science, and illustrate a path towards building computational models that solve the real problems of natural cognition, together with a reductive understanding of the processes and principles by which they do so.

2502.13731 2026-05-25 cs.AI

Robust Counterfactual Inference in Markov Decision Processes

马尔可夫决策过程中的鲁棒反事实推断

Jessica Lally, Milad Kazemi, Nicola Paoletti

AI总结 本文针对马尔可夫决策过程(MDP)中现有反事实推理方法的一个关键局限性,提出了一种新的非参数方法。传统方法依赖特定的因果模型来识别反事实,而实际上存在多个与观测和干预分布一致的因果模型,导致反事实分布不同。本文通过计算所有兼容因果模型下反事实转移概率的紧致界,提供了高效且可扩展的反事实推理方法,并在此基础上设计出鲁棒的反事实策略,以优化最坏情况下的奖励。实验表明,该方法在多个案例中表现出更强的鲁棒性。

详情
AI中文摘要

本文解决了马尔可夫决策过程(MDP)中现有反事实推断方法的一个关键局限性。当前方法假设特定的因果模型以使反事实可识别。然而,通常存在许多与MDP的观测分布和干预分布一致的因果模型,每个模型产生不同的反事实分布,因此固定一个特定的因果模型限制了反事实推断的有效性(和有用性)。我们提出了一种新颖的非参数方法,该方法在所有兼容因果模型上计算反事实转移概率的紧界。与先前需要求解规模过大(变量随MDP大小呈指数增长)的优化问题的方法不同,我们的方法为这些界提供了闭式表达式,使得计算对于非平凡MDP高度高效且可扩展。一旦构建了这样的区间反事实MDP,我们的方法识别出鲁棒的反事实策略,该策略针对不确定的区间MDP概率优化最坏情况奖励。我们在各种案例研究上评估了我们的方法,证明了相比现有方法具有改进的鲁棒性。

英文摘要

This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

2501.08222 2026-05-25 cs.RO

Data-driven Spatial Classification using Multi-Arm Bandits for Monitoring with Energy-Constrained Mobile Robots

基于多臂老虎机的数据驱动空间分类用于能量受限移动机器人监测

Xiaoshan Lin, Siddharth Nayak, Stefano Di Cairano, Abraham P. Vinod

AI总结 本文研究了利用协同移动机器人进行环境监测中的空间分类问题,旨在快速将搜索区域划分为感兴趣和不感兴趣区域。提出了一种基于多臂老虎机框架的双层策略,高层规划器根据实时数据确定待访问区域,底层规划器通过整数规划协调路径,同时考虑传感器噪声和能量约束。该方法在仿真和实际机器人实验中均表现出良好的分类效率和任务完成性能。

详情
Comments
8 pages, 6 figures. See https://www.youtube.com/watch?v=gzulpOcVYzg for an overview of the approach along with videos of the hardware experiments
AI中文摘要

我们考虑使用协调移动机器人团队收集的数据进行监测的空间分类问题。此类分类问题出现在包括搜索救援和精准农业在内的多个应用中。具体而言,我们希望使用移动传感器和移动充电站团队,尽可能快地将搜索环境的区域分类为有趣和无趣。我们开发了一种数据驱动策略,该策略适应传感数据中的噪声和传感器的有限能量容量,并为团队生成无碰撞运动计划。我们提出了一种双层方法,其中高层规划器利用多臂老虎机框架,根据在线收集的数据确定无人机接下来要访问的潜在感兴趣区域。然后,基于整数规划的低层路径规划器协调团队访问已确定区域的路径,并满足物理约束。我们描述了所提方法的若干理论特性,包括任意时间保证和任务完成时间。我们在仿真中展示了我们方法的有效性,并在使用移动机器人的物理实验中进一步验证了这些观察结果。

英文摘要

We consider the spatial classification problem for monitoring using data collected by a coordinated team of mobile robots. Such classification problems arise in several applications including search-and-rescue and precision agriculture. Specifically, we want to classify the regions of a search environment into interesting and uninteresting as quickly as possible using a team of mobile sensors and mobile charging stations. We develop a data-driven strategy that accommodates the noise in sensed data and the limited energy capacity of the sensors, and generates collision-free motion plans for the team. We propose a bi-level approach, where a high-level planner leverages a multi-armed bandit framework to determine the potential regions of interest for the drones to visit next based on the data collected online. Then, a low-level path planner based on integer programming coordinates the paths for the team to visit the determined regions subject to the physical constraints. We characterize several theoretical properties of the proposed approach, including anytime guarantees and task completion time. We show the efficacy of our approach in simulation, and further validate these observations in physical experiments using mobile robots.

2412.19098 2026-05-25 cs.LG

SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation

SyMerge:从无干扰到协同合并的单层自适应方法

Aecheon Jung, Seunghwan Lee, Dongyoon Han, Sungeun Hong

AI总结 SyMerge 是一种轻量级的模型合并框架,旨在通过单层适配实现任务间的协同效应,而非仅仅避免任务干扰。该方法通过联合优化合并系数和一个任务特定层,引入专家引导的自标注目标,提升了合并效果的稳定性与性能。研究证明,SyMerge 能够成功合并不同初始化训练的模型,在多个视觉、密集预测和自然语言处理基准上取得了最先进的结果。

详情
Comments
Accepted at ICML 2026
AI中文摘要

模型合并将独立训练的模型组合成一个多任务模型。然而,大多数现有方法主要关注避免任务干扰。我们认为其更大的潜力在于实现任务协同,即任务之间主动相互改进。我们识别出跨任务性能,由不同任务之间的编码器和预测器的兼容性定义,作为合并质量的关键指标。我们证明仅适应单个任务特定层就足以诱导这种协同。本研究提出SyMerge,一个轻量级框架,联合优化合并系数和单个任务特定层。我们采用专家引导的自标签目标,提供超越熵最小化的稳定监督。有趣的是,我们进一步表明SyMerge成功合并了从不同初始化训练的模型,而标准方法在此情况下失效。我们极简但有原则的方法在视觉、密集预测和NLP基准上达到了最先进的结果。我们的代码可在https://aim-skku.github.io/SyMerge获取。

英文摘要

Model merging combines independently trained models into a single multi-task model. However, most existing approaches focus primarily on avoiding task interference. We argue that its greater potential lies in enabling task synergy, where tasks actively improve one another. We identify cross-task performance, defined by compatibility between encoders and predictors across tasks, as a key indicator of merge quality. We demonstrate that adapting only a single task-specific layer is sufficient to induce such synergy. This study proposes SyMerge, a lightweight framework that jointly optimizes merging coefficients and a single task-specific layer. We adopt an expert-guided self-labeling objective, providing stable supervision beyond entropy minimization. Intriguingly, we further show that SyMerge successfully merges models trained from different initializations, a regime where standard methods break down. Our minimalist yet principled method achieves state-of-the-art results across vision, dense prediction, and NLP benchmarks. Our code is available at https://aim-skku.github.io/SyMerge

2412.14642 2026-05-25 cs.CL

Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

Speak-to-Structure:评估大语言模型在开放域自然语言驱动的分子生成中的表现

Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Yatao Bian, Dongzhan Zhou, Xiao-yong Wei, Qing Li

AI总结 近期,大型语言模型(LLMs)在自然语言驱动的分子发现任务中展现出巨大潜力,但现有数据集和基准主要基于一对一映射,仅评估模型检索单一预定义答案的能力,而忽略了其生成多样化且有效分子候选物的创造力。为此,研究者提出了首个用于评估LLMs在开放领域自然语言驱动分子生成能力的基准S²-Bench,该基准专门设计用于一对多关系,挑战模型展现真实的分子理解和开放生成能力。研究还引入了大规模指令微调数据集OpenMolIns,使Llama3.1-8B在该基准上超越了GPT-4o和Claude-3.5等强大多模态模型,并通过全面评估31个LLMs,推动了从简单模式记忆向真实分子设计的转变。

详情
Comments
Accepted by KDD 2026. Our codes and datasets are fully accessible through the https://github.com/phenixace/S2-TOMG-Bench and https://huggingface.co/datasets/phenixace/S2-TOMG-Bench
AI中文摘要

近期,大语言模型(LLMs)在自然语言驱动的分子发现中展现出巨大潜力。然而,现有的分子-文本对齐数据集和基准主要基于一对一映射,衡量LLMs检索单一预定义答案的能力,而非其生成多样且同样有效的候选分子的创造性潜力。为填补这一关键空白,我们提出Speak-to-Structure(S^2-Bench),这是首个评估LLMs在开放域自然语言驱动分子生成中的基准。S^2-Bench专为一对多关系设计,挑战LLMs展现真正的分子理解和开放生成能力。我们的基准包括三个关键任务:分子编辑(MolEdit)、分子优化(MolOpt)和定制分子生成(MolCustom),每个任务探索分子发现的不同方面。我们还引入OpenMolIns,一个大规模指令微调数据集,使Llama3.1-8B在S^2-Bench上超越最强大的LLMs如GPT-4o和Claude-3.5。我们对31个LLMs的全面评估将焦点从简单的模式回忆转向现实的分子设计,为自然语言驱动分子发现中更强大的LLMs铺平道路。我们的代码和数据集完全可通过GitHub仓库(https://github.com/phenixace/S2-TOMG-Bench)和Huggingface数据集(https://huggingface.co/datasets/phenixace/S2-TOMG-Bench)获取。

英文摘要

Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on one-to-one mappings, measuring LLMs' ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates. To address this critical gap, we propose Speak-to-Structure (S^2-Bench), the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation. S^2-Bench is specifically designed for one-to-many relationships, challenging LLMs to exhibit genuine molecular understanding and open-ended generation capabilities. Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom), each probing a different aspect of molecule discovery. We also introduce OpenMolIns, a large-scale instruction tuning dataset that enables Llama3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S^2-Bench. Our comprehensive evaluation of 31 LLMs shifts the focus from simple pattern recall to realistic molecular design, paving the way for more capable LLMs in natural language-driven molecule discovery. Our codes and datasets are fully accessible through the Github Repository: https://github.com/phenixace/S2-TOMG-Bench and Huggingface Datasets: https://huggingface.co/datasets/phenixace/S2-TOMG-Bench.

2406.02883 2026-05-25 cs.LG cs.CR

Nonlinear Transformations Against Unlearnable Datasets

针对不可学习数据集的非线性变换

Thushari Hapuarachchi, Jing Lin, Kaiqi Xiong, Mohamed Rahouti, Gitte Ost

AI总结 本文研究了如何通过非线性变换方法解决深度学习模型对传统认为无法学习的“不可遗忘”数据集的学习问题。作者提出了一种有效的非线性变换框架,并通过大量实验表明,深度神经网络能够从由多种数据保护方法生成的不可遗忘数据中有效学习,显著优于近期提出的线性可分技术。实验结果表明,该方法在多个数据集上提升了模型性能,揭示了现有保护方法在防止数据未经授权使用方面存在不足,亟需更强大的防护机制。

详情
AI中文摘要

自动化爬取是深度学习模型中未经数据所有者授权收集数据的常见方法。近期研究开始解决这种数据收集方法带来的隐私问题。显著的方法包括Deepconfuse、误差最小化、误差最大化(也称为对抗性投毒)、神经正切泛化攻击、合成、自回归、单像素捷径、自集成保护、纠缠特征、鲁棒误差最小化、虚伪和TensorClog。这些方法生成的数据称为“不可学习”样本,阻止深度学习模型“学习”。在本研究中,我们调查并设计了一个有效的非线性变换框架,并进行大量实验,证明深度神经网络能够有效从上述十二种方法产生的传统上被认为不可学习的数据/样本中学习。与研究人员最近提出的线性可分技术相比,所提出的方法提高了破解不可学习数据的能力。具体来说,我们的大量实验表明,对于这些十二种数据保护方法生成的不可学习CIFAR10数据集(除单像素捷径外),改进范围为0.34%至249.59%。此外,与线性可分技术相比,所提出的框架在自回归和REM方法上实现了超过100%的测试准确率提升。我们的发现表明,这些方法不足以防止机器学习模型中数据的未经授权使用。迫切需要开发更强大的保护机制,有效阻止攻击者在未经所有者适当授权的情况下访问数据。

英文摘要

Automated scraping stands out as a common method for collecting data in deep learning models without the authorization of data owners. Recent studies have begun to tackle the privacy concerns associated with this data collection method. Notable approaches include Deepconfuse, error-minimizing, error-maximizing (also known as adversarial poisoning), Neural Tangent Generalization Attack, synthetic, autoregressive, One-Pixel Shortcut, Self-Ensemble Protection, Entangled Features, Robust Error-Minimizing, Hypocritical, and TensorClog. The data generated by those approaches, called "unlearnable" examples, are prevented "learning" by deep learning models. In this research, we investigate and devise an effective nonlinear transformation framework and conduct extensive experiments to demonstrate that a deep neural network can effectively learn from the data/examples traditionally considered unlearnable produced by the above twelve approaches. The resulting approach improves the ability to break unlearnable data compared to the linear separable technique recently proposed by researchers. Specifically, our extensive experiments show that the improvement ranges from 0.34% to 249.59% for the unlearnable CIFAR10 datasets generated by those twelve data protection approaches, except for One-Pixel Shortcut. Moreover, the proposed framework achieves over 100% improvement of test accuracy for Autoregressive and REM approaches compared to the linear separable technique. Our findings suggest that these approaches are inadequate in preventing unauthorized uses of data in machine learning models. There is an urgent need to develop more robust protection mechanisms that effectively thwart an attacker from accessing data without proper authorization from the owners.

2403.12401 2026-05-25 cs.CV

RT-NeRV: Rethinking Hybrid Neural Representations for Video via Residual Tokenization

RT-NeRV: 通过残差标记化重新思考混合神经视频表示

Yunjie Xu, Xiang Feng, Chengkai Wang, Alan Wee-Chung Liew, Xuefei Yin, Yanming Zhu

AI总结 本文提出了一种名为RT-NeRV的新型混合神经视频表示方法,旨在解决现有方法在低比特率下难以保留细节的问题。其核心思想是通过残差分块技术,将浅层残差特征和帧间残差信息离散化为紧凑的残差块,从而高效传输并利用这些信息进行重建。该方法设计了残差分块器和残差感知码本学习策略,有效提升了重建质量与训练稳定性,并在多个视频回归与修复任务中优于现有混合NeRV方法。

详情
Comments
Under Review
AI中文摘要

神经视频表示(NeRV)通过将视频表示为紧凑的神经网络并实现高效解码,已成为视频压缩的一种有前景的范式。混合NeRV方法通过内容自适应嵌入进一步提高了重建质量,但在低比特率下仍难以保留精细细节。一个关键限制是,浅层残差支持信息虽然对重建非常有益,但其连续形式的传输成本高昂,因此未被充分利用。在本文中,我们重新思考混合NeRV,并提出了RT-NeRV,一种用于混合神经视频表示的残差标记化框架。核心思想是将浅层残差特征和帧间残差线索离散化为紧凑的残差标记,从而使得信息丰富的重建支持能够高效传输并被解码器利用。为此,我们设计了一个残差标记化器,并结合了一种残差感知的码本学习策略,该策略提高了标记利用率并稳定了训练。RT-NeRV可以轻松集成到现代混合NeRV主机中,持续增强细节保留、重建质量以及比特率-质量权衡。在视频回归和相关恢复任务上的大量实验表明,RT-NeRV优于强混合NeRV基线,并与近期基于INR的视频压缩方法保持竞争力。这些结果表明,残差标记化是推进混合神经视频表示的一个有效且互补的方向。

英文摘要

Neural Representations for Videos(NeRV) have emerged as a promising paradigm for video compression by representing videos as compact neural networks with efficient decoding. Hybrid NeRV methods further improve reconstruction quality through content adaptive embeddings, but still struggle to preserve fine details at low bitrates. A key limitation is that shallow residual support in formation, although highly beneficial for reconstruction, is costly to transmit in its continuous form and is therefore underutilized. In this paper, we rethink hybrid NeRV and present RT-NeRV, a residual tokenization framework for hybrid neural video representations. The core idea is to discretize shallow residual features and inter-frame residual cues into compact residual tokens, allowing informative reconstruction support to be transmitted efficiently and exploited by the decoder. To this end, we design a residual tokenizer together with a residual-aware codebook learning strategy that improves token utilization and stabilizes training. RT-NeRV can be readily integrated into modern hybrid NeRV hosts, consistently enhancing detail preservation, reconstruction quality, and bitrate quality trade-offs. Extensive experiments on video regression and related restoration tasks show that RT-NeRV outperforms strong hybrid NeRV baselines and remains competitive with recent INR based video compression methods. These results demonstrate that residual tokenization is an effective and complementary direction for advancing hybrid neural video representations

2605.23673 2026-05-25 cs.LG

Relevant Walk Search for Explaining Graph Neural Networks

用于解释图神经网络的相关游走搜索

Ping Xiong, Thomas Schnake, Michael Gastegger, Grégoire Montavon, Klaus-Robert Müller, Shinichi Nakajima

AI总结 本文研究了图神经网络(GNN)的可解释性问题,提出了一种高效寻找关键路径(walk)的方法,用于揭示网络中的重要信息流动。针对现有基于层间相关性传播(GNN-LRP)方法计算复杂度高、难以应用于大规模网络的问题,作者设计了多项式时间算法,能够在保证解释精度的同时大幅提升计算效率。实验表明,该方法在多个实际应用领域中表现良好,具有广泛的应用价值。

详情
Journal ref
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:38301-38324, 2023
Comments
Published in ICML 2023
AI中文摘要

图神经网络(GNN)已成为图分析的重要机器学习工具,其可解释性对于安全性、公平性和鲁棒性至关重要。GNN的逐层相关性传播(GNN-LRP)评估游走的相关性以揭示网络中的重要信息流,并提供高阶解释,已被证明优于低阶(即节点/边级)解释。然而,通过GNN-LRP识别相关游走需要相对于网络深度的指数级计算复杂度,本文将对这一问题进行改进。具体来说,我们提出了多项式时间算法来寻找前K个相关游走,这大大减少了计算量,从而提高了GNN-LRP在大规模问题上的适用性。我们提出的算法基于最大积算法——一种在概率图模型中寻找最大似然配置的常用工具——并且可以在神经元级别精确地找到最相关的游走,在节点级别近似地找到。我们的实验展示了我们的算法在规模上的性能及其在应用领域(即流行病学、分子和自然语言基准)中的实用性。我们在\href{https://github.com/xiong-ping/rel_walk_gnnlrp}{github.com/xiong-ping/rel\_walk\_gnnlrp}上提供代码。

英文摘要

Graph Neural Networks (GNNs) have become important machine learning tools for graph analysis, and its explainability is crucial for safety, fairness, and robustness. Layer-wise relevance propagation for GNNs (GNN-LRP) evaluates the relevance of \emph{walks} to reveal important information flows in the network, and provides higher-order explanations, which have been shown to be superior to the lower-order, i.e., node-/edge-level, explanations. However, identifying relevant walks by GNN-LRP requires {\em exponential} computational complexity with respect to the network depth, which we will remedy in this paper. Specifically, we propose {\em polynomial-time} algorithms for finding top-$K$ relevant walks, which drastically reduces the computation and thus increases the applicability of GNN-LRP to large-scale problems. Our proposed algorithms are based on the \emph{max-product} algorithm -- a common tool for finding the maximum likelihood configurations in probabilistic graphical models -- and can find the most relevant walks exactly at the neuron level and approximately at the node level. Our experiments demonstrate the performance of our algorithms at scale and their utility across application domains, i.e., on epidemiology, molecular, and natural language benchmarks. We provide our codes under \href{https://github.com/xiong-ping/rel_walk_gnnlrp}{github.com/xiong-ping/rel\_walk\_gnnlrp}.