arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3860
2603.29868 2026-05-19 cs.AI cs.LO

Spatiotemporal Robustness of Temporal Logic Tasks using Multi-Objective Reasoning

基于多目标推理的时序逻辑任务时空鲁棒性

Oliver Schön, Lars Lindemann

AI总结 本文研究了通过多目标推理处理时序逻辑任务的时空鲁棒性,提出了一种新的时空鲁棒性定义,能够同时考虑空间和时间扰动,并展示了其在多智能体机器人、智慧城市和空中交通管制等交互系统中的应用。

Comments 30 pages, 6 figures, to be published at the 38th International Conference on Computer Aided Verification 2026

详情
AI中文摘要

自主系统的可靠性依赖于其鲁棒性,即在不确定性下满足目标的能力。本文研究了在离散时间信号上评估的时序逻辑规范的时空鲁棒性。现有工作提出了鲁棒语义,能够捕捉不仅布尔可满足性,还包括从不可满足性距离的几何距离,对应于给定信号的可接受空间扰动。相比之下,我们提出了时空鲁棒性(STR),它同时捕捉可接受的空间和时间扰动。这一概念对于交互系统,如多智能体机器人、智慧城市和空中交通管制尤其具有信息量。我们将STR定义为一个多目标推理问题,通过空间和时间扰动的偏序关系形式化。这种视角有两个关键优势:(1)STR可以被解释为一个帕累托最优集,该集描述了所有可接受的时空扰动;(2)STR可以通过多目标优化工具进行计算。为克服计算挑战,我们提出了适用于STR的鲁棒语义,这些语义在适当的意义下是准确的,同时计算上是可行的。最后,我们使用这些鲁棒语义提出了STR的监控算法。据我们所知,这是首次通过多目标推理处理多维鲁棒性的工作。

英文摘要

The reliability of autonomous systems depends on their robustness, i.e., their ability to meet their objectives under uncertainty. In this paper, we study spatiotemporal robustness of temporal logic specifications evaluated over discrete-time signals. Existing work has proposed robust semantics that capture not only Boolean satisfiability, but also the geometric distance from unsatisfiability, corresponding to admissible spatial perturbations of a given signal. In contrast, we propose spatiotemporal robustness (STR), which captures admissible spatial and temporal perturbations jointly. This notion is particularly informative for interacting systems, such as multi-agent robotics, smart cities, and air traffic control. We define STR as a multi-objective reasoning problem, formalized via a partial order over spatial and temporal perturbations. This perspective has two key advantages: (1) STR can be interpreted as a Pareto-optimal set that characterizes all admissible spatiotemporal perturbations, and (2) STR can be computed using tools from multi-objective optimization. To navigate computational challenges, we propose robust semantics for STR that are sound in the sense of suitably under-approximating STR while being computationally tractable. Finally, we present monitoring algorithms for STR using these robust semantics. To the best of our knowledge, this is the first work to deal with robustness across multiple dimensions via multi-objective reasoning.

2603.26720 2026-05-19 cs.RO cs.AI

SutureFormer: Learning Surgical Trajectories via Goal-conditioned Offline RL in Pixel Space

SutureFormer: 通过像素空间中的目标引导离线强化学习学习手术轨迹

Huanrong Liu, Chunlin Tian, Tongyu Jia, Tailai Zhou, Qin Liu, Yu Gao, Yutong Ban, Yun Gu, Guy Rosman, Xin Ma, Qingbiao Li

AI总结 本文提出SutureFormer,一种基于目标引导的离线强化学习框架,通过稀疏标注到密集奖励信号的插值,有效学习手术针轨迹预测,减少平均位移误差58.6%。

详情
AI中文摘要

从内窥镜视频预测手术针轨迹对于机器人辅助缝合至关重要,能够实现预见性规划、实时引导和更安全的运动执行。现有直接从视觉观测学习运动分布的方法往往忽视相邻运动步骤之间的序列依赖性。此外,稀疏路径点标注通常无法提供足够的监督,进一步增加了监督或模仿学习方法的难度。为了解决这些挑战,我们将基于图像的针轨迹预测 formulations 为一个序列决策问题,在其中将针尖视为一个在像素空间中逐步移动的智能体。这种 formulation 自然捕捉了针运动的连续性,并能够显式建模在时间上物理上合理的像素级状态转换。从这个角度来看,我们提出SutureFormer,一种目标引导的离线强化学习框架,通过三次样条插值将稀疏标注转换为密集奖励信号,鼓励策略在利用有限专家指导的同时探索合理的未来运动路径。SutureFormer 使用观察编码器编码可变长度片段,以捕捉局部空间线索和长距离时间动态,并通过由离散方向和连续幅度组成的操作自回归地预测未来路径点。为了实现从专家演示中稳定离线策略优化,我们采用保守Q学习与行为克隆正则化。在包含1,158条轨迹的新的肾伤口缝合数据集中进行的实验表明,与最强基线相比,SutureFormer将平均位移误差减少了58.6%,证明了将针轨迹预测建模为像素级序列动作学习的有效性。

英文摘要

Predicting surgical needle trajectories from endoscopic video is critical for robot-assisted suturing, enabling anticipatory planning, real-time guidance, and safer motion execution. Existing methods that directly learn motion distributions from visual observations tend to overlook the sequential dependency among adjacent motion steps. Moreover, sparse waypoint annotations often fail to provide sufficient supervision, further increasing the difficulty of supervised or imitation learning methods. To address these challenges, we formulate image-based needle trajectory prediction as a sequential decision-making problem, in which the needle tip is treated as an agent that moves step by step in pixel space. This formulation naturally captures the continuity of needle motion and enables the explicit modeling of physically plausible pixel-wise state transitions over time. From this perspective, we propose SutureFormer, a goal-conditioned offline reinforcement learning framework that leverages sparse annotations to dense reward signals via cubic spline interpolation, encouraging the policy to exploit limited expert guidance while exploring plausible future motion paths. SutureFormer encodes variable-length clips using an observation encoder to capture both local spatial cues and long-range temporal dynamics, and autoregressively predicts future waypoints through actions composed of discrete directions and continuous magnitudes. To enable stable offline policy optimization from expert demonstrations, we adopt Conservative Q-Learning with Behavioral Cloning regularization. Experiments on a new kidney wound suturing dataset containing 1,158 trajectories from 50 patients show that SutureFormer reduces Average Displacement Error by 58.6% compared with the strongest baseline, demonstrating the effectiveness of modeling needle trajectory prediction as pixel-level sequential action learning.

2603.20216 2026-05-19 cs.CL cs.AI cs.LG

Locally Coherent Parallel Decoding in Diffusion Language Models

局部相干并行解码在扩散语言模型中

Michael Hersche, Nicolas Menet, Ronan Tanios, Abbas Rahimi

AI总结 本文提出CoDiLA方法,通过引入小型辅助自回归模型来解决扩散语言模型在并行解码中的相干性问题,从而在代码生成任务中实现更高的准确性和速度。

Comments Accepted at ICML 2026

详情
AI中文摘要

扩散语言模型(DLMs)作为一种有前景的替代自回归(AR)模型,提供了亚线性生成延迟和双向能力,这在代码生成和编辑中尤为吸引人。在离散DLMs中实现亚线性延迟需要并行预测多个token。然而,标准DLMs从条件边缘分布独立采样token,无法捕捉同时生成token之间的联合依赖关系。因此,它们常常导致语法不一致并破坏多token结构。在本工作中,我们引入CoDiLA(Coherent Diffusion with Local Autoregression),一种方法,通过引入小型辅助AR模型来解决并行采样与局部依赖建模之间的矛盾。该方法将局部解码委托给一个小型辅助AR模型,该模型在扩散潜变量上进行操作。这种设计允许并行生成,同时在块内确保序列的有效性,并保持核心DLM能力,包括跨块的双向建模。我们证明使用高度紧凑的辅助AR模型(例如,0.6B参数)可以有效消除相干性伪影,在代码生成基准中建立了一个新的帕累托前沿。

英文摘要

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel generation while ensuring sequential validity within a block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.

2603.17751 2026-05-19 cs.RO cs.SY eess.SY

Multi-Source Human-in-the-Loop Digital Twin Testbed for Connected and Autonomous Vehicles in Mixed Traffic Flow

多源人机协同数字孪生测试平台用于混合交通流中的连接与自动驾驶车辆

Jianghong Dong, Chunying Yang, Mengchi Cai, Chaoyi Chen, Qing Xu, Jianqiang Wang, Jiawei Wang, Keqiang Li

AI总结 本文提出了一种多源人机协同混合云控制测试平台(MSH-MCCT),用于在混合交通环境中测试连接与自动驾驶车辆(CAVs)与人类驾驶车辆(HDVs)之间的复杂交互,通过混合数字孪生概念结合混合现实与数字孪生,提升实验灵活性和可扩展性。

详情
Journal ref
2026 in Journal of Intelligent and Connected Vehicles
AI中文摘要

在新兴的混合交通环境中,连接与自动驾驶车辆(CAVs)必须与周围的人类驾驶车辆(HDVs)进行交互。本文介绍MSH-MCCT(多源人机协同混合云控制测试平台),一种新的CAV测试平台,能够捕捉各种CAVs和HDVs之间的复杂交互。利用混合数字孪生概念,该概念结合了混合现实与数字孪生,MSH-MCCT整合了物理、虚拟和混合平台,以及多源控制输入。通过混合平台的连接,MSH-MCCT允许人类驾驶员和CAV算法在多个视野范围内同时操作物理和虚拟车辆。特别地,该测试平台促进了物理和虚拟CAVs与HDVs的共存和实时交互,显著提高了实验的灵活性和可扩展性。在混合交通中的车辆编队实验展示了MSH-MCCT通过不同保真度的驾驶模拟器进行多源真实人类驾驶员闭环CAV测试的潜力。实验视频可在我们的项目网站上获得:https://dongjh20.github.io/MSH-MCCT。

英文摘要

In the emerging mixed traffic environments, Connected and Autonomous Vehicles (CAVs) have to interact with surrounding human-driven vehicles (HDVs). This paper introduces MSH-MCCT (Multi-Source Human-in-the-Loop Mixed Cloud Control Testbed), a novel CAV testbed that captures complex interactions between various CAVs and HDVs. Utilizing the Mixed Digital Twin concept, which combines Mixed Reality with Digital Twin, MSH-MCCT integrates physical, virtual, and mixed platforms, along with multi-source control inputs. Bridged by the mixed platform, MSH-MCCT allows human drivers and CAV algorithms to operate both physical and virtual vehicles within multiple fields of view. Particularly, this testbed facilitates the coexistence and real-time interaction of physical and virtual CAVs \& HDVs, significantly enhancing the experimental flexibility and scalability. Experiments on vehicle platooning in mixed traffic showcase the potential of MSH-MCCT to conduct CAV testing with multi-source real human drivers in the loop through driving simulators of diverse fidelity. The videos for the experiments are available at our project website: https://dongjh20.github.io/MSH-MCCT.

2603.17577 2026-05-19 cs.LG cs.AI stat.ML

Identifying Latent Actions and Dynamics from Offline Data via Demonstrator Diversity

通过示范多样性从离线数据中识别潜在动作和动态

Felix Schur

AI总结 本文研究了在不观察动作的情况下从离线轨迹中恢复潜在动作和环境动态的问题,通过示范多样性假设,证明了在满足特定条件时,潜在转移和示范策略可以被唯一确定,从而为从离线强化学习数据中学习潜在动作和动态提供了新的方法。

详情
AI中文摘要

在动作未被观察的情况下,能否从离线轨迹中恢复潜在动作和环境动态?我们研究了在轨迹无动作但带有示范者身份标签的设置中这一问题。我们假设每个示范者遵循不同的策略,而环境动态在所有示范者之间是共享的,身份仅通过所选动作影响下一个观测。在这些假设下,条件下一个观测分布 $p(o_{t+1}\mid o_t,e)$ 是潜在动作条件化转移核的混合,具有示范者特定的混合权重。我们证明,这导致每个状态的可观测条件分布具有列随机非负矩阵分解。通过充分分散的策略多样性和秩条件,我们证明潜在转移和示范策略在潜在动作标签的排列下是可识别的。通过Gram行列式最小体积准则,我们将结果扩展到连续观测空间,并证明在连接的状态空间上转移映射的连续性将局部排列模糊性提升为单一全局排列。少量标记的动作数据足以消除最终的模糊性。这些结果确立了示范多样性作为从离线强化学习数据中学习潜在动作和动态的原理性可识别性来源。

英文摘要

Can latent actions and environment dynamics be recovered from offline trajectories when actions are never observed? We study this question in a setting where trajectories are action-free but tagged with demonstrator identity. We assume that each demonstrator follows a distinct policy, while the environment dynamics are shared across demonstrators and identity affects the next observation only through the chosen action. Under these assumptions, the conditional next-observation distribution $p(o_{t+1}\mid o_t,e)$ is a mixture of latent action-conditioned transition kernels with demonstrator-specific mixing weights. We show that this induces, for each state, a column-stochastic nonnegative matrix factorization of the observable conditional distribution. Using sufficiently scattered policy diversity and rank conditions, we prove that the latent transitions and demonstrator policies are identifiable up to permutation of the latent action labels. We extend the result to continuous observation spaces via a Gram-determinant minimum-volume criterion, and show that continuity of the transition map over a connected state space upgrades local permutation ambiguities to a single global permutation. A small amount of labeled action data then suffices to fix this final ambiguity. These results establish demonstrator diversity as a principled source of identifiability for learning latent actions and dynamics from offline RL data.

2603.14936 2026-05-19 cs.CV

Bridging the Intention-Expression Gap: Aligning Multi-Dimensional Preferences via Hierarchical Relevance Feedback in Text-to-Image Diffusion

弥合意图-表达鸿沟:通过层次相关反馈对齐多维偏好

Wenxi Wang, Hongbin Liu, Mingqian Li, Junyan Yuan, Junqi Zhang

AI总结 本文提出一种层次相关反馈驱动框架,通过在文本到图像扩散模型中对齐多维特征,解决用户意图与表达之间的鸿沟问题,提升模型对多维偏好的识别能力。

详情
AI中文摘要

用户往往具有明确的视觉意图,但难以用语言准确表达。这种意图-表达鸿沟使得在文本到图像扩散模型中对齐生成图像与潜在视觉偏好成为基本挑战。现有方法要么需要模型训练,牺牲灵活性,要么依赖文本反馈,加重认知负担。尽管最近的无训练方法使用基于点击的二元偏好反馈来减少用户努力,但它们迫使基础模型(FMs)在语义层面推断偏好。当面对多维偏好时,FMs会受到推断过载的影响,并且无法在冲突的用户信号下识别出确切的首选特征值。因此,一种灵活的多维特征对齐框架仍然缺失。为了解决这个问题,我们提出了一个层次相关反馈驱动(HRFD)框架。认识到多个特征难以同时收敛,HRFD将它们组织成三级层次,并适应相关反馈以强制粗到细的收敛,从而减少认知负担。为了绕过FM推断过载,HRFD将过程分解为独立的单特征偏好推断任务。此外,为了克服FM在识别首选值上的失败,HRFD采用统计推断来量化“喜欢”和“不喜欢”图像集之间特征分布差异,实现稳健且透明的偏好测量。关键的是,HRFD完全在外部文本空间中运行,严格无训练且模型无关。广泛的实验表明,HRFD能够有效捕捉用户的真正视觉意图,显著优于基线方法。

英文摘要

Users often possess a clear visual intent but struggle to articulate it precisely in language. This intention-expression gap makes aligning generated images with latent visual preferences a fundamental challenge in text-to-image diffusion models. Existing methods either require model training, sacrificing flexibility, or rely on textual feedback, imposing a heavy cognitive burden. Although recent training-free methods use click-based binary preference feedback to reduce user effort, they force Foundation Models (FMs) to infer preferences at the semantic level. When faced with multi-dimensional preferences, FMs suffer from inference overload and fail to identify exact preferred feature values under conflicting user signals. Consequently, a flexible framework for multi-dimensional feature alignment remains absent. To address this, we propose a Hierarchical Relevance Feedback-Driven (HRFD) framework. Recognizing that multiple features struggle to converge simultaneously, HRFD organizes them into a three-tier hierarchy and adapts relevance feedback to enforce coarse-to-fine convergence, minimizing cognitive load. To bypass FM inference overload, HRFD decouples the process into independent single-feature preference inference tasks. Furthermore, to overcome FMs' failure in identifying preferred values, HRFD employs statistical inference to quantify the distribution divergence of features between "liked" and "disliked" image sets, achieving robust and transparent preference measurement. Crucially, HRFD operates entirely within the external text space, remaining strictly training-free and model-agnostic. Extensive experiments demonstrate that HRFD effectively captures the user's true visual intent, significantly outperforming baseline approaches.

2603.14371 2026-05-19 cs.RO cs.AI

OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism

OxyGen: 为多任务并行下的VLA推理提供统一的KV缓存管理

Xiangyu Li, Huaizhi Tang, Xin Ding, Weijun Wang, Ting Cao, Yunxin Liu

AI总结 本文提出OxyGen,一种统一的KV缓存管理方法,用于在多任务并行下提高VLA推理效率,通过跨任务KV共享和跨帧连续批处理实现冗余计算和资源竞争的减少,从而在设备端实现更高的吞吐量和频率。

Comments Preprint

详情
AI中文摘要

具身AI代理越来越多地需要在不同的时间约束下从共享观察中并行执行多个任务,如操作、对话和记忆构建。最近的混合变换器(MoT)视觉-语言-动作模型(VLAs)在架构上支持这种异构输出,但现有的推理系统由于冗余计算和资源竞争未能在设备部署中实现高效的多任务并行。我们发现孤立的KV缓存管理是根本原因。为此,我们提出了统一的KV缓存管理,一种将KV缓存作为跨任务和时间的第一类共享资源的推理设计。这种抽象使两种关键优化成为可能:跨任务的KV共享消除了共享观察的冗余预填充,而跨帧连续批处理将可变长度的语言解码与固定速率的动作生成解耦。我们为流行的MoT VLA π_{0.5} 实现了这种设计,并在NVIDIA GeForce RTX 4090和Jetson AGX Thor两个代表性的设备端VLA推理平台上进行了评估。OxyGen在孤立执行的情况下实现了高达3.7倍的加速,同时在不降低动作质量的情况下,实现了超过200 tokens/s的语言吞吐量和70 Hz的动作频率,并进一步在搭载Jetson AGX Thor的现实人形机器人上验证了这些收益。

英文摘要

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment because of redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference design that treats the KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this design for $π_{0.5}$, a popular MoT VLA, and evaluate it on both NVIDIA GeForce RTX 4090 and Jetson AGX Thor, two representative platforms for on-device VLA inference. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without degrading action quality, and we further validate the gains on a real humanoid robot with on-board Jetson AGX Thor.

2603.13708 2026-05-19 cs.CV

RSEdit: Text-Guided Image Editing for Remote Sensing

RSEdit:面向遥感的文本引导图像编辑

Chen Zhenyuan, Zhang Zechuan, Zhang Feng

AI总结 本文提出RSEdit,一种基于生成模型的遥感图像编辑方法,通过研究文本到图像模型的条件策略,实现了在保持地理空间结构的同时,生成指令忠实的图像编辑结果。

Comments accepted by IEEE GRSL

详情
AI中文摘要

在本文中,我们探索了利用生成模型在遥感领域进行文本引导的图像编辑。我们提出了RSEdit,一种从U-Net到DiT的各种配置模型的集合。具体来说,我们展示了首次全面研究如何通过文本到图像模型构建图像编辑模型的条件策略。我们的实验表明,RSEdit在保持地理空间结构的同时,实现了最佳的指令忠实编辑。我们发布了代码和检查点。

英文摘要

In this paper, we explore text-guided image editing in the remote sensing domain using generative modeling. We propose \rsedit, a collection of models from U-Net to DiT with various configurations. Specifically, we present the first comprehensive study of conditioning strategies for building image editing models from off-the-shelf text-to-image ones. Our experiments show that \rsedit achieves the best instruction-faithful edits while preserving geospatial structure. We release the code at \url{https://github.com/Bili-Sakura/RSEdit-Preview} and checkpoints at \url{https://huggingface.co/collections/BiliSakura/rsedit}.

2603.09668 2026-05-19 cs.CV

DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics

DiffWind: 基于物理的可微风驱物体动力学建模

Yuanhang Lei, Boming Zhao, Zesong Yang, Xingxuan Li, Tao Cheng, Haocheng Peng, Ru Zhang, Yang Yang, Siyuan Huang, Yujun Shen, Ruizhen Hu, Hujun Bao, Zhaopeng Cui

AI总结 本文提出DiffWind,一种基于物理的可微框架,统一了风-物体相互作用建模、基于视频的重建和正向模拟。通过将风表示为基于网格的物理场,物体表示为从3D高斯点散布派生的粒子系统,并利用材料点方法(MPM)建模其相互作用,从而实现了对风驱物体动力学的重建。此外,本文还引入了WD-Objects数据集,通过大量实验证明了该方法在重建精度和模拟保真度方面显著优于现有动态场景建模方法。

Comments Accepted by ICLR 2026. Project page: https://zju3dv.github.io/DiffWind/

详情
AI中文摘要

从视频观测建模风驱物体动力学极具挑战性,因为风的不可见性和时空变异性以及物体的复杂变形。我们提出了DiffWind,一种基于物理的可微框架,统一了风-物体相互作用建模、基于视频的重建和正向模拟。具体来说,我们将风表示为基于网格的物理场,物体表示为从3D高斯点散布派生的粒子系统,其相互作用通过材料点方法(MPM)建模。为了恢复风驱物体动力学,我们引入了一个重建框架,通过可微渲染和模拟联合优化时空风力场和物体运动。为了确保物理有效性,我们将其纳入格子玻尔兹曼方法(LBM)作为物理约束,强制符合流体动力学定律。除了重建之外,我们的方法自然支持在新型风条件下进行正向模拟,并能够实现新的应用,如风引导重定向。我们进一步引入了WD-Objects,一个合成和现实世界风驱场景的数据集。大量实验表明,我们的方法在重建精度和模拟保真度方面显著优于现有动态场景建模方法,为基于视频的风-物体相互作用建模开辟了新的途径。

英文摘要

Modeling wind-driven object dynamics from video observations is highly challenging due to the invisibility and spatio-temporal variability of wind, as well as the complex deformations of objects. We present DiffWind, a physics-informed differentiable framework that unifies wind-object interaction modeling, video-based reconstruction, and forward simulation. Specifically, we represent wind as a grid-based physical field and objects as particle systems derived from 3D Gaussian Splatting, with their interaction modeled by the Material Point Method (MPM). To recover wind-driven object dynamics, we introduce a reconstruction framework that jointly optimizes the spatio-temporal wind force field and object motion through differentiable rendering and simulation. To ensure physical validity, we incorporate the Lattice Boltzmann Method (LBM) as a physics-informed constraint, enforcing compliance with fluid dynamics laws. Beyond reconstruction, our method naturally supports forward simulation under novel wind conditions and enables new applications such as wind retargeting. We further introduce WD-Objects, a dataset of synthetic and real-world wind-driven scenes. Extensive experiments demonstrate that our method significantly outperforms prior dynamic scene modeling approaches in both reconstruction accuracy and simulation fidelity, opening a new avenue for video-based wind-object interaction modeling.

2603.09405 2026-05-19 cs.CV

YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search

YOLO-NAS-Bench: 一种具有自进化预测器的代理基准,用于YOLO架构搜索

Zhe Li, Xiaoyu Ding, Jiaxin Zheng, Yongtao Wang

AI总结 本文提出YOLO-NAS-Bench,一种针对YOLO检测器的代理基准,通过自进化机制提升预测器的准确性,从而在YOLO架构搜索中实现高效评估。

Comments Accepted as Oral at CVPR 2026 Workshop on Neural Architecture Search (NAS)

详情
AI中文摘要

针对目标检测中的神经架构搜索(NAS)面临高评估成本的问题,本文提出YOLO-NAS-Bench,首个专门针对YOLO风格检测器的代理基准。YOLO-NAS-Bench定义了一个涵盖通道宽度、块深度和运算符类型的搜索空间,覆盖YOLOv8到YOLO12的核心模块。通过随机、分层和拉丁超立方策略采样1000种架构,在COCO-mini上训练并构建LightGBM代理预测器。为提高预测器在高性能领域的表现,提出自进化机制,通过预测器自身发现并评估有信息量的架构,使预测器的R2从0.770提升至0.815,稀疏Kendall Tau从0.694提升至0.752。使用最终预测器作为进化搜索的适应度函数,发现超越所有官方YOLOv8-YOLO12基线的架构,在COCO-mini上具有可比的延迟,验证了预测器对高性能检测架构的判别能力。代码可在https://github.com/VDIGPKU/YOLO-NAS-Bench获取。

英文摘要

Neural Architecture Search (NAS) for object detection is severely bottlenecked by high evaluation cost, as fully training each candidate YOLO architecture on COCO demands days of GPU time. Meanwhile, existing NAS benchmarks largely target image classification, leaving the detection community without a comparable benchmark for NAS evaluation. To address this gap, we introduce YOLO-NAS-Bench, the first surrogate benchmark tailored to YOLO-style detectors. YOLO-NAS-Bench defines a search space spanning channel width, block depth, and operator type across both backbone and neck, covering the core modules of YOLOv8 through YOLO12. We sample 1,000 architectures via random, stratified, and Latin Hypercube strategies, train them on COCO-mini, and build a LightGBM surrogate predictor. To sharpen the predictor in the high-performance regime most relevant to NAS, we propose a Self-Evolving Mechanism that progressively aligns the predictor's training distribution with the high-performance frontier, by using the predictor itself to discover and evaluate informative architectures in each iteration. This method grows the pool to 1,500 architectures and raises the ensemble predictor's R2 from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752, demonstrating strong predictive accuracy and ranking consistency. Using the final predictor as the fitness function for evolutionary search, we discover architectures that surpass all official YOLOv8-YOLO12 baselines at comparable latency on COCO-mini, confirming the predictor's discriminative power for top-performing detection architectures. The code is available at https://github.com/VDIGPKU/YOLO-NAS-Bench.

2603.08462 2026-05-19 cs.LG

Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

推理作为压缩:通过条件信息瓶颈统一预算强制

Fabio Valerio Massoli, Andrey Kuzmin, Arash Behboodi

AI总结 本文提出将高效推理视为信息瓶颈原则下的损失性压缩问题,通过引入条件信息瓶颈(CIB)原则,解决了传统预算强制方法在处理transformers时的理论缺陷,并通过语义先验实现了更高效的推理压缩,提升了准确率并减少了计算成本。

详情
AI中文摘要

\ac{CoT}提示方法提高了LLM在复杂任务上的准确性,但通常会增加token使用和推理成本。现有的"预算强制"方法通过使用启发式长度惩罚进行微调来减少成本,但会抑制必要的推理和冗余填充。我们重新将高效推理视为在\ac{IB}原则下的损失性压缩问题,并识别出在应用朴素\ac{IB}到transformers时的关键理论缺口:注意力违反了提示、推理轨迹和响应之间的马尔可夫性质。为了解决这个问题,我们模型\ac{CoT}生成在\ac{CIB}原则下,其中推理轨迹$Z$作为计算桥梁,只包含响应$Y$中无法直接从提示$X$获得的信息。这产生了一个通用的强化学习目标:在推理轨迹的先验分布下最大化任务奖励,同时压缩完成内容,将常见启发法(如长度惩罚)作为特殊情况(如均匀先验)包含在内。与传统的token计数方法不同,我们引入了一个语义先验,通过语言模型测量token成本的惊奇度。关键的是,该先验仅在token级log-概率上进行查询,对训练循环的开销可忽略不计。实证表明,我们的\ac{CIB}目标在保留流畅性和逻辑性的同时修剪推理冗余,提高准确率在中等压缩水平,并在最小的准确率下降下实现激进压缩。这些收益在不同模型家族和任务领域中得到验证,确认\ac{CIB}作为一种领域无关的CoT压缩框架。

英文摘要

\ac{CoT} prompting improves LLM accuracy on complex tasks but often increases token usage and inference cost. Existing ``Budget Forcing'' methods reduce cost via fine-tuning with heuristic length penalties, suppressing both essential reasoning and redundant filler. We recast efficient reasoning as a lossy compression problem under the \ac{IB} principle, and identify a key theoretical gap when applying naive \ac{IB} to transformers: attention violates the Markov property between prompt, reasoning trace, and response. To resolve this issue, we model \ac{CoT} generation under the \ac{CIB} principle, where the reasoning trace $Z$ acts as a computational bridge that contains only the information about the response $Y$ that is not directly accessible from the prompt $X$. This yields a general Reinforcement Learning objective: maximize task reward while compressing completions under a prior over reasoning traces, subsuming common heuristics (e.g., length penalties) as special cases (e.g., uniform priors). In contrast to naive token-counting approaches, we introduce a semantic prior that measures token cost by surprisal under a language model. Crucially, the prior is queried only for token-level log-probabilities, adding negligible overhead to the training loop. Empirically, our \ac{CIB} objective prunes reasoning redundancy while preserving fluency and logic, improving accuracy at moderate compression and enabling aggressive compression with minimal accuracy drop. These gains generalize across model families and task domains, confirming \ac{CIB} as a domain-agnostic CoT compression framework.

2603.08290 2026-05-19 cs.LG cs.AI

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

先浅后深:一种由深度诱导的sharpness-aware minimization的隐式偏见

Chaewon Moon, Dongkuk Si, Chulhee Yun

AI总结 该研究探讨了在训练线性可分二分类问题时,sharpness-aware minimization (SAM) 的隐式偏见,发现对于深度L=2的情况,SAM的行为与深度L=1时不同,展示了sequential feature amplification现象。

Comments Accepted to ICLR 2026, 84 pages, 35 figures

详情
AI中文摘要

我们研究了在训练L层线性对角网络时,sharpness-aware minimization (SAM) 的隐式偏见。对于线性模型(L=1),ℓ∞-SAM和ℓ2-SAM都能恢复ℓ2最大间隔分类器,与梯度下降(GD)一致。然而,对于深度L=2,行为发生剧烈变化——即使在单例数据集上。对于ℓ∞-SAM,极限方向依赖于初始化,并可能收敛到零向量或任何标准基向量,与GD的极限方向形成鲜明对比。对于ℓ2-SAM,我们证明其极限方向与GD的ℓ1最大间隔解一致,但有限时间动态表现出我们称之为“顺序特征放大”的现象,即预测器最初依赖于次要坐标,然后逐渐转向更大的坐标。我们的理论分析将这种现象归因于ℓ2-SAM在扰动中应用的梯度归一化因子,该因子在早期放大次要坐标,允许主要坐标在后期主导。合成和真实数据实验验证了我们的发现。

英文摘要

We study the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drastically -- even on a single-example dataset. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $\mathbf{0}$ or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For $\ell_2$-SAM, we show that although its limit direction matches the $\ell_1$ max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.

2603.07900 2026-05-19 cs.AI

EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

EveryQuery: 通过电子健康记录上的任务条件预训练实现零样本临床预测

Payal Chandak, Gregory Kondas, Liat Antwarg Friedman, Isaac Kohane, Matthew McDermott

AI总结 本文提出EveryQuery,一种通过任务条件预训练实现零样本临床预测的电子健康记录基础模型,通过直接估计未来窗口内结果发生的可能性,而非生成未来事件,从而在多个预测任务中优于自回归基线模型。

详情
AI中文摘要

在电子健康记录(EHR)上预训练的基础模型已通过生成合成患者未来和聚合采样轨迹的统计信息,展示了零样本临床预测能力。然而,这种自回归推理过程计算成本高、统计噪声大且不支持直接提示条件预测,因为用户无法直接根据特定临床问题条件预测。在本初步工作中,我们引入EveryQuery,一种EHR基础模型,通过任务条件预训练实现零样本推理。不同于生成未来事件,EveryQuery输入患者的历史和一个结构化的查询指定临床任务,并通过单次前向传递直接估计未来窗口内结果发生的可能性。EveryQuery通过在随机采样的查询任务和患者上下文中预训练,直接训练模型以产生正确的答案。这使得无需微调、线性探测或轨迹生成即可对查询空间中的任何任务进行零样本预测。在MIMIC-IV上,EveryQuery在82%的39个随机采样的预测任务中优于自回归基线模型,平均AUC提高+0.16(95%置信区间:[0.10,0.22])。这一优势在明确从预训练分布中排除的任务中保持一致。此外,EveryQuery的性能提升在罕见临床事件上最为显著,证实并展示了自回归推理在低预发率结果方面的根本限制的解决方案。然而,目前EveryQuery在需要对多个代码进行离散推理的任务上表现欠佳,如30天再入院,暴露了当前查询语言的表达性限制。

英文摘要

Foundation models pretrained on electronic health records (EHR) have demonstrated zero-shot clinical prediction capabilities by generating synthetic patient futures and aggregating statistics over sampled trajectories. However, this autoregressive inference procedure is computationally expensive, statistically noisy, and not natively promptable because users cannot directly condition predictions on specific clinical questions. In this preliminary work, we introduce EveryQuery, an EHR foundation model that achieves zero-shot inference through task-conditioned pre-training. Rather than generating future events, EveryQuery takes as input a patient's history and a structured query specifying a clinical task, and directly estimates the likelihood of the outcome occurring in the future window via a single forward pass. EveryQuery realizes this capability by pre-training over randomly sampled combinations of query tasks and patient contexts, directly training the model to produce correct answers to arbitrary input prompts. This enables zero-shot prediction for any task in the query space without finetuning, linear probing, or trajectory generation. On MIMIC-IV, EveryQuery outperforms an autoregressive baseline on 82% of 39 randomly sampled prediction tasks, with a mean AUC improvement of +0.16 (95% CI: [0.10,0.22]). This advantage remains consistent on tasks that were explicitly held out from the pre-training distribution. Further, EveryQuery's performance gains are most pronounced for rare clinical events, affirming and demonstrating a solution to the fundamental limitation of autoregressive inference for low-prevalence outcomes. However, at present, EveryQuery underperforms on tasks requiring disjunctive reasoning over multiple codes, such as 30-day readmission, exposing a concrete expressiveness limitation of the current query language.

2603.04727 2026-05-19 cs.CV cs.AI

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

多模态大语言模型是否准备好用于监控?对零样本异常检测在现实中的检验

Shanle Yao, Armin Danesh Pazho, Narges Rashvand, Hamed Tabkhi

AI总结 本文研究了多模态大语言模型在现实中的零样本异常检测性能,发现其存在保守偏差,通过特定指令可以提升F1分数,但召回率仍是关键瓶颈。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视频理解方面展示了出色的通用能力,但其在现实中的视频异常检测(VAD)可靠性仍待探索。与传统依赖重建或姿态线索的流程不同,MLLMs实现了将异常检测视为语言引导推理任务的范式转变。本文通过将VAD重新表述为二分类任务,在弱时间监督下系统评估了最先进的MLLMs在ShanghaiTech和CHAD基准上的性能。我们研究了提示特异性及时间窗口长度(1s-3s)对性能的影响,重点分析精度-召回率的权衡。研究发现,在零样本设置中存在显著的保守偏差;尽管模型表现出高置信度,但倾向于选择'正常'类,导致高精度但召回率崩溃,限制了实际应用。我们证明,针对类别的特定指令可显著改变这一决策边界,使ShanghaiTech的峰值F1分数从0.09提升至0.64,但召回率仍是关键瓶颈。这些结果突显了MLLMs在嘈杂环境中的显著性能差距,并为未来在召回导向提示和模型校准方面的研究提供了基础,这对需要复杂视频理解和推理的开放世界监控任务提出了要求。

英文摘要

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

2603.04161 2026-05-19 cs.CL

Traces of Social Competence in Large Language Models

大语言模型中社会能力的踪迹

Tom Kouwenhoven, Michiel van der Meer, Max van Duijn

AI总结 本文研究了大语言模型在虚假信念测试中的表现,通过贝叶斯逻辑回归分析模型大小和训练方法对社会认知能力的影响,发现模型规模扩大有助于性能提升,但并非绝对,同时指出解释命题态度会改变响应模式,进一步的推理导向微调会加剧这种影响。

Comments Presented at the 2026 Conference on Computational Natural Language Learning (CoNLL)

详情
AI中文摘要

虚假信念测试(FBT)一直是评估理论自我(ToM)及相关社会认知能力的主要方法。对于大语言模型(LLMs),由于数据污染、模型细节不足和控制不一致等问题,该测试的可靠性和解释潜力一直有限。我们通过在192个FBT变体(Trott等人,2023)的平衡数据集上测试17个开源模型,并使用贝叶斯逻辑回归来识别模型大小和训练后对社会认知能力的影响。我们发现模型规模扩大有助于性能提升,但并非严格正比。交叉效应显示,解释命题态度(X thinks)根本上改变了响应模式。指令微调部分缓解了这种影响,但进一步的推理导向微调会加剧这种影响。在分析OLMo 2训练过程中社会推理能力的案例研究中,我们发现这种交叉效应出现在预训练阶段,表明模型在预训练过程中获取了与心理状态词汇相关的刻板响应模式,这些模式可能超过其他情境语义。最后,向量引导使我们能够将think向量作为观察到的FBT行为的因果驱动因素。

英文摘要

The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies. For Large Language Models (LLMs), the reliability and explanatory potential of this test have remained limited due to issues like data contamination, insufficient model details, and inconsistent controls. We address these issues by testing 17 open-weight models on a balanced set of 192 FBT variants (Trott et al., 2023) using Bayesian Logistic regression to identify how model size and post-training affect socio-cognitive competence. We find that scaling model size benefits performance, but not strictly. A cross-over effect reveals that explicating propositional attitudes (X thinks) fundamentally alters response patterns. Instruction tuning partially mitigates this effect, but further reasoning-oriented fine-tuning amplifies it. In a case study analysing social reasoning ability throughout OLMo 2 training, we show that this cross-over effect emerges during pre-training, suggesting that models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics. Finally, vector steering allows us to isolate a think vector as the causal driver of observed FBT behaviour.

2602.22941 2026-05-19 cs.CV

Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings

基于平移和缩放视频记录的皮划艇冲刺团队船只速度和划桨率重建

Julian Ziegler, Daniel Matthes, Finn Gerdts, Patrick Frenzel, Torsten Warnke, Matthias Englert, Tina Koevari, Mirco Fuchs

AI总结 本文提出了一种基于平移和缩放视频记录重建皮划艇冲刺团队船只速度和划桨率的方法,利用YOLOv8检测浮标和运动员,结合已知的浮标网格估计同源性,通过U-Net进行船体校准以估计船的位置,并利用光流实现鲁棒跟踪,最终提取划桨率信息,实验结果表明其速度和划桨率的MAPE分别达到0.011和0.009,具有高精度和自动化反馈。

详情
AI中文摘要

节奏策略,由速度和划桨率曲线定义,对于皮划艇冲刺的峰值表现至关重要。尽管GPS是分析的黄金标准,但其有限的可用性需要自动化视频分析方法。本文提出了一种扩展框架,用于从平移和缩放的视频记录中重建所有冲刺项目(K1-K4,C1-C2)和距离(200m-500m)的性能指标。我们的方法利用YOLOv8进行浮标和运动员检测,利用已知的浮标网格估计同源性。我们通过学习特定船体的运动员偏移量来一般化估计船的位置,利用U-Net进行船体校准。进一步,我们通过光流实现鲁棒的跟踪方案以适应多运动员船体类型。最后,我们介绍了从姿态估计或运动员边界框本身提取划桨率信息的方法。与精英比赛GPS数据的评估显示,速度的MAPE为0.011 [0.008 0.014](Spearman rho=0.974)和划桨率的MAPE为0.009 [0.006 0.013](Spearman rho=0.975)。这些方法为教练提供了高精度、自动化的反馈,且无需传感器,仅需极少的手动初始化工作。

英文摘要

Pacing strategies, defined by velocity and stroke rate profiles, are essential for peak performance in canoe sprint. While GPS is the gold standard for analysis, its limited availability necessitates automated video-based solutions. This paper presents an extended framework for reconstructing performance metrics from panned and zoomed video recordings across all sprint disciplines (K1-K4, C1-C2) and distances (200m-500m). Our method utilizes YOLOv8 for buoy and athlete detection, leveraging the known buoy grid to estimate homographies. We generalized the estimation of the boat position by means of learning a boat-specific athlete offset using a U-net based boat tip calibration. Further, we implement a robust tracking scheme using optical flow to adapt to multi-athlete boat types. Finally, we introduce methods to extract stroke rate information from either pose estimations or the athlete bounding boxes themselves. Evaluation against GPS data from elite competitions yields a velocity MAPE of 0.011 [0.008 0.014] (Spearman rho=0.974) and a stroke rate MAPE of 0.009 [0.006 0.013] (Spearman rho = 0.975). The methods provide coaches with highly accurate, automated feedback with minimal manual initialization work required, and without requiring sensors.

2602.18217 2026-05-19 cs.CL

Information-Theoretic Storage Cost in Sentence Comprehension

句法理解中的信息论存储成本

Kohei Kajikawa, Shinnosuke Isono, Ethan Gotlieb Wilcox

AI总结 本文提出了一种基于信息论的存储成本度量方法,用于评估句法理解过程中上下文信息的存储需求,通过神经语言模型估计该成本,并在英语中验证了其在中心嵌套和相对从句中的处理不对称性,以及在阅读时间变异预测中的有效性。

Comments Accepted to CoNLL 2026

详情
AI中文摘要

实时句法理解对工作记忆施加了显著负担,因为理解者必须维护上下文信息以预测未来输入。尽管对这种负担的测量在心理语言学理论中起到了重要作用,但它们主要通过符号语法形式化,将句法预测分配为离散且均匀的成本。本研究提出了一种基于信息论形式化的处理存储成本度量,作为先前词语对未来上下文信息的携带量,在不确定性下的度量。与之前的离散、基于语法的度量不同,这种度量是连续的、概率性的、理论中立的,并且可以从预训练的神经语言模型中估计。通过三种英语分析验证了该方法的有效性:我们的度量(i)恢复了已知的中心嵌套和相对从句中的处理不对称性,(ii)与一个语法标注语料库中的基于语法的存储成本相关联,(iii)在两个大规模自然主义数据集中预测阅读时间变异,这在传统信息基础预测器之上。我们的代码可在https://github.com/kohei-kaji/info-storage获取。

英文摘要

Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have largely been formalized using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of information previous words carry about future context, under uncertainty. Unlike previous discrete, grammar-based metrics, this measure is continuous, probabilistic, theory-neutral, and can be estimated from pre-trained neural language models. The validity of this approach is demonstrated through three analyses in English: our measure (i) recovers well-known processing asymmetries in center embeddings and relative clauses, (ii) correlates with a grammar-based storage cost in a syntactically-annotated corpus, and (iii) predicts reading-time variance in two large-scale naturalistic datasets over and above baseline models with traditional information-based predictors. Our code is available at https://github.com/kohei-kaji/info-storage.

2602.12703 2026-05-19 cs.LG

SWING: Unlocking Implicit Graph Representations for Graph Random Features

SWING: 解锁隐式图表示用于图随机特征

Alessandro Manenti, Avinava Dubey, Arijit Sehanobish, Cesare Alippi, Krzysztof Choromanski

AI总结 SWING通过在连续空间中进行行走而非在图节点上进行行走,实现了对隐式图表示(i-graphs)中图随机特征的高效计算,其核心方法是结合随机特征和重要性采样技术的定制Gumbel-softmax采样机制,从而在不需显式图结构的情况下,提高了计算效率和精度。

详情
AI中文摘要

我们提出了SWING:空间行走用于隐式网络图,这是一种新的算法类别,用于在由隐式表示(i-graphs)给出的图上进行图随机特征的计算,其中边权重定义为相应节点特征向量的双变量函数。这些图类包括多个显著例子,如ε邻域图,广泛用于机器学习。与在图节点上进行行走不同,这些方法依赖于在连续空间中的行走,在其中这些图被嵌入。为了准确且高效地近似原始组合计算,SWING应用了通过随机特征结合重要性采样技术获得的定制Gumbel-softmax采样机制,具有线性化内核。该算法本身具有独特价值。SWING依赖于隐式定义图与傅里叶分析之间的深刻联系,本文中已提出。SWING具有加速友好特性,不需要输入图的显式材料。我们对SWING进行了详细的分析,并在不同类别的i-graphs上进行了彻底的实验。

英文摘要

We propose SWING: Space Walks for Implicit Network Graphs, a new class of algorithms for computations involving Graph Random Features on graphs given by implicit representations (i-graphs), where edge-weights are defined as bi-variate functions of feature vectors in the corresponding nodes. Those classes of graphs include several prominent examples, such as: $ε$-neighborhood graphs, used on regular basis in machine learning. Rather than conducting walks on graphs' nodes, those methods rely on walks in continuous spaces, in which those graphs are embedded. To accurately and efficiently approximate original combinatorial calculations, SWING applies customized Gumbel-softmax sampling mechanism with linearized kernels, obtained via random features coupled with importance sampling techniques. This algorithm is of its own interest. SWING relies on the deep connection between implicitly defined graphs and Fourier analysis, presented in this paper. SWING is accelerator-friendly and does not require input graph materialization. We provide detailed analysis of SWING and complement it with thorough experiments on different classes of i-graphs.

2602.12015 2026-05-19 cs.CL

Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

解构大型语言模型中的歧义与不稳定性:一项临床文本到SQL的案例研究

Angelo Ziletti, Leonardo D'Ambrosi

AI总结 本文提出CLUES框架,通过将文本到SQL分解为两个阶段(解释->答案)来区分输出多样性两种不同原因:输入歧义和模型不稳定性,并在临床文本到SQL基准测试中提高了故障预测性能。

详情
Journal ref
Proceedings of the 7th Clinical Natural Language Processing Workshop 2026
AI中文摘要

在临床文本到SQL中部署大型语言模型需要区分输出多样性的两种不同原因:(i)输入歧义,应触发澄清,和(ii)模型不稳定性,应触发人工审查。我们提出CLUES,将文本到SQL建模为两个阶段的过程(解释-->答案),并将语义不确定性分解为歧义分数和不稳定性分数。不稳定性分数通过二元语义图矩阵的Schur补计算。在AmbigQA/SituatedQA(黄金解释)和临床文本到SQL基准测试(已知解释)上,CLUES在状态-of-the-art Kernel Language Entropy之上提高了故障预测。在部署设置中,它保持竞争力,同时提供单个分数不可用的诊断分解。所得到的不确定性区域映射到目标干预 - 对歧义进行查询细化,对不稳定性进行模型改进。高歧义/高不稳定性区域包含51%的错误,覆盖25%的查询,从而实现高效的优先级排序。

英文摘要

Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.

2602.09805 2026-05-19 cs.CL cs.AI cs.LG

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

超越准确率:分解大语言模型的推理效率

Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

AI总结 本文提出一种无需追踪的评估协议,通过完成率、条件正确性和生成长度三个指标分解大语言模型的token效率,同时考虑任务工作量元数据进行归一化处理,并评估模型在不同任务上的推理效率和冗余问题。

Comments Preprint (under review). 29 pages, 4 figures

详情
AI中文摘要

随着推理大语言模型越来越多地通过推理、搜索和自我纠正来换取准确性,单一的准确性分数已无法说明这些token是否带来了有用的推理、从困难实例中恢复或不必要的冗长。我们介绍了一种可选追踪的评估协议,通过三个即使在封闭模型中也可用的观测指标精确分解token效率:完成率、在完成条件下正确性的条件正确性以及生成长度。当实例级工作量元数据可用时,我们进一步将生成长度归一化为声明的任务隐含工作,并将平均口头冗余与工作量依赖的扩展分离。当此类元数据不可用时,我们定义了一个可审计的求解器衍生工作量规模,并在留出自我、留出top-k和持有参考池扰动下评估其稳定性。我们在CogniLoad、GSM8K、ProofWriter和ZebraLogic上评估了14个共享开放权重模型。我们进一步在CogniLoad上评估了11个额外模型,从而能够对推理任务难度因素进行细致分析:任务长度、内在难度和干扰项密度。效率和冗余排名在所有基准对中保持稳定,比准确性排名更加稳健,同时分解了逻辑受限、上下文受限(截断驱动)和冗余受限的失败模式,这些模式在准确性每token下看起来是相同的。我们发布了评估工具包和报告模板,详细说明了LLM在推理上的低效原因。

英文摘要

As reasoning LLMs increasingly trade tokens for accuracy through deliberation, search, and self-correction, a single accuracy score can no longer tell whether those tokens buy useful reasoning, recovery from hard instances, or unnecessary verbosity. We introduce a trace-optional evaluation protocol that exactly decomposes token efficiency using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available, we further normalize generated length by declared task-implied work and separate mean verbalization overhead from workload-dependent scaling. When such metadata is absent, we define an auditable solver-derived workload scale and evaluate its stability under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. We evaluate 14 shared open-weight models on CogniLoad, GSM8K, ProofWriter, and ZebraLogic. We further evaluate 11 additional models on CogniLoad, enabling a fine-grained analysis of reasoning-task difficulty factors: task length, intrinsic difficulty, and distractor density. Efficiency and overhead rankings remain stable across all benchmark pairs, more robustly than accuracy rankings, while the decomposition separates logic-limited, context-limited (truncation-driven), and verbosity-limited failure modes that look identical under accuracy-per-token. We release an evaluation artifact and reporting template, which elaborates on why an LLM is inefficient at reasoning.

2602.08206 2026-05-19 cs.CV

Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation

基于地理推理的词汇无关遥感语义分割

Chufeng Zhou, Jian Wang, Xinyuan Liu, Xiaokang Zhang

AI总结 本文提出了一种基于地理推理的词汇无关遥感语义分割框架GR-CoT,通过离线知识蒸馏流和在线实例推理流解决遥感开放词汇语义分割中的语义歧义问题,提升复杂场景下的分割性能和语义一致性。

Comments 5 pages, 3 figures

详情
AI中文摘要

开放词汇语义分割已成为遥感领域的重要研究方向,因为它能够实现超越预定义土地覆盖类别的识别。然而,现有方法主要依赖于被动的视觉-文本匹配,往往在地理复杂场景中面临语义歧义问题,尤其是当不同类别表现出相似的光谱或结构模式时。为了解决这个问题,我们提出了一个用于遥感开放词汇语义分割的地理推理链式思考(GR-CoT)框架。GR-CoT由一个离线知识蒸馏流和一个在线实例推理流组成。前者为歧分类构建类别解释标准,后者执行宏观场景锚定、视觉特征解耦和知识驱动的决策合成,以生成适应图像的词汇表供下游分割使用。在LoveDA和GID5基准测试中,实验表明所提出的框架提高了整体分割性能,并在复杂场景中产生了更具语义一致性的预测。

英文摘要

Open-vocabulary semantic segmentation has become an important direction in remote sensing, as it enables recognition beyond predefined land-cover categories. However, existing methods mainly depend on passive visual-text matching and often struggle with semantic ambiguity in geographically complex scenes, especially when different classes exhibit similar spectral or structural patterns. To address this issue, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework for remote sensing open-vocabulary semantic segmentation. GR-CoT consists of an offline knowledge distillation stream and an online instance reasoning stream. The former constructs category interpretation standards for confusing classes, while the latter performs macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis to generate an image-adaptive vocabulary for downstream segmentation. Experiments on the LoveDA and GID5 benchmarks indicate that the proposed framework improves overall segmentation performance and yields more semantically coherent predictions in complex scenes.

2602.07618 2026-05-19 cs.LG stat.ML

Neural Networks With Dense Weights Are Not Universal Approximators

具有密集权重的神经网络不是通用逼近器

Levi Rauchwerger, Stefanie Jegelka, Ron Levie

AI总结 研究探讨了密集神经网络的逼近能力,指出在有限的权重约束下,密集连接的神经网络无法逼近任意连续函数,从而揭示了密集层神经网络的固有局限性,推动了稀疏连接在实现真正通用性中的必要性。

详情
AI中文摘要

我们研究了密集神经网络的逼近能力。虽然通用逼近定理表明,如果对权重值没有限制,足够大的架构可以逼近任意连续函数,但我们证明密集神经网络并不具备这种普遍性。我们的论证基于一种模型压缩方法,结合弱正则性引理与将前馈网络解释为消息传递图神经网络的解释。我们考虑具有自然权重、输入和输出维度约束的ReLU神经网络,这建模了一种密集连接的概念。在此设置中,我们展示了存在无法被此类网络逼近的Lipschitz连续函数。这突显了密集层神经网络的固有局限性,并推动了稀疏连接作为实现真正通用性的必要成分的使用。

英文摘要

We investigate the approximation capabilities of dense neural networks. While universal approximation theorems establish that sufficiently large architectures can approximate arbitrary continuous functions if there are no restrictions on the weight values, we show that dense neural networks do not possess this universality. Our argument is based on a model compression approach, combining the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks. We consider ReLU neural networks subject to natural constraints on weights and input and output dimensions, which model a notion of dense connectivity. Within this setting, we demonstrate the existence of Lipschitz continuous functions that cannot be approximated by such networks. This highlights intrinsic limitations of neural networks with dense layers and motivates the use of sparse connectivity as a necessary ingredient for achieving true universality.

2602.06866 2026-05-19 cs.LG

T-STAR: A Context-Aware Transformer Framework for Short-Term Probabilistic Demand Forecasting in Dock-Based Shared Micro-Mobility

T-STAR: 一种基于上下文的Transformer框架用于基于码头的共享微出行短期概率需求预测

Jingyi Cheng, Gonçalo Homem de Almeida Correia, Oded Cats, Shadi Sharif Azadeh

AI总结 本文提出T-STAR框架,通过两级结构分离一致需求模式和短期波动,提升短期概率需求预测的准确性,实验表明其在确定性和概率性准确性上均优于现有方法,且具备良好的时空鲁棒性。

Comments This work has been submitted to Transportation Research Part C

详情
AI中文摘要

可靠的短期需求预测对于管理共享微出行服务和确保响应、以用户为中心的操作至关重要。本文介绍了T-STAR(Two-stage Spatial and Temporal Adaptive contextual Representation),一种新的基于Transformer的概率框架,旨在以15分钟的分辨率预测车站级自行车共享需求。T-STAR通过分层两级结构解决高分辨率预测中的关键挑战,第一阶段捕捉粗粒度的小时需求模式,第二阶段通过整合高频、本地化的输入(包括近期波动和实时需求变化)提高预测精度,以考虑短期需求的时间转移。时间序列Transformer模型用于两个阶段生成概率预测。使用华盛顿特区的Capitol Bikeshare数据的广泛实验表明,T-STAR在确定性和概率性准确性上均优于现有方法。该模型在车站和时间期间表现出强大的时空鲁棒性。零样本预测实验进一步展示了T-STAR在无需重新训练的情况下能够转移到以前未见过的服务区域的能力。这些结果凸显了该框架在提供细粒度、可靠且不确定性的短期需求预测方面的潜力,从而无缝整合以支持多模式出行规划,提高共享微出行服务的实时操作能力。

英文摘要

Reliable short-term demand forecasting is essential for managing shared micro-mobility services and ensuring responsive, user-centered operations. This study introduces T-STAR (Two-stage Spatial and Temporal Adaptive contextual Representation), a novel transformer-based probabilistic framework designed to forecast station-level bike-sharing demand at a 15-minute resolution. T-STAR addresses key challenges in high-resolution forecasting by disentangling consistent demand patterns from short-term fluctuations through a hierarchical two-stage structure. The first stage captures coarse-grained hourly demand patterns, while the second stage improves prediction accuracy by incorporating high-frequency, localized inputs, including recent fluctuations and real-time demand variations in connected metro services, to account for temporal shifts in short-term demand. Time series transformer models are employed in both stages to generate probabilistic predictions. Extensive experiments using Washington D.C.'s Capital Bikeshare data demonstrate that T-STAR outperforms existing methods in both deterministic and probabilistic accuracy. The model exhibits strong spatial and temporal robustness across stations and time periods. A zero-shot forecasting experiment further highlights T-STAR's ability to transfer to previously unseen service areas without retraining. These results underscore the framework's potential to deliver granular, reliable, and uncertainty-aware short-term demand forecasts, which enable seamless integration to support multimodal trip planning for travelers and enhance real-time operations in shared micro-mobility services.

2602.05156 2026-05-19 cs.RO cs.SY eess.SY

PLATO Hand: Shaping Contact Behavior with Fingernails for Precise Manipulation

PLATO Hand:利用指甲形状接触行为实现精确操控

Dong Ho Kang, Aaron Kim, Mingyo Seo, Kazuto Yokoyama, Tetsuya Narita, Luis Sentis

AI总结 本文提出PLATO手,一种具有混合指尖的灵活机器人手,通过结合刚性指甲、嵌入式远节指骨和顺应性肉垫,实现接触行为的塑造。研究开发了基于应变能的弯曲-压入模型,指导指尖设计并解释材料刚度和接触几何如何控制指尖变形分配。实验显示提升了捏合稳定性、指甲介导的背侧接触力传输和本体感觉可观察性,并成功执行了敏感边缘操控任务,如纸张分隔、卡片拾取和橙子剥皮。这些结果表明,结合机械结构的接触界面与力-运动透明手指机制提供了精确操控的原理性方法。

详情
AI中文摘要

我们提出了PLATO手,一种具有混合指尖的灵活机器人手,该指尖结合了刚性指甲、嵌入式远节指骨和顺应性肉垫,以在操控过程中塑造接触行为。通过机械组织指尖接触的启动、支撑和传递方式,这种结构在多样化的物体几何形状和抓取方向上创造了稳定且任务相关的接触条件。我们开发了基于应变能的弯曲-压入模型,以指导指尖设计并解释材料刚度和接触几何如何控制指尖内的变形分配。实验显示提升了捏合稳定性、指甲介导的背侧接触力传输和本体感觉可观察性,并成功执行了敏感边缘操控任务,包括纸张分隔、卡片拾取和橙子剥皮。这些结果表明,结合机械结构的接触界面与力-运动透明手指机制提供了精确操控的原理性方法。我们的项目页面是:https://platohand.github.io

英文摘要

We present the PLATO Hand, a dexterous robotic hand with a hybrid fingertip that combines a rigid fingernail, embedded distal phalanx, and compliant pulp to shape contact behavior during manipulation. \rrev{By mechanically organizing how contact is initiated, supported, and transmitted at the fingertip, this structure creates stable and task-relevant contact conditions across diverse object geometries and grasp orientations.} We develop a strain-energy-based bending--indentation model to guide the fingertip design and to explain how material stiffness and contact geometry govern deformation partitioning within the fingertip. \rrev{Experiments show improved pinch stability, improved fingernail-mediated dorsal-contact force transmission and proprioceptive observability}, and successful execution of edge-sensitive manipulation tasks, including paper singulation, card picking, and orange peeling. These results show that coupling a mechanically structured contact interface with a force-motion-transparent finger mechanism provides a principled approach to precise manipulation. Our project page is at: https://platohand.github.io

2602.03797 2026-05-19 cs.LG

Manifold Random Features

流形随机特征

Ananya Parashar, Derek Long, Dwaipayan Saha, Krzysztof Choromanski

AI总结 本文提出了一种新的方法,通过离散化流形和最近引入的图随机特征(GRFs)技术,学习流形上的连续场,从而近似一般流形上定义的双变量函数(特别是核函数)。该方法提供了正且有界的特征,对于准确且低方差的近似至关重要。

详情
AI中文摘要

我们提出了一种新的范式,用于创建随机特征以近似在一般流形上定义的双变量函数(特别是核函数)。这种新的机制称为流形随机特征(MRFs),利用流形的离散化和最近引入的图随机特征(GRFs)技术来学习流形上的连续场。这些场用于找到在一般情况下无法解析推导的连续近似机制。MRFs提供正且有界的特征,这是准确、低方差近似的关键属性。我们展示了GRFs在离散图对象上定义与用于正则核的连续随机特征之间的深刻渐近联系。作为我们方法的副产品,我们重新发现最近引入的高斯核近似机制,特别是用于改进线性注意力Transformer,通过考虑简单的图随机游走并绕过原始复杂的数学计算。我们还补充了我们的算法的严格理论分析,并通过详尽的实验研究进行了验证。

英文摘要

We present a new paradigm for creating random features to approximate bi-variate functions (in particular, kernels) defined on general manifolds. This new mechanism of Manifold Random Features (MRFs) leverages discretization of the manifold and the recently introduced technique of Graph Random Features (GRFs) to learn continuous fields on manifolds. Those fields are used to find continuous approximation mechanisms that otherwise, in general scenarios, cannot be derived analytically. MRFs provide positive and bounded features, a key property for accurate, low-variance approximation. We show deep asymptotic connection between GRFs, defined on discrete graph objects, and continuous random features used for regular kernels. As a by-product of our method, we re-discover recently introduced mechanism of Gaussian kernel approximation applied in particular to improve linear-attention Transformers, considering simple random walks on graphs and by-passing original complex mathematical computations. We complement our algorithm with a rigorous theoretical analysis and verify in thorough experimental studies.

2602.03664 2026-05-19 cs.AI cs.LG

Mitigating Conversational Inertia in Multi-Turn Agents

缓解多轮代理中的对话惯性

Yang Wan, Zheng Cao, Zhenhao Zhang, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu

AI总结 本文研究了多轮代理中对话惯性问题,提出通过上下文偏好学习来校准模型偏好,以减少惯性并提升性能。

Comments ICML2026

详情
AI中文摘要

大型语言模型在获得适当演示时表现出色,但在多轮代理场景中,LLM错误地模仿自身之前的响应作为少样本示例。通过注意力分析,我们识别出对话惯性现象,即模型对先前响应表现出强烈的对角注意力,这与模仿偏差有关,限制了探索。这揭示了将少样本LLM转化为代理时的张力:更长的上下文丰富了环境反馈以供利用,但也加剧了对话惯性,从而损害探索。我们的关键见解是,对于相同状态,生成时使用更长上下文的动作表现出更强的惯性,这使得可以在没有环境奖励的情况下构建偏好对。基于此,我们提出上下文偏好学习,以校准模型偏好,使模型更倾向于选择低惯性响应而非高惯性响应。我们进一步提供了推理时的上下文管理策略,以平衡探索与利用。在八个代理环境和一个深度研究场景中的实验结果验证了我们的框架能够减少对话惯性并实现性能提升。

英文摘要

Large language models excel as few-shot learners when provided with appropriate demonstrations, yet this strength becomes problematic in multiturn agent scenarios, where LLMs erroneously mimic their own previous responses as few-shot examples. Through attention analysis, we identify conversational inertia, a phenomenon where models exhibit strong diagonal attention to previous responses, which is associated with imitation bias that constrains exploration. This reveals a tension when transforming few-shot LLMs into agents: longer context enriches environmental feedback for exploitation, yet also amplifies conversational inertia that undermines exploration. Our key insight is that for identical states, actions generated with longer contexts exhibit stronger inertia than those with shorter contexts, enabling construction of preference pairs without environment rewards. Based on this, we propose Context Preference Learning to calibrate model preferences to favor low-inertia responses over highinertia ones. We further provide context management strategies at inference time to balance exploration and exploitation. Experimental results across eight agentic environments and one deep research scenario validate that our framework reduces conversational inertia and achieves performance improvements.

2601.19667 2026-05-19 cs.CL cs.AI cs.IR cs.LG

SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

SynCABEL:面向生物医学实体链接的合成上下文增强

Adam Remaki, Christel Gérardin, Eulàlia Farré-Maduell, Martin Krallinger, Xavier Tannier

AI总结 SynCABEL通过利用大型语言模型生成丰富的上下文合成训练示例,解决了监督式生物医学实体链接中专家标注数据稀缺的问题,并在三个多语言基准上实现了新的最先进的结果。

Comments 7 pages, 5 figures

详情
AI中文摘要

我们提出了SynCABEL(Synthetic Contextualized Augmentation for Biomedical Entity Linking),一个框架,旨在解决监督式生物医学实体链接(BEL)中的核心瓶颈:专家标注训练数据的稀缺性。SynCABEL利用大型语言模型为目标知识库中的所有候选概念生成上下文丰富的合成训练示例,提供广泛的监督而无需手动标注。我们证明,当结合解码器-only模型和引导推理时,SynCABEL在三个广泛使用的多语言基准上建立了新的最先进结果:MedMentions(英语)、QUAERO(法语)和SPACCC(西班牙语)。评估数据效率时,我们显示SynCABEL在使用最多60%的标注数据的情况下达到全人工监督的性能,显著减少了对劳动密集型和昂贵的专家标注的依赖。最后,考虑到基于精确代码匹配的标准评估往往低估了由于本体冗余而具有临床价值的预测,我们引入了LLM-as-a-judge协议。这项分析揭示了SynCABEL显著提高了具有临床价值的预测率。我们的合成数据集、模型和代码已发布以支持可重复性和未来研究。

英文摘要

We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference, establishes new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research.

2601.16880 2026-05-19 cs.LG cs.IT math.IT

Theory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks

深度网络中最小权重扰动的理论及其在低秩激活后门攻击中的应用

Bethan Evans, Jared Tanner

AI总结 本文推导了深度网络实现指定输出变化所需的最小范数权重扰动,并讨论了其大小决定因素,同时将其应用于精度修改激活的后门攻击,确定了攻击成功的压缩阈值,并展示了低秩压缩可以在保持全精度准确性的同时可靠激活潜在后门。

详情
AI中文摘要

深度网络中实现指定输出变化所需的最小范数权重扰动被推导出来,并讨论了其大小决定因素。这些单层精确公式与更通用的多层Lipschitz常数基于的鲁棒性保证被对比;两者都被观察到具有相同数量级,这表明它们在保证效果上相似。这些结果应用于精度修改激活的后门攻击,确定了攻击成功的压缩阈值,并通过实验证明低秩压缩可以在保持全精度准确性的同时可靠激活潜在后门。这些表达式揭示了反向传播边际如何控制逐层敏感性,并提供了关于与所需输出变化一致的最小参数更新的可验证保证。

英文摘要

The minimal norm weight perturbations of DNNs required to achieve a specified change in output are derived and the factors determining its size are discussed. These single-layer exact formulae are contrasted with more generic multi-layer Lipschitz constant based robustness guarantees; both are observed to be of the same order which indicates similar efficacy in their guarantees. These results are applied to precision-modification-activated backdoor attacks, establishing provable compression thresholds below which such attacks cannot succeed, and show empirically that low-rank compression can reliably activate latent backdoors while preserving full-precision accuracy. These expressions reveal how back-propagated margins govern layer-wise sensitivity and provide certifiable guarantees on the smallest parameter updates consistent with a desired output shift.

2601.14330 2026-05-19 cs.CV cs.LG

LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models

LURE: 用于扩散模型多概念重新唤醒的潜在空间解阻

Mengyu Sun, Ziyuan Yang, Andrew Beng Jin Teoh, Junxu Liu, Haibo Hu, Yi Zhang

AI总结 本文提出LURE方法,通过重建潜在空间和引导采样轨迹,实现多概念的高保真重新唤醒,解决了现有方法在多概念场景下的梯度冲突和特征纠缠问题。

详情
AI中文摘要

概念擦除旨在抑制扩散模型中的敏感内容,但最近的研究表明,被擦除的概念仍可能被重新唤醒,揭示了擦除方法的脆弱性。现有重新唤醒方法主要依赖于提示级优化来操控采样轨迹,忽略了其他生成因素,限制了对底层动态的全面理解。在本文中,我们将生成过程建模为一个隐式函数,以实现对多个因素的全面理论分析,包括文本条件、模型参数和潜在状态。我们理论证明,扰动每个因素可以重新唤醒被擦除的概念。基于这一见解,我们提出了一种新的概念重新唤醒方法:用于概念重新唤醒的潜在空间解阻(LURE),通过重建潜在空间并引导采样轨迹来重新唤醒被擦除的概念。具体而言,我们的语义重新绑定机制通过将去噪预测与目标分布对齐来重建潜在空间,以重新建立断裂的文本-视觉关联。然而,在多概念场景中,朴素的重建会导致梯度冲突和特征纠缠。为了解决这个问题,我们引入了梯度场正交化,强制特征正交以防止相互干扰。此外,我们的潜在语义识别引导采样(LSIS)通过后验密度验证确保重新唤醒过程的稳定性。广泛的实验表明,LURE能够在多种擦除任务和方法中同时实现多个被擦除概念的高保真重新唤醒。

英文摘要

Concept erasure aims to suppress sensitive content in diffusion models, but recent studies show that erased concepts can still be reawakened, revealing vulnerabilities in erasure methods. Existing reawakening methods mainly rely on prompt-level optimization to manipulate sampling trajectories, neglecting other generative factors, which limits a comprehensive understanding of the underlying dynamics. In this paper, we model the generation process as an implicit function to enable a comprehensive theoretical analysis of multiple factors, including text conditions, model parameters, and latent states. We theoretically show that perturbing each factor can reawaken erased concepts. Building on this insight, we propose a novel concept reawakening method: Latent space Unblocking for concept REawakening (LURE), which reawakens erased concepts by reconstructing the latent space and guiding the sampling trajectory. Specifically, our semantic re-binding mechanism reconstructs the latent space by aligning denoising predictions with target distributions to reestablish severed text-visual associations. However, in multi-concept scenarios, naive reconstruction can cause gradient conflicts and feature entanglement. To address this, we introduce Gradient Field Orthogonalization, which enforces feature orthogonality to prevent mutual interference. Additionally, our Latent Semantic Identification-Guided Sampling (LSIS) ensures stability of the reawakening process via posterior density verification. Extensive experiments demonstrate that LURE enables simultaneous, high-fidelity reawakening of multiple erased concepts across diverse erasure tasks and methods.

2601.13839 2026-05-19 cs.CV

DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes

DisasterVQA: 一个用于灾难场景的视觉问答基准数据集

Aisha Al-Mohannadi, Ayisha Firoz, Yin Yang, Muhammad Imran, Ferda Ofli

AI总结 本文提出DisasterVQA数据集,用于灾难场景中的感知与推理任务,通过1395张真实图像和4405对专家 curated 的问答对,评估了七种最先进的视觉-语言模型在灾难响应中的性能,发现模型在细粒度定量推理、物体计数和上下文敏感解释方面存在不足。

Comments Accepted at ICWSM 2026

详情
AI中文摘要

社交媒体图像在自然灾害和人为灾害中提供低延迟的情报信息源,能够实现快速损害评估和响应。尽管视觉问答(VQA)在通用领域表现出色,但其在灾难响应中所需的复杂和安全关键推理的适用性仍不明确。我们引入了DisasterVQA基准数据集,专门用于危机情境中的感知和推理。DisasterVQA包含1395张真实世界图像和4405对专家精心编写的问答对,涵盖洪水、野火和地震等多种事件。基于人道主义框架,包括FEMA ESF和OCHA MIRA,该数据集包含二元、多选和开放式问题,覆盖情境意识和操作决策任务。我们评估了七种最先进的视觉-语言模型,并发现性能在问题类型、灾难类别、地区和人道主义任务上存在差异。尽管模型在二元问题上实现高准确率,但在细粒度定量推理、物体计数和上下文敏感解释方面表现不佳,尤其是在代表性不足的灾难场景中。DisasterVQA提供了一个具有挑战性和实用性的基准,以指导开发更稳健和具有操作意义的视觉-语言模型用于灾害响应。该数据集可通过https://doi.org/10.5281/zenodo.18267769公开获取。

英文摘要

Social media imagery provides a low-latency source of situational information during natural and human-induced disasters, enabling rapid damage assessment and response. While Visual Question Answering (VQA) has shown strong performance in general-purpose domains, its suitability for the complex and safety-critical reasoning required in disaster response remains unclear. We introduce DisasterVQA, a benchmark dataset designed for perception and reasoning in crisis contexts. DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning diverse events such as floods, wildfires, and earthquakes. Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA, the dataset includes binary, multiple-choice, and open-ended questions covering situational awareness and operational decision-making tasks. We benchmark seven state-of-the-art vision-language models and find performance variability across question types, disaster categories, regions, and humanitarian tasks. Although models achieve high accuracy on binary questions, they struggle with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, particularly for underrepresented disaster scenarios. DisasterVQA provides a challenging and practical benchmark to guide the development of more robust and operationally meaningful vision-language models for disaster response. The dataset is publicly available at https://doi.org/10.5281/zenodo.18267769.