arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
2606.00109 2026-06-02 cs.CV cs.AI cs.LG

VDSB-GWSyn: Diffusion Schrödinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography

VDSB-GWSyn: 用于冠状动脉造影中可控且解剖学可行的导丝合成的扩散薛定谔桥

Haoyuan Tang, Zhuo Zhang, Jialin Li, Shuai Xiao, Jiachen Yang

发表机构 * Tianjin University(天津大学)

AI总结 提出基于扩散薛定谔桥的VDSB-GWSyn框架,通过形状先验和血管分割约束生成可控、高保真导丝样本,显著提升下游导丝端点定位精度。

Comments Early accept to MICCAI 2026

详情
AI中文摘要

冠状动脉导丝端点定位是计算机辅助PCI的基本能力,随着机器人辅助PCI逐渐普及以减少操作者辐射暴露,其重要性日益增加。然而,带有导丝的标注CAG图像稀缺以及现有导丝合成模型的适应性有限,仍是导丝端点定位的关键瓶颈。为解决此问题,我们提出VDSB-GWSyn,一个基于扩散薛定谔桥(DSB)模型的框架,能够在复杂解剖背景下合成可控、高保真的导丝样本。VDSB-GWSyn首先使用我们的形状先验算法学习基本导丝几何形状,然后在血管分割掩码的约束下生成导丝掩码并输出对应的端点坐标,最后通过SPADE条件化的DSB在真实CAG图像上合成逼真的导丝样本。实验结果表明,VDSB-GWSyn合成的导丝样本取得了良好的ROI-FID和ROI-KID,以及高IPR分数。此外,将我们的合成数据用于合成预训练后接真实微调,显著改进了下游导丝端点定位,将MPE从16.01像素降低到7.71像素,PCK@3像素从52.63%提高到86.27%,从而实现了更临床可靠的机器人辅助导丝输送系统部署。此外,具有严格背景保留和解剖可行性约束的可控设备合成的核心设计理念,有可能迁移到其他标注数据稀缺的介入设备感知任务中。

英文摘要

Coronary guidewire endpoint localization is a fundamental capability for computer-assisted PCI, and its importance increases as robot-assisted PCI is progressively adopted to reduce operator radiation exposure. However, the scarcity of annotated CAG images with guidewires and the limited adaptability of existing guidewire synthesis models remain key bottlenecks for guidewire endpoint localization. To address this issue, we propose VDSB-GWSyn, a Diffusion Schrödinger Bridge (DSB) model-based framework, enabling synthesis of controllable, high-fidelity guidewire samples under complex anatomical backgrounds. VDSB-GWSyn first uses our shape prior algorithm to learn the basic guidewire geometry. It then generates guidewire masks under constraints imposed by the vessel segmentation masks and outputs the corresponding endpoint coordinates. Finally, it synthesizes realistic guidewire samples on real CAG images using DSB conditioned with SPADE. Experimental results show that the guidewire samples synthesized by VDSB-GWSyn achieve favorable ROI-FID and ROI-KID, as well as high IPR scores. In addition, incorporating our synthesized data for synthetic pre-training followed by real fine-tuning substantially improves downstream guidewire endpoint localization, reducing MPE from 16.01~px to 7.71~px and increasing PCK at 3~px from 52.63\% to 86.27\%, leading to more clinically reliable deployment of robot-assisted guidewire delivery systems. Moreover, the core design philosophy of controllable device synthesis with strict background preservation and anatomical feasibility constraints has the potential to transfer to other interventional device perception tasks where annotated data are scarce.

2606.00105 2026-06-02 cs.CV cs.AI

Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning

视觉噪声引导的上下文蒸馏用于多模态大语言模型遗忘

Junkai Chen, Yuhao He, Junxiang You, Ruiqi Liu, Chenyu Wang, Shu Wu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Advanced Interdisciplinary Sciences, UCAS(北京大学交叉学科研究院)

AI总结 提出视觉噪声引导的上下文蒸馏(VGID)框架,通过双模态干预构建教师分布进行蒸馏,实现多模态大语言模型参数级遗忘,平衡遗忘效果与模型效用。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务上取得了显著进展,但它们也可能记忆和暴露敏感或受限知识,引发隐私和更广泛的安全风险。机器遗忘(MU)提供了一种有前景的方法,可以从训练好的模型中移除目标不良知识,而无需从头重新训练,同时保持通用模型效用。然而,在MLLMs中实现有效遗忘仍然特别具有挑战性。现有的基于训练的方法通常难以平衡遗忘效果和模型效用。相比之下,无训练方法如上下文遗忘通过避免参数更新来保持模型效用,但它们不会在参数级别移除记忆的知识,可能仍然容易受到逆向工程攻击。更重要的是,上下文遗忘在多模态设置中不足,其中视觉输入可以提供强条件信号并诱导不良输出。为了解决这些挑战,我们提出了视觉噪声引导的上下文蒸馏(VGID),一种基于蒸馏的MLLM遗忘框架。VGID通过结合视觉扰动与文本上下文遗忘的双模态干预,从冻结的基础模型动态构建面向遗忘的教师分布。由此产生的干预诱导分布作为蒸馏的教师信号,引导学生模型实现参数级遗忘,而无需外部教师模型或显式的不良响应注释。实验结果表明,VGID在保持竞争性模型效用的同时实现了强遗忘效果,在代表性设置中,遗忘集ROUGE-L降低了0.371,而保留集ROUGE-L仅下降0.055。

英文摘要

Multimodal Large Language Models (MLLMs) have achieved remarkable progress on vision-language tasks, but they may also memorize and expose sensitive or restricted knowledge, raising concerns about privacy and broader safety risks. Machine Unlearning (MU) provides a promising way to remove targeted undesirable knowledge from trained models without retraining from scratch while preserving general model utility. Nevertheless, effective unlearning in MLLMs remains particularly challenging. Existing training-based methods often struggle to balance unlearning effectiveness and model utility. In contrast, training-free methods such as in-context unlearning preserve model utility by avoiding parameter updates, but they do not remove memorized knowledge at the parameter level and may remain vulnerable to reverse-engineering attacks. More importantly, in-context unlearning is insufficient in multimodal settings, where visual inputs can provide strong conditioning signals and induce undesirable outputs. To address these challenges, we propose Visual-Noise Guided In-Context Distillation (VGID), a distillation-based framework for MLLM unlearning. VGID dynamically constructs an unlearning-oriented teacher distribution from the frozen base model through dual-modal intervention that combines visual perturbation with textual in-context unlearning. The resulting intervention-induced distribution serves as a teacher signal for distillation, guiding the student model toward parameter-level unlearning without requiring external teacher models or explicit undesirable response annotations. Experimental results show that VGID achieves strong unlearning effectiveness while preserving competitive model utility, reducing forget set ROUGE-L by 0.371 with only a 0.055 drop in retain set ROUGE-L in a representative setting.

2606.00104 2026-06-02 cs.RO cs.AI

PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs

PEACE: 一种用于无人机的带约束执行的规划-执行智能体

Erdem Uysal, Timo Kehrer, Sebastiano Panichella

发表机构 * Institute of Computer Science, University of Bern(伯尔尼大学计算机科学研究所) AI4I - The Italian Institute of Artificial Intelligence(意大利人工智能研究所)

AI总结 提出一种基于大语言模型的规划-执行智能体架构,通过解耦高层任务规划与低层控制,并引入约束执行层和有限重规划,实现无人机可解释、可约束的自主飞行。

Comments Accepted to ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction

详情
AI中文摘要

基础模型越来越多地被用于驱动自主系统,然而现有方法要么将模型保持在紧密的控制循环中,增加延迟和幻觉风险,要么将自然语言编译成不透明的端到端策略,难以解释、约束,且需要特定领域的数据集和微调。我们提出一种用于基于PX4的无人机的规划-执行智能体,将高层任务规划与低层控制解耦。大语言模型执行单次任务规划,而执行通过结构化的ROS 2工具调用接口(桥接到MAVLink)处理。该系统通过将模块化2D检测器(如YOLO或视觉语言模型)与用于3D物体定位的针孔深度投影模块相结合,构建世界模型。约束执行层强制执行高度限制和水平地理围栏,有限重规划能够从执行时的动作失败中恢复。我们将我们的方法定位在基于基础模型的机器人系统的三种常见设计模式中,并在Gazebo中的PX4软件在环仿真中展示其可行性。结果突出了与紧密耦合的LLM控制相比,改进的可解释性、约束执行和减少的LLM调用。代码、数据集、视频和其他材料可在以下链接找到:https://github.com/erdemuysalx/PEACE

英文摘要

Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop, raising latency and hallucination risk, or compile natural language into opaque end-to-end policies that are hard to explain, constraint and require domain-specific datasets and fine-tuning. We propose a planner-executor agent for PX4-based drones that decouples high-level mission planning from low-level control. A large language model performs single-pass task planning, while execution is handled through a structured ROS 2 tool-calling interface bridged to MAVLink. The system constructs a world model by combining modular 2D detectors (e.g., YOLO or vision-language models) with a pinhole depth projection module for 3D object localization. A constraint enforcement layer enforces altitude limits and horizontal geofencing, and bounded replanning enables recovery from execution-time action failures. We position our approach within three common design patterns for foundation-model-based robotics systems and demonstrate its feasibility in PX4 software-in-the-loop simulations in Gazebo. Results highlight improved explainability, constraint enforcement, and reduced LLM calls compared to tightly coupled LLM control. The code, dataset, videos, and other material can be found at the following link: https://github.com/erdemuysalx/PEACE

2606.00103 2026-06-02 cs.AI

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

评估大语言模型中的交互式推理:一个带有可执行游戏的分层基准

Mingyuan Fan, Weiguang Han, Daixin Wang, Cen Chen, Zhiqiang Zhang, Jun Zhou

发表机构 * East China Normal University(东华师范大学) Ant Group(蚂蚁集团)

AI总结 提出一个多轮交互式推理评估框架,将推理视为主动证据获取和信念更新,通过474个可执行游戏基准测试大语言模型在成功率、交互效率、上下文鲁棒性和元认知适应方面的表现。

Comments preprint version, under review

详情
AI中文摘要

我们引入了一个用于推理评估的多轮交互式框架,将推理视为主动证据获取和信念更新。其中,LLMs仅接收任务规则,必须向隐藏环境发出有针对性的查询,随时间整合部分观察结果,并决定何时提交最终答案。除了标准的成功率和交互效率,我们还在受控上下文扰动下评估上下文鲁棒性,并通过反事实修正和必要性判断评估元认知适应。我们将该框架实例化为一个包含474个可执行游戏的基准,每个游戏在五个固定配置搜索空间(对应五个难度级别)下进行评估,并评估了一系列前沿LLMs。结果表明,该基准具有高度区分性,不仅在成功率上,而且在交互效率上也暴露了巨大差异。此外,我们实证表明,上下文扰动导致适度但持续的下降,而反事实修正和必要性判断导致更大的下降。

英文摘要

We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden environment, integrate partial observations over time, and decide when to submit a final answer. Beyond standard success rate and interaction efficiency, we evaluate contextual robustness under controlled contextual perturbations, and metacognitive adaptation through counterfactual revision and necessity judgment. We instantiate the framework as a benchmark of 474 executable games, each evaluated under five fixed configuration search spaces corresponding to five difficulty levels, and evaluate a broad set of frontier LLMs. Results show that the benchmark is highly discriminative, exposing large differences not only in success rate but also in interaction efficiency. Moreover, we empirically show that contextual perturbations cause moderate but consistent declines, whereas counterfactual revision and necessity judgment lead to much larger drops.

2606.00102 2026-06-02 cs.AI math.PR

On the evolution of the concept of probability as a mirror of the evolution of reason

论概率概念的演化作为理性演化的镜像

Jean-Louis Le Mouël, Vincent Courtillot, Dominique Gibert, Vladimir Kossobokov, Jean-Baptiste Boulé, Pierpaolo Zuddas, Fernando Lopes, Païkan Marccagi, Alexis Maineult

发表机构 * Académie des Sciences, Institut de France, Paris, France(法国科学院,法兰西学院,巴黎,法国) DeepField Sensing, France(法国DeepField Sensing公司) Institute of Earthquake Prediction Theory and Mathematical Geophysics, Russian Academy of Sciences, Moscow, Russia(俄罗斯科学院地震预测理论与数学地球物理研究所,莫斯科,俄罗斯) Accademia Nazionale delle Scienze detta dei XL, Roma, Italia(意大利国家科学院(罗马)) Muséum National d’Histoire Naturelle, CNRS UMR7196, INSERM U1154, Paris, France(自然史博物馆,法国国家科学研究中心UMR7196,法国国家医学研究院U1154,巴黎,法国) Sorbonne Université, CNRS, METIS, UMR7619, Paris, France(索邦大学,法国国家科学研究中心,METIS,UMR7619,巴黎,法国) Laboratoire de Géologie de l’ENS, UMR 8538, Paris, France(巴黎高等师范学院地质实验室,UMR 8538,巴黎,法国)

AI总结 本文从历史与认识论视角,将概率论的发展解读为理性本身的转变,并探讨概率、模糊逻辑与深度学习在科学理性中的角色与局限。

Comments 44 pages, 7 figures

详情
AI中文摘要

几个世纪以来,概率论已从博弈演算发展成为不确定性推理的核心框架。本文将其演化不仅解释为数学史,更视为理性本身的转变。从帕斯卡和费马的组合对称性,到贝叶斯和拉普拉斯的归纳逻辑,从泊松的事件统计到柯尔莫哥洛夫的公理化形式化,概率逐步将不确定性、时间和一致性纳入科学判断。这一轨迹在现代贝叶斯推断中达到成熟的认识论形式,尤其是在Tarantola将概率视为信息逻辑的观点中,先验知识与数据被一致地结合。然而,这一框架也暴露了一个局限:概率量化了关于明确定义命题的不确定性,但本身并未形式化用于描述这些概念的概念模糊性。因此,本文考察理性如何超越概率。模糊逻辑被呈现为一种用于分级意义和定性判断的严谨语言,而深度学习则被分析为一种基于几何插值和优化而非显式推理的独特、强大的预测模式。通过将概率、模糊逻辑和深度学习置于共同的历史和认识论视角,本文阐明了它们的角色与局限。它认为当代科学理性不能仅归结为数据驱动的性能,而需要明确阐述不确定性、模糊性和推理。

英文摘要

Over the centuries, probability theory has grown from the calculus of games of chance into a central framework for reasoning under uncertainty. This article interprets that evolution not merely as a mathematical history, but as a transformation of rationality itself. From Pascal and Fermat's combinatorial symmetry to the inductive logic of Bayes and Laplace, from Poisson's statistics of events to Kolmogorov's axiomatic formalization, probability progressively incorporated uncertainty, time, and coherence into scientific judgment. This trajectory reaches a mature epistemological form in modern Bayesian inference, especially in Tarantola's view of probability as a logic of information, where prior knowledge and data are combined coherently. Yet this framework also exposes a limit: probability quantifies uncertainty about well-defined propositions, but does not by itself formalize the vagueness of the concepts used to describe them. The article therefore examines how rationality extends beyond probability. Fuzzy logic is presented as a rigorous language for graded meaning and qualitative judgment, while deep learning is analyzed as a distinct, powerful mode of prediction based on geometric interpolation and optimization rather than explicit inference. By situating probability, fuzzy logic, and deep learning in a common historical and epistemological perspective, the article clarifies their roles and limits. It argues that contemporary scientific rationality cannot be reduced to data-driven performance alone, but requires the explicit articulation of uncertainty, vagueness, and inference.

2606.00101 2026-06-02 cs.CV cs.AI

CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection

CoCoVideo: 基于商业模型的高质量对比基准用于AI生成视频检测

Huidong Feng, Wentao Chen, Jie Chen, Xinqi Cai, Ruolong Ma, Yinglin Zheng, Yuxin Lin, Ming Zeng

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) China Academy of Information and Communications Technology(中国信息通信技术研究院) AI Transcend Pte. Ltd.(AI Transcend有限公司)

AI总结 针对现有数据集依赖低质量开源模型且商业样本带水印的问题,提出包含13个商业生成器的CoCoVideo-26K对比数据集,并设计结合对比学习与置信门控多模态大语言模型的CoCoDetect检测框架,实现高保真AI生成视频的鲁棒检测。

Comments Accepected by CVPR 2026

详情
AI中文摘要

随着人工智能生成内容(AIGC)技术的快速发展,视频伪造日益普遍,给公共讨论和社会安全带来新挑战。尽管现有深度伪造检测方法取得了显著进展,但AIGC伪造检测仍然具有挑战性,因为现有数据集主要依赖开源视频生成模型,其质量远低于商业AIGC系统。即使包含少量商业样本的数据集也常常保留可见水印,损害真实性并阻碍模型泛化到高保真AIGC视频。为解决这些问题,我们引入了CoCoVideo-26K,一个基于对比学习的商业模型AIGC视频数据集,涵盖13个主流商业生成器,并提供语义对齐的真实-伪造视频对。该数据集能够深入探索真实视频与高质量合成视频之间的差异,并为高逼真视频伪造检测建立新基准。基于该数据集,我们提出了CoCoDetect,一个集成对比学习与置信门控多模态大语言模型(MLLM)推理的检测框架。R3D-18骨干网络提取时空表示,而置信门将不确定案例路由到MLLM进行物理合理性和场景一致性的推理。在CoCoVideo-26K和公共基准上的大量实验证明了最先进的性能,验证了该框架的鲁棒性和泛化能力。我们的代码和数据集可在https://github.com/DonoToT/CoCoVideo获取。

英文摘要

With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on open-source video generation models with quality far below that of commercial AIGC systems. Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce CoCoVideo-26K, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real-fake video pairs. This dataset enables deeper exploration of the differences between authentic and high-quality synthetic videos and establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose CoCoDetect, a detection framework integrating contrastive learning with confidence-gated multimodal large language model (MLLM) inference. An R3D-18 backbone extracts spatio-temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo-26K and public benchmarks demonstrate state-of-the-art performance, validating the framework's robustness and generalizability. Our code and dataset are available at https://github.com/DonoToT/CoCoVideo.

2606.00100 2026-06-02 cs.CV cs.AI

CoilDrop-MRI: Self-supervised physics-guided MRI reconstruction with coil dropout

CoilDrop-MRI:基于线圈丢弃的自监督物理引导MRI重建

Tongxi Song, Ziyu Li, Zihan Li, Wen Zhong, Congyu Liao, Yang Yang, Hua Guo, Wenchuan Wu, Qiyuan Tian

发表机构 * School of Biomedical Engineering, Tsinghua Medicine, Tsinghua University(清华大学生物医学工程系) Oxford Centre for Integrative Neuroimaging, FMRIB, Nuffield Department of Clinical Neurosciences, University of Oxford(牛津大学整合神经影像中心) Department of Radiology & Biomedical Imaging, University of California San Francisco(加州大学旧金山分校放射科与生物医学成像系)

AI总结 提出CoilDrop-MRI方法,通过在线圈维度进行丢弃并作为自监督训练目标,结合图像域和k空间域展开架构,实现无需全采样数据的并行MRI重建,在多站点、多场强、多模态数据集上性能优于现有自监督方法。

详情
AI中文摘要

基于自监督深度学习的方法在加速磁共振成像(MRI)重建中展现出巨大潜力,无需全采样数据即可实现高图像质量。这些方法通常将采集的数据划分为两个不相交的子集,构建输入-目标对以优化重建网络。然而,现有方法仅在空间频率(k空间)域进行划分,未探索线圈维度。为充分利用接收线圈间的信号相关性,我们提出CoilDrop-MRI,该方法对输入应用线圈级丢弃,并将丢弃的数据作为自监督框架中的训练目标。该方法被集成到图像域(SENSE)和k空间(SPIRiT)公式的展开架构中。我们进一步将CoilDrop-MRI扩展到多激发、相位校正的扩散MRI(dMRI)重建,展示了其多功能性。CoilDrop-MRI在多站点、多场强(0.3T、0.55T和3T)和多模态(T1加权、T2加权、T2-FLAIR和dMRI)数据集上进行了广泛验证,始终优于最先进的自监督方法,达到了与监督重建方法相当的质量,且无需全采样参考训练数据。此外,CoilDrop-MRI表现出强大的数据效率和跨成像条件的鲁棒泛化能力,使其成为自监督并行MRI重建的实用且通用的框架。

英文摘要

Self-supervised deep learning-based methods have shown great promise for accelerated magnetic resonance imaging (MRI) reconstruction, achieving high image quality without requiring fully sampled data for training. These methods typically partition the acquired data into two disjoint subsets to construct input-target pairs for optimizing the reconstruction network. However, existing approaches perform this partition exclusively within the spatial frequency (k-space) domain, leaving the coil dimension unexplored. To enforce full exploitation of signal correlation across receiver coils, we propose CoilDrop-MRI, which applies coil-wise dropout to the input and uses the dropped data as training targets in a self-supervised framework. This method is integrated into unrolled architectures in both image-domain (SENSE) and k-space (SPIRiT) formulations. We further demonstrate its versatility by extending CoilDrop-MRI to multi-shot, phase-corrected diffusion MRI (dMRI) reconstruction. CoilDrop-MRI is extensively validated on multi-site, multi-field-strength (0.3T, 0.55T, and 3T), and multi-modality (T1-weighted, T2-weighted, T2-FLAIR, and dMRI) datasets and consistently outperforms state-of-the-art self-supervised methods, achieving quality comparable to supervised reconstruction methods without requiring fully sampled reference training data. Moreover, CoilDrop-MRI exhibits strong data efficiency and robust generalization across imaging conditions, establishing it as a practical and versatile framework for self-supervised parallel MRI reconstruction.

2606.00098 2026-06-02 cs.CV eess.IV

Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection

分割引导的空间索引用于可泛化和可解释的深度伪造检测

Izaldein Al-Zyoud, Abdulmotaleb El Saddik

发表机构 * University of Central Florida(佛罗里达大学)

AI总结 提出分割引导的空间索引方法,通过冻结的FaRL解析器为DINOv3 ViT-L/16的patch token分配语义标签,仅选择语义相关的区域进行分类,实现可泛化且可解释的深度伪造检测。

详情
AI中文摘要

我们引入了分割引导的空间索引,用于可泛化和可解释的深度伪造检测。关键思想颠倒了标准设计顺序:不是先汇集所有人脸token再分类,而是先选择语义上有意义的patch token,然后仅汇集这些token。一个冻结的FaRL解析器为每个DINOv3 ViT-L/16 patch token分配一个语义标签;丢弃非目标token;一个线性探针对保留的区域进行分类。这种空间索引利用了DINOv3的patch级空间一致性(即产生涌现分割的相同属性),向探针呈现一个更纯净的区域子空间,其中与操作相关的证据较少被全脸线索稀释。区域归因是结构性的:当嘴部模型预测为假时,决策仅使用了嘴部token,而不是叠加的显著性图。在Celeb-DF v2上,嘴部索引探针的AUC达到0.905,优于LipForensics(+8.1个百分点)和Xception(+16.9个百分点),且无需对DINOv3或FaRL进行微调,也无需目标域数据。消融实验隔离了机制:用DINOv3的CLS token替换区域选择,Celeb-DF v2 AUC下降26.4个百分点;用FaRL特征替换DINOv3,AUC下降20.9个百分点。DINOv3表示和空间索引都是独立必要的;单独任何一个都无法达到完整系统的性能。

英文摘要

We introduce segmentation-guided spatial indexing for generalizable and explainable deepfake detection. The key idea reverses the standard design order: rather than pooling all facial tokens and classifying afterward, we first select semantically meaningful patch tokens, then pool only those. A frozen FaRL parser assigns each DINOv3 ViT-L/16 patch token a semantic label; non-target tokens are discarded; a linear probe classifies the retained region. This spatial indexing exploits DINOv3's patch-level spatial consistency, the same property that enables emergent segmentation, to present the probe with a purer regional subspace where manipulation-relevant evidence is less diluted by whole-face cues. Region attribution is structural: when the mouth model predicts fake, the decision used only mouth tokens, not an overlaid saliency map. On Celeb-DF v2, the mouth-indexed probe achieves AUC 0.905, outperforming LipForensics (+8.1 pp) and Xception (+16.9 pp), with no DINOv3 or FaRL fine-tuning and no target-domain data. Ablations isolate the mechanism: replacing regional selection with DINOv3's CLS token drops Celeb-DF v2 AUC by 26.4 pp; replacing DINOv3 with FaRL features drops it by 20.9 pp. Both DINOv3 representation and the spatial index are independently necessary; neither alone approaches the full system.

2606.00095 2026-06-02 cs.CV cs.AI cs.CL cs.RO

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

弥合2D-3D鸿沟:面向视觉语言导航的分层语义几何地图

Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Bosch Corporate Research(博世企业研究) King Abdullah University of Science and Technology(卡布斯大学)

AI总结 提出分层语义几何地图(HSGM),将3D几何信息转化为VLM可理解的结构化表示,结合VLM高层语义规划与经典路径规划,实现零样本视觉语言导航,在R2R-CE和RxR-CE基准上达到最先进性能。

详情
AI中文摘要

视觉语言导航(VLN)使具身智能体能够通过遵循语言指令在未知环境中到达目标位置。尽管近期视觉语言模型(VLM)取得了进展,但仍存在关键的语义-几何鸿沟:VLM擅长语言和2D视觉理解,但在3D空间推理方面表现不佳,且无法捕捉动作与空间转换之间的因果动态,导致导航不可靠,尤其在零样本设置中。为弥合这一鸿沟,我们提出分层语义几何地图(HSGM),将3D几何信息转化为与VLM兼容的结构化表示,有效将其与物理世界连接。具体而言,HSGM表示为多通道俯视图,组织为三个层次:(1)几何层,记录可导航区域和障碍物;(2)语义层,表示物体及其关系;(3)决策层,支持高层任务推理和目标选择。导航过程中,VLM作为高层语义规划器,解释HSGM编码的空间布局以选择几何有效航点,而航点间的低层无碰撞运动由经典路径规划算法执行,从而将语义推理与动作执行完全解耦。此外,复杂指令被分解为子任务,以缓解长程导航中的进度遗忘或幻觉问题。在R2R-CE和RxR-CE基准上的大量实验表明,我们的零样本框架达到了最先进性能,甚至优于若干监督方法。代码见 https://github.com/Teacher-Tom/HSGM_public。

英文摘要

Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic-Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher-Tom/HSGM_public.

2606.00093 2026-06-02 cs.CL cs.HC physics.data-an

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

LLM作为评判者的评估一致性指标:报告什么及为什么

Delip Rao, Chris Callison-Burch

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文通过调查24篇近期论文,指出在二元评判标准下多数一致性指标冗余,强调Cohen's κ提供额外信息,并给出报告清单。

Comments 12 pages

详情
AI中文摘要

将LLM评判者与人类标注进行验证通常需要报告多个一致性统计量:准确率、精确率、召回率、$F_1$、Cohen's $κ$以及一个或多个秩相关。对24篇近期LLM作为评判者论文的调查发现,指标选择与评判尺度、平局处理、无效输出和弃权处理纠缠在一起,且这些选择很少被说明。对于二元标准——基于量规评估中的常见情况,每个标准被评分为MET或UNMET——报告的大多数数字是冗余的:Pearson's $r$、Spearman's $ρ$、Kendall's $τ_b$、phi系数$ϕ$和Matthews相关系数在非退化二元数据上都归结为单个数字,因此报告其中几个只会产生佐证证据的错觉。Cohen's $κ$是唯一增加信息的一致性系数:它与$ϕ$共享分子但归一化方式不同,两者之间的差距衡量了评判者正标签率偏离人类正标签率的程度。然后我们追踪当评判者可能以CANNOT_ASSESS裁决弃权时发生的变化:处理弃权的三种常见方式不是可互换的预处理选择,而是回答不同的问题,并且它们打破了二元等价关系。对于使用Fleiss' $κ$或Krippendorff's $α$评分的多评判者集成,相同的等价关系会重新出现,但存在可忽略的有限样本修正。最后,我们给出一个报告清单,其中列出评判尺度、弃权和平局处理模式、覆盖率、混淆矩阵以及聚合级别,同时附上任何标量一致性系数。

英文摘要

Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $κ$, and one or more rank correlations. A survey of 24 recent LLM-as-judge papers finds metric choice entangled with the judgment scale, tie handling, invalid outputs, and abstention handling, and those choices rarely stated. For binary criteria -- the common case in rubric-based evaluation, where each criterion is graded MET or UNMET -- most of the reported numbers are redundant: Pearson's $r$, Spearman's $ρ$, Kendall's $τ_b$, the phi coefficient $ϕ$, and the Matthews Correlation Coefficient all reduce to a single number on non-degenerate binary data, so reporting several of them only creates an illusion of corroborating evidence. Cohen's $κ$ is the one agreement coefficient that adds information: it shares $ϕ$'s numerator but normalizes differently, and the gap between them measures how far the judge's positive-label rate has drifted from the human's. We then trace what changes when a judge may abstain with a CANNOT_ASSESS verdict: the three common ways of handling abstentions are not interchangeable preprocessing choices but answer different questions, and they break the binary equivalences. The same equivalences reappear, up to a negligible finite-sample correction, for multi-judge ensembles scored with Fleiss' $κ$ or Krippendorff's $α$. We close with a reporting checklist that names the judgment scale, the abstention and tie handling mode, coverage, the confusion matrix, and the aggregation level alongside any scalar agreement coefficient.

2606.00092 2026-06-02 cs.CV cs.AI

Aligning Cellular Sheaves with Classifier Attention for Interpretable Weakly-Supervised Pathology Localization

对齐细胞层与分类器注意力以实现可解释的弱监督病理定位

Devansh Lalwani, Swapnil Bhat, Maulik Shah

发表机构 * Turocrates AI Private Limited(Turocrates AI私有有限公司)

AI总结 针对弱监督全切片图像分类中注意力图定位不准确的问题,提出结合细胞层与注意力机制的一致性训练方法,在Camelyon16上实现补丁级AUC 0.940,并提升注意力AUC从0.717至0.953。

详情
AI中文摘要

基于基础特征的注意力多实例学习(ABMIL)在Camelyon16切片级别性能上接近饱和,但相应的注意力图作为定位信号并不完美:在临床解释中,一个正确分类但未激活实际病灶的模型难以被信任。我们通过细胞层(cellular sheaves)来解决这一差距,细胞层为图的每个顶点和边赋予有限维向量空间及它们之间一致的线性映射,提供了一种在图结构数据上检测局部不一致性的原则性方法。我们将细胞层应用于全切片图像的弱监督肿瘤定位,结合了细胞层不一致场与ABMIL。自然的训练目标——鼓励相似特征之间的一致性——产生的不一致场追踪的是组织级纹理而非诊断内容。我们提出注意力条件一致性,利用分类器的注意力来定义哪些相邻补丁应该一致。在此目标下联合训练分类器和细胞层,在Camelyon16上产生的不一致场达到补丁级AUC 0.940,并将注意力头从单独ABMIL的0.717提升至0.953。两阶段消融实验(分类器冻结在ABMIL值)仅在不一致场上达到0.727,注意力保持0.717,证实增益来自投影器在两个目标下的共同适应,而非单独的损失变化。训练后的模型无需重新训练即可迁移至Camelyon17的标注切片,保持Delta AUC 0.932 +/- 0.083和注意力AUC 0.955 +/- 0.099。结果是注意力图和细胞层不一致图同时激活相同的诊断区域,为每个切片级预测提供两种互补的解释。

英文摘要

Weakly-supervised classification of whole-slide images with attention-based multiple instance learning (ABMIL) on top of foundation features now reaches near-saturation on Camelyon16 slide-level performance, but the corresponding attention maps are an imperfect localization signal: in clinical interpretation, a model that classifies correctly without firing on the actual lesion is hard to trust. We address this gap with cellular sheaves, which equip each vertex and edge of a graph with a finite-dimensional vector space and consistent linear maps between them, providing a principled way to detect local disagreement on graph-structured data. We apply cellular sheaves to weakly-supervised tumour localization on whole-slide images, combining a sheaf disagreement field with ABMIL. The natural training objective, encouraging consistency between similar features, produces a disagreement field that tracks tissue-level texture rather than diagnostic content. We propose attention-conditional consistency, which uses the classifier's attention to define which neighbouring patches should agree. Joint training of the classifier and the sheaf under this objective produces a disagreement field with patch-level AUC 0.940 on Camelyon16 and raises the attention head from its ABMIL-alone level of 0.717 to 0.953. Two-stage ablation with the classifier frozen at its ABMIL values reaches only 0.727 on the disagreement field and leaves attention at 0.717, confirming that the gain comes from the projector co-adapting under both objectives, not from the loss change in isolation. The trained model transfers without retraining to annotated slides from Camelyon17, maintaining Delta AUC 0.932 +/- 0.083 and attention AUC 0.955 +/- 0.099. The result is an attention map and a sheaf-disagreement map that fire on the same diagnostic regions, giving clinicians two complementary explanations for each slide-level prediction.

2606.00091 2026-06-02 cs.CL cs.AI

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

DLLM-JEPA:用于掩码扩散语言模型的联合嵌入预测架构

Sangdae Nam

发表机构 * Sangdae Nam

AI总结 提出DLLM-JEPA,通过将联合嵌入预测架构与掩码扩散语言模型结合,消除显式多视图数据和双梯度前向传播需求,在多个任务上提升准确率并降低训练FLOPs。

Comments 17 pages, 4 figures, 13 tables. Accepted at SPIGM Workshop, ICML 2026

详情
AI中文摘要

联合嵌入预测架构(JEPAs)重塑了视觉中的自监督表示学习。最近的LLM-JEPA将JEPA移植到自回归语言模型,但继承了因果注意力机制的两个高昂代价:它需要显式的多视图数据(例如文本-代码对),并且每步需要两次携带梯度的前向传播。我们提出DLLM-JEPA,它将JEPA与掩码扩散语言模型配对,一次性消除这两个代价。扩散模型的双向注意力通过不同的掩码率从同一输入产生两个语义不同的视图——无需显式配对——并支持单次携带梯度的前向传播,相对于LLM-JEPA减少33%的训练FLOPs。DLLM-JEPA在我们评估的每个(任务,架构)组合上优于仅扩散微调:在LLaDA-8B GSM8K上最高提升+18.7个百分点,在Dream-7B GSM8K上提升+11.4个百分点,在Spider、NL-RX-SYNTH和Django上持续获得正向增益。除了准确率,DLLM-JEPA还表现出双赢特性:在LLaDA-8B上使用Wide-t配置时,它同时提高了GSM8K准确率(67.1 vs. 65.2,+1.8个百分点),将保留的Wikitext损失降至预训练基础之下,并在三个微调种子下将MMLU准确率保持在基础水平——而L2到基础参数锚点匹配基线准确率但没有任务增益。逐层探测揭示了机制:一种几何-功能漂移分离,其中微调后的骨干比基线更远离预训练权重,但在保留的Wikitext上遗忘更少,且放大集中在中间Transformer层。该模式也出现在Dream-7B上,表明该现象并非特定于单个骨干网络。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA ported JEPA to autoregressive language models but inherited two steep costs from the causal-attention substrate: it demands explicit multi-view data (e.g., text-code pairs), and it requires two gradient-carrying forward passes per step. We introduce DLLM-JEPA, which pairs JEPA with masked-diffusion language models to eliminate both costs at once. The bidirectional attention of diffusion models yields two semantically distinct views of the same input via different masking rates -- no explicit pairs needed -- and supports a single gradient-carrying forward pass, cutting training FLOPs by 33% relative to LLM-JEPA. DLLM-JEPA improves over diffusion-only fine-tuning in every (task, architecture) combination we evaluate: up to +18.7 pp on LLaDA-8B GSM8K and +11.4 pp on Dream-7B GSM8K, with consistent positive gains on Spider, NL-RX-SYNTH, and Django. Beyond accuracy, DLLM-JEPA exhibits a dual-win property: on LLaDA-8B with the Wide-t configuration, it simultaneously raises GSM8K accuracy (67.1 vs. 65.2, +1.8 pp), drives held-out Wikitext loss below the pre-trained base, and preserves MMLU accuracy at base level across three fine-tuning seeds -- whereas an L2-to-base parameter anchor matches baseline accuracy with no task gain. Layer-wise probing reveals the mechanism: a geometric-functional drift dissociation in which the fine-tuned backbone moves further from the pre-trained weights than the baseline yet forgets less on held-out Wikitext, with the amplification concentrated in middle transformer layers. The pattern appears on Dream-7B as well, indicating the phenomenon is not specific to a single backbone.

2606.00090 2026-06-02 cs.RO cs.AI

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

物理AI中的静默故障:自主系统运行时动作授权的文献综述

Barak Or

发表机构 * STATE16

AI总结 本文综述了物理AI系统中黑箱模型发出看似合理但实际错误的物理动作导致的静默故障问题,提出了运行时防护栏的分类和评估要求。

Comments 23 pages

详情
AI中文摘要

物理AI系统越来越多地将多模态观测、语言指令和学习的世界表示映射为具有物理后果的动作。机器人基础模型、视觉-语言-动作模型和基于世界模型的自主系统可以决定移动车辆、机器人、无人机和工业机器的决策。这种转变暴露了一个传统AI内容审核或经典机器人安全无法完全捕获的安全问题:黑箱模型可能发出一个物理后果的动作,同时表现出自信、合理和语义对齐。由此产生的故障可能是静默的,源于传感器漂移、遮挡、状态估计误差、分布偏移、幻觉的可供性,或在下游硬件控制器检测到违规之前的无效物理假设。在具身基础模型、世界模型、机器人仿真、具身安全基准、安全控制、运行时保证、不确定性估计、验证和防护栏评估中,模型能力和安全机制沿着大致分离的技术轨道发展。这里综合的一个反复出现的差距是,本综述调查的单一流都没有提供黑箱物理AI模型与物理执行之间的完整运行时授权边界。由此产生的分析发展了一个有界的问题表述、静默物理动作故障的定义、运行时防护栏功能的分类,以及比较防护栏作为物理AI保证机制的评估要求。

英文摘要

Physical AI systems increasingly map multimodal observations, language instructions, and learned world representations into physically consequential actions. Robotics foundation models, vision-language-action models, and world-model-based autonomous systems can condition decisions that move vehicles, robots, drones, and industrial machines. This transition exposes a safety problem that is not fully captured by conventional AI content moderation or by classical robot safety alone: a black-box model may issue a physically consequential action while appearing confident, plausible, and semantically aligned. The resulting failure can be silent, arising from sensor drift, occlusion, state-estimation error, distribution shift, hallucinated affordances, or invalid physical assumptions before downstream hardware controllers detect a violation. Across embodied foundation models, world models, robotics simulation, embodied safety benchmarks, safe control, runtime assurance, uncertainty estimation, verification, and guardrail evaluation, model capability and safety mechanisms have advanced along largely separate technical tracks. A recurring gap synthesized here is that no single stream surveyed in this review supplies a complete runtime authorization boundary between black-box Physical AI models and physical execution. The resulting analysis develops a bounded problem formulation, a definition of silent physical-action failure, a taxonomy of runtime guardrail functions, and evaluation requirements for comparing guardrails as Physical AI assurance mechanisms.

2606.00089 2026-06-02 cs.RO cs.AI

Can Predicted Dynamics Exist in the Physical World?

物理世界中是否存在可预测的动态?

Barak Or

发表机构 * STATE16 Technion - Israel Institute of Technology(技术Ion - 以色列理工学院) Reichman University(Reichman大学) Google-Reichman AI Tech School(Google-Reichman人工智能技术学院)

AI总结 本文提出物理可接受性作为预测-控制接口,通过运动学、动力学和直接到组合的视界条件评估解码提案的物理可执行性,实验表明该方法能有效识别无效提案并保持任务进度。

Comments 17 pages

详情
AI中文摘要

预测性物理AI系统输出状态展开、动作块和潜在计划,但低均方根误差(RMSE)并不意味特定提案在物理上可执行。我们将物理可接受性定义为预测-控制接口:在执行前,将解码提案视为候选动态,并使用运动学、动力学和直接到组合的视界条件进行评估。通过不是任务成功的证明;拒绝标识指定物理包络的违反,并给出组件级原因。在Hugging Face LeRobot PushT上,受控伪造表明一步预测RMSE和标准化动态残差达到接收者操作特征曲线下面积(AUC)0.982和0.972,仅运动学条件达到AUC 0.592,完整门控达到AUC 0.957并带有条件级归因。在基于重放的干预实验中,基于残差的过滤器和完整物理可接受性门控阻止了87-89%的无效提案,同时保持平均进度接近0.998。

英文摘要

Predictive Physical AI systems output state rollouts, action chunks, and latent plans, yet a low root-mean-square error (RMSE) does not imply that a particular proposal is physically executable. We formulate physical admissibility as a prediction-control interface: before execution, a decoded proposal is treated as candidate dynamics and evaluated using kinematic, dynamic, and direct-to-composed horizon conditions. Passing is not a certificate of task success; rejection identifies violation of the specified physical envelope and gives a component-level reason. On Hugging Face LeRobot PushT, controlled falsification shows that one-step prediction-RMSE and standardized dynamics residuals reach area under the receiver operating characteristic curve (AUC) 0.982 and 0.972, kinematic-only conditions reach AUC 0.592, and the full gate reaches AUC 0.957 with condition-level attribution. In replay-based intervention experiments, residual-based filters and the full physical-admissibility gate prevent 87-$89% of invalid proposals while preserving mean progress near 0.998.

2606.00087 2026-06-02 cs.CV cs.AI

Structured Visual Evidence Decomposition for Evidence-Grounded Multimodal Screening of Obstructive Sleep Apnea-Hypopnea Syndrome

结构化视觉证据分解用于阻塞性睡眠呼吸暂停低通气综合征的证据驱动多模态筛查

Chen Zhan, Yingchen Wei, Xiaoyu Tan, Jingjing Huang, Xihe Qiu

发表机构 * School of Electronic and Electrical Engineering, Shanghai University of Engineering Science(上海工程技术大学电子与电气工程学院) Tencent Youtu Lab(腾讯云视频实验室) ENT Institute and Department of Otorhinolaryngology, Eye & ENT Hospital of Fudan University(复旦大学耳鼻喉科医院耳鼻喉科研究所) National University of Singapore(新加坡国立大学)

AI总结 提出EviOSAHS框架,通过将面部图像分解为七个解剖查询并生成结构化证据卡,结合临床信息进行高灵敏度OSAHS筛查。

详情
AI中文摘要

有效的阻塞性睡眠呼吸暂停低通气综合征(OSAHS)多导睡眠图前筛查需要结合临床风险因素与可见的颅面和颈部线索。直接提示通用多模态基础模型进行医学是/否决策可能产生不稳定、校准不良的输出。我们提出EviOSAHS,一个证据驱动的多模态推理框架,将仅基于图像的解剖证据获取与最终临床判定分离。每张正面面部图像被分解为七个固定的解剖查询,涵盖颈部、下巴、嘴巴、面/颈脂肪、下颌、中面部和鼻子。视觉响应被转换为结构化证据卡,记录目标解剖结构、可见性、风险方向、证据强度、置信度和简洁摘要。这些卡片仅在最后阶段与清理后的临床档案结合,由大型语言模型进行平衡的二元筛查判定。我们在642名受试者队列上评估了EviOSAHS,将正常受试者映射为筛查阴性,轻度、中度或重度OSAHS受试者映射为筛查阳性。EviOSAHS实现了88.47%的准确率、94.86%的灵敏度、93.74%的F1分数和5.14%的假阴性率,在统一协议下优于仅临床提示、直接多模态提示和朴素两阶段流水线。消融实验表明,七问题视觉分解和平衡最终判定对高灵敏度工作点至关重要。对4,494个视觉输出的问题级审计显示100%的结构化解析率和93.88%的高可见率。EviOSAHS为二元多导睡眠图前OSAHS筛查提供了一个可审计、高灵敏度的工作流程,但应被视为分诊助手而非诊断系统。在临床部署前需要进行前瞻性验证、外部测试和校准的工作点控制。

英文摘要

Effective pre-polysomnography screening for obstructive sleep apnea-hypopnea syndrome (OSAHS) requires combining clinical risk factors with visible craniofacial and neck cues. Directly prompting general-purpose multimodal foundation models for medical yes/no decisions can yield unstable, poorly calibrated outputs. We propose EviOSAHS, an evidence-grounded multimodal reasoning framework that separates image-only anatomical evidence acquisition from final clinical adjudication. Each frontal facial image is decomposed into seven fixed anatomical queries covering the neck, chin, mouth, face/neck fat, lower jaw, midface, and nose. Visual responses are converted into structured evidence cards recording target anatomy, visibility, risk direction, evidence strength, confidence, and a concise summary. These cards are combined with a cleaned clinical profile only in the final stage, where a large language model performs balanced binary screening adjudication. We evaluated EviOSAHS on a 642-subject cohort, mapping normal subjects to screening-negative and mild, moderate, or severe OSAHS subjects to screening-positive. EviOSAHS achieved 88.47% accuracy, 94.86% sensitivity, 93.74% F1-score, and a 5.14% false-negative rate, outperforming clinical-only prompting, direct multimodal prompting, and naive two-stage pipelines under a unified protocol. Ablations showed that seven-question visual decomposition and balanced final adjudication were critical to the high-sensitivity operating point. A question-level audit of 4,494 visual outputs showed a 100% structured parse rate and 93.88% high-visibility rate. EviOSAHS provides an auditable, high-sensitivity workflow for binary pre-polysomnography OSAHS screening, but should be viewed as a triage assistant rather than a diagnostic system. Prospective validation, external testing, and calibrated operating-point control are needed before clinical deployment.

2606.00086 2026-06-02 cs.RO

Whole-Body Inverse Kinematics with Graph Diffusion

基于图扩散的全身逆运动学

Helong Huang, Kai Tan, Feng Wen, Guowei Huang, Xingyue Quan

发表机构 * Large Model Algorithm Lab, Huawei(华为大模型算法实验室)

AI总结 提出GraphDiff-IK,一种结构感知的图扩散逆运动学框架,通过将机器人表示为运动学图并引入分层消息传递和躯干感知条件,实现了多分支机器人的准确稳定IK求解。

详情
AI中文摘要

逆运动学(IK)是机器人学中的一个基本问题,需要生成满足目标末端执行器位姿的关节配置。现有方法通常难以在多种机器人形态间泛化,并且无法有效建模IK的多模态特性,特别是在具有多个运动学分支的关节系统中。在这项工作中,我们提出了GraphDiff-IK,一种结构感知的图扩散逆运动学框架。具体来说,我们将机器人表示为从机器人URDF构建的运动学图,其中节点对应驱动关节,边编码运动学依赖关系。基于这种表示,我们将IK表述为条件图扩散过程,直接在机器人图上生成关节配置。为了更好地捕捉关节系统中的结构依赖关系,我们进一步引入了一种结构感知的图推理框架,具有分层阶段式消息传递和针对多分支机器人的躯干感知条件。此外,我们结合了带噪声的正向运动学反馈和任务空间监督,以提高去噪过程中的几何一致性。所提出的框架提供了一种统一的公式,自然支持单臂机器人、双臂系统以及具有躯干或腰部结构的关节机器人。在多种机器人平台上的大量实验表明,所提出的方法实现了准确且稳定的IK性能,同时保留了为冗余机器人系统生成多个可行解的能力。

英文摘要

Inverse kinematics (IK) is a fundamental problem in robotics, requiring the generation of joint configurations that satisfy target end-effector poses. Existing approaches often struggle to generalize across diverse robot morphologies and to effectively model the multi-modal nature of IK, particularly in articulated systems with multiple kinematic branches. In this work, we propose GraphDiff-IK, a structure-aware graph diffusion framework for inverse kinematics. Specifically, we represent the robot as a kinematic graph constructed from the robot URDF, where nodes correspond to actuated joints and edges encode kinematic dependencies. Building upon this representation, we formulate IK as a conditional graph diffusion process that directly generates joint configurations on the robot graph. To better capture structural dependencies in articulated systems, we further introduce a structure-aware graph reasoning framework with hierarchical stage-wise message passing and torso-aware conditioning for multi-branch robots. In addition, we incorporate noisy forward kinematics feedback and task-space supervision to improve geometric consistency during denoising. The proposed framework provides a unified formulation that naturally supports single-arm robots, dual-arm systems, and articulated robots with torso or waist structures. Extensive experiments on diverse robotic platforms demonstrate that the proposed method achieves accurate and stable IK performance while preserving the ability to generate multiple feasible solutions for redundant robotic systems.

2606.00085 2026-06-02 cs.RO

Balancing Accuracy and Efficiency: Adaptive Dynamics Orchestration for Model Predictive Control

平衡精度与效率:模型预测控制的自适应动力学编排

Francesco Cancelliere, Aniket Datar, Giovanni Muscato, Xuesu Xiao

发表机构 * Department of Electrical and Computer Engineering, University of Michigan, Ann Arbor, MI, USA(1. 电气与计算机工程系,密歇根大学,安娜堡,密歇根州,美国)

AI总结 提出自适应动力学编排(ADO)框架,通过在线反事实滚动评估模型残差,动态选择最适合当前导航上下文的动力学模型,在计算效率与预测精度之间取得平衡。

Comments 8 pages, 7 figures

详情
AI中文摘要

自主导航的模型预测控制(MPC)面临模型精度与实时效率之间的基本权衡。高保真动力学模型能够准确预测轨迹展开过程中复杂的车辆-地形交互,但计算成本高,增加推理延迟并降低控制频率。相反,轻量级模型支持快速更新和密集采样,但在安全关键条件下可能产生错误预测,导致灾难性故障如车辆侧翻。为解决这一权衡,我们提出自适应动力学编排(ADO),一种根据当前导航上下文动态选择最合适动力学模型的框架。ADO维护一个涵盖不同精度-效率特征的模型库,并通过在线反事实滚动(即执行的控制动作在模型库中重放以评估预测差异)的残差误差,持续细化地形条件性能估计。这些估计实时指导模型选择,平衡计算效率与预测精度。在越野地面机器人上的真实实验表明,与固定低延迟基线相比,ADO显著降低了建模误差,同时接近最高保真模型的精度而不产生其计算成本,从而在复杂地形中实现更可靠和有效的导航。

英文摘要

Model Predictive Control (MPC) for autonomous navigation faces a fundamental trade-off between model accuracy and real-time efficiency. High-fidelity dynamics models can accurately predict complex vehicle-terrain interactions during trajectory rollouts, but incur significant computational cost, increasing inference latency and reducing control frequency. Conversely, lightweight models enable fast updates and dense sampling, yet may produce erroneous predictions under safety-critical conditions, potentially leading to catastrophic failures such as vehicle rollover. To address this trade-off, we propose Adaptive Dynamics Orchestration (ADO), a framework that dynamically selects the most appropriate dynamics model for the current navigation context. ADO maintains a library of models spanning diverse accuracy-efficiency profiles and continuously refines terrain-conditioned performance estimates using residual errors from online counterfactual rollouts, where executed control actions are replayed across the model library to assess predictive discrepancy. These estimates guide model selection in real time, balancing computational efficiency and predictive accuracy. Real-world experiments on an off-road ground robot demonstrate that ADO significantly reduces modeling error compared to a fixed low-latency baseline, while approaching the accuracy of the highest-fidelity model without incurring its computational cost, resulting in more reliable and effective navigation in challenging terrain.

2606.00083 2026-06-02 cs.LG cs.AI cs.RO

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

从演示到奖励:VLM奖励模型的测试时提示优化

Christian Gumbsch, Leonardo Barcellona, Lennard Schünemann, Platon Karageorgis, Andrii Zadaianchuk, Zehao Wang, Sergey Zakharov, Fabien Despinoy, Rahaf Aljundi, Efstratios Gavves

发表机构 * University of Amsterdam(阿姆斯特丹大学) Catholic University of Leuven(鲁汶天主大学) Toyota Research Institute(丰田研究院) Toyota Motor Europe(丰田欧洲公司)

AI总结 提出Demo2Reward方法,利用少量专家演示在测试时优化VLM奖励模型的提示指令,减少假阳性并保持真阳性,无需额外训练即可提升下游策略学习。

详情
AI中文摘要

强化学习依赖于准确的奖励函数,但在现实应用(如机器人技术)中,这些函数通常是手工设计的,甚至不可用。最近的研究探索了预训练视觉-语言模型(VLM)作为奖励模型的零样本推理能力。然而,如果没有仔细的提示工程,这些方法往往会产生次优的奖励,其中假阳性预测会严重降低下游策略学习。在机器人技术中,通常收集包含专家演示的有限数据集来引导策略学习。这种场景提供了在策略训练之前优化奖励模型的机会。我们提出Demo2Reward,一种测试时自适应技术,基于少量演示(3-10条轨迹)优化奖励模型的语言指令,以减少假阳性同时保持真阳性。关键是,这在策略学习期间不需要额外的模型训练或计算资源。我们表明,Demo2Reward在一系列模拟机器人任务和策略骨干上始终优于现有的零样本和少样本VLM奖励模型。最后,我们证明Demo2Reward有效迁移到真实世界的机器人学习场景,无需手动设计奖励函数即可实现策略学习。

英文摘要

Reinforcement learning relies on accurate reward functions, which are often hand-crafted or even unavailable in real-world applications, such as robotics. Recent work has explored the zero-shot reasoning capabilities of pre-trained Vision-Language Models (VLMs) as reward models. However, without careful prompt engineering, these approaches tend to produce suboptimal rewards, where false positive predictions can severely degrade downstream policy learning. In robotics, limited datasets comprising expert demonstrations are often collected to bootstrap policy learning. This scenario provides an opportunity to optimize a reward model prior policy training. We propose Demo2Reward a test-time adaptation technique to optimize the language instruction of a reward model based on a few demonstrations (3-10 trajectories) to reduce false positives while preserving true positives. Crucially, this requires no additional model training or computation resources during policy learning. We show that Demo2Reward consistently outperforms existing zero- and few-shot VLM reward models across a range of simulated robotic tasks and policy backbones. Finally, we demonstrate that Demo2Reward effectively transfers to a real-world robotic learning scenario, enabling policy learning without manually engineering a reward function.

2606.00082 2026-06-02 cs.LG cs.AI stat.ML

Hoeffding Concept Bottleneck Models with Applications to Overhead Images

Hoeffding概念瓶颈模型及其在俯视图像中的应用

Clément Bénard, Manon Arfib, Christophe Labreuche, Victor Quétu

发表机构 * Thales cortAIx-Labs(泰雷兹 cortAIx 实验室) Université Paris-Saclay, CentraleSupélec(巴黎-萨克雷大学,中央理工-巴黎高等学院)

AI总结 针对线性概念瓶颈模型可解释性差和信息泄露问题,提出基于Hoeffding泛函分解的非线性稀疏聚合方法HCBM,并证明其对概念间泄露的鲁棒性,在分类和俯视图像目标检测任务中优于传统线性CBM。

详情
AI中文摘要

深度学习算法的可解释性对于高风险决策的计算机视觉应用至关重要。概念瓶颈模型(CBM)最近在基于高级概念瓶颈的分类问题上展示了提供可解释且准确预测的潜力。现有的CBM方法依赖概念分数的线性聚合来计算预测。然而,这种线性方法通常使用大量概念,这削弱了可解释性并有利于信息泄露。通常,概念与输出logits之间的潜在关系不是线性的。因此,我们引入了Hoeffding概念瓶颈模型(HCBM),该模型基于梯度提升树的Hoeffding泛函分解,提供概念分数的非线性和稀疏聚合,并使用素蕴含生成紧凑预测。HCBM被证明对概念间泄露具有鲁棒性,并在大量实验中优于标准线性CBM。除了分类,HCBM还可以适应目标检测,我们专注于一个具有挑战性的俯视图像案例,以展示HCBM在这些设置中的高性能。

英文摘要

Explainability of deep learning algorithms is critical for computer-vision applications with high-stake decisions. Concept bottleneck models (CBM) have recently shown promising performance to provide explainable and accurate predictions for classification problems, based on a bottleneck of high-level concepts. Existing CBM methods rely on a linear aggregation of the concept scores to compute predictions. However, a large number of concepts is often used in this linear approach, which undermines explainability and favors information leakage. In general, the underlying relation between concepts and output logits is not linear. Therefore, we introduce Hoeffding Concept Bottleneck Models (HCBM), which build on the Hoeffding functional decomposition of gradient-boosted trees to provide non-linear and sparse aggregations of concept scores, and generate compact predictions using prime implicants. HCBM are proved to be robust to interconcept leakage, and outperform standard linear CBM in practice, as shown in extensive experiments. Beyond classification, HCBM can be adapted to object detection, and we focus on a challenging case with overhead images to show the high performance of HCBM in these settings.

2606.00081 2026-06-02 cs.LG cs.AI cs.SD

DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions

DAStatFormer: 一种融合统计特征的混合多分支Transformer用于DAS模式识别

Michel Dione, Jerry Lonlac, Hélène Louis, Anthony Fleury, Stephane Lecoeuche

发表机构 * IMT Nord Europe, Institut Mines-Telecom, Univ. Lille, Centre for Digital Systems Lille, France(IMT北欧学院,法国电信研究院,里尔大学,数字系统研究中心,法国) IMT Mines Ales, Institut Mines-Telecom, Ales, France(IMT阿尔勒学院,法国电信研究院,阿尔勒,法国)

AI总结 针对DAS数据高维度和复杂时空模式问题,提出DAStatFormer混合多分支Transformer,通过提取24个ANOVA选择的统计特征并采用门控Transformer网络,在降低数据量级的同时实现高达99.4%的准确率。

详情
AI中文摘要

分布式声学传感(DAS)通过光纤实现大规模监测,但其高维度和复杂的时空模式使得事件分类具有挑战性。现有的深度学习方法——CNN、循环模型和Transformer变体——要么无法捕获长程依赖,要么需要以高昂成本处理原始DAS矩阵。我们提出DAStatFormer,一种混合多分支Transformer,将紧凑的多域统计特征与门控Transformer网络相结合。我们不是使用原始信号,而是从每个通道的时域、波形和频域提取24个ANOVA选择的属性,将数据量减少数个数量级,同时保留判别信息。每个域通过专用的逐步骤和逐通道注意力分支处理,并通过自适应门控机制融合。在开放的$\Phi$-OTDR基准测试和真实场景DAS数据集上的实验表明,DAStatFormer实现了高达99.4%的准确率和接近完美的实际性能,同时使用的参数和推理成本显著低于DASFormer和DeepViT等模型。这些结果证明了其适用于可扩展、实时的DAS监测。我们在https://github.com/MichelD-git/DAStatFormer发布代码。

英文摘要

Distributed Acoustic Sensing (DAS) enables large-scale monitoring through optical fibers, but its high dimensionality and complex spatio-temporal patterns make event classification demanding. Existing deep learning approaches-CNNs, recurrent models, and Transformer variants-either fail to capture long-range dependencies or require processing raw DAS matrices at prohibitive cost. We propose DAStatFormer, a hybrid multibranch Transformer that combines compact multidomain statistical features with Gated Transformer Networks. Instead of raw signals, we extract 24 ANOVA-selected attributes per channel from the temporal, waveform, and spectral domains, reducing data size by orders of magnitude while preserving discriminative information. Each domain is processed via dedicated step-wise and channel-wise attention branches, fused by an adaptive gating mechanism. Experiments on the open $Φ$-OTDR benchmark and a real-scenario DAS dataset show that DAS-tatFormer achieves up to 99.4% accuracy and near-perfect real-world performance, while using significantly fewer parameters and lower inference cost than models such as DASFormer and DeepViT. These results demonstrate its suitability for scalable, real-time DAS-based monitoring. We release our code at https://github.com/MichelD-git/DAStatFormer

2606.00080 2026-06-02 cs.CV cs.AI cs.LG cs.NE

Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems

Planktonzilla: 用于理解浮游生态系统的多模态数据集与模型

Alan Gerson Contreras Montanares, Luis Valenzuela, Luis Martí, Nayat Sanchez-Pi

发表机构 * Inria Chile Research Center(Inria智利研究中心)

AI总结 为解决浮游生物分类模型泛化性差的问题,提出统一数据集Planktonzilla-17M(含1740万张图像,涵盖602个分类类群),并对比监督学习与CLIP风格训练,发现基于分类谱系的监督学习优于CLIP,且现有生物基础模型在海洋成像领域表现不佳。

详情
AI中文摘要

海洋浮游生物支撑着水生食物网,并在全球二氧化碳封存中发挥关键作用,因此可靠的物种识别对于理解海洋健康和气候反馈至关重要。现有的分类模型在单个数据集上表现良好,但由于训练数据集孤立且标签不一致,无法跨仪器和环境泛化。为解决这一问题,我们引入了Planktonzilla-17M,这是一个统一的数据集,整合了来自13个成像系统的公开浮游生物图像集合。它包含1740万张图像,具有标准化的分类学和地理环境元数据,其中包括374万张浮游生物图像,涵盖602个分类类群,其中201个在物种级别被识别,使其成为迄今为止最大、最全面的浮游生物图像数据集。利用这一大规模数据集,我们在共享ViT骨干网络上进行了监督学习与CLIP风格图像-文本训练的对比实验。我们发现,当使用分类谱系作为文本时,监督分类器的表现与CLIP风格训练相当或更优。我们进一步观察到,BioCLIP和BioCLIP2在零样本和少样本设置下对浮游生物表现不佳。利用Planktonzilla-17M提高了浮游生物分类性能,凸显了当前生物基础模型在海洋成像领域的局限性。

英文摘要

Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical for understanding ocean health and climate feedbacks. Existing classification models perform well on individual collections but fail to generalize across instruments and environments due to isolated training datasets and inconsistent labels. To address this, we introduce Planktonzilla-17M, a unified dataset consolidating publicly available plankton image collections spanning thirteen imaging systems. It comprises 17.4 million images with standardized taxonomy and geo-environmental metadata, including 3.74 million plankton images spanning over 602 taxonomic classes, of which 201 are identified at the species level, making it the largest and most comprehensive plankton image dataset to date. Using this large-scale dataset, we perform a controlled comparison between supervised and CLIP-style image--text training on a shared ViT backbone. We find that a supervised classifier matches or exceeds CLIP-style training when trained using taxonomic lineage as text. We further observe that BioCLIP and BioCLIP2 perform poorly on plankton in zero-shot and few-shot settings. Leveraging Planktonzilla-17M improves plankton classification performance, highlighting the limitations of current biological foundation models in marine imaging domains.

2606.00079 2026-06-02 cs.LG cs.AI

BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization

BitsMoE: 面向MoE大语言模型量化的频谱能量引导比特分配

Jiayu Zhao, Zihan Teng, Minhao Fan, Tianrui Ma, Wentao Ren, Song Chen, Weichen Liu

发表机构 * School of Microelectronics, University of Science and Technology of China(中国科学技术大学微电子学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院)

AI总结 提出BitsMoE框架,通过SVD分解和频谱能量引导的混合精度比特分配,解决MoE模型超低位量化中的精度损失问题,在Qwen3-30B-A3B-Base上2比特量化下准确率提升27.83个百分点。

Comments 29 pages, 6 figures, 9 tables. Code and models are available at https://github.com/zjiayu064/BitsMoE

详情
AI中文摘要

混合专家(MoE)大语言模型通过稀疏专家激活减少了每词元的计算量,但由于所有专家权重必须常驻内存,其部署仍然占用大量内存。现有的MoE压缩方法在超低位宽场景下表现不佳:剪枝不可逆地移除模型容量,而粗粒度量化无法根据异构的专家和权重方向重要性分配比特。我们提出BitsMoE,一种面向MoE大语言模型量化的频谱能量引导比特分配框架。BitsMoE通过SVD将每个MoE层分解为共享基和专家特定的频谱因子,保留共享基不进行量化以保持跨专家的共同结构,并使用专家特定因子作为细粒度量化单元。为确定每个单元的比特宽度,BitsMoE将频谱混合精度量化建模为激活感知的重建替代问题,并求解一个整数线性规划,在固定比特预算下最小化估计的重建损失。在多个MoE大语言模型上的实验表明,BitsMoE在超低位宽场景下显著降低了下游任务准确率下降。在Qwen3-30B-A3B-Base上进行2比特量化时,BitsMoE相比GPTQ加速量化12.3倍,平均准确率提升27.83个百分点,解码速度提升1.76倍。我们的模型和代码已在https://github.com/zjiayu064/BitsMoE公开。

英文摘要

Mixture-of-Experts (MoE) large language models reduce per-token computation through sparse expert activation, but their deployment remains memory-intensive because all expert weights must be kept resident in memory. Existing MoE compression methods struggle in the ultra-low-bit regime: pruning irreversibly removes model capacity, while coarse-grained quantization fails to allocate bits according to heterogeneous expert and weight-direction importance. We propose BitsMoE, a spectral-energy-guided bit-allocation framework for MoE LLM quantization. BitsMoE decomposes each MoE layer by SVD into a shared basis and expert-specific spectral factors, retaining the shared basis without quantization to preserve common cross-expert structure and using the expert-specific factors as fine-grained quantization units. To determine the bit-width of each unit, BitsMoE formulates spectrum-wise mixed-precision quantization as an activation-aware reconstruction surrogate and solves an integer linear program that minimizes estimated reconstruction loss under a fixed bit budget. Experiments across multiple MoE LLMs show that BitsMoE substantially reduces downstream task accuracy degradation in ultra-low-bit regimes. Under 2-bit quantization on Qwen3-30B-A3B-Base, BitsMoE accelerates quantization by 12.3$\times$, improves average accuracy by 27.83 percentage points, and increases decoding speed by 1.76$\times$ over GPTQ. Our model and code are publicly available at https://github.com/zjiayu064/BitsMoE.

2606.00078 2026-06-02 cs.CV cs.AI

Flow-Based Generative Modeling for Optimizing Sampling Policies in Compressed Sensing Applications

基于流的生成建模优化压缩感知应用中的采样策略

Roman Pavelkin, Luis A. Zavala-Mondragon, Christiaan G. A. Viviers, Fons van der Sommen

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 提出一种任务感知的基于流的生成框架,通过训练流模型优化压缩感知中的子采样掩码,显著提升图像分类、重建和MRI加速的性能。

详情
AI中文摘要

信号处理和医学成像中的许多现代应用需要在严格的资源约束下获取高维信号。传统采样理论表明,准确重建信号所需的测量次数与信号的维数成正比,这一要求往往过于昂贵或不切实际。压缩感知通过证明稀疏信号可以在较少的测量下恢复(前提是测量算子满足某些条件)挑战了这一观念。这项概念验证研究提出了一种任务感知的基于流的生成框架——对传统流匹配训练范式的重新表述,其中流模型被训练用于优化压缩感知应用中的子采样。我们建立了所提出的学习子采样掩码框架的基本可行性,该框架显著提升了压缩感知在图像分类、图像重建和MRI加速中的性能。在图像重建任务中,我们的方法展示了最先进的性能,在CelebA数据集上以5%的子采样率实现了25.17 dB的峰值信噪比,在重建8倍加速的MRI测量(fastMRI数据集)时以最小的计算开销达到了29.24 dB。这些结果突显了生成流模型中任务条件化的有效性,并揭示了表示学习策略的一个有前景的方向。总体而言,所提出的框架提供了一种统一、灵活的方法来设计数据和任务驱动的感知方案,有望适用于广泛的逆问题。

英文摘要

Numerous modern applications in signal processing and medical imaging necessitate acquiring high-dimensional signals under tight resource constraints. Traditional sampling theory suggests that accurate signal reconstruction requires a number of measurements proportional to the signal's ambient dimension, a requirement often too expensive or impractical. Compressed sensing challenges this notion by demonstrating that sparse signals can be recovered with fewer measurements, provided the measurement operator meets certain conditions. This proof-of-concept study presents a task-aware flow-based generative framework -- a reformulation of the conventional Flow Matching training paradigm with a flow model trained to optimize subsampling in compressed sensing applications. We establish the fundamental feasibility of the proposed framework of learning subsampling masks that substantially enhance the performance of compressed sensing for image classification, image reconstruction, and MRI acceleration. For the image reconstruction task, our method demonstrated state-of-the-art performance, achieving Peak Signal-to-Noise Ratio of 25.17 dB at the subsampling rate of 5\% on the CelebA dataset and 29.24 dB when reconstructing $8\times$ accelerated MRI measurements (fastMRI dataset) with the minimal computational overhead. These results highlight the effectiveness of task-conditioning within generative flow models and reveal a promising direction for representation learning strategies. Overall, the proposed framework offers a unified, flexible approach to designing data- and task-driven sensing schemes that can be potentially adapted to a broad range of inverse problems.

2606.00077 2026-06-02 cs.CV cs.AI

Improved Belief-Attention in Vision Task

视觉任务中的改进信念注意力

Guoqiang Zhang

发表机构 * University of Exeter(埃克塞特大学)

AI总结 提出Belief2-Attention,通过同时利用垂直分量和投影分量扩展信念注意力,并引入额外内积矩阵增强标记相关性,提升视觉任务性能。

详情
AI中文摘要

最近,Belief-Attention \cite{Guoqiang25BeliefAttention} 被提出,它首先对基于 softmax 的 $V$ 向量加权求和进行关于原始 $V$ 向量的正交投影,然后将垂直分量作为 Transformer 中的残差信号以提升性能。在本文中,我们首先进行消融研究,表明投影分量也携带关于标记相关性的信息,不应被忽略。然后,我们提出通过同时利用垂直分量和投影分量来扩展 Belief-Attention。具体地,投影分量经过某种激活函数,然后进行线性映射,再与所考虑的标记合并。概念上讲,投影分量的神经块可以视为新注意力块内的两层前馈网络(FFN)。此外,注意到标准注意力通过内积矩阵 $QK^T$ 捕获标记相关性。我们提出向 $QK^T$ 引入额外的内积矩阵 $ZZ^T$ 以捕获更丰富的标记相关性。我们将新模块称为 Belief2-Attention。可以很容易地证明 Belief2-Attention 比标准注意力更具表达能力。然后,我们验证了 Belief2-Attention 在图像分类和分割等视觉任务中的有效性。

英文摘要

Recently, Belief-Attention \cite{Guoqiang25BeliefAttention} has been proposed by first performing an orthogonal projection of the softmax-based weighted summation of $V$ vectors with respect to the original $V$ vectors and then taking the perpendicular component as the residual signal in Transformer for performance improvement. In this paper, we first conduct an ablation study showing the projected component also carries information about the token correlation, which should not be ignored. We then propose to extend Belief-Attention by making use of both the perpendicular and projected components. In particular, the projected component goes through certain activation function and then a linear mapping before merging with the considered token. Conceptually speaking, the neural block for the projected component can be viewed as a two-layer feedforward network (FFN) within the new attention block. It is also noted that standard attention captures the token correlation via the inner-product matrix $QK^T$. We propose to introduce an additional inner-product matrix $ZZ^T$ to $QK^T$ to capture richer token correlation. We refer to the new module as Belief2-Attention. It can be easily shown that Belief2-Attention is more expressive than standard Attention. We then verify the effectiveness of Belief2-Attention for vision tasks of image classification and segmentation.

2606.00076 2026-06-02 cs.CV

DefocusTrackerAI -- A Generalized Framework for the Automatic Detection of Defocused Particle Images

DefocusTrackerAI -- 一种用于自动检测离焦粒子图像的通用框架

Gonçalo Coutinho, Ana S. Moita, António L. N. Moreira, Massimiliano Rossi

发表机构 * IN+ Center for Innovation, Technology and Policy Research, Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal(IN+创新、科技与政策研究中心,理工学院,里斯本大学,里斯本,葡萄牙) CINAMIL - Military Academy Research Center, Militart Academy, Portugal(CINAMIL - 军事学院研究中心,军事学院,葡萄牙) Department of Industrial Engineering, Alma Mater Studiorum University of Bologna, Bologna, Italy(工业工程系,博洛尼亚大学,博洛尼亚,意大利)

AI总结 提出DefocusTrackerAI,一种基于YOLOv9的通用深度学习框架,用于自动检测和位置估计离焦粒子图像,在多种光学配置下实现高召回率和低不确定性。

Comments 24 pages, 10 figures

详情
AI中文摘要

本工作介绍了DefocusTrackerAI,一个通用的深度学习框架,用于自动检测和位置估计来自任何光学配置的离焦粒子图像,同时不损害不确定性和召回率,旨在作为开源项目DefocusTracker的后续。我们从两个知名的目标检测模型Faster R-CNN和YOLOv9的直接比较中选择了深度神经网络架构,这些模型在包含不同直径的像散和非像散离焦粒子图像的多样化且特征丰富的合成图像集上进行了训练。对合成数据的模型评估表明,首先,YOLOv9优于Faster R-CNN,实现了更高的召回率和更低的不确定性,特别是在高粒子图像密度下;其次,YOLOv9提供了增强的空间分辨率,对于粒子图像密度N_s高达0.5,不确定性值在0.1到0.4像素之间,优于最先进的算法。我们证明了我们的模型能够在多种光学设置和不同光照条件下检测像散和非像散离焦粒子图像。此外,我们成功地将模型应用于真实的DPT实验,包括荧光和阴影图数据,表明它们可以用于传统DPT应用之外,包括喷雾和液滴的跟踪。基于YOLOv9的预训练、即用型DefocusTrackerAI版本可在https://gitlab.com/goncalo.coutinho/defocustrackerAI-main/-/tree/7e0f11f649ebad50e20dca5b9545f26ca303ebe0获取,并可用于高精度自动检测任何类型的离焦粒子图像。结合合适的深度位置校准方法,它可作为三维离焦粒子跟踪的有效第一步。

英文摘要

The present work introduces DefocusTrackerAI, a generalized deep-learning framework for the automatic detection and position estimation of defocused particle images from any kind of optical configuration without compromising uncertainty and recall, intended as a follow-up of the open-source project DefocusTracker. We selected the deep neural network architecture from the direct comparison of two well-known object detection models, Faster R-CNN and YOLOv9, trained on a diverse and feature-rich synthetic image set containing astigmatic and non-astigmatic defocused particle images of varying diameters. The model evaluation on synthetic data showed that, first, YOLOv9 outperforms Faster R-CNN, achieving higher recall and lower uncertainty, particularly at high particle image densities; and second, that YOLOv9 provides enhanced spatial resolution, with uncertainty values between 0.1 and 0.4 pixels for particle image densities N_s up to 0.5, outperforming state-of-the-art algorithms. We demonstrated that our models are able to detect astigmatic and non-astigmatic defocused particle images in multiple optical setups with varying lighting conditions. In addition, we successfully applied our models on real DPT experiments, including fluorescence and shadowgraph data, showing that they can be used beyond conventional DPT applications, including the tracking of sprays and droplets. A pre-trained, ready-to-use version of DefocusTrackerAI based on YOLOv9 is available at https://gitlab.com/goncalo.coutinho/defocustrackerAI-main/-/tree/7e0f11f649ebad50e20dca5b9545f26ca303ebe0 and can be used for automatic detection of defocused particle images of any kind with high accuracy. In combination with a suitable calibration approach for the depth position, it can be used as an effective first step for three-dimensional defocusing particle tracking.

2606.00069 2026-06-02 cs.RO eess.IV

Invascal: Inverse-Vacuity Self-Calibration for Uncertainty-Aware LiDAR Range-View Semantic Segmentation

Invascal: 面向不确定性感知激光雷达距离视图语义分割的逆空性自校准

Kerim Turacan, Hannes Reichert, Andrei Bolandut, Konrad Doll

发表机构 * Faculty of Engineering and Computer Science, University of Applied Sciences Aschaffenburg(工程与计算机科学学院,阿施费尔德应用科学大学)

AI总结 提出一种与架构无关的不确定性感知适配器头,通过偏好头和强度头分解预测,并设计逆空性自校准目标(Invascal)来监督强度信号,实现可靠且校准良好的不确定性估计,同时保持分割精度。

Comments Accepted for publication at the 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC)

详情
AI中文摘要

激光雷达语义分割是自动驾驶车辆和移动机器人的核心感知能力。然而,安全运行还取决于知道预测何时不可靠。现有方法通常依赖softmax置信度,这往往校准不良且过度自信,而来自蒙特卡洛dropout或集成方法的更强不确定性估计对于实时使用通常计算成本高昂。为此,我们引入了一种新颖的、与架构无关的不确定性感知适配器头。它将预测分解为用于类别排名的偏好头和用于细化不确定性评估的强度头,从而能够原则性地构建证据狄利克雷表示。基于此设计,我们提出了逆空性自校准目标(Invascal),它直接监督强度信号以产生可靠且校准良好的不确定性估计,同时防止证据无节制增长。我们在多个激光雷达数据集和骨干架构上评估了我们的框架。我们与确定性训练、蒙特卡洛dropout和集成方法以及先前的证据方法进行了比较。我们的方法在最小计算开销下,持续改进了不确定性校准,优于传统的确定性方法。同时,它保持了有竞争力的分割精度,而先前的证据方法往往会出现性能下降。

英文摘要

LiDAR semantic segmentation is a core perception capability for autonomous vehicles and mobile robots. However, safe operation also depends on knowing when predictions are unreliable. Existing approaches typically rely on softmax confidence, which is often miscalibrated and overconfident, while stronger uncertainty estimates from Monte Carlo dropout or ensembles are often computationally expensive for real-time use. To this end, we introduce a novel, architecture-agnostic uncertainty-aware Adapter Head. It decomposes the prediction into a Preference Head for class ranking and a Strength Head that refines uncertainty assessment, thereby enabling a principled construction of evidential Dirichlet representations. Building on this design, we propose our inverse-vacuity self-calibration objective (Invascal), which directly supervises the strength signal to produce reliable and well-calibrated uncertainty estimates while preventing runaway evidence growth. We evaluate our framework across multiple LiDAR datasets and backbone architectures. We compare against deterministic training, Monte Carlo dropout and ensembles, and prior evidential methods. Our approach consistently improves uncertainty calibration over traditional deterministic methods with minimal computational overhead. At the same time, it preserves competitive segmentation accuracy, where prior evidential methods often suffer performance degradation.

2606.00066 2026-06-02 cs.SD eess.AS

DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

DUET: 扩散与流匹配驱动的文本转语音的统一双空间情感控制

Xu Zhang, Longbing Cao, Zhangkai Wu

发表机构 * Frontier AI Research Centre, Macquarie University(前沿人工智能研究中心,麦考瑞大学)

AI总结 提出DUET框架,通过隐空间引导和梅尔谱梯度修正的双空间控制,在预训练扩散/流匹配TTS模型中实现细粒度情感控制,超越10个有监督基线。

详情
AI中文摘要

基于扩散和流匹配的文本转语音(TTS)模型在自然度方面表现出色,但由于情感信号与说话人身份纠缠,往往缺乏显式的情感控制。我们发现情感嵌入作为冻结隐藏状态的线性可解码方向出现,几乎与编码说话人身份的方向正交。这启发了一个即插即用框架DUET,用于对预训练的扩散和流匹配TTS模型进行情感控制。在生成过程中,DUET统一双空间控制,在单步更新中实现细粒度情感干预:隐空间引导沿目标情感方向移动生成,而梅尔谱引导通过从可微分声码器反向传播的梯度细化频谱细节。我们在三个数据集上的五个架构多样的预训练TTS骨干上验证了DUET,它跨范式超越了10个有监督的最先进情感TTS基线,并获得了最高的人类评分情感适宜性。为了进一步展示其定性行为,我们将DUET部署在Ameca人形机器人上,使其产生丰富表现力的情感语音,展示了即插即用情感交互在具身智能体中的巨大潜力。

英文摘要

Diffusion and flow-matching based text-to-speech (TTS) models excel in naturalness but often lack explicit emotion control, as emotional signals remain entangled with speaker identity. We discover that emotion embedding emerges as a linearly decodable direction of frozen hidden states, nearly orthogonal to the direction embedding speaker identity. This inspires a plug-and-play framework DUET for emotion control over pretrained diffusion and flow-matching based TTS models. During generation, DUET unifies dual-space control to achieve fine-grained emotion intervention in a single per-step update: hidden space steering shifts generation along the target emotion direction, while mel-space guidance refines spectral details through gradients backpropagated from a differentiable vocoder. We validate DUET on five architecturally diverse pretrained TTS backbones across three datasets, where it outperforms 10 supervised state-of-the-art emotional TTS baselines across paradigms and achieves the highest human-rated emotion appropriateness. To further showcase its qualitative behavior, we deploy DUET on an Ameca humanoid robot, where it produces richly expressive emotional speech on the humanoid, demonstrating the strong potential for plug-and-play affective interaction for embodied agents.

2606.00063 2026-06-02 cs.RO math-ph math.MP physics.flu-dyn

Linear Motility Maps in Nonlinear Viscous Fluids

非线性粘性流体中的线性运动映射

Yishun Zhou, Shai Revzen

发表机构 * Department of Robotics, University of Michigan(机器人学系,密歇根大学) Departments of Electrical Engineering and Computer Science, and Ecology and Evolutionary Biology(电气工程与计算机科学系、生态与进化生物学系)

AI总结 研究在低雷诺数流体中,线性运动映射扩展到幂律流体,并发现Carreau-Yasuda流体可违反该线性性质实现净运动,方向可随速度改变。

详情
AI中文摘要

已知在低雷诺数流体中运动的系统受“运动映射”支配,该映射线性地将形状变化率与通过流体的本体框架速度联系起来。其结果是“珀塞尔扇贝定理”——经历时间上前后相同路径的形状变化(往复身体变形)的运动系统无法实现净位移,无论这些变化的速度如何。我们证明线性速度运动映射扩展到任何幂律粘度(即Ostwald-de Waele流体),因此也适用于中间剪切范围内的许多生物流体。我们还表明,在Carreau-Yasuda流体中,线性速度性质可以被违反,使用由两个不等质量且具有不等阻力系数的质量组成的“尺蠖”模型进行往复运动,从而产生净运动。有趣的是,运动方向可以通过改变速度来切换。我们的结果表明,几何力学的线性运动映射可用于分析和设计幂律流体中的运动,并且某些非线性阻力关系(如Carreau-Yasuda)可用于产生净运动,看似违反了“扇贝定理”。

英文摘要

Systems moving in low Reynolds number fluid regimes are known to be governed by a ``motility map'' which linearly relates their shape change rates to they body frame velocity moving through the fluid. A consequence of this is ``Purcell's Scallop Theorem'' -- a locomotion system that undergoes shape changes that follow the same path forward and backward in time (reciprocal body deformations) cannot achieve net displacement, regardless of pacing of those changes.We show that linear-in-velocity motility maps extend to any power law viscosity (a.k.a. Ostwald--de Waele fluid), and therefore to many biological fluids in intermediate shear ranges. We also show that the linear-in-velocity property can be violated in Carreau-Yasuda fluids to produce net motion using an ``inchworm'' model consisting of two unequal masses with unequal drag coefficients performing reciprocal motions. Interestingly, the direction of motion can be switched by changing speeds. Our results show that the linear motility map of geometric mechaincs can be used to analyze and design locomotion in power-law fluids, and that some nonlinear drag relationships such as Carreau-Yasuda can be exploited to generate net locomotion in seeming violation of the ``scallop theorem''.

2606.00062 2026-06-02 cs.CL

Graph-Augmented Retrieval for Cross-Entity Financial Sentiment Analysis: A Comparative Study

跨实体金融情感分析的图增强检索:一项比较研究

Rajan Bastakoti, Sagar Bhetwal, Nirajan Acharya, Gaurav Kumar Gupta

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 本文提出一种两跳图增强检索架构(Graph-RAG),通过构建情感加权知识图谱并融合密度检索与图遍历,相比标准向量检索在跨实体金融情感分析中显著提升实体召回率和复杂查询答案相关性。

详情
AI中文摘要

检索增强生成(RAG)已成为将大语言模型锚定到特定领域语料库的基础方法,然而传统的基于向量的RAG系统在捕捉支撑金融市场分析的结构化多实体关系方面存在根本性局限。本文对一种新颖的两跳图增强检索架构(Graph-RAG)与标准纯向量基线在跨实体金融情感分析中进行了全面比较研究。我们的系统从覆盖10只主要科技股的255篇新闻文章中构建了一个包含59个股票实体的情感加权知识图谱,然后通过强度过滤的图遍历(沿INFLUENCES边)增强密集检索,以揭示纯向量搜索无法获取的关系证据。我们使用语义相似度、实体召回率、RAGAS指标、延迟基准和消融研究,在100个有依据的查询(30个直接查询,70个关系查询)上评估了两种架构。Graph-RAG在实体召回率上实现了统计显著的提升(+6.4%,p < 0.001,Wilcoxon符号秩检验),并为复杂的多实体查询提供了显著更相关的答案(答案相关性+11.7%),增益集中在关系型问题类型(+16.1%)。关键的是,这些改进在答案质量上没有可测量的成本(语义相似度变化+0.001,Cohen's d = 0.078),平均延迟适度增加22.6%,但延迟方差降低了80%。对图遍历强度阈值的消融研究揭示了其与答案质量的倒U型关系,确定tau = 0.5为最优值,而生产默认值为tau = 0.7。这些发现刻画了图增强检索固有的精度-覆盖率权衡,并为构建多实体金融分析RAG系统的从业者提供了可操作的架构指导。

英文摘要

Retrieval-Augmented Generation (RAG) has become foundational for grounding large language models in domain-specific corpora, yet conventional vector-based RAG systems are fundamentally limited in their ability to capture the structured, multi-entity relationships that underpin financial market analysis. This paper presents a comprehensive comparative study of a novel two-hop Graph-RAG architecture versus a standard vector-only baseline for cross-entity financial sentiment analysis. Our system constructs a sentiment-weighted knowledge graph of 59 equity entities from 255 news articles covering 10 major technology stocks, then augments dense retrieval with intensity-filtered graph traversal over INFLUENCES edges to surface relational evidence inaccessible to vector search alone. We evaluate both architectures on 100 grounded queries (30 Direct, 70 Relational) using semantic similarity, entity recall, RAGAS metrics, latency benchmarks, and ablation studies. Graph-RAG achieves a statistically significant improvement in entity recall (+6.4%, p < 0.001, Wilcoxon signed-rank) and delivers substantially more relevant answers for complex multi-entity queries (+11.7% Answer Relevancy), with gains concentrating in relational question types (+16.1%). Critically, these improvements come at no measurable cost to answer quality (delta = +0.001 semantic similarity, Cohen's d = 0.078), with a modest 22.6% increase in mean latency offset by an 80% reduction in latency variance. An ablation study on the graph traversal intensity threshold reveals an inverted-U relationship with answer quality, identifying tau = 0.5 as optimal over the production default of tau = 0.7. These findings characterize a precision-for-coverage trade-off inherent to graph-augmented retrieval and provide actionable architectural guidance for practitioners building RAG systems for multi-entity financial analysis.

2606.00059 2026-06-02 cs.RO cs.LG

Reinforcement Learning for Optimal Experiment Design in Parameter Identification of Mechatronic Systems

机电系统参数辨识中最优实验设计的强化学习方法

Julian Langschwert, Georg Schaefer, Jakob Rehrl, Stefan Huber, Simon Hirlaender

发表机构 * Josef Ressel Centre for Intelligent and Secure Industrial Automation, Salzburg University of Applied Sciences, Salzburg, Austria(约瑟夫·雷斯尔智能与安全工业自动化中心,萨尔茨堡应用技术大学,萨尔茨堡,奥地利) Paris Lodron University of Salzburg, Salzburg, Austria(萨尔茨堡巴黎洛登伦大学,萨尔茨堡,奥地利)

AI总结 提出一种强化学习智能体,通过奖励塑形自主满足安全约束,为Quanser Aero 2测试平台学习最优激励信号,在三个辨识参数上均达到竞争性估计精度,且安全违规率仅0.75%。

Comments Accepted at DEXA AI4IP 2026

详情
AI中文摘要

信息丰富的激励信号对于机电系统的精确系统辨识至关重要,然而经典系统辨识方法需要专家知识和手工设计的信号以满足硬件安全约束,限制了其通用性。我们提出一种强化学习智能体,为Quanser Aero 2测试平台学习最优激励信号,同时通过奖励塑形自主强制执行安全约束。在10个独立训练种子的评估中,我们的综合智能体在所有三个辨识参数上均实现了具有竞争力的估计精度,优于经典基线方法,且仅产生0.75%的安全违规。

英文摘要

Informative excitation signals are critical for accurate system identification of mechatronic systems, yet classical system identification (SI) approaches require expert knowledge and hand-crafted signal design to respect hardware safety constraints, limiting their generalizability. We propose a reinforcement learning (RL) agent that learns optimal excitation signals for a Quanser Aero 2 testbed while autonomously enforcing safety constraints through reward shaping. Evaluated across 10 independent training seeds, our comprehensive agent achieves competitive estimation accuracy across all three identified parameters, outperforming classical baselines while incurring only 0.75% safety violations.