arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.23903 2026-05-25 cs.CV 版本更新

从激活到因果：人脑中视觉表征的因果发现

Yuval Golbari, Navve Wasserman, Matias Cosarinsky, Roman Beliy, Aude Oliva, Antonio Torralba, Michal Irani, Tamar Rott Shaham

发表机构 * Weizmann Institute of Science（魏茨曼科学研究所）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本文研究如何在人类大脑中识别与特定视觉概念相关的脑区，提出了一个名为BrainCause的自动化框架，通过结合生成模型和脑成像模型，生成受控刺激并进行因果验证，以区分真正代表概念的脑区与仅由相关视觉或语义线索驱动的脑区。该方法能够有效识别已知的功能定位，并发现新的候选表征，验证表明仅依赖激活强度可能导致大量假阳性结果，强调了因果验证的重要性。

详情

AI中文摘要

识别人类大脑中哪些脑区代表视觉概念是神经科学的核心挑战。现有方法通过激活最大化定位粗略的功能区域（例如，面孔、地点），识别出对目标概念相对于其他概念激活强烈的区域。然而，仅凭强激活并不能确定该区域代表概念本身，因为响应可能由相关的视觉或语义线索驱动。我们引入了BrainCause，一个自动化框架，结合生成模型和脑模型合成受控刺激，并通过有针对性的因果测试验证神经表征。给定一个指定感兴趣概念的查询，我们的框架构建有针对性的刺激集，包括概念图像、去除目标概念同时保留其他图像内容的反事实编辑，以及包含候选相关干扰物的图像。然后，它使用图像到fMRI编码模型预测大脑响应，并搜索那些对目标概念而非相关替代物有特定响应的表征。BrainCause返回经过验证的候选表征，并提出后续fMRI实验以进一步测试或扩展其发现。我们的方法成功恢复了已知的功能定位，并在数十个概念中识别出新的候选表征，在预测和测量的fMRI数据上均得到验证。关键的是，我们表明如果没有因果验证，大部分定位将是假阳性，证实了仅凭激活不足以作为表征的证据。

英文摘要

Identifying which brain regions represent a visual concept in the human brain is a central challenge in neuroscience. Existing approaches have localized coarse functional regions (e.g., faces, places) through activation maximization, identifying regions that activate strongly for a target concept relative to other concepts. Yet strong activation alone does not establish that a region represents the concept itself, as responses may instead be driven by correlated visual or semantic cues. We introduce BrainCause, an automated framework that combines generative and brain models to synthesize controlled stimuli and validate neural representations through targeted causal testing. Given a query specifying a concept of interest, our framework constructs targeted stimulus sets comprising concept images, counterfactual edits that remove the target concept while preserving other image content, and images with candidate correlated distractors. It then uses an image-to-fMRI encoding model to predict brain responses and searches for representations that respond specifically to the target concept over correlated alternatives. BrainCause returns validated candidate representations and proposes follow-up fMRI experiments to further test or extend its discoveries. Our approach successfully recovers known functional localizations and identifies new candidate representations across dozens of concepts, validated on both predicted and measured fMRI data. Critically, we show that without causal validation, a large fraction of localizations would be false positives, confirming that activation alone is insufficient evidence of representation.

URL PDF HTML ☆

赞 0 踩 0

2605.23892 2026-05-25 cs.CV cs.AI cs.GR cs.LG cs.RO 版本更新

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

优质令牌狩猎：视觉几何变换器令牌选择指南

Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski

发表机构 * University of Toronto & Vector Institute（多伦多大学及向量研究所）； Google（谷歌）； Technical University of Munich（慕尼黑技术大学）

AI总结视觉几何变换器在多视角三维重建中表现出色，但其计算成本随输入序列长度呈二次增长，限制了模型的效率和可扩展性。本文提出了一种简单而通用的解决方案，通过限制每个查询在全局注意力中交互的关键/值标记数量来降低计算复杂度。该方法采用两阶段框架：首先在帧级别选择保留的帧以保证场景覆盖多样性，然后在帧内进一步去除冗余标记，且引入基于注意力熵的层感知稀疏化策略。实验表明，该方法在保持或提升性能的同时，可将视觉几何变换器的处理速度提升85%以上。

Comments Project Page: https://zsh2000.github.io/good-token-hunting.github.io, Code: https://github.com/zsh2000/gotohunt

详情

AI中文摘要

视觉几何变换器已成为多视图三维重建的强大架构，能够以前馈方式联合预测多个三维属性。然而，由于这些模型内部的全局注意力层，其计算成本随输入序列长度呈二次增长，限制了其可扩展性和效率。在这项工作中，我们通过一个简单而通用的策略来应对这一挑战：限制每个查询在全局注意力期间交互的键/值令牌数量。为了实现有效的令牌选择，我们引入了一个两阶段框架。首先，帧间选择步骤在帧级别操作，以识别应保留的帧。其次，帧内选择步骤进一步丢弃所选帧内更冗余的令牌。我们的分析强调了基于多样性的帧间选择策略的优势，该策略确保了对场景的广泛覆盖。对于帧内选择，我们表明层感知稀疏化是必要的，选择过程由全局注意力模式的熵引导。与现有解决方案相比，我们的方法提供了优越的速度-精度权衡。大量实验表明，对于包含500张图像的场景，我们的方法将视觉几何变换器加速超过85%，同时保持甚至提升基线性能，这暗示了我们的令牌选择策略如何在视觉几何变换器的未来应用中发挥关键作用。我们的项目网站位于 https://zsh2000.github.io/good-token-hunting.github.io。

英文摘要

Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good-token-hunting.github.io.

URL PDF HTML ☆

赞 0 踩 0

2605.23891 2026-05-25 cs.CV 版本更新

Smart-Insertion-V: Photorealistic Video Insertion via a Closed-Loop Feedback Dual-Stream Framework

Smart-Insertion-V: 通过闭环反馈双流框架实现逼真的视频插入

Xiao Cao, Yansong Qu, Xiangzhen, Chang, Wen Xiao, Jiakui Hu, Heyuan Li, Jialun Liu, Zhiyong Huang, Xuelong Li

AI总结本文提出了一种名为 Smart-Insertion-V 的端到端双流框架，用于实现无需掩码的高质量视频物体插入。该方法通过图像流同步引导视频生成，并引入闭环反馈机制以增强插入鲁棒性，同时设计了 Dual-World-View RoPE 和解耦引导模块，以解决特征纠缠和风格泄露问题，并提升语义对齐与风格适应能力。实验表明，该方法在物体插入位置合理性与画面和谐性方面均达到当前最优水平。

详情

AI中文摘要

无掩码视频对象插入已成为一项具有挑战性的任务，需要将参考对象和谐地融入源视频中。然而，当参考对象与源场景存在严重的风格域差异时，现有方法难以应对。为了克服这一问题，我们提出了 extit{ extbf{Smart-Insertion-V}}，一种端到端的 extbf{双流}框架，同时进行视频插入和图像风格迁移。在该框架内，图像流同步引导视频生成过程，同时进一步引入 extbf{闭环反馈}机制以确保鲁棒插入。不可避免地，整合这些多样化的条件信号会导致特征纠缠和风格泄露。为解决此问题，我们设计了 extbf{双世界视角旋转位置编码}，通过时空偏移区分不同信号，且不增加大量训练开销。此外，为了促进空间定位和风格适应，我们引入了 extbf{解耦引导模块}，该模块利用视觉语言模型进行语义推理，同时通过原生文本编码器保留原始时间引导。为了弥合和谐参考插入任务的数据差距，我们提出了一种数据整理流程，并将发布一个 extbf{开源数据集}。实验表明，我们的方法可以将对象插入到合理的位置，同时实现最和谐的结果。

英文摘要

Mask-free video object insertion has emerged as a challenging task, requiring harmonious integration of reference objects into source videos. However, existing methods struggle when references exhibit severe stylistic domain gaps with the source scene. To overcome this, we propose \textit{\textbf{Smart-Insertion-V}}, an end-to-end \textbf{Dual-Stream} framework that concurrently conducts video insertion and image style transfer. Within this framework, the image stream synchronously guides the video generation process, while a \textbf{Closed-loop Feedback} mechanism is further incorporated to ensure robust insertion. Inevitably, integrating these diverse conditioning signals results in feature entanglement and style leakage. To tackle this issue, we design \textbf{Dual-World-View RoPE} to distinguish different signals via spatial-temporal offsets without incurring heavy training overhead. Furthermore, to facilitate spatial grounding and stylistic adaptation, we introduce a \textbf{Decoupled Guidance Module} that leverages a Vision-Language Model for semantic reasoning while preserving original temporal guidance with native text encoder. To bridge data gap for harmonious reference insertion task, we propose a data curation pipeline and will release an \textbf{open-source dataset}. Experiments demonstrate that our method can insert objects into plausible positions while achieving the most harmonious results.

URL PDF HTML ☆

赞 0 踩 0

2605.23889 2026-05-25 cs.CV 版本更新

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

HorizonStream: 用于流式3D重建的长程注意力

Chong Cheng, Peilin Tao, Nanjie Yao, Guanzhi Ding, Xianda Chen, Yuansen Du, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, Zhengqing Chen, Hao Wang

发表机构 * HKUST(GZ)（香港科技大学）； Horizon Robotics（Horizon机器人）； CASIA（中国科学院自动化研究所）； CSU（中国科学技术大学）

AI总结 HorizonStream 是一种用于流式三维重建的长时序注意力模型，旨在解决在线重建中因时间异质性导致的漂移、抖动和崩溃问题。该方法通过引入证据影响核的概念，将几何传播分解为长时序和短时序两个因子，分别采用几何线性注意力和几何局部注意力进行处理，从而实现多时间尺度的几何信息传播与稳定的空间匹配。实验表明，HorizonStream 在仅使用48帧训练的情况下，能够稳定地处理超过10,000帧的长序列，表现出优越的流式三维重建性能。

详情

AI中文摘要

在线3D重建需要在严格的因果和有界内存约束下估计相机姿态和场景几何。现有方法在长序列上常出现漂移、抖动或崩溃。我们将这些失败归因于一个根本性的不匹配：流式几何本质上是时间异质的，证据范围从短时对应到持久全局尺度。然而，当前架构施加了统一且病态的影响模式。例如，滑动窗口强制硬截断，而无门控循环和因果注意力导致缓存饱和和尖峰状注意力沉没。为解决此问题，我们将几何传播形式化为一个证据影响核，并提出HorizonStream，一种显式分解该核的长程Transformer。对于长程时间因子，几何线性注意力学习通道级衰减率，实现几何证据的有界、多时间尺度传播。对于短程空间因子，具有时空RoPE的几何局部注意力在抑制注意力沉没的同时执行可靠的3D匹配。最后，度量读出令牌直接从持久几何状态恢复稳定尺度和刚性姿态。大量实验表明，仅用48帧片段训练的HorizonStream，在恒定内存和线性时间下稳定泛化到超过10,000帧的序列，实现了最先进的流式3D重建性能。项目页面：https://3dagentworld.github.io/horizonstream/

英文摘要

Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an \emph{evidence influence kernel} and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: https://3dagentworld.github.io/horizonstream/

URL PDF HTML ☆

赞 0 踩 0

2605.23888 2026-05-25 cs.CV 版本更新

GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

GenRecon: 桥接生成先验的多视图3D场景重建

Katharina Schmid, Nicolas von Lützow, Jozef Hladký, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich（慕尼黑技术大学）； Computing Systems Lab, Huawei Technologies（华为技术有限公司计算系统实验室）

AI总结本文提出了一种基于生成先验的高质量多视角三维场景重建方法GenRecon，通过将场景分割为局部重叠的区域，并在每个区域上进行条件生成，实现了大范围场景的高精度重建。研究利用先进的生成形状模型Trellis.2作为先验，并提出了一种基于投影的条件机制，将多视角图像特征提升为与生成模型对齐的三维表示，从而生成几何一致、视图一致的重建结果。该方法在室内环境重建中表现出色，相比现有方法在重建质量上提升了16%。

Comments Project page: https://kasothaphie.github.io/GenRecon/

详情

AI中文摘要

我们提出了一种新的方法，从多视图RGB图像进行高保真3D场景重建，该方法将重建与强大的生成式3D先验紧密结合。我们将场景重建视为在空间局部、重叠的块上的条件3D生成，这些块共同覆盖场景，将生成扩展到大的场景范围。关键的是，我们继承了最先进的生成形状模型（以Trellis.2为例）的保真度和完整性，并将其推广到场景级别。为此，我们提出了一种基于投影的条件机制，该机制将带姿态的多视图图像特征提升为与生成模型对齐的连贯3D表示，独立于视图顺序并空间锚定到场景，从而产生高保真、多视图一致的生成几何。这使得将Trellis.2的强对象级先验提升到多视图、场景规模的生成，产生室内环境的忠实、可编辑的PBR网格重建。因此，我们获得了高保真结果，比最先进的重建方法性能提升16%。

英文摘要

We introduce a new approach to high-fidelity 3D scene reconstruction from multi-view RGB images that tightly couples reconstruction with a strong generative 3D prior. We cast scene reconstruction as conditional 3D generation over a set of spatially-localized, overlapping chunks that together tile the scene, scaling generation to large scene extents. Crucially, we inherit the fidelity and completeness of state-of-the-art generative shape models -- we use Trellis.2 as an example -- which we generalize to the scene level. To this end, we propose a projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding high-fidelity, multi-view consistent generated geometry. This enables lifting the strong object-level prior of Trellis.2 to multi-view, scene-scale generation, producing faithful, editable PBR mesh reconstructions of indoor environments. As a result, we obtain high-fidelity results that outperform cutting-edge reconstruction methods by 16%.

URL PDF HTML ☆

赞 0 踩 0

2605.23883 2026-05-25 cs.CV cs.AI 版本更新

利用基础模型进行因果生成建模

Aneesh Komanduri, Xintao Wu

发表机构 * University of Arkansas（亚拉巴马大学）

AI总结该论文研究如何利用预训练基础模型进行因果生成建模，旨在提升AI系统在反事实推理方面的能力。提出了一种名为FM-CGM的模块化框架，通过概念提取器、概念操作器和反事实生成器三个核心组件，实现了端到端的视觉因果推理。该方法结合了因果推理模型和文本到图像扩散模型，并引入了因果语义引导机制，有效支持零样本因果发现与反事实图像生成，具有重要的理论与应用价值。

详情

AI中文摘要

因果生成建模对于开发能够进行反事实推理的可靠且透明的AI系统至关重要。现有方法侧重于在生成模型训练过程中整合因果约束，但通常缺乏统一框架来利用预训练基础模型的零样本推理能力。我们提出FM-CGM，一个使用预训练基础模型进行端到端视觉因果推理的模块化框架。FM-CGM通过三个核心组件形式化因果流程：概念提取器、概念操作器和反事实生成器。通过利用大型推理模型进行因果推断，以及文本到图像扩散模型进行生成，我们的方法实现了零样本因果发现、干预和反事实生成。然后，我们开发了因果语义引导（CSG），一种基于交叉注意力的机制，确保语义干预传播到后代概念，同时保留不变区域。我们实验证明，我们的方法能够识别合理的因果结构，并适用于忠实的反事实图像生成。

英文摘要

Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While existing approaches focus on integrating causal constraints during the training of generative models, they often lack a unified framework to leverage the zero-shot reasoning capabilities of pretrained foundation models. We introduce FM-CGM, a modular framework for end-to-end visual causal reasoning using pretrained foundation models. FM-CGM formalizes the causal pipeline through three core components: a concept extractor, a concept manipulator, and a counterfactual generator. By leveraging a large reasoning model for causal inference and a text-to-image diffusion model for generation, our approach enables zero-shot causal discovery, intervention, and counterfactual generation. We then develop Causal Semantic Guidance (CSG), a cross-attention-based mechanism that ensures semantic interventions propagate to descendant concepts while preserving invariant regions. We empirically show that our approach can identify plausible causal structures and is suitable for faithful counterfactual image generation.

URL PDF HTML ☆

赞 0 踩 0

2605.23845 2026-05-25 cs.CV 版本更新

Learning a Particle Dynamics Model with Real-world Videos

利用真实世界视频学习粒子动力学模型

Chanho Kim, Suhas V. Sumukh, Li Fuxin

发表机构 * Oregon State University（俄勒冈州立大学）

AI总结本文提出了一种从真实世界未标注视频中学习粒子动力学模型的新方法，旨在克服传统物理模拟器和依赖合成数据的世界模型在现实场景中的局限性。该方法基于高斯点扩散框架，通过渲染监督直接学习密集高斯粒子的位置和旋转变化，无需粒子级别的标注信息。研究还发布了一个包含约500个视频的真实数据集，用于多样化物体交互的建模与验证。

Comments CVPR 2026 Findings

详情

AI中文摘要

数据驱动的物理仿真学习方法（有时称为世界模型）因其可微性质，已成为传统物理模拟器的有前途的替代方案。先前的工作在预测涉及多个相互作用物体的复杂场景中刚性和非刚性物体的运动方面展示了令人印象深刻的结果。然而，这些模型通常在模拟环境中训练，因为在现实世界中获取完美的状态信息（例如完整的场景点云和随时间变化的点对应关系）具有挑战性。这种对合成数据的依赖可能在模拟到现实差距较大时限制其适用性。在这项工作中，我们旨在通过引入一种直接从无标签真实世界视频训练神经物体动力学模型的新框架来克服这些限制。具体来说，我们提出学习一个与高斯溅射框架兼容的基于粒子的动力学模型，该模型操作于从高斯导出的密集粒子（即具有尺度和旋转的粒子），并预测它们随时间的位置和旋转变化。该模型通过渲染监督进行训练，从而无需粒子级别的标签状态即可从真实世界视频中学习。我们的模型直接操作于密集高斯，而不依赖于启发式子采样锚点。为了实现这项研究，我们还提供了一个包含约500个捕捉不同物体相互作用的视频的真实世界数据集。

英文摘要

Data-driven learning approaches for physics simulation, sometimes referred to as world models, have emerged as promising alternatives to traditional physics simulators due to their differentiable nature. Prior work has demonstrated impressive results in predicting the motions of rigid and non-rigid objects in complex scenes involving multiple interacting bodies. However, these models are typically trained in simulated environments because obtaining perfect state information such as complete scene point clouds and point correspondences over time is challenging in real-world settings. This reliance on synthetic data can limit their applicability when the sim-to-real gap is large. In this work, we aim to overcome these limitations by introducing a novel framework for training neural object dynamics models directly from unlabeled real-world videos. Specifically, we propose to learn a particle-based dynamics model compatible with a Gaussian splatting framework, which operates on dense particles derived from Gaussians (i.e., particles with scales and rotations) and predicts their position and rotation changes over time. The model is trained via rendering supervision, enabling learning from real-world videos without requiring particle-level labeled states. Our model operates directly on dense Gaussians without relying on heuristic subsampling anchor points. To enable this study, we also present a real-world dataset consisting of about 500 videos capturing diverse object interactions.

URL PDF HTML ☆

赞 0 踩 0

2605.23840 2026-05-25 cs.CV 版本更新

MuellerPT: Decomposition Driven Pretraining for Dense Learning in Mueller Polarimetry

MuellerPT: 穆勒偏振测量中密集学习的分解驱动预训练

Adam Tlemsani, Yingdian Li, Maxime Giot, Naim Slim, Christopher J. Peters, Abhijeet Ghosh, Daniel S. Elson

发表机构 * Department of Computing, Imperial College London（帝国理工学院计算机系）； Hamlyn Centre for Robotic Surgery, Imperial College London（帝国理工学院机器人外科中心）； Department of Surgery and Cancer, Imperial College London（帝国理工学院外科与癌症系）； Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences（中国科学院西安光学精密机械研究所）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结该研究提出了一种名为 MuellerPT 的物理引导预训练方法，用于解决穆勒偏振成像在生物医学组织分析中因标注稀缺和领域差异导致的监督学习难题。通过从每个像素的 4x4 穆勒矩阵预测 Lu-Chipman 分解图，该方法学习到具有迁移能力的密集表征，并在少样本分割和分类任务中表现出显著提升。实验表明，MuellerPT 在标签效率和跨样本迁移能力方面优于无预训练的模型，为高效标注的穆勒偏振成像应用提供了新思路。

Comments Accepted to 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026)

详情

AI中文摘要

穆勒矩阵成像为生物医学组织分析提供了丰富、物理上有意义的对比度，但监督学习受到稀疏密集标注和跨样本及采集设置强域偏移的阻碍。我们提出MuellerPT，一种物理引导的预训练方法，通过从逐像素4x4穆勒矩阵预测Lu-Chipman分解图来学习可迁移的密集表示。为了扩展预训练，我们收集了新的多光谱动物偏振器官数据集（MAP-Org）。预训练编码器通过分割头适应于羔羊脑灰质与白质分割，并使用分类头进行结直肠癌与非癌分类。分割和分类均在少样本学习场景下评估。在分割中，与无预训练模型相比，MuellerPT提高了标签效率和跨样本迁移，在使用5%训练数据时，相比从头训练的基线实现了超过20%的绝对DICE增益。在分类中，MuellerPT也增强了标签效率，在使用1%训练数据时，相比基线总体准确率提高了8%。我们通过对离体人类食管样本预测的Lu-Chipman图进行定性评估，证明了MuellerPT对域偏移的鲁棒性。这些结果表明，预测Lu-Chipman分解是从穆勒偏振测量中进行鲁棒生物医学推断的有效且实用的预文本任务，并为未来标签高效穆勒成像的工作铺平了道路。

英文摘要

Mueller matrix imaging provides rich, physically meaningful contrast for biomedical tissue analysis, but supervised learning is hindered by scarce dense annotations and strong domain shifts across specimens and acquisition settings. We introduce MuellerPT, a physics guided pre-training approach that learns transferable dense representations by predicting Lu-Chipman decomposition maps from per-pixel 4x4 Mueller matrices. To scale pre-training, we collected a new large Multispectral Animal Polarimetric Organ dataset (MAP-Org). The pre-trained encoder is adapted with a segmentation head for grey vs. white matter segmentation in lamb brain. A classification head is used for colorectal cancer vs. non-cancer classification. Both segmentation and classification are evaluated across few-shot learning scenarios. In segmentation, MuellerPT improves label efficiency and cross specimen transfer compared to models without pre-training, achieving an absolute DICE gain of over 20% compared to the baseline trained from scratch when using 5% of the training data. In classification, MuellerPT also enhances label efficiency, improving overall accuracy by 8% compared to the baseline when using 1% of the training data. We demonstrate MuellerPT's robustness to domain shift with a qualitative evaluation of its predicted Lu-Chipman maps on an ex vivo human oesophagus sample. These results suggest that predicting Lu-Chipman decomposition is an effective and practical pretext task for robust biomedical inference from Mueller polarimetry and can pave the way for future work on label efficient Mueller imaging.

URL PDF HTML ☆

赞 0 踩 0

2605.23826 2026-05-25 cs.CV cs.CL 版本更新

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

将查询分解为工具调用以进行长视频关键帧检索

Michal Shlapentokh-Rothman, Prachi Garg, Yu-Xiong Wang, Derek Hoiem

发表机构 * University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文研究了如何从长视频中检索关键帧以支持问答任务，提出了一种基于工具调用分解与合并的新型关键帧检索方法ToolMerge。该方法利用大语言模型将查询分解为多个工具调用，并通过布尔运算符合并各工具的排序结果，从而更精准地定位相关帧。实验在自建的M2M基准上进行，ToolMerge在多项任务中表现优异，尤其在字幕检索任务中超越其他方法5%。

详情

AI中文摘要

关键帧选择是为长视频问答（QA）提供可验证视觉证据的直接方式。查询所需的内容各不相同，找到正确的帧取决于知道要查找什么。现有的关键帧选择器要么根据单个查询对每一帧进行评分，要么将查询分解为由单个视觉工具评估的固定模式。我们提出ToolMerge，一种基于分解和合并的关键帧检索方法：基于大语言模型（LLM）的规划器将查询分解为工具调用，并指定如何使用布尔运算符合并每个工具的排名。为了直接评估检索，我们构建了Molmo-2 Moments（M2M）基准，其中每个问题通过构造锚定到特定的时间间隔。在QA、问题检索和字幕检索中，ToolMerge与先前的关键帧选择器具有竞争力，尤其是在字幕检索上，优于其他方法5%。代码和数据可在https://github.com/michalsr/ToolMerge找到。

英文摘要

Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at https://github.com/michalsr/ToolMerge .

URL PDF HTML ☆

赞 0 踩 0

2605.23819 2026-05-25 cs.CV cs.AI 版本更新

Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot

不过于生成，也不过于判别：人类对齐的甜蜜点

Jorge Chang Ortega, Bastien Le Lan, Thomas Serre, Victor Boutin

发表机构 * ANITI ； Brown University（布朗大学）； CNRS（国家科学研究中心）

AI总结本文探讨了计算视觉中一个核心问题：人类视觉表征是由判别式学习还是生成式学习更好地解释。研究通过联合能量模型（JEMs）在固定架构下连续插值判别与生成训练目标，分离学习目标的影响，并在六个涵盖感知相似性、光泽感知、人类响应不确定性等的人类对齐基准上进行评估。结果表明，人类对齐在生成与判别目标的中间点达到最优，而非极端端点，表明人类视觉对齐源于生成与判别目标的平衡，而非单一目标的选择。

详情

AI中文摘要

计算视觉中的一个核心问题是，人类视觉表征是否更好地由判别学习或生成学习解释。然而，现有的比较常常混淆学习目标与架构、规模及训练数据，使得目标本身是否驱动对齐的问题悬而未决。我们使用联合能量模型（JEM）来解决这一混淆问题，该模型在固定架构内连续插值判别与生成训练。通过改变单个混合系数，我们隔离了学习目标的影响，并在六个涵盖感知相似性、光泽感知、人类响应不确定性、鲁棒性、形状-纹理线索冲突和诊断性特征归因的人类对齐基准上评估了所得模型。在这多样化的测试套件中，人类对齐在生成-判别连续体的中间点始终达到最大，而非任一端点。混合JEM结合了判别学习诱导的类别结构与生成学习诱导的对输入结构的敏感性，在视觉的多个层次上产生了更类人的行为。这些结果表明，生成-判别二分法不是理解人类对齐视觉的正确轴：对齐并非来自选择其中一个目标，而是来自平衡两者。

英文摘要

A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.

URL PDF HTML ☆

赞 0 踩 0

2605.23797 2026-05-25 cs.LG cs.CV 版本更新

Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models

去偏负挖掘提升基于预训练视觉语言模型的分布外检测

Bo Peng, Jie Lu, Guangquan Zhang, Zhen Fang

发表机构 * University of Technology Sydney（悉尼科技大学）

AI总结本文研究了如何利用预训练的视觉-语言模型（VLM）进行分布外（OOD）检测，旨在识别来自未知类别的输入。现有方法主要依赖启发式规则从未标注的语料中挖掘负样本，但存在严重的负样本偏差问题。为此，作者提出了一种去偏负样本挖掘方法，通过间接估计负样本分布来纠正偏差，并将其转化为基于标注数据和未标注语料的蒙特卡洛采样过程。实验表明，该方法在多种OOD检测任务中取得了新的最先进性能。

Comments KDD 2026

详情

AI中文摘要

旨在识别来自未知类别的意外输入，分布外（OOD）检测已成为增强机器学习模型可靠性的关键方法。本文聚焦于基于预训练视觉语言模型（VLM）的事后OOD检测这一新兴范式，其中一种流行的流程是通过检查输入与ID标签和负标签（即语义上不同于ID标签的标签）之间的亲和度来检测OOD输入。由于目标OOD标签不可用，现有工作主要依赖启发式规则从未标注的语料数据中挖掘负标签。尽管取得了经验上的成功，我们认为基于VLM的OOD检测能力尚未被完全释放，因为文献中臭名昭著的假阴性问题远未解决。基于这一动机，我们感兴趣于解决为OOD评分挖掘真实负标签的挑战。为此，我们开发了一个理论框架，通过间接近似负标签的分布来校正负标签的采样偏差。令人惊讶的是，我们表明去偏负挖掘可以自然地转化为基于ID标签和未标注语料数据的蒙特卡洛采样。大量实验经验性地证明，我们的方法在各种OOD检测设置中建立了新的最先进水平。代码公开于\href{https://github.com/60pen9/Debiased-Negative-Mining-Improves-OOD-Detection-with-Pre-trained-VLMs}{此处}。

英文摘要

Aiming at identifying unexpected inputs from unknown classes, out-of-distribution (OOD) detection has emerged as a pivotal approach to enhancing the reliability of machine learning models. This paper focuses on the burgeoning paradigm of post-hoc OOD detection with pre-trained vision-language models (VLMs), where a popular pipeline is to detect OOD inputs by examining their affinities between ID labels and negative labels, i.e., those semantically different from ID labels. Due to the unavailability of target OOD labels, existing works predominantly rely on heuristic rules to mine negative labels from unlabeled wild corpus data. Despite the empirical success, we argue that the power of VLM-based OOD detection has yet to be fully unleashed since the notorious false negative problem is far from addressed in the literature. With this motivation, we are interested in addressing the challenge of mining true negative labels for OOD scoring. To this end, we develop a theoretical framework for correcting the sampling bias of negatives labels by indirectly approximating the distribution of negative labels. Perhaps surprisingly, we show that the debiased negative mining can be naturally converted into Monte-Carlo sampling based on ID labels and the unlabeled wild corpus data. Extensive experiments empirically manifest that our method establishes a new state-of-the-art in a variety of OOD detection setups. Code is publicly available at \href{https://github.com/60pen9/Debiased-Negative-Mining-Improves-OOD-Detection-with-Pre-trained-VLMs}{\textcolor{red}{here}}.

URL PDF HTML ☆

赞 0 踩 0

2605.23790 2026-05-25 cs.CV 版本更新

Exploring deep learning for Event-Based Saliency Prediction with a Transformer-based model

探索基于事件的显著性预测：一种基于Transformer的模型

Romaric Mazna, Jean Martinet, Sai Deepesh Pokala

发表机构 * i3S/CNRS, Université Côte d’Azur（i3S/CNRS，法国国家科学研究中心，埃克塞特大学）

AI总结本文研究了基于事件相机数据的显著性预测问题，提出了一个基于Transformer的模型SEST，用于从事件数据中预测显著性区域。为克服事件数据缺乏大规模标注数据集和强基线模型的难题，作者引入了事件原生的预训练策略和合成监督，并构建了两个新的基准数据集。实验表明，SEST在事件显著性预测任务中优于现有方法，并在真实事件数据上展示了良好的迁移能力，是首次将深度学习应用于事件显著性预测的研究。

详情

AI中文摘要

显著性预测在RGB图像和视频中作为人类视觉注意的计算模型已被广泛研究。相比之下，尽管事件相机具有生物启发性和良好的传感特性，但从事件数据预测显著性仍基本未被探索。两个障碍阻碍了这一方向：缺乏大规模事件显著性数据集，以及缺乏强基线。在本文中，我们介绍了SEST（Swin事件显著性Transformer），一种基于Transformer的事件数据显著性预测模型，通过事件原生预训练和合成监督弥补数据稀缺障碍。SEST利用自监督预训练的事件Swin Transformer骨干结合轻量CNN解码器生成动态显著性图。为解决标注事件显著性数据稀缺的问题，我们引入了两个新的基准数据集N-DHF1K和N-UCF Sports，这些数据集从大规模RGB显著性基准生成。实验结果表明，SEST明显优于现有事件显著性方法，并缩小了与最先进RGB模型的性能差距。在真实事件相机数据集上的零样本评估进一步证明，我们在合成数据上训练的模型在真实事件流上仍具有可迁移性。据我们所知，这项工作是首次将深度学习应用于基于事件的显著性预测，开辟了事件视觉与神经形态视觉注意交叉领域的新研究方向。

英文摘要

Saliency prediction has been extensively studied in RGB images and videos as a computational model of human visual attention. In contrast, predicting saliency from event-based data remains largely unexplored, despite the biological inspiration and favorable sensing properties of event cameras. Two obstacles have held this direction back: the absence of large-scale event saliency datasets, and the lack of a strong baseline. In this paper, we introduce SEST (Swin Event-based Saliency Transformer), a transformer-based model for saliency prediction from event data, bridging the data scarcity barrier through event-native pretraining and synthetic supervision. SEST leverages a self-supervised pretrained event-based Swin Transformer backbone combined with a lightweight CNN decoder to produce dynamic saliency maps. To address the scarcity of annotated event-based saliency data, we introduce two new benchmark datasets, N-DHF1K and N-UCF Sports, generated from large-scale RGB saliency benchmarks. Experimental results show that SEST clearly outperforms existing event-based saliency methods and narrows the performance gap with state-of-the-art RGB models. Zero-shot evaluation on a real event camera dataset further demonstrates that our model trained on synthetic data remains transferable on real event streams. To the best of our knowledge, this work is the first to apply deep learning to event-based saliency prediction, opening a new research direction at the intersection of event-based vision and neuromorphic visual attention.

URL PDF HTML ☆

赞 0 踩 0

2605.23777 2026-05-25 cs.CV 版本更新

Machine learning applied to emerald gemstone grading: framework proposal and creation of a public dataset

机器学习应用于祖母绿宝石分级：框架提案与公开数据集创建

FB Pena, D Crabi, Sandro C Izidoro, Érick O Rodrigues, G Bernardes

发表机构 * Department of Academic Informatics (DAINF), Universidade Tecnológica Federal do Paraná (UTFPR), Pato Branco, State of Parana, Brazil（学术信息系（DAINF），联邦技术大学Parana分校（UTFPR），Pato Branco，巴西巴拉那州）

AI总结本文提出了一种基于机器学习的祖母绿宝石分级框架，并创建了一个公开数据集。该框架从图像采集到最终分类实现了整个分级过程的自动化，避免了人工分级的主观性。研究首次将机器学习与图像处理技术结合应用于祖母绿分级，取得了98%的分类准确率，并发布了包含192张祖母绿图像及其预处理特征的数据集。

详情

DOI: 10.1007/s10044-021-01041-4
Journal ref: Pattern Analysis and Applications 2022

AI中文摘要

目前，宝石分级是由宝石学家执行的手工过程。一种流行的方法使用参考石，由专家目视检查，决定哪一颗参考石与待检石最相似。该过程非常主观，不同专家可能做出不同的分级选择。本文提出了一个完整的框架，涵盖图像采集直至最终宝石分类。该提案能够自动化整个过程，除了将宝石放入创建的图像采集腔室之外。它摒弃了专家做出的主观决策。这是首个将机器学习方法与图像处理技术相结合用于祖母绿分级的工作。所提出的框架实现了98%的准确率（正确分类的宝石），优于深度学习方法。此外，我们还创建并发布了所使用的数据集，包含192张祖母绿宝石图像及其提取和预处理后的特征。

英文摘要

The grading of gemstones is currently a manual procedure performed by gemologists. A popular approach uses reference stones, where those are visually inspected by specialists that decide which one of the available reference stone is the most similar to the inspected stone. This procedure is very subjective as different specialists may end up with different grading choices. This work proposes a complete framework that entails the image acquisition and goes up to the final stone categorization. The proposal is able to automate the entire process apart from including the stone in the created chamber for the image acquisition. It discards the subjective decisions made by specialists. This is the first work to propose a machine learning approach coupled with image processing techniques for emerald grading. The proposed framework achieves 98% of accuracy (correctly categorized stones), outperforming a deep learning approach. Furthermore, we also create and publish the used dataset that contains 192 images of emerald stones along with their extracted and pre-processed features.

URL PDF HTML ☆

赞 0 踩 0

2605.23775 2026-05-25 cs.CV 版本更新

A Novel Approach for the Counting of Wood Logs Using cGANs and Image Processing Techniques

一种基于cGANs和图像处理技术的木材计数新方法

João VC Mazzochin, Giovani Bernardes Vitor, Gustavo Tiecker, Elioenai MF Diniz, Gilson A Oliveira, Marcelo Trentin, Érick O Rodrigues

发表机构 * Graduate Program of Production and Systems Engineering, Universidade Tecnol6gica Federal do Paraná (UTFPR)（生产与系统工程研究生项目，联邦技术大学帕托布拉诺分校（UTFPR））； Institute of Technological Sciences, Universidade Federal de Itajubá (UNIFEI)（技术科学研究所，联邦大学伊塔比拉分校（UNIFEI））； Business School, Universidade Federal do Paraná (UFPR)（商业学院，联邦帕拉分校（UFPR））； Graduate Program of Electrical and Computer Engineering, Universidade Tecnológica Federal do Paraná (UTEPR)（电气与计算机工程研究生项目，技术联邦大学帕托布拉诺分校（UTEPR））

AI总结本文提出了一种基于条件生成对抗网络（cGANs）和图像处理技术的新型木材原木计数方法，旨在解决精确计数中的挑战。该方法结合图像处理技术处理噪声和交叉重叠问题，并利用连通组件算法实现高效计数。研究还公开了一个包含466张图像、约13,048根桉树原木的数据库，实验表明该方法在像素级和原木级准确率上分别达到96.4%和92.3%，具有较高的实用价值和实时处理能力，适用于林业管理、资源优化等实际场景。

详情

DOI: 10.3390/f16020237
Journal ref: Forests 2025

AI中文摘要

本研究解决了精确木材计数的挑战，所提出方法论的应用可涵盖从材料管理、监控和安全科学到木材交通监测、木材体积估计等自动化方法。我们引入了一种利用条件生成对抗网络（cGANs）进行桉木图像分割的方法，结合专门的图像处理技术处理噪声和交叉，并采用连通分量算法进行高效计数。为支持本研究，我们创建并公开了一个包含466张图像、约13,048根桉木的全面数据库，用于训练和验证。我们的方法表现出稳健性能，平均像素精度达到96.4%，原木计数精度达到92.3%，其他指标如F1分数在0.879至0.933之间，IoU值在0.784至0.875之间，进一步验证了其有效性。该实现效率高，在NVIDIA T4 GPU上每张图像平均处理时间为0.713秒，适合实时应用。该方法对运营林业具有重要实际意义，能够实现更准确的库存管理，减少人工计数的错误，并优化资源配置。此外，模型的分割能力为桉木堆体积估计等高级应用奠定了基础，有助于对林业运营进行更全面和精细的分析。该方法在处理复杂场景（包括交叉原木和变化的环境条件）方面的成功，使其成为相关工业领域实际应用的有价值工具。

英文摘要

This study tackles the challenge of precise wood log counting, where applications of the proposed methodology can span from automated approaches for materials management, surveillance, and safety science to wood traffic monitoring, wood volume estimation, and others. We introduce an approach leveraging Conditional Generative Adversarial Networks (cGANs) for eucalyptus log segmentation in images, incorporating specialized image processing techniques to handle noise and intersections, coupled with the Connected Components Algorithm for efficient counting. To support this research, we created and made publicly available a comprehensive database of 466 images containing approximately 13,048 eucalyptus logs, which served for both training and validation purposes. Our method demonstrated robust performance, achieving an average Accuracy_pixel of 96.4% and Accuracy_logs of 92.3%, with additional measures such as F1 scores ranging from 0.879 to 0.933 and IoU values between 0.784 and 0.875, further validating its effectiveness. The implementation proves to be efficient with an average processing time of 0.713s per image on an NVIDIA T4 GPU, making it suitable for realtime applications. The practical implications of this method are significant for operational forestry, enabling more accurate inventory management, reducing human errors in manual counting, and optimizing resource allocation. Furthermore, the segmentation capabilities of the model provide a foundation for advanced applications such as eucalyptus stack volume estimation, contributing to a more comprehensive and refined analysis of forestry operations. The methodology's success in handling complex scenarios, including intersecting logs and varying environmental conditions, positions it as a valuable tool for practical applications across related industrial sectors.

URL PDF HTML ☆

赞 0 踩 0

2605.23771 2026-05-25 cs.CV cs.AI cs.MA 版本更新

PhotoFlow: Agentic 3D Virtual Photography Missions

PhotoFlow: 智能体式3D虚拟摄影任务

Jiarui Guo, Haojia Wei, Yiming Zhang, Yifei Liu, Yuning Gong, Hongjie Zhang, Xue Yang, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Northeastern University（东北大学）； University of California, Los Angeles（加州大学洛杉矶分校）； Cornell University（康奈尔大学）； Shanghai AI Laboratory（上海人工智能实验室）； Sichuan University（四川大学）

AI总结 PhotoFlow 是一种用于虚拟摄影的智能代理系统，能够在没有预设相机参数或参考图像的情况下，根据语言指令在3D场景中生成符合语义意图的高质量照片。该系统由三个模块组成：Director 生成多样化的相机候选方案，Reviewer 进行视觉评估与参数筛选，Reflector 则通过失败经验优化搜索策略。研究还提出了 VPhotoBench 基准，包含多个 Blender 场景和语言条件摄影任务，实验表明 PhotoFlow 在多轮渲染预算下表现出色，是首个在任意 Blender 场景中实现语言条件虚拟摄影的可执行代理系统。

详情

AI中文摘要

虚拟摄影要求智能体进入一个预制的3D场景，没有预设的相机姿态或参考图像，从场景信息和语言意图中推断合适的镜头，选择可执行的相机参数，并渲染最终照片。视觉-语言模型的最新进展使这种空间智能体越来越可行，但该任务强调两种难以同时评估的能力：复杂的3D空间理解和抽象审美判断。我们引入了PhotoFlow，一个导演-评审-反思智能体，用于闭环相机搜索。导演构建软摄影蓝图并提议多样化的候选相机；评审结合规则检查、视觉批评和成对优胜者选择；反思将失败转化为区域记忆、死区抑制和高探索重定位。我们还引入了VPhotoBench，一个包含47个开源许可的Blender场景和141个语言条件摄影任务的基准，涵盖主体放置、关系构图和氛围/风格。在保留实验中，PhotoFlow在六轮渲染预算下，在一次性预测、单链反思、锚点库选择和随机搜索中取得了最强的外部质量-对齐复合指标和成功率。据我们所知，这是第一项将任意Blender场景中的语言条件虚拟摄影作为可执行智能体任务的工作，我们的结果表明，以LLM为中心的空间智能体已经可以在旨在挑战3D推理和审美选择的设置中产生强大的照片。

英文摘要

Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.

URL PDF HTML ☆

赞 0 踩 0

2605.23747 2026-05-25 cs.CV 版本更新

Revitalizing Dense Material Segmentation: Stabilized Vision Transformers and the Generalization Paradox

复兴密集材质分割：稳定的视觉Transformer与泛化悖论

Allan Kazakov, Duygu Cakir, Hilal Kurt İrfanoğlu, Yavuz İrfanoğlu

发表机构 * Bahcesehir University, Istanbul, Turkey（巴塞希尔大学，伊斯坦布尔，土耳其）； Poder Bilişim Teknolojileri Sanayi ve Ticaret A.Ş., Istanbul, Turkey（Poder信息科技工业和贸易股份有限公司，伊斯坦布尔，土耳其）； Galatasaray University, Istanbul, Turkey（加拉塔萨雷大学，伊斯坦布尔，土耳其）

AI总结本文旨在复兴苹果密集材料分割（Apple-DMS）基准，解决当前材料分割任务中因几何偏倚模型主导而导致的性能停滞问题。研究提出了一种稳定训练方法，包括高保真逻辑投影、查询熵正则化和物理兼容的数据增强策略，显著提升了基于Vision Transformer的分割模型性能。同时，作者揭示了“泛化悖论”——虽然数据重划分可提升指标，却会降低模型在真实场景中的泛化能力，强调了使用原始数据划分对推动物理感知人工智能研究的重要性。

详情

AI中文摘要

材质分割，即对物理表面属性进行像素级分类，仍然是计算机视觉中的一个挑战性问题，需要区别于以物体为中心解析的物理化学理解。尽管引入了严格的Apple密集材质分割（DMS）数据集，该基准测试仍遭受衰退和停滞，日益被偏向几何的基础模型所掩盖。在本文中，我们复兴Apple-DMS基准测试，建立现代视觉Transformer基线。我们对SegFormer和Mask2Former架构进行了详尽评估，揭示标准训练范式由于高方差梯度而在无定形纹理场上失败。为解决此问题，我们引入了一种稳定的训练方案，包括高保真logit投影、查询熵正则化以及领域特定、符合物理的增强流程。我们优化的SegFormer-B5在原始数据集划分上达到了0.4572 mIoU的新最先进水平（SOTA），显著超越了先前的卷积基线。此外，我们识别出一个关键的“泛化悖论”：虽然将数据集重新划分为数据丰富的80/10/10划分将指标提升至0.5276 mIoU，但专家定性分析表明这导致了分布同质化，严重降低了真实世界、分布外性能。通过发布我们恢复的数据集索引和稳健的训练框架，我们证明材质感知远未解决，并敦促社区利用严格的原始划分推动物理基础人工智能的真正进展。

英文摘要

Material segmentation, the pixel-wise classification of physical surface properties, remains a challenging problem in computer vision, requiring physicochemical understanding distinct from object-centric parsing. Despite the introduction of the rigorous Apple Dense Material Segmentation (DMS) dataset, the benchmark has suffered from attrition and stagnation, increasingly overshadowed by geometry-biased foundation models. In this paper, we revive the Apple-DMS benchmark to establish a modern Vision Transformer baseline. We conduct an exhaustive evaluation of SegFormer and Mask2Former architectures, revealing that standard training paradigms fail on amorphous texture fields due to high-variance gradients. To address this, we introduce a stabilized training recipe featuring High-Fidelity Logit Projection, Query Entropy Regularization, and a domain-specific, physics-compliant augmentation pipeline. Our optimized SegFormer-B5 achieves a new State-of-the-Art (SOTA) of 0.4572 mIoU on the original dataset split, significantly surpassing the prior convolutional baseline. Furthermore, we identify a critical "Generalization Paradox": while re-partitioning the dataset into a data-rich 80/10/10 split inflates the metric to 0.5276 mIoU, expert qualitative analysis reveals this induces distributional homogenization, severely degrading real-world, out-of-distribution performance. By releasing our recovered dataset index and robust training framework, we demonstrate that material perception is far from solved and urge the community to leverage the rigorous original split to drive genuine progress in physically grounded artificial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.23719 2026-05-25 cs.CV cs.AI 版本更新

Weierstrass Positional Encoding for Vision Transformers

Weierstrass位置编码用于视觉Transformer

Zhihang Xin, Rui Wang, Xitong Hu, Xiaojun Wu

发表机构 * School of Mathematics and Data Science, Jiangnan University（江南大学数学与数据科学学院）； School of Artificial Intelligence and Computer Science, Jiangnan University（江南大学人工智能与计算机科学学院）

AI总结视觉Transformer在计算机视觉中取得了显著成功，但其常用的可学习一维位置编码在图像分块展平后削弱了图像的二维空间结构。为解决这一问题，本文提出了一种基于魏尔斯特拉斯椭圆函数的位置编码方法（WePE），通过在复数域中对二维分块坐标进行映射，构建具有双周期特性的四维位置特征，从而更准确地保留图像分块的几何关系和空间邻近性先验。该方法具有数学理论支撑，能够自然匹配图像网格的规则结构，并且无需额外计算开销，可无缝集成到现有视觉Transformer中，实验表明其在多种任务中均能带来性能提升。

详情

AI中文摘要

视觉Transformer在计算机视觉中取得了显著成功，但它们通常使用可学习的一维位置编码，这削弱了图像块展平后固有的二维空间结构。现有的位置编码往往缺乏几何约束，并且不保持欧氏空间距离与序列索引距离之间的单调关系，限制了ViTs利用空间邻近先验的能力。受周期性在位置编码中实用性的启发，我们提出了Weierstrass椭圆位置编码（WePE），这是一种在复数域中编码二维坐标的数学基础方法。WePE将归一化的二维块坐标映射到复平面，并使用Weierstrass椭圆函数及其导数构建紧凑的四维位置特征。双周期性提供了二维位置的原则性表示，其固有的晶格结构自然匹配图像块网格的规则几何形状。其非线性几何特性有助于更忠实地建模空间距离关系，而代数加法公式使得任意块对之间的相对位置信息可以直接从其绝对编码中推导出来。WePE是即插即用的且与分辨率无关，可以无缝集成到现有的ViTs中。大量实验表明，WePE在大多数设置中带来一致的性能提升。通过预计算的查找表，这些改进不会引入明显的计算或内存开销。额外的分析和消融研究进一步验证了所提方法的有效性。

英文摘要

Vision Transformers have achieved remarkable success in computer vision, but their common use of learnable one-dimensional positional encodings weakens the inherent two-dimensional spatial structure of images after patch flattening. Existing positional encodings often lack geometric constraints and do not preserve a monotonic relationship between Euclidean spatial distances and sequential index distances, limiting ViTs' ability to exploit spatial proximity priors. Motivated by the usefulness of periodicity in positional encoding, we propose Weierstrass elliptic Positional Encoding (WePE), a mathematically grounded method for encoding two-dimensional coordinates in the complex domain. WePE maps normalized 2D patch coordinates onto the complex plane and constructs compact four-dimensional positional features using the Weierstrass elliptic function and its derivative. The double periodicity provides a principled representation of 2D positions, and its intrinsic lattice structure naturally matches the regular geometry of image patch grids. Its nonlinear geometric properties help model spatial distance relationships more faithfully, while the algebraic addition formula enables relative positional information between arbitrary patch pairs to be derived directly from their absolute encodings. WePE is plug-and-play and resolution-agnostic, allowing seamless integration into existing ViTs. Extensive experiments show that WePE brings consistent performance gains in most settings. With precomputed lookup tables, these improvements introduce no noticeable computational or memory overhead. Additional analyses and ablation studies further validate the effectiveness of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2605.23699 2026-05-25 cs.CV 版本更新

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

CRONOS：视频模型中反事实物理一致性的基准测试

León Begiristain, Olaf Dünkel, Adam Kortylewski

发表机构 * University of Freiburg（弗莱堡大学）； Max Planck Institute for Informatics（马克斯·普朗克信息研究所）； CISPA Helmholtz Center for Information Security（哈勒斯海姆信息安全中心）

AI总结本文提出CRONOS，一个基于干预的视频模型评估基准，用于检验模型在面对视觉输入变化时对物理事件预测的反事实一致性。该基准构建于高度逼真的Unreal Engine环境中，通过系统性地改变视角、场景、物体类别和外观等因素，而保持物理事件类型不变，从而评估模型对这些变化的鲁棒性。实验表明，当前主流视频生成模型在面对不同干预时，其预测质量存在显著下降，突显了现有模型在物理一致性方面的不足。CRONOS为研究和改进视频模型的物理理解能力提供了可控且可复现的测试平台。

Comments 27 pages, 12 figures

详情

AI中文摘要

视频预测日益被视为通向通用世界模型的途径，但目前尚不清楚这些系统是学习了潜在的因果结构，还是仅仅利用表面的视觉相关性进行未来预测。我们引入了CRONOS，一个基于干预的基准，旨在评估反事实物理一致性：即模型对物理事件的预测是否对视觉输入中的受控变化（如场景上下文、视角、物体外观和物体类别的变化）做出适当响应。CRONOS构建在逼真的Unreal Engine环境中，能够跨不同场景和动态生成受控、高保真的视频。与之前的基准相比，CRONOS系统性地干预了四个关键因素——视角、场景、物体类别和物体外观——同时保持底层物理事件类型（如碰撞、遮挡或坠落）不变。我们对近期开源视频生成器的评估揭示了反事实物理一致性的显著失败：同一物理事件类型的预测质量受到外观、环境，尤其是视角变化的影响。CRONOS提供了一个受控且可重复的测试平台，用于诊断不同干预下生成视频质量的变化，为开发在多种条件变化下表现一致的模型确立了具体目标。数据集和代码可在我们的项目页面获取。

英文摘要

Video prediction is increasingly viewed as a path toward generalizable world models, yet it remains unclear whether these systems learn underlying causal structure or merely exploit superficial visual correlations for future prediction. We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. In contrast to previous benchmarks, CRONOS systematically intervenes on four key factors - viewpoint, scene, object category, and object appearance - while keeping the underlying physical event type, such as a collision, occlusion, or fall, fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes. CRONOS provides a controlled and reproducible testbed for diagnosing how the quality of generated videos changes for different interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions. The dataset and code are available at our project page.

URL PDF HTML ☆

赞 0 踩 0

2605.18214 2026-05-25 cs.CV 版本更新

EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

EgoInteract: 用于交互理解和预测的合成自我中心视频生成

Rosario Leonardi, Francesco Ragusa, Daniele Materia, Alessandro Passanisi, James Fort, Jakob Engel, Giovanni Maria Farinella

发表机构 * Department of Mathematics and Computer Science, University of Catania（卡塔尼亚大学数学与计算机科学系）； Next Vision s.r.l.（Next Vision公司）； Reality Labs Research, Meta（Meta现实实验室）

AI总结本文提出EgoInteract，一个可控的模拟器，用于生成具有精细时空标注的以自我为中心的合成视频，旨在解决真实数据收集困难以及交互模式覆盖有限的问题。该模拟器支持对相机、人体和手部运动、物体操作及场景构图的精确控制，生成的视频数据可用于时序动作分割、下一时段活跃物体检测、交互预测等任务。实验表明，基于该模拟器训练的模型在多个真实世界的以自我为中心数据集上均取得了优于现有方法的性能，验证了其有效性和泛化能力。

详情

AI中文摘要

收集具有密集时空标注的大规模自我中心视频数据集成本高昂、速度缓慢，且常受环境偏差、隐私约束和交互模式覆盖有限的限制。虽然合成数据在多个视觉领域显示出巨大潜力，但其在自我中心感知中的应用仍相对未被充分探索，尤其是对于需要时间一致的人-物交互的任务。在这项工作中，我们引入了EgoInteract，一个用于自我中心视频生成的可控模拟器，旨在建模细粒度的自我中心交互及其时间动态。该模拟器能够精确控制相机、人体和手部运动、物体操作以及跨不同环境的场景组成。基于此框架，我们生成一个带有密集时空标注的合成自我中心视频数据集，用于时间动作分割、下一活动物体检测、交互预测和手-物交互检测。我们评估了在模拟数据上训练的模型在多个真实世界自我中心基准上的表现，这些基准涵盖不同环境、物体类别和交互模式。结果表明，在各项任务和数据集上，我们的方法相较于强基线有一致的改进，展示了基于模拟方法的有效性和可迁移性。

英文摘要

Collecting large-scale egocentric video datasets with dense spatial and temporal annotations is costly, slow, and often constrained by environmental biases, privacy constraints, and limited coverage of interaction patterns. While synthetic data has shown strong potential in several vision domains, its use for egocentric perception remains relatively underexplored, especially for tasks requiring temporally coherent human-object interactions. In this work, we introduce EgoInteract, a controllable simulator for egocentric video generation designed to model fine-grained egocentric interactions and their temporal dynamics. The simulator enables precise control over camera, human body and hand motion, object manipulation, and scene composition across diverse environments. Building on this framework, we generate a synthetic egocentric video dataset with dense spatial and temporal annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection. We evaluate models trained with simulated data on multiple real-world egocentric benchmarks spanning diverse environments, object categories, and interaction patterns. Results show consistent improvements over strong baselines across tasks and datasets, demonstrating the effectiveness and transferability of our simulation-based approach.

URL PDF HTML ☆

赞 0 踩 0

2605.07919 2026-05-25 cs.CV 版本更新

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

MedVIGIL: 在视觉证据受损下评估可信的医学视觉语言模型

Hanqi Jiang, Junhao Chen, Mingyu Kang, Hyeokjae Kwon, Yi Pan, Lifeng Chen, Weihang You, Haozhen Gong, Ruiyu Yan, Jinglei Lv, Lin Zhao, Hui Ren, Quanzheng Li, Tianming Liu, Xiang Li

发表机构 * University of Georgia（佐治亚大学）； Harvard Medical School（哈佛医学院）； Chungbuk National University（Chungbuk国立大学）； Chungnam National University Hospital（Chungnam国立大学医院）； National University of Singapore（新加坡国立大学）； New York University（纽约大学）； University of Sydney（悉尼大学）； New Jersey Institute of Technology（新泽西理工学院）

AI总结本文提出MedVIGIL，一个用于评估医疗视觉-语言模型（VLMs）在面对失效视觉证据时可信度的基准测试。研究关注模型在图像或问题被篡改时是否仍能正确拒绝回答，而非给出流畅但错误的答案。MedVIGIL包含300个由放射科专家标注的案例，提供了多种评估指标和复合得分，用于衡量模型在不同失效场景下的表现，并公开了16个视觉模型和两个纯文本基线的评估结果。

详情

AI中文摘要

医学视觉语言模型（VLM）通常在完整的图像-问题对上进行评估，但可信的临床应用需要更强的性质：模型必须能够识别答案的证据基础何时失效。我们通过扰动证据下的静默失败来研究这一问题，其中视觉相关的医学问题与错误前提、措辞扰动、仅知识改写或ROI损坏的图像配对，但模型返回流畅的非拒绝答案。我们引入了medvigil，一个从四个公共医学VQA来源中提取的300例评估套件，由四位委员会认证的放射科医生全程监督：每个黄金答案、拒绝选项、候选答案集、释义、错误前提陷阱、ROI框和临床风险等级均由临床医生撰写。两位主治放射科医生并行注释每个案例，一位高级放射科医生整合发布的清单，第四位独立于构建的放射科医生回答每个探针以提供人类参考基线。发布包含2556个MCQ探针、240个反事实三元组、医生裁定的风险等级和可回答性标志、ROI框以及配对的开放式变体。我们报告了七个正确性条件审计指标，总结为medvigil复合评分（MCS），并审计了16个视觉能力模型加上两个纯文本基线。独立放射科医生得分为MCS 83.3，静默失败率为5.8%，比最强审计模型（Claude Opus 4.7为69.2）高出14.1个复合分。基准和评估工具已公开发布。

英文摘要

Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.

URL PDF HTML ☆

赞 0 踩 0

2604.21502 2026-05-25 cs.CV 版本更新

VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

VFM$^{4}$SDG：揭示VFM在单域广义目标检测中的力量

Yupeng Zhang, Ruize Han, Ningnan Guo, Wei Feng, Song Wang, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University（天津大学智能与计算学院）； Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology（深圳先进技术大学计算机科学与人工智能学院）

AI总结该研究针对单域通用目标检测（SDGOD）中因环境变化导致的性能下降问题，提出了一种基于视觉基础模型（VFM）的新型框架VFM$^{4}$SDG。通过分析发现，检测器在跨域场景下的性能下降主要源于关系结构的不稳定，而VFM在严重域偏移下仍能保持稳定的关系和物体响应，因此被用作跨域稳定性先验。该方法通过引入冻结的VFM，分别在编码器和解码器中进行关系先验蒸馏和语义-上下文查询增强，有效提升了检测器的跨域鲁棒性，并在多个基准测试中取得了显著优势。

详情

AI中文摘要

现实世界中的天气、光照和成像变化常常引起严重的域偏移，导致单源检测器在未见环境中性能下降。现有的单域广义目标检测（SDGOD）方法主要依赖于数据增强或域不变学习，而很大程度上忽略了域偏移如何破坏检测器的预测稳定性。通过分析实验，我们发现性能下降主要由漏检增加主导。进一步分析表明，这一现象源于DETR风格检测器的跨域稳定性降低：域偏移破坏了编码器侧的物体-背景和实例间关系，并进一步削弱了解码器查询与真实物体之间的语义-空间绑定。受此启发，我们发现视觉基础模型（VFM）在严重偏移下仍能保持稳定的关系结构和物体响应，使其成为补偿检测器退化的合适跨域稳定性先验。为此，我们提出了VFM$^{4}$SDG，一个用于SDGOD的双先验学习框架，它将冻结的VFM引入编码器表示学习和解码器查询建模。具体来说，我们提出了跨域稳定关系先验蒸馏，将VFM中的稳定物体-背景和实例间关系蒸馏到编码器中，补偿关系退化。同时，我们提出了基于语义-上下文先验的查询增强，在查询进入解码器层之前注入类别语义原型和全局物体上下文，增强语义-空间查询-物体绑定稳定性。大量实验表明，VFM$^{4}$SDG在标准SDGOD基准和两个主流基于DETR的检测框架上显著优于现有先进方法，证明了其有效性、鲁棒性和泛化性。

英文摘要

Real-world weather, illumination, and imaging variations often induce severe domain shifts, degrading single-source detectors in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant learning, while largely overlooking how domain shift disrupts detector prediction stability. Through analytical experiments, we find that performance degradation is mainly dominated by increasing missed detections. Further analysis shows that this phenomenon stems from reduced cross-domain stability in DETR-style detectors: domain shift disrupts encoder-side object-background and inter-instance relations, and further weakens the semantic-spatial binding between decoder queries and real objects. Motivated by this, we find that vision foundation models (VFMs) still preserve stable relational structures and object responses under severe shifts, making them suitable cross-domain stability priors to compensate for detector degradation. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen VFM into encoder representation learning and decoder query modeling. Specifically, we propose Cross-domain Stable Relational Prior Distillation to distill stable object-background and inter-instance relations from the VFM into the encoder, compensating for relational degradation. Meanwhile, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category semantic prototypes and global object context into queries before they enter the decoder layer, enhancing semantic-spatial query-object binding stability. Extensive experiments show that VFM$^{4}$SDG significantly outperforms existing advanced methods on standard SDGOD benchmarks and two mainstream DETR-based detection frameworks, demonstrating its effectiveness, robustness, and generality.

URL PDF HTML ☆

赞 0 踩 0

2604.10077 2026-05-25 cs.CV 版本更新

渐进式 $\mathcal{J}$-不变自监督学习用于低剂量CT去噪

Yichao Liu, Zongru Shao, Yueyang Teng, Junwen Guo

发表机构 * organization= IWR, Heidelberg University , city= Heidelberg , postcode= 69120 , state= Baden Württemberg , country= Germany ； organization= Silicon Austria Labs , city= Linz , postcode= 4040 , state= Upper Austria , country= Austria ； organization= Institute of Science Tokyo , addressline= , city= Tokyo , country= Japan ； organization= College of Medicine ； Biological Information Engineering, Northeastern University , city= Shenyang , postcode= 110169 , state= Liaoning , country= China ； organization= Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education , city= Shenyang , postcode= 110169 , state= Liaoning , country= China ； organization= Department of Epidemiology \& Global Health, Umeå University , addressline= , city= Umeå , postcode= 90187 , country= Sweden

AI总结本文研究了低剂量CT图像去噪中的自监督学习方法，旨在减少对配对正常剂量CT数据的依赖。为了解决现有方法因感受野受限导致的训练效率低和性能不足的问题，提出了一种渐进式$\mathcal{J}$-不变自监督学习方法，通过逐步盲区去噪机制和引入控制噪声来提升去噪效果。实验表明，该方法在Mayo低剂量CT数据集上优于现有自监督方法，并达到或超越了部分监督去噪方法的性能。

详情

AI中文摘要

自监督学习越来越多地被研究用于低剂量计算机断层扫描（LDCT）图像去噪，因为它减轻了对通常难以收集的配对正常剂量CT（NDCT）数据的依赖。然而，许多现有的自监督盲点去噪方法由于感受野受限，存在训练效率低下和性能次优的问题。为了缓解这一问题，我们提出了一种新颖的渐进式 $\mathcal{J}$-不变学习，最大化利用 $\mathcal{J}$-不变性来增强LDCT去噪性能。我们引入了一种逐步盲点去噪机制，以渐进方式强制执行条件独立性，从而实现更细粒度的去噪学习。此外，我们在训练过程中显式注入受控的高斯噪声和泊松噪声的组合，以正则化去噪过程并减轻过拟合。在Mayo LDCT数据集上的大量实验表明，所提出的方法持续优于现有的自监督方法，并实现了与几种代表性监督去噪方法相当或更好的性能。

英文摘要

Self-supervised learning has been increasingly investigated for low-dose computed tomography (LDCT) image denoising, as it alleviates the dependence on paired normal-dose CT (NDCT) data, which are often difficult to collect. However, many existing self-supervised blind-spot denoising methods suffer from training inefficiencies and suboptimal performance due to restricted receptive fields. To mitigate this issue, we propose a novel Progressive $\mathcal{J}$-invariant Learning that maximizes the use of $\mathcal{J}$-invariant to enhance LDCT denoising performance. We introduce a step-wise blind-spot denoising mechanism that enforces conditional independence in a progressive manner, enabling more fine-grained learning for denoising. Furthermore, we explicitly inject a combination of controlled Gaussian and Poisson noise during training to regularize the denoising process and mitigate overfitting. Extensive experiments on the Mayo LDCT dataset demonstrate that the proposed method consistently outperforms existing self-supervised approaches and achieves performance comparable to, or better than, several representative supervised denoising methods.

URL PDF HTML ☆

赞 0 踩 0

2512.20901 2026-05-25 cs.CV 版本更新

Benchmarking and Enhancing VLM for Compressed Image Understanding

基准测试与增强VLM对压缩图像的理解

Zifu Zhang, Tongda Xu, Siqi Li, Shengxi Li, Yue Zhang, Mai Xu, Yan Wang

发表机构 * Institute for AI Industry Research, Tsinghua University, Beijing, China（清华人工智能产业研究院）； Beihang University, Beijing, China（北京航空航天大学）； Beijing University of Technology, Beijing, China（北京理工大学）

AI总结随着图像压缩技术的广泛应用，如何提升视觉语言模型（VLM）对压缩图像的理解能力变得尤为重要。本文首次构建了一个全面的基准，用于评估VLM在不同压缩编码和任务下的表现，并分析了模型在压缩图像上的性能差距来源，发现仅通过增强模型泛化能力可以有效缓解这一问题。基于此，作者提出了一种通用的VLM适配器，能够在多种压缩格式和比特率下提升模型性能10%-30%，为VLM在压缩图像任务中的应用提供了重要参考。

Comments The paper is accepted by ICML 2026

详情

AI中文摘要

自主X光引导脊柱手术的机器人控制策略学习研究

Florence Klitzner, Blanca Inigo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song, Axel Krieger, Mathias Unberath

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Technical University of Munich（慕尼黑技术大学）； Johns Hopkins School of Medicine（约翰霍普金斯医学院）

AI总结本文研究了基于模仿学习的机器人控制策略在X射线引导脊柱手术中的应用，特别是在椎体成形术中导管插入任务中的可行性与挑战。研究构建了一个高度逼真的仿真环境，并构建了包含正确操作轨迹和双平面X射线序列的数据集，用于训练仅依赖视觉信息的模仿学习策略。实验表明，该策略在多种脊柱解剖结构和初始条件下均能实现安全的导管插入，为未来轻量化、无需CT的术中脊柱机器人导航提供了基础。

详情

DOI: 10.1007/s11548-026-03716-x

AI中文摘要

基于模仿学习的机器人控制策略在基于视频的机器人学中重新受到关注。然而，对于稀疏输入的X光引导手术（如脊柱内固定），这种方法是否适用尚不清楚。我们研究了在双平面引导的套管针插入中模仿策略学习的可行性、机遇和挑战。我们开发了一个用于可扩展、自动化模拟X光引导脊柱手术的计算机沙盒，具有高度逼真性。我们整理了一个包含正确轨迹和相应双平面X光序列的数据集，模拟了提供者的逐步对齐过程。然后，我们训练了用于规划和开环控制的模仿学习策略，该策略仅基于视觉信息在椎体成形术环境中迭代对齐套管针。这种精确控制的设置提供了对该方法局限性和能力的见解。我们的策略在68.5%的案例中首次尝试成功，在不同椎体水平上保持了安全的椎弓根内轨迹。该策略迁移到了复杂解剖结构（包括骨折）以及不同的解剖结构和初始位置。在真实X光上的展开表明，具有合理轨迹的部分仿真到真实迁移是可能的。尽管这些初步结果令人鼓舞，但我们还发现了局限性，特别是在入口点精度方面。当前的结果为未来的努力提供了明确的基准，而借助更稳健的先验和领域知识，此类模型可能为未来实现轻量级、无CT的机器人术中脊柱导航奠定基础。

英文摘要

Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation, with sparse inputs. We examine the feasibility, opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula in a vertebroplasty setting solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy transferred to complex anatomy, including fractures, as well as varied anatomies and initializations. Rollouts on real X-ray indicate that partial sim-to-real transfer with plausible trajectories is possible. While these preliminary results are promising, we also identify limitations, especially in entry point precision. The current results present a clear benchmark for future efforts, while with more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.

URL PDF HTML ☆

赞 0 踩 0

2510.15060 2026-05-25 cs.CV 版本更新

A solution to generalized learning from small training sets found in infant repeated visual experiences of individual objects

从婴儿个体物体重复视觉经验中发现的小训练集泛化学习问题的解决方案

Frangil Ramirez, Elizabeth Clerkin, David J. Crandall, Linda B. Smith

发表机构 * Department of Computer Science, Luddy School of Informatics, Computing, and Engineering（计算机科学系，琳迪学校信息学、计算与工程学院）； Department of Psychological and Brain Sciences（心理学与脑科学系）

AI总结该研究探讨了婴儿在日常生活中通过重复视觉经验学习物体类别的方式，分析了14名一岁婴儿在用餐时拍摄的87段头部摄像头图像，涉及8类早期学习的物体。研究发现，每个婴儿对每个类别的视觉体验呈现高度偏态分布，即少数物体被频繁观看，而其他实例较少。通过图论方法分析，发现这些类别内部存在高相似性与高变异性并存的“块状”结构。实验表明，这种分布特征的人工训练集能够在极少样本的情况下支持模型对新实例的泛化，为人类和机器的视觉识别及学习机制提供了新见解。

Comments 28 pages, 7 figures, 3 tables

详情

AI中文摘要

一岁婴儿能快速形成并泛化他们遇到的日常物体类别。这里我们提供了关于婴儿日常视觉经验中8个早期学习物体类别的证据。使用婴儿在用餐时间记录的头戴摄像机图像语料库（14名婴儿记录的87次用餐时间），我们测量了每个类别独特实例的频率以及每个实例视觉经验的变异性。实例分布高度偏斜，对于每个婴儿和类别，包含大量同一少数物体的图像以及较少其他实例的图像。单个类别相似性结构的图论度量揭示了高相似度和高变异性的混合，组织成多个但相互连接的高相似度图像簇。在计算实验中，我们表明，以相似性团块分布为特征的人工创建的训练集在非常少的训练经验后支持对新实例的泛化。我们讨论了对视觉物体识别以及更一般的学习（包括人类和机器）的启示。

英文摘要

One-year-old infants rapidly form and generalize categories of the everyday objects they encounter. Here we provide evidence on infants daily-life visual experiences for 8 early-learned object categories. Using a corpus of infant head-camera images recorded at mealtimes (87 mealtimes captured by 14 infants), we measure the frequency of the unique instances of each category and the variability of the visual experiences of each instance. The distribution of instances is highly skewed, containing, for each infant and category, many images of the same few objects along with fewer images of other instances. Graph theoretic measures of the similarity structure for individual categories reveal a lumpy mix of high similarity and high variability, organized into multiple but interconnected clusters of high-similarity images. In computational experiments, we show that artificially-created training sets characterized by a lumpy distribution of similarities support generalization to novel instances after very few training experiences. We discuss implications for visual object recognition, and for learning more generally, by both humans and machines.

URL PDF HTML ☆

赞 0 踩 0

2506.14135 2026-05-25 cs.RO cs.CV 版本更新

GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation

GAF: 高斯动作场作为机器人操作中动态世界建模的4D表示

Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang Li, Hongwen Zhang, Yebin Liu

发表机构 * Tsinghua University（清华大学）； Beijing Normal University（北京师范大学）； Shadow AI

AI总结本文提出了一种基于高斯动作场（GAF）的四维表示方法，用于机器人操作中的动态世界建模。GAF通过引入可学习的运动属性，扩展了三维高斯点绘（3DGS），实现了对动态场景和操作动作的四维建模。该方法能够直接从运动感知的四维表示中进行动作推理，并通过重建当前场景、预测未来帧和估计初始动作三个相关输出，提升操作精度。实验表明，GAF在重建质量和机器人操作成功率方面均优于现有方法。

Comments https://ChaiYing1.github.io/projects/GAF/

详情

AI中文摘要

准确的场景感知对于基于视觉的机器人操作至关重要。现有方法通常遵循视觉到动作（V-A）范式，直接从视觉输入预测动作，或视觉到3D到动作（V-3D-A）范式，利用中间3D表示。然而，由于操作场景的复杂性和动态性，这些方法常常面临动作不准确的问题。在本文中，我们采用V-4D-A框架，通过高斯动作场（GAF）从运动感知的4D表示中直接进行动作推理。GAF通过引入可学习的运动属性扩展了3D高斯溅射（3DGS），实现了动态场景和操作动作的4D建模。为了学习时变场景几何和动作感知的机器人运动，GAF提供三个相互关联的输出：当前场景的重建、未来帧的预测以及通过高斯运动估计的初始动作。此外，我们采用一个动作-视觉对齐的去噪框架，以GAF生成的初始动作和高斯感知的统一表示为条件，进一步获得更精确的动作。大量实验表明，GAF在重建质量上实现了显著改进，PSNR提高+11.5385 dB，SSIM提高+0.3864，LPIPS降低-0.5574，同时在机器人操作任务中，相比最先进方法，平均成功率提升+7.3%。

英文摘要

Accurate scene perception is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. Furthermore, we employ an action-vision-aligned denoising framework, conditioned on a unified representation that combines the init action and the Gaussian perception, both generated by the GAF, to further obtain more precise actions. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success rate in robotic manipulation tasks over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2403.12401 2026-05-25 cs.CV 版本更新

RT-NeRV: Rethinking Hybrid Neural Representations for Video via Residual Tokenization

RT-NeRV: 通过残差标记化重新思考混合神经视频表示

Yunjie Xu, Xiang Feng, Chengkai Wang, Alan Wee-Chung Liew, Xuefei Yin, Yanming Zhu

发表机构 * Ningbo University（宁波大学）； Hangzhou Dianzi University（杭州电子科技大学）； Griffith University（格里菲斯大学）

AI总结本文提出了一种名为RT-NeRV的新型混合神经视频表示方法，旨在解决现有方法在低比特率下难以保留细节的问题。其核心思想是通过残差分块技术，将浅层残差特征和帧间残差信息离散化为紧凑的残差块，从而高效传输并利用这些信息进行重建。该方法设计了残差分块器和残差感知码本学习策略，有效提升了重建质量与训练稳定性，并在多个视频回归与修复任务中优于现有混合NeRV方法。

Comments Under Review

详情

AI中文摘要

神经视频表示（NeRV）通过将视频表示为紧凑的神经网络并实现高效解码，已成为视频压缩的一种有前景的范式。混合NeRV方法通过内容自适应嵌入进一步提高了重建质量，但在低比特率下仍难以保留精细细节。一个关键限制是，浅层残差支持信息虽然对重建非常有益，但其连续形式的传输成本高昂，因此未被充分利用。在本文中，我们重新思考混合NeRV，并提出了RT-NeRV，一种用于混合神经视频表示的残差标记化框架。核心思想是将浅层残差特征和帧间残差线索离散化为紧凑的残差标记，从而使得信息丰富的重建支持能够高效传输并被解码器利用。为此，我们设计了一个残差标记化器，并结合了一种残差感知的码本学习策略，该策略提高了标记利用率并稳定了训练。RT-NeRV可以轻松集成到现代混合NeRV主机中，持续增强细节保留、重建质量以及比特率-质量权衡。在视频回归和相关恢复任务上的大量实验表明，RT-NeRV优于强混合NeRV基线，并与近期基于INR的视频压缩方法保持竞争力。这些结果表明，残差标记化是推进混合神经视频表示的一个有效且互补的方向。

英文摘要

Neural Representations for Videos(NeRV) have emerged as a promising paradigm for video compression by representing videos as compact neural networks with efficient decoding. Hybrid NeRV methods further improve reconstruction quality through content adaptive embeddings, but still struggle to preserve fine details at low bitrates. A key limitation is that shallow residual support in formation, although highly beneficial for reconstruction, is costly to transmit in its continuous form and is therefore underutilized. In this paper, we rethink hybrid NeRV and present RT-NeRV, a residual tokenization framework for hybrid neural video representations. The core idea is to discretize shallow residual features and inter-frame residual cues into compact residual tokens, allowing informative reconstruction support to be transmitted efficiently and exploited by the decoder. To this end, we design a residual tokenizer together with a residual-aware codebook learning strategy that improves token utilization and stabilizes training. RT-NeRV can be readily integrated into modern hybrid NeRV hosts, consistently enhancing detail preservation, reconstruction quality, and bitrate quality trade-offs. Extensive experiments on video regression and related restoration tasks show that RT-NeRV outperforms strong hybrid NeRV baselines and remains competitive with recent INR based video compression methods. These results demonstrate that residual tokenization is an effective and complementary direction for advancing hybrid neural video representations

URL PDF HTML ☆

赞 0 踩 0

2605.23672 2026-05-25 cs.CV 版本更新

DualMem: 绕过目标性瓶颈以实现开放世界目标检测中校准的未知流过滤

Yingjun Xiao, Xi Chen, Gang Fang, Siyuan Chen

发表机构 * School of Artificial Intelligence, Guangzhou University（广州大学人工智能学院）； School of Computer Science and Cyber Engineering, Guangzhou University（广州大学计算机科学与网络工程学院）； Institute of Computing Science and Technology, Guangzhou University（广州大学计算科学与技术研究院）

AI总结开放世界目标检测（OWOD）需要检测器既能定位已知类别，又能识别未知对象以支持未来的增量学习。本文发现当前强OWOD检测器的未知预测流中背景误检比例过高，问题根源在于对象性头的信息瓶颈。为此，作者提出DualMem，一种基于冻结SigLIP特征空间的校准后处理过滤器，通过非参数似然比检验实现对未知对象的筛选，有效提升了未知对象识别的准确性，同时保持已知类别检测性能不变。

详情

AI中文摘要

开放世界目标检测（OWOD）要求检测器定位已知类别，同时识别未知对象以进行未来的增量学习。我们发现，强OWOD检测器的未知预测流受到严重污染：在M-OWODB上，对于PROB、OW-DETR和HypOW，未来任务的正未知样本仅占未知预测的不到10%，而背景假阳性则占46-71%。我们表明，这不是信息缺失问题，而是目标性头部的信息瓶颈。在PROB任务1上，对256维解码器查询的线性探针在正负未知区分上达到了0.908的AUROC，但最终的一维目标性标量降至0.642。一个冻结的SigLIP特征，无需访问检测器，在过滤阶段独立恢复了大部分这种提议级别的可分离性（AUROC = 0.871）。基于这一发现，我们提出DualMem，一种校准的后验过滤器，它假设一个小的、图像不相交的、标注了未来任务对象的校准分割，并在冻结的SigLIP特征空间中执行非参数似然比检验。DualMem使用k近邻正记忆来保护未来任务对象，并使用负记忆来抑制类似背景的提议。其决策阈值通过Neyman-Pearson校准选择，为用户提供了假未知抑制与新奇召回之间的显式权衡。在M-OWODB任务1上的PROB、OW-DETR和HypOW中，DualMem将每幅图像的背景型假未知提议减少了44.9%-66.3%，平均减少56.6%。在PROB任务1上，它使自然K-means原型基线的减少量翻倍以上，同时保持已知类别的mAP不变，因为已知检测绕过过滤器。

英文摘要

Open-world object detection (OWOD) requires detectors to localize known classes while identifying unknown objects for future incremental learning. We find that the unknown prediction streams of strong OWOD detectors are heavily polluted: on M-OWODB, across PROB, OW-DETR, and HypOW, future-task positive unknowns make up less than 10% of unknown predictions, whereas background false positives account for 46-71%. We show that this is not a missing-information problem, but an information bottleneck at the objectness head. On PROB Task 1, a linear probe on the 256-D decoder query achieves an AUROC of 0.908 for positive-versus-negative unknown discrimination, but the final one-dimensional objectness scalar drops to 0.642. A frozen SigLIP feature, without access to the detector, independently recovers much of this proposal-level separability at the filtering stage (AUROC = 0.871). Motivated by this finding, we propose DualMem, a calibrated post-hoc filter that assumes a small image-disjoint annotated calibration split of held-out future-task objects and performs a non-parametric likelihood ratio test in frozen SigLIP feature space. DualMem uses a k-nearest-neighbor positive memory to protect future-task objects and a negative memory to suppress background-like proposals. Its decision threshold is chosen by Neyman-Pearson calibration, giving users an explicit trade-off between false-unknown suppression and novel recall. Across PROB, OW-DETR, and HypOW on M-OWODB Task 1, DualMem reduces background-type false unknown proposals per image by 44.9%-66.3%, with a mean reduction of 56.6%. On PROB Task 1, it more than doubles the reduction achieved by a natural K-means prototype baseline, while leaving known-class mAP unchanged because known detections bypass the filter.

URL PDF HTML ☆

赞 0 踩 0

2605.23629 2026-05-25 cs.CV 版本更新

DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs

DDX-TRACE: 视觉语言模型中医学诊断轨迹的基准

Jiazhen Pan, Weixiang Shen, Jun Li, Julian Canisius, Felix Bitzer, Paula Roßmüller, Jiancheng Yang, Virginie Kreutzinger, Daniel Rueckert, Benedikt Wiestler

发表机构 * Technical University of Munich（慕尼黑技术大学）； TUM University Hospital（TUM大学医院）； Munich Center for Machine Learning（慕尼黑机器学习中心）； LMU Munich（慕尼黑大学）； Aalto University（阿尔托大学）； Imperial College London（伦敦帝国学院）

AI总结 DDX-TRACE 是一个用于评估视觉语言模型在医学诊断过程中表现的基准，专注于神经放射学领域，包含211个复杂病例。该基准模拟了真实的诊断流程，模型需在有限的临床信息基础上逐步请求影像检查、更新诊断假设，并最终给出确诊结果。研究发现，传统仅评价最终答案的方法可能无法准确反映模型的诊断质量，而DDX-TRACE通过关注诊断轨迹，揭示了模型在证据获取、不确定性更新和推理能力方面的关键问题。

Comments 41 pages

详情

AI中文摘要

医学诊断并非来自完全指定的病例的单次预测。它是一个序贯工作流程：临床医生决定获取哪些证据，修订鉴别诊断，并在诊断得到充分支持时停止。大多数医学AI基准则提前揭示相关背景，仅对最终答案评分，使得无依据的正确猜测、过早闭合、低效工作流以及不良的不确定性更新变得不可见。我们引入了DDX-TRACE，一个由医生裁决的多模态神经放射学基准，在211个具有挑战性的病例中评估隐藏证据下的诊断轨迹。每个病例从有限的临床病史开始；模型以自由形式请求影像研究，在可用时接收匹配的图像包，每轮后更新概率性鉴别诊断，并以定位的最终诊断结束。评估最先进的VLM，我们发现最终诊断分数可能严重歪曲工作流质量：模型可能在没有必要证据的情况下猜测合理的诊断，请求有用的研究但误解原始图像，或者低效地获取证据同时更新不确定性不佳。受控证据变体隔离了规划、视觉证据提取和下游鉴别推理中的瓶颈。DDX-TRACE将医学AI评估从最终答案转向证据支持的诊断轨迹。

英文摘要

Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories.

URL PDF HTML ☆

赞 0 踩 0

2605.23610 2026-05-25 cs.CV cs.AI 版本更新

PathNavigate: 一种无需训练的病理学代理，具有惊喜引导扫描和共享幻灯片记忆用于全切片图像VQA

Chunze Yang, Qidong Liu, Wenjie Zhao, Yue Tang, Jiusong Ge, Di Zhang, Jiashuai Liu, Lei Wu, Junbo Lu, Ni Zhang, Xian Wu, Zeyu Gao, Chen Li

发表机构 * School of Comp. Science & Technology, Xi’an Jiaotong University（西安交通大学计算机科学与技术学院）； Tencent Jarvis Lab（腾讯Jarvis实验室）； University of Cambridge（剑桥大学）

AI总结 PathNavigate 是一种无需训练的病理图像问答代理，旨在解决全切片图像问答（WSI-VQA）中在有限检查预算下高效定位关键病理证据的问题。该方法采用“扫描-搜索-读取”流程，通过共享的在线记忆模块生成异常区域池，并结合问题条件的相关性筛选高倍镜下的目标区域，从而提升答案准确性和解释性。实验表明，PathNavigate 在保持模型冻结的前提下，实现了更高的效率和更可靠的证据选择路径。

详情

AI中文摘要

全切片图像视觉问答（WSI-VQA）将病理学视为极端上下文搜索问题：为了回答自由形式的临床查询，系统必须首先在严格的检查预算下导航千兆像素切片，以定位稀疏的高分辨率证据。现有方法主要分为两种范式：i）监督式病理学多模态大语言模型（MLLMs）和代理可以将定位和推理吸收到学习模块中，但它们通常将导航与任务特定的监督和重新训练耦合，限制了其实用性；ii）无需训练的病理学代理通过保持核心模型冻结来避免这种成本，但通常遵循问题优先的设计，主要从查询条件相关性构建初始候选集。这可能会遗漏问题中未提及的决定性形态，并迫使更重的推理时脚手架。为了解决这一挑战，我们引入了PathNavigate，一种无需训练的病理学代理，基于扫描-搜索-读出流程构建。在问题匹配之前，PathNavigate在低放大倍数下扫描当前切片，使用共享的在线记忆模块处理冻结的病理学特征，生成一个切片特定的惊喜场，标记异常区域池。然后，它仅在此池内应用问题条件的PLIP相关性，以选择高放大倍数的搜索目标。最后，它提取局部高放大倍数证据，并使用冻结的感知器-裁决器堆栈进行回答，利用相同的在线记忆作为切片级上下文。在WSI-VQA和SlideBench-BCNB上的实验表明，所提出的扫描-搜索-读出设计提高了答案准确性，并产生了更可解释的证据选择轨迹，且效率更高。代码已在线公开。

英文摘要

Whole-slide image visual question answering (WSI-VQA) frames pathology as an extreme-context search problem: to answer a free-form clinical query, a system must first navigate a gigapixel slide under a strict inspection budget to locate sparse, high-resolution evidence. Existing approaches largely fall into two paradigms: i) supervised pathology multimodal large language models (MLLMs) and agents can absorb localization and reasoning into learned modules, but they often couple navigation to task-specific supervision and retraining, limiting their practicality; ii) training-free pathology agents avoid this cost by keeping core models frozen, but often follow a question-first design, constructing the initial candidate set mainly from query-conditioned relevance. This can miss decisive morphology that is not named in the question, and force heavier inference-time scaffolding. To address this challenge, we introduce PathNavigate, a training-free pathology agent built around a scan-search-readout routine. Before question matching, PathNavigate scans the current slide at low magnification with a shared online memory module over frozen pathology features, producing a slide-specific surprise field that marks an abnormal-region pool. It then applies question-conditioned PLIP relevance only within this pool to select high-magnification search targets. Finally, it extracts local high-magnification evidence and answers with a frozen perceptor-adjudicator stack, using the same online memory as slide-level context. Experiments on WSI-VQA and SlideBench-BCNB show that the proposed scan-search-readout design improves answer accuracy and yields more interpretable evidence-selection trajectories with higher efficiency.The code is available online.

URL PDF HTML ☆

赞 0 踩 0

2605.23555 2026-05-25 cs.CV 版本更新

Generator-Refiner-Examiner: A Tri-Module Data Augmentation Framework for 3D Human Avatar Learning from Monocular Videos

生成器-精炼器-检验器：一种用于从单目视频学习3D人体虚拟形象的三模块数据增强框架

Gangjian Zhang, Jian Shu, Sicheng Yu, Wenhao Shen, Yu Feng, Hao Wang

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结本文研究了从单目视频中重建具有逼真外观和可动画效果的3D人体化身的挑战。为了解决现有方法在数据稀缺情况下难以捕捉细节的问题，提出了一种名为TrioMan的三模块数据增强框架，包含生成器、细化器和检查器三个协同组件，分别用于生成多样化样本、提升生成质量以及筛选符合人体一致性的样本。实验表明，该方法在多个基准数据集上优于现有先进方法。

2605.23523 2026-05-25 cs.CV 版本更新

ComPose: When to Trust Hands for Object Pose Tracking

ComPose：何时信任手部进行物体姿态跟踪

Jisu Shin, Junoh Lee, JunGyu Lee, Inhwan Bae, Dohyeon Lee, Hokyun Im, Youngwoon Lee, Hae-Gon Jeon

发表机构 * GIST（韩国信息科学与技术学院）； Yonsei Univ.（延世大学）； DGIST（国立地面空间技术研究所）

AI总结本文提出了一种名为 ComPose 的六自由度物体姿态跟踪框架，旨在从 RGB 视频中实现对被手部遮挡物体的鲁棒跟踪。该方法创新性地将手部运动作为补充线索，而非单纯遮挡物，在统一的跟踪流程中结合物体和手部的提示信息，通过自适应选择关键手部关节、融合多源线索并利用几何证据进行修正，实现了稳定且精确的物体轨迹估计。实验表明，该方法在严重遮挡和几何模糊情况下表现出色，且无需外部平滑处理即可获得时间上一致的 3D 轨迹，适用于机器人操作等下游任务。

Comments 22 pages, 10 figures

详情

AI中文摘要

从视频中重建物体运动是具身AI和机器人操作的关键组成部分。尽管已经研究了多种物体姿态跟踪方法，但它们严重依赖强大的外部先验（如深度数据或3D模板），并且即使使用显式掩码，仍然极易受到手部抓取造成的严重遮挡的影响。在这项工作中，我们提出了ComPose，一个6DoF物体跟踪框架，旨在从RGB视频中进行手部感知的物体姿态估计。我们的方法不是将手部纯粹视为遮挡物，而是将手部运动协调为物体跟踪的补充线索。具体来说，我们通过在一个统一的跟踪流程中结合来自基础模型的物体和手部线索，随时间恢复多种物体运动。在此，ComPose自适应地选择信息丰富的手部关节，结合物体和手部衍生的线索进行运动估计，并使用可见的几何证据和学习到的校正来细化所得的物体运动。我们进一步在旋转和平移上强制时间一致性，从而在没有外部平滑的情况下产生稳定的3D物体轨迹。大量实验表明，我们的方法在严重手部遮挡和几何模糊下准确、高效且鲁棒。此外，所得的轨迹还可以通过使机器人能够从在线视频中重建人类动作，有效地转移到下游机器人操作中。

英文摘要

Reconstructing the motion of objects from videos is a key component for embodied AI and robot manipulation. While diverse approaches to object pose tracking have been studied, they rely heavily on strong external priors, such as depth data or 3D templates, and remain highly vulnerable to severe occlusions by hand grasps despite the use of explicit masks. In this work, we present ComPose, a 6DoF object tracking framework designed for hand-aware object pose estimation from RGB video. Rather than treating the hand purely as an occluder, our method harmonizes hand motions as a \textit{complementary cue} for object tracking. In detail, we recover a variety of object motions over time by combining object and hand cues from foundation models within a unified tracking pipeline. Here, ComPose adaptively selects informative hand joints, combines object- and hand-derived cues for motion estimation, and refines the resulting object motion using visible geometric evidence and a learned correction. We further enforce the temporal consistency over both rotation and translation, yielding stable 3D object trajectories over time without any external smoothing. Extensive experiments show that our method is accurate, efficient, and robust under severe hand occlusion and geometric ambiguity. In addition, the resulting trajectories can also effectively transfer to downstream robot manipulation by enabling robots to reconstruct human actions from online videos.

URL PDF HTML ☆

赞 0 踩 0

2605.23522 2026-05-25 cs.LG cs.AI cs.CV 版本更新

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Precise: 用于流匹配模型强化学习后训练的SDE一致随机采样

Jade Zou, Tao Huang, Weijie Kong, Junzhe Li, Yue Wu, Qi Tian, Jiangfeng Xiong, Jianwei Zhang, Liefeng Bo, Zhao Zhong

发表机构 * Peking University（北京大学）； Tencent Hunyuan（腾讯文言）

AI总结该论文研究了如何通过强化学习（RL）对流匹配模型进行后训练，以提升其生成质量与提示对齐能力。核心方法是将确定性的采样轨迹转化为随机策略，通过设计一个符合随机微分方程（SDE）的采样器，实现探索与稳定性的平衡。提出的新采样器Precise在保持去噪轨迹SDE一致性的同时，有效减少了噪声干扰，实验表明其在奖励优化速度和生成质量上均优于现有方法。

详情

AI中文摘要

强化学习已成为提升扩散和流匹配生成器中提示对齐和感知质量的有效方法。将在线强化学习应用于流匹配的关键步骤是将确定性采样轨迹转化为随机策略，通常通过用随机微分方程替代逆向常微分方程来实现。随机采样器控制探索行为和去噪动力学，因此是策略的一部分，其设计会显著影响奖励优化性能。我们将采样器设计分解为两个相互依赖的组成部分：选择适量的随机探索，以及在强化学习中使用的少量步数下忠实地离散化得到的SDE。针对第一个组成部分，我们分析了去噪过程中探索与稳定性之间的固有张力，并推导出平衡两者的SDE调度。针对离散化挑战，我们使用一个玩具示例表明，现有采样器可能偏离流匹配过程，要么引入过多的离散化噪声，要么依赖不能保证收敛到数据分布的启发式规则。为解决这些问题，我们提出了Precise，一种新的随机采样器，平衡了有效探索与稳定性。关键地，Precise通过一种冻结干净潜变量后验均值的新颖近似，使去噪轨迹保持SDE一致，解决了标准采样器中的过度噪声问题。大量实验表明，该公式通过强化学习实现了显著更快且更稳定的奖励优化，达到了最先进的对齐分数（例如PickScore、HPSv2.1），同时匹配先前采样器的最佳域内性能所需的训练时间减少了13.1-53.2%。

英文摘要

Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.

URL PDF HTML ☆

赞 0 踩 0

2605.23518 2026-05-25 cs.CV 版本更新

高效的一步扩散修复模型：紧凑令牌压缩与线性注意力

Bingtian Qiao, Yue Shi, Yingjie Zhou, Yong Guo, Guangtao Zhai, Jiezhang Cao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Fuzhou University（福州市大学）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结本文针对真实场景图像超分辨率任务中现有方法计算量大、内存消耗高、推理延迟大的问题，提出了一种高效的一步式修复框架SANA-SR。该方法通过深度压缩自编码器将潜在特征压缩32倍，大幅减少冗余信息，同时引入线性注意力机制与LoRA微调技术，实现了线性复杂度的高分辨率图像恢复。实验表明，SANA-SR在多个基准数据集上取得了优异的定量性能，且模型参数量小、推理速度快，具有良好的实际部署潜力。

详情

AI中文摘要

真实图像超分辨率旨在从复杂且未知的真实退化中恢复高质量图像。然而，现有的生成式Real-ISR方法很大程度上继承了为高分辨率图像合成开发的密集潜在表示和二次成本全局建模范式，导致计算、内存使用和推理延迟随分辨率增长而不利地扩展，从而限制了实际部署。我们认为关键瓶颈不在于修复先验不足，而在于高分辨率修复过程中过多的令牌冗余和昂贵的令牌交互。受此观察启发，我们从紧凑潜在表示和线性复杂度建模的角度重新审视Real-ISR，提出了SANA-SR，一种高效的一步修复框架。具体来说，SANA-SR采用具有32倍压缩比的深度压缩自编码器，大幅减少潜在令牌，同时保留与修复相关的结构和纹理。在此紧凑潜在空间之上，我们引入了带有LoRA微调的线性注意力DiT，实现了具有线性复杂度令牌混合的高效高分辨率修复。在所有基准数据集上的大量实验表明，SANA-SR在定量性能上与现有方法高度竞争且通常更优，同时恢复出更清晰、更真实的纹理。此外，剪枝后，部署的模型运行时间为0.019秒，MACs为407.95G，参数量为344M，突显了其在移动设备上实际部署的强大潜力。

英文摘要

Real-world image super-resolution aims to recover high-quality images from complex and unknown real-world degradations. However, existing generative Real-ISR methods largely inherit the dense latent representations and quadratic-cost global modeling paradigm developed for high-resolution image synthesis, causing computation, memory usage, and inference latency to scale unfavorably with resolution and thus limiting practical deployment. We argue that the key bottleneck lies not in insufficient restoration priors, but in excessive token redundancy and costly token interactions during high-resolution restoration. Motivated by this observation, we revisit Real-ISR from the perspectives of compact latent representation and linear-complexity modeling, and propose SANA-SR, an efficient one-step restoration framework. Specifically, SANA-SR employs a deep compression autoencoder with a 32x compression ratio to drastically reduce latent tokens while preserving restoration-relevant structures and textures. On top of this compact latent space, we introduce a linear-attention DiT with LoRA fine-tuning, enabling efficient high-resolution restoration with linear-complexity token mixing. Extensive experiments on all benchmark datasets demonstrate that SANA-SR achieves highly competitive and often superior quantitative performance against existing methods, while restoring clearer and more realistic textures. Moreover, after pruning, the deployed model runs in 0.019s with 407.95G MACs and 344M parameters, highlighting its strong potential for practical mobile deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.23449 2026-05-25 cs.LG cs.CV math.AG 版本更新

Commutator-Induced Uncertainty in VAEs

VAE中的换位子引发的不确定性

Tahereh Dehdarirad, Michael Felsberg, Gabriel Eilertsen, Ziliang Xiong

发表机构 * Computer Vision and Learning Systems (CVL), Linköping University, Sweden（计算机视觉与学习系统（CVL），林雪平大学，瑞典）； Department of Science and Technology, Linköping University, Sweden（科学与技术系，林雪平大学，瑞典）

AI总结变分自编码器（VAEs）在学习非交换结构时常常面临不确定性问题。本文提出了一种基于李群的VAE框架，通过结合几何与代数视角分析不确定性，将离散生成因素与连续几何变换分离。该方法通过诊断代数非交换性并调整解码器对非交换结构的敏感度，提升了重构质量与潜在空间结构的一致性，在多个基准数据集上表现出优越的重构与潜在空间遍历性能。

详情

AI中文摘要

变分自编码器（VAE）通常难以表示学习到的潜在空间中的非交换结构。对称感知的VAE通常通过代数正则化强制交换性来解决这个问题，这适用于交换变换群，但当非交换性是数据内在特性时会抑制有意义的非交换结构。我们认为，非交换性应被明确诊断并反映在重建行为中。我们引入了一个李群VAE框架，该框架结合了几何和代数视角下的不确定性，同时将离散生成因子与连续几何变换分开。在第一阶段，模型在没有结构约束的情况下进行训练，同时通过有限Baker-Campbell-Hausdorff偏差测量代数非交换性，并通过重建顺序交换测试测量解码器顺序敏感性。这些诊断揭示了在无约束训练下潜在非交换性与重建行为之间的尺度不匹配。在第二阶段，我们引入了一个具有数据驱动校准常数的变形稳定性约束，使解码器敏感性与代数非交换性对齐。我们在dSprites、3DShapes、3DCars和CelebA上评估了该框架，并与通用和对称感知基线（包括beta-VAE、CLG-VAE和CFASL）进行了比较。在合成基准上，该方法提高了重建质量，并产生了与潜在非交换结构更一致的解码器行为。定性分析显示了更清晰的顺序依赖潜在组合和更稳定的重建。在CelebA上，该模型比CFASL产生了更忠实的重建和因子特定的潜在遍历，同时在学习的潜在方向之间也表现出有意义的顺序依赖交互。

英文摘要

Variational autoencoders (VAEs) often struggle to represent non-commutative structure in learned latent spaces. Symmetry-aware VAEs commonly address this issue by enforcing commutativity through algebraic regularization, which is appropriate for commutative transformation groups but can suppress meaningful non-commutative structure when it is intrinsic to the data. We argue that non-commutativity should instead be explicitly diagnosed and reflected in reconstruction behavior. We introduce a Lie Group VAE framework that combines geometric and algebraic perspectives on uncertainty while separating discrete generative factors from continuous geometric transformations. In a first phase, the model is trained without structural constraints while algebraic non-commutativity is measured through finite Baker-Campbell-Hausdorff deviations and decoder order sensitivity is measured through reconstruction order-swap tests. These diagnostics reveal a scale mismatch between latent non-commutativity and reconstruction behavior under unconstrained training. In a second phase, we introduce a deformation-stability constraint with a data-driven calibration constant that aligns decoder sensitivity with algebraic non-commutativity. We evaluate the framework on dSprites, 3DShapes, 3DCars, and CelebA against generic and symmetry-aware baselines, including beta-VAE, CLG-VAE, and CFASL. Across synthetic benchmarks, the method improves reconstruction quality and yields decoder-level behavior more consistent with latent non-commutative structure. Qualitative analyses show clearer order-dependent latent compositions and more stable reconstructions. On CelebA, the model yields more faithful reconstructions and factor-specific latent traversals than CFASL, while also exhibiting meaningful order-dependent interactions between learned latent directions.

URL PDF HTML ☆

赞 0 踩 0

2605.23445 2026-05-25 cs.CV 版本更新

DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation

DFSAttn：面向高效视频生成的动态细粒度稀疏注意力

Jie Hu, Zixiang Gao, Yutong He, Kun Yuan

发表机构 * Peking University（北京大学）

AI总结该论文提出了一种名为DFSAttn的动态细粒度稀疏注意力机制，旨在提升视频生成中扩散变换器的效率。针对现有块稀疏注意力在高稀疏比下质量下降的问题，DFSAttn通过理论分析得出注意力召回的下界，并设计了无需训练的稀疏注意力框架，包含基于希尔伯特曲线的令牌重排序、分层块评分和自适应稀疏掩码缓存等核心模块。实验表明，DFSAttn在保持高质量生成的同时，实现了高达2.1倍的端到端加速。

Comments ICML 2026; 17 pages, 8 figures;

详情

AI中文摘要

扩散变换器在高品质视频生成中取得了显著成功，但其对时空3D全注意力的依赖由于注意力的二次复杂度而产生了高昂的计算成本。块稀疏注意力是一种常见方法，通过将计算集中在重要区域来缓解这一问题。然而，DiTs中的注意力图表现出固有的动态和细粒度稀疏性，这导致现有的块稀疏注意力方法在质量上显著下降，尤其是在高稀疏率下。在本文中，我们重新审视块稀疏注意力，并推导出注意力召回率的理论下界，以刻画影响其有效性的关键因素。在这些见解的指导下，我们提出了DFSAttn，一种无需训练的稀疏注意力框架，能够高效地实现动态、细粒度的稀疏化。DFSAttn包含三个核心设计：基于希尔伯特曲线的令牌重排序以实现细粒度稀疏性同时保持高效的GPU执行，分层块评分以准确估计块重要性，以及具有自适应比率的稀疏掩码缓存以平衡准确性和效率。实验结果表明，DFSAttn在高稀疏度下始终优于先前方法，在保持高生成质量的同时实现了高达2.1倍的端到端加速。我们的代码已开源，可在https://github.com/jessica-hujie/DFSAttn获取。

英文摘要

Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to mitigate this by focusing computation on important regions. However, attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity, which causes existing block sparse attention methods to degrade significantly in quality, especially at high sparsity ratios. In this paper, we revisit block sparse attention and derive a theoretical lower bound on attention recall to characterize the key factors governing its effectiveness. Guided by these insights, we propose DFSAttn, a training-free sparse attention framework that enables dynamic, fine-grained sparsification efficiently. DFSAttn incorporates three core designs: Hilbert curve-based token reordering to achieve fine-grained sparsity while preserving efficient GPU execution, hierarchical block scoring for accurate block importance estimation, and sparse mask caching with adaptive ratios to balance accuracy and efficiency. Experimental results demonstrate that DFSAttn consistently outperforms prior methods under high sparsity, achieving up to 2.1$\times$ end-to-end speedup while maintaining high generation quality. Our code is open-sourced and available at https://github.com/jessica-hujie/DFSAttn.

URL PDF HTML ☆

赞 0 踩 0

2605.23428 2026-05-25 cs.CV cs.MM 版本更新

使用3D卷积神经网络的在线手势识别

Yinghao Qin, Tijana Timotijevic

发表机构 * School of Electronic Engineering and Computer Science（电子工程与计算机科学学院）； Queen Mary, University of London（伦敦大学Queen Mary）

AI总结本文提出了一种基于3D卷积神经网络的在线手部手势识别系统，旨在实现实时视频流中手势的定位与分类。为提高系统鲁棒性，采用滑动窗口方法对多窗口结果进行优化。该系统在Jester数据集上训练，检测和分类准确率分别达到98%以上和90%以上，在自制数据集上达到37.5%的Levenshtein准确率，且响应时间在三秒以内。

Comments Master's dissertation work written in Autumn 2020

2605.23406 2026-05-25 cs.CV 版本更新

CHASD：面向LVLMs中幻觉的语言增量校准对比解码

Xiaoyi Huang, Kejia Zhang, Zhiming Luo

发表机构 * Institute of Artificial Intelligence, Xiamen University（厦门大学人工智能学院）； Department of Artificial Intelligence, Xiamen University（厦门大学人工智能系）

AI总结本文研究了大型视觉-语言模型（LVLMs）在语言先验主导下容易产生物体幻觉的问题，提出了一种无需训练的对比解码方法CHASD。该方法通过注意力引导的局部视觉扰动构建负样本分支，并在生成过程中仅对低置信度的词元进行对比校准，从而在保证推理效率的同时有效抑制幻觉。实验表明，CHASD在多个基准数据集上显著提升了相关指标，优于现有的训练自由基线方法。

详情

AI中文摘要

大型视觉-语言模型展现了强大的多模态推理能力，但当语言先验主导不足或错位的视觉证据时，它们仍然容易产生对象幻觉。无训练对比解码方法通过比较原始和扰动视觉输入的预测来缓解此问题，但现有方法要么应用可能改变有用视觉证据的全局扰动，要么在每个解码步骤调用额外的负分支。在本文中，我们观察到幻觉风险是瞬态且特定于token的：视觉注意力在生成的token间转移，而一些功能token以高置信度产生，不需要对比校准。基于这一观察，我们提出面向大型视觉-语言模型的对比幻觉感知逐步解码（CHASD），一种“按需校准”的推理时框架。CHASD使用不确定性驱动的置信门控，仅当下一token的最大概率低于阈值时激活对比分支，并通过注意力引导的局部扰动构建负分支，扰动当前显著的视觉token。这种设计减少了不必要的负分支前向传播，同时保留了高置信度步骤的原始分布。在POPE、AMBER、MME、MMHal-Bench和CHAIR上的实验表明，CHASD在强无训练基线上改进了幻觉相关指标，并具有有竞争力的推理效率。

英文摘要

Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or misaligned visual evidence. Training-free contrastive decoding methods mitigate this issue by comparing predictions from original and perturbed visual inputs, but existing approaches either apply global perturbations that may alter useful visual evidence or invoke an additional negative branch at every decoding step. In this paper, we observe that hallucination risks are transient and token-specific: visual attention shifts across generated tokens, while some functional tokens are produced with high confidence and do not require contrastive calibration. Based on this observation, we propose Contrastive Hallucination-Aware Step-wise Decoding (CHASD) for Large Vision-Language Models, an inference-time framework for "calibration on demand". CHASD uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next-token is less than the threshold, and constructs the negative branch through attention-guided localized perturbations of the currently salient visual tokens. This design reduces unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR show that CHASD improves hallucination-related metrics over strong training-free baselines with competitive inference efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.23324 2026-05-25 cs.CV quant-ph 版本更新

Enhancing Blood Cells Classification using Hybrid Quantum Neural Networks

使用混合量子神经网络增强血细胞分类

Guilherme Cruz, Nouhaila Innan, Alberto Marchisio, Gabriel Falcao, Muhammad Shafique

发表机构 * Center for Quantum and Topological Systems, NYUAD Research Institute, New York University Abu Dhabi, UAE（阿布扎比纽约大学NYUAD研究机构量子与拓扑系统中心）； Science Division, New York University Abu Dhabi, UAE（阿布扎比纽约大学科学学院）

AI总结本文研究了如何利用混合量子-经典神经网络（HQNN）提升显微血细胞分类的准确性。作者提出了一种模块化架构，结合预训练的ResNet-50主干网络、低维潜在瓶颈和变分量子电路，以比较量子增强与纯经典变换机制的效果。实验结果表明，HQNN在两个公开血细胞数据集上均表现出更优或更均衡的分类性能，尤其在高难度的8类分类任务中，F1分数提升了0.15个百分点，并在IBM量子硬件上验证了模型对噪声的鲁棒性。

Comments 11 pages, 13 figures

详情

AI中文摘要

显微镜血细胞的准确分类仍然是医学图像分析中的关键任务，其中微小的变化和有限的数据可能挑战传统的深度学习模型。因此，在这项工作中，我们研究了混合量子-经典神经网络（HQNN）在该领域中增强特征表示和改善分类性能的潜力。我们提出了一种模块化架构，结合了预训练的ResNet-50骨干网络、低维潜在瓶颈和变分量子电路，使得量子增强和纯经典变换机制之间能够进行直接比较。为了隔离量子组件的贡献，我们评估了三种架构：HQNN模型、具有可比容量的额外非线性变换层的经典匹配模型，以及没有中间变换阶段的基线模型。在两个公开的血细胞数据集（即血细胞图像数据集和PBC数据集）上进行的实验表明，HQNN在评估指标上始终实现更优或更平衡的性能。在血细胞图像数据集中，与经典基线相比，所提出的方法将宏F1分数提高了高达3.7%，而在更具挑战性的8类场景中，F1分数从98.54%提高到98.69%，性能接近饱和。在IBM量子硬件上的额外评估表明，该模型在噪声下仍然保持鲁棒性，与模拟结果相比仅出现适度的性能下降。这些结果表明，量子特征变换可以增强判别表示，特别是在具有挑战性的分类场景中，并突显了HQNN模型在医学成像任务中的实际潜力。

英文摘要

Accurate classification of microscopic blood cells is still a critical task in medical image analysis, where subtle variations and limited data can challenge conventional deep learning models. As such, we investigate in this work the potential of Hybrid Quantum-Classical Neural Networks (HQNNs) to enhance feature representation and improve classification performance in this domain. We propose a modular architecture combining a pre-trained ResNet-50 backbone with a low-dimensional latent bottleneck and a variational quantum circuit, enabling a direct comparison between quantum-enhanced and purely classical transformation mechanisms. To isolate the contribution of the quantum component, we evaluate three architectures: a HQNN model, a Classical Matched Model with an additional nonlinear transformation layer of comparable capacity, and a baseline model without an intermediate transformation stage. Experiments conducted on two publicly available blood cell datasets, namely the Blood Cell Images dataset and the PBC dataset, demonstrate that HQNNs consistently achieve superior or more balanced performance across evaluation metrics. In the Blood Cell Images Dataset, the proposed approach improves macro F1-score by up to 3.7% compared to classical baselines, while improving the F1-score from 98.54% to 98.69% in the more challenging 8-class scenario with near-saturated performance. Additional evaluation on IBM quantum hardware shows that the model remains robust under noise, with only a modest performance degradation relative to simulated results. These results indicate that quantum feature transformations can enhance discriminative representations, particularly in challenging classification scenarios, and highlight the practical potential of HQNN models for medical imaging tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.23323 2026-05-25 eess.IV cs.CV 版本更新

Efficient Learned Image Compression without Entropy Coding

无需熵编码的高效学习图像压缩

Hao Cao, Wenqi Guo, Zhijin Qin, Jungong Han

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； Department of Automation, Tsinghua University（清华大学自动化系）； State Key Laboratory of Space Network（空间网络与通信国家重点实验室）； Beijing National Research Center for Information Science（北京信息科学国家研究中心）

AI总结本文提出了一种无需熵编码的高效学习图像压缩方法EF-LIC，旨在解决传统方法中熵编码导致的编码延迟瓶颈问题。该方法通过引入无约束向量量化和上下文条件自回归变换，有效去除统计冗余和相关性冗余，实现了与传统方法相当的压缩性能。实验表明，EF-LIC在保持高质量的同时，显著提升了编码和解码速度。

Comments Accepted by ICML 2026

详情

AI中文摘要

熵编码在典型的学习图像压缩（LIC）中被广泛使用，它将潜在变量转换为紧凑的比特流。然而，熵编码通常是顺序执行的，成为编码延迟的瓶颈。为了克服这一问题，我们提出了无需熵编码的学习图像压缩（EF-LIC），这是一个多速率框架，通过去除统计冗余和相关冗余，以低编码延迟生成紧凑表示。首先，我们引入无约束向量量化，并证明其索引分布接近最大熵界，从而产生最小的统计冗余。其次，我们提出了一种上下文条件自回归变换，直接重新参数化潜在变量以减少相互依赖性。理论分析表明，EF-LIC可以像带有熵编码的典型LIC一样有效地去除相关冗余，从而实现相当的压缩性能。实验表明，在Kodak数据集上使用LPIPS度量，EF-LIC相比MS-ILLM实现了高达67.86%的比特率降低。消融研究进一步表明，EF-LIC在匹配基于熵编码的变体的压缩性能的同时，实现了超过3倍的编码加速和超过5倍的解码加速。

英文摘要

Entropy coding is widely used in typical learned image compression (LIC) that converts latents into a compact bitstream. However, entropy coding is typically sequential and becomes the coding latency bottleneck. To overcome it, we present Entropy-Coding Free Learned Image Compression (EF-LIC), a multi-rate framework that generates compact representation by removing statistical and correlation redundancy with low coding latency. First, we introduce unconstrained vector quantization and prove that its index distribution approaches the maximum-entropy bound, yielding minimal statistical redundancy. Second, we propose a context-conditioned autoregressive transform that directly reparameterizes the latents to reduce inter-dependency. Theoretical analysis shows that EF-LIC can remove correlation redundancy as effectively as typical LIC with entropy coding, leading to comparable compression performance. Experiments show EF-LIC achieves up to 67.86% bitrate reduction over MS-ILLM on Kodak with LPIPS. Ablation studies further show EF-LIC matches the compression performance of its entropy-coding based variant while achieving over $3\times$ faster encoding and $5\times$ faster decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.23304 2026-05-25 cs.CV 版本更新

General Hazard Detection

通用危险检测

Stephanie Ng, CP Lim, SueJen Looi, Hendrik Zurlinden, David Nguyen, Lei Wei, Saeid Nahavandi, Hailing Zhou

发表机构 * Swinburne University of Technology（斯winburne大学）； National Transport Research Organisation（国家交通运输研究组织）； Google Cloud（谷歌云）； Deakin University（德金大学）

AI总结本文研究了如何检测抽象概念的“危害”，并提出了一种基于语言规则而非具体图像示例的通用危害检测方法。为了解决现有系统在数据稀疏性、定义动态变化和泛化能力方面的不足，作者构建了CompliVision数据集，并设计了一个结合视觉与语言模型的框架，通过权威规范定义多领域危害概念，实现对安全合规性的有效评估。该方法引入主动学习机制，提升模型在复杂场景下的鲁棒性和适应性。

Comments 20 pages, 7 figures and 4 tables

详情

AI中文摘要

危险作为一个抽象概念，通常通过认知层面的逻辑推理而非具体示例来定义。相比之下，现有的危险检测系统依赖于预定义的危险类别，并需要在检测或分类架构中密集收集标注示例。这种方法在处理抽象安全概念时面临三个基本挑战：(1) 噪声大且稀疏的训练数据，(2) 随上下文和时间动态演变的定义，以及(3) 对未见或新颖场景的泛化能力有限。为了解决这些局限性，我们提出了CompliVision数据集，这是第一个专为基于规则的合规评估设计的通用危险数据集，同时提供了一个用于危险评估的基线框架。我们的关键创新在于通过基于语言的规则表达安全要求，从而将危险概念与基于图像的示例解耦。我们将方法建立在权威领域法规和ISO标准之上，以定义跨多个领域的多样化危险概念。CompliVision数据集包含跨越交通、建筑和仓库环境的3,006张图像，每张图像都根据特定安全规则进行了合规性标注，并附有突出显示支持性视觉证据的自然语言解释。为了实现稳健的泛化，我们开发了一个主动学习框架，以更有效地指导和优化视觉语言模型在危险合规评估中的表现。尽管最先进的VLM表现出强大的能力，但在准确安全评估所需的细粒度、上下文相关解释方面仍存在困难。我们提出了一个通用危险检测框架来解决这一局限性，该框架结合了基于LLaVA的视觉推理与人在回路反馈。

英文摘要

Hazard, as an abstract concept, is typically defined through cognitive-level logical reasoning rather than concrete examples. In contrast, existing hazard detection systems rely on predefined hazard categories and require intensive collection of labelled examples within detection or classification architectures. This approach faces three fundamental challenges when addressing abstract safety concepts: (1) noisy and sparse training data, (2) dynamically evolving definitions that change across contexts and time, and (3) limited generalisation to unseen or novel scenarios. To address these limitations, we present the CompliVision dataset, the first general-purpose hazard dataset designed for rule-based compliance assessment, along with a baseline framework for hazard evaluation. Our key innovation is decoupling the hazard concept from image-based examples by expressing safety requirements through language-based rules. We ground our approach in authoritative domain regulations and ISO standards to define diverse hazard concepts across multiple domains. The CompliVision dataset comprises 3,006 images spanning traffic, construction, and warehouse environments, with each image annotated for compliance against specific safety rules, accompanied by natural language explanations highlighting the supporting visual evidence. To achieve robust generalisation, we develop an active learning framework to more effectively guide and refine vision-language models in assessing hazard compliance. While state-of-the-art VLMs demonstrate strong capabilities, they struggle with the fine-grained, context-dependent interpretation required for accurate safety assessment. We proposed a general hazard detection framework to address this limitation which combines LLaVA-based visual reasoning with with human-in-the-loop feedback.

URL PDF HTML ☆

赞 0 踩 0

2605.23288 2026-05-25 cs.CV 版本更新

DepthAgent: 通过样本级专家选择实现更好的通用深度估计

Jie Zhu, Girish Chandar Ganesan, Xiaoming Liu

发表机构 * Michigan State University（密歇根州立大学）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结本文提出了一种名为 DepthAgent 的视觉语言智能体，用于自适应单目深度估计。该方法通过分析场景和相机特性，选择或融合多个预训练深度模型的预测结果，从而提升在不同视角、鱼眼和全景图像等多样化相机设置下的深度估计性能。研究发现，不同模型在不同输入域上的表现存在显著差异，通过样本级专家选择与融合可以显著提升难样本的估计精度，实验表明 DepthAgent 在多个基准测试中均优于单一模型及固定融合方法。

详情

AI中文摘要

单目度量深度估计通过大规模训练和通用相机建模取得了显著进展，但在不同相机设置（如透视、鱼眼和全景图像）下的鲁棒部署仍然具有挑战性。现有方法通常依赖单一深度估计器，忽略了不同模型编码不同的相机假设并在不同输入域下表现最佳。本文中，我们展示了深度专家在样本级上具有强互补性：模型偏好与相机几何高度相关，多模型融合在单个专家不可靠的困难样本上带来最大收益。受这些观察启发，我们提出了 extbf{\ours}，一种用于自适应单目深度估计的视觉语言智能体。DepthAgent将现有深度模型视为冻结工具，学习分析场景和相机线索，通过多轮工具调用调用合适的专家，并为每个输入选择或融合它们的预测。为了优化这种离散决策以实现密集几何质量，我们设计了一种多奖励强化学习微调方案，共同鼓励有效的工具执行、相机/场景分析、专家选择质量和推理效率。在透视、鱼眼和全景基准上的大量实验表明，\ours一致优于单个专家、固定模型融合和不同选择策略，在困难样本上取得了显著改进，突显了专家选择和融合的关键作用。代码和模型将在发表后发布。

英文摘要

Monocular metric depth estimation has achieved strong progress with large-scale training and universal-camera modeling, yet robust deployment across diverse camera settings, such as perspective, fisheye, and panoramic images, remains challenging. Existing methods typically rely on a single depth estimator, overlooking that different models encode different camera assumptions and perform best under different input domains. In this paper, we show that depth experts exhibit strong sample-wise complementarity: model preference is highly correlated with camera geometry, and multi-model fusion brings the largest gains on difficult samples where individual experts are unreliable. Motivated by these observations, we propose \textbf{\ours}, a vision-language agent for adaptive monocular depth estimation. DepthAgent treats existing depth models as frozen tools and learns to analyze scene and camera cues, invoke suitable experts through multi-turn tool utilization, and select or fuse their predictions for each input. To optimize such discrete decision-making toward dense geometric quality, we design a multi-reward reinforcement fine-tuning scheme that jointly encourages valid tool execution, camera/scene analysis, expert-selection quality, and inference efficiency. Extensive experiments across perspective, fisheye, and panoramic benchmarks show that \ours consistently outperforms individual experts, fixed model fusion, and different selection strategies, with strong improvements on challenging samples, highlighting the critical role of expert selection and fusion. The code and model will be released upon publication.

URL PDF HTML ☆

赞 0 踩 0

2605.23274 2026-05-25 cs.CV 版本更新

U-CESE: Unified Clip-based Event Search Engine for AI Challenge HCMC 2025

U-CESE：面向AI挑战赛胡志明市2025的统一基于片段的事件搜索引擎

Duc-Nhuan Le, Hoang-Phuc Nguyen, Thanh-Duy Lam, Minh-Nhut Dang, Minh-Hoang Le

发表机构 * Faculty of Information Technology, University of Science, VNU-HCM（越南国家大学胡志明市分校信息科技学院）； Vietnam National University, Ho Chi Minh City, Vietnam（越南国家大学胡志明市分校）

AI总结本文提出U-CESE，一种统一的基于片段的事件搜索引擎，用于AI Challenge HCMC 2025中的多模态事件检索任务。U-CESE整合了原有CESE的三个模块，形成统一框架，支持跨多种视频源的一致事件检索。其核心方法包括统一剪辑算法、基于JPEG文件大小变化的无训练关键帧提取方法DAKE，以及受循环神经网络启发的时序一致字幕生成框架ReCap，有效提升了大规模多模态事件检索的效率与准确性。

Comments Accepted for publication in the Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)

详情

AI中文摘要

从大规模视频数据集中检索事件由于复杂的时空和多模态信息而具有挑战性。本文介绍了U-CESE，这是我们对AI挑战赛胡志明市2025的解决方案，一个统一的基于片段的事件搜索引擎，用于跨多种视频源的多模态事件检索。在CESE的基础上，U-CESE将其三个模块集成到一个统一的框架中，确保跨查询类型的一致处理和检索。核心组件是统一剪辑算法，它将单独的剪辑算法合并为一个高效的流水线。为了处理大规模数据，我们提出了DAKE，一种轻量级、无需训练的关键帧提取方法，利用JPEG文件大小变化来识别显著的场景变化。最后，我们引入了ReCap，一个受循环神经网络启发的时序一致字幕生成框架，生成详细且上下文感知的文本描述。实验表明，U-CESE在大规模多模态事件检索中提供了稳健、一致且高效的性能。

英文摘要

Retrieving events from large-scale video datasets is challenging due to complex temporal, spatial, and multimodal information. This paper presents U-CESE, our solution for the AI Challenge HCMC 2025, a Unified Clip-based Event Search Engine for multimodal event retrieval across diverse video sources. Building on CESE, U-CESE integrates its three modules into a single cohesive framework, ensuring consistent processing and retrieval across query types. A core component is the Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline. To handle large-scale data, we propose DAKE, a lightweight, training-free keyframe extraction method using JPEG file size variations to identify significant scene changes. Finally, we introduce ReCap, a temporally consistent captioning framework inspired by Recurrent Neural Network, generating detailed and context-aware textual descriptions. Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.23271 2026-05-25 cs.CV cs.AI 版本更新

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

EvalVerse：面向专业电影级视频生成的流水线感知与专家校准基准测试

Songlin Yang, Haobin Zhong, Ruilin Zhang, Xiaotong Zhao, Shuai Li, Kai Zheng, Xuyi Yang, Zhe Wang, Zhenchen Tang, Yang Li, Bohai Gu, Zhengwei Peng, Yidan Huang, Mengzhou Luo, Yihang Bo, Dalu Feng, Yujia Zhang, Juntao Ma, Ruiqi Wang, Lvmin Zhang, Yuwei Guo, Frank Guan, Maneesh Agrawala, Hongbo Fu, Alan Zhao, Anyi Rao

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）； Tencent（腾讯）； Tsinghua University（清华大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Beijing Film Academy（北京电影学院）； Stanford University（斯坦福大学）； The Chinese University of Hong Kong（香港中文大学）； Singapore Institute of Technology（新加坡理工学院）

AI总结随着生成式视频基础模型的快速发展，影视级视频生成成为研究热点，但现有的评估方法多关注生成内容是否符合提示，而忽视了其艺术质量、表演和美学表现。为解决这一问题，本文提出 EvalVerse，一个流程感知且由专家校准的评估框架，通过构建专业影视制作流程的评估体系、收集大规模专家标注数据，并结合专家校准的微调策略提升视觉语言模型的推理能力，从而实现对视频生成质量的全面评估，为未来奖励模型和评估代理的研究提供了基础支撑。

详情

AI中文摘要

生成式视频基础模型的快速发展推动该领域向专业级电影合成迈进。为达到如此苛刻的质量，社区正转向强化学习和智能体工作流。然而，可靠的评估已成为关键瓶颈。现有基准主要评估“是否正确”（基本提示遵循），而从根本上忽略了“是否优良”（电影质量、表演和美学）。此外，当前的自动指标缺乏提供可信信号所需的领域特异性，在人类审美感知与机器评分之间造成了严重的可信度差距。为弥合这一差距，我们引入了EvalVerse，一个全面、流水线感知且专家校准的评估框架。我们将视频生成评估不仅视为一项工程任务，而是作为一个核心科学问题：主观电影专业知识的系统数字化。首先，我们将领域知识组织成与专业电影制作工作流（前期制作、制作和后期制作）一致的评估分类法。其次，我们将人类专家判断提炼为带有大规模人工标注的精选数据集。第三，我们通过专家校准的微调策略将这些知识注入视觉语言模型，使VLM能够执行显式的思维链推理。与先前工作相比，EvalVerse不仅保持与基础“正确性”指标的兼容性，还显著扩展了“优良性”标准，并将任务覆盖范围拓宽到复杂的多镜头序列和视听整合。因此，通过提供细粒度的诊断信号，EvalVerse超越了静态排行榜，为未来工作（如奖励模型和评估智能体）建立了基础基础设施。

英文摘要

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.

URL PDF HTML ☆

赞 0 踩 0

2605.23270 2026-05-25 cs.CV cs.AI cs.RO 版本更新

StereoGenBench：一种用于受控基线条件下立体生成的合成多相机基准

Yangzhi Cui, Feng Qiao, Nathan Jacobs

发表机构 * Washington University in St. Louis（华盛顿大学圣路易斯分校）

AI总结 StereoGenBench 是一个基于 Unreal Engine 的合成多相机基准数据集，旨在为立体生成、几何估计和可控视角合成提供精确可控的多基线配对数据。该数据集通过固定场景下六相机阵列的渲染，生成包含多基线、内参、深度、相机位姿等信息的高质量配对视图，支持对不同基线范围下的生成模型进行评估。该工作填补了现有数据集在多基线配对和可控参数方面的不足，为立体生成研究提供了标准化的测试平台。

详情

AI中文摘要

立体图像和视频生成、立体几何估计以及条件控制视图合成需要配对数据，其中决定双目几何的变量——相机基线、内参、场景深度和相机运动——是已知且可控的。现有的立体资源提供了这些变量的子集，但据我们所知，常用于立体生成评估的资源并未在单一受控源中提供场景配对的、校准的多基线右视图真值，以及联合记录的内参、密集度量深度和每帧姿态。我们引入了StereoGenBench，一个合成的Unreal Engine基准，旨在使基线灵敏度与目标相机一致性在匹配的场景内容下可测量。每个场景使用刚性六相机横向阵列渲染，产生多达15个校准视图对；相邻基线从瞳孔间到宽基线范围采样；焦距独立采样；每个视图发布RGB、度量深度、内参、每对基线和每帧姿态。数据集划分包括窄基线和宽基线两个评估族，以及一个仅训练族用于更广泛的全对覆盖。我们发布了数据集、评估代码、参考结果、Croissant元数据以及用于扩展的生成代码/配置（兼容资产）。数据集可在https://huggingface.co/datasets/stereo-dataset/stereo-dataset获取。

英文摘要

Stereo image and video generation, stereo geometry estimation, and condition-controlled view synthesis require paired data in which the variables that determine binocular geometry -- camera baseline, intrinsics, scene depth, and camera motion -- are known and controllable. Existing stereo resources provide subsets of these variables, but resources commonly used for stereo generation evaluation do not, to our knowledge, provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source. We introduce StereoGenBench, a synthetic Unreal Engine benchmark designed to make baseline-regime sensitivity and target-camera consistency measurable under matched scene content. Each scene is rendered with a rigid six-camera lateral array, yielding up to 15 calibrated view pairs; adjacent baselines are sampled from inter-pupillary to wide-baseline regimes; focal length is sampled independently; and every view is released with RGB, metric depth, intrinsics, per-pair baselines, and per-frame poses. The splits include two evaluation families for narrow and wide baseline regimes and a train-only family for broader all-pairs coverage. We release the dataset, evaluation code, reference results, Croissant metadata, and generation code/configuration for extension with compatible assets. The dataset is available at https://huggingface.co/datasets/stereo-dataset/stereo-dataset

URL PDF HTML ☆

赞 0 踩 0

2605.23216 2026-05-25 cs.CV 版本更新

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

CaST-Bench：面向视频问答的因果链时空推理基准

Mingfang Zhang, Jingjing Pan, Ashutosh Kumar, Rajat Saini, Mustafa Erdogan, Hsuan-Kung Yang, Caixin Kang, Yifei Huang, Yoichi Sato, Quan Kong

发表机构 * Woven by Toyota（丰田公司）； The University of Tokyo（东京大学）

AI总结 CaST-Bench 是一个用于评估视频问答中因果链引导的时空推理能力的新基准，旨在解决现有模型在因果推理方面缺乏细致、可验证证据的问题。该基准通过人类与AI协作构建了包含2066个问题的高质量数据集，每个问题都附带有时间片段和边界框标注的因果链证据。研究还设计了新的评估指标，全面衡量模型在答案正确性和视觉证据推理方面的能力，揭示了当前视觉语言模型在构建精确因果链方面的不足，为未来模型改进指明了方向。

Comments CVPR 2026

详情

AI中文摘要

视频中的因果推理对视觉语言模型（VLM）是一个重大挑战，因为它需要超越表面感知，深入理解因果机制。然而，现有基准很少提供严格评估这一能力所需的细粒度、有依据的证据。为填补这一空白，我们引入了CaST-Bench，一个用于因果链时空视频推理的基准。CaST-Bench提出复杂的因果问题，要求模型识别并定位多个时空证据组成的链条。通过人机协作流程，我们构建了一个高质量数据集，包含1015个视频上的2066个问题，因果链由时间片段和边界框轨迹标注。此外，我们设计了一套全面的评估方案，包含新颖的指标，不仅评估答案正确性，还评估基于视觉证据的推理能力。这种证据基础对于通过减轻虚假相关性来提高准确性，以及通过使模型更透明来增强用户信任至关重要。我们的实验表明，当前的VLM在因果问题上表现不佳，主要原因是它们构建精确且有依据的因果链的能力有限。这为改进未来VLM指明了一个重要方向。

英文摘要

Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.23203 2026-05-25 cs.CV cs.AI cs.LG cs.RO 版本更新

Lipschitz Optimization for Formal Verification of Homographies

单应性矩阵形式化验证的Lipschitz优化

Jean-Guillaume Durand, Panagiotis Kouvaros, Maxime Gariel, Alessio Lomuscio

发表机构 * Joby Aviation（Joby航空）； Safe Intelligence

AI总结本文研究了针对视觉神经网络在安全关键领域应用的正式鲁棒性验证问题，特别关注相机运动引起的3D扰动对图像生成过程的影响。作者提出了一种基于李普希茨优化和分段连续性分析的验证方法，建立了相机姿态到像素值的闭式映射，并推导出对扰动像素值的紧致线性界。该方法适用于具有平面结构的场景，如增强现实、自动驾驶和机器人操作等，并在多个基准测试中验证了其有效性，相比现有方法在速度和边界紧致性方面均有提升。

Comments 18 pages, 13 figures, 6 tables, to be published at CVPR 2026

详情

AI中文摘要

在受监管行业中采用视觉神经网络需要形式化的鲁棒性保证，尤其是在医疗、自动驾驶和航空航天等安全关键领域。然而，当前方法局限于不完整的统计验证或对$\ell_p$范数和仿射变换的鲁棒性，仅覆盖了图像形成过程中一小部分扰动。特别是，对相机运动的鲁棒性仍然是一个开放问题，尽管它是部署许多视觉应用的关键。我们提出了一种形式化验证方法，针对捕获相机的3D运动扰动鲁棒性。我们首先建立了从相机位姿到像素值的闭式映射。通过分析所得单应性矩阵的连续性性质，我们展示了如何将最近关于Lipschitz优化和分段连续性的工作扩展到推导扰动像素值的紧线性边界。我们的方法适用于以平面结构为主的场景，例如增强现实中的地面、自动驾驶中的道路标记和交通标志，或机器人操作中的平面工作空间。这实现了对投影几何变换的首次形式化验证，无需复杂仿真、替代网络或显式图像形成模型。我们验证了实现，并展示了相比先前工作最高89%的加速和7%更紧的边界。然后，我们在VNN-COMP基准上评估了我们的方法，揭示了投影扰动的系统性弱点。最后，我们在一个安全关键的跑道分类器上进行了真实世界案例研究，突出了对相机运动的实际漏洞，并解决了学习模型认证中的一个关键挑战。数据和代码公开在https://github.com/jeangud/homography-verification。

英文摘要

The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete statistical verification or robustness to $\ell_p$-norm and affine transforms, which cover only a narrow subset of perturbations to the image formation process. In particular, robustness to camera motion remains an open problem despite being key to deploy many vision applications. We present a formal verification approach that targets robustness against 3D motion perturbations of the capturing camera. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. Our approach applies to scenes with predominantly planar structure, such as ground planes in augmented reality, road markings and traffic signs in autonomous driving, or planar workspaces in robotic manipulation. This enables the first formal verification of projective geometry transforms, without complex simulation, surrogate networks, or explicit image-formation models. We validate our implementation and show up to 89% speedup and 7% tighter bounds over prior work. We then evaluate our method on the VNN-COMP benchmark and reveal systematic weaknesses to projective perturbations. Finally, we demonstrate a real-world case study on a safety-critical runway classifier, highlighting practical vulnerabilities to camera motion, and addressing a key challenge in the certification of learned models. Data and code are publicly available at https://github.com/jeangud/homography-verification .

URL PDF HTML ☆

赞 0 踩 0

2605.23187 2026-05-25 cs.CV cs.RO 版本更新

IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

IntentionNav: 一种基于隐式人类指令的意图驱动目标导航基准

Lin Qian, Shijie Li, Sihao Lin, Xuan Zhang, Bangya Liu, Yanran Li, Hujun Yin

发表机构 * The University of Manchester（曼彻斯特大学）； A*STAR ； Responsible AI Research Centre, Adelaide University（阿德莱德大学负责任人工智能研究中心）； University of Bedfordshire（贝福德郡大学）

AI总结 IntentionNav 是一个用于意图驱动对象导航的新基准，旨在评估智能体从隐含人类指令中推断目标物体并完成导航任务的能力。该基准不直接提供目标物体名称，而是通过自然语言指令隐含表达需求，要求智能体理解意图、识别目标并完成导航。研究引入了四种意图模式和多种指令风格，支持对目标推理、语言鲁棒性及导航成功率的细致分析，揭示了当前视觉语言模型在理解隐含意图和完成精准导航任务方面仍面临挑战。

Comments preprint

详情

AI中文摘要

现有的目标导航基准通常告诉具身智能体要找到哪个物体类别，例如微波炉或椅子。面向人类的具身AI经常被问到一些不那么直接的问题：“我需要热一下这个食物”或“房间感觉很闷”。智能体必须推断出能够满足需求的物体，找到一个场景中的实例，并决定是否已达到目标。我们将这种设置研究为意图驱动的目标导航，并引入IntentionNav，一个用于从隐式人类指令进行主动目标搜索的诊断基准。每个episode提供一个自由文本意图、RGB-D观测和位姿，但隐藏目标物体名称。IntentionNav包含176个Isaac Sim场景和64个目标类别上的500个意图。每个意图以四种受控指令风格重写，并标注四种意图模式之一，将表面措辞与语义线索类型分离，同时保持几何匹配。这种配对设计支持对目标推断、语言鲁棒性、邻域可达性和终端成功（而非仅聚合成功）的分析。我们使用一个固定的主动导航智能体评估了三个VLM。模型在48.3%的episode中识别出预期目标，在68.7%中进入其2米邻域，但仅在24.9%中成功终止，并在5.5%中达到接地1米成功。事件脚本意图的成功率最高（28.7%），而物理状态和可供性意图的成功率较低（分别为19.2%和18.5%），表明间接人类意图仍然是主动具身搜索中目标选择、视觉验证和终端定位的瓶颈。

英文摘要

Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy." The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached. We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48.3 percent of episodes and enter its 2 m neighborhood in 68.7 percent, but terminate successfully in only 24.9 percent and achieve grounded 1 m success in 5.5 percent. Success is highest for event-script intents (28.7 percent) and lower for physical-state and affordance intents (19.2 percent and 18.5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.

URL PDF HTML ☆

赞 0 踩 0

2605.23183 2026-05-25 eess.IV cs.CV 版本更新

GMENet: Generative Mixture of Experts Network for Multi-Center Glioma Diagnosis with Incomplete Imaging Sequences

GMENet: 用于多中心胶质瘤诊断的生成式专家混合网络（不完整成像序列）

Pengfei Song, Fangjin Liu, Wenwen Zeng, Yonghuang Wu, Chengqian Zhao, Feiyu Yin, Xuan Xie, Jinhua Yu

发表机构 * School of Biomedical Engineering and Technology Innovation, Fudan University（复旦大学生物医学工程与技术创新学院）； Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University（复旦大学脑启发智能科学技术研究院）； Intelligent Diagnosis and Treatment Laboratory for Brain Diseases, Joint Laboratory of Neurosurgery Department of Huashan Hospital and School of Information Science and Technology, Fudan University（脑病智能诊断与治疗实验室，华山医院神经外科部门联合实验室，复旦大学信息科学学院）

AI总结当前胶质瘤诊断通常结合分子特征与组织病理学信息进行临床决策，但在实际应用中，不同中心的影像协议不统一，导致MRI序列不完整，限制了现有模型的临床适用性。为此，本文提出GMENet，一种用于多中心胶质瘤诊断的生成专家混合网络。该方法通过跨注意力门控生成模块合成缺失的影像特征，并引入动态加权专家融合模块实现多任务预测，有效提升了模型在不完整数据下的诊断性能和跨中心适应能力。

Comments IJCAI Accept

详情

AI中文摘要

当代胶质瘤诊断将分子特征与组织病理学相结合以指导临床决策。然而，在临床环境中，不同的成像协议导致MRI序列不完整，从而带来两个主要挑战：迫使现有框架在训练期间丢弃大量临床数据，并因此限制了其临床适用性。为解决这些限制，我们提出了GMENet，一种用于不完整成像序列的多中心胶质瘤诊断的生成式专家混合网络。首先，我们设计了一个基于交叉注意力的门控生成模块，该模块通过交叉注意力和动态门控机制从可用序列合成缺失序列特征，并引入循环一致性损失以保持语义完整性。其次，我们引入了一个动态加权专家融合模块，该模块对原始和合成的双序列特征进行专家混合交互和置信度感知融合，以进行多任务预测。我们在一个包含来自四个内部数据集和两个公共存储库的1241名受试者的多中心队列上评估了GMENet。实验表明，相对于仅完整序列的数据，GMENet将临床可用的训练数据扩大了97%。此外，它始终优于在完整数据上训练的最先进方法，在跨中心分布偏移下表现出更强的鲁棒性。

英文摘要

Contemporary glioma diagnosis integrates molecular features with histopathology to guide clinical decision-making. However, in clinical settings, divergent imaging protocols result in incomplete MRI sequences, leading to two primary challenges: forcing existing frameworks to discard a large portion of clinical data during training and consequently limiting their clinical applicability. To address these limitations, we propose GMENet, a Generative Mixture of Experts Network for multi-center glioma diagnosis with incomplete imaging sequences. Firstly, we design a Cross-attention-based Gated Generation Module that synthesizes missing sequence features from available sequences via cross-attention and dynamic gating mechanisms, incorporating a cycle-consistency loss to preserve semantic integrity. Secondly, we introduce a Dynamically Weighted Experts Fusion Module that performs mixture-of-experts interaction and confidence-aware fusion over original and synthesized dual-sequence features for multi-task prediction. We evaluate GMENet on a multi-center cohort of 1,241 subjects from four in-house datasets and two public repositories. Experiments show that GMENet expands clinically usable training data by 97\%, relative to complete-sequence-only data. Furthermore, it consistently outperforms state-of-the-art methods trained on complete data, demonstrating improved robustness under cross-center distribution shifts.

URL PDF HTML ☆

赞 0 踩 0

2605.23178 2026-05-25 cs.CV 版本更新

Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes

将人物组合在一起：面向多人交互场景的迭代姿态-图像生成

Wenxuan Peng, Bharath Hariharan, Hadar Averbuch-Elor

发表机构 * Cornell University（康奈尔大学）

AI总结尽管现有文本到图像模型在生成多人互动场景时仍面临语义多样性不足和构图准确性低的问题，常导致布局重复、姿势刻板和互动不自然。本文提出一种双模态的姿势-图像表示方法，将以人为中心的结构先验引入预训练的扩散变换模型，通过联合预测二维姿势图和对应的RGB图像，使结构与外观在学习过程中协同演化。核心方法采用跨模态对齐方案，将文本、姿势和图像表示进行绑定，确保多模态一致性，并设计迭代场景生成策略，逐步构建复杂的多人互动场景，有效分解整体生成复杂度，实验表明该方法显著提升了多人图像生成的提示对齐度和场景多样性。

Comments Accepted to SIGGRAPH Conference Papers 2026. 22 pages, 12 figures. Project page: https://cornell-vailab.github.io/PeopleComposer/

详情

DOI: 10.1145/3799902.3811129

AI中文摘要

尽管近期取得了进展，文本到图像模型仍然难以生成语义多样且组合准确的多人交互场景，常常陷入重复布局、刻板姿态和交互基础薄弱的问题。在这项工作中，我们通过引入一种双姿态-图像表示来弥合这一差距，该表示将人物中心的结构先验引入预训练扩散Transformer。我们的模型联合预测2D姿态可视化图像及其对应的RGB图像，使得结构和外观在学习过程中共同演化。其核心是一种跨模态对齐方案，将文本、姿态和图像表示绑定在一起，确保跨模态的一致性基础。此外，我们设计了一种迭代场景构建方案，逐步生成复杂的多人交互，同时有效分解整体生成复杂性。大量实验表明，我们的方法在多人图像生成中显著提高了提示对齐度和场景多样性。

英文摘要

Despite recent progress, text-to-image models still struggle to generate semantically diverse and compositionally accurate multi-person interaction scenes, often collapsing to repetitive layouts, stereotypical poses, and poorly grounded interactions. In this work, we bridge this gap by introducing a dual pose-image representation that brings person-centric structural priors into pretrained diffusion transformers. Our model jointly predicts a 2D pose visualization image and its corresponding RGB image, enabling structure and appearance to co-evolve during learning. At its core, a cross-modal alignment scheme binds text, pose, and image representations, ensuring consistent grounding across modalities. Furthermore, we design an iterative scene construction scheme, progressively generating complex multi-human interactions while effectively decomposing the overall generation complexity. Extensive experiments demonstrate that our method substantially improves prompt alignment and scene diversity in multi-person image generation.

URL PDF HTML ☆

赞 0 踩 0

2605.23174 2026-05-25 cs.CV 版本更新

LQ-rPPG: A Label-Quantized Coarse-to-Fine Learning Framework for Remote Physiological Measurement

LQ-rPPG：一种用于远程生理测量的标签量化粗到细学习框架

Jun Seong Lee, Samyeul Noh, Changki Sung, Hyun Myung

发表机构 * Electronics and Telecommunications Research Institute（电子电信研究院）； School of Electrical Engineering, Korea Advanced Institute of Science and Technology（韩国科学技术院电气工程学院）

AI总结远程光电容积图（rPPG）技术能够通过面部视频非接触地测量生理信号，在远程医疗和日常健康监测中具有重要应用前景。然而，现有基于深度学习的rPPG方法大多忽视了训练标签的质量及其对模型学习的影响，导致模型易受标签噪声和变化的影响，影响泛化性能。为此，本文提出LQ-rPPG，一种基于标签量化和粗到细学习的框架，通过将连续PPG信号转化为多比特伪标签以减少噪声，并在分层监督下逐步优化rPPG估计，从而提升模型鲁棒性和泛化能力，实验表明其在多个数据集上表现优异且计算效率显著提高。

详情

AI中文摘要

远程光电容积描记（rPPG）技术能够从面部视频中非接触式测量生理信号，在远程医疗和日常健康监测方面具有巨大潜力。受此驱动，研究者提出了多种基于深度学习的rPPG方法以改进估计性能。然而，以往的深度学习方法很少关注训练标签的质量及其对模型学习的影响。用作训练标签的接触式PPG信号通常包含由运动伪影、传感器接触不一致和形态畸变引起的噪声和变异性。这种标签不一致性可能导致模型过拟合标签噪声和变异性，从而降低泛化性能。为解决此问题，我们提出LQ-rPPG，一种标签量化的粗到细学习框架，用于鲁棒的rPPG估计。LQ-rPPG包含一个标签量化模块和一个粗到细的rPPG估计模型。标签量化模块将连续PPG信号转换为多比特量化伪标签，以降低噪声和变异性。粗到细估计模型在多比特伪标签的分层监督下逐步细化rPPG信号。这种设计减轻了对标签特定变异性的过拟合，使模型能够学习结构化和一致的表示。因此，LQ-rPPG即使在挑战性条件下也能实现鲁棒且可泛化的rPPG估计。在多个基准数据集上的实验表明，LQ-rPPG在数据集内和跨数据集评估中均取得了强劲性能，同时参数和乘累加操作分别减少88%和29%，吞吐量提高191%。代码可在https://github.com/Anonymous-repo-code/LQ-rPPG获取。

英文摘要

Remote photoplethysmography (rPPG) enables non-contact measurement of physiological signals from facial videos, offering strong potential for remote healthcare and daily health monitoring. Driven by this potential, various deep learning-based rPPG methods have been proposed to improve rPPG estimation. However, previous deep learning-based rPPG methods have paid little attention to the quality of training labels and their impact on model learning. Contact-based PPG signals used as training labels often contain noise and variability caused by motion artifacts, inconsistent sensor contact, and morphological distortions. Such label inconsistency can lead models to overfit to the label noise and variability and consequently degrade generalization performance. To address this issue, we propose LQ-rPPG, a label-quantized coarse-to-fine learning framework for robust rPPG estimation. LQ-rPPG consists of a label quantization module and a coarse-to-fine rPPG estimation model. The label quantization module transforms continuous PPG signals into multi-bit quantized pseudo labels with reduced noise and variability. The coarse-to-fine estimation model progressively refines rPPG signals under hierarchical supervision guided by the multi-bit pseudo labels. This design alleviates overfitting to label-specific variations and enables the model to learn structured and consistent representations. As a result, LQ-rPPG achieves robust and generalizable rPPG estimation even under challenging conditions. Experiments on multiple benchmark datasets demonstrate that LQ-rPPG achieves strong performance in both intra- and cross-dataset evaluations, while reducing parameters and multiply-accumulate operations by 88% and 29%, respectively, and increasing throughput by 191%. The code is available at https://github.com/Anonymous-repo-code/LQ-rPPG.

URL PDF HTML ☆

赞 0 踩 0

2605.23160 2026-05-25 cs.RO cs.CV 版本更新

Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping

语义感知引导的无人机探索：面向语言条件的三维室内建图

Nitin Vegesna, Avideh Zakhor

发表机构 * Department of Electrical Engineering and Computer Sciences（电气工程与计算机科学系）

AI总结本文提出了一种语义感知引导的无人机探索系统SAGE，用于在未知的室内3D环境中进行开放词汇的探索，能够在保持全面覆盖行为的同时，利用语义线索重新优先选择探索前沿。SAGE基于FALCON体积探索器，通过集成CLIP模型的四个关键组件，实现了语义与几何信息的联合规划，有效提升了目标发现效率。实验表明，SAGE在模拟和真实环境中均优于现有方法，尤其在目标发现速度和体积吞吐量方面表现突出。

Comments 10 pages, 6 figures, 4 tables. To be presented at the 2nd 3D-LLM/VLA Workshop at CVPR 2026 (non-archival workshop)

详情

AI中文摘要

我们提出语义感知引导探索（SAGE），一个用于未知三维室内环境的开放词汇探索系统，该系统在保持覆盖导向行为的同时，允许语义提示重新优先化前沿选择。基于FALCON体积探索器，SAGE通过四个关键组件集成对比语言-图像预训练（CLIP）：以物体为中心的嵌入存储、将最近观测投影到自由-未知边界的时间缓存、用于高相似度检测的物体前沿，以及统一的语义-几何规划成本。该成本函数限制了语义重新加权的影响，确保前沿被优先化而不牺牲总覆盖率。在基于Matterport3D的仿真中，SAGE在地图-查询对上的物体发现方面优于FALCON和纯语义消融。与Finding Things in the Unknown（FTU）相比，SAGE在九个共享地图-查询对上的探索速度提高了9.0到25.9倍，平均加速13.7倍。此外，SAGE的体积吞吐量显著高于FTU。最后，我们在Modal AI Starling 2四旋翼飞行器上，在两种环境中的五次真实飞行中部署了SAGE，配备机载感知和规划以及离板CLIP推理。比较SAGE和FALCON，我们发现虽然FALCON导致更快的探索和更短的建图轨迹，但SAGE在物体发现方面优于FALCON。

英文摘要

We present Semantic-Aware Guided Exploration, SAGE, a system for open-vocabulary exploration in unknown 3D indoor environments that preserves coverage-oriented behavior while allowing semantic cues to reprioritize frontier selection. Building on the FALCON volumetric explorer, SAGE integrates Contrastive Language-Image Pre-training (CLIP) via four key components: object-centric embedding storage, a temporal cache that projects recent observations onto the free-unknown boundary, object frontiers for high-similarity detections, and a unified semantic-geometric planning cost. This cost function bounds semantic reweighting influence, ensuring frontiers are prioritized without sacrificing total coverage. In Matterport3D-based simulations, SAGE outperforms FALCON and a semantic-only ablation in object discovery across map-query pairs. Compared to Finding Things in the Unknown (FTU), SAGE completes exploration 9.0 to 25.9 times faster across the nine shared map-query pairs, achieving a mean speedup of 13.7. Furthermore, SAGE achieves substantially higher volumetric throughput than FTU. Finally, we deploy SAGE in five real-world flights in two environments on a Modal AI Starling 2 quadrotor with onboard sensing and planning, and offboard CLIP inference. Comparing SAGE and FALCON, we find that while FALCON results in faster exploration and shorter mapping trajectories, SAGE outperforms FALCON in terms of object discovery.

URL PDF HTML ☆

赞 0 踩 0

2605.23144 2026-05-25 cs.CV 版本更新

Flow Mismatching: 通过流匹配模型中的速度差异进行无监督异常检测

Shengzhe Chen, Mehrdad Moradi, Kamran Paynabar, Hao Yan

发表机构 * Arizona State University（亚利桑那州立大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结本文提出了一种名为 Flow Mismatching 的无监督异常检测方法，避免了基于重建的范式，转而利用流匹配模型中的速度差异来检测异常。该方法通过在从高斯噪声到目标图像的仿射路径上分析模型预测速度与几何路径速度之间的不一致，从而识别出异常区域。实验表明，该方法在多个基准数据集上优于现有的基于重建和基于流匹配的最新方法。

详情

AI中文摘要

我们提出Flow Mismatching，一种无监督异常检测方法，有意避免基于重建的范式。相反，我们将流匹配视为几何动力学，并利用一个关键见解：异常发生在学习到的正常流与指向测试图像的几何路径不一致的地方。给定仅在正常图像上训练的流匹配模型，我们沿着从高斯噪声到目标图像的仿射路径探测其学习到的速度场。沿着每条路径，我们比较模型预测的速度（遵循正常生成动力学）与指向目标的速度（包含任何异常内容）。异常会导致这些速度之间的强烈局部不一致。聚合不同时间步和多条路径上的不匹配，产生像素级热图和图像级分数，无需测试时优化、特征记忆或额外校准。我们的分析表明，总体不匹配分解为一个不可约的降噪项和一个测试路径与正常路径得分函数之间的Fisher散度项，后者识别出驱动异常分离的得分差距成分，并解释了鲁棒路径聚合的有效性。在MVTec-AD和VisA上的大量实验表明，与最先进的基于重建和最近的基于流匹配的方法相比，性能优越。

英文摘要

We propose Flow Mismatching, an unsupervised anomaly detection method that deliberately avoids reconstruction-based paradigms. Instead, we treat flow matching as geometric dynamics and leverage a key insight: anomalies occur at places where the learned normal flow disagrees with the geometric path toward a test image. Given a flow matching model trained only on normal images, we probe its learned velocity field along affine paths from Gaussian noise to a target image. Along each path, we compare the model-predicted velocity, which follows normal generative dynamics, with the geometric velocity toward the target, which includes any anomalous content. Anomalies induce strong local disagreement between these velocities. Aggregating the mismatch over different time steps and multiple paths yields pixel-wise heatmaps and image-level scores without test-time optimization, feature memories, or additional calibration. Our analysis shows that the population mismatch decomposes into an irreducible denoising term and a Fisher-divergence term between the test-path and normal-path score functions, which identifies the score-gap component that drives anomaly separation and explains the effectiveness of robust path aggregation. Extensive experiments on MVTec-AD and VisA demonstrate superior performance compared with SOTA reconstruction-based and recent flow matching-based approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.23068 2026-05-25 cs.CV 版本更新

RoboSurg-VQA: A Multimodal Benchmark for Surgical Segmentation-Aware Visual Question Answering

RoboSurg-VQA：面向手术分割感知的视觉问答多模态基准

Chengyi Zhang, Zi Ye, Ziyang Wang

发表机构 * Swansea University, UK（威尔士大学）； Maynooth University, Ireland（迈诺特大学）； Aston University, UK（阿斯顿大学）

AI总结本文提出了一种名为 RoboSurg-VQA 的多模态基准，用于评估手术场景下的分割感知视觉问答能力。该基准基于公开的手术分割数据集构建，每个图像帧都配有一组临床导向的问题，涵盖手术背景、解剖结构、成像方式、手术器械可见性等方面，并采用封闭式答案集以保证评估一致性。研究通过约束提示生成候选答案，并结合人工审核提升答案的合理性和标签一致性，旨在推动机器人辅助手术中更可靠的视觉理解技术发展。

详情

AI中文摘要

在机器人辅助和微创手术（RMIS/MIS）中，可靠的视觉理解不仅仅需要精确的掩膜：在临床实践中，临床医生会提出关于手术过程背景、可见性、伪影以及解剖结构和手术器械存在性的语言类问题，且通常是在由遮挡、烟雾、出血和镜面高光导致的退化视图下。我们提出了 extbf{RoboSurg-VQA}，这是一个基于共享模式重新利用公共手术分割数据集构建的分割感知视觉问答（VQA）基准。每帧图像与一组固定的临床驱动问题配对，涵盖手术过程背景、解剖结构（包括区域）、成像模态/视图、手术伪影、图像质量以及基本可见性和空间属性，并采用封闭答案集以实现一致的评估。为了扩展标注，我们通过约束提示生成候选答案，并自动进行有效性和一致性检查，随后进行人工审计以提高合理性和标签一致性。我们报告了基准统计信息、基线合理性以及在挑战性手术条件下的常见评估挑战。代码将在https://github.com/ziyangwang007/Robosurg-VQA上提供。

英文摘要

Reliable visual understanding in robot-assisted and minimally invasive surgery (RMIS/MIS) demands more than accurate masks: in clinical practice, clinicians pose language-like questions about procedural context, visibility, artefacts, and the presence of anatomical structures and surgical instruments, often under degraded views caused by occlusion, smoke, bleeding, and specular highlights. We present \textbf{RoboSurg-VQA}, a segmentation-aware visual question answering (VQA) benchmark built by repurposing public surgical segmentation datasets under a shared schema. Each frame is paired with a fixed set of clinically motivated questions spanning procedure context, anatomy (including region), imaging modality/view, surgical artefacts, image quality, and basic visibility and spatial attributes, with closed answer sets to enable consistent evaluation. To scale annotation, we generate candidate answers via constrained prompting with automatic validity and consistency checks, followed by human auditing to improve plausibility and label consistency. We report benchmark statistics, sanity baselines, and common evaluation challenges under challenging surgical conditions. The code will be available on https://github.com/ziyangwang007/Robosurg-VQA.

URL PDF HTML ☆

赞 0 踩 0

2605.23065 2026-05-25 cs.CV cs.AI cs.LG 版本更新

Dithering Defense: Adversarial Robustness of Vision Foundation Models via Multi-Level Floyd-Steinberg Dithering

抖动防御：通过多级 Floyd-Steinberg 抖动实现视觉基础模型的对抗鲁棒性

Yury Belousov, Brian Pulfer, Vitaliy Kinakh, Slava Voloshynovskiy

发表机构 * Department of Computer Science, University of Geneva, Switzerland（日内瓦大学计算机科学系）

AI总结该研究提出了一种基于多级Floyd-Steinberg抖动算法的轻量输入变换方法，用于提升视觉基础模型在对抗攻击下的鲁棒性。该方法通过在图像中引入可控的噪声，破坏对抗扰动的同时保留语义内容，适用于多种下游任务和不同模型架构。实验表明，该方法在多种攻击场景下表现优异，且对干净输入的性能下降较小，优于现有的去噪基线方法。

Comments Paper accepted at the IEEE International Conference on Image Processing (ICIP 2026)

详情

AI中文摘要

视觉基础模型被广泛用作许多下游任务中的冻结骨干，使其成为对抗攻击下的单点故障。我们研究了多级 Floyd-Steinberg 误差扩散抖动作为一种轻量级、模型无关的输入变换，它在保留语义内容的同时破坏对抗扰动。与先前局限于二值抖动、灰度 CIFAR-10 和从头训练的单个小模型的工作不同，我们在六个任务（分类、分割、深度估计、检索、字幕生成、视觉问答）、两个模型家族（DINOv2、PaliGemma）以及三种强度递增的攻击（PGD、MI-FGSM、SIA）上进行了评估，还包括使用直通估计器的自适应攻击者。我们的结果表明，在中间量化级别上的 Floyd-Steinberg 抖动，尤其是与后处理模糊相结合时，超过或匹配所有测试的基线（包括基于扩散的去噪），并且在干净输入上的退化显著更小。

英文摘要

Vision foundation models are widely used as frozen backbones across many downstream tasks, making them a single point of failure under adversarial attack. We study multi-level Floyd-Steinberg error-diffusion dithering as a lightweight, model-agnostic input transformation that disrupts adversarial perturbations while preserving semantic content. Unlike prior work, which was limited to binary dithering, grayscale CIFAR-10, and a single small model trained from scratch, we evaluate across six tasks (classification, segmentation, depth estimation, retrieval, captioning, visual question answering), two model families (DINOv2, PaliGemma), and three attacks of increasing strength (PGD, MI-FGSM, SIA), as well as an adaptive attacker using a straight-through estimator. Our results show that Floyd-Steinberg dithering at intermediate quantization levels, especially when combined with post-processing blur, exceeds or matches all tested baselines, including diffusion-based denoising, with substantially less degradation on clean inputs.

URL PDF HTML ☆

赞 0 踩 0

2605.23064 2026-05-25 cs.CV cs.LG 版本更新

Millimeter-wave Imaging for Anthropometric Body Measurement

毫米波成像用于人体测量

Miriam Senne, Benjamin D. Killeen, Christoph Baur, Nassir Navab, Azade Farshad

发表机构 * Chair for Computer Aided Medical Procedures（计算机辅助医疗程序研究所）； Technical University of Munich（慕尼黑技术大学）； Rohde & Schwarz GmbH & Co. KG（罗德与施瓦茨 GmbH & Co. KG）； Munich Center for Machine Learning（慕尼黑机器学习中心）； ELLIS Unit Helsinki, Dept. Computer Science, Aalto University（赫尔辛基ELLIS单位，计算机科学系，阿alto大学）

AI总结该研究提出了一种基于毫米波雷达的无接触人体体型测量方法，旨在解决传统测量工具在隐私、效率和适用性方面的不足。通过优化框架，该方法能够从毫米波点云数据中恢复人体三维形状并提取全面的体态测量指标。其核心贡献在于引入了一种顶点加权策略，结合参数化人体模型（SMPL）进行鲁棒的表面对齐与噪声抑制，实现了无需脱衣、无需摄像头的快速、隐私保护的测量流程，适用于各类人群的临床风险评估。

详情

AI中文摘要

身体形状和围度是临床上用于风险分层的信息性生物标志物，包括腰臀比、肢体和躯干周长等指标，然而传统工具如手动卷尺和光学扫描仪通常需要脱衣和保持姿势。这些要求减缓了工作流程，损害了尊严，并且排除了许多老年人和行动不便者。为了实现快速无接触测量，我们利用毫米波雷达，它保护隐私并能穿透典型衣物，实现快速全身采集。在这项工作中，我们提出了一个新的基于优化的框架，从体积毫米波数据中恢复3D人体形状并提取一套全面的人体测量数据。我们的方法引入了一个加权配准流程，将参数化身体模型（SMPL）直接拟合到噪声毫米波点云上。我们贡献的核心是一种顶点加权策略，该策略调节Chamfer能量函数以实现可靠的表面对齐和噪声消除。我们通过加入脚-地面约束和姿态先验进一步稳定拟合，直接优化SMPL参数。这些组件共同实现了一个快速、保护隐私的工作流程，无需摄像头或脱衣，且只需最小程度的配合，即可通过衣物提供高保真度的身体形状和测量数据，支持在诊所和护理机构中对所有年龄和活动水平的患者进行频繁的风险导向评估。

英文摘要

Body shape and circumferences are clinically informative biomarkers for risk stratification, including measures such as waist to hip ratio, limb and trunk girths, yet conventional tools such as manual tape measures and optical scanners often require undressing and sustained poses. These demands slow workflows, compromise dignity, and exclude many older adults and people with limited mobility. To make measurement fast and contactless, we leverage millimeter-wave (mmWave) radar, which preserves privacy and operates through typical clothing, enabling quick full-body acquisition. In this work, we present a new optimization-based framework to recover 3D human shape and extract a comprehensive set of anthropometric measurements from volumetric mmWave data. Our method introduces a weighted registration pipeline that fits a parametric body model (SMPL) directly to the noisy mmWave point cloud. The core of our contribution is a vertex-weighting strategy that modulates a Chamfer energy function for reliable surface alignment and noise elimination. We further stabilize the fit by incorporating a foot-ground plane constraint and pose priors, optimizing directly for the SMPL parameters. Together, these components enable a fast, privacy preserving workflow that delivers high fidelity body shape and measurements through clothing without cameras or disrobing and with minimal cooperation, supporting frequent risk oriented assessments in clinics and care facilities for patients of all ages and mobility levels.

URL PDF HTML ☆

赞 0 踩 0

2605.23045 2026-05-25 cs.CV cs.AI cs.LG 版本更新

CoMoGen: 基于掩码引导的视频生成的可控运动动力学与交互

Adil Meric, Lin Geng Foo, Mert Kiray, Benjamin Busam, Rishabh Dabral, Christian Theobalt

发表机构 * Technical University of Munich（慕尼黑技术大学）； Max Planck Institute for Informatics, Saarland Informatics Campus（马克斯·普朗克信息研究所，萨尔兰信息校园）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； Obsphera

AI总结本文提出了一种可控视频生成框架 CoMoGen，能够在输入图像和二值掩码序列的条件下生成具有真实交互动态的视频。该方法引入了一个轻量的 MaskAdapter 模块，将掩码序列编码为残差信号，并通过余弦加权调度注入到多模态扩散变换器（MMDiT）中。通过低秩适配（LoRA）对 MMDiT 中负责运动生成的特定层进行微调，实现了对运动关键组件的聚焦，降低了计算成本。实验表明，CoMoGen 在运动保真度和感知真实感方面优于现有方法，达到了当前最优水平。

详情

AI中文摘要

Imagine2Real: 通过视频生成先验实现零样本人形机器人-物体交互

Jiahe Chen, ZiRui Wang, Feiyu Jia, Xiao Chen, Xiaojie Niu, Weishuai Zeng, Tianfan Xue, Xiaowei Zhou, Jiangmiao Pang, Jingbo Wang

发表机构 * Zhejiang University（浙江大学）； Shanghai AI Laboratory（上海人工智能实验室）； The Chinese University of Hong Kong（香港中文大学）

AI总结全身体型人机交互（HOI）因高质量3D数据稀缺而面临瓶颈。现有基于视频生成先验的方法由于依赖几何先验（如显式CAD模型）导致表示对齐问题，并因复杂的形态重定向过程而面临重定向复杂性问题。本文提出Imagine2Real，一种无需几何信息的零样本HOI框架，通过将机器人和物体运动统一为4D点轨迹解决表示对齐问题，并通过稀疏关键点追踪避开重定向误差，结合行为基础模型的潜在空间实现自然运动，最终在运动捕捉系统中实现零样本物理部署。

详情

AI中文摘要

具有扩散教师的期望方差缩减

Jesse Bettencourt, Xindi Wu, Matan Atzmon, James Lucas, Jonathan Lorraine

发表机构 * NVIDIA ； University of Toronto（多伦多大学）； Princeton University（普林斯顿大学）

AI总结本文研究了如何在使用预训练扩散模型作为“教师”进行下游任务（如文本到3D生成、单步蒸馏等）时，降低梯度估计的方差。提出了一种名为CARV的计算感知方差控制框架，通过分层蒙特卡洛估计器，将昂贵的上游计算过程与廉价的扩散噪声重采样相结合，并结合时间步重要性采样和分层逆CDF构造，有效减少了计算成本。实验表明，CARV在不改变目标函数的前提下显著提升了计算效率，但在某些任务中梯度方差的降低并未带来生成质量的提升，表明此时方差已不再是性能瓶颈。

Comments Project page: https://research.nvidia.com/labs/sil/projects/CARV/

详情

AI中文摘要

预训练的扩散模型作为冻结教师，为文本到3D、单步蒸馏和数据归因等下游流程提供支持。这些流程消耗的教师梯度是关于噪声水平和高斯噪声样本的蒙特卡洛期望；其估计器方差主导了计算成本，因为每次抽取都需要昂贵的上游工作（渲染、模拟、编码）。我们引入了CARV，一个计算感知的方差核算框架，它激发了一种分层蒙特卡洛估计器：通过廉价的扩散噪声重采样来摊销昂贵的上游计算，并通过时间步重要性采样和分层逆CDF构造加以强化。在我们的文本到3D蒸馏和归因实验中，CARV在不改变目标的情况下提供了2-3倍的有效计算乘数（主要来自摊销重用；约25%来自IS+分层）；在单步蒸馏中，相同的技术将梯度方差降低了一个数量级，但并未改善下游FID，标志着MC方差不再是瓶颈的区间。

英文摘要

Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo (MC) expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical MC estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified-inverse-CDF construction. In our text-to-3D distillation and attribution experiments, CARV delivers 2-3x effective compute multipliers (most from amortized reuse; ~25% additional from IS+stratification) without changing the objective; in single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, marking the regime where MC variance is no longer the bottleneck.

URL PDF HTML ☆

赞 0 踩 0

2605.21487 2026-05-25 cs.CV 版本更新

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Uni-Edit: 智能编辑作为统一模型调优的通用任务

Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, Hongsheng Li

发表机构 * CUHK MMLab（香港大学多模态实验室）； Meituan（美团）； TJU（天津大学）； USTC（中国科学技术大学）

AI总结本文提出了一种名为Uni-Edit的智能图像编辑任务，作为统一多模态模型（UMMs）调优的通用任务。与传统的多任务混合训练方法不同，Uni-Edit通过单一任务、单一训练阶段和单一数据集，同时提升模型在图像理解、生成和编辑三方面的能力。研究引入了一种自动化且可扩展的数据合成方法，将多样化的视觉问答数据转化为复杂且有效的编辑指令，从而显著提升了模型的编辑性能，并在多个基准测试中验证了其对多模态能力的全面提升效果。

Comments Project Page: https://zhengdian1.github.io/Uni-Edit-proj/ Code: https://github.com/zhengdian1/Uni-Edit

详情

AI中文摘要

目前，增强统一多模态模型（UMMs）的图像理解、生成和编辑能力主要依赖于混合多任务训练。由于固有的任务冲突，这种策略需要复杂的多阶段流水线、大量数据混合和平衡技巧，仅能实现性能折衷而非真正的相互增强。为了打破这一范式，我们提出Uni-Edit，一种智能图像编辑任务，作为UMM调优的第一个通用任务。与复杂的混合流水线不同，Uni-Edit仅使用一个任务、一个训练阶段和一个数据集，即可同时提升所有三种能力。具体来说，我们首先识别出图像编辑本质上是一个理想的通用任务，因为它自然需要视觉理解和生成。然而，现有的编辑数据依赖于过于简单的指令，严重低估了模型的理解能力。为解决这一问题，我们引入了第一个自动化且可扩展的智能编辑数据合成流水线，将多样化的VQA数据转化为复杂且有效的编辑指令，其中嵌入了问题和嵌套逻辑。由此产生了Uni-Edit-148k数据集，将多样化的推理密集型指令与高质量编辑图像配对。在BAGEL和Janus-Pro上的大量实验表明，仅对Uni-Edit进行调优即可在所有三种能力上实现全面增强，无需任何辅助操作。

英文摘要

Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.

URL PDF HTML ☆

赞 0 踩 0

2605.21139 2026-05-25 cs.CV cs.LG 版本更新

Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

蒸馏思考，预见行动：面向自动驾驶的认知-物理强化学习

Yang Wu, Qiang Meng, Zhaojiang Liu, Youquan Liu, Jian Yang, Jin Xie

发表机构 * NJU（南京大学）； SJTU（上海交通大学）； FDU（福建大学）

AI总结当前端到端自动驾驶模型受到模仿学习行为克隆天花板的限制，为此，本文提出CoPhy认知-物理强化学习框架，通过将视觉语言模型知识蒸馏到鸟瞰图编码器中，实现零推理成本的认知能力，并构建自回归的鸟瞰图世界模型以预测候选动作的未来语义地图，从而在物理环境层面预见行动后果。该方法结合物理奖励和认知奖励优化驾驶策略，不仅在NAVSIM基准上取得最优性能，还支持通过用户定义的语言指令实现更安全、更灵活的驾驶控制。

详情

AI中文摘要

当前的端到端自动驾驶模型从根本上受到模仿学习的行为克隆上限的限制。虽然强化学习提供了更智能自主性的路径，但它需要两个缺失的基础设施：（1）理解交通语义和驾驶意图的认知基础，以及（2）能够预见候选行动后果的前瞻性物理环境。为此，我们提出了CoPhy，一个用于自动驾驶的认知-物理强化学习框架。为了蒸馏思考，我们将VLM知识蒸馏到BEV编码器中，然后完全丢弃VLM，以零推理成本保留认知能力，同时将认知通道作为可插拔接口释放，用于可选的人类语言命令。为了预见行动，我们构建了一个自回归BEV世界模型，该模型明确预测以候选行动为条件的未来语义地图，作为一个可解释的物理沙盒，从中直接推导出安全指标。基于这一双重基础设施，我们通过GRPO优化驾驶策略，采用新颖的双奖励机制：从BEV rollout导出的物理奖励强制执行硬安全约束，而来自语言对齐评分器的认知奖励确保意图合规。大量实验表明，CoPhy不仅在NAVSIM v1和v2基准上取得了最先进的结果，而且通过认知信息化的场景合规性和通过用户定义的语言指令实现的灵活意图控制，实现了更安全的驾驶。

英文摘要

Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual-reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language-aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user-defined language instructions.

URL PDF HTML ☆

赞 0 踩 0

2605.18329 2026-05-25 cs.CV cs.LG 版本更新

Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

迷失在折叠中：当交叉验证不是用于不确定性估计的深度集成时

Tristan Kirscher, Markus Bujotzek, Yannick Kirchhoff, Maximilian Rokuss, Fabian Isensee, Kim-Celine Kahl, Balint Kovacs, Klaus Maier-Hein

发表机构 * ICube Laboratory, CNRS UMR-7357, University of Strasbourg, Strasbourg, France（ICube实验室，法国斯特拉斯堡大学）； CLCC Institut-Strauss, Strasbourg, France（CLCC斯特拉斯堡研究所）； German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing（海德堡德国癌症研究中心（DKFZ）医学影像计算部门）； Medical Faculty Heidelberg, Heidelberg University, Heidelberg, Germany（海德堡医学院，海德堡大学）； Faculty of Mathematics and Computer Science, University of Heidelberg, Germany（海德堡大学数学与计算机科学学院）； Helmholtz Imaging, German Cancer Research Center, Heidelberg, Germany（海德堡德国癌症研究中心Helmholtz成像部门）； Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany（海德堡大学医院放射肿瘤学部模式分析与学习小组）

AI总结在医学图像分割中，集成模型的分歧常被用作认识论不确定性的代理，但许多研究通过K折交叉验证（CV）构建集成模型，却称之为“深度集成”（DE），导致术语与实现不一致。本文对比了标准5折CV集成与5成员DE在三个多标注分割数据集上的表现，发现DE在保持分割精度的同时，提升了校准和失败检测能力，而CV集成有时与标注者间差异相关性更强。研究指出，应根据研究目标选择集成构建方式：DE适用于可靠性导向任务（如选择性转诊），CV集成则更适合作为模糊性代理。

Comments Accepted for publication at MICCAI 2026

详情

Journal ref: 29th International Conference On Medical Image Computing And Computer Assisted Intervention, Sep 2026, Strasbourg, France

AI中文摘要

集成不一致性被广泛用作医学图像分割中认知不确定性的代理。在实践中，许多研究通过K折交叉验证（CV）形成集成，却称之为“深度集成”（DE）。由于CV成员在不同的数据子集上训练，它们的不一致性混合了种子驱动变异和数据暴露效应，这可能改变不确定性的解释方式。我们审查了最近的分割不确定性研究，发现术语与实现不匹配很常见。然后，我们在三个多模态多标注者分割数据集上，在相同配置下比较了标准5折CV集成与5成员DE（固定训练集，不同随机种子）。我们评估了不确定性在校准、故障检测、歧义建模和分布偏移下的鲁棒性。DE在匹配分割精度的同时改善了校准和故障检测，而CV集成在研究数据集上有时与标注者间变异性相关性更强。因此，应选择与研究问题匹配的集成构建方式：DE用于可靠性导向的使用（如选择性转诊/故障检测），CV集成作为歧义的代理。我们提供了一个轻量级的nnU-Net修改，使得在默认流程内能够进行DE训练。

英文摘要

Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as ``deep ensembles'' (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology--implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation accuracy while improving calibration and failure detection, whereas CV ensembles sometimes correlate more strongly with inter-rater variability on the studied datasets. Thus, ensemble construction should be chosen to match the research question: DE for reliability-oriented use (e.g., selective referral/failure detection) and CV ensembles as a proxy for ambiguity. We provide a lightweight nnU-Net modification enabling DE training within the default pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.15828 2026-05-25 cs.CV 版本更新

Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer

并非所有任务量化平等：面向视觉几何Transformer的Fisher引导量化

Yipu Zhang, Jintao Cheng, Weilun Feng, Jiehao Luo, Chuanguang Yang, Zhulin An, Yongjun Xu, Wei Zhang

发表机构 * Department of Electronic and Computer Engineering, HKUST（香港科技大学电子与计算机工程系）； State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences（中国科学院人工智能安全国家重点实验室）； University of Chinese Academy of Sciences（中国科学院大学）； School of Data Science and Engineering, South China Normal University（华南师范大学数据科学与工程学院）

AI总结本文研究了如何在视觉几何变换器（VGGT）等前馈3D重建模型中进行有效的量化，以降低模型的内存和计算开销。针对不同任务、块和通道对量化误差的敏感性差异，作者提出了一种基于Fisher信息矩阵的引导量化方法（FGQ），通过量化不同组件对任务的重要性，在校准过程中动态调整仿射变换，从而更有效地保留关键信息。实验表明，FGQ在多个3D视觉任务中显著优于现有方法，在4位量化下相对提升了高达39%的性能。

详情

AI中文摘要

以视觉几何基础Transformer（VGGT）为代表的前馈3D重建模型，在单次前向传播中联合预测多个视觉几何任务，如深度估计、相机姿态预测和点云重建。它们已广泛应用于3D视觉应用，但其十亿级参数带来了巨大的内存和计算开销，给设备端部署带来挑战。训练后量化（PTQ）是减少这种开销的有效技术。现有的前馈3D模型PTQ方法主要关注处理重尾激活分布和构建多样化的校准数据集。然而，我们观察到前馈3D模型通过共享骨干网络预测多个几何属性，其中不同的Transformer块和隐藏通道对每个任务的贡献不同，导致不同任务、块和通道对量化误差的敏感性差异显著。因此，平等对待所有任务会过度强调不敏感的任务，并导致敏感任务上的显著精度损失。为解决此问题，我们提出面向前馈3D重建模型的Fisher引导量化（FGQ）。具体地，FGQ使用对角Fisher信息矩阵来量化不同任务、块和通道的敏感性，并在校准期间将这些敏感性纳入可学习仿射变换中，以更好地保留对每个任务最关键的通道和块。在相机姿态估计、点云重建和深度估计上的大量实验表明，FGQ在VGGT上始终优于最先进的量化基线，在4比特量化下实现了高达39%的相对改进。代码可在https://github.com/ypzhng/FGQ获取。

英文摘要

Feed-forward 3D reconstruction models, represented by Visual Geometry Grounded Transformer (VGGT), jointly predict multiple visual geometry tasks such as depth estimation, camera pose prediction, and point cloud reconstruction in a single forward pass. They have been widely adopted in 3D vision applications, but their billion-scale parameters bring substantial memory and computation overhead, posing challenges for on-device deployment. Post-Training Quantization (PTQ) is an effective technique to reduce this overhead. Existing PTQ methods for feed-forward 3D models mainly focus on handling heavy-tailed activation distributions and constructing diverse calibration datasets. However, we observe that feed-forward 3D models predict multiple geometric attributes through a shared backbone, where different transformer blocks and hidden channels contribute distinctly to each task, resulting in substantially different sensitivities to quantization errors across tasks, blocks, and channels. Consequently, treating all tasks equally over-emphasizes insensitive tasks and causes significant accuracy loss on the sensitive ones. To address this issue, we propose Fisher-Guided Quantization (FGQ) for feed-forward 3D reconstruction models. Specifically, FGQ uses the diagonal Fisher information matrix to quantify the different sensitivities across tasks, blocks, and channels, and incorporates these sensitivities into the Learnable Affine Transformation during calibration to better preserve the channels and blocks most critical to each task. Extensive experiments across camera pose estimation, point map reconstruction, and depth estimation show that FGQ consistently outperforms state-of-the-art quantization baselines on VGGT, achieving up to 39% relative improvement under the 4-bit quantization. Code is available at https://github.com/ypzhng/FGQ.

URL PDF HTML ☆

赞 0 踩 0

2605.11596 2026-05-25 cs.CV 版本更新

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

HorizonDrive: 用于长时域驾驶仿真的自纠正自回归世界模型

Conglang Zhang, Yifan Zhan, Qingjie Wang, Zhanpeng Ouyang, Yu Li, Zihao Yang, Xiaoyang Guo, Weiqiang Ren, Qian Zhang, Zhen Dong, Yinqiang Zheng, Wei Yin, Zhengqing Chen

发表机构 * Wuhan University（武汉大学）； The University of Tokyo（东京大学）； Horizon Robotics ； Tsinghua University（清华大学）； University of Science and Technology of China（中国科学技术大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出HorizonDrive，一种用于长时域驾驶模拟的自纠正自回归世界模型。该方法通过引入计划式回滚恢复机制，使教师模型能够在长序列预测中保持稳定，并利用其自回归扩展提供无界监督，从而在有限内存下实现分钟级的预测。实验表明，HorizonDrive在多项指标上显著优于现有方法，提升了驾驶模拟的质量与效率。

Comments Comments: 22 pages, 14 figures. Project page: https://zcliangyue.github.io/HorizonDrive Code: https://github.com/zcliangyue/HorizonDrive

详情

AI中文摘要

闭环驾驶仿真需要超越短时离线片段的实时交互，推动当前驾驶世界模型向自回归（AR）滚转发展。现有的AR蒸馏方法通常依赖于帧沉或学生端退化训练。前者由于快速的自我运动和场景变化，难以迁移到驾驶场景；后者受限于教师单次输出长度，仅提供有限的监督时域。一个自然的问题是：能否通过AR滚转扩展教师本身，以有限的内存成本提供无限时域的监督？关键困难在于标准教师会在自身预测下漂移，污染其提供的监督。我们的关键见解是使教师具备滚转能力，确保从其自身的AR滚转中获得可靠监督。这实例化为HorizonDrive，一个用于AR驾驶仿真的抗漂移训练与蒸馏框架。首先，计划性滚转恢复（SRR）训练基础模型从预测损坏的历史中重建真实未来片段，得到一个在长AR滚转中保持稳定的教师。其次，通过AR滚转扩展具备滚转能力的教师，在有限内存下提供长时域分布匹配监督，同时短窗口学生通过教师滚转DMD（TRD）与之对齐，以实现高效的实时部署。HorizonDrive原生支持在有限内存下的分钟级AR滚转；在nuScenes上，与最强的长时域流式基线相比，HorizonDrive将FID降低52%，FVD降低37%，并将ARE和DTW分别降低21%和9%，同时与单次驾驶视频生成器保持竞争力。

英文摘要

Closed-loop driving simulation requires real-time interaction beyond short offline clips, pushing current driving world models toward autoregressive (AR) rollout. Existing AR distillation approaches typically rely on frame sinks or student-side degradation training. The former transfers poorly to driving due to fast ego-motion and rapid scene changes, while the latter remains bounded by the teacher's single-pass output length and thus provides only a limited supervision horizon. A natural question is: can the teacher itself be extended via AR rollout to provide unbounded-horizon supervision at bounded memory cost? The key difficulty is that a standard teacher drifts under its own predictions, contaminating the supervision it provides. Our key insight is to make the teacher rollout-capable, ensuring reliable supervision from its own AR rollouts. This is instantiated as HorizonDrive, an anti-drifting training-and-distillation framework for AR driving simulation. First, scheduled rollout recovery (SRR) trains the base model to reconstruct ground-truth future clips from prediction-corrupted histories, yielding a teacher that remains stable across long AR rollouts. Second, the rollout-capable teacher is extended via AR rollout, providing long-horizon distribution-matching supervision under bounded memory, while a short-window student aligns to it with teacher rollout DMD (TRD) for efficient real-time deployment. HorizonDrive natively supports minute-scale AR rollout under bounded memory; on nuScenes, HorizonDrive reduces FID by 52% and FVD by 37%, and lowers ARE and DTW by 21% and 9% relative to the strongest long-horizon streaming baselines, while remaining competitive with single-pass driving video generators.

URL PDF HTML ☆

赞 0 踩 0

2605.07590 2026-05-25 cs.CV 版本更新

Beyond Defenses: Manifold-Aligned Regularization for Intrinsic 3D Point Cloud Robustness

超越防御：面向内在3D点云鲁棒性的流形对齐正则化

Pedro Alonso, Chongshou Li, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University（计算机与人工智能学院，西南交通大学）； Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Southwest Jiaotong University（可持续城市智能交通工程研究中心，教育部，西南交通大学）

AI总结尽管点云鲁棒性研究已取得进展，但现有方法多依赖数据增强或防御机制，忽视了对抗脆弱性的几何本质。本文提出一种基于流形对齐的正则化方法，认为3D网络的对抗脆弱性源于模型学习的潜在几何结构与点云表面内在几何之间的不匹配。通过引入Manifold-Aligned Point Recognition（MAPR）框架，在不依赖对抗训练或额外数据的情况下，有效提升了模型在多个数据集上的鲁棒性。

详情

AI中文摘要

尽管点云鲁棒性研究取得了广泛进展，现有方法主要依赖增强策略或防御机制，却忽视了对抗脆弱性的几何本质。我们假设3D网络中的对抗脆弱性源于模型学习的潜在几何与底层表面的内在几何之间的流形错位。沿输入流形的微小几何保持扰动往往在特征空间中引起不成比例的扭曲，可能导致误分类。我们通过建立3D鲁棒性的几何解释来形式化这一现象，将经典对抗理论与点云的内在结构联系起来。受此分析启发，我们提出了流形对齐点识别（MAPR），该框架通过跨内在扰动对齐预测来正则化潜在几何。MAPR为每个点云增强捕获局部曲率和扩散结构的内在特征，并应用保持内在几何保持扰动不变性的一致性损失。在不依赖对抗训练或额外数据的情况下，MAPR在多个数据集上持续提升对多种对抗攻击的鲁棒性，在ModelNet40和ScanObjectNN上分别比原始模型平均提高+20.02和+8.83个百分点的鲁棒性。

英文摘要

Despite extensive progress in point cloud robustness, existing methods primarily rely on augmentation strategies or defense mechanisms while overlooking the geometric nature of adversarial fragility. We hypothesize that adversarial vulnerability in 3D networks arises from a manifold misalignment between the latent geometry learned by the model and the intrinsic geometry of the underlying surface. Small, geometry-preserving perturbations along the input manifold often induce disproportionate distortions in feature space, potentially leading to misclassifications. We formalize this phenomenon by developing a geometric interpretation of 3D robustness that links classical adversarial theory to the intrinsic structure of point clouds. Motivated by this analysis, we introduce Manifold-Aligned Point Recognition (MAPR), a framework that regularizes the latent geometry by aligning predictions across intrinsic perturbations. MAPR augments each point cloud with intrinsic features capturing local curvature and diffusion structure, and applies a consistency loss that preserves invariance to intrinsic, geometry-preserving perturbations. Without relying on adversarial training or additional data, MAPR consistently improves robustness under multiple adversarial attacks across several datasets, achieving average robustness gains of +20.02 and +8.83 percentage points over vanilla models on ModelNet40 and ScanObjectNN, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.06094 2026-05-25 cs.CV cs.AI 版本更新

VISD: Enhancing Video Reasoning via Structured Self-Distillation

VISD: 通过结构化自蒸馏增强视频推理

Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, Hongbo Jin

发表机构 * HUST（华中科技大学）； Wuhan University（武汉大学）； Peking University（北京大学）； Tsinghua University（清华大学）

AI总结本文提出VISD，一种用于增强视频推理的结构化自蒸馏框架，旨在解决视频大语言模型在复杂推理任务中因稀疏奖励和细粒度信用分配不足而导致的学习效率低下的问题。VISD引入了一个视频感知的评判模型，将推理质量分解为答案正确性、逻辑一致性和时空定位等多个维度，并利用结构化反馈指导教师策略进行细粒度的标记级监督。通过方向与幅度解耦机制，VISD稳定地将密集监督与强化学习结合，显著提升了推理准确性和训练效率。实验表明，VISD在多个基准测试中均优于现有方法，且收敛速度更快。

详情

AI中文摘要

训练视频大语言模型进行复杂推理仍然具有挑战性，原因在于稀疏的序列级奖励以及缺乏对长时间、时间上接地推理轨迹的细粒度信用分配。虽然具有可验证奖励的强化学习提供了可靠的监督，但它无法捕捉令牌级贡献，导致学习效率低下。相反，现有的自蒸馏方法提供密集监督，但缺乏结构和诊断特异性，并且通常与强化学习交互不稳定。在这项工作中，我们提出了VISD，一个结构化自蒸馏框架，为视频推理引入诊断上有意义的特权信息。VISD采用视频感知判断模型，将推理质量分解为多个维度，包括答案正确性、逻辑一致性和时空接地性，并使用这种结构化反馈指导教师策略进行令牌级监督。为了将密集监督与强化学习稳定集成，我们引入了方向-幅度解耦机制，其中由奖励计算的展开级优势决定更新方向，而结构化特权信号调节令牌级更新幅度。这种设计实现了语义对齐和细粒度的信用分配，提高了推理忠实度和训练效率。此外，VISD结合了课程调度和基于指数移动平均的教师稳定化，以支持长视频序列上的鲁棒优化。在多个基准上的实验表明，VISD始终优于强基线，提高了答案准确性和时空接地质量。值得注意的是，VISD在优化步骤中实现了近2倍的收敛速度，突出了结构化自监督在提高视频大语言模型性能和样本效率方面的有效性。

英文摘要

Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.06088 2026-05-25 cs.CV 版本更新

OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention

OpenGaFF: 基于码本注意力的开放词汇高斯特征场

Kunyi Li, Michael Niemeyer, Sen Wang, Stefano Gasperini, Nassir Navab, Federico Tombari

发表机构 * Technical University of Munich（慕尼黑技术大学）； Google（谷歌）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Visualais

AI总结本文提出了一种名为 OpenGaFF 的新型框架，用于实现开放词汇的3D场景理解。该方法基于3D高斯点喷射技术，通过引入高斯特征场，将语义建模为高斯几何和外观的连续函数，从而增强几何与语义之间的关联性，提升3D空间中语义的一致性。此外，作者设计了一个结构化码本和基于码本引导的注意力机制，以实现对开放词汇的鲁棒推理，并减少物体内部特征的差异。实验表明，该方法在多个标准2D和3D开放词汇基准测试中均优于现有方法，取得了更优的分割质量与更强的3D语义一致性。

详情

AI中文摘要

理解基于高斯表示的开放词汇3D场景仍然具有挑战性，因为多视角观测下的语义预测碎片化且空间不一致。在本文中，我们提出了OpenGaFF，一个基于3D高斯泼溅构建的开放词汇3D场景理解新框架。我们方法的核心是一个高斯特征场，它将语义建模为高斯几何和外观的连续函数。通过显式地将语义预测条件于几何结构，该公式加强了几何与语义之间的耦合，从而在3D空间中相似结构上实现了更好的空间一致性。为了进一步强制执行对象级语义一致性，我们引入了一个结构化码本，作为一组共享的语义基元。此外，提出了一种码本引导的注意力机制，通过查询嵌入与学习到的码本条目之间的相似性匹配来检索语言特征，从而实现鲁棒的开放词汇推理，同时减少对象内特征方差。在标准2D和3D开放词汇基准上的大量实验表明，我们的方法持续优于先前的方法，实现了改进的分割质量、更强的3D语义一致性以及一个语义可解释的码本，为学习到的表示提供了洞察。

英文摘要

Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic predictions on geometric structure, this formulation strengthens the coupling between geometry and semantics, leading to improved spatial coherence across similar structures in 3D space. To further enforce object-level semantic consistency, we introduce a structured codebook that serves as a set of shared semantic primitives. Furthermore, a codebook-guided attention mechanism is proposed to retrieve language features via similarity matching between query embeddings and learned codebook entries, enabling robust open-vocabulary reasoning while reducing intra-object feature variance. Extensive experiments on standard 2D and 3D open-vocabulary benchmarks demonstrate that our method consistently outperforms prior approaches, achieving improved segmentation quality, stronger 3D semantic consistency and a semantically interpretable codebook that provides insight into the learned representation.

URL PDF HTML ☆

赞 0 踩 0

2605.05997 2026-05-25 cs.CV 版本更新

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

4DThinker: 用4D图像进行动态空间理解的思考

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xiang An, Bo Li, Xin Xie, ZiDong Wang, Mingze Sun, Shuang Chen, Hongyu Li, Xiaobin Hu, Ruqi Huang

发表机构 * Tsinghua University, SIGS（清华大学 SIGS）； Meituan（美团）； The Chinese University of Hong Kong（香港中文大学）； National University of Singapore（新加坡国立大学）； LMMs-Lab（LMMs实验室）； University of California, Los Angeles（加州大学洛杉矶分校）

AI总结本文提出了一种名为4DThinker的新型框架，旨在通过动态的潜空间心理图像使视觉语言模型（VLMs）具备四维（4D）动态空间推理能力。该方法引入了无需标注的数据生成流程和动态图像微调（DIFT）技术，结合文本与4D潜变量进行联合监督，从而增强模型对动态视觉语义的理解。此外，基于奖励的4D强化学习（4DRL）进一步提升了模型在复杂推理任务中的表现，实验表明该方法在多个动态空间推理基准测试中均优于现有方法。

Comments 21 pages, 16 figures

详情

AI中文摘要

从单目视频中进行动态空间推理对于连接视觉智能与物理世界至关重要，但对视觉语言模型（VLM）仍然具有挑战性。先前的方法要么将时空推理完全表述为文本，这对于复杂动态来说本质上是冗长且不精确的，要么依赖外部几何模块，这增加了推理复杂性而不培养内在模型能力。在本文中，我们提出了4DThinker，这是第一个使VLM能够通过动态潜在心理图像（即在连续隐藏空间内模拟场景如何演化）进行“4D思考”的框架。具体来说，我们首先引入了一个可扩展的、无需标注的数据生成流程，从原始视频中合成4D推理数据。然后我们提出了动态图像微调（DIFT），它联合监督文本令牌和4D潜在变量，将模型锚定在动态视觉语义中。在此基础上，4D强化学习（4DRL）通过基于结果的奖励进一步处理复杂推理任务，将策略梯度限制在文本令牌上以确保稳定优化。在多个动态空间推理基准上的大量实验表明，4DThinker始终优于强基线，并为VLM中的4D推理提供了新视角。我们的代码可在https://github.com/zhangquanchen/4DThinker获取。

英文摘要

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.

URL PDF HTML ☆

赞 0 踩 0

2605.01018 2026-05-25 cs.CV 版本更新

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

WildTableBench：在真实场景中评估多模态基础模型的表格理解能力

Junzhe Huang, Xiaoxiao Sun, Yan Yang, Yuxuan Hou, Ruotian Zhang, Sirui Li, Hehe Fan, Serena Yeung-Levy, Xin Yu

发表机构 * The University of Queensland（昆士兰大学）； Stanford University（斯坦福大学）； The Australian National University（澳大利亚国立大学）； Zhejiang University（浙江大学）； Murdoch University（莫纳什大学）； The University of Adelaide（阿德莱德大学）

AI总结 WildTableBench 是一个用于评估多模态基础模型在真实场景下理解表格图像能力的基准测试。该研究引入了包含402张来自不同领域的真实表格图像和928个手动标注问题的数据集，用于测试模型在结构感知和数值推理方面的能力。实验表明，目前主流的多模态模型在该基准上的表现普遍较低，仅有一款模型准确率超过50%，揭示了当前模型在处理复杂表格图像时仍存在显著不足。

详情

AI中文摘要

使用多模态基础模型分析表格图像是消费和企业场景中高价值但具有挑战性的应用。尽管其重要性，当前评估主要依赖于结构化文本表格或干净渲染的图像，忽视了真实世界表格图像的视觉复杂性。这些图像具有多样的布局和领域，需要复杂的结构感知和数值推理。为弥补这一差距，我们引入了WildTableBench，这是第一个针对真实世界设置中自然出现的表格图像的问答基准。WildTableBench包含从跨领域在线论坛和网站收集的402张高信息密度表格图像，以及928个手动标注和验证的问题，涵盖五个类别的17个子类型。我们在此基准上评估了21个前沿专有和开源多模态基础模型。仅有一个模型准确率超过50%，其余模型准确率在4.1%至49.9%之间。我们进一步进行诊断分析以表征模型失败，并揭示结构感知和推理方面的持续弱点。这些结果和分析为当前模型能力提供了有用的见解，并将WildTableBench建立为表格图像理解的有价值的诊断基准。数据集：https://huggingface.co/datasets/jzhuang/WildTableBench 代码：https://github.com/hjzhe/WildTableBench 排行榜：https://hjzhe.github.io/WildTableBench

英文摘要

Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding. Dataset: https://huggingface.co/datasets/jzhuang/WildTableBench Code: https://github.com/hjzhe/WildTableBench Leaderboard: https://hjzhe.github.io/WildTableBench

URL PDF HTML ☆

赞 0 踩 0

2604.27247 2026-05-25 cs.CV 版本更新

Towards Generalizable Mapping of Hedges and Linear Woody Features from Earth Observation Data: a national Product for Germany

面向地球观测数据中树篱与线性木本特征的可泛化映射：德国国家产品

Thorsten Hoeser, Verena Huber-Garcia, Sarah Asam, Ursula Gessner, Claudia Kuenzer

发表机构 * Earth Observation Center (EOC), German Aerospace Center (DLR)（地球观测中心（EOC），德国航空航天中心（DLR））

AI总结本文旨在从地球观测数据中生成适用于全国范围的可推广的灌木和线性木质特征地图，以支持生态管理和保护。研究提出了一种模块化的工作流程，包含一个灵活的数据接口和一个深度神经网络，分别用于生成木质植被掩膜和区分线性与非线性结构。该方法在德国全国范围内应用了三种不同分辨率的数据源，无需重新训练模型即可生成高质量的线性木质特征地图，并在多个评估区域表现出良好的性能。

Comments 33 pages, 17 figures

详情

AI中文摘要

树篱和其他线性木本特征在集约化管理的农业景观中提供宝贵的生态系统服务。它们是气候适应和生物多样性的关键要素，不仅因为其高度变化的植物区系，还作为许多动物和昆虫（包括有价值的传粉者）的觅食、休息和筑巢场所。因此，它们需要专门的管理、保护和关注。从地球观测数据中对这些特征进行系统化和大规模制图具有重要意义。然而，考虑到传感器类型、空间分辨率、数据采集条件以及研究区域复杂的景观变异性，可转移和可复用的线性木本特征制图工作流仍然是一个关键的方法论挑战。我们引入了一个模块化工作流，围绕两个独立可优化的组件构建。首先，一个灵活的输入数据接口，将异构的地球观测数据整合为二值木本植被掩膜；其次，一个深度神经网络，训练用于区分这些掩膜中的线性形状和非线性形状。我们通过使用单个训练模型（无需重新训练）从三个输入源（空间分辨率分别为0.73米、1米和3米）推导出覆盖整个德国的三个全国尺度线性木本特征图来演示该工作流。与来自四个联邦州生物群落制图活动的精细参考数据进行的评估，以及与两个现有线性木本特征图的比较表明，该工作流在全国所有评估站点均产生具有竞争力的结果。其模块化设计及其在全国尺度上的适用性为超越德国的可扩展和可泛化线性木本特征制图提供了基础。

视觉引导的多模态推理策略优化

Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu

发表机构 * AMAP, Alibaba Group（阿里集团AMAP）； SYSU（南方科技大学）； BUPT（北京邮电大学）

AI总结该研究针对视觉语言模型在多模态推理中视觉关注不足的问题，提出了一种名为Visually-Guided Policy Optimization（VGPO）的新框架，通过引入视觉注意力补偿机制和双粒度优势重加权策略，增强模型在推理过程中的视觉聚焦能力。实验表明，VGPO有效提升了模型在数学多模态推理和依赖视觉的任务中的表现，显著改善了视觉信息的利用效率。

Comments Accepted to ACL 2026, https://github.com/wzb-bupt/VGPO

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）显著提升了视觉语言模型（VLM）的推理能力。然而，VLM固有的文本主导特性常导致视觉忠实度不足，表现为对视觉标记的注意力激活稀疏。更重要的是，我们的实证分析揭示，推理步骤中的时序视觉遗忘加剧了这一缺陷。为弥补这一差距，我们提出视觉引导策略优化（VGPO），一种在策略优化期间强化视觉聚焦的新框架。具体而言，VGPO首先引入视觉注意力补偿机制，利用视觉相似性定位并放大视觉线索，同时在后续步骤中逐步提升视觉期望以对抗视觉遗忘。基于此机制，我们实施双粒度优势重加权策略：轨迹内层级突出显示具有相对较高视觉激活的标记，而轨迹间层级优先选择表现出优越视觉累积的轨迹。大量实验表明，VGPO在数学多模态推理和视觉依赖任务中实现了更好的视觉激活和优越性能。代码已发布于https://github.com/wzb-bupt/VGPO。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks. The code has been released at https://github.com/wzb-bupt/VGPO.

URL PDF HTML ☆

赞 0 踩 0

2604.06885 2026-05-25 cs.CV 版本更新

Time-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer

基于FDG-PET/CT的非小细胞肺癌时间驱动生存分析

Sambit Tarai, Ashish Chauhan, Elin Lundström, Johan Öfverstedt, Therese Sjöholm, Veronica Sanchez Rodriguez, Håkan Ahlström, Joel Kullberg

发表机构 * Radiology, Department of Surgical Sciences（外科科学系放射学部）； Antaros Medical（Antaros医疗）； Molecular Imaging and Medical Physics, Department of Surgical Sciences（外科科学系分子成像与医学物理部）

AI总结该研究提出了一种基于FDG-PET/CT影像的深度回归框架，用于预测非小细胞肺癌患者的总生存期（OS），并引入时间变量作为输入以实现时间驱动的生存分析。方法结合ResNet-50提取影像特征，并与时间信息融合，生成随时间变化的生存概率预测。实验表明，该方法在AUC指标上优于基线模型，且结合临床与影像特征的集成模型取得了最佳性能，验证了多模态数据在生存预测中的互补价值。

Comments Under review

详情

DOI: 10.1007/s10439-026-04181-y
Journal ref: Ann Biomed Eng (2026)

AI中文摘要

目的：基于医学图像的临床结果（如总生存期，OS）自动预测在改善患者预后和个性化治疗计划方面具有巨大潜力。我们开发了一个深度回归框架，使用组织FDG-PET/CT投影作为输入，以及一个表示标量时间范围（以天为单位）的时间输入，来预测非小细胞肺癌（NSCLC）患者的OS。方法：所提出的框架采用ResNet-50骨干网络处理输入图像并生成相应的图像嵌入。然后将嵌入与时间数据结合，生成作为时间函数的OS概率，从而有效地基于时间参数化预测。整体框架使用U-CAN队列（n=556）开发，并在测试集（n=292）上与基线方法进行比较评估。基线使用ResNet-50架构，仅处理图像作为输入，并在预定义的时间间隔（如2年或5年）提供OS预测。结果：将时间数据与图像嵌入相结合在预测OS方面显示出优势，优于基线方法，AUC提高了4.3%。使用临床+IDP特征的模型取得了强劲性能，而成像与临床+IDP模型的集成取得了最佳整体性能（0.788），突显了多模态输入的互补价值。所提出的方法还能够将患者风险分层为不同类别（高风险与低风险）。显著性分析的热图突出显示了肿瘤区域作为预测的关键结构。结论：我们的方法提供了一个自动化的框架来预测作为时间函数的OS，并展示了结合成像和表格数据以改善生存预测的潜力。

英文摘要

Purpose: Automated medical image-based prediction of clinical outcomes, such as overall survival (OS), has great potential in improving patient prognostics and personalized treatment planning. We developed a deep regression framework using tissue-wise FDG-PET/CT projections as input, along with a temporal input representing a scalar time horizon (in days) to predict OS in patients with Non-Small Cell Lung Cancer (NSCLC). Methods: The proposed framework employed a ResNet-50 backbone to process input images and generate corresponding image embeddings. The embeddings were then combined with temporal data to produce OS probabilities as a function of time, effectively parameterizing the predictions based on time. The overall framework was developed using the U-CAN cohort (n = 556) and evaluated by comparing with a baseline method on the test set (n = 292). The baseline utilized the ResNet-50 architecture, processing only the images as input and providing OS predictions at pre-specified intervals, such as 2- or 5-year. Results: The incorporation of temporal data with image embeddings demonstrated an advantage in predicting OS, outperforming the baseline method with an improvement in AUC of 4.3%. The proposed model using clinical + IDP features achieved strong performance, and an ensemble of imaging and clinical + IDP models achieved the best overall performance (0.788), highlighting the complementary value of multimodal inputs. The proposed method also enabled risk stratification of patients into distinct categories (high vs low risk). Heat maps from the saliency analysis highlighted tumor regions as key structures for the prediction. Conclusion: Our method provided an automated framework for predicting OS as a function of time and demonstrates the potential of combining imaging and tabular data for improved survival prediction.

URL PDF HTML ☆

赞 0 踩 0

2603.24985 2026-05-25 cs.CV 版本更新

Few-Shot Left Atrial Wall Segmentation in 3D LGE MRI via Meta-Learning

基于元学习的3D LGE MRI左心房壁少样本分割

Yusri Al-Sanaani, Rebecca Thornhill, Pablo Nery, Elena Pena, Robert deKemp, Calum Redpath, David Birnie, Sreeraman Rajan

发表机构 * Department of Systems and Computer Engineering, Carleton University（系统与计算机工程系，卡尔顿大学）； Department of Radiology, Radiation Oncology, and Medical Physics, University of Ottawa（放射科、放射肿瘤学与医学物理系，渥太华大学）； Division of Cardiology, Department of Medicine, University of Ottawa Heart Institute（心内科，医学系，渥太华心脏研究所）

AI总结该研究针对3D晚期钆增强磁共振成像（LGE-MRI）中左心房壁分割的挑战，提出了一种基于元学习的模型无关框架，结合3D残差U-Net网络，实现少量样本（5、10、20个样本）下的分割任务。通过联合训练左心房壁及辅助左、右心房腔任务，并引入边界感知复合损失函数，提升了对薄结构的分割精度。实验表明，该方法在少样本条件下优于传统微调方法，并在不同数据域下表现出良好的鲁棒性，有助于减少心脏重构评估中的标注负担。

Comments Accepted to IEEE EMBC 2026

详情

AI中文摘要

从晚期钆增强磁共振成像（LGE-MRI）中分割左心房（LA）壁因其薄几何结构、低对比度和有限的专家标注而具有挑战性。我们提出了一种基于模型无关元学习（MAML）的框架，采用3D残差U-Net骨干网络，用于K-shot（K=5, 10, 20）左心房壁分割。该框架在左心房壁任务以及辅助的左心房和右心房（RA）腔任务上进行元训练，并使用边界感知复合损失来改善薄结构描绘。我们在一个保留的干净测试集上评估了MAML，并在未见过的合成域偏移和本地队列上评估了其鲁棒性。在保留的干净测试集上，MAML在5-shot下优于少样本微调基线，Dice系数（DSC）=0.54对比0.48，豪斯多夫距离（HD95）=4.60对比6.40毫米。在20-shot下，MAML接近从头训练的完全监督模型，DSC=0.59对比0.61。在未见过的偏移下，性能相对于干净测试有所下降，但随K增加而持续改善。在5-shot下，MAML在未见过的合成偏移下达到DSC=0.52和HD95=5.02毫米，在本地队列上达到DSC=0.50和HD95=5.43毫米。这些结果表明，元学习可以改善低样本适应中的薄壁描绘，并可能减少心房重构评估的标注负担。

英文摘要

Segmenting the left atrial (LA) wall from late gadolinium enhancement magnetic resonance imaging (LGE-MRI) is challenging because of its thin geometry, low contrast, and limited expert annotations. We propose a model-agnostic meta-learning (MAML) framework with a 3D residual U-Net backbone for K-shot (K = 5, 10, 20) LA wall segmentation. The framework is meta-trained on LA wall tasks together with auxiliary LA and right atrial (RA) cavity tasks and uses a boundary-aware composite loss to improve thin-structure delineation. We evaluated MAML on a held-out clean test set and assessed its robustness under an unseen synthetic domain shift and on a local cohort. On the held-out clean test set, MAML outperformed the K-shot fine-tuning baseline at 5-shot, achieving Dice coefficient (DSC) = 0.54 versus 0.48 and Hausdorff distance (HD95) = 4.60 versus 6.40 mm. At 20-shot, MAML approached the fully supervised model trained from scratch, with DSC = 0.59 versus 0.61. Under unseen shifts, performance decreased relative to clean testing but improved consistently as K increased. At 5-shot, MAML achieved DSC = 0.52 and HD95 = 5.02 mm under the unseen synthetic shift, and DSC = 0.50 and HD95 = 5.43 mm on the local cohort. These results suggest that meta-learning can improve thin-wall delineation in low-shot adaptation and may reduce the annotation burden for atrial remodeling assessment.

URL PDF HTML ☆

赞 0 踩 0

2603.17879 2026-05-25 cs.CV cs.AI 版本更新

Anatomy-Guided Vision-Language Learning with Angular Prototype Separation for Multi-Label Video Capsule Endoscopy Classification Under Class Imbalance

解剖引导的视觉-语言学习与角度原型分离用于类别不平衡下的多标签视频胶囊内镜分类

Podakanti Satyajith Chary, Nagarajan Ganapathy

发表机构 * Department of Engineering Science, IIT Hyderabad（印度海得拉尔理工学院工程科学系）； Department of Biomedical Engineering, IIT Hyderabad（印度海得拉尔理工学院生物医学工程系）

AI总结本文提出了一种用于视频胶囊内镜（VCE）的多标签时间事件检测框架，针对Galar数据集中严重的类别不平衡问题，结合了角度分离损失和生物状态机解码器两个核心贡献。该框架基于BiomedCLIP模型，通过局部差分注意力模块融合连续帧以增强病理信号，并利用解剖上下文头结合软解剖激活进行病理预测。实验表明，该方法在RARE-VISION测试集上显著提升了检测性能，实现了更高的平均精度。

Comments 12 pages, 1 figure, ICPR 2026 RARE-VISION Competition

详情

AI中文摘要

本文提出一个多标签时间事件检测框架用于视频胶囊内镜（VCE），通过结合两个主要贡献来解决Galar数据集固有的极端类别不平衡问题：类原型上的角度分离损失和生物状态机时间解码器。主干网络保持为BiomedCLIP，一个生物医学视觉-语言基础模型。三个连续帧通过局部差分注意力模块融合，该模块通过抑制静态时间冗余来放大瞬态病理信号。然后，解剖上下文头将病理预测条件化于软解剖激活上，利用已知的胃肠道发现空间共现结构。可学习的文本特征提示和基于原型的logit增强与角度分离损失一起训练，该损失惩罚类原型之间的非对角线余弦相似度，防止在极端不平衡下影响罕见类的原型崩溃。为抵消倾斜的标签分布，训练方案结合了非对称焦点损失、逆频率加权采样、时间混合、指数移动平均和每类阈值校准。生物状态机解码器用基于解剖标签的生理学基础前向状态转换替代朴素间隙合并，消除了先前方法中每视频产生数百个虚假解剖事件的碎片化伪影，并将每视频解剖输出减少到2-3个临床现实事件。在包含三个NaviCam检查（161,025帧）的保留RARE-VISION测试集上，更新后的管道实现了整体时间mAP@0.5为0.3597，mAP@0.95为0.3399，相比先前提交分别相对提升46%和44%，总推理时间在单个GPU上约21分钟完成。

英文摘要

This work presents a multi-label temporal event detection framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset by combining two principal contributions: an Angular Separation Loss on class prototypes and a Biological State Machine temporal decoder. The backbone remains BiomedCLIP, a biomedical vision-language foundation model. Three consecutive frames are fused through a Local Differencing Attention module that amplifies transient pathological signals by suppressing static temporal redundancy. An Anatomy Context Head then conditions pathological predictions on soft anatomical activations, exploiting the known spatial co-occurrence structure of GI findings. Learnable text-feature prompts and prototype-based logit augmentation are trained alongside an Angular Separation Loss that penalizes off-diagonal cosine similarity between class prototypes, preventing the prototype collapse that afflicts rare classes under extreme imbalance. To counteract the skewed label distribution, the training regime combines asymmetric focal loss, inverse-frequency weighted sampling, temporal Mixup, Exponential Moving Average, and per-class threshold calibration. The Biological State Machine decoder replaces naive gap merging with a physiologically grounded forward-only state transition over anatomy labels, eliminating the fragmentation artefact that produced hundreds of spurious anatomy events per video in the prior approach and reducing per-video anatomy output to 2--3 clinically realistic events. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the updated pipeline achieves an overall temporal mAP@0.5 of 0.3597 and mAP@0.95 of 0.3399, representing a relative improvement of 46% and 44% respectively over the prior submission, with total inference completed in approximately 21 minutes on a single GPU.

URL PDF HTML ☆

赞 0 踩 0

2603.10688 2026-05-25 cs.RO cs.CV 版本更新

GT-SVJ: 基于生成式Transformer的自监督视频评判器用于高效视频奖励建模

Shivanshu Shekhar, Uttaran Bhattacharya, Raghavendra Addanki, Mehrab Tanjim, Somdeb Sarkhel, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Adobe Inc.（Adobe公司）

AI总结该研究提出了一种基于生成式变换器的自监督视频评估模型GT-SVJ，旨在更高效地建模视频奖励，以对齐视频生成模型与人类偏好。不同于依赖视觉语言模型的方法，GT-SVJ通过将先进的视频生成模型重新设计为能量基模型，从而捕捉视频中的细微时序动态并精确判断视频质量。通过构造具有可控退化特性的合成负样本，模型被引导学习有意义的时空特征，实验表明其在仅使用30K人工标注数据的情况下，在多个基准测试中取得了优于现有方法的性能。

详情

AI中文摘要

将视频生成模型与人类偏好对齐仍然具有挑战性：当前方法依赖视觉语言模型（VLM）进行奖励建模，但这些模型难以捕捉细微的时间动态。我们提出了一种根本不同的方法：将天生设计用于建模时间结构的视频生成模型重新用作奖励模型。我们提出了基于生成式Transformer的自监督视频评判器（GT-SVJ），这是一种新颖的评估模型，将最先进的视频生成模型转化为强大的时间感知奖励模型。我们的关键洞察是，生成模型可以重新表述为基于能量的模型（EBM），该模型为高质量视频分配低能量，为退化视频分配高能量，从而在通过对比目标训练时能够以惊人的精度区分视频质量。为了防止模型利用真实视频与生成视频之间的表面差异，我们通过受控的潜在空间扰动设计了具有挑战性的合成负样本：时间切片、特征交换和帧洗牌，这些模拟了真实但细微的视觉退化。这迫使模型学习有意义的时空特征，而不是琐碎的伪影。GT-SVJ在GenAI-Bench和MonteBench上仅使用30K人工标注就达到了最先进的性能：比现有的基于VLM的方法少6倍到65倍。

英文摘要

Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.

URL PDF HTML ☆

赞 0 踩 0

2602.05126 2026-05-25 cs.CV 版本更新

CLEAR-HPV: Interpretable concept discovery for human-papillomavirus-associated morphology in whole-slide histology

CLEAR-HPV: 全切片组织学中人乳头瘤病毒相关形态的可解释概念发现

Weiyi Qin, Yingci Liu-Swetz, Shiwei Tan, Hao Wang

发表机构 * Department of Computer Science, Rutgers University（罗格斯大学计算机科学系）； Rutgers Health（罗格斯健康）； Rutgers University（罗格斯大学）

AI总结 CLEAR-HPV 是一种用于宫颈癌和头颈癌病理切片中HPV相关形态分析的可解释性框架，旨在解决基于注意力机制的多重实例学习（MIL）模型在形态学解释性方面的不足。该方法通过重构MIL的潜在空间，无需概念标签即可自动发现如角化、基底样和间质等关键形态概念，并生成对应的空间概念图与简洁的概念分数向量，从而在保持预测性能的同时实现高度可解释的特征表示。

详情

AI中文摘要

人乳头瘤病毒（HPV）状态是头颈癌和宫颈癌预后及治疗反应的关键决定因素。尽管基于注意力的多实例学习（MIL）在HPV相关的全切片组织病理学中实现了强切片级预测，但其形态学可解释性有限。为解决这一局限，我们引入了CLEAR-HPV（Concept-Level Explainable Attention-guided Representation for HPV），该框架利用注意力重构MIL潜在空间，从而在训练过程中无需概念标签即可实现概念发现。在注意力加权的潜在空间中运行，CLEAR-HPV自动发现角化、基底样和间质形态概念，生成空间概念图，并使用紧凑的概念分数向量表示每个切片。CLEAR-HPV的概念分数向量保留了原始MIL嵌入的预测信息，同时将高维特征空间（例如1536维）减少到仅10个可解释概念。CLEAR-HPV在TCGA-HNSCC、TCGA-CESC和CPTAC-HNSCC上一致地泛化，通过一个通用的、与骨干网络无关的框架，为基于注意力的全切片组织病理学MIL模型提供紧凑的概念级可解释性。

英文摘要

Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPV's concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.

URL PDF HTML ☆

赞 0 踩 0

2601.19117 2026-05-25 eess.IV cs.CV stat.AP 版本更新

Optimized $k$-means color quantization of digital images in machine-based and human perception-based colorspaces

基于机器感知和人类感知色彩空间的优化 $k$-均值图像颜色量化

Ranjan Maitra

发表机构 * Department of Statistics, Iowa State University（统计学系，爱荷华州立大学）

AI总结该研究探讨了在不同颜色空间中使用 $k$-means 算法进行数字图像颜色量化的效果，比较了 RGB、CIE-XYZ 和 CIE-LUV/CIE-HCL 等颜色空间在不同量化级别下的表现。通过视觉信息保真度（VIF）指标评估图像质量，发现 $k$-means 在 RGB 空间中表现最佳的情况约占一半，而在较高量化级别时，CIE-XYZ 空间通常表现更优，部分低量化级别情况下 CIE-LUV 空间效果更佳。研究还分析了色调、色度和亮度分布对颜色空间选择的影响，为不同场景下的颜色量化提供了更细致的指导。

Comments 25 pages, 11 figures, 5 tables, accepted in the Journal of Electronic Imaging

详情

DOI: 10.1117/1.JEI.35.2.023002
Journal ref: Journal of Electronic Imaging Journal of Electronic Imaging, Vol. 35, Issue 2, 023002 (Mar 2026)

AI中文摘要

颜色量化使用原始颜色数量的一小部分来表示图像，同时仅最小程度地损失视觉质量。$k$-均值算法在此背景下常用，但主要应用于由三原色组成的基于机器的RGB色彩空间。然而，最近一些研究表明其在基于人类感知的色彩空间中性能有所提升。我们研究了在RGB、CIE-XYZ和CIE-LUV/CIE-HCL色彩空间中，$k$-均值颜色量化在四个量化级别下对148张涵盖广泛场景、主题和设置的多样化数字图像的性能。视觉信息保真度（VIF）度量数值上评估了量化图像的质量，并显示在大约一半的情况下，$k$-均值颜色量化在RGB空间中最佳，而在其他时候，特别是对于更高的量化级别（$k$），CIE-XYZ色彩空间通常表现更好。也有一些情况，尤其是在较低的$k$下，最佳性能在CIE-LUV色彩空间中获得。进一步根据图像中色调、色度和亮度分布对性能的分析，为每个色彩空间更适合$k$-均值颜色量化的图像提供了细致的视角和特征描述。

英文摘要

Color quantization represents an image using a fraction of its original number of colors while only minimally losing its visual quality. The $k$-means algorithm is commonly used in this context, but has mostly been applied in the machine-based RGB colorspace composed of the three primary colors. However, some recent studies have indicated its improved performance in human perception-based colorspaces. We investigated the performance of $k$-means color quantization at four quantization levels in the RGB, CIE-XYZ, and CIE-LUV/CIE-HCL colorspaces, on 148 varied digital images spanning a wide range of scenes, subjects and settings. The Visual Information Fidelity (VIF) measure numerically assessed the quality of the quantized images, and showed that in about half of the cases, $k$-means color quantization is best in the RGB space, while at other times, and especially for higher quantization levels ($k$), the CIE-XYZ colorspace is where it usually does better. There are also some cases, especially at lower $k$, where the best performance is obtained in the CIE-LUV colorspace. Further analysis of the performances in terms of the distributions of the hue, chromaticity and luminance in an image presents a nuanced perspective and characterization of the images for which each colorspace is better for $k$-means color quantization.

URL PDF HTML ☆

赞 0 踩 0

2601.15224 2026-05-25 cs.CV cs.CL 版本更新

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

PROGRESSLM: 迈向视觉-语言模型中的进度推理

Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu

发表机构 * Northwestern University（西北大学）； University of California, Santa Barbara（加州大学圣巴巴拉分校）

AI总结该论文提出ProgressLM，旨在解决视觉语言模型在任务进展推理方面的能力不足问题。研究引入了Progress-Bench基准，用于系统评估模型对任务进展的推理能力，并提出了一种受人类启发的两阶段推理范式，通过训练无关的提示和基于数据集ProgressLM-45K的训练方法进行探索。实验表明，大多数现有模型在任务进展估计上表现有限，而基于训练的ProgressLM-3B即使在小规模下也取得了稳定提升，显示出良好的泛化能力。

Comments ACL 2026 Camera Ready Version

详情

AI中文摘要

估计任务进度需要对长期动态进行推理，而非仅识别静态视觉内容。尽管现代视觉-语言模型（VLM）擅长描述可见内容，但它们能否从部分观测中推断任务进展程度仍不清楚。为此，我们引入了Progress-Bench，一个用于系统评估VLM进度推理能力的基准。除基准测试外，我们进一步探索了一种受人类启发的两阶段进度推理范式，包括基于无训练提示和基于训练的方法，后者基于精心策划的数据集ProgressLM-45K。对14个VLM的实验表明，大多数模型尚未准备好进行任务进度估计，对演示模态和视角变化敏感，且难以处理不可回答的情况。虽然强制结构化进度推理的无训练提示仅带来有限且依赖模型的改进，但基于训练的ProgressLM-3B即使在小型模型规模下也取得了一致的改进，尽管其训练任务集与评估任务完全不相交。进一步分析揭示了特征性错误模式，并阐明了进度推理成功或失败的时间与原因。网站：https://progresslm.github.io/ProgressLM/

英文摘要

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails. Website: https://progresslm.github.io/ProgressLM/

URL PDF HTML ☆

赞 0 踩 0

2601.14821 2026-05-25 cs.CV 版本更新

POTR: Post-Training 3DGS Compression

POTR：训练后3DGS压缩

Bert Ramlot, Martijn Courteaux, Peter Lambert, Glenn Van Wallendael

发表机构 * IDLab-MEDIA research group（IDLab-MEDIA研究组）； Ghent University（根特大学）； imec

AI总结本文提出了一种名为POTR的后训练3D高斯点云压缩方法，旨在解决3D高斯溅射（3DGS）在存储需求过高的问题。该方法引入了一种高效的剪枝技术，通过改进的3DGS光栅化器同时计算每个点的移除影响，显著减少了点的数量并提升了推理速度；同时，提出了一种无需训练即可重构光照系数的新方法，大幅降低了其熵值并提高了稀疏性。实验表明，POTR在压缩率与推理速度方面均优于现有方法。

Comments 15 pages, 12 figures. Submitted to IEEE TCSVT, under review

详情

DOI: 10.1109/TCSVT.2026.3685779
Journal ref: IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2026

AI中文摘要

3D高斯泼溅（3DGS）最近在3D场景重建和实时新视角合成中成为神经辐射场（NeRF）的有力竞争者。3DGS在训练和推理速度上优于NeRF，但存储需求显著更高。为解决这一缺点，我们提出POTR，一种基于两项新技术的训练后3DGS编解码器。首先，POTR引入一种新颖的剪枝方法，使用修改后的3DGS光栅化器同时高效计算每个泼溅的单独移除效果。该技术相比其他训练后剪枝技术减少2-4倍的泼溅数量，并因此显著加速推理，实验表明其推理速度比其他压缩模型快1.5-2倍。其次，我们提出一种重新计算光照系数的新方法，在不使用任何训练的情况下显著降低其熵。我们的快速且高度并行的方案特别增加了AC光照系数的稀疏性，实验表明稀疏性从70%提升到97%，且质量损失极小。最后，我们通过简单的微调方案扩展POTR，以进一步增强剪枝、推理和率失真性能。实验表明，即使没有微调，POTR在率失真性能和推理速度上始终优于所有其他训练后压缩技术。

英文摘要

3D Gaussian Splatting (3DGS) has recently emerged as a promising contender to Neural Radiance Fields (NeRF) in 3D scene reconstruction and real-time novel view synthesis. 3DGS outperforms NeRF in training and inference speed but has substantially higher storage requirements. To remedy this downside, we propose POTR, a post-training 3DGS codec built on two novel techniques. First, POTR introduces a novel pruning approach that uses a modified 3DGS rasterizer to efficiently calculate every splat's individual removal effect simultaneously. This technique results in 2-4x fewer splats than other post-training pruning techniques and as a result also significantly accelerates inference with experiments demonstrating 1.5-2x faster inference than other compressed models. Second, we propose a novel method to recompute lighting coefficients, significantly reducing their entropy without using any form of training. Our fast and highly parallel approach especially increases AC lighting coefficient sparsity, with experiments demonstrating increases from 70% to 97%, with minimal loss in quality. Finally, we extend POTR with a simple fine-tuning scheme to further enhance pruning, inference, and rate-distortion performance. Experiments demonstrate that POTR, even without fine-tuning, consistently outperforms all other post-training compression techniques in both rate-distortion performance and inference speed.

URL PDF HTML ☆

赞 0 踩 0

2512.18363 2026-05-25 cs.CV 版本更新

Enhancing 3D Semantic Scene Completion with a Refinement Module

使用精化模块增强3D语义场景补全

Dunxing Zhang, Jiachen Lu, Han Yang, Lei Bao, Bo Song

发表机构 * National Science Center for Earthquake Engineering, Tianjin University（天津大学地球quake工程科学中心）； School of Civil Engineering, Tianjin University（天津大学土木工程学院）； Chair of Robotics, Artificial Intelligence and Real-time Systems, Technical University of Munich（慕尼黑技术大学机器人、人工智能与实时系统教授席）

AI总结本文提出了一种名为ESSC-RM的增强型语义场景补全框架，该框架通过一个可插拔的细化模块，能够无缝集成到现有的语义场景补全模型中。该方法采用两阶段策略，首先由基础网络生成粗粒度体素预测，再通过基于3D U-Net的预测噪声感知模块和体素级局部几何模块进行多尺度监督下的细化。实验表明，ESSC-RM在SemanticKITTI数据集上显著提升了语义预测性能，验证了其作为通用细化框架的广泛适用性。

Comments 19 pages, 8 figures

2511.17171 2026-05-25 cs.CV cs.LG 版本更新

FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle

FireScope: 基于思维链预言机的野火风险栅格预测

Mario Markov, Stefan Maria Ailuro, Luc Van Gool, Konrad Schindler, Danda Pani Paudel

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结该论文提出了一种名为FireScope的框架，用于预测野火风险栅格图，通过结合视觉、气候和地理信息进行因果推理。研究引入了FireScope-Bench数据集，整合了Sentinel-2卫星图像、气候数据和专家定义的风险图，用于跨大陆评估。FireScope基于视觉语言模型，结合强化学习和视觉监督，生成带有推理轨迹的风险图，显著提升了模型在不同大陆间的泛化能力和可解释性。该工作首次展示了基于语言的推理在视觉生成中的泛化提升作用，并提出了首个可跨大陆应用的高分辨率野火风险模型。

Comments CVPR 2026, Project Page: https://firescope.ai/research

详情

AI中文摘要

预测野火风险是一个推理密集型的空间问题，需要整合视觉、气候和地理因素来推断连续的风险地图。现有方法缺乏可靠泛化所需的因果推理和多模态理解。我们引入了FireScope-Bench，一个大规模数据集和基准，将Sentinel-2图像和气候数据与专家定义的全美风险栅格以及欧洲的真实野火事件配对，用于跨大陆评估。基于此数据集，我们提出了FireScope，一个基于VLM的推理到生成框架，从强化学习和视觉监督中学习，通过互补的推理轨迹预测风险栅格。当在美国训练并在欧洲测试时，FireScope取得了显著的性能提升，而专家反馈和自动化分析证实其推理轨迹是忠实且有语义意义的。我们的发现表明，推理可以支撑栅格预测模型，提高泛化性和可解释性。据我们所知，这是第一个（1）证明基于语言的推理可以改善视觉生成泛化性的框架，（2）提出一个可跨大陆应用的高分辨率野火风险模型，以及（3）能够系统研究多模态火灾风险模型稳健跨大陆泛化的框架。我们相信FireScope-Bench有潜力成为推动推理驱动、可解释和可泛化空间建模的基础。数据和源代码将公开提供。

英文摘要

Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce FireScope-Bench, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose FireScope, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, FireScope achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that FireScope-Bench has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2511.14286 2026-05-25 cs.CV 版本更新

NeuralBoneReg: An Instance-Specific Label-Free Point Cloud-Based Method for Multi-Modal Bone Surface Registration

NeuralBoneReg：一种用于多模态骨表面注册的实例特定无标签点云方法

Luohong Wu, Matthias Seibold, Nicola A. Cavalcanti, Yunke Ao, Roman Flepp, Aidana Massalimova, Lilian Calvet, Philipp Fürnstahl

发表机构 * Research in Orthopedic Computer Science, Balgrist University Hospital, University of Zurich（骨科计算机科学研究所，巴尔格里斯大学医院，苏黎世大学）； AI Center, ETH Zurich（人工智能中心，苏黎世联邦理工学院）

AI总结在计算机辅助骨科手术中，术前影像与术中数据的精确配准对手术规划至关重要。本文提出了一种无需标注的点云为基础的神经配准方法NeuralBoneReg，通过隐式神经网络学习术前骨模型，并结合多层感知机进行全局初始化与局部优化，实现了跨模态骨表面的鲁棒配准。该方法无需跨受试者训练数据，实验表明其在多个公开数据集上表现优异，具有良好的解剖结构与模态泛化能力。

详情

AI中文摘要

在计算机和机器人辅助骨科手术（CAOS）中，基于术前影像的患者特定手术计划定义了目标位置和植入物轨迹。在手术过程中，这些计划必须准确传递，依赖于术前和术中数据之间的精确交叉注册。然而，不同成像模态之间的显著异质性使得这种注册具有挑战性且容易出错。因此，鲁棒、自动且与模态无关的骨表面注册在临床上非常重要。我们提出了NeuralBoneReg，一个自监督的基于表面的框架，使用3D点云作为与模态无关的表示来注册骨表面。NeuralBoneReg包括两个模块：一个学习术前骨模型的隐式神经无符号距离场（UDF），以及一个基于MLP的注册模块，通过生成变换假设来对齐术中点云与神经UDF，从而执行全局初始化和局部细化。与最先进的监督方法不同，NeuralBoneReg以自监督方式运行，无需跨受试者的训练数据。我们在两个公开的多模态数据集上评估了NeuralBoneReg与基线方法的性能：一个腓骨和胫骨的CT-超声数据集（UltraBones100k）和一个脊柱椎骨的CT-RGB-D数据集（SpineDepth）。评估还包括一个新引入的包含股骨和骨盆的尸体的CT-超声数据集（UltraBones-Hip），该数据集将公开提供。NeuralBoneReg在所有数据集上匹配或超越现有方法，在UltraBones100k上平均RRE/RTE为1.83°/2.02 mm，在UltraBones-Hip上为1.90°/1.56 mm，在SpineDepth上为3.78°/2.80 mm。这些结果证明了跨解剖结构和模态的强泛化能力，为CAOS提供了鲁棒且准确的跨模态对齐。

英文摘要

In computer- and robot-assisted orthopedic surgery (CAOS), patient-specific surgical plans derived from preoperative imaging define target locations and implant trajectories. During surgery, these plans must be accurately transferred, relying on precise cross-registration between preoperative and intraoperative data. However, substantial modality heterogeneity across imaging modalities makes this registration challenging and error-prone. Robust, automatic, and modality-agnostic bone surface registration is therefore clinically important. We propose NeuralBoneReg, a self-supervised, surface-based framework that registers bone surfaces using 3D point clouds as a modality-agnostic representation. NeuralBoneReg includes two modules: an implicit neural unsigned distance field (UDF) that learns the preoperative bone model, and an MLP-based registration module that performs global initialization and local refinement by generating transformation hypotheses to align the intraoperative point cloud with the neural UDF. Unlike SOTA supervised methods, NeuralBoneReg operates in a self-supervised manner, without requiring inter-subject training data. We evaluated NeuralBoneReg against baseline methods on two publicly available multi-modal datasets: a CT-ultrasound dataset of the fibula and tibia (UltraBones100k) and a CT-RGB-D dataset of spinal vertebrae (SpineDepth). The evaluation also includes a newly introduced CT-ultrasound dataset of cadaveric subjects containing femur and pelvis (UltraBones-Hip), which will be made publicly available. NeuralBoneReg matches or surpasses existing methods across all datasets, achieving mean RRE/RTE of 1.83°/2.02 mm on UltraBones100k, 1.90°/1.56 mm on UltraBones-Hip, and 3.78°/2.80 mm on SpineDepth. These results demonstrate strong generalizability across anatomies and modalities, providing robust and accurate cross-modal alignment for CAOS.

URL PDF HTML ☆

赞 0 踩 0

2511.11051 2026-05-25 cs.CV 版本更新

基于动态权重的极端噪声下低光视频增强的时间聚合

Ruirui Lin, Guoxi Huang, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, University of Bristol, United Kingdom（布里斯托大学视觉信息实验室）

AI总结本文研究了在极端噪声环境下低光视频增强（LLVE）的问题，针对现有基于学习的方法在处理真实场景中严重噪声时效果不佳的问题，提出了一种新型的基于深度学习的递归框架DWTA-Net。该方法采用两阶段架构，第一阶段通过多帧对齐实现时序一致的Mamba增强，第二阶段利用动态权重引导的光流驱动的时序聚合进行递归细化，有效提升了视频的视觉质量。实验表明，DWTA-Net在噪声抑制和细节保留方面优于现有先进方法。

详情

AI中文摘要

低光视频增强（LLVE）由于噪声、低对比度和颜色退化而具有挑战性。虽然基于学习的方法能够实现快速推理，但由于未能充分利用长期时间线索，它们在严重的真实噪声下常常失败。我们提出了DWTA-Net，一种新颖的基于深度学习的递归LLVE框架，采用递归设计。DWTA-Net采用集成的两阶段架构：第一阶段通过多帧对齐恢复局部结构和颜色，实现时间一致的基于Mamba的增强；第二阶段使用新颖的基于动态权重的时间聚合（由光流引导）进行递归细化，作为适应运动的递归去噪器。我们进一步引入了一种纹理自适应损失，在保留纹理区域细节的同时抑制均匀区域中的噪声。在真实低光视频上的实验表明，DWTA-Net实现了更强的噪声抑制和更少的伪影，与最先进的方法相比，提供了优越的视觉质量。

英文摘要

Low-light video enhancement (LLVE) is challenging due to noise, low contrast, and color degradation. While learning-based methods enable fast inference, they often fail under heavy real-world noise because they do not sufficiently exploit long-term temporal cues. We propose DWTA-Net, a novel deep-learning recurrent LLVE framework with a recurrent design. DWTA-Net adopts an integrated two-stage architecture: Stage I restores local structure and color via multi-frame alignment for temporally consistent Mamba-based enhancement, while Stage II performs recurrent refinement using a novel dynamic weight-based temporal aggregation guided by optical flow, functioning as a recurrent denoiser that adapts to motion. We further introduce a texture-adaptive loss that preserves fine details in textured regions while suppressing noise in homogeneous areas. Experiments on real-world low-light footage show that DWTA-Net achieves stronger noise suppression and fewer artifacts, delivering superior visual quality compared with state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2510.00948 2026-05-25 cs.CV 版本更新

InfVSR: Toward Consistency-Driven Streaming Generative Video Super-Resolution

InfVSR：迈向一致性驱动的流式生成视频超分辨率

Ziqing Zhang, Kai Liu, Zheng Chen, Xi Li, Yucong Chen, Bingnan Duan, Linghe Kong, Yulun Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出了一种名为InfVSR的生成式视频超分辨率方法，旨在解决处理长序列视频时效率低和时序一致性差的问题。该方法将视频超分辨率重构为自回归单步扩散框架，通过因果结构的预训练DiT模型和滚动键值缓存机制，实现了流式推理并保持局部与全局的一致性。此外，研究还引入了块级像素监督和跨块分布匹配技术，显著提升了处理效率，并构建了一个针对长视频的评估基准，推动了长序列视频超分辨率领域的发展。

Comments Code and model are available at https://github.com/Kai-Liu001/InfVSR

详情

AI中文摘要

真实世界视频通常包含数千帧。然而，现有的生成式视频超分辨率（VSR）方法在处理长序列时面临两个持续挑战：（1）由于对全长序列进行多步去噪的高成本导致的低效率；（2）时间分解导致伪影和不连续性，阻碍了良好的一致性。为突破这些限制，我们提出InfVSR，将VSR重新构建为自回归单步扩散范式，并利用视频扩散先验实现流式推理。首先，我们将预训练的DiT适配为因果结构，通过滚动KV缓存和联合视觉引导保持局部和全局一致性。其次，我们通过逐块像素监督和跨块分布匹配，高效地将扩散过程蒸馏为单步。为填补长视频评估的空白，我们构建了一个针对扩展序列的新基准，并引入语义级指标以全面评估时间一致性。我们的方法推动了长视频VSR的前沿，实现了具有增强语义一致性的最先进质量，并相比现有方法（如MGLD-VSR）提供了高达58倍的加速。我们的代码和模型可在https://github.com/Kai-Liu001/InfVSR获取。

英文摘要

Real-world videos often extend over thousands of frames. Existing generative video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor consistency is hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which reformulates VSR as an autoregressive-one-step-diffusion paradigm, and enables streaming inference with video diffusion priors. First, we adapt the pretrained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Our code and models are available at https://github.com/Kai-Liu001/InfVSR.

URL PDF HTML ☆

赞 0 踩 0

2508.18958 2026-05-25 cs.CV cs.AI 版本更新

A drone-based framework for coral habitat mapping via weakly supervised segmentation

基于弱监督分割的无人机珊瑚栖息地制图框架

Matteo Contini, Victor Illien, Sylvain Poulain, Serge Bernard, Julien Barde, Sylvain Bonhommeau, Alexis Joly

发表机构 * IFREMER Délégation Océan Indien (DOI)（IFREMER大洋印度洋办事处）； INRIA, LIRMM, Université de Montpellier, CNRS（INRIA、LIRMM、蒙彼利埃大学、国家科学研究中心）； UMR Marbec, IRD, Université de Montpellier, CNRS, Ifremer（Marbec联合研究单位、IRD、蒙彼利埃大学、国家科学研究中心、IFREMER）； CNRS, LIRMM, Université de Montpellier（国家科学研究中心、LIRMM、蒙彼利埃大学）

AI总结本文提出了一种基于无人机的弱监督分割框架，用于珊瑚生境的映射。该方法通过结合水下图像的细粒度多标签预测和广覆盖的航拍数据，无需像素级标注即可训练高分辨率分割模型。研究在珊瑚礁图像上验证了该方法，实现了大面积珊瑚形态的分割，取得了86.07%的像素准确率和52.23%的平均交并比，展示了其在生态监测中的高效性和适用性。

Comments Extended journal version of "The Point is the Mask: Scaling coral reef segmentation with weak supervision"

详情

DOI: 10.1007/s10044-026-01682-3

AI中文摘要

在大空间范围内获取像素级标注仍然是机器学习在生态应用中部署的主要瓶颈。本文提出了一种多尺度弱监督语义分割（WSSS）框架，能够利用密集的、基于分类的输出训练高分辨率分割模型。我们的方法将来自水下图像的细粒度多标签预测与广覆盖的航空数据相结合。将这些点级分类转换为粗监督掩码，用于训练无人机（UAV）正射影像上的语义分割模型。然后使用模型自身的细化预测进行第二步训练，以进一步提高空间精度，无需额外标注。我们在珊瑚礁图像上展示了该方法，实现了珊瑚形态类型的大面积分割，并展示了其整合新类别的灵活性。最终模型在人工标注的礁区上达到86.07%的像素精度和52.23%的平均交并比（mIoU），表明无需像素级标注即可获得准确的大规模珊瑚分割。通过跨尺度和跨模态连接图像分类与分割，该方法为标注不可用场景下部署分割模型提供了高效解决方案，并为生态学及其他领域的可扩展、高效监测开辟了机会。

英文摘要

Obtaining pixel-level annotations over large spatial extents remains a major bottleneck for deploying machine learning in ecological applications. Here we present a multi-scale weakly supervised semantic segmentation (WSSS) framework that enables training high-resolution segmentation models from dense, classification-based outputs. Our method combines fine-scale, multi-label predictions from underwater imagery with broad-coverage aerial data. We convert these point-level classifications into coarse supervision masks that can be used to train a semantic segmentation model on Unmanned Aerial Vehicle (UAV) orthophotos. A second training step using the model's own refined predictions is then used to further improve spatial accuracy without requiring additional annotations. We demonstrate the approach on coral reef imagery, enabling large-area segmentation of coral morphotypes and illustrating its flexibility in integrating new classes. The final model achieves 86.07% pixel accuracy and 52.23% mean Intersection over Union (mIoU) on manually annotated reef zones, demonstrating that accurate large-scale coral segmentation can be obtained without pixel-level annotations. By bridging image classification and segmentation across scales and modalities, this method provides an efficient solution for deploying segmentation models in settings where annotations are unavailable and opens opportunities for scalable, efficient monitoring in ecology and beyond.

URL PDF HTML ☆

赞 0 踩 0

2507.23372 2026-05-25 cs.CV 版本更新

UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries

UniEmo: 利用可学习专家查询统一情感理解与生成

Yijie Zhu, Lingsen Zhang, Zitong Yu, Rui Shao, Tao Tan, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Great Bay University（大湾区大学）； School of Computing and Information Technology（计算机与信息学院）； Dongguan Key Laboratory for Intelligence and Information Technology（东莞智能与信息科技重点实验室）； Shenzhen Loop Area Institute（深圳环 area 院）； Macao Polytechnic University（澳门 polytechnic 大学）

AI总结本文提出 UniEmo，一个统一的情感理解和生成框架，通过可学习的专家查询机制，将情感理解与生成任务有机结合。该方法通过分层的情感理解链逐步提取多尺度情感特征，并利用这些特征引导扩散模型生成具有情感表达的图像，同时引入情感相关系数和条件损失以提升生成图像的多样性和真实性。实验表明，UniEmo 在情感理解和生成任务上均优于现有先进方法。

Comments Accepted to TIP 2026

详情

DOI: 10.1109/TIP.2026.3691017
Journal ref: IEEE Transactions on Image Processing, vol. 35, pp. 5165-5180, 2026

AI中文摘要

情感理解和生成通常被视为独立的任务，然而它们本质上是互补的，可以相互增强。在本文中，我们提出UniEmo，一个无缝集成这两个任务的统一框架。关键挑战在于情感的抽象性质，需要提取对两个任务都有益的视觉表示。为此，我们提出一个带有可学习专家查询的分层情感理解链，逐步提取多尺度情感特征，从而作为统一的基础步骤。同时，我们融合这些专家查询和情感表示，以指导扩散模型生成引发情感反应的图像。为了增强生成情感图像的多样性和保真度，我们进一步在融合过程中引入情感相关系数和情感条件损失。这一步骤促进了由理解引导的情感生成的融合与对齐。反过来，我们证明联合训练允许生成部分向理解部分提供隐式反馈。此外，我们提出一种新颖的数据过滤算法，以选择由训练良好的模型生成的高质量和多样化的情感图像，这些图像显式地反馈到理解部分。这些生成驱动的双重反馈过程共同增强了模型的理解能力。大量实验表明，UniEmo在情感理解和生成任务上均显著优于现有方法。所提出方法的代码可在 https://github.com/JiuTian-VL/UniEmo 获取。

英文摘要

Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model's understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at https://github.com/JiuTian-VL/UniEmo.

URL PDF HTML ☆

赞 0 踩 0

2507.12455 2026-05-25 cs.CV 版本更新

欧洲多中心乳腺癌MRI数据集

Gustav Müller-Franzes, Lorena Escudero Sánchez, Nicholas Payne, Alexandra Athanasiou, Michael Kalogeropoulos, Aitor Lopez, Alfredo Miguel Soro Busto, Julia Camps Herrero, Nika Rasoolzadeh, Tianyu Zhang, Ritse Mann, Debora Jutz, Maike Bode, Christiane Kuhl, Yuan Gao, Wouter Veldhuis, Oliver Lester Saldanha, JieFu Zhu, Jakob Nikolas Kather, Daniel Truhn, Fiona J. Gilbert

发表机构 * University of Cambridge（剑桥大学）； MITERA Hospital（MITERA医院）； Ribera Salud Group（Ribera Salud集团）； Radboud University Medical Center（拉德堡德大学医学中心）； University Hospital RWTH Aachen（亚琛工业大学医院）； University Medical Center Utrecht（乌得勒支大学医学中心）； University Hospital Carl Gustav Carus（卡尔·古斯塔夫·卡鲁斯大学医院）； EKFZ Technical University Dresden（德累斯顿技术大学EKFZ）

AI总结该研究提出了一种公开的欧洲多中心乳腺癌MRI数据集，旨在解决当前乳腺MRI人工智能辅助诊断中缺乏大规模、多样化数据的问题。数据集包含来自五个欧洲国家六家临床机构的741例乳腺MRI检查，涵盖恶性、良性及无病灶病例，并使用不同扫描设备和参数采集，真实反映临床多样性。研究还利用基于Transformer的模型进行了基准测试，展示了数据集的潜在应用价值，并为后续方法比较提供了参考性能。

详情

AI中文摘要

早期检测乳腺癌对于改善患者预后至关重要。虽然乳腺X线摄影仍是主要筛查手段，但磁共振成像（MRI）越来越多地被推荐作为乳腺组织致密女性及高风险女性的补充工具。然而，多参数乳腺MRI的采集和解读耗时且需要专业知识，限制了其在临床实践中的可扩展性。人工智能（AI）方法在支持乳腺MRI解读方面显示出潜力，但其发展受到大型、多样化和公开可访问数据集可用性有限的阻碍。为弥补这一差距，我们提供了一个公开可用的多中心乳腺MRI数据集，该数据集收集自五个欧洲国家的六个临床机构。该数据集包含741例接受筛查或诊断性乳腺MRI的女性检查，包括恶性、良性和非病灶病例。数据使用异构扫描仪、场强和采集协议获取，反映了真实世界的临床变异性。此外，我们报告了使用基于Transformer模型的基线基准实验，以说明该数据集的潜在用例，并为未来的方法比较提供参考性能。

英文摘要

Early detection of breast cancer is critical for improving patient outcomes. While mammography remains the primary screening modality, magnetic resonance imaging (MRI) is increasingly recommended as a supplemental tool for women with dense breast tissue and those at elevated risk. However, the acquisition and interpretation of multiparametric breast MRI are time-consuming and require specialized expertise, limiting scalability in clinical practice. Artificial intelligence (AI) methods have shown promise in supporting breast MRI interpretation, but their development is hindered by the limited availability of large, diverse, and publicly accessible datasets. To address this gap, we present a publicly available, multi-centre breast MRI dataset collected across six clinical institutions in five European countries. The dataset comprises 741 examinations from women undergoing screening or diagnostic breast MRI and includes malignant, benign, and non-lesion cases. Data were acquired using heterogeneous scanners, field strengths, and acquisition protocols, reflecting real-world clinical variability. In addition, we report baseline benchmark experiments using a transformer-based model to illustrate potential use cases of the dataset and to provide reference performance for future methodological comparisons.

URL PDF HTML ☆

赞 0 踩 0

2505.17015 2026-05-25 cs.CV cs.CL 版本更新

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Multi-SpatialMLLM: 多模态大语言模型的多帧空间理解

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin J. Liang

发表机构 * FAIR, Meta（FAIR，Meta）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出了一种名为Multi-SpatialMLLM的多模态大语言模型框架，旨在增强模型对多帧场景的时空理解能力。通过引入深度感知、视觉对应和动态感知等基本空间技能，并构建包含2700多万个样本的MultiSPA数据集，该方法显著提升了模型在多帧空间任务中的表现。实验表明，Multi-SpatialMLLM在多种空间任务上优于现有基线模型和商业系统，展示了其在复杂场景下的泛化能力和多任务学习优势，并可应用于机器人领域的多帧奖励标注。

Comments CVPR 2026 Camera Ready. 27 pages. Project page: https://runsenxu.com/projects/Multi-SpatialMLLM

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉任务中取得了快速进展，但其空间理解仍局限于单张图像，使其不适合需要多帧推理的物理世界应用。在本文中，我们提出一个框架，通过整合基本空间技能（包括深度感知、视觉对应和动态感知）来赋予MLLMs多帧空间理解能力。我们设计了一个新颖的数据管道，并收集了包含超过2700万个样本的MultiSPA数据集，涵盖多样的3D和4D场景，以支持训练。除了MultiSPA，我们还引入了一个全面的基准测试，在统一的度量标准下测试广泛的空间任务。我们的最终模型Multi-SpatialMLLM在基线和专有系统上取得了显著提升，展示了可扩展和可泛化的多帧感知能力。我们进一步观察到多任务收益和在挑战性场景中的新兴空间能力，并展示了我们的模型如何作为机器人学的多帧奖励标注器。

英文摘要

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception. We design a novel data pipeline and collect the MultiSPA dataset of more than 27 million samples spanning diverse 3D and 4D scenes to enable training. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable and generalizable multi-frame perception. We further observe multi-task benefits and emergent spatial capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

URL PDF HTML ☆

赞 0 踩 0

2503.12868 2026-05-25 cs.CV 版本更新

UniReg: A Universal Model for Controllable CT Image Registration

UniReg: 一种用于可控CT图像配准的通用模型

Zi Li, Jianpeng Zhang, Tai Ma, Tony C. W. Mok, Yan-Jie Zhou, Zeli Chen, Xianghua Ye, Le Lu, Cheng Chen, Dakai Jin

发表机构 * The University of Hong Kong（香港大学）； DAMO Academy, Alibaba Group（阿里巴巴集团达摩院）； The First Affiliated Hospital of Zhejiang University（浙江大学第一附属医院）

AI总结本文提出了一种名为UniReg的通用可控CT图像配准模型，旨在解决现有方法在不同临床场景下泛化能力差、需为每个任务单独训练网络的问题。UniReg通过结合任务特定学习方法的精度优势与传统优化方法的泛化能力，构建了一个统一的配准框架，能够根据解剖结构先验、配准类型约束和实例特征自适应估计形变场，实现跨场景的最优配准。实验表明，UniReg在多个CT/MR配准数据集上取得了优于现有先进方法的平均配准精度，并显著降低了模型冗余和训练成本。

详情

AI中文摘要

基于学习的医学图像配准在匹配传统方法精度的同时，提供了优越的计算效率。然而，现有方法在不同临床场景中泛化能力差，需要为特定配准任务（如个体间/个体内配准或解剖区域特定对齐）开发多个孤立的网络，导致开发流程繁琐。为克服这一局限，我们提出了UniReg，首个用于多场景CT图像配准的条件统一模型，它结合了任务特定学习方法的精度优势和传统优化方法的泛化能力。我们的关键创新是一个统一的配准框架，该框架根据以下条件自适应估计变形场：（1）解剖结构先验，（2）配准类型约束（个体间/个体内），以及（3）实例特定特征，从而在单个模型中实现跨异构场景的最优对齐。通过在多个CT/MR配准数据集上的全面实验，UniReg相比当前最先进的基于学习方法取得了更优的平均配准精度，同时展现出强大的跨场景泛化能力。此外，通过用一个紧凑的统一模型替代多个孤立的任务特定模型，UniReg显著降低了总体训练负担，包括总训练成本和模型冗余。

英文摘要

Learning-based medical image registration has matched the accuracy of conventional methods while offering superior computational efficiency. However, existing approaches suffer from poor generalization across diverse clinical scenarios, requiring the laborious development of multiple isolated networks for specific registration tasks, e.g., inter-/intra-subject registration or anatomical region-specific alignment, leading to cumbersome development pipelines. To overcome this limitation, we propose UniReg, the first conditional unified model for multi-scenario CT image registration, which combines the precision advantages of task-specific learning methods with the generalization of traditional optimization methods. Our key innovation is a unified registration framework that adaptively estimates deformation fields conditioned on: (1) anatomical structure priors, (2) registration type constraints (inter/intra-subject), and (3) instance-specific features, enabling optimal alignment across heterogeneous scenarios within a single model. Through comprehensive experiments on multiple CT/MR registration datasets, UniReg achieves superior average registration accuracy compared with current state-of-the-art learning-based methods while exhibiting strong cross-scenario generalization. Moreover, by replacing multiple isolated task-specific models with a compact unified model, UniReg substantially reduces the overall training burden in terms of total training cost and model redundancy.

URL PDF HTML ☆

赞 0 踩 0

2503.06684 2026-05-25 cs.CV 版本更新

PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation

PixelPonder: 动态补丁自适应增强多条件文本到图像生成

Yanjie Pan, Qingdong He, Zhengkai Jiang, Pengcheng Xu, Chaoyi Wang, Jinlong Peng, Haoxuan Wang, Yun Cao, Zhenye Gan, Mingmin Chi, Bo Peng, Yabiao Wang

发表机构 * Fudan University（复旦大学）； Tencent Youtu Lab（腾讯优图实验室）； Western University（西澳大学）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文提出了一种名为PixelPonder的新型统一控制框架，用于解决多条件文本到图像生成中多个异构控制信号之间的语义保真与视觉质量协调问题。该方法通过设计一种基于图像块的自适应条件选择机制，在子区域层面动态优先选择空间相关的控制信号，从而实现精确的局部引导而不受全局干扰，并结合时间感知的控制注入策略，根据去噪时间步调整条件影响，逐步从结构保持过渡到纹理优化。实验表明，PixelPonder在多个基准数据集上优于现有方法，在空间对齐精度和文本语义一致性方面均表现出色。

详情

AI中文摘要

最近基于扩散的文本到图像生成通过视觉条件控制取得了有希望的结果。然而，现有的ControlNet类方法在处理组合视觉条件时存在困难——在多个异质控制信号之间同时保持语义保真度，同时维持高视觉质量，它们采用独立的控制分支，在去噪过程中常常引入冲突的引导，导致生成图像中出现结构失真和伪影。为了解决这个问题，我们提出了PixelPonder，一种新颖的统一控制框架，允许在单一控制结构下有效控制多个视觉条件。具体来说，我们设计了一种补丁级自适应条件选择机制，在子区域级别动态优先考虑空间相关的控制信号，实现精确的局部引导而无需全局干扰。此外，部署了一种时间感知控制注入方案，根据去噪时间步调节条件影响，逐步从结构保留过渡到纹理细化，充分利用不同类别的控制信息以促进更和谐的图像生成。大量实验表明，PixelPonder在多个基准数据集上超越了先前的方法，在保持高文本语义一致性的同时，在空间对齐精度上显示出优越的提升。

英文摘要

Recent advances in diffusion-based text-to-image generation have demonstrated promising results through visual condition control. However, existing ControlNet-like methods struggle with compositional visual conditioning - simultaneously preserving semantic fidelity across multiple heterogeneous control signals while maintaining high visual quality, where they employ separate control branches that often introduce conflicting guidance during the denoising process, leading to structural distortions and artifacts in generated images. To address this issue, we present PixelPonder, a novel unified control framework, which allows for effective control of multiple visual conditions under a single control structure. Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level, enabling precise local guidance without global interference. Additionally, a time-aware control injection scheme is deployed to modulate condition influence according to denoising timesteps, progressively transitioning from structural preservation to texture refinement and fully utilizing the control information from different categories to promote more harmonious image generation. Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets, showing superior improvement in spatial alignment accuracy while maintaining high textual semantic consistency.

URL PDF HTML ☆

赞 0 踩 0

2503.05534 2026-05-25 cs.CV 版本更新

S4M: 4-points to Segment Anything

S4M: 4点分割一切

Adrien Meyer, Lorenzo Arboit, Giuseppe Massimiani, Shih-Min Yin, Didier Mutter, Nicolas Padoy

发表机构 * University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France（斯特拉斯堡大学、法国国家科学研究中心、法国国家医学研究院、ICube、UMR7357、法国斯特拉斯堡）； Department of General Surgery, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University College of Medicine, Kaohsiung, Taiwan（高雄长庚纪念医院外科部、长庚大学医学院）

AI总结本文提出了一种名为S4M的改进方法，旨在解决医学图像分割中Segment Anything Model（SAM）因点提示模糊而导致的分割精度不足问题。该方法引入了一种结构化的四点提示策略，利用极值点或主/次轴端点作为实例级别的形状描述，增强提示的表达能力。通过扩展提示空间并引入辅助的“Canvas”预训练任务，S4M提升了模型对几何结构的理解能力，实验表明其在多个超声和手术内窥镜数据集上显著提升了分割性能，并减少了临床标注的工作量。

详情

DOI: 10.1007/s11548-026-03689-x

AI中文摘要

目的：Segment Anything Model (SAM) 有望缓解医学分割中的标注瓶颈，但重叠解剖结构和模糊边界使其点提示存在歧义，导致需要反复手动细化才能获得精确掩膜。需要更好的提示策略。方法：我们提出一种结构化提示策略，使用4个点作为紧凑的实例级形状描述。受超声测量实践启发，我们研究了两种4点变体：极值点和提出的长短轴端点。SAM无法充分利用此类结构化提示，因为它将所有点等同对待，缺乏几何感知推理。为解决此问题，我们引入S4M（4点分割一切），它增强SAM以将4点解释为关系线索而非孤立点击。S4M通过角色特定嵌入扩展提示空间，并添加辅助“画布”前置任务，直接从提示草绘粗略掩膜，促进几何感知推理。结果：在超声和手术内镜的八个数据集上，在相同提示预算下，S4M比强SAM基线提升+3.42 mIoU。与三位临床医生的标注研究进一步表明，长短轴提示可实现更快的标注。结论：S4M提高了性能，减少了标注工作量，并使提示与临床实践对齐，从而在医学影像中实现更可扩展的数据集开发。我们在https://github.com/CAMMA-public/S4M发布代码和预训练模型。

英文摘要

Purpose: The Segment Anything Model (SAM) promises to ease the annotation bottleneck in medical segmentation, but overlapping anatomy and blurred boundaries make its point prompts ambiguous, leading to cycles of manual refinement to achieve precise masks. Better prompting strategies are needed. Methods: We propose a structured prompting strategy using 4 points as a compact instance-level shape description. We study two 4-point variants: extreme points and the proposed major/minor axis endpoints, inspired by ultrasound measurement practice. SAM cannot fully exploit such structured prompts because it treats all points identically and lacks geometry-aware reasoning. To address this, we introduce S4M (4-points to Segment Anything), which augments SAM to interpret 4 points as relational cues rather than isolated clicks. S4M expands the prompt space with role-specific embeddings and adds an auxiliary "Canvas" pretext task that sketches coarse masks directly from prompts, fostering geometry-aware reasoning. Results: Across eight datasets in ultrasound and surgical endoscopy, S4M improves segmentation by +3.42 mIoU over a strong SAM baseline at equal prompt budget. An annotation study with three clinicians further shows that major/minor prompts enable faster annotation. Conclusion: S4M increases performance, reduces annotation effort, and aligns prompting with clinical practice, enabling more scalable dataset development in medical imaging. We release our code and pretrained models at https://github.com/CAMMA-public/S4M.

URL PDF HTML ☆

赞 0 踩 0

2502.04415 2026-05-25 cs.CV cs.AI 版本更新

TerraQ: Spatiotemporal Question-Answering on Satellite Image Archives

TerraQ：卫星图像档案的时空问答

Sergios-Anestis Kefalidis, Konstantinos Plas, Manolis Koubarakis

发表机构 * Dept. of Informatics and Telecommunications（信息与电信系）； National and Kapodistrian University of Athens（国家与卡布里亚大学）； Archimedes/Athena RC（阿基米德/雅典RC）

AI总结 TerraQ 是一个用于卫星图像档案的时空问答系统，能够根据自然语言查询快速检索符合条件的卫星图像。该系统结合了自然语言处理与空间知识库，支持基于图像元数据和地理实体的复杂查询。其核心贡献在于提升了地球观测数据的可访问性与智能化检索能力。

2407.03535 2026-05-25 cs.CV 版本更新

BVI-RLV: A Fully Registered Dataset for Low-Light Video Enhancement

BVI-RLV：一个完全配准的低光视频增强数据集

Ruirui Lin, Guoxi Huang, Joanne Lin, Qi Sun, Alexandra Malyugina, David R Bull, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, Bristol Vision Institute (BVI), University of Bristol（视觉信息实验室，布里斯托尔视觉研究所（BVI），布里斯托尔大学）

AI总结低光照视频常伴有时空不一致的噪声，影响视觉清晰度和计算机视觉任务的性能。为解决深度学习增强此类内容时缺乏高质量对齐训练数据的问题，本文提出了BVI-RLV数据集，包含40个不同场景下超过3万对低光与正常光配对帧，实现了高精度的亚像素级对齐。该数据集在动态运动场景中具有广泛适用性，并提供了多种模型的基线实现，实验表明其对监督学习效果显著，且在跨数据集评估中表现优于现有数据集。

Comments arXiv admin note: text overlap with arXiv:2402.01970

详情

AI中文摘要

低光视频通常表现出时空不连贯的噪声，损害可见性并降低计算机视觉应用的性能。使用深度学习增强此类内容的一个主要挑战在于缺乏像素对齐的高质量训练数据。我们引入了BVI-RLV，一个完全配准的低光视频数据集，包含来自40个不同场景的超过3万对帧，在两种低光条件下，每个帧都与正常光照的真实值对齐。与依赖中性密度（ND）滤波器或存在未对齐问题的现有数据集不同，BVI-RLV通过使用电动滑轨和基于图像的优化，在动态运动场景下实现了全高清分辨率下99.24%数据的亚像素配准。该数据集涵盖了广泛的运动类型和真实的时间噪声。我们还提供了使用四种代表性架构的基线实现：卷积神经网络（CNN）、Transformer、状态空间模型（Mamba）和扩散模型（DM）。实验表明，配准对于监督学习至关重要，与未配准训练相比，PSNR提升高达5.85 dB。在跨数据集评估中，基于BVI-RLV训练的模型优于基于现有数据集训练的模型，即使在真实户外场景中也取得了优越性能。我们的数据集公开于https://doi.org/10.21227/mzny-8c77。

英文摘要

Low-light videos often exhibit spatiotemporally incoherent noise, compromising visibility and degrading performance in computer vision applications. A major challenge for enhancing such content using deep learning lies in the scarcity of pixel-aligned, high-quality training data. We introduce BVI-RLV, a fully registered low-light video dataset comprising over 30k paired frames from 40 diverse scenes under two low-light conditions, each aligned with normal-light ground truth. Unlike existing datasets that rely on neutral density (ND) filters or suffer from misalignment issues, BVI-RLV achieves sub-pixel registration for 99.24% of data at full HD resolution across dynamic motion scenarios using a motorized dolly and image-based refinement. The dataset covers a wide range of motion types and realistic temporal noise. We also provide baseline implementations using four representative architectures: Convolutional Neural Network (CNN), Transformer, State Space Model (Mamba), and Diffusion Model (DM). Experiments demonstrate that registration is crucial for supervised learning, yielding up to 5.85 dB PSNR improvement compared to unregistered training. Models trained on BVI-RLV outperform those trained on existing datasets in cross-dataset evaluations, achieving superior performance even in real-world outdoor scenes. Our dataset is publicly available at https://doi.org/10.21227/mzny-8c77.

URL PDF HTML ☆

赞 0 踩 0

2605.22942 2026-05-25 cs.CV 版本更新

Improved Vision-to-Chart Buoy Association with Learned World-to-Image Projection

改进的视觉到图表浮标关联：学习世界到图像投影

Borja Carrillo-Perez

发表机构 * Arquimea Research Center（阿基米德研究中心）

AI总结本文针对MaCVi 2026视觉-海图数据关联挑战，提出了一种对基于DETR的融合变压器基线的轻量改进方法。通过引入一个专门的多层感知机（QueryMLP），该方法能够从海图测量和IMU姿态数据中显式预测浮标在图像中的水线接触点，从而为每个浮标提供直接的空间先验信息，减轻了变压器解码器的几何推理负担。该方法在测试集上取得了总体得分为0.7386（F1=0.8055，mIoU=0.6718）的优异性能，位列挑战赛提交结果的第二名。

Comments 5 pages, 3 figures. Technical report for the MaCVi 2026 Vision-to-Chart Data Association Challenge at the CVPR 2026 Workshop; 2nd place submission. Code: https://github.com/bcarrpe/macvi26-visionmap-querymlp

2605.22907 2026-05-25 cs.CV 版本更新

短视频多模态特征中信息感知价值的计算模型预测感官与行为参与

Haoning Xue, Jingwen Zhang, Xiaohui Wang, Diane Dagyong Kim, Yunya Song

发表机构 * Department of Communication, University of Utah（犹他大学通讯系）； Department of Communication, University of California, Davis（加州大学戴维斯分校通讯系）； Department of Media and Communication, City University of Hong Kong（香港城市大学媒体与传播系）； Division of Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology（香港科学与技术大学新兴跨学科领域 division）

AI总结本文提出了一种基于多模态特征计算短视频信息感知价值（MSV）的模型，用于预测用户对短视频的感官和行为参与度。该模型结合多模态特征分析与1200个短视频的人类评估，并在三个短视频平台的14492个未见数据上验证，发现MSV与感官参与呈正相关，但与行为参与呈倒U型关系。研究不仅深化了对短视频用户参与机制的理论理解，也为短视频研究提供了可靠的计算工具。

详情

DOI: 10.5117/CCR2026.1.8.XUE

AI中文摘要

当代媒体环境以耸人听闻的短视频为特征。虽然先前研究考察了单个多模态特征的影响，但多模态特征对短视频观众参与度的集体影响仍然未知。基于信息感知价值（MSV）的理论框架，本研究通过多模态特征分析和对1200个短视频的人工评估，开发并测试了一个MSV计算模型。该模型预测感官和行为参与，并在来自三个短视频平台的两个未见数据集（总计N=14,492）上进一步验证。虽然MSV与感官参与正相关，但与行为参与呈倒U型关系：较高的MSV引起更强的感官刺激，但适度的MSV优化行为参与。这项研究推进了对短视频参与的理论理解，并为短视频研究引入了一个强大的计算工具。

英文摘要

The contemporary media landscape is characterized by sensational short videos. While prior research examines the effects of individual multimodal features, the collective impact of multimodal features on viewer engagement with short videos remains unknown. Grounded in the theoretical framework of Message Sensation Value (MSV), this study develops and tests a computational model of MSV with multimodal feature analysis and human evaluation of 1,200 short videos. This model that predicts sensory and behavioral engagement was further validated across two unseen datasets from three short video platforms (combined N = 14,492). While MSV is positively associated with sensory engagement, it shows an inverted U-shaped relationship with behavioral engagement: Higher MSV elicits stronger sensory stimulation, but moderate MSV optimizes behavioral engagement. This research advances the theoretical understanding of short video engagement and introduces a robust computational tool for short video research.

URL PDF HTML ☆

赞 0 踩 0

2503.20066 2026-05-25 cs.RO cs.CV 版本更新

Learning Scene-Level Signed Directional Distance Function with Ellipsoidal Priors and Neural Residuals

学习场景级有符号方向距离函数：结合椭球先验与神经残差

Zhirui Dai, Hojoon Shin, Yulun Tian, Ki Myung Brian Lee, Nikolay Atanasov

发表机构 * Department of Electrical and Computer Engineering, University of California San Diego（加州大学圣地亚哥分校电气与计算机工程系）； Brain Corporation（Brain公司）； Robotics Department, University of Michigan（密歇根大学机器人系）

AI总结本文提出了一种新的神经隐式表示方法——有符号方向距离函数（SDDF），用于解决三维重建和可微渲染中的效率与精度问题。SDDF 以位置和视角方向为输入，直接输出到表面的距离，从而实现高效且精确的几何重建。为提升学习效率，作者结合显式的椭球先验和隐式的神经残差，构建了可微混合表示，有效处理障碍物边界处的距离不连续问题，并在多个指标上优于现有方法。

详情

DOI: 10.1109/tpami.2026.3688658
Journal ref: 2026 IEEE Transactions on Pattern Analysis and Machine Intelligence

AI中文摘要

密集重建和可微渲染是3D视觉和计算机图形学中紧密相连的基本操作。最近的神经隐式表示在重建保真度和可微性方面相比传统的离散表示（如网格、点云和体素）展现出显著优势。然而，许多神经隐式模型，如神经辐射场（NeRF）和有符号距离函数（SDF）网络，由于需要沿每条相机射线进行多次查询，渲染效率低下。此外，NeRF和高斯泼溅方法在光度重建方面表现令人印象深刻，但通常需要仔细的监督才能实现精确的几何重建。为了解决这些挑战，我们提出了一种称为有符号方向距离函数（SDDF）的新型表示。与SDF不同，与NeRF类似，SDDF以位置和观察方向作为输入。与SDF类似，与NeRF不同，SDDF直接提供到观察表面的距离，而不是沿视线方向积分。因此，SDDF实现了精确的几何重建和高效的可微方向距离预测。为了高效地学习和预测场景级SDDF，我们开发了一种可微混合表示，结合了显式椭球先验和隐式神经残差。这使得模型能够有效处理障碍物边界周围的距离不连续性，同时保持密集高保真距离预测的能力。通过与最先进表示的广泛评估，我们展示了SDDF实现了（i）有竞争力的SDDF预测精度，（ii）比SDF和NeRF更快的预测速度，以及（iii）与NeRF和高斯泼溅相比更优越的几何一致性。

英文摘要

Dense reconstruction and differentiable rendering are fundamental tightly connected operations in 3D vision and computer graphics. Recent neural implicit representations demonstrate compelling advantages in reconstruction fidelity and differentiability over conventional discrete representations such as meshes, point clouds, and voxels. However, many neural implicit models, such as neural radiance fields (NeRF) and signed distance function (SDF) networks, are inefficient in rendering due to the need to perform multiple queries along each camera ray. Moreover, NeRF and Gaussian Splatting methods offer impressive photometric reconstruction but often require careful supervision to achieve accurate geometric reconstruction. To address these challenges, we propose a novel representation called signed directional distance function (SDDF). Unlike SDF and similar to NeRF, SDDF has a position and viewing direction as input. Like SDF and unlike NeRF, SDDF directly provides distance to the observed surface rather than integrating along the view ray. As a result, SDDF achieves accurate geometric reconstruction and efficient differentiable directional distance prediction. To learn and predict scene-level SDDF efficiently, we develop a differentiable hybrid representation that combines explicit ellipsoid priors and implicit neural residuals. This allows the model to handle distance discontinuities around obstacle boundaries effectively while preserving the ability for dense high-fidelity distance prediction. Through extensive evaluation against state-of-the-art representations, we show that SDDF achieves (i) competitive SDDF prediction accuracy, (ii) faster prediction speed than SDF and NeRF, and (iii) superior geometric consistency compared to NeRF and Gaussian Splatting.

URL PDF HTML ☆

赞 0 踩 0