arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.23903 2026-05-25 cs.CV 版本更新

Geo-Align: Video Generation Alignment via Metric Geometry Reward

Geo-Align: 通过度量几何奖励实现视频生成对齐

Zizun Li, Haoyu Guo, Runzhe Teng, Chunhua Shen, Tong He

发表机构 * USTC(中国科学技术大学) Shanghai AI Lab(上海人工智能实验室) ZJU(浙江大学)

AI总结 Geo-Align 是一种基于强化学习的视频重渲染框架,旨在解决现有方法在处理真实世界视频时对物理尺度和相机轨迹控制不足的问题。该方法通过引入度量三维估计器提取生成视频中的精确相机轨迹,并结合感知奖励机制优化模型,从而提升对旋转和平移偏差的控制能力。实验表明,Geo-Align 在相机可控性和视觉保真度方面均优于现有监督学习方法,展示了其在无配对数据情况下的有效性。

详情
AI中文摘要

近年来,相机控制的视频生成取得了显著进展。然而,现有的视频到视频重渲染方法主要依赖于使用合成数据集的监督微调。目前,同步的多视角真实世界视频数据极度稀缺。因此,当前范式在处理分布外的真实世界视频时通常表现出有限的泛化能力,模型难以准确遵循物理尺度和相机轨迹。为了弥补这一差距,我们提出了Geo-Align,这是第一个专门为相机控制视频重渲染设计的强化学习框架。基于预训练模型,我们通过尺度感知的感知奖励机制优化模型。具体来说,我们引入了一个度量3D估计器,从生成的视频中提取精确的相机轨迹,明确惩罚旋转和平移的偏差。此外,我们精心设计了一种基于真实条件视频和从合成数据导出的目标相机轨迹的数据流水线策略,消除了对配对数据的依赖。大量实验表明,Geo-Align在精确的相机可控性和视觉保真度方面始终优于现有的监督学习基线,证明了我们方法的有效性。

英文摘要

Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.

2605.23902 2026-05-25 cs.CV 版本更新

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

PiD: 基于像素扩散的快速高分辨率潜解码

Yifan Lu, Qi Wu, Jay Zhangjie Wu, Zian Wang, Huan Ling, Sanja Fidler, Xuanchi Ren

发表机构 * NVIDIA

AI总结 本文提出了一种名为PiD的像素扩散解码器,旨在解决高分辨率文本到图像生成中传统潜空间解码效率低、细节不足的问题。PiD将潜空间解码重新表述为条件像素扩散过程,统一了生成与上采样步骤,能够在高分辨率像素空间中直接去噪,从而以较低的延迟生成4倍甚至8倍放大图像。通过引入轻量的sigma感知适配器和模型蒸馏技术,PiD在保证视觉质量的同时显著提升了生成速度和效率,适用于多种潜空间表示,包括传统VAE和近期基于语义潜空间的模型。

Comments Project Page: https://research.nvidia.com/labs/sil/projects/pid/

详情
AI中文摘要

大多数实用的高分辨率文本到图像系统,包括潜扩散和自回归模型,在紧凑的潜空间中生成,然后由解码器将生成的潜变量映射回像素。然而,潜到像素解码器是面向重建的,优化目标是反转编码器而非合成更多细节,并且在百万像素尺度上成本越来越高。这一缺陷呼唤更具表现力和高效的解码范式。受近期可扩展像素空间扩散进展的启发,我们引入了PiD,一种像素扩散解码器,将潜解码重新表述为条件像素扩散,将解码和上采样统一到一个生成模块中。通过直接在高分辨率像素空间中去噪,PiD以低延迟合成$4 imes$甚至$8 imes$上采样的图像。对于潜条件,一个轻量级的sigma感知适配器将噪声污染的潜变量注入像素扩散主干,使PiD能够解码部分去噪的潜变量并提前终止潜扩散过程。为进一步提高效率,我们使用DMD2蒸馏模型,将推理步骤减少到仅4步。PiD适用于传统的VAE潜变量和近期基于RAE的模型中使用的语义潜变量(例如SigLIP、DINOv2)。PiD在消费级RTX 5090上,以13 GB峰值内存,在不到1秒内将$512 imes 512$图像的潜变量解码为$2048 imes 2048$像素,在GB200 GPU上最快可达210毫秒,比级联扩散超分辨率流水线快约$6 imes$,且视觉保真度更高。

英文摘要

Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes $4\times$ and even $8\times$ upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of $512 \times 512$ images into $2048 \times 2048$ pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about $6\times$ faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.

2605.23897 2026-05-25 cs.CV cs.AI cs.CL 版本更新

ETCHR: Editing To Clarify and Harness Reasoning

ETCHR: 通过编辑来澄清和利用推理

Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin

发表机构 * The Chinese University of Hong Kong Shanghai AI Laboratory(香港中文大学上海人工智能实验室) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 多模态大语言模型在视觉推理方面取得了进展,但纯文本推理链在需要精细关注或视角变换的问题上仍存在瓶颈。为解决这一问题,研究提出ETCHR,一种与理解模型解耦的、基于问题条件的图像编辑器,通过两阶段训练方法分别弥补语言侧和生成侧的缺陷,提升编辑准确性和推理效果。实验表明,ETCHR在多个任务上显著提升了推理性能,且可无缝集成到不同开源和闭源模型中。

Comments Code, model and data are open-sourced at https://github.com/InternLM/ETCHR

详情
AI中文摘要

多模态大语言模型已经推进了视觉推理,但对于需要细粒度关注或视角变换的问题,纯文本思维链仍然是一个瓶颈。“用图像思考”范式缩小了这一差距,但现有方法要么受限于固定的预定义工具包,要么从统一的多模态方法中产生噪声的中间图像。我们追求第三种选择:使用专用的图像编辑模型并将其与理解模型解耦。然而,现成的图像编辑器作为推理助手存在两个互补的差距:语言侧差距,即被训练为被动指令跟随者的编辑器无法将抽象问题映射到适当的视觉变换;以及生成侧差距,即随着推理深度增加,编辑正确性下降。基于这一分析,我们引入了ETCHR(Editing To Clarify and Harness Reasoning),一种问题条件化、推理感知的图像编辑器,与下游理解模型解耦,并通过针对这两个差距的两阶段配方进行训练:通过监督微调编辑轨迹进行推理模仿,随后通过基于VLM的奖励进行推理增强,以提升编辑正确性和下游推理准确性。由于编辑器是解耦的,ETCHR可以以无需训练的方式插入不同的开源和闭源MLLM。在五个任务族(细粒度感知、图表理解、逻辑推理、拼图恢复和3D理解)上,ETCHR将平均Pass@1从55.95提升到60.77(+4.82,使用Qwen3-VL-8B),从65.08提升到70.55(+5.47,使用Gemini-3.1-Flash-Lite),以及从76.55提升到81.16(+4.61,使用1T参数的MoE模型Kimi K2.5)。

英文摘要

Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reasoning assistants with two complementary gaps: a language-side gap, where editors trained as passive instruction-followers cannot map an abstract question to an appropriate visual transformation, and a generation-side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question-conditioned, reasoning-aware image editor decoupled from the downstream understanding model and trained with a two-stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine-tuning on edit trajectories, followed by Reasoning Enhancement with VLM-derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open- and closed-source MLLMs in a training-free manner. Across five task families (fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3-VL-8B, from 65.08 to 70.55 (+5.47) with Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 (+4.61) with the 1T-parameter MoE model Kimi K2.5.

2605.23895 2026-05-25 cs.CV 版本更新

From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain

从激活到因果:人脑中视觉表征的因果发现

Yuval Golbari, Navve Wasserman, Matias Cosarinsky, Roman Beliy, Aude Oliva, Antonio Torralba, Michal Irani, Tamar Rott Shaham

发表机构 * Weizmann Institute of Science(魏茨曼科学研究所) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究如何在人类大脑中识别与特定视觉概念相关的脑区,提出了一个名为BrainCause的自动化框架,通过结合生成模型和脑成像模型,生成受控刺激并进行因果验证,以区分真正代表概念的脑区与仅由相关视觉或语义线索驱动的脑区。该方法能够有效识别已知的功能定位,并发现新的候选表征,验证表明仅依赖激活强度可能导致大量假阳性结果,强调了因果验证的重要性。

详情
AI中文摘要

识别人类大脑中哪些脑区代表视觉概念是神经科学的核心挑战。现有方法通过激活最大化定位粗略的功能区域(例如,面孔、地点),识别出对目标概念相对于其他概念激活强烈的区域。然而,仅凭强激活并不能确定该区域代表概念本身,因为响应可能由相关的视觉或语义线索驱动。我们引入了BrainCause,一个自动化框架,结合生成模型和脑模型合成受控刺激,并通过有针对性的因果测试验证神经表征。给定一个指定感兴趣概念的查询,我们的框架构建有针对性的刺激集,包括概念图像、去除目标概念同时保留其他图像内容的反事实编辑,以及包含候选相关干扰物的图像。然后,它使用图像到fMRI编码模型预测大脑响应,并搜索那些对目标概念而非相关替代物有特定响应的表征。BrainCause返回经过验证的候选表征,并提出后续fMRI实验以进一步测试或扩展其发现。我们的方法成功恢复了已知的功能定位,并在数十个概念中识别出新的候选表征,在预测和测量的fMRI数据上均得到验证。关键的是,我们表明如果没有因果验证,大部分定位将是假阳性,证实了仅凭激活不足以作为表征的证据。

英文摘要

Identifying which brain regions represent a visual concept in the human brain is a central challenge in neuroscience. Existing approaches have localized coarse functional regions (e.g., faces, places) through activation maximization, identifying regions that activate strongly for a target concept relative to other concepts. Yet strong activation alone does not establish that a region represents the concept itself, as responses may instead be driven by correlated visual or semantic cues. We introduce BrainCause, an automated framework that combines generative and brain models to synthesize controlled stimuli and validate neural representations through targeted causal testing. Given a query specifying a concept of interest, our framework constructs targeted stimulus sets comprising concept images, counterfactual edits that remove the target concept while preserving other image content, and images with candidate correlated distractors. It then uses an image-to-fMRI encoding model to predict brain responses and searches for representations that respond specifically to the target concept over correlated alternatives. BrainCause returns validated candidate representations and proposes follow-up fMRI experiments to further test or extend its discoveries. Our approach successfully recovers known functional localizations and identifies new candidate representations across dozens of concepts, validated on both predicted and measured fMRI data. Critically, we show that without causal validation, a large fraction of localizations would be false positives, confirming that activation alone is insufficient evidence of representation.

2605.23892 2026-05-25 cs.CV cs.AI cs.GR cs.LG cs.RO 版本更新

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

优质令牌狩猎:视觉几何变换器令牌选择指南

Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski

发表机构 * University of Toronto & Vector Institute(多伦多大学及向量研究所) Google(谷歌) Technical University of Munich(慕尼黑技术大学)

AI总结 视觉几何变换器在多视角三维重建中表现出色,但其计算成本随输入序列长度呈二次增长,限制了模型的效率和可扩展性。本文提出了一种简单而通用的解决方案,通过限制每个查询在全局注意力中交互的关键/值标记数量来降低计算复杂度。该方法采用两阶段框架:首先在帧级别选择保留的帧以保证场景覆盖多样性,然后在帧内进一步去除冗余标记,且引入基于注意力熵的层感知稀疏化策略。实验表明,该方法在保持或提升性能的同时,可将视觉几何变换器的处理速度提升85%以上。

Comments Project Page: https://zsh2000.github.io/good-token-hunting.github.io, Code: https://github.com/zsh2000/gotohunt

详情
AI中文摘要

视觉几何变换器已成为多视图三维重建的强大架构,能够以前馈方式联合预测多个三维属性。然而,由于这些模型内部的全局注意力层,其计算成本随输入序列长度呈二次增长,限制了其可扩展性和效率。在这项工作中,我们通过一个简单而通用的策略来应对这一挑战:限制每个查询在全局注意力期间交互的键/值令牌数量。为了实现有效的令牌选择,我们引入了一个两阶段框架。首先,帧间选择步骤在帧级别操作,以识别应保留的帧。其次,帧内选择步骤进一步丢弃所选帧内更冗余的令牌。我们的分析强调了基于多样性的帧间选择策略的优势,该策略确保了对场景的广泛覆盖。对于帧内选择,我们表明层感知稀疏化是必要的,选择过程由全局注意力模式的熵引导。与现有解决方案相比,我们的方法提供了优越的速度-精度权衡。大量实验表明,对于包含500张图像的场景,我们的方法将视觉几何变换器加速超过85%,同时保持甚至提升基线性能,这暗示了我们的令牌选择策略如何在视觉几何变换器的未来应用中发挥关键作用。我们的项目网站位于 https://zsh2000.github.io/good-token-hunting.github.io。

英文摘要

Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good-token-hunting.github.io.

2605.23891 2026-05-25 cs.CV 版本更新

Smart-Insertion-V: Photorealistic Video Insertion via a Closed-Loop Feedback Dual-Stream Framework

Smart-Insertion-V: 通过闭环反馈双流框架实现逼真的视频插入

Xiao Cao, Yansong Qu, Xiangzhen, Chang, Wen Xiao, Jiakui Hu, Heyuan Li, Jialun Liu, Zhiyong Huang, Xuelong Li

AI总结 本文提出了一种名为 Smart-Insertion-V 的端到端双流框架,用于实现无需掩码的高质量视频物体插入。该方法通过图像流同步引导视频生成,并引入闭环反馈机制以增强插入鲁棒性,同时设计了 Dual-World-View RoPE 和解耦引导模块,以解决特征纠缠和风格泄露问题,并提升语义对齐与风格适应能力。实验表明,该方法在物体插入位置合理性与画面和谐性方面均达到当前最优水平。

详情
AI中文摘要

无掩码视频对象插入已成为一项具有挑战性的任务,需要将参考对象和谐地融入源视频中。然而,当参考对象与源场景存在严重的风格域差异时,现有方法难以应对。为了克服这一问题,我们提出了 extit{ extbf{Smart-Insertion-V}},一种端到端的 extbf{双流}框架,同时进行视频插入和图像风格迁移。在该框架内,图像流同步引导视频生成过程,同时进一步引入 extbf{闭环反馈}机制以确保鲁棒插入。不可避免地,整合这些多样化的条件信号会导致特征纠缠和风格泄露。为解决此问题,我们设计了 extbf{双世界视角旋转位置编码},通过时空偏移区分不同信号,且不增加大量训练开销。此外,为了促进空间定位和风格适应,我们引入了 extbf{解耦引导模块},该模块利用视觉语言模型进行语义推理,同时通过原生文本编码器保留原始时间引导。为了弥合和谐参考插入任务的数据差距,我们提出了一种数据整理流程,并将发布一个 extbf{开源数据集}。实验表明,我们的方法可以将对象插入到合理的位置,同时实现最和谐的结果。

英文摘要

Mask-free video object insertion has emerged as a challenging task, requiring harmonious integration of reference objects into source videos. However, existing methods struggle when references exhibit severe stylistic domain gaps with the source scene. To overcome this, we propose \textit{\textbf{Smart-Insertion-V}}, an end-to-end \textbf{Dual-Stream} framework that concurrently conducts video insertion and image style transfer. Within this framework, the image stream synchronously guides the video generation process, while a \textbf{Closed-loop Feedback} mechanism is further incorporated to ensure robust insertion. Inevitably, integrating these diverse conditioning signals results in feature entanglement and style leakage. To tackle this issue, we design \textbf{Dual-World-View RoPE} to distinguish different signals via spatial-temporal offsets without incurring heavy training overhead. Furthermore, to facilitate spatial grounding and stylistic adaptation, we introduce a \textbf{Decoupled Guidance Module} that leverages a Vision-Language Model for semantic reasoning while preserving original temporal guidance with native text encoder. To bridge data gap for harmonious reference insertion task, we propose a data curation pipeline and will release an \textbf{open-source dataset}. Experiments demonstrate that our method can insert objects into plausible positions while achieving the most harmonious results.

2605.23889 2026-05-25 cs.CV 版本更新

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

HorizonStream: 用于流式3D重建的长程注意力

Chong Cheng, Peilin Tao, Nanjie Yao, Guanzhi Ding, Xianda Chen, Yuansen Du, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, Zhengqing Chen, Hao Wang

发表机构 * HKUST(GZ)(香港科技大学) Horizon Robotics(Horizon机器人) CASIA(中国科学院自动化研究所) CSU(中国科学技术大学)

AI总结 HorizonStream 是一种用于流式三维重建的长时序注意力模型,旨在解决在线重建中因时间异质性导致的漂移、抖动和崩溃问题。该方法通过引入证据影响核的概念,将几何传播分解为长时序和短时序两个因子,分别采用几何线性注意力和几何局部注意力进行处理,从而实现多时间尺度的几何信息传播与稳定的空间匹配。实验表明,HorizonStream 在仅使用48帧训练的情况下,能够稳定地处理超过10,000帧的长序列,表现出优越的流式三维重建性能。

详情
AI中文摘要

在线3D重建需要在严格的因果和有界内存约束下估计相机姿态和场景几何。现有方法在长序列上常出现漂移、抖动或崩溃。我们将这些失败归因于一个根本性的不匹配:流式几何本质上是时间异质的,证据范围从短时对应到持久全局尺度。然而,当前架构施加了统一且病态的影响模式。例如,滑动窗口强制硬截断,而无门控循环和因果注意力导致缓存饱和和尖峰状注意力沉没。为解决此问题,我们将几何传播形式化为一个证据影响核,并提出HorizonStream,一种显式分解该核的长程Transformer。对于长程时间因子,几何线性注意力学习通道级衰减率,实现几何证据的有界、多时间尺度传播。对于短程空间因子,具有时空RoPE的几何局部注意力在抑制注意力沉没的同时执行可靠的3D匹配。最后,度量读出令牌直接从持久几何状态恢复稳定尺度和刚性姿态。大量实验表明,仅用48帧片段训练的HorizonStream,在恒定内存和线性时间下稳定泛化到超过10,000帧的序列,实现了最先进的流式3D重建性能。项目页面:https://3dagentworld.github.io/horizonstream/

英文摘要

Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an \emph{evidence influence kernel} and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: https://3dagentworld.github.io/horizonstream/

2605.23888 2026-05-25 cs.CV 版本更新

GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

GenRecon: 桥接生成先验的多视图3D场景重建

Katharina Schmid, Nicolas von Lützow, Jozef Hladký, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑技术大学) Computing Systems Lab, Huawei Technologies(华为技术有限公司计算系统实验室)

AI总结 本文提出了一种基于生成先验的高质量多视角三维场景重建方法GenRecon,通过将场景分割为局部重叠的区域,并在每个区域上进行条件生成,实现了大范围场景的高精度重建。研究利用先进的生成形状模型Trellis.2作为先验,并提出了一种基于投影的条件机制,将多视角图像特征提升为与生成模型对齐的三维表示,从而生成几何一致、视图一致的重建结果。该方法在室内环境重建中表现出色,相比现有方法在重建质量上提升了16%。

Comments Project page: https://kasothaphie.github.io/GenRecon/

详情
AI中文摘要

我们提出了一种新的方法,从多视图RGB图像进行高保真3D场景重建,该方法将重建与强大的生成式3D先验紧密结合。我们将场景重建视为在空间局部、重叠的块上的条件3D生成,这些块共同覆盖场景,将生成扩展到大的场景范围。关键的是,我们继承了最先进的生成形状模型(以Trellis.2为例)的保真度和完整性,并将其推广到场景级别。为此,我们提出了一种基于投影的条件机制,该机制将带姿态的多视图图像特征提升为与生成模型对齐的连贯3D表示,独立于视图顺序并空间锚定到场景,从而产生高保真、多视图一致的生成几何。这使得将Trellis.2的强对象级先验提升到多视图、场景规模的生成,产生室内环境的忠实、可编辑的PBR网格重建。因此,我们获得了高保真结果,比最先进的重建方法性能提升16%。

英文摘要

We introduce a new approach to high-fidelity 3D scene reconstruction from multi-view RGB images that tightly couples reconstruction with a strong generative 3D prior. We cast scene reconstruction as conditional 3D generation over a set of spatially-localized, overlapping chunks that together tile the scene, scaling generation to large scene extents. Crucially, we inherit the fidelity and completeness of state-of-the-art generative shape models -- we use Trellis.2 as an example -- which we generalize to the scene level. To this end, we propose a projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding high-fidelity, multi-view consistent generated geometry. This enables lifting the strong object-level prior of Trellis.2 to multi-view, scene-scale generation, producing faithful, editable PBR mesh reconstructions of indoor environments. As a result, we obtain high-fidelity results that outperform cutting-edge reconstruction methods by 16%.

2605.23883 2026-05-25 cs.CV cs.AI 版本更新

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

PGT: 用于提升多模态大语言模型视觉定位的程序化生成任务

Rim Assouel, Amir Bar, Michal Drozdzal, Adriana Romero-Soriano

发表机构 * Mila - Qu\'ebec AI Institute FAIR at Meta Superintelligence Labs McGill University Canada CIFAR AI Chair

AI总结 尽管多模态大语言模型(MLLMs)已取得显著进展,但在细粒度理解任务上仍存在不足。本文提出了一种名为PGT的过程生成任务框架,通过在图像上叠加明确的几何原语生成密集的监督信号,从而提升模型的视觉 grounding 能力,并作为低成本的诊断工具识别感知失败的原因。实验表明,PGT 在多种基准测试中显著提升了模型性能,表明细粒度感知瓶颈可通过增强监督信号有效解决。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)取得了显著进展,但这些模型在细粒度理解任务上仍然存在困难。在这项工作中,我们提出了程序化生成任务(PGT),一个简单的数据驱动框架,具有双重目的:诱导细粒度视觉理解,并作为低成本的诊断工具来识别感知失败的来源。通过在图像上叠加明确的几何基元,PGT生成额外的密集监督,将视觉定位能力与语义先验解耦。在关系、定量和3D/深度理解基准上的大量实验表明,PGT在各种架构上均取得了显著提升。在使用PGT数据增强的LLaVA-v1.5-Instruct上进行指令微调,在What'sUp基准上提升高达+20%,在CV-Bench-2D上提升+13.3%,同时保持通用感知能力。此外,在PGT数据上微调最先进的MLLMs,在What'sUp上提升高达+5.5%,在CV-Bench-2D上提升+8.3%。这些发现表明,PGT有效解决了细粒度感知的瓶颈,揭示了许多空间推理缺陷源于监督信号不足,而非固有的架构或分辨率限制。

英文摘要

Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. In this work, we propose Procedurally Generated Tasks (PGT), a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic priors. Extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks show that PGT yields remarkable gains across diverse architectures. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data results in improvements of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D, while maintaining general perception capabilities. Moreover, finetuning state-of-the-art MLLMs on PGT data leads to boosts of up to +5.5% on What'sUp and +8.3% on CV-Bench-2D. These findings demonstrate that PGT effectively address the bottleneck of fine-grained perception, revealing that many spatial reasoning deficits stem from inadequate supervision signals rather than inherent architectural or resolution limitations.

2605.23878 2026-05-25 cs.CV 版本更新

LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation

LaMo: 视频生成中物理真实性的自监督潜在运动先验

Bo Jiang, Depu Meng, Yihan Hu, Yichen Xie, Tianshuo Xu, Wei Zhan

发表机构 * Applied Intuition University of California, Berkeley(加州大学伯克利分校)

AI总结 现代视频生成模型虽然能生成视觉吸引人的视频,但在物理和运动一致性方面仍存在不足,限制了其作为可靠世界模拟器的应用。本文提出LaMo,一种基于自监督学习的潜在运动先验方法,通过从未标注的训练视频中提取运动线索,无需外部模拟器或物理数据即可提升视频生成的物理真实性。LaMo引入了两个轻量级模块,在训练和采样阶段分别用于约束运动漂移和引导运动先验,能够方便地集成到现有视频扩散模型中,并在多个基准测试中展现出优越的物理一致性提升效果。

Comments Project Page: https://lamo-ai.github.io/

详情
AI中文摘要

现代视频生成器能产生视觉上吸引人的片段,但在物理和运动一致性方面仍有困难,限制了其作为可靠世界模拟器的使用。现有的补救措施通常依赖外部模拟器、教师模型或精心策划的物理聚焦数据。我们探索了一种互补的自监督方向:从用于训练视频扩散模型的无标签视频中提取运动线索。我们提出LaMo,它根据当前潜在变量和提示,对帧间潜在变化制定了一个潜在运动先验。该先验通过两个轻量级读出器暴露:一个用于训练期间的宏运动漂移损失,以及一个用于采样期间的微运动场引导。这两个组件都是即插即用的,与现有的视频扩散骨干网络兼容,无需架构或I/O更改。在VideoPhy和VideoPhy2上,LaMo改进了CogVideoX骨干网络,并优于最近使用外部监督的物理感知基线。在VBench上,它在保持整体生成质量的同时改善了运动相关维度。这些结果表明,无标签视频包含有用的运动监督,可用于提高现代视频扩散模型的物理保真度。

英文摘要

Modern video generators produce visually compelling clips but still struggle with physical and motion consistency, limiting their use as reliable world simulators. Existing remedies often rely on external simulators, teacher models, or curated physics-focused data. We explore a complementary self-supervised direction: extracting motion cues from the unlabeled videos already used to train video diffusion models. We propose LaMo, which formulates a latent motion prior over frame-to-frame latent changes conditioned on the current latent and prompt. This prior is exposed through two lightweight readouts: a macro motion drift used during training as a Motion Drift Loss, and a learned micro motion field used during sampling as Motion Prior Guidance. Both components are plug-and-play with existing video diffusion backbones, requiring no architectural or I/O changes. On VideoPhy and VideoPhy2, LaMo improves CogVideoX backbones and outperforms recent physics-aware baselines that use external supervision. On VBench, it preserves overall generation quality while improving motion-related dimensions. These results suggest that unlabeled video contains useful motion supervision for improving physical fidelity in modern video diffusion models.

2605.23868 2026-05-25 cs.CV 版本更新

Vision Transformers Need Better Token Interaction

视觉Transformer需要更好的Token交互

Linxiang Su

发表机构 * University of Szeged(塞格德大学)

AI总结 视觉Transformer(ViT)在学习图像级表示方面表现出色,但在长时间训练后,其对密集预测任务的块表示效果会下降。本文分析了这一现象,指出其原因不仅在于高范数伪影,还涉及语义扩散问题,即全局语义信息在块之间非局部地传播。为此,作者提出通过稀疏注意力机制来增强块之间的选择性交互,在保持全局连接性的前提下提升了ViT在密集预测任务中的性能,如语义分割等。

Comments 7 pages

详情
AI中文摘要

视觉Transformer(ViT)可以学习强大的图像级表示,但在长时间训练过程中,其补丁表示对于密集预测变得不那么有效。我们重新审视这种密集退化现象,并认为它不能仅由高范数伪影完全解释。相反,我们描述了\emph{语义扩散}:一种优化捷径,其中全局语义信息通过补丁token传播,超出了局部合理的范围。我们的分析表明,密集表示质量不能仅由局部性来捕捉:浅层特征可以保持与前景区域更好的对齐,但表现不如深层特征,而 exttt{[CLS]}特征在密集预测中仍然具有互补性。这些观察表明,目标不应该是移除全局上下文,而是使token交互更具选择性。因此,我们研究稀疏注意力作为最小干预,用entmax-1.5替换softmax注意力,同时保持全局token连接。在ImageNet-1K上训练200个epoch的DINOv1 ViT-S/16上,这一改变保持了ImageNet线性探测准确率,并显著提高了语义分割性能:VOC mIoU从42.80提高到48.78,ADE20K从19.85提高到21.97,Cityscapes从36.79提高到37.87。这些结果表明,选择性token混合是改善密集ViT表示的一种简单而有效的偏置。

英文摘要

Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it is not fully explained by high-norm artifacts alone. Instead, we characterize \emph{semantic diffusion}: an optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified. Our analysis shows that dense representation quality is not captured by locality alone: shallow features can remain better aligned with foreground regions yet underperform deeper features, and \texttt{[CLS]} features remain complementary for dense prediction. These observations suggest that the goal should not be to remove global context, but to make token interactions more selective. We therefore study sparse attention as a minimal intervention, replacing softmax attention with entmax-1.5 while preserving global token connectivity. On DINOv1 ViT-S/16 trained for 200 epochs on ImageNet-1K, this change preserves ImageNet linear probing accuracy and substantially improves semantic segmentation performance: VOC mIoU increases from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87. These results suggest that selective token mixing is a simple and effective bias for improving dense ViT representations.

2605.23861 2026-05-25 cs.LG cs.AI cs.CV 版本更新

Leveraging Foundation Models for Causal Generative Modeling

利用基础模型进行因果生成建模

Aneesh Komanduri, Xintao Wu

发表机构 * University of Arkansas(亚拉巴马大学)

AI总结 该论文研究如何利用预训练基础模型进行因果生成建模,旨在提升AI系统在反事实推理方面的能力。提出了一种名为FM-CGM的模块化框架,通过概念提取器、概念操作器和反事实生成器三个核心组件,实现了端到端的视觉因果推理。该方法结合了因果推理模型和文本到图像扩散模型,并引入了因果语义引导机制,有效支持零样本因果发现与反事实图像生成,具有重要的理论与应用价值。

详情
AI中文摘要

因果生成建模对于开发能够进行反事实推理的可靠且透明的AI系统至关重要。现有方法侧重于在生成模型训练过程中整合因果约束,但通常缺乏统一框架来利用预训练基础模型的零样本推理能力。我们提出FM-CGM,一个使用预训练基础模型进行端到端视觉因果推理的模块化框架。FM-CGM通过三个核心组件形式化因果流程:概念提取器、概念操作器和反事实生成器。通过利用大型推理模型进行因果推断,以及文本到图像扩散模型进行生成,我们的方法实现了零样本因果发现、干预和反事实生成。然后,我们开发了因果语义引导(CSG),一种基于交叉注意力的机制,确保语义干预传播到后代概念,同时保留不变区域。我们实验证明,我们的方法能够识别合理的因果结构,并适用于忠实的反事实图像生成。

英文摘要

Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While existing approaches focus on integrating causal constraints during the training of generative models, they often lack a unified framework to leverage the zero-shot reasoning capabilities of pretrained foundation models. We introduce FM-CGM, a modular framework for end-to-end visual causal reasoning using pretrained foundation models. FM-CGM formalizes the causal pipeline through three core components: a concept extractor, a concept manipulator, and a counterfactual generator. By leveraging a large reasoning model for causal inference and a text-to-image diffusion model for generation, our approach enables zero-shot causal discovery, intervention, and counterfactual generation. We then develop Causal Semantic Guidance (CSG), a cross-attention-based mechanism that ensures semantic interventions propagate to descendant concepts while preserving invariant regions. We empirically show that our approach can identify plausible causal structures and is suitable for faithful counterfactual image generation.

2605.23845 2026-05-25 cs.CV 版本更新

Learning a Particle Dynamics Model with Real-world Videos

利用真实世界视频学习粒子动力学模型

Chanho Kim, Suhas V. Sumukh, Li Fuxin

发表机构 * Oregon State University(俄勒冈州立大学)

AI总结 本文提出了一种从真实世界未标注视频中学习粒子动力学模型的新方法,旨在克服传统物理模拟器和依赖合成数据的世界模型在现实场景中的局限性。该方法基于高斯点扩散框架,通过渲染监督直接学习密集高斯粒子的位置和旋转变化,无需粒子级别的标注信息。研究还发布了一个包含约500个视频的真实数据集,用于多样化物体交互的建模与验证。

Comments CVPR 2026 Findings

详情
AI中文摘要

数据驱动的物理仿真学习方法(有时称为世界模型)因其可微性质,已成为传统物理模拟器的有前途的替代方案。先前的工作在预测涉及多个相互作用物体的复杂场景中刚性和非刚性物体的运动方面展示了令人印象深刻的结果。然而,这些模型通常在模拟环境中训练,因为在现实世界中获取完美的状态信息(例如完整的场景点云和随时间变化的点对应关系)具有挑战性。这种对合成数据的依赖可能在模拟到现实差距较大时限制其适用性。在这项工作中,我们旨在通过引入一种直接从无标签真实世界视频训练神经物体动力学模型的新框架来克服这些限制。具体来说,我们提出学习一个与高斯溅射框架兼容的基于粒子的动力学模型,该模型操作于从高斯导出的密集粒子(即具有尺度和旋转的粒子),并预测它们随时间的位置和旋转变化。该模型通过渲染监督进行训练,从而无需粒子级别的标签状态即可从真实世界视频中学习。我们的模型直接操作于密集高斯,而不依赖于启发式子采样锚点。为了实现这项研究,我们还提供了一个包含约500个捕捉不同物体相互作用的视频的真实世界数据集。

英文摘要

Data-driven learning approaches for physics simulation, sometimes referred to as world models, have emerged as promising alternatives to traditional physics simulators due to their differentiable nature. Prior work has demonstrated impressive results in predicting the motions of rigid and non-rigid objects in complex scenes involving multiple interacting bodies. However, these models are typically trained in simulated environments because obtaining perfect state information such as complete scene point clouds and point correspondences over time is challenging in real-world settings. This reliance on synthetic data can limit their applicability when the sim-to-real gap is large. In this work, we aim to overcome these limitations by introducing a novel framework for training neural object dynamics models directly from unlabeled real-world videos. Specifically, we propose to learn a particle-based dynamics model compatible with a Gaussian splatting framework, which operates on dense particles derived from Gaussians (i.e., particles with scales and rotations) and predicts their position and rotation changes over time. The model is trained via rendering supervision, enabling learning from real-world videos without requiring particle-level labeled states. Our model operates directly on dense Gaussians without relying on heuristic subsampling anchor points. To enable this study, we also present a real-world dataset consisting of about 500 videos capturing diverse object interactions.

2605.23840 2026-05-25 cs.CV 版本更新

MuellerPT: Decomposition Driven Pretraining for Dense Learning in Mueller Polarimetry

MuellerPT: 穆勒偏振测量中密集学习的分解驱动预训练

Adam Tlemsani, Yingdian Li, Maxime Giot, Naim Slim, Christopher J. Peters, Abhijeet Ghosh, Daniel S. Elson

发表机构 * Department of Computing, Imperial College London(帝国理工学院计算机系) Hamlyn Centre for Robotic Surgery, Imperial College London(帝国理工学院机器人外科中心) Department of Surgery and Cancer, Imperial College London(帝国理工学院外科与癌症系) Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences(中国科学院西安光学精密机械研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 该研究提出了一种名为 MuellerPT 的物理引导预训练方法,用于解决穆勒偏振成像在生物医学组织分析中因标注稀缺和领域差异导致的监督学习难题。通过从每个像素的 4x4 穆勒矩阵预测 Lu-Chipman 分解图,该方法学习到具有迁移能力的密集表征,并在少样本分割和分类任务中表现出显著提升。实验表明,MuellerPT 在标签效率和跨样本迁移能力方面优于无预训练的模型,为高效标注的穆勒偏振成像应用提供了新思路。

Comments Accepted to 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026)

详情
AI中文摘要

穆勒矩阵成像为生物医学组织分析提供了丰富、物理上有意义的对比度,但监督学习受到稀疏密集标注和跨样本及采集设置强域偏移的阻碍。我们提出MuellerPT,一种物理引导的预训练方法,通过从逐像素4x4穆勒矩阵预测Lu-Chipman分解图来学习可迁移的密集表示。为了扩展预训练,我们收集了新的多光谱动物偏振器官数据集(MAP-Org)。预训练编码器通过分割头适应于羔羊脑灰质与白质分割,并使用分类头进行结直肠癌与非癌分类。分割和分类均在少样本学习场景下评估。在分割中,与无预训练模型相比,MuellerPT提高了标签效率和跨样本迁移,在使用5%训练数据时,相比从头训练的基线实现了超过20%的绝对DICE增益。在分类中,MuellerPT也增强了标签效率,在使用1%训练数据时,相比基线总体准确率提高了8%。我们通过对离体人类食管样本预测的Lu-Chipman图进行定性评估,证明了MuellerPT对域偏移的鲁棒性。这些结果表明,预测Lu-Chipman分解是从穆勒偏振测量中进行鲁棒生物医学推断的有效且实用的预文本任务,并为未来标签高效穆勒成像的工作铺平了道路。

英文摘要

Mueller matrix imaging provides rich, physically meaningful contrast for biomedical tissue analysis, but supervised learning is hindered by scarce dense annotations and strong domain shifts across specimens and acquisition settings. We introduce MuellerPT, a physics guided pre-training approach that learns transferable dense representations by predicting Lu-Chipman decomposition maps from per-pixel 4x4 Mueller matrices. To scale pre-training, we collected a new large Multispectral Animal Polarimetric Organ dataset (MAP-Org). The pre-trained encoder is adapted with a segmentation head for grey vs. white matter segmentation in lamb brain. A classification head is used for colorectal cancer vs. non-cancer classification. Both segmentation and classification are evaluated across few-shot learning scenarios. In segmentation, MuellerPT improves label efficiency and cross specimen transfer compared to models without pre-training, achieving an absolute DICE gain of over 20% compared to the baseline trained from scratch when using 5% of the training data. In classification, MuellerPT also enhances label efficiency, improving overall accuracy by 8% compared to the baseline when using 1% of the training data. We demonstrate MuellerPT's robustness to domain shift with a qualitative evaluation of its predicted Lu-Chipman maps on an ex vivo human oesophagus sample. These results suggest that predicting Lu-Chipman decomposition is an effective and practical pretext task for robust biomedical inference from Mueller polarimetry and can pave the way for future work on label efficient Mueller imaging.

2605.23826 2026-05-25 cs.CV cs.CL 版本更新

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

将查询分解为工具调用以进行长视频关键帧检索

Michal Shlapentokh-Rothman, Prachi Garg, Yu-Xiong Wang, Derek Hoiem

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了如何从长视频中检索关键帧以支持问答任务,提出了一种基于工具调用分解与合并的新型关键帧检索方法ToolMerge。该方法利用大语言模型将查询分解为多个工具调用,并通过布尔运算符合并各工具的排序结果,从而更精准地定位相关帧。实验在自建的M2M基准上进行,ToolMerge在多项任务中表现优异,尤其在字幕检索任务中超越其他方法5%。

详情
AI中文摘要

关键帧选择是为长视频问答(QA)提供可验证视觉证据的直接方式。查询所需的内容各不相同,找到正确的帧取决于知道要查找什么。现有的关键帧选择器要么根据单个查询对每一帧进行评分,要么将查询分解为由单个视觉工具评估的固定模式。我们提出ToolMerge,一种基于分解和合并的关键帧检索方法:基于大语言模型(LLM)的规划器将查询分解为工具调用,并指定如何使用布尔运算符合并每个工具的排名。为了直接评估检索,我们构建了Molmo-2 Moments(M2M)基准,其中每个问题通过构造锚定到特定的时间间隔。在QA、问题检索和字幕检索中,ToolMerge与先前的关键帧选择器具有竞争力,尤其是在字幕检索上,优于其他方法5%。代码和数据可在https://github.com/michalsr/ToolMerge找到。

英文摘要

Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at https://github.com/michalsr/ToolMerge .

2605.23819 2026-05-25 cs.CV cs.AI 版本更新

Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot

不过于生成,也不过于判别:人类对齐的甜蜜点

Jorge Chang Ortega, Bastien Le Lan, Thomas Serre, Victor Boutin

发表机构 * ANITI Brown University(布朗大学) CNRS(国家科学研究中心)

AI总结 本文探讨了计算视觉中一个核心问题:人类视觉表征是由判别式学习还是生成式学习更好地解释。研究通过联合能量模型(JEMs)在固定架构下连续插值判别与生成训练目标,分离学习目标的影响,并在六个涵盖感知相似性、光泽感知、人类响应不确定性等的人类对齐基准上进行评估。结果表明,人类对齐在生成与判别目标的中间点达到最优,而非极端端点,表明人类视觉对齐源于生成与判别目标的平衡,而非单一目标的选择。

详情
AI中文摘要

计算视觉中的一个核心问题是,人类视觉表征是否更好地由判别学习或生成学习解释。然而,现有的比较常常混淆学习目标与架构、规模及训练数据,使得目标本身是否驱动对齐的问题悬而未决。我们使用联合能量模型(JEM)来解决这一混淆问题,该模型在固定架构内连续插值判别与生成训练。通过改变单个混合系数,我们隔离了学习目标的影响,并在六个涵盖感知相似性、光泽感知、人类响应不确定性、鲁棒性、形状-纹理线索冲突和诊断性特征归因的人类对齐基准上评估了所得模型。在这多样化的测试套件中,人类对齐在生成-判别连续体的中间点始终达到最大,而非任一端点。混合JEM结合了判别学习诱导的类别结构与生成学习诱导的对输入结构的敏感性,在视觉的多个层次上产生了更类人的行为。这些结果表明,生成-判别二分法不是理解人类对齐视觉的正确轴:对齐并非来自选择其中一个目标,而是来自平衡两者。

英文摘要

A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.

2605.23797 2026-05-25 cs.LG cs.CV 版本更新

Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models

去偏负挖掘提升基于预训练视觉语言模型的分布外检测

Bo Peng, Jie Lu, Guangquan Zhang, Zhen Fang

发表机构 * University of Technology Sydney(悉尼科技大学)

AI总结 本文研究了如何利用预训练的视觉-语言模型(VLM)进行分布外(OOD)检测,旨在识别来自未知类别的输入。现有方法主要依赖启发式规则从未标注的语料中挖掘负样本,但存在严重的负样本偏差问题。为此,作者提出了一种去偏负样本挖掘方法,通过间接估计负样本分布来纠正偏差,并将其转化为基于标注数据和未标注语料的蒙特卡洛采样过程。实验表明,该方法在多种OOD检测任务中取得了新的最先进性能。

Comments KDD 2026

详情
AI中文摘要

旨在识别来自未知类别的意外输入,分布外(OOD)检测已成为增强机器学习模型可靠性的关键方法。本文聚焦于基于预训练视觉语言模型(VLM)的事后OOD检测这一新兴范式,其中一种流行的流程是通过检查输入与ID标签和负标签(即语义上不同于ID标签的标签)之间的亲和度来检测OOD输入。由于目标OOD标签不可用,现有工作主要依赖启发式规则从未标注的语料数据中挖掘负标签。尽管取得了经验上的成功,我们认为基于VLM的OOD检测能力尚未被完全释放,因为文献中臭名昭著的假阴性问题远未解决。基于这一动机,我们感兴趣于解决为OOD评分挖掘真实负标签的挑战。为此,我们开发了一个理论框架,通过间接近似负标签的分布来校正负标签的采样偏差。令人惊讶的是,我们表明去偏负挖掘可以自然地转化为基于ID标签和未标注语料数据的蒙特卡洛采样。大量实验经验性地证明,我们的方法在各种OOD检测设置中建立了新的最先进水平。代码公开于\href{https://github.com/60pen9/Debiased-Negative-Mining-Improves-OOD-Detection-with-Pre-trained-VLMs}{此处}。

英文摘要

Aiming at identifying unexpected inputs from unknown classes, out-of-distribution (OOD) detection has emerged as a pivotal approach to enhancing the reliability of machine learning models. This paper focuses on the burgeoning paradigm of post-hoc OOD detection with pre-trained vision-language models (VLMs), where a popular pipeline is to detect OOD inputs by examining their affinities between ID labels and negative labels, i.e., those semantically different from ID labels. Due to the unavailability of target OOD labels, existing works predominantly rely on heuristic rules to mine negative labels from unlabeled wild corpus data. Despite the empirical success, we argue that the power of VLM-based OOD detection has yet to be fully unleashed since the notorious false negative problem is far from addressed in the literature. With this motivation, we are interested in addressing the challenge of mining true negative labels for OOD scoring. To this end, we develop a theoretical framework for correcting the sampling bias of negatives labels by indirectly approximating the distribution of negative labels. Perhaps surprisingly, we show that the debiased negative mining can be naturally converted into Monte-Carlo sampling based on ID labels and the unlabeled wild corpus data. Extensive experiments empirically manifest that our method establishes a new state-of-the-art in a variety of OOD detection setups. Code is publicly available at \href{https://github.com/60pen9/Debiased-Negative-Mining-Improves-OOD-Detection-with-Pre-trained-VLMs}{\textcolor{red}{here}}.

2605.23790 2026-05-25 cs.CV 版本更新

Exploring deep learning for Event-Based Saliency Prediction with a Transformer-based model

探索基于事件的显著性预测:一种基于Transformer的模型

Romaric Mazna, Jean Martinet, Sai Deepesh Pokala

发表机构 * i3S/CNRS, Université Côte d’Azur(i3S/CNRS,法国国家科学研究中心,埃克塞特大学)

AI总结 本文研究了基于事件相机数据的显著性预测问题,提出了一个基于Transformer的模型SEST,用于从事件数据中预测显著性区域。为克服事件数据缺乏大规模标注数据集和强基线模型的难题,作者引入了事件原生的预训练策略和合成监督,并构建了两个新的基准数据集。实验表明,SEST在事件显著性预测任务中优于现有方法,并在真实事件数据上展示了良好的迁移能力,是首次将深度学习应用于事件显著性预测的研究。

详情
AI中文摘要

显著性预测在RGB图像和视频中作为人类视觉注意的计算模型已被广泛研究。相比之下,尽管事件相机具有生物启发性和良好的传感特性,但从事件数据预测显著性仍基本未被探索。两个障碍阻碍了这一方向:缺乏大规模事件显著性数据集,以及缺乏强基线。在本文中,我们介绍了SEST(Swin事件显著性Transformer),一种基于Transformer的事件数据显著性预测模型,通过事件原生预训练和合成监督弥补数据稀缺障碍。SEST利用自监督预训练的事件Swin Transformer骨干结合轻量CNN解码器生成动态显著性图。为解决标注事件显著性数据稀缺的问题,我们引入了两个新的基准数据集N-DHF1K和N-UCF Sports,这些数据集从大规模RGB显著性基准生成。实验结果表明,SEST明显优于现有事件显著性方法,并缩小了与最先进RGB模型的性能差距。在真实事件相机数据集上的零样本评估进一步证明,我们在合成数据上训练的模型在真实事件流上仍具有可迁移性。据我们所知,这项工作是首次将深度学习应用于基于事件的显著性预测,开辟了事件视觉与神经形态视觉注意交叉领域的新研究方向。

英文摘要

Saliency prediction has been extensively studied in RGB images and videos as a computational model of human visual attention. In contrast, predicting saliency from event-based data remains largely unexplored, despite the biological inspiration and favorable sensing properties of event cameras. Two obstacles have held this direction back: the absence of large-scale event saliency datasets, and the lack of a strong baseline. In this paper, we introduce SEST (Swin Event-based Saliency Transformer), a transformer-based model for saliency prediction from event data, bridging the data scarcity barrier through event-native pretraining and synthetic supervision. SEST leverages a self-supervised pretrained event-based Swin Transformer backbone combined with a lightweight CNN decoder to produce dynamic saliency maps. To address the scarcity of annotated event-based saliency data, we introduce two new benchmark datasets, N-DHF1K and N-UCF Sports, generated from large-scale RGB saliency benchmarks. Experimental results show that SEST clearly outperforms existing event-based saliency methods and narrows the performance gap with state-of-the-art RGB models. Zero-shot evaluation on a real event camera dataset further demonstrates that our model trained on synthetic data remains transferable on real event streams. To the best of our knowledge, this work is the first to apply deep learning to event-based saliency prediction, opening a new research direction at the intersection of event-based vision and neuromorphic visual attention.

2605.23777 2026-05-25 cs.CV 版本更新

Machine learning applied to emerald gemstone grading: framework proposal and creation of a public dataset

机器学习应用于祖母绿宝石分级:框架提案与公开数据集创建

FB Pena, D Crabi, Sandro C Izidoro, Érick O Rodrigues, G Bernardes

发表机构 * Department of Academic Informatics (DAINF), Universidade Tecnológica Federal do Paraná (UTFPR), Pato Branco, State of Parana, Brazil(学术信息系(DAINF),联邦技术大学Parana分校(UTFPR),Pato Branco,巴西巴拉那州)

AI总结 本文提出了一种基于机器学习的祖母绿宝石分级框架,并创建了一个公开数据集。该框架从图像采集到最终分类实现了整个分级过程的自动化,避免了人工分级的主观性。研究首次将机器学习与图像处理技术结合应用于祖母绿分级,取得了98%的分类准确率,并发布了包含192张祖母绿图像及其预处理特征的数据集。

详情
Journal ref
Pattern Analysis and Applications 2022
AI中文摘要

目前,宝石分级是由宝石学家执行的手工过程。一种流行的方法使用参考石,由专家目视检查,决定哪一颗参考石与待检石最相似。该过程非常主观,不同专家可能做出不同的分级选择。本文提出了一个完整的框架,涵盖图像采集直至最终宝石分类。该提案能够自动化整个过程,除了将宝石放入创建的图像采集腔室之外。它摒弃了专家做出的主观决策。这是首个将机器学习方法与图像处理技术相结合用于祖母绿分级的工作。所提出的框架实现了98%的准确率(正确分类的宝石),优于深度学习方法。此外,我们还创建并发布了所使用的数据集,包含192张祖母绿宝石图像及其提取和预处理后的特征。

英文摘要

The grading of gemstones is currently a manual procedure performed by gemologists. A popular approach uses reference stones, where those are visually inspected by specialists that decide which one of the available reference stone is the most similar to the inspected stone. This procedure is very subjective as different specialists may end up with different grading choices. This work proposes a complete framework that entails the image acquisition and goes up to the final stone categorization. The proposal is able to automate the entire process apart from including the stone in the created chamber for the image acquisition. It discards the subjective decisions made by specialists. This is the first work to propose a machine learning approach coupled with image processing techniques for emerald grading. The proposed framework achieves 98% of accuracy (correctly categorized stones), outperforming a deep learning approach. Furthermore, we also create and publish the used dataset that contains 192 images of emerald stones along with their extracted and pre-processed features.

2605.23775 2026-05-25 cs.CV 版本更新

A Novel Approach for the Counting of Wood Logs Using cGANs and Image Processing Techniques

一种基于cGANs和图像处理技术的木材计数新方法

João VC Mazzochin, Giovani Bernardes Vitor, Gustavo Tiecker, Elioenai MF Diniz, Gilson A Oliveira, Marcelo Trentin, Érick O Rodrigues

发表机构 * Graduate Program of Production and Systems Engineering, Universidade Tecnol6gica Federal do Paraná (UTFPR)(生产与系统工程研究生项目,联邦技术大学帕托布拉诺分校(UTFPR)) Institute of Technological Sciences, Universidade Federal de Itajubá (UNIFEI)(技术科学研究所,联邦大学伊塔比拉分校(UNIFEI)) Business School, Universidade Federal do Paraná (UFPR)(商业学院,联邦帕拉分校(UFPR)) Graduate Program of Electrical and Computer Engineering, Universidade Tecnológica Federal do Paraná (UTEPR)(电气与计算机工程研究生项目,技术联邦大学帕托布拉诺分校(UTEPR))

AI总结 本文提出了一种基于条件生成对抗网络(cGANs)和图像处理技术的新型木材原木计数方法,旨在解决精确计数中的挑战。该方法结合图像处理技术处理噪声和交叉重叠问题,并利用连通组件算法实现高效计数。研究还公开了一个包含466张图像、约13,048根桉树原木的数据库,实验表明该方法在像素级和原木级准确率上分别达到96.4%和92.3%,具有较高的实用价值和实时处理能力,适用于林业管理、资源优化等实际场景。

详情
Journal ref
Forests 2025
AI中文摘要

本研究解决了精确木材计数的挑战,所提出方法论的应用可涵盖从材料管理、监控和安全科学到木材交通监测、木材体积估计等自动化方法。我们引入了一种利用条件生成对抗网络(cGANs)进行桉木图像分割的方法,结合专门的图像处理技术处理噪声和交叉,并采用连通分量算法进行高效计数。为支持本研究,我们创建并公开了一个包含466张图像、约13,048根桉木的全面数据库,用于训练和验证。我们的方法表现出稳健性能,平均像素精度达到96.4%,原木计数精度达到92.3%,其他指标如F1分数在0.879至0.933之间,IoU值在0.784至0.875之间,进一步验证了其有效性。该实现效率高,在NVIDIA T4 GPU上每张图像平均处理时间为0.713秒,适合实时应用。该方法对运营林业具有重要实际意义,能够实现更准确的库存管理,减少人工计数的错误,并优化资源配置。此外,模型的分割能力为桉木堆体积估计等高级应用奠定了基础,有助于对林业运营进行更全面和精细的分析。该方法在处理复杂场景(包括交叉原木和变化的环境条件)方面的成功,使其成为相关工业领域实际应用的有价值工具。

英文摘要

This study tackles the challenge of precise wood log counting, where applications of the proposed methodology can span from automated approaches for materials management, surveillance, and safety science to wood traffic monitoring, wood volume estimation, and others. We introduce an approach leveraging Conditional Generative Adversarial Networks (cGANs) for eucalyptus log segmentation in images, incorporating specialized image processing techniques to handle noise and intersections, coupled with the Connected Components Algorithm for efficient counting. To support this research, we created and made publicly available a comprehensive database of 466 images containing approximately 13,048 eucalyptus logs, which served for both training and validation purposes. Our method demonstrated robust performance, achieving an average Accuracy_pixel of 96.4% and Accuracy_logs of 92.3%, with additional measures such as F1 scores ranging from 0.879 to 0.933 and IoU values between 0.784 and 0.875, further validating its effectiveness. The implementation proves to be efficient with an average processing time of 0.713s per image on an NVIDIA T4 GPU, making it suitable for realtime applications. The practical implications of this method are significant for operational forestry, enabling more accurate inventory management, reducing human errors in manual counting, and optimizing resource allocation. Furthermore, the segmentation capabilities of the model provide a foundation for advanced applications such as eucalyptus stack volume estimation, contributing to a more comprehensive and refined analysis of forestry operations. The methodology's success in handling complex scenarios, including intersecting logs and varying environmental conditions, positions it as a valuable tool for practical applications across related industrial sectors.

2605.23771 2026-05-25 cs.CV cs.AI cs.MA 版本更新

PhotoFlow: Agentic 3D Virtual Photography Missions

PhotoFlow: 智能体式3D虚拟摄影任务

Jiarui Guo, Haojia Wei, Yiming Zhang, Yifei Liu, Yuning Gong, Hongjie Zhang, Xue Yang, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University(上海交通大学) Northeastern University(东北大学) University of California, Los Angeles(加州大学洛杉矶分校) Cornell University(康奈尔大学) Shanghai AI Laboratory(上海人工智能实验室) Sichuan University(四川大学)

AI总结 PhotoFlow 是一种用于虚拟摄影的智能代理系统,能够在没有预设相机参数或参考图像的情况下,根据语言指令在3D场景中生成符合语义意图的高质量照片。该系统由三个模块组成:Director 生成多样化的相机候选方案,Reviewer 进行视觉评估与参数筛选,Reflector 则通过失败经验优化搜索策略。研究还提出了 VPhotoBench 基准,包含多个 Blender 场景和语言条件摄影任务,实验表明 PhotoFlow 在多轮渲染预算下表现出色,是首个在任意 Blender 场景中实现语言条件虚拟摄影的可执行代理系统。

详情
AI中文摘要

虚拟摄影要求智能体进入一个预制的3D场景,没有预设的相机姿态或参考图像,从场景信息和语言意图中推断合适的镜头,选择可执行的相机参数,并渲染最终照片。视觉-语言模型的最新进展使这种空间智能体越来越可行,但该任务强调两种难以同时评估的能力:复杂的3D空间理解和抽象审美判断。我们引入了PhotoFlow,一个导演-评审-反思智能体,用于闭环相机搜索。导演构建软摄影蓝图并提议多样化的候选相机;评审结合规则检查、视觉批评和成对优胜者选择;反思将失败转化为区域记忆、死区抑制和高探索重定位。我们还引入了VPhotoBench,一个包含47个开源许可的Blender场景和141个语言条件摄影任务的基准,涵盖主体放置、关系构图和氛围/风格。在保留实验中,PhotoFlow在六轮渲染预算下,在一次性预测、单链反思、锚点库选择和随机搜索中取得了最强的外部质量-对齐复合指标和成功率。据我们所知,这是第一项将任意Blender场景中的语言条件虚拟摄影作为可执行智能体任务的工作,我们的结果表明,以LLM为中心的空间智能体已经可以在旨在挑战3D推理和审美选择的设置中产生强大的照片。

英文摘要

Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.

2605.23747 2026-05-25 cs.CV 版本更新

Revitalizing Dense Material Segmentation: Stabilized Vision Transformers and the Generalization Paradox

复兴密集材质分割:稳定的视觉Transformer与泛化悖论

Allan Kazakov, Duygu Cakir, Hilal Kurt İrfanoğlu, Yavuz İrfanoğlu

发表机构 * Bahcesehir University, Istanbul, Turkey(巴塞希尔大学,伊斯坦布尔,土耳其) Poder Bilişim Teknolojileri Sanayi ve Ticaret A.Ş., Istanbul, Turkey(Poder信息科技工业和贸易股份有限公司,伊斯坦布尔,土耳其) Galatasaray University, Istanbul, Turkey(加拉塔萨雷大学,伊斯坦布尔,土耳其)

AI总结 本文旨在复兴苹果密集材料分割(Apple-DMS)基准,解决当前材料分割任务中因几何偏倚模型主导而导致的性能停滞问题。研究提出了一种稳定训练方法,包括高保真逻辑投影、查询熵正则化和物理兼容的数据增强策略,显著提升了基于Vision Transformer的分割模型性能。同时,作者揭示了“泛化悖论”——虽然数据重划分可提升指标,却会降低模型在真实场景中的泛化能力,强调了使用原始数据划分对推动物理感知人工智能研究的重要性。

详情
AI中文摘要

材质分割,即对物理表面属性进行像素级分类,仍然是计算机视觉中的一个挑战性问题,需要区别于以物体为中心解析的物理化学理解。尽管引入了严格的Apple密集材质分割(DMS)数据集,该基准测试仍遭受衰退和停滞,日益被偏向几何的基础模型所掩盖。在本文中,我们复兴Apple-DMS基准测试,建立现代视觉Transformer基线。我们对SegFormer和Mask2Former架构进行了详尽评估,揭示标准训练范式由于高方差梯度而在无定形纹理场上失败。为解决此问题,我们引入了一种稳定的训练方案,包括高保真logit投影、查询熵正则化以及领域特定、符合物理的增强流程。我们优化的SegFormer-B5在原始数据集划分上达到了0.4572 mIoU的新最先进水平(SOTA),显著超越了先前的卷积基线。此外,我们识别出一个关键的“泛化悖论”:虽然将数据集重新划分为数据丰富的80/10/10划分将指标提升至0.5276 mIoU,但专家定性分析表明这导致了分布同质化,严重降低了真实世界、分布外性能。通过发布我们恢复的数据集索引和稳健的训练框架,我们证明材质感知远未解决,并敦促社区利用严格的原始划分推动物理基础人工智能的真正进展。

英文摘要

Material segmentation, the pixel-wise classification of physical surface properties, remains a challenging problem in computer vision, requiring physicochemical understanding distinct from object-centric parsing. Despite the introduction of the rigorous Apple Dense Material Segmentation (DMS) dataset, the benchmark has suffered from attrition and stagnation, increasingly overshadowed by geometry-biased foundation models. In this paper, we revive the Apple-DMS benchmark to establish a modern Vision Transformer baseline. We conduct an exhaustive evaluation of SegFormer and Mask2Former architectures, revealing that standard training paradigms fail on amorphous texture fields due to high-variance gradients. To address this, we introduce a stabilized training recipe featuring High-Fidelity Logit Projection, Query Entropy Regularization, and a domain-specific, physics-compliant augmentation pipeline. Our optimized SegFormer-B5 achieves a new State-of-the-Art (SOTA) of 0.4572 mIoU on the original dataset split, significantly surpassing the prior convolutional baseline. Furthermore, we identify a critical "Generalization Paradox": while re-partitioning the dataset into a data-rich 80/10/10 split inflates the metric to 0.5276 mIoU, expert qualitative analysis reveals this induces distributional homogenization, severely degrading real-world, out-of-distribution performance. By releasing our recovered dataset index and robust training framework, we demonstrate that material perception is far from solved and urge the community to leverage the rigorous original split to drive genuine progress in physically grounded artificial intelligence.

2605.23719 2026-05-25 cs.CV cs.AI 版本更新

Weierstrass Positional Encoding for Vision Transformers

Weierstrass位置编码用于视觉Transformer

Zhihang Xin, Rui Wang, Xitong Hu, Xiaojun Wu

发表机构 * School of Mathematics and Data Science, Jiangnan University(江南大学数学与数据科学学院) School of Artificial Intelligence and Computer Science, Jiangnan University(江南大学人工智能与计算机科学学院)

AI总结 视觉Transformer在计算机视觉中取得了显著成功,但其常用的可学习一维位置编码在图像分块展平后削弱了图像的二维空间结构。为解决这一问题,本文提出了一种基于魏尔斯特拉斯椭圆函数的位置编码方法(WePE),通过在复数域中对二维分块坐标进行映射,构建具有双周期特性的四维位置特征,从而更准确地保留图像分块的几何关系和空间邻近性先验。该方法具有数学理论支撑,能够自然匹配图像网格的规则结构,并且无需额外计算开销,可无缝集成到现有视觉Transformer中,实验表明其在多种任务中均能带来性能提升。

详情
AI中文摘要

视觉Transformer在计算机视觉中取得了显著成功,但它们通常使用可学习的一维位置编码,这削弱了图像块展平后固有的二维空间结构。现有的位置编码往往缺乏几何约束,并且不保持欧氏空间距离与序列索引距离之间的单调关系,限制了ViTs利用空间邻近先验的能力。受周期性在位置编码中实用性的启发,我们提出了Weierstrass椭圆位置编码(WePE),这是一种在复数域中编码二维坐标的数学基础方法。WePE将归一化的二维块坐标映射到复平面,并使用Weierstrass椭圆函数及其导数构建紧凑的四维位置特征。双周期性提供了二维位置的原则性表示,其固有的晶格结构自然匹配图像块网格的规则几何形状。其非线性几何特性有助于更忠实地建模空间距离关系,而代数加法公式使得任意块对之间的相对位置信息可以直接从其绝对编码中推导出来。WePE是即插即用的且与分辨率无关,可以无缝集成到现有的ViTs中。大量实验表明,WePE在大多数设置中带来一致的性能提升。通过预计算的查找表,这些改进不会引入明显的计算或内存开销。额外的分析和消融研究进一步验证了所提方法的有效性。

英文摘要

Vision Transformers have achieved remarkable success in computer vision, but their common use of learnable one-dimensional positional encodings weakens the inherent two-dimensional spatial structure of images after patch flattening. Existing positional encodings often lack geometric constraints and do not preserve a monotonic relationship between Euclidean spatial distances and sequential index distances, limiting ViTs' ability to exploit spatial proximity priors. Motivated by the usefulness of periodicity in positional encoding, we propose Weierstrass elliptic Positional Encoding (WePE), a mathematically grounded method for encoding two-dimensional coordinates in the complex domain. WePE maps normalized 2D patch coordinates onto the complex plane and constructs compact four-dimensional positional features using the Weierstrass elliptic function and its derivative. The double periodicity provides a principled representation of 2D positions, and its intrinsic lattice structure naturally matches the regular geometry of image patch grids. Its nonlinear geometric properties help model spatial distance relationships more faithfully, while the algebraic addition formula enables relative positional information between arbitrary patch pairs to be derived directly from their absolute encodings. WePE is plug-and-play and resolution-agnostic, allowing seamless integration into existing ViTs. Extensive experiments show that WePE brings consistent performance gains in most settings. With precomputed lookup tables, these improvements introduce no noticeable computational or memory overhead. Additional analyses and ablation studies further validate the effectiveness of the proposed method.

2605.23699 2026-05-25 cs.CV 版本更新

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

CRONOS:视频模型中反事实物理一致性的基准测试

León Begiristain, Olaf Dünkel, Adam Kortylewski

发表机构 * University of Freiburg(弗莱堡大学) Max Planck Institute for Informatics(马克斯·普朗克信息研究所) CISPA Helmholtz Center for Information Security(哈勒斯海姆信息安全中心)

AI总结 本文提出CRONOS,一个基于干预的视频模型评估基准,用于检验模型在面对视觉输入变化时对物理事件预测的反事实一致性。该基准构建于高度逼真的Unreal Engine环境中,通过系统性地改变视角、场景、物体类别和外观等因素,而保持物理事件类型不变,从而评估模型对这些变化的鲁棒性。实验表明,当前主流视频生成模型在面对不同干预时,其预测质量存在显著下降,突显了现有模型在物理一致性方面的不足。CRONOS为研究和改进视频模型的物理理解能力提供了可控且可复现的测试平台。

Comments 27 pages, 12 figures

详情
AI中文摘要

视频预测日益被视为通向通用世界模型的途径,但目前尚不清楚这些系统是学习了潜在的因果结构,还是仅仅利用表面的视觉相关性进行未来预测。我们引入了CRONOS,一个基于干预的基准,旨在评估反事实物理一致性:即模型对物理事件的预测是否对视觉输入中的受控变化(如场景上下文、视角、物体外观和物体类别的变化)做出适当响应。CRONOS构建在逼真的Unreal Engine环境中,能够跨不同场景和动态生成受控、高保真的视频。与之前的基准相比,CRONOS系统性地干预了四个关键因素——视角、场景、物体类别和物体外观——同时保持底层物理事件类型(如碰撞、遮挡或坠落)不变。我们对近期开源视频生成器的评估揭示了反事实物理一致性的显著失败:同一物理事件类型的预测质量受到外观、环境,尤其是视角变化的影响。CRONOS提供了一个受控且可重复的测试平台,用于诊断不同干预下生成视频质量的变化,为开发在多种条件变化下表现一致的模型确立了具体目标。数据集和代码可在我们的项目页面获取。

英文摘要

Video prediction is increasingly viewed as a path toward generalizable world models, yet it remains unclear whether these systems learn underlying causal structure or merely exploit superficial visual correlations for future prediction. We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. In contrast to previous benchmarks, CRONOS systematically intervenes on four key factors - viewpoint, scene, object category, and object appearance - while keeping the underlying physical event type, such as a collision, occlusion, or fall, fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes. CRONOS provides a controlled and reproducible testbed for diagnosing how the quality of generated videos changes for different interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions. The dataset and code are available at our project page.

2605.18214 2026-05-25 cs.CV 版本更新

EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

EgoInteract: 用于交互理解和预测的合成自我中心视频生成

Rosario Leonardi, Francesco Ragusa, Daniele Materia, Alessandro Passanisi, James Fort, Jakob Engel, Giovanni Maria Farinella

发表机构 * Department of Mathematics and Computer Science, University of Catania(卡塔尼亚大学数学与计算机科学系) Next Vision s.r.l.(Next Vision公司) Reality Labs Research, Meta(Meta现实实验室)

AI总结 本文提出EgoInteract,一个可控的模拟器,用于生成具有精细时空标注的以自我为中心的合成视频,旨在解决真实数据收集困难以及交互模式覆盖有限的问题。该模拟器支持对相机、人体和手部运动、物体操作及场景构图的精确控制,生成的视频数据可用于时序动作分割、下一时段活跃物体检测、交互预测等任务。实验表明,基于该模拟器训练的模型在多个真实世界的以自我为中心数据集上均取得了优于现有方法的性能,验证了其有效性和泛化能力。

详情
AI中文摘要

收集具有密集时空标注的大规模自我中心视频数据集成本高昂、速度缓慢,且常受环境偏差、隐私约束和交互模式覆盖有限的限制。虽然合成数据在多个视觉领域显示出巨大潜力,但其在自我中心感知中的应用仍相对未被充分探索,尤其是对于需要时间一致的人-物交互的任务。在这项工作中,我们引入了EgoInteract,一个用于自我中心视频生成的可控模拟器,旨在建模细粒度的自我中心交互及其时间动态。该模拟器能够精确控制相机、人体和手部运动、物体操作以及跨不同环境的场景组成。基于此框架,我们生成一个带有密集时空标注的合成自我中心视频数据集,用于时间动作分割、下一活动物体检测、交互预测和手-物交互检测。我们评估了在模拟数据上训练的模型在多个真实世界自我中心基准上的表现,这些基准涵盖不同环境、物体类别和交互模式。结果表明,在各项任务和数据集上,我们的方法相较于强基线有一致的改进,展示了基于模拟方法的有效性和可迁移性。

英文摘要

Collecting large-scale egocentric video datasets with dense spatial and temporal annotations is costly, slow, and often constrained by environmental biases, privacy constraints, and limited coverage of interaction patterns. While synthetic data has shown strong potential in several vision domains, its use for egocentric perception remains relatively underexplored, especially for tasks requiring temporally coherent human-object interactions. In this work, we introduce EgoInteract, a controllable simulator for egocentric video generation designed to model fine-grained egocentric interactions and their temporal dynamics. The simulator enables precise control over camera, human body and hand motion, object manipulation, and scene composition across diverse environments. Building on this framework, we generate a synthetic egocentric video dataset with dense spatial and temporal annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection. We evaluate models trained with simulated data on multiple real-world egocentric benchmarks spanning diverse environments, object categories, and interaction patterns. Results show consistent improvements over strong baselines across tasks and datasets, demonstrating the effectiveness and transferability of our simulation-based approach.

2605.07919 2026-05-25 cs.CV 版本更新

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

MedVIGIL: 在视觉证据受损下评估可信的医学视觉语言模型

Hanqi Jiang, Junhao Chen, Mingyu Kang, Hyeokjae Kwon, Yi Pan, Lifeng Chen, Weihang You, Haozhen Gong, Ruiyu Yan, Jinglei Lv, Lin Zhao, Hui Ren, Quanzheng Li, Tianming Liu, Xiang Li

发表机构 * University of Georgia(佐治亚大学) Harvard Medical School(哈佛医学院) Chungbuk National University(Chungbuk国立大学) Chungnam National University Hospital(Chungnam国立大学医院) National University of Singapore(新加坡国立大学) New York University(纽约大学) University of Sydney(悉尼大学) New Jersey Institute of Technology(新泽西理工学院)

AI总结 本文提出MedVIGIL,一个用于评估医疗视觉-语言模型(VLMs)在面对失效视觉证据时可信度的基准测试。研究关注模型在图像或问题被篡改时是否仍能正确拒绝回答,而非给出流畅但错误的答案。MedVIGIL包含300个由放射科专家标注的案例,提供了多种评估指标和复合得分,用于衡量模型在不同失效场景下的表现,并公开了16个视觉模型和两个纯文本基线的评估结果。

详情
AI中文摘要

医学视觉语言模型(VLM)通常在完整的图像-问题对上进行评估,但可信的临床应用需要更强的性质:模型必须能够识别答案的证据基础何时失效。我们通过扰动证据下的静默失败来研究这一问题,其中视觉相关的医学问题与错误前提、措辞扰动、仅知识改写或ROI损坏的图像配对,但模型返回流畅的非拒绝答案。我们引入了medvigil,一个从四个公共医学VQA来源中提取的300例评估套件,由四位委员会认证的放射科医生全程监督:每个黄金答案、拒绝选项、候选答案集、释义、错误前提陷阱、ROI框和临床风险等级均由临床医生撰写。两位主治放射科医生并行注释每个案例,一位高级放射科医生整合发布的清单,第四位独立于构建的放射科医生回答每个探针以提供人类参考基线。发布包含2556个MCQ探针、240个反事实三元组、医生裁定的风险等级和可回答性标志、ROI框以及配对的开放式变体。我们报告了七个正确性条件审计指标,总结为medvigil复合评分(MCS),并审计了16个视觉能力模型加上两个纯文本基线。独立放射科医生得分为MCS 83.3,静默失败率为5.8%,比最强审计模型(Claude Opus 4.7为69.2)高出14.1个复合分。基准和评估工具已公开发布。

英文摘要

Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.

2604.21502 2026-05-25 cs.CV 版本更新

VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

VFM$^{4}$SDG:揭示VFM在单域广义目标检测中的力量

Yupeng Zhang, Ruize Han, Ningnan Guo, Wei Feng, Song Wang, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院) Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology(深圳先进技术大学计算机科学与人工智能学院)

AI总结 该研究针对单域通用目标检测(SDGOD)中因环境变化导致的性能下降问题,提出了一种基于视觉基础模型(VFM)的新型框架VFM$^{4}$SDG。通过分析发现,检测器在跨域场景下的性能下降主要源于关系结构的不稳定,而VFM在严重域偏移下仍能保持稳定的关系和物体响应,因此被用作跨域稳定性先验。该方法通过引入冻结的VFM,分别在编码器和解码器中进行关系先验蒸馏和语义-上下文查询增强,有效提升了检测器的跨域鲁棒性,并在多个基准测试中取得了显著优势。

详情
AI中文摘要

现实世界中的天气、光照和成像变化常常引起严重的域偏移,导致单源检测器在未见环境中性能下降。现有的单域广义目标检测(SDGOD)方法主要依赖于数据增强或域不变学习,而很大程度上忽略了域偏移如何破坏检测器的预测稳定性。通过分析实验,我们发现性能下降主要由漏检增加主导。进一步分析表明,这一现象源于DETR风格检测器的跨域稳定性降低:域偏移破坏了编码器侧的物体-背景和实例间关系,并进一步削弱了解码器查询与真实物体之间的语义-空间绑定。受此启发,我们发现视觉基础模型(VFM)在严重偏移下仍能保持稳定的关系结构和物体响应,使其成为补偿检测器退化的合适跨域稳定性先验。为此,我们提出了VFM$^{4}$SDG,一个用于SDGOD的双先验学习框架,它将冻结的VFM引入编码器表示学习和解码器查询建模。具体来说,我们提出了跨域稳定关系先验蒸馏,将VFM中的稳定物体-背景和实例间关系蒸馏到编码器中,补偿关系退化。同时,我们提出了基于语义-上下文先验的查询增强,在查询进入解码器层之前注入类别语义原型和全局物体上下文,增强语义-空间查询-物体绑定稳定性。大量实验表明,VFM$^{4}$SDG在标准SDGOD基准和两个主流基于DETR的检测框架上显著优于现有先进方法,证明了其有效性、鲁棒性和泛化性。

英文摘要

Real-world weather, illumination, and imaging variations often induce severe domain shifts, degrading single-source detectors in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant learning, while largely overlooking how domain shift disrupts detector prediction stability. Through analytical experiments, we find that performance degradation is mainly dominated by increasing missed detections. Further analysis shows that this phenomenon stems from reduced cross-domain stability in DETR-style detectors: domain shift disrupts encoder-side object-background and inter-instance relations, and further weakens the semantic-spatial binding between decoder queries and real objects. Motivated by this, we find that vision foundation models (VFMs) still preserve stable relational structures and object responses under severe shifts, making them suitable cross-domain stability priors to compensate for detector degradation. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen VFM into encoder representation learning and decoder query modeling. Specifically, we propose Cross-domain Stable Relational Prior Distillation to distill stable object-background and inter-instance relations from the VFM into the encoder, compensating for relational degradation. Meanwhile, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category semantic prototypes and global object context into queries before they enter the decoder layer, enhancing semantic-spatial query-object binding stability. Extensive experiments show that VFM$^{4}$SDG significantly outperforms existing advanced methods on standard SDGOD benchmarks and two mainstream DETR-based detection frameworks, demonstrating its effectiveness, robustness, and generality.

2604.10077 2026-05-25 cs.CV 版本更新

DocRevive: A Unified Pipeline for Document Text Restoration

DocRevive:文档文本恢复的统一流水线

Kunal Purkayastha, Ayan Banerjee, Josep Llados, Umapada Pal

发表机构 * Computer Vision Center(计算机视觉中心) Indian Statistical Institute(印度统计研究所)

AI总结 DocRevive 是一种统一的文档文本修复管道,旨在解决损坏、遮挡或不完整文本的重建问题。该方法结合了先进的OCR、图像分析、掩码语言模型和扩散模型,实现了在保持视觉完整性的同时进行语义连贯的文本修复。研究还构建了一个包含30,078张退化文档图像的合成数据集,并提出了一种综合上下文相似度度量指标,以评估修复质量,为文档修复任务设立了新的基准。

详情
AI中文摘要

在文档理解中,重建受损、遮挡或不完整文本的挑战仍然是一个关键但未充分探索的问题。后续的文档理解任务可以受益于文档重建过程。为此,本文提出了一种新颖的统一流水线,结合了最先进的光学字符识别(OCR)、高级图像分析、掩码语言建模和基于扩散的模型,以在保持视觉完整性的同时恢复和重建文本。我们创建了一个包含30,078张退化文档图像的合成数据集,模拟了多种文档退化场景,为恢复任务设定了基准。我们的流水线检测并识别文本,通过遮挡检测器识别退化,并使用修复模型进行语义连贯的重建。基于扩散的模块无缝地重新整合文本,匹配字体、大小和对齐方式。为了评估恢复质量,我们提出了统一上下文相似度度量(UCSM),结合了编辑相似度、语义相似度和长度相似度,并引入上下文可预测性度量,当正确文本在上下文中显而易见时,对偏差进行惩罚。我们的工作推进了文档恢复,有利于档案研究和数字保存,同时为文本重建设立了新标准。OPRB数据集和代码分别可在Hugging Face(https://huggingface.co/datasets/kpurkayastha/OPRB)和Github(https://github.com/kunalpurkayastha/DocRevive)上获取。

英文摘要

In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Similarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. The OPRB dataset and code are available at \href{https://huggingface.co/datasets/kpurkayastha/OPRB}{Hugging Face} and \href{https://github.com/kunalpurkayastha/DocRevive}{Github} respectively.

2603.28767 2026-05-25 cs.CV 版本更新

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Gen-Searcher: 强化搜索代理用于图像生成

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue

发表机构 * MMLab, CUHK(CUHK的MMLab) UCLA(加州大学洛杉矶分校) UC Berkeley(加州大学伯克利分校) Meituan Home(美团家庭)

AI总结 本文提出Gen-Searcher,首个结合搜索增强的图像生成智能体,旨在解决现有模型因内部知识固化而在知识密集型或需最新信息的现实场景中表现不佳的问题。该方法通过多跳推理与搜索获取生成所需的文字知识和参考图像,并构建了两个高质量数据集及一个综合性基准KnowGen用于评估模型性能。实验表明,Gen-Searcher在多个指标上显著优于现有模型,为基于搜索的图像生成智能体研究提供了开放基础。

Comments Project page: https://gen-searcher.vercel.app Code: https://github.com/tulerfeng/Gen-Searcher

详情
AI中文摘要

最近的图像生成模型在生成高保真度和逼真图像方面表现出强大能力。然而,它们从根本上受限于冻结的内部知识,因此在需要知识密集型或最新信息的现实场景中常常失败。在本文中,我们提出Gen-Searcher,作为训练搜索增强图像生成代理的首次尝试,该代理执行多跳推理和搜索,以收集基于文本的知识和参考图像,用于接地生成。为实现这一目标,我们构建了一个定制数据管道,并策划了两个高质量数据集:Gen-Searcher-SFT-10k和Gen-Searcher-RL-6k,包含多样化的搜索密集型提示和对应的真实合成图像。我们进一步引入了KnowGen,一个综合基准,明确要求搜索接地外部知识用于图像生成,并从多个维度评估模型。基于这些资源,我们使用SFT训练Gen-Searcher,随后进行具有双重奖励反馈的代理强化学习,该奖励结合了基于文本和基于图像的奖励,为GRPO训练提供更稳定和信息丰富的学习信号。实验表明,Gen-Searcher带来了显著提升,在KnowGen上使Qwen-Image提高了约16分,在WISE上提高了15分。我们希望这项工作能够作为图像生成中搜索代理的开放基础,并完全开源我们的数据、模型和代码。

英文摘要

Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.

2603.07615 2026-05-25 cs.LG cs.CV 版本更新

Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

压缩即适应:基于扩散基础模型的隐式视觉表示

Zongyu Guo, Jiajun He, Zhaoyang Jia, Xiaoyi Zhang, Jiahao Li, Xiao Li, Bin Li, José Miguel Hernández-Lobato, Yan Lu

发表机构 * Microsoft Research Asia(微软亚洲研究院) University of Cambridge(剑桥大学)

AI总结 本文提出了一种将视觉信号编码为函数的新表示框架,通过低秩适配参数附着在冻结的视觉生成模型上,从而实现对视觉内容的隐式表示。该方法能够将例如81帧视频的信号压缩为一个紧凑的向量,在极低比特率下实现高质量的感知视频压缩。此外,该函数式表示支持推理时的扩展与控制,提升了压缩性能,并为视觉压缩与生成提供了一个统一的框架。

Comments ICML 2026

详情
AI中文摘要

现代视觉生成模型通过大规模训练获得丰富的视觉知识,但现有的视觉表示(如像素、潜变量或标记)仍独立于模型,无法直接利用这些知识进行紧凑存储或重用。在这项工作中,我们引入了一种新的视觉表示框架,将信号编码为一个函数,该函数通过附加在冻结的视觉生成模型上的低秩适应参数进行参数化。这种视觉信号的隐式表示,例如一个81帧的视频,可以进一步哈希成一个紧凑的向量,在极低比特率下实现强感知视频压缩。除了基本压缩外,这种表示的函数性质使得推理时缩放和控制成为可能,从而在压缩性能上实现额外优化。更广泛地说,由于隐式表示直接作为生成过程的函数,这提出了一个统一视觉压缩与生成的框架。

英文摘要

Modern visual generative models acquire rich visual knowledge through large-scale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations of visual signals, \textit{e.g.}, an 81-frame video, can further be hashed into a single compact vector, achieving strong perceptual video compression at extremely low bitrates. Beyond basic compression, the functional nature of this representation enables inference-time scaling and control, allowing additional refinement on the compression performance. More broadly, as the implicit representations directly act as a function of the generation process, this suggests a unified framework bridging visual compression and generation.

2602.11146 2026-05-25 cs.CV cs.AI 版本更新

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

超越基于VLM的奖励:扩散原生潜在奖励建模

Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo

发表机构 * The Hong Kong University of Science Huawei Hong Kong AI Framework \& Data Technologies Lab Tsinghua University The Australian National University

AI总结 本文提出了一种基于扩散模型的原生潜在奖励模型DiNa-LRM,旨在解决扩散和流匹配模型在偏好优化中对奖励函数的需求。该方法直接在扩散过程的噪声状态上进行偏好学习,引入了与扩散噪声相关的不确定性校准的Thurstone似然函数,从而提升了奖励模型的判别鲁棒性和计算效率。实验表明,DiNa-LRM在图像对齐任务中显著优于现有的扩散奖励基线,并以更低的计算成本达到与最先进视觉语言模型相当的性能,同时提升了偏好优化的动态效率。

Comments Accepted by ICML 2026. Code: https://github.com/HKUST-C4G/diffusion-rm

详情
AI中文摘要

扩散和流匹配模型的偏好优化依赖于既具有判别鲁棒性又计算高效的奖励函数。视觉语言模型(VLM)凭借其丰富的多模态先验,已成为主要的奖励提供者,用于指导对齐。然而,它们的计算和内存成本可能很高,并且通过像素空间奖励优化潜在扩散生成器会引入域不匹配,使对齐复杂化。在本文中,我们提出DiNa-LRM,一种扩散原生潜在奖励模型,直接在噪声扩散状态上制定偏好学习。我们的方法引入了一种噪声校准的Thurstone似然,具有扩散噪声依赖的不确定性。DiNa-LRM利用预训练的潜在扩散骨干网络,配备时间步条件奖励头,并支持推理时噪声集成,提供了一种扩散原生的机制用于测试时缩放和鲁棒奖励。在图像对齐基准测试中,DiNa-LRM显著优于现有的基于扩散的奖励基线,并以一小部分计算成本实现了与最先进VLM竞争的性能。在偏好优化中,我们证明DiNa-LRM改善了偏好优化动态,实现了更快且更资源高效的模型对齐。

英文摘要

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

2601.14180 2026-05-25 cs.CV 版本更新

Progressive $\mathcal{J}$-Invariant Self-supervised Learning for Low-Dose CT Denoising

渐进式 $\mathcal{J}$-不变自监督学习用于低剂量CT去噪

Yichao Liu, Zongru Shao, Yueyang Teng, Junwen Guo

发表机构 * organization= IWR, Heidelberg University , city= Heidelberg , postcode= 69120 , state= Baden Württemberg , country= Germany organization= Silicon Austria Labs , city= Linz , postcode= 4040 , state= Upper Austria , country= Austria organization= Institute of Science Tokyo , addressline= , city= Tokyo , country= Japan organization= College of Medicine Biological Information Engineering, Northeastern University , city= Shenyang , postcode= 110169 , state= Liaoning , country= China organization= Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education , city= Shenyang , postcode= 110169 , state= Liaoning , country= China organization= Department of Epidemiology \& Global Health, Umeå University , addressline= , city= Umeå , postcode= 90187 , country= Sweden

AI总结 本文研究了低剂量CT图像去噪中的自监督学习方法,旨在减少对配对正常剂量CT数据的依赖。为了解决现有方法因感受野受限导致的训练效率低和性能不足的问题,提出了一种渐进式$\mathcal{J}$-不变自监督学习方法,通过逐步盲区去噪机制和引入控制噪声来提升去噪效果。实验表明,该方法在Mayo低剂量CT数据集上优于现有自监督方法,并达到或超越了部分监督去噪方法的性能。

详情
AI中文摘要

自监督学习越来越多地被研究用于低剂量计算机断层扫描(LDCT)图像去噪,因为它减轻了对通常难以收集的配对正常剂量CT(NDCT)数据的依赖。然而,许多现有的自监督盲点去噪方法由于感受野受限,存在训练效率低下和性能次优的问题。为了缓解这一问题,我们提出了一种新颖的渐进式 $\mathcal{J}$-不变学习,最大化利用 $\mathcal{J}$-不变性来增强LDCT去噪性能。我们引入了一种逐步盲点去噪机制,以渐进方式强制执行条件独立性,从而实现更细粒度的去噪学习。此外,我们在训练过程中显式注入受控的高斯噪声和泊松噪声的组合,以正则化去噪过程并减轻过拟合。在Mayo LDCT数据集上的大量实验表明,所提出的方法持续优于现有的自监督方法,并实现了与几种代表性监督去噪方法相当或更好的性能。

英文摘要

Self-supervised learning has been increasingly investigated for low-dose computed tomography (LDCT) image denoising, as it alleviates the dependence on paired normal-dose CT (NDCT) data, which are often difficult to collect. However, many existing self-supervised blind-spot denoising methods suffer from training inefficiencies and suboptimal performance due to restricted receptive fields. To mitigate this issue, we propose a novel Progressive $\mathcal{J}$-invariant Learning that maximizes the use of $\mathcal{J}$-invariant to enhance LDCT denoising performance. We introduce a step-wise blind-spot denoising mechanism that enforces conditional independence in a progressive manner, enabling more fine-grained learning for denoising. Furthermore, we explicitly inject a combination of controlled Gaussian and Poisson noise during training to regularize the denoising process and mitigate overfitting. Extensive experiments on the Mayo LDCT dataset demonstrate that the proposed method consistently outperforms existing self-supervised approaches and achieves performance comparable to, or better than, several representative supervised denoising methods.

2512.20901 2026-05-25 cs.CV 版本更新

Benchmarking and Enhancing VLM for Compressed Image Understanding

基准测试与增强VLM对压缩图像的理解

Zifu Zhang, Tongda Xu, Siqi Li, Shengxi Li, Yue Zhang, Mai Xu, Yan Wang

发表机构 * Institute for AI Industry Research, Tsinghua University, Beijing, China(清华人工智能产业研究院) Beihang University, Beijing, China(北京航空航天大学) Beijing University of Technology, Beijing, China(北京理工大学)

AI总结 随着图像压缩技术的广泛应用,如何提升视觉语言模型(VLM)对压缩图像的理解能力变得尤为重要。本文首次构建了一个全面的基准,用于评估VLM在不同压缩编码和任务下的表现,并分析了模型在压缩图像上的性能差距来源,发现仅通过增强模型泛化能力可以有效缓解这一问题。基于此,作者提出了一种通用的VLM适配器,能够在多种压缩格式和比特率下提升模型性能10%-30%,为VLM在压缩图像任务中的应用提供了重要参考。

Comments The paper is accepted by ICML 2026

详情
AI中文摘要

随着视觉语言模型(VLM)的快速发展及其应用需求的增长,图像输入的高效压缩变得日益重要。现有VLM主要处理和理解高比特率压缩图像,而它们对低比特率压缩图像的解读能力迄今尚未被探索。本文首次引入全面基准测试,评估VLM对压缩图像的能力,涵盖多种现有广泛使用的图像编解码器和多样化任务,基准测试中包含超过一百万个压缩图像。接着,我们分析性能差距的来源,将其归因于a)压缩过程中的信息损失和b)VLM的泛化失败。我们通过具体示例可视化这些差距,并确定对于压缩图像,只有泛化差距可以缓解。最后,我们提出一个通用VLM适配器,以增强模型对现有编解码器压缩图像的性能。结果证明,单个适配器可以将VLM在不同编解码器和比特率图像上的性能提升10%-30%。我们相信,我们的基准测试和增强方法为弥合VLM与压缩图像之间的差距提供了宝贵的见解和贡献。源代码可在https://github.com/bblgbr/CompressVLMBench获取。

英文摘要

With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images. The source code is available at https://github.com/bblgbr/CompressVLMBench.

2512.07078 2026-05-25 cs.CV cs.LG 版本更新

DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection

DFIR-DETR:面向小目标检测的频域迭代细化与动态特征聚合

Bo Gao, Jingcheng Tong, Xingsheng Chen, Han Yu, Zichen Li

发表机构 * School of Information Engineering, Beijing Institute of Graphic Communication(信息工程学院,北京印刷学院) School of Computing and Data Science, The University of Hong Kong(计算与数据科学学院,香港大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 本文针对复杂场景中小目标检测中的核心挑战,提出了一种名为DFIR-DETR的新方法,通过频率域迭代优化和动态特征聚合,有效解决了现有网络在注意力分配、特征上采样和高频信息保留方面的不足。该方法在保持较低计算成本的同时,在NEU-DET和VisDrone数据集上取得了显著的性能提升,验证了其在不同检测任务中的有效性。

详情
AI中文摘要

复杂场景中的小目标检测暴露了神经网络设计中的基本矛盾:骨干注意力分布均匀而不考虑内容,金字塔颈部在上采样过程中放大激活幅度而不进行归一化补偿,瓶颈卷积通过累积空间滤波逐步平滑高频边缘分量。为此,我们开发了DFIR-DETR,将每个提出的模块追溯到RT-DETR基线中特定的、可测量的缺陷:忽略空间复杂性的均匀注意力、破坏上采样特征稳定性的归一化漂移,以及逐步抑制小目标所依赖的高频分量的空间卷积。在NEU-DET和VisDrone上,DFIR-DETR仅以11.7M参数和47.2 GFLOPs就达到了92.9%和51.6%的mAP50,在两个性质不同的检测领域展示了持续的性能提升。

英文摘要

Small object detection in complex scenes exposes a fundamental tension in neural network design: backbone attention distributes computation uniformly regardless of content, pyramid necks inflate activation magnitudes during upsampling without norm compensation, and bottleneck convolutions progressively smooth high-frequency edge components through accumulated spatial filtering. In response, we develop DFIR-DETR by tracing each proposed module back to a specific, measurable deficiency in the RT-DETR baseline: uniform attention that ignores spatial complexity, norm drift that destabilises upsampled features, and spatial convolutions that progressively suppress the high-frequency components small objects depend on. On NEU-DET and VisDrone, DFIR-DETR achieves 92.9% and 51.6% mAP50 with only 11.7M parameters and 47.2 GFLOPs, demonstrating consistent gains across two qualitatively different detection domains.

2511.22521 2026-05-25 cs.CV cs.AI 版本更新

DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

DocVAL:用于基于文档的视觉问答的验证链式思维蒸馏

Pinaki Prasad Guha Neogi, Ahmad Mohammadshirazi, Ser-Nam Lim, Rajiv Ramnath

发表机构 * Department of Computer Science(计算机科学系) Engineering, Ohio State University, Ohio, US(工程系,俄亥俄州立大学,俄亥俄,美国) Department of Computer Science, University of Central Florida, Florida, US(计算机科学系,中央佛罗里达大学,佛罗里达,美国)

AI总结 DocVAL 是一种用于文档视觉问答(VQA)的验证式思维链(CoT)蒸馏框架,旨在将大型视觉语言模型(VLM)中的精确空间推理能力转移到更高效的紧凑模型中。该方法结合了教师模型生成的空间推理监督、基于规则的双模式验证器以过滤低质量训练信号,并采用两阶段训练流程进行迭代优化,最终使学生模型无需OCR或检测模块即可独立运行。实验表明,DocVAL 在多个基准测试中显著提升了紧凑模型的定位性能,并引入了mAP作为新的定位评估指标。

详情
AI中文摘要

文档视觉问答要求模型不仅正确回答问题,还要在复杂文档布局中精确定位答案。大型视觉语言模型(VLM)具有强大的空间定位能力,但其推理成本和延迟限制了实际部署。紧凑型VLM更高效,但在标准微调或蒸馏下常出现显著的定位退化。为解决这一问题,我们提出DocVAL,一种验证链式思维(CoT)蒸馏框架,将显式空间推理从大型教师模型转移到紧凑、可部署的学生VLM。DocVAL结合了(1)教师生成的空间CoT监督,(2)基于规则的双模式验证器,过滤低质量训练信号并提供细粒度像素级纠正反馈,以及(3)验证驱动的两阶段训练过程与迭代细化。文本检测仅作为训练时的监督和验证脚手架,使得最终学生模型在推理时作为纯VLM运行,无需OCR或检测。在多个文档理解基准上,DocVAL相比可比的紧凑VLM持续提升高达6-7个ANLS点。我们进一步引入平均精度(mAP)作为文档问答的定位指标,并在此新评估下报告了强大的空间定位性能。我们发布了95K验证器验证的CoT轨迹,并表明高质量、验证过的监督比扩展未过滤数据更有效,实现了高效且可信的文档定位。代码/数据:https://github.com/ahmad-shirazi/DocVAL

英文摘要

Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are more efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we propose DocVAL, a validated chain-of-thought (CoT) distillation framework that transfers explicit spatial reasoning from large teacher models to compact, deployable student VLMs. DocVAL combines (1) teacher-generated spatial CoT supervision, (2) a rule-based dual-mode validator that filters low-quality training signals and provides fine-grained, pixel-level corrective feedback, and (3) a validation-driven two-stage training procedure with iterative refinement. Text detection is used only as training-time scaffolding for supervision and validation, enabling the final student to operate as a pure VLM without OCR or detection at inference. Across multiple document understanding benchmarks, DocVAL yields consistent improvements of up to 6-7 ANLS points over comparable compact VLMs. We further introduce mean Average Precision (mAP) as a localization metric for document question answering and report strong spatial grounding performance under this new evaluation. We release 95K validator-verified CoT traces and show that high-quality, validated supervision is more effective than scaling unfiltered data, enabling efficient and trustworthy document grounding. Code/Data: https://github.com/ahmad-shirazi/DocVAL

2511.13904 2026-05-25 cs.CV 版本更新

Edge Assisted Multi-Camera Vehicle Tracking Framework for Real-Time and Scalable Deployment

面向实时可扩展部署的边缘辅助多摄像头车辆跟踪框架

Yuqiang Lin, Sam Lockyer, Shucheng Zhang, Florian Stanek, Markus Zarbock, Adrian Evans, Wenbin Li, Yinhai Wang, Nic Zhang

发表机构 * University of Bath(巴斯大学) Starwit Technologies GmbH(Starwit技术公司) University of Washington(华盛顿大学)

AI总结 本文提出了一种名为EASE-MCVT的边缘辅助多摄像头车辆跟踪框架,旨在解决现有方法在实时性和可扩展性方面的不足。该框架采用分布式边缘-服务器架构,通过在边缘端进行目标检测、单摄像头跟踪和特征提取,仅传输轻量级元数据至中心服务器,从而实现高效的跨摄像头关联。研究在算法和系统层面进行了优化,包括动态工作负载分配、服务器端重匹配模块和自监督摄像头链接模型,实验表明该方法在保证跟踪精度的同时实现了实时处理能力,为城市级实时交通管理提供了可行方案。

详情
AI中文摘要

摄像头是现代智能交通系统中的核心传感模态,提供关于道路使用者活动的丰富视觉信息。多摄像头车辆跟踪利用这些数据重建跨摄像头网络的车辆轨迹,支持交通流预测和优化等应用。然而,现有大多数MCVT研究强调跟踪精度,而对实时性能和可扩展性关注有限,这两者对于实际城市规模部署至关重要。为弥补这一差距,我们提出边缘辅助、可扩展且高效的MCVT(EASE-MCVT),一种分布式边缘-服务器框架,专为实时吞吐量和可扩展操作设计。在边缘端,每个摄像头流通过目标检测、单摄像头跟踪、地理映射和特征提取进行处理,而仅将轻量级元数据(包括车辆位置和外观特征)发送到中央服务器进行跨摄像头关联。为提高跟踪精度和系统效率,EASE-MCVT从算法和系统角度进行了优化。算法上,它引入了用于轨迹级特征提取的动态工作负载方案、用于重新连接碎片化轨迹的服务器端重新匹配模块,以及一个自监督摄像头链接模型,该模型学习时空约束以加速和稳定跨摄像头关联。系统上,它集成了面向生产的数据工程组件,以标准化大规模操作的部署和数据交换。据我们所知,EASE-MCVT是首个明确设计用于在分布式边缘-服务器设置中同时解决实时性能和可扩展性的MCVT框架。在RoundaboutHD和CityFlow数据集上的实验表明,该框架实现了实时吞吐量并具有竞争力的跟踪精度,为城市范围的实时交通管理铺平了道路。

英文摘要

Cameras are a core sensing modality in modern intelligent transportation systems (ITS), providing rich visual information on road-user activities. Multi-Camera Vehicle Tracking (MCVT) uses this data to reconstruct vehicle trajectories across camera networks, supporting applications such as traffic flow prediction and optimisation. However, most existing MCVT studies emphasise tracking accuracy while paying limited attention to real-time performance and scalability, both essential for real-world and city-scale deployment. To address this gap, we propose Edge-Assisted, Scalable and Efficient MCVT (EASE-MCVT), a distributed edge--server framework designed for real-time throughput and scalable operation. On the edge side, each camera stream is processed through object detection, single-camera tracking, geo-mapping and feature extraction, while only lightweight metadata, including vehicle locations and appearance features, is sent to the central server for cross-camera association. To improve both tracking accuracy and system efficiency, EASE-MCVT is optimised from algorithmic and system perspectives. Algorithmically, it introduces a dynamic workload scheme for tracklet-level feature extraction, a server-side re-match module to reconnect fragmented tracklets, and a self-supervised camera link model that learns spatio-temporal constraints to accelerate and stabilise cross-camera association. Systemically, it integrates production-oriented data engineering components to standardise deployment and data exchange for large-scale operation. To the best of our knowledge, EASE-MCVT is the first MCVT framework explicitly designed to address both real-time performance and scalability in a distributed edge--server setting. Experiments on the RoundaboutHD and CityFlow datasets demonstrate real-time throughput with competitive tracking accuracy, paving the way for city-wide real-time traffic management.

2511.03882 2026-05-25 cs.CV cs.AI cs.LG cs.RO 版本更新

Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures

自主X光引导脊柱手术的机器人控制策略学习研究

Florence Klitzner, Blanca Inigo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song, Axel Krieger, Mathias Unberath

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Technical University of Munich(慕尼黑技术大学) Johns Hopkins School of Medicine(约翰霍普金斯医学院)

AI总结 本文研究了基于模仿学习的机器人控制策略在X射线引导脊柱手术中的应用,特别是在椎体成形术中导管插入任务中的可行性与挑战。研究构建了一个高度逼真的仿真环境,并构建了包含正确操作轨迹和双平面X射线序列的数据集,用于训练仅依赖视觉信息的模仿学习策略。实验表明,该策略在多种脊柱解剖结构和初始条件下均能实现安全的导管插入,为未来轻量化、无需CT的术中脊柱机器人导航提供了基础。

详情
AI中文摘要

基于模仿学习的机器人控制策略在基于视频的机器人学中重新受到关注。然而,对于稀疏输入的X光引导手术(如脊柱内固定),这种方法是否适用尚不清楚。我们研究了在双平面引导的套管针插入中模仿策略学习的可行性、机遇和挑战。我们开发了一个用于可扩展、自动化模拟X光引导脊柱手术的计算机沙盒,具有高度逼真性。我们整理了一个包含正确轨迹和相应双平面X光序列的数据集,模拟了提供者的逐步对齐过程。然后,我们训练了用于规划和开环控制的模仿学习策略,该策略仅基于视觉信息在椎体成形术环境中迭代对齐套管针。这种精确控制的设置提供了对该方法局限性和能力的见解。我们的策略在68.5%的案例中首次尝试成功,在不同椎体水平上保持了安全的椎弓根内轨迹。该策略迁移到了复杂解剖结构(包括骨折)以及不同的解剖结构和初始位置。在真实X光上的展开表明,具有合理轨迹的部分仿真到真实迁移是可能的。尽管这些初步结果令人鼓舞,但我们还发现了局限性,特别是在入口点精度方面。当前的结果为未来的努力提供了明确的基准,而借助更稳健的先验和领域知识,此类模型可能为未来实现轻量级、无CT的机器人术中脊柱导航奠定基础。

英文摘要

Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation, with sparse inputs. We examine the feasibility, opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula in a vertebroplasty setting solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy transferred to complex anatomy, including fractures, as well as varied anatomies and initializations. Rollouts on real X-ray indicate that partial sim-to-real transfer with plausible trajectories is possible. While these preliminary results are promising, we also identify limitations, especially in entry point precision. The current results present a clear benchmark for future efforts, while with more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.

2510.15060 2026-05-25 cs.CV 版本更新

A solution to generalized learning from small training sets found in infant repeated visual experiences of individual objects

从婴儿个体物体重复视觉经验中发现的小训练集泛化学习问题的解决方案

Frangil Ramirez, Elizabeth Clerkin, David J. Crandall, Linda B. Smith

发表机构 * Department of Computer Science, Luddy School of Informatics, Computing, and Engineering(计算机科学系,琳迪学校信息学、计算与工程学院) Department of Psychological and Brain Sciences(心理学与脑科学系)

AI总结 该研究探讨了婴儿在日常生活中通过重复视觉经验学习物体类别的方式,分析了14名一岁婴儿在用餐时拍摄的87段头部摄像头图像,涉及8类早期学习的物体。研究发现,每个婴儿对每个类别的视觉体验呈现高度偏态分布,即少数物体被频繁观看,而其他实例较少。通过图论方法分析,发现这些类别内部存在高相似性与高变异性并存的“块状”结构。实验表明,这种分布特征的人工训练集能够在极少样本的情况下支持模型对新实例的泛化,为人类和机器的视觉识别及学习机制提供了新见解。

Comments 28 pages, 7 figures, 3 tables

详情
AI中文摘要

一岁婴儿能快速形成并泛化他们遇到的日常物体类别。这里我们提供了关于婴儿日常视觉经验中8个早期学习物体类别的证据。使用婴儿在用餐时间记录的头戴摄像机图像语料库(14名婴儿记录的87次用餐时间),我们测量了每个类别独特实例的频率以及每个实例视觉经验的变异性。实例分布高度偏斜,对于每个婴儿和类别,包含大量同一少数物体的图像以及较少其他实例的图像。单个类别相似性结构的图论度量揭示了高相似度和高变异性的混合,组织成多个但相互连接的高相似度图像簇。在计算实验中,我们表明,以相似性团块分布为特征的人工创建的训练集在非常少的训练经验后支持对新实例的泛化。我们讨论了对视觉物体识别以及更一般的学习(包括人类和机器)的启示。

英文摘要

One-year-old infants rapidly form and generalize categories of the everyday objects they encounter. Here we provide evidence on infants daily-life visual experiences for 8 early-learned object categories. Using a corpus of infant head-camera images recorded at mealtimes (87 mealtimes captured by 14 infants), we measure the frequency of the unique instances of each category and the variability of the visual experiences of each instance. The distribution of instances is highly skewed, containing, for each infant and category, many images of the same few objects along with fewer images of other instances. Graph theoretic measures of the similarity structure for individual categories reveal a lumpy mix of high similarity and high variability, organized into multiple but interconnected clusters of high-similarity images. In computational experiments, we show that artificially-created training sets characterized by a lumpy distribution of similarities support generalization to novel instances after very few training experiences. We discuss implications for visual object recognition, and for learning more generally, by both humans and machines.

2506.14135 2026-05-25 cs.RO cs.CV 版本更新

GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation

GAF: 高斯动作场作为机器人操作中动态世界建模的4D表示

Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang Li, Hongwen Zhang, Yebin Liu

发表机构 * Tsinghua University(清华大学) Beijing Normal University(北京师范大学) Shadow AI

AI总结 本文提出了一种基于高斯动作场(GAF)的四维表示方法,用于机器人操作中的动态世界建模。GAF通过引入可学习的运动属性,扩展了三维高斯点绘(3DGS),实现了对动态场景和操作动作的四维建模。该方法能够直接从运动感知的四维表示中进行动作推理,并通过重建当前场景、预测未来帧和估计初始动作三个相关输出,提升操作精度。实验表明,GAF在重建质量和机器人操作成功率方面均优于现有方法。

Comments https://ChaiYing1.github.io/projects/GAF/

详情
AI中文摘要

准确的场景感知对于基于视觉的机器人操作至关重要。现有方法通常遵循视觉到动作(V-A)范式,直接从视觉输入预测动作,或视觉到3D到动作(V-3D-A)范式,利用中间3D表示。然而,由于操作场景的复杂性和动态性,这些方法常常面临动作不准确的问题。在本文中,我们采用V-4D-A框架,通过高斯动作场(GAF)从运动感知的4D表示中直接进行动作推理。GAF通过引入可学习的运动属性扩展了3D高斯溅射(3DGS),实现了动态场景和操作动作的4D建模。为了学习时变场景几何和动作感知的机器人运动,GAF提供三个相互关联的输出:当前场景的重建、未来帧的预测以及通过高斯运动估计的初始动作。此外,我们采用一个动作-视觉对齐的去噪框架,以GAF生成的初始动作和高斯感知的统一表示为条件,进一步获得更精确的动作。大量实验表明,GAF在重建质量上实现了显著改进,PSNR提高+11.5385 dB,SSIM提高+0.3864,LPIPS降低-0.5574,同时在机器人操作任务中,相比最先进方法,平均成功率提升+7.3%。

英文摘要

Accurate scene perception is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. Furthermore, we employ an action-vision-aligned denoising framework, conditioned on a unified representation that combines the init action and the Gaussian perception, both generated by the GAF, to further obtain more precise actions. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success rate in robotic manipulation tasks over state-of-the-art methods.

2403.12401 2026-05-25 cs.CV 版本更新

RT-NeRV: Rethinking Hybrid Neural Representations for Video via Residual Tokenization

RT-NeRV: 通过残差标记化重新思考混合神经视频表示

Yunjie Xu, Xiang Feng, Chengkai Wang, Alan Wee-Chung Liew, Xuefei Yin, Yanming Zhu

发表机构 * Ningbo University(宁波大学) Hangzhou Dianzi University(杭州电子科技大学) Griffith University(格里菲斯大学)

AI总结 本文提出了一种名为RT-NeRV的新型混合神经视频表示方法,旨在解决现有方法在低比特率下难以保留细节的问题。其核心思想是通过残差分块技术,将浅层残差特征和帧间残差信息离散化为紧凑的残差块,从而高效传输并利用这些信息进行重建。该方法设计了残差分块器和残差感知码本学习策略,有效提升了重建质量与训练稳定性,并在多个视频回归与修复任务中优于现有混合NeRV方法。

Comments Under Review

详情
AI中文摘要

神经视频表示(NeRV)通过将视频表示为紧凑的神经网络并实现高效解码,已成为视频压缩的一种有前景的范式。混合NeRV方法通过内容自适应嵌入进一步提高了重建质量,但在低比特率下仍难以保留精细细节。一个关键限制是,浅层残差支持信息虽然对重建非常有益,但其连续形式的传输成本高昂,因此未被充分利用。在本文中,我们重新思考混合NeRV,并提出了RT-NeRV,一种用于混合神经视频表示的残差标记化框架。核心思想是将浅层残差特征和帧间残差线索离散化为紧凑的残差标记,从而使得信息丰富的重建支持能够高效传输并被解码器利用。为此,我们设计了一个残差标记化器,并结合了一种残差感知的码本学习策略,该策略提高了标记利用率并稳定了训练。RT-NeRV可以轻松集成到现代混合NeRV主机中,持续增强细节保留、重建质量以及比特率-质量权衡。在视频回归和相关恢复任务上的大量实验表明,RT-NeRV优于强混合NeRV基线,并与近期基于INR的视频压缩方法保持竞争力。这些结果表明,残差标记化是推进混合神经视频表示的一个有效且互补的方向。

英文摘要

Neural Representations for Videos(NeRV) have emerged as a promising paradigm for video compression by representing videos as compact neural networks with efficient decoding. Hybrid NeRV methods further improve reconstruction quality through content adaptive embeddings, but still struggle to preserve fine details at low bitrates. A key limitation is that shallow residual support in formation, although highly beneficial for reconstruction, is costly to transmit in its continuous form and is therefore underutilized. In this paper, we rethink hybrid NeRV and present RT-NeRV, a residual tokenization framework for hybrid neural video representations. The core idea is to discretize shallow residual features and inter-frame residual cues into compact residual tokens, allowing informative reconstruction support to be transmitted efficiently and exploited by the decoder. To this end, we design a residual tokenizer together with a residual-aware codebook learning strategy that improves token utilization and stabilizes training. RT-NeRV can be readily integrated into modern hybrid NeRV hosts, consistently enhancing detail preservation, reconstruction quality, and bitrate quality trade-offs. Extensive experiments on video regression and related restoration tasks show that RT-NeRV outperforms strong hybrid NeRV baselines and remains competitive with recent INR based video compression methods. These results demonstrate that residual tokenization is an effective and complementary direction for advancing hybrid neural video representations

2605.23672 2026-05-25 cs.CV 版本更新

RiGS: Rigid-aware 4D Gaussian Splatting from a Single Monocular Video

RiGS: 从单目视频中的刚性感知4D高斯泼溅

Chenyu Wu, Wanhua Li, Zhu-Tian Chen, Hanspeter Pfister

发表机构 * Harvard University(哈佛大学) Zhejiang University(浙江大学) Nanyang Technological University(南洋理工大学) University of Minnesota - Twin Cities(明尼苏达大学-双城分校)

AI总结 从单目视频重建动态3D场景是一项基础但极具挑战性的任务,因为现实中的运动往往包含长期平滑变换和短期复杂形变。本文提出了一种名为RiGS的刚性感知四维高斯泼溅方法,能够同时捕捉多时间尺度的运动信息。该方法引入了三种高斯基元,分别用于表示静态背景、长期低频运动和短期高频动态,并通过对象级动态掩码聚合长距离时空运动信息,指导静态与动态区域的分解。实验表明,RiGS在新视角合成任务中取得了最先进的性能。

详情
AI中文摘要

从单目视频重建动态3D场景是一项基本但极具挑战性的任务,因为现实世界的运动通常涉及长期平滑变换和短期复杂变形。现有方法要么难以保持时间一致性,要么由于运动建模能力有限而无法捕捉高频动态。在这项工作中,我们提出了刚性感知4D高斯泼溅(RiGS),它同时捕捉多个时间尺度上的运动。具体来说,RiGS引入了三种类型的高斯原语:静态、刚性和瞬态,分别表示静态背景、长期低频运动和短期高频动态。提出了一种对象级动态掩码来聚合长距离时空运动信息,并指导静态和动态区域的分解。为了联合建模跨尺度的运动,允许刚性高斯根据其时间持续期转变为瞬态高斯,并且两者都在场景流引导下进行优化,提供密集的3D运动监督。大量实验表明,RiGS在新视角合成基准测试中达到了最先进的性能。代码可在\url{https://github.com/ladvu/RiGS}获取。

英文摘要

Reconstructing dynamic 3D scenes from monocular videos is a fundamental yet highly challenging task, as real-world motions often involve both long-term smooth transformations and short-term complex deformations. Existing methods either struggle to maintain temporal consistency or fail to capture high-frequency dynamics due to limited motion modeling capacity. In this work, we present Rigid-aware 4D Gaussian Splatting (RiGS), which simultaneously captures motions across multiple temporal scales. Specifically, RiGS introduces three types of Gaussian primitives: static, rigid, and transient, which represent static backgrounds, long-term low-frequency motions, and short-term high-frequency dynamics, respectively. An object-wise dynamic mask is proposed to aggregate long-range spatiotemporal motion information and guide the decomposition of static and dynamic regions. To jointly model motion across scales, rigid Gaussians are allowed to transition into transient Gaussians based on their temporal duration, and both are optimized under scene flow guidance, providing dense 3D motion supervision. Extensive experiments demonstrate that RiGS achieves state-of-the-art performance on novel view synthesis benchmarks. Code is available at \hyperlink{https://github.com/ladvu/RiGS}{https://github.com/ladvu/RiGS}.

2605.23656 2026-05-25 cs.CV 版本更新

Recursive Block-Diagonal Coupling for Resource-Efficient Training of Vision Models

递归块对角耦合用于视觉模型的资源高效训练

Maxim Henry, Adrien Deliège, Sébastien Piérard, Marc Van Droogenbroeck

发表机构 * Montefiore Institute, University of Liège(蒙费尔研究所,列日大学)

AI总结 本文提出了一种名为RBDC的高效训练方法,通过递归地以无参数的块对角方式耦合多个窄模型,从而构建出宽模型,实现了对训练资源的灵活分配。该方法在ImageNet数据集上与从头训练的标准方法相比,在保持相似测试精度的情况下减少了30%的计算量,并在相同计算量下取得了优于现有模型增长方法的性能。此外,RBDC训练的模型在下游目标检测和实例分割任务中也表现出更优的性能。

Comments 22 pages, 3 figures, 4 tables, and 34 references

详情
AI中文摘要

从头训练高容量视觉模型需要大量计算资源。为了提高宽目标模型的训练效率,现有的增长方法通常假设存在更窄的模型,从而掩盖了整个流程的真实计算成本。我们提出了一种高效的训练协议RBDC,该协议通过递归方式以无参数块对角耦合独立训练的窄模型来构建宽模型。这允许灵活分配所有涉及模型的可用训练预算。在ImageNet上使用视觉变换器(DeiT)和卷积网络(ResNet)进行评估,我们的RBDC训练协议显示出比标准协议从头训练的模型更好的效率,在相似测试精度下实现了30%的FLOPs减少。与模型增长文献中的训练协议相比,它在相同训练FLOPs下也实现了更高的性能。最后,我们展示了我们的模型可以作为比原始模型更好的下游目标检测和实例分割任务的主干网络。

英文摘要

Training high-capacity vision models from scratch requires substantial computational resources. To improve training efficiency of a wide target model, existing growth methods often assume the availability of narrower models, obscuring the true computational cost of the entire pipeline. We propose an efficient training protocol, RBDC, that builds wide models by coupling in a parameter-free block-diagonal way narrower, independently trained models in a recursive way. This allows a flexible allocation of the training budget available across all the models involved. Evaluated with vision transformers (DeiT) and convolutional networks (ResNet) on ImageNet, our RBDC training protocol shows a much better efficiency than models trained from scratch with the standard protocol, yielding 30% FLOPs reduction at similar test accuracies. It also achieves higher performances at same training FLOPs than training protocols from the model growth literature. Finally, we show that our models can serve as better backbones than their original counterparts for downstream object detection and instance segmentation tasks.

2605.23655 2026-05-25 cs.CV cs.AI cs.LG cs.MM 版本更新

CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception

CVSearch:赋予多模态大语言模型认知视觉搜索能力以感知高分辨率图像

Liupeng Li, Haoqian Kang, Zhenyu Lu, Jinpeng Wang, Bin Chen, Ke Chen, Yaowei Wang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China(深圳先进技术研究院)

AI总结 高分辨率图像感知是多模态大语言模型面临的关键瓶颈。为解决视觉搜索中覆盖性与效率之间的矛盾,本文提出CVSearch,一种无需训练的自适应框架,通过“评估-搜索”流程动态调度搜索策略。该方法在全局信息不足时采用专家辅助搜索,失败时触发语义感知的扫描机制,有效减少物体碎片化,并通过动态自底向上搜索策略提升局部细节的探索效率。实验表明,CVSearch在高分辨率基准上实现了最先进的准确率和显著提升的搜索效率。

Comments Accepted by ICML 2026. 22 pages, 12 figures, 7 tables

详情
AI中文摘要

高分辨率图像感知是多模态大语言模型的一个关键瓶颈。虽然视觉搜索提供了有希望的解决方案,但现有方法在覆盖率和效率之间难以权衡。视觉专家辅助搜索效率高,但当提议失败时容易出现盲点,而基于扫描的搜索以计算冗余和语义碎片化为代价保证了覆盖率。为了解决这一困境,我们引入了CVSearch,一种无需训练的自适应框架,通过评估-搜索工作流动态调度搜索策略。具体来说,CVSearch首先在全局信息不足时调用专家辅助搜索,仅在失败时触发一种新颖的语义感知扫描机制。与刚性网格划分不同,这种高效扫描范式结合了语义引导的自适应补丁,将图像分解为语义一致的区域,有效缓解了物体碎片化。此外,我们设计了一种由视觉复杂性先验驱动的动态自底向上搜索策略,以实现对局部细节的高效且精确的迭代探索。在高分辨率基准上的大量实验表明,CVSearch在显著提高搜索效率的同时实现了最先进的准确性。代码已发布在https://github.com/liliupeng28/ICML26-CVSearch。

英文摘要

High-resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade-off between coverage and efficiency. Visual expert-assisted search is efficient but prone to blind spots when proposals fail, whereas scan-based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training-free adaptive framework that dynamically schedules search strategies via an Assess-then-Search workflow. Specifically, CVSearch first invokes expert-assisted search when global information is insufficient, and only triggers a novel semantic-aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom-Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state-of-the-art accuracy while substantially improving search efficiency. Code is released at https://github.com/liliupeng28/ICML26-CVSearch.

2605.23653 2026-05-25 cs.CV 版本更新

ExpOS: Explainable Open-Surgery Skills Assessment Using 3D Hand Reconstruction

ExpOS: 基于3D手部重建的可解释开放式手术技能评估

Roi Papo, Idan Smoller, Shlomi Laufer

发表机构 * Faculty of Data and Decision Sciences, Technion – Israel Institute of Technology, Haifa, 3200003, Israel(数据与决策科学学院,技术学院–以色列理工学院,海法,3200003,以色列)

AI总结 本文提出ExpOS,一种基于3D手部重建的可解释开放手术技能评估框架,旨在实现自动化的、以反馈为导向的手术训练评估。该方法通过从手术视频中提取手部姿态和工具检测信息,学习具有判别力的时间模式,并利用时空卷积网络和注意力机制生成帧级重要性图,从而预测技能水平并提供可解释的反馈。实验表明,ExpOS在多个手术任务中与专家评分具有高度相关性,尤其在筋膜闭合任务中表现优异,展示了其在可扩展性和实用性方面的潜力。

Comments 10 pages, 4 figures

详情
AI中文摘要

及时且透明的反馈对于有效的手术培训至关重要,但目前的评估仍然依赖于专家观察,限制了可扩展性和自主实践的机会。我们提出了ExpOS,一个用于数据驱动的开放式手术技能评估的可解释框架,旨在实现自动化的、面向反馈的评估。ExpOS不依赖于专家定义的指标,而是直接从运动数据中学习判别性时间模式,并识别出最能预测技能水平的片段和行为。我们在221名医学生执行三项开放式手术任务的视频上训练和评估了该方法。从每一帧中提取手部姿态和工具检测,以推导运动学描述符和全局运动统计。使用时间卷积骨干网络和基于注意力的池化对时空手-工具动态进行建模,生成帧级重要性图。这些表示与全局运动统计融合,以预测技能水平并提供可解释的反馈。ExpOS通过注意力权重识别信息事件发生的时间,并通过全局特征分析确定哪些运动特征对预测影响最大,从而提供多层级可解释性。在各项任务中,该框架与专家评分实现了强相关性,在筋膜闭合任务上表现最佳(r = 0.778, R2 = 0.74)。这些结果表明,将弱监督时间重要性学习与可解释运动统计相结合,能够实现可扩展且可操作的手术技能评估。

英文摘要

Timely and transparent feedback is essential for effective surgical training, yet current assessment remains dependent on expert observation, limiting scalability and opportunities for autonomous practice. We present ExpOS, an explainable framework for data-driven assessment of open-surgery skills designed to enable automatic, feedback-oriented evaluation. Rather than relying on expert-defined metrics, ExpOS learns discriminative temporal patterns directly from motion data and identifies the segments and behaviors most predictive of skill level. We trained and evaluated the method on 221 videos of medical students performing three open-surgery tasks. Hand poses and tool detections were extracted from each frame to derive kinematic descriptors and global motion statistics. Spatiotemporal hand-tool dynamics were modeled using a temporal convolutional backbone with attention-based pooling to generate frame-level importance maps. These representations were fused with global motion statistics to predict skill level and to provide interpretable feedback. ExpOS provides multi-level explainability by identifying when informative events occur through attention weights and which motion characteristics most influence predictions through global feature analysis. Across tasks, the framework achieved strong correlation with expert ratings, with best performance on fascial closure (r = 0.778, R2 = 0.74). These results demonstrate that combining weakly-supervised temporal importance learning with interpretable motion statistics enables scalable and actionable surgical skill assessment.

2605.23634 2026-05-25 cs.CV cs.AI 版本更新

DualMem: Bypassing the Objectness Bottleneck for Calibrated Unknown-Stream Filtering in Open-World Object Detection

DualMem: 绕过目标性瓶颈以实现开放世界目标检测中校准的未知流过滤

Yingjun Xiao, Xi Chen, Gang Fang, Siyuan Chen

发表机构 * School of Artificial Intelligence, Guangzhou University(广州大学人工智能学院) School of Computer Science and Cyber Engineering, Guangzhou University(广州大学计算机科学与网络工程学院) Institute of Computing Science and Technology, Guangzhou University(广州大学计算科学与技术研究院)

AI总结 开放世界目标检测(OWOD)需要检测器既能定位已知类别,又能识别未知对象以支持未来的增量学习。本文发现当前强OWOD检测器的未知预测流中背景误检比例过高,问题根源在于对象性头的信息瓶颈。为此,作者提出DualMem,一种基于冻结SigLIP特征空间的校准后处理过滤器,通过非参数似然比检验实现对未知对象的筛选,有效提升了未知对象识别的准确性,同时保持已知类别检测性能不变。

详情
AI中文摘要

开放世界目标检测(OWOD)要求检测器定位已知类别,同时识别未知对象以进行未来的增量学习。我们发现,强OWOD检测器的未知预测流受到严重污染:在M-OWODB上,对于PROB、OW-DETR和HypOW,未来任务的正未知样本仅占未知预测的不到10%,而背景假阳性则占46-71%。我们表明,这不是信息缺失问题,而是目标性头部的信息瓶颈。在PROB任务1上,对256维解码器查询的线性探针在正负未知区分上达到了0.908的AUROC,但最终的一维目标性标量降至0.642。一个冻结的SigLIP特征,无需访问检测器,在过滤阶段独立恢复了大部分这种提议级别的可分离性(AUROC = 0.871)。基于这一发现,我们提出DualMem,一种校准的后验过滤器,它假设一个小的、图像不相交的、标注了未来任务对象的校准分割,并在冻结的SigLIP特征空间中执行非参数似然比检验。DualMem使用k近邻正记忆来保护未来任务对象,并使用负记忆来抑制类似背景的提议。其决策阈值通过Neyman-Pearson校准选择,为用户提供了假未知抑制与新奇召回之间的显式权衡。在M-OWODB任务1上的PROB、OW-DETR和HypOW中,DualMem将每幅图像的背景型假未知提议减少了44.9%-66.3%,平均减少56.6%。在PROB任务1上,它使自然K-means原型基线的减少量翻倍以上,同时保持已知类别的mAP不变,因为已知检测绕过过滤器。

英文摘要

Open-world object detection (OWOD) requires detectors to localize known classes while identifying unknown objects for future incremental learning. We find that the unknown prediction streams of strong OWOD detectors are heavily polluted: on M-OWODB, across PROB, OW-DETR, and HypOW, future-task positive unknowns make up less than 10% of unknown predictions, whereas background false positives account for 46-71%. We show that this is not a missing-information problem, but an information bottleneck at the objectness head. On PROB Task 1, a linear probe on the 256-D decoder query achieves an AUROC of 0.908 for positive-versus-negative unknown discrimination, but the final one-dimensional objectness scalar drops to 0.642. A frozen SigLIP feature, without access to the detector, independently recovers much of this proposal-level separability at the filtering stage (AUROC = 0.871). Motivated by this finding, we propose DualMem, a calibrated post-hoc filter that assumes a small image-disjoint annotated calibration split of held-out future-task objects and performs a non-parametric likelihood ratio test in frozen SigLIP feature space. DualMem uses a k-nearest-neighbor positive memory to protect future-task objects and a negative memory to suppress background-like proposals. Its decision threshold is chosen by Neyman-Pearson calibration, giving users an explicit trade-off between false-unknown suppression and novel recall. Across PROB, OW-DETR, and HypOW on M-OWODB Task 1, DualMem reduces background-type false unknown proposals per image by 44.9%-66.3%, with a mean reduction of 56.6%. On PROB Task 1, it more than doubles the reduction achieved by a natural K-means prototype baseline, while leaving known-class mAP unchanged because known detections bypass the filter.

2605.23629 2026-05-25 cs.CV 版本更新

DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs

DDX-TRACE: 视觉语言模型中医学诊断轨迹的基准

Jiazhen Pan, Weixiang Shen, Jun Li, Julian Canisius, Felix Bitzer, Paula Roßmüller, Jiancheng Yang, Virginie Kreutzinger, Daniel Rueckert, Benedikt Wiestler

发表机构 * Technical University of Munich(慕尼黑技术大学) TUM University Hospital(TUM大学医院) Munich Center for Machine Learning(慕尼黑机器学习中心) LMU Munich(慕尼黑大学) Aalto University(阿尔托大学) Imperial College London(伦敦帝国学院)

AI总结 DDX-TRACE 是一个用于评估视觉语言模型在医学诊断过程中表现的基准,专注于神经放射学领域,包含211个复杂病例。该基准模拟了真实的诊断流程,模型需在有限的临床信息基础上逐步请求影像检查、更新诊断假设,并最终给出确诊结果。研究发现,传统仅评价最终答案的方法可能无法准确反映模型的诊断质量,而DDX-TRACE通过关注诊断轨迹,揭示了模型在证据获取、不确定性更新和推理能力方面的关键问题。

Comments 41 pages

详情
AI中文摘要

医学诊断并非来自完全指定的病例的单次预测。它是一个序贯工作流程:临床医生决定获取哪些证据,修订鉴别诊断,并在诊断得到充分支持时停止。大多数医学AI基准则提前揭示相关背景,仅对最终答案评分,使得无依据的正确猜测、过早闭合、低效工作流以及不良的不确定性更新变得不可见。我们引入了DDX-TRACE,一个由医生裁决的多模态神经放射学基准,在211个具有挑战性的病例中评估隐藏证据下的诊断轨迹。每个病例从有限的临床病史开始;模型以自由形式请求影像研究,在可用时接收匹配的图像包,每轮后更新概率性鉴别诊断,并以定位的最终诊断结束。评估最先进的VLM,我们发现最终诊断分数可能严重歪曲工作流质量:模型可能在没有必要证据的情况下猜测合理的诊断,请求有用的研究但误解原始图像,或者低效地获取证据同时更新不确定性不佳。受控证据变体隔离了规划、视觉证据提取和下游鉴别推理中的瓶颈。DDX-TRACE将医学AI评估从最终答案转向证据支持的诊断轨迹。

英文摘要

Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories.

2605.23610 2026-05-25 cs.CV cs.AI 版本更新

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

EM-Vid:无需训练的以实体为中心的记忆,用于高效且一致的多镜头视频生成

Jente Vandersanden, Matheus Gadelha, Chun-Hao P. Huang, Hyeonho Jeong, Yulia Gryaditskaya

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克研究所) Adobe Research(Adobe研究)

AI总结 本文提出了一种无需训练的实体中心记忆机制 EM-Vid,用于高效且一致的多镜头视频生成。该方法通过存储实体相关的潜在补丁来分离持久实体信息与瞬时场景背景,结合稀疏 token 条件控制和结构化脚本格式,有效降低了计算成本并提升了生成一致性。此外,引入的预算化记忆更新策略和噪声注入机制,进一步增强了对实体外观的精细控制,防止了无关信息的泄露。

详情
AI中文摘要

多镜头视频生成需要在不同镜头间保持重复实体的一致外观,同时忠实于镜头特定的文本提示。最近的自回归方法重用先前生成的帧作为记忆。然而,全帧存储将持久实体信息与瞬态场景上下文纠缠在一起,导致无关信息泄漏和高计算成本。我们提出一种以实体为中心的记忆,形式为实体索引的潜在补丁库。我们引入与预训练模型兼容的稀疏令牌条件化,将自注意力限制在实体相关令牌上,降低计算成本。为此,我们引入一种结构化的多镜头脚本格式。我们还提出一种预算记忆更新策略,以维护紧凑且不断演化的记忆。最后,我们为实体表示配备噪声注入机制,实现细粒度外观控制,防止无关信息泄漏。我们的方法在保持主体一致性的同时,提高了提示遵循度和效率。

英文摘要

Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible with pretrained models, restricting self-attention to entity-relevant tokens and reducing computational cost. To support this, we introduce a structured multi-shot script format. We additionally propose a budgeted memory update strategy to maintain a compact, evolving memory. Finally, we equip the entity representation with a noise-injection mechanism that enables fine-grained appearance control, preventing leakage of irrelevant information. Our method improves prompt adherence and efficiency while preserving subject consistency.

2605.23602 2026-05-25 cs.CV 版本更新

GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes

GlowGS: 夜间发光场景中用于3D高斯溅射的生成式语义特征学习

Beibei Lin, Xiao Cao, Jingyuan Guo, Robby T. Tan

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 现有3D高斯泼溅(3DGS)方法在白天清晰场景中能生成高质量的新视角图像,但在夜间发光区域表现较差,主要因为缺乏纹理和边缘等结构特征。为此,本文提出GlowGS方法,结合扩散模型和视觉基础模型(VFM),通过语义特征生成和新视角语义学习两个关键思想,生成高质量的隐式结构线索,并在无需真实标签的情况下优化渲染结果,显著提升了夜间发光场景下3D重建的语义准确性和视觉质量。

Comments Accepted by CVPR Findings 2026

详情
AI中文摘要

现有的3DGS方法在晴朗场景中能有效渲染高质量的新视图。然而,它们在夜间场景中表现不佳,特别是在发光区域,因为缺乏纹理和边缘等结构特征,而这些特征是基于溅射重建的关键线索。为了解决这个问题,我们利用扩散模型和视觉基础模型(VFM)来补偿缺失的结构线索。我们的方法包含两个关键的新思想:语义特征生成和新视图语义学习。首先,语义特征生成为新视图生成高质量的语义特征作为隐式结构线索。具体来说,扩散模型从训练视图中合成具有未知相机姿态的新视图,而VFM评估其质量。一旦识别出高质量的新视图,VFM提取鲁棒特征以构建语义特征库。其次,新视图语义学习使3DGS能够优化渲染的新视图,而无需真实标签。它通过从渲染的新视图中提取语义特征,在特征库中搜索最相似的特征,并最小化它们的距离来实现。这个过程施加了隐式结构约束,确保语义一致、无伪影的渲染视图。大量实验证明了我们的GlowGS在生成语义准确的3D视图方面的有效性,显示出比现有方法显著的改进。

英文摘要

Existing 3DGS methods effectively render high-quality novel views in clear-day scenes. However, they struggle with night scenes, particularly in glow regions, due to the lack of structural features such as textures and edges, which are key cues for splatting-based reconstruction. To address this problem, we leverage a diffusion model and a Vision Foundation Model (VFM) to compensate for missing structural cues. Our method consists of two key novel ideas: semantic feature generation and novel-view semantic learning. First, semantic feature generation produces high-quality semantic features as implicit structural cues for novel views. Specifically, a diffusion model synthesizes novel views with unknown camera poses from training views, while a VFM evaluates their quality. Once high-quality novel views are identified, the VFM extracts robust features to construct the semantic feature bank. Second, novel-view semantic learning enables 3DGS to optimize rendered novel views without requiring ground truth. It achieves this by extracting semantic features from a rendered novel view, searching the feature bank for the most similar features, and minimizing their distance. This process enforces implicit structural constraints, ensuring semantically coherent, artifact-free rendered views. Extensive experiments demonstrate the effectiveness of our GlowGS in generating semantically accurate 3D views, showing significant improvements over existing methods.

2605.23580 2026-05-25 cs.CV 版本更新

Calibration-Informative Region Selection for Online LiDAR--Camera Calibration in Agricultural Environments

农业环境中在线LiDAR-相机标定的标定信息区域选择

Rajitha de Silva, Grzegorz Cielniak

发表机构 * Lincoln Institute for Agri-Food Technology, University of Lincoln, UK(林肯农业食品技术研究所,林肯大学,英国)

AI总结 本文研究了农业环境下在线激光雷达-相机标定中的校准信息区域选择问题,提出了一种基于支持图的多模态标定方法,将标定过程分解为初始标定、跨模态残差提取、支持图估计和支持感知优化四个模块。通过结合无目标标定方法MDPCalib和密集匹配模型CMRNext,该方法生成了一个密集校准支持图,用于识别标定信息可靠的区域,实验表明该方法在Bacchus Long-Term和KITTI数据集上能有效提升标定精度,尤其在平移参数方面表现突出。

Comments Accepted to ICRA 2026 Workshop on Agricultural Robotics

详情
AI中文摘要

可靠的多模态标定需要识别哪些观测真正约束外参,哪些主要引入噪声或模糊性。本文提出一种基于支持图的多模态标定方法,解耦四个功能模块:初始标定、跨模态残差提取、支持图估计和支持感知精化。我们利用MDPCalib(一种基于运动和深度点对应的无目标LiDAR-相机标定方法)和CMRNext(一种预测光流状图像平面残差的密集LiDAR-相机匹配模型)实例化该公式用于在线LiDAR-相机标定。关键贡献是密集标定支持图,它聚合对齐观测上的跨模态一致性,并突出标定证据持续可靠的区域。在Bacchus Long-Term (BLT)数据集和KITTI上,我们表明标定证据在空间和语义上不均匀,表明某些语义区域为标定提供更强的线索。在KITTI上,支持引导的精化改善了标定性能,平移精度更好,而旋转增益仍然有限。

英文摘要

Reliable multi-modal calibration requires identifying which observations truly constrain the extrinsic parameters and which ones mainly add noise or ambiguity. In this paper, we propose a support-map-driven approach to multi-modal calibration that decouples four functional blocks: initial calibration, cross-modal residual extraction, support-map estimation, and support-aware refinement. We instantiate this formulation for online LiDAR--camera calibration using MDPCalib, a target-less LiDAR--camera calibration method based on motion and deep point correspondences, and CMRNext, a dense LiDAR--camera matching model that predicts optical-flow-like image-plane residuals. The key contribution is a dense calibration support map that aggregates cross-modal agreement over aligned observations and highlights where calibration evidence is consistently reliable. Across the Bacchus Long-Term (BLT) dataset and KITTI, we show that calibration evidence is spatially and semantically non-uniform, indicating that some semantic regions provide stronger cues for calibration than others. On KITTI, support-guided refinement improves the calibration performance with better translation accuracy while rotational gains remain limited.

2605.23559 2026-05-25 cs.CV cs.AI 版本更新

PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA

PathNavigate: 一种无需训练的病理学代理,具有惊喜引导扫描和共享幻灯片记忆用于全切片图像VQA

Chunze Yang, Qidong Liu, Wenjie Zhao, Yue Tang, Jiusong Ge, Di Zhang, Jiashuai Liu, Lei Wu, Junbo Lu, Ni Zhang, Xian Wu, Zeyu Gao, Chen Li

发表机构 * School of Comp. Science & Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) Tencent Jarvis Lab(腾讯Jarvis实验室) University of Cambridge(剑桥大学)

AI总结 PathNavigate 是一种无需训练的病理图像问答代理,旨在解决全切片图像问答(WSI-VQA)中在有限检查预算下高效定位关键病理证据的问题。该方法采用“扫描-搜索-读取”流程,通过共享的在线记忆模块生成异常区域池,并结合问题条件的相关性筛选高倍镜下的目标区域,从而提升答案准确性和解释性。实验表明,PathNavigate 在保持模型冻结的前提下,实现了更高的效率和更可靠的证据选择路径。

详情
AI中文摘要

全切片图像视觉问答(WSI-VQA)将病理学视为极端上下文搜索问题:为了回答自由形式的临床查询,系统必须首先在严格的检查预算下导航千兆像素切片,以定位稀疏的高分辨率证据。现有方法主要分为两种范式:i)监督式病理学多模态大语言模型(MLLMs)和代理可以将定位和推理吸收到学习模块中,但它们通常将导航与任务特定的监督和重新训练耦合,限制了其实用性;ii)无需训练的病理学代理通过保持核心模型冻结来避免这种成本,但通常遵循问题优先的设计,主要从查询条件相关性构建初始候选集。这可能会遗漏问题中未提及的决定性形态,并迫使更重的推理时脚手架。为了解决这一挑战,我们引入了PathNavigate,一种无需训练的病理学代理,基于扫描-搜索-读出流程构建。在问题匹配之前,PathNavigate在低放大倍数下扫描当前切片,使用共享的在线记忆模块处理冻结的病理学特征,生成一个切片特定的惊喜场,标记异常区域池。然后,它仅在此池内应用问题条件的PLIP相关性,以选择高放大倍数的搜索目标。最后,它提取局部高放大倍数证据,并使用冻结的感知器-裁决器堆栈进行回答,利用相同的在线记忆作为切片级上下文。在WSI-VQA和SlideBench-BCNB上的实验表明,所提出的扫描-搜索-读出设计提高了答案准确性,并产生了更可解释的证据选择轨迹,且效率更高。代码已在线公开。

英文摘要

Whole-slide image visual question answering (WSI-VQA) frames pathology as an extreme-context search problem: to answer a free-form clinical query, a system must first navigate a gigapixel slide under a strict inspection budget to locate sparse, high-resolution evidence. Existing approaches largely fall into two paradigms: i) supervised pathology multimodal large language models (MLLMs) and agents can absorb localization and reasoning into learned modules, but they often couple navigation to task-specific supervision and retraining, limiting their practicality; ii) training-free pathology agents avoid this cost by keeping core models frozen, but often follow a question-first design, constructing the initial candidate set mainly from query-conditioned relevance. This can miss decisive morphology that is not named in the question, and force heavier inference-time scaffolding. To address this challenge, we introduce PathNavigate, a training-free pathology agent built around a scan-search-readout routine. Before question matching, PathNavigate scans the current slide at low magnification with a shared online memory module over frozen pathology features, producing a slide-specific surprise field that marks an abnormal-region pool. It then applies question-conditioned PLIP relevance only within this pool to select high-magnification search targets. Finally, it extracts local high-magnification evidence and answers with a frozen perceptor-adjudicator stack, using the same online memory as slide-level context. Experiments on WSI-VQA and SlideBench-BCNB show that the proposed scan-search-readout design improves answer accuracy and yields more interpretable evidence-selection trajectories with higher efficiency.The code is available online.

2605.23555 2026-05-25 cs.CV 版本更新

Generator-Refiner-Examiner: A Tri-Module Data Augmentation Framework for 3D Human Avatar Learning from Monocular Videos

生成器-精炼器-检验器:一种用于从单目视频学习3D人体虚拟形象的三模块数据增强框架

Gangjian Zhang, Jian Shu, Sicheng Yu, Wenhao Shen, Yu Feng, Hao Wang

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文研究了从单目视频中重建具有逼真外观和可动画效果的3D人体化身的挑战。为了解决现有方法在数据稀缺情况下难以捕捉细节的问题,提出了一种名为TrioMan的三模块数据增强框架,包含生成器、细化器和检查器三个协同组件,分别用于生成多样化样本、提升生成质量以及筛选符合人体一致性的样本。实验表明,该方法在多个基准数据集上优于现有先进方法。

详情
AI中文摘要

本文解决了从单目视频重建逼真且可动画化的3D人体虚拟形象的挑战。现有方法依赖于将逐主体优化与通用人体先验相结合,但在训练帧数有限时往往难以捕捉细粒度细节。为了缓解数据稀缺问题,我们提出了TrioMan,一个用于增强3D虚拟形象学习的系统性三模块框架。我们的方法包含三个协同组件。生成器通过对姿态和相机施加高斯扰动来创建多样化的未见样本。精炼器通过由纹理和几何线索引导的一步扩散来提高生成数据的质量。检验器使用基于双分支注意力的相似性评估来选择与主体一致的样本。在X-Humans和NeuMan基准上的实验表明,TrioMan优于最先进的方法。

英文摘要

This paper addresses the challenge of reconstructing photorealistic and animatable 3D human avatars from monocular videos. While existing methods rely on combining per-subject optimization with generic human priors, they often fail to capture fine-grained details when training frames are limited. To mitigate this data scarcity, we propose TrioMan, a systematic tri-module framework for augmented 3D avatar learning. Our approach comprises three synergistic components. The Generator creates diverse unseen samples by imposing Gaussian perturbations on pose and camera. The Refiner improves the quality of generated data through one-step diffusion guided by texture and geometry cues. The Examiner selects subject-consistent samples using a dual-branch attention-based similarity evaluation. Experiments on the X-Humans and NeuMan benchmarks show that TrioMan outperforms state-of-the-art methods.

2605.23523 2026-05-25 cs.CV 版本更新

ComPose: When to Trust Hands for Object Pose Tracking

ComPose:何时信任手部进行物体姿态跟踪

Jisu Shin, Junoh Lee, JunGyu Lee, Inhwan Bae, Dohyeon Lee, Hokyun Im, Youngwoon Lee, Hae-Gon Jeon

发表机构 * GIST(韩国信息科学与技术学院) Yonsei Univ.(延世大学) DGIST(国立地面空间技术研究所)

AI总结 本文提出了一种名为 ComPose 的六自由度物体姿态跟踪框架,旨在从 RGB 视频中实现对被手部遮挡物体的鲁棒跟踪。该方法创新性地将手部运动作为补充线索,而非单纯遮挡物,在统一的跟踪流程中结合物体和手部的提示信息,通过自适应选择关键手部关节、融合多源线索并利用几何证据进行修正,实现了稳定且精确的物体轨迹估计。实验表明,该方法在严重遮挡和几何模糊情况下表现出色,且无需外部平滑处理即可获得时间上一致的 3D 轨迹,适用于机器人操作等下游任务。

Comments 22 pages, 10 figures

详情
AI中文摘要

从视频中重建物体运动是具身AI和机器人操作的关键组成部分。尽管已经研究了多种物体姿态跟踪方法,但它们严重依赖强大的外部先验(如深度数据或3D模板),并且即使使用显式掩码,仍然极易受到手部抓取造成的严重遮挡的影响。在这项工作中,我们提出了ComPose,一个6DoF物体跟踪框架,旨在从RGB视频中进行手部感知的物体姿态估计。我们的方法不是将手部纯粹视为遮挡物,而是将手部运动协调为物体跟踪的补充线索。具体来说,我们通过在一个统一的跟踪流程中结合来自基础模型的物体和手部线索,随时间恢复多种物体运动。在此,ComPose自适应地选择信息丰富的手部关节,结合物体和手部衍生的线索进行运动估计,并使用可见的几何证据和学习到的校正来细化所得的物体运动。我们进一步在旋转和平移上强制时间一致性,从而在没有外部平滑的情况下产生稳定的3D物体轨迹。大量实验表明,我们的方法在严重手部遮挡和几何模糊下准确、高效且鲁棒。此外,所得的轨迹还可以通过使机器人能够从在线视频中重建人类动作,有效地转移到下游机器人操作中。

英文摘要

Reconstructing the motion of objects from videos is a key component for embodied AI and robot manipulation. While diverse approaches to object pose tracking have been studied, they rely heavily on strong external priors, such as depth data or 3D templates, and remain highly vulnerable to severe occlusions by hand grasps despite the use of explicit masks. In this work, we present ComPose, a 6DoF object tracking framework designed for hand-aware object pose estimation from RGB video. Rather than treating the hand purely as an occluder, our method harmonizes hand motions as a \textit{complementary cue} for object tracking. In detail, we recover a variety of object motions over time by combining object and hand cues from foundation models within a unified tracking pipeline. Here, ComPose adaptively selects informative hand joints, combines object- and hand-derived cues for motion estimation, and refines the resulting object motion using visible geometric evidence and a learned correction. We further enforce the temporal consistency over both rotation and translation, yielding stable 3D object trajectories over time without any external smoothing. Extensive experiments show that our method is accurate, efficient, and robust under severe hand occlusion and geometric ambiguity. In addition, the resulting trajectories can also effectively transfer to downstream robot manipulation by enabling robots to reconstruct human actions from online videos.

2605.23522 2026-05-25 cs.LG cs.AI cs.CV 版本更新

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Precise: 用于流匹配模型强化学习后训练的SDE一致随机采样

Jade Zou, Tao Huang, Weijie Kong, Junzhe Li, Yue Wu, Qi Tian, Jiangfeng Xiong, Jianwei Zhang, Liefeng Bo, Zhao Zhong

发表机构 * Peking University(北京大学) Tencent Hunyuan(腾讯文言)

AI总结 该论文研究了如何通过强化学习(RL)对流匹配模型进行后训练,以提升其生成质量与提示对齐能力。核心方法是将确定性的采样轨迹转化为随机策略,通过设计一个符合随机微分方程(SDE)的采样器,实现探索与稳定性的平衡。提出的新采样器Precise在保持去噪轨迹SDE一致性的同时,有效减少了噪声干扰,实验表明其在奖励优化速度和生成质量上均优于现有方法。

详情
AI中文摘要

强化学习已成为提升扩散和流匹配生成器中提示对齐和感知质量的有效方法。将在线强化学习应用于流匹配的关键步骤是将确定性采样轨迹转化为随机策略,通常通过用随机微分方程替代逆向常微分方程来实现。随机采样器控制探索行为和去噪动力学,因此是策略的一部分,其设计会显著影响奖励优化性能。我们将采样器设计分解为两个相互依赖的组成部分:选择适量的随机探索,以及在强化学习中使用的少量步数下忠实地离散化得到的SDE。针对第一个组成部分,我们分析了去噪过程中探索与稳定性之间的固有张力,并推导出平衡两者的SDE调度。针对离散化挑战,我们使用一个玩具示例表明,现有采样器可能偏离流匹配过程,要么引入过多的离散化噪声,要么依赖不能保证收敛到数据分布的启发式规则。为解决这些问题,我们提出了Precise,一种新的随机采样器,平衡了有效探索与稳定性。关键地,Precise通过一种冻结干净潜变量后验均值的新颖近似,使去噪轨迹保持SDE一致,解决了标准采样器中的过度噪声问题。大量实验表明,该公式通过强化学习实现了显著更快且更稳定的奖励优化,达到了最先进的对齐分数(例如PickScore、HPSv2.1),同时匹配先前采样器的最佳域内性能所需的训练时间减少了13.1-53.2%。

英文摘要

Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.

2605.23518 2026-05-25 cs.CV 版本更新

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

VINS-120K:基于大规模数据集的超高分辨率图像编辑

Zhizhou Chen, Shanyan Guan, Zhanxin Gao, En Ci, Yanhao Ge, Wei Li, Zhenyu Zhang, Jian Yang, Ying Tai

发表机构 * Nanjing University(南京大学) vivo

AI总结 本文提出VINS-120K,一个包含12万组高分辨率图像编辑指令对的大规模数据集,每张图像分辨率超过4K,用于推动超高分辨率图像编辑研究。研究还提出一种高频感知的后适配策略,使现有模型能够有效处理超高分辨率图像,并构建了VINS-4KEval基准以评估编辑效果。该工作为超高分辨率图像编辑提供了高质量数据支持和新的方法改进。

详情
AI中文摘要

直接编辑超高分辨率(UHR)图像具有价值但尚未充分探索,主要由于缺乏高质量数据以及高频纹理细节建模的挑战。我们引入VINS-120K,首个用于基于指令的UHR图像编辑的大规模数据集,包含120K精心筛选的指令、输入图像和编辑图像三元组。每张图像超过4K分辨率(≥4096×4096),并通过严格的多阶段流水线过滤以确保视觉质量、指令对齐和美学保真度。基于VINS-120K,我们进一步开发了一种高频感知的后适应策略,将预训练的非高分辨率模型扩展到UHR领域。我们还提出了VINS-4KEval基准,涵盖多种编辑类型,以促进UHR设置下的一致评估。实验证实,我们的工作在UHR图像编辑中改善了细粒度细节合成和纹理真实感。

英文摘要

Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image editing, comprising 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution ($\geq$4096 $\times$ 4096) and is filtered through a rigorous multi-stage pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity. Built on VINS-120K, we further develop a high-frequency-aware post-adaptation strategy to extend pretrained non-high-resolution models to the UHR regime. We also present VINS-4KEval, a benchmark covering diverse editing types, to facilitate consistent evaluation in UHR settings. Experiments confirm that our work improves fine-grained detail synthesis and texture realism in UHR image editing.

2605.23508 2026-05-25 cs.GR cs.AI cs.CV cs.MM eess.IV 版本更新

DrawVideo: Generating Long Video from Storyboard Keyframe Sketches

DrawVideo: 从故事板关键帧草图生成长视频

Chuanzhi Xu, Huiqi Liang, Bang Shi, Huiming Zhang, Yifan Xiao, Guangcheng Lin, Haodong Chen, Qiang Qu, Zhicheng Lu, Weidong Cai

发表机构 * The University of Sydney(悉尼大学) Charles Sturt University(查尔斯·斯特劳特大学)

AI总结 DrawVideo 是一种基于草图和分镜脚本的可控长视频生成框架,能够通过用户提供的黑白草图、外观描述和运动提示生成结构清晰、内容连贯的长视频。该方法将视频分解为多个可独立控制的镜头,每个镜头由草图、外观提示和运动提示定义,并采用分层策略生成参考帧和动作状态帧,最终合成完整视频。研究还构建了首个用于草图引导长视频生成的数据集 SketchLongVideo,实验表明该方法在结构控制、外观一致性和视觉稳定性方面表现优异。

Comments 45 pages, 19 figures

详情
AI中文摘要

长视频生成需要高保真合成、连贯的叙事结构以及用户对长时间跨度的控制。现有的文本到视频方法通常依赖单一长提示,限制了对姿态、构图、布局和运动的控制。我们提出 DrawVideo,一种草图引导、故事板驱动的可控长视频生成框架。DrawVideo 将长视频分解为独立可控的镜头,每个镜头由黑白草图、外观提示和运动提示定义。草图控制姿态和布局,外观提示定义身份、场景和风格,运动提示引导时间动态。DrawVideo 遵循分层“全局多镜头、局部单草图”策略:首先生成结构对齐的参考关键帧,然后将运动提示扩展为代表动作状态的衍生关键帧,最后在相邻关键帧之间合成片段以构建每个镜头。我们还引入了 SketchLongVideo,这是首个用于草图引导的文本到长视频生成的数据集,通过镜头检测、关键帧提取、视觉语言识别、提示分解和草图转换从动画视频构建。实验表明,DrawVideo 实现了强大的结构可控性、外观一致性、视觉稳定性和连贯的长视频生成。

英文摘要

Long video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing text-to-video methods often rely on a single long prompt, limiting control over pose, composition, layout, and motion. We propose DrawVideo, a sketch-guided, storyboard-driven framework for controllable long-video generation. DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch, an appearance prompt, and a motion prompt. The sketch controls pose and layout, the appearance prompt defines identity, scene, and style, and the motion prompt guides temporal dynamics. DrawVideo follows a hierarchical 'global multi-shot, local single-sketch' strategy: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot. We also introduce SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments show that DrawVideo achieves strong structural controllability, appearance consistency, visual stability, and coherent long-video generation.

2605.23507 2026-05-25 cs.CV 版本更新

MDS-DETR: DETR with Masked Duplicate Suppressor

MDS-DETR: 带有掩码重复抑制器的DETR

Chanho Lee, Seunghee Koh, Yunho Jeon, Junmo Kim

发表机构 * Samsung Research(三星研究院) Korea Advanced Institute of Science(韩国先进科学研究院) Department of Artificial Intelligence Software, Hanbat National University(汉巴特国立大学人工智能软件系)

AI总结 DETR虽然是一种强大的端到端目标检测器,但其一对一匹配策略存在收敛慢和召回率低的问题。为解决这一问题,本文提出MDS-DETR,在单一解码器中结合了一对一和一对多监督,通过引入基于置信度的因果掩码机制的“掩码重复抑制器”(MDS),有效过滤一对多监督生成的重复预测,实现了无需额外查询或辅助解码器的可解释、无重复预测。实验表明,MDS-DETR在COCO数据集上相比现有方法在保持训练时间增加较小的情况下取得了更高的检测精度。

Comments code is available at https://github.com/DChoLee/MDS-DETR

详情
AI中文摘要

DEtection TRansformer (DETR) 是一种强大的端到端目标检测器,但其一对一匹配策略存在收敛慢和召回率低的问题。解决此问题的常见方法是使用一对多标签分配以提供更多正样本。然而,现有使用一对多匹配作为辅助目标的方法会导致训练成本增加,且其辅助解码器在推理时被丢弃。为解决这一限制,我们提出MDS-DETR,它在单一解码器中同时利用一对一和一对多监督。具体来说,我们引入了一个掩码重复抑制器(MDS),通过基于置信度的因果掩码向自注意力注入不对称性。MDS过滤掉由一对多监督层生成的重复项,在完全端到端的框架中实现可解释、无重复的预测。MDS-DETR优于现有的一对多DETR变体,如MS-DETR、MR.DETR和Relation-DETR,且无需依赖任何额外的查询或辅助解码器。在MS COCO上使用ResNet-50骨干网络进行12轮训练,MDS-DETR相比Deformable-DETR实现了+2.8 mAP的提升,训练时间仅增加5%,并且比最先进的MR.DETR高出+0.3 mAP,同时训练速度甚至快20%。我们的代码和模型可在\href{https://github.com/dcholee/mds-detr}{https://github.com/DChoLee/MDS-DETR}获取。

英文摘要

The DEtection TRansformer (DETR) is a powerful end-to-end object detector, yet its one-to-one matching strategy suffers from slow convergence and low recall. A common approach to address this issue is to use one-to-many label assignment to provide more positive samples. However, existing methods that use one-to-many matching as an auxiliary objective lead to increased training costs, with their auxiliary decoders discarded during inference. To address this limitation, we propose MDS-DETR, which leverages both one-to-one and one-to-many supervision within a single decoder. Specifically, we introduce a Masked Duplicate Suppressor (MDS) that injects asymmetry into self-attention via confidence-based causal masking. MDS filters out the duplicates generated by the one-to-many supervised layer, enables explainable, duplicate-free predictions in a fully end-to-end framework. MDS-DETR outperforms existing one-to-many DETR variants such as MS-DETR, MR.DETR and Relation-DETR, without relying on any additional queries or auxiliary decoders. Under a 12-epoch training schedule on MS COCO with a ResNet-50 backbone, MDS-DETR achieves a +2.8 mAP improvement over Deformable-DETR with only a 5\% increase in training time, and outperforms the state-of-the-art MR.DETR by +0.3 mAP while being even 20\% faster in training. Our code and models are available at \href{https://github.com/dcholee/mds-detr}{https://github.com/DChoLee/MDS-DETR}.

2605.23482 2026-05-25 cs.CV cs.AI 版本更新

Multimodal Distribution Matching for Vision-Language Dataset Distillation

多模态分布匹配用于视觉-语言数据集蒸馏

Jongoh Jeong, Hoyong Kwon, Minseok Kim, Kuk-Jin Yoon

发表机构 * Visual Intelligence Lab., KAIST(韩国科学技术院视觉智能实验室)

AI总结 该研究提出了一种名为Multimodal Distribution Matching (MDM)的多模态数据集蒸馏方法,旨在在有限的计算和内存资源下,高效生成保留视觉-语言语义信息的紧凑合成数据集。MDM通过结合数据、模型和损失层面的互补组件,实现了跨模态对齐与表示质量的保持,包括在联合嵌入空间中采样生成图像-文本对、基于预训练模型的权重空间插值构建混合教师模型,以及利用几何感知的损失函数匹配联合分布。实验表明,MDM在多个跨架构的图像-文本检索任务中表现出色,显著降低了蒸馏成本并保持了模型的鲁棒性。

Comments Accepted for publication at CVPR 2026. Project Page: https://andyj1.github.io/mdm

详情
AI中文摘要

数据集蒸馏将大型训练集压缩为紧凑的合成数据集,同时保持下游性能。随着现代系统越来越多地处理成对的视觉-语言输入,多模态蒸馏必须在严格的计算和内存预算下保持表示质量和跨模态对齐,然而先前的方法通常需要大量计算并忽略其相关性。为了解决这个问题,我们提出了多模态分布匹配(MDM),一种用于高效且可泛化的多模态蒸馏的几何感知框架。具体来说,MDM在数据、模型和损失层面集成了互补组件。在数据层面,它通过在联合嵌入空间中的聚类采样来初始化合成图像-文本对。在模型层面,它通过在权重空间中根据独立微调模型与预训练锚点的角度偏差进行插值,形成混合教师模型。在损失层面,它使用几何感知的匹配目标在单位超球面上匹配联合分布,该目标利用跨模态一致性和差异方向上的联合特征以及对称对比学习。在跨架构评估的图像-文本检索基准上,MDM生成的紧凑合成集保留了多模态语义,显著降低了蒸馏成本,并在不同架构下保持鲁棒性。

英文摘要

Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

2605.23478 2026-05-25 cs.CV cs.AI 版本更新

PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction

PhenoYieldNet: 学习作物感知的物候响应以进行多作物产量预测

Yu Luo, Xiaogang Zhu, Shan Zeng, Wei Xiang, Thomas Francis Bishop, Zhiyong Wang, Kun Hu

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) School of Computer Science and Information Technology, Adelaide University(阿德莱德大学计算机科学与信息技术学院) College of Mathematics and Computer Science, Wuhan Polytechnic University(武汉职业技术学院数学与计算机科学学院) School of Computing, La Trobe University(拉特罗布大学计算学院) School of Science, Edith Cowan University(埃迪斯科文大学科学学院)

AI总结 准确预测作物产量对可持续农业和全球粮食安全至关重要。现有方法多针对单一作物,难以泛化到多种作物,且未充分考虑不同作物对天气变化的特定物候响应。本文提出PhenoYieldNet,一种面向多作物产量预测的框架,通过显式建模作物的物候响应来学习作物特异性物候特征,包含作物物候库和注意力模块,能够动态捕捉不同物候阶段的时空特征,并通过预训练模型和自监督策略提升泛化能力,实验表明其在多作物数据集上显著优于现有方法。

Comments Accepted by CVPR2026

详情
AI中文摘要

准确的作物产量预测对于可持续农业和全球粮食安全至关重要。现有方法主要针对单一作物预测开发,通常难以泛化到不同作物类型,且未能解决由复杂天气模式动态调节的独特作物物候响应。在本文中,我们提出PhenoYieldNet,一个多作物产量预测框架,通过显式建模作物对时间驱动因素的响应来学习作物特异性物候。具体来说,我们开发了一个作物感知的时间解码器,由作物物候库(CPB)和作物物候注意力(CPA)模块组成。CPB集成了一组可学习的嵌入,利用查询引导CPA模块学习特定作物最相关的物候模式。CPA模块显式捕获多尺度趋势和变化成分以构建时间上下文,使模型能够动态调整不同物候阶段的注意力。为了学习鲁棒且可泛化的多作物预测特征,编码器使用预训练基础模型初始化,并通过自监督时序对比适应策略进一步调整以对齐农业时间动态。在多作物数据集上进行的大量实验表明,我们提出的方法显著优于最先进的方法,在不同地区和作物上展现出强大的泛化能力。

英文摘要

Accurate crop yield prediction is crucial for sustainable agriculture and global food security. While existing methods are predominantly developed for single-crop prediction, they often struggle to generalize across diverse crop types, without addressing the unique crop phenological responses that are dynamically modulated by complex weather patterns. In this paper, we propose PhenoYieldNet, a multi-crop yield prediction framework that learns crop-specific phenology by explicitly modeling their responses with temporal drivers. Specifically, we develop a crop-aware temporal decoder consisting of a Crop Phenology Bank (CPB) and a Crop Phenology Attention (CPA) module. The CPB integrates a set of learnable embeddings, which leverage a query to guide the CPA module to learn the most relevant phenology patterns for the specific crop. And the CPA module explicitly captures multi-scale trend and variation components to construct temporal contexts, enabling the model to dynamically adjust the attention across different phenological stages. To learn robust and generalizable features for multi-crop prediction, the encoder is initialized with a pre-trained foundation model, and further adapted via a self-supervised Temporal Contrastive Adaptation strategy to align with agricultural temporal dynamics. Extensive experiments conducted on multi-crop datasets indicate that our proposed method significantly outperforms state-of-the-art methods, exhibiting strong generalization capabilities across different regions and crops.

2605.23472 2026-05-25 cs.CV 版本更新

Rethinking Transfer Learning for Industrial Inspection: DINOv3 vs. ImageNet Pretraining Across RGB and X-ray Tasks

重新思考工业检测的迁移学习:DINOv3与ImageNet预训练在RGB和X射线任务上的对比

Mehdi Gharbage, Céline Teulière, Pierre Bouges, Thierry Chateau

发表机构 * Michelin Tyres Manufacturer(米其林轮胎制造商) Université Clermont Auvergne, CNRS, Institut Pascal(克莱蒙特-奥弗涅大学,CNRS,帕西尔研究所)

AI总结 本文探讨了现代视觉基础模型在工业检测任务中的迁移学习效果,比较了基于ImageNet监督预训练和DINOv3自监督蒸馏的ConvNeXt主干网络在RGB和X射线检测任务中的表现。研究发现,DINOv3在冻结参数的迁移中优势不明显,但在RGB任务的全微调下能提供更好的初始化,加快收敛并提升性能;而在X射线任务中,基于ImageNet的监督预训练仍更具优势。结果表明,现代视觉基础模型在工业RGB检测中具有潜力,但其迁移效果高度依赖下游任务的适配和数据模态。

Comments Accepted to the CVPR 2026 Workshop on Vision Foundation Models for Industrial Inspection (VISION'26)

详情
AI中文摘要

最近,在网页规模数据上预训练的视觉基础模型在许多下游任务中展现出强大的迁移能力,但它们在工业视觉检测中的有效性仍不明确。工业数据与网页数据差异显著,通常需要细粒度的密集预测,这引发了一个问题:现代自监督预训练能否超越基于监督ImageNet初始化的传统迁移学习范式。在这项工作中,我们比较了使用监督ImageNet分类或DINOv3蒸馏预训练的ConvNeXt骨干网络,并将它们与传统的ResNet-50基线相关联。我们在四个下游数据集上评估了语义分割、实例分割和物体检测,这些数据集涵盖RGB表面缺陷检测和X射线缺陷检测。我们进一步研究了冻结和完全微调两种适应机制。我们的结果表明,DINOv3在冻结迁移中没有明显优势,但在RGB任务完全微调后提供了更强的初始化,实现了更快的收敛和更好的最终性能。然而,在X射线模态偏移下,监督ImageNet预训练在冻结和微调设置中仍然更有效。总体而言,我们的发现表明,现代视觉基础模型对于监督RGB工业检测是有前景的,但它们的迁移能力强烈依赖于下游适应和目标模态。

英文摘要

Vision foundation models pretrained on web-scale data have recently shown strong transfer capabilities on many downstream tasks, but their effectiveness for industrial visual inspection remains unclear. Industrial data differ substantially from web-data and often require fine-grained dense prediction, raising the question of whether modern self-supervised pretraining can improve over the conventional transfer-learning paradigm based on supervised ImageNet initialization. In this work, we compare ConvNeXt backbones pretrained with supervised ImageNet classification or DINOv3 distillation, and relate them to the conventional ResNet-50 baseline. We evaluate semantic segmentation, instance segmentation, and object detection across four downstream datasets spanning RGB surface-defect inspection and X-ray defect detection. We further study both frozen and fully finetuned adaptation regimes. Our results show that DINOv3 offers no clear advantage in frozen transfer, but provides a stronger initialization after full finetuning on RGB tasks, yielding faster convergence and better final performance. Under X-ray modality shift, however, supervised ImageNet pretraining remains more effective in both frozen and finetuned settings. Overall, our findings suggest that modern vision foundation models are promising for supervised RGB industrial inspection, but their transferability is strongly conditioned by downstream adaptation and target modality.

2605.23458 2026-05-25 cs.CV cs.AI 版本更新

One-Forcing: Towards Stable One-Step Autoregressive Video Generation

One-Forcing: 迈向稳定的一步自回归视频生成

Jiaqi Feng, Justin Cui, Yuanhao Ban, Cho-Jui Hsieh

发表机构 * Tsinghua University(清华大学) UCLA(加州大学洛杉矶分校)

AI总结 该论文提出了一种名为 One-Forcing 的方法,旨在解决单步自回归视频生成中的稳定性和质量问题。该方法通过在动态模式分解(DMD)目标中引入辅助的生成对抗网络(GAN)损失,实现了高质量且高效的单步视频生成。实验表明,One-Forcing 在 VBench 数据集上取得了当前最优的性能,并且仅需三分之一的训练成本即可实现稳定的逐帧自回归生成,优于以往方法。

Comments Work in Progress. Project Page: https://aurora-edu.github.io/one-forcing/, Code: https://github.com/Aurora-edu/One-Forcing

详情
AI中文摘要

最近的进展显著改善了自回归机制下的实时交互式视频生成。然而,大多数现有的少步自回归视频生成方法(通常从相应的多步教师模型蒸馏而来)默认采用4步采样配置,这在部署期间仍会产生相当大的延迟,并且当进一步减少采样步数(特别是在一步设置中)时,会遭受严重的质量下降。轨迹式一致性蒸馏方法通常生成动态较弱的视频,而基于DMD的方法(如Self-Forcing)往往产生模糊的帧。为了解决这一挑战,我们提出了One-Forcing,一种简单而有效的方法,它通过向DMD目标添加辅助GAN损失,实现高质量高效的一步视频生成。在VBench上的实验表明,One-Forcing的总得分为83.76,在一步因果视频生成方法中达到了最先进的性能,并且与强大的多步方法保持竞争力。我们进一步证明,仅需分块模型三分之一的训练成本,即可稳定实现逐帧的一步自回归生成,而先前的方法未能成功实现这一设置。

英文摘要

Recent advances have substantially improved real-time interactive video generation in the autoregressive regime. However, most existing few-step autoregressive video generation methods, often distilled from a corresponding many-step teacher, default to a 4-step sampling configuration, which still incurs considerable latency during deployment and suffers from severe quality degradation when the number of sampling steps is further reduced, particularly in the one-step setting. Trajectory-style consistency distillation methods often produce videos with weak dynamics, while DMD-based approaches, such as Self-Forcing, tend to yield blurry frames. To address this challenge, we propose One-Forcing, a simple yet effective approach which augments the DMD objective with an auxiliary GAN loss for high-quality and efficient one-step video generation. Experiments on VBench show that One-Forcing achieves a total score of 83.76, establishing state-of-the-art performance among one-step causal video generation methods and remaining competitive with strong many-step approaches. We further demonstrate that one-step framewise autoregressive generation can be achieved stably with merely one-third of the training cost of the chunkwise model, a setting that prior methods have failed to achieve successfully.

2605.23451 2026-05-25 cs.CV 版本更新

Efficient One-Step Diffusion Restoration Model with Compact Token Compression and Linear Attention

高效的一步扩散修复模型:紧凑令牌压缩与线性注意力

Bingtian Qiao, Yue Shi, Yingjie Zhou, Yong Guo, Guangtao Zhai, Jiezhang Cao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Fuzhou University(福州市大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文针对真实场景图像超分辨率任务中现有方法计算量大、内存消耗高、推理延迟大的问题,提出了一种高效的一步式修复框架SANA-SR。该方法通过深度压缩自编码器将潜在特征压缩32倍,大幅减少冗余信息,同时引入线性注意力机制与LoRA微调技术,实现了线性复杂度的高分辨率图像恢复。实验表明,SANA-SR在多个基准数据集上取得了优异的定量性能,且模型参数量小、推理速度快,具有良好的实际部署潜力。

详情
AI中文摘要

真实图像超分辨率旨在从复杂且未知的真实退化中恢复高质量图像。然而,现有的生成式Real-ISR方法很大程度上继承了为高分辨率图像合成开发的密集潜在表示和二次成本全局建模范式,导致计算、内存使用和推理延迟随分辨率增长而不利地扩展,从而限制了实际部署。我们认为关键瓶颈不在于修复先验不足,而在于高分辨率修复过程中过多的令牌冗余和昂贵的令牌交互。受此观察启发,我们从紧凑潜在表示和线性复杂度建模的角度重新审视Real-ISR,提出了SANA-SR,一种高效的一步修复框架。具体来说,SANA-SR采用具有32倍压缩比的深度压缩自编码器,大幅减少潜在令牌,同时保留与修复相关的结构和纹理。在此紧凑潜在空间之上,我们引入了带有LoRA微调的线性注意力DiT,实现了具有线性复杂度令牌混合的高效高分辨率修复。在所有基准数据集上的大量实验表明,SANA-SR在定量性能上与现有方法高度竞争且通常更优,同时恢复出更清晰、更真实的纹理。此外,剪枝后,部署的模型运行时间为0.019秒,MACs为407.95G,参数量为344M,突显了其在移动设备上实际部署的强大潜力。

英文摘要

Real-world image super-resolution aims to recover high-quality images from complex and unknown real-world degradations. However, existing generative Real-ISR methods largely inherit the dense latent representations and quadratic-cost global modeling paradigm developed for high-resolution image synthesis, causing computation, memory usage, and inference latency to scale unfavorably with resolution and thus limiting practical deployment. We argue that the key bottleneck lies not in insufficient restoration priors, but in excessive token redundancy and costly token interactions during high-resolution restoration. Motivated by this observation, we revisit Real-ISR from the perspectives of compact latent representation and linear-complexity modeling, and propose SANA-SR, an efficient one-step restoration framework. Specifically, SANA-SR employs a deep compression autoencoder with a 32x compression ratio to drastically reduce latent tokens while preserving restoration-relevant structures and textures. On top of this compact latent space, we introduce a linear-attention DiT with LoRA fine-tuning, enabling efficient high-resolution restoration with linear-complexity token mixing. Extensive experiments on all benchmark datasets demonstrate that SANA-SR achieves highly competitive and often superior quantitative performance against existing methods, while restoring clearer and more realistic textures. Moreover, after pruning, the deployed model runs in 0.019s with 407.95G MACs and 344M parameters, highlighting its strong potential for practical mobile deployment.

2605.23449 2026-05-25 cs.LG cs.CV math.AG 版本更新

Commutator-Induced Uncertainty in VAEs

VAE中的换位子引发的不确定性

Tahereh Dehdarirad, Michael Felsberg, Gabriel Eilertsen, Ziliang Xiong

发表机构 * Computer Vision and Learning Systems (CVL), Linköping University, Sweden(计算机视觉与学习系统(CVL),林雪平大学,瑞典) Department of Science and Technology, Linköping University, Sweden(科学与技术系,林雪平大学,瑞典)

AI总结 变分自编码器(VAEs)在学习非交换结构时常常面临不确定性问题。本文提出了一种基于李群的VAE框架,通过结合几何与代数视角分析不确定性,将离散生成因素与连续几何变换分离。该方法通过诊断代数非交换性并调整解码器对非交换结构的敏感度,提升了重构质量与潜在空间结构的一致性,在多个基准数据集上表现出优越的重构与潜在空间遍历性能。

详情
AI中文摘要

变分自编码器(VAE)通常难以表示学习到的潜在空间中的非交换结构。对称感知的VAE通常通过代数正则化强制交换性来解决这个问题,这适用于交换变换群,但当非交换性是数据内在特性时会抑制有意义的非交换结构。我们认为,非交换性应被明确诊断并反映在重建行为中。我们引入了一个李群VAE框架,该框架结合了几何和代数视角下的不确定性,同时将离散生成因子与连续几何变换分开。在第一阶段,模型在没有结构约束的情况下进行训练,同时通过有限Baker-Campbell-Hausdorff偏差测量代数非交换性,并通过重建顺序交换测试测量解码器顺序敏感性。这些诊断揭示了在无约束训练下潜在非交换性与重建行为之间的尺度不匹配。在第二阶段,我们引入了一个具有数据驱动校准常数的变形稳定性约束,使解码器敏感性与代数非交换性对齐。我们在dSprites、3DShapes、3DCars和CelebA上评估了该框架,并与通用和对称感知基线(包括beta-VAE、CLG-VAE和CFASL)进行了比较。在合成基准上,该方法提高了重建质量,并产生了与潜在非交换结构更一致的解码器行为。定性分析显示了更清晰的顺序依赖潜在组合和更稳定的重建。在CelebA上,该模型比CFASL产生了更忠实的重建和因子特定的潜在遍历,同时在学习的潜在方向之间也表现出有意义的顺序依赖交互。

英文摘要

Variational autoencoders (VAEs) often struggle to represent non-commutative structure in learned latent spaces. Symmetry-aware VAEs commonly address this issue by enforcing commutativity through algebraic regularization, which is appropriate for commutative transformation groups but can suppress meaningful non-commutative structure when it is intrinsic to the data. We argue that non-commutativity should instead be explicitly diagnosed and reflected in reconstruction behavior. We introduce a Lie Group VAE framework that combines geometric and algebraic perspectives on uncertainty while separating discrete generative factors from continuous geometric transformations. In a first phase, the model is trained without structural constraints while algebraic non-commutativity is measured through finite Baker-Campbell-Hausdorff deviations and decoder order sensitivity is measured through reconstruction order-swap tests. These diagnostics reveal a scale mismatch between latent non-commutativity and reconstruction behavior under unconstrained training. In a second phase, we introduce a deformation-stability constraint with a data-driven calibration constant that aligns decoder sensitivity with algebraic non-commutativity. We evaluate the framework on dSprites, 3DShapes, 3DCars, and CelebA against generic and symmetry-aware baselines, including beta-VAE, CLG-VAE, and CFASL. Across synthetic benchmarks, the method improves reconstruction quality and yields decoder-level behavior more consistent with latent non-commutative structure. Qualitative analyses show clearer order-dependent latent compositions and more stable reconstructions. On CelebA, the model yields more faithful reconstructions and factor-specific latent traversals than CFASL, while also exhibiting meaningful order-dependent interactions between learned latent directions.

2605.23445 2026-05-25 cs.CV 版本更新

DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation

DFSAttn:面向高效视频生成的动态细粒度稀疏注意力

Jie Hu, Zixiang Gao, Yutong He, Kun Yuan

发表机构 * Peking University(北京大学)

AI总结 该论文提出了一种名为DFSAttn的动态细粒度稀疏注意力机制,旨在提升视频生成中扩散变换器的效率。针对现有块稀疏注意力在高稀疏比下质量下降的问题,DFSAttn通过理论分析得出注意力召回的下界,并设计了无需训练的稀疏注意力框架,包含基于希尔伯特曲线的令牌重排序、分层块评分和自适应稀疏掩码缓存等核心模块。实验表明,DFSAttn在保持高质量生成的同时,实现了高达2.1倍的端到端加速。

Comments ICML 2026; 17 pages, 8 figures;

详情
AI中文摘要

扩散变换器在高品质视频生成中取得了显著成功,但其对时空3D全注意力的依赖由于注意力的二次复杂度而产生了高昂的计算成本。块稀疏注意力是一种常见方法,通过将计算集中在重要区域来缓解这一问题。然而,DiTs中的注意力图表现出固有的动态和细粒度稀疏性,这导致现有的块稀疏注意力方法在质量上显著下降,尤其是在高稀疏率下。在本文中,我们重新审视块稀疏注意力,并推导出注意力召回率的理论下界,以刻画影响其有效性的关键因素。在这些见解的指导下,我们提出了DFSAttn,一种无需训练的稀疏注意力框架,能够高效地实现动态、细粒度的稀疏化。DFSAttn包含三个核心设计:基于希尔伯特曲线的令牌重排序以实现细粒度稀疏性同时保持高效的GPU执行,分层块评分以准确估计块重要性,以及具有自适应比率的稀疏掩码缓存以平衡准确性和效率。实验结果表明,DFSAttn在高稀疏度下始终优于先前方法,在保持高生成质量的同时实现了高达2.1倍的端到端加速。我们的代码已开源,可在https://github.com/jessica-hujie/DFSAttn获取。

英文摘要

Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to mitigate this by focusing computation on important regions. However, attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity, which causes existing block sparse attention methods to degrade significantly in quality, especially at high sparsity ratios. In this paper, we revisit block sparse attention and derive a theoretical lower bound on attention recall to characterize the key factors governing its effectiveness. Guided by these insights, we propose DFSAttn, a training-free sparse attention framework that enables dynamic, fine-grained sparsification efficiently. DFSAttn incorporates three core designs: Hilbert curve-based token reordering to achieve fine-grained sparsity while preserving efficient GPU execution, hierarchical block scoring for accurate block importance estimation, and sparse mask caching with adaptive ratios to balance accuracy and efficiency. Experimental results demonstrate that DFSAttn consistently outperforms prior methods under high sparsity, achieving up to 2.1$\times$ end-to-end speedup while maintaining high generation quality. Our code is open-sourced and available at https://github.com/jessica-hujie/DFSAttn.

2605.23428 2026-05-25 cs.CV cs.MM 版本更新

FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis

FAST-ME:面向高效物联网视频分析的基于基础模型的自适应运动估计停止方法

Kakia Panagidi, Stathes Hadjieftymiadis

发表机构 * Department of Informatics \& Telecommunications National Kapodistrian University of Athens Athens, Greece

AI总结 在资源受限的物联网视频分析场景中,视频压缩与理解中的块运动估计(ME)仍是计算瓶颈。本文提出了一种基于时空差异评估的最优停止理论(OST)算法,并结合基础模型(FMs)构建语义感知的运动估计框架,通过融合视觉模型提取的语义注意力分数与传统失真度量,实现对运动幅度与语义重要性的联合判断,从而在保证精度的前提下显著降低计算开销。实验表明,该方法在多个基准数据集上取得了高效且语义覆盖良好的性能。

详情
AI中文摘要

在现代多媒体系统中,高效的视频处理至关重要,尤其是在资源受限的环境下,例如基于物联网的摄像头网络、自主平台和无线传感器多媒体系统。视频压缩和理解中的一个关键瓶颈是块运动估计(ME),尽管已经开发了快速搜索技术,但该过程仍然计算量大。本文提出了一种基于最优停止理论(OST)的块运动估计算法,该算法基于视频帧内和帧间的时空差异评估。同时,本文还提出了一种语义感知运动估计框架,将基础模型(FMs)与基于OST的决策过程相结合。通过利用预训练的视觉模型,如视觉变换器(ViT)和分割一切模型(SAM),该框架提取语义注意力分数,指示特定空间区域内运动的重要性。这些分数与传统的基于失真的度量(如绝对差和(SAD))融合,以指导一个混合停止准则,该准则同时考虑运动幅度和语义相关性。由此产生的自适应算法在冗余区域提前停止,而在运动具有语义重要性的区域继续搜索。实验将所提出的解决方案与文献中广泛使用的方法在基准和多模态视频数据集上进行了比较。所提出的方法在计算量上实现了显著减少,同时精度损失最小,并提高了语义覆盖。结果凸显了将低层运动分析与高层语义推理相结合的益处,为下一代智能系统中高效的多模态视频理解提供了有前景的方向。

英文摘要

In modern multimedia systems, efficient video processing is critical, especially in resource-constrained environments such as IoT-based camera networks, autonomous platforms, and wireless sensor multimedia systems. A key bottleneck in video compression and understanding is block motion estimation (ME), a process that remains computationally expensive despite the development of fast search techniques. This work introduces an Optimal Stopping Theory (OST) algorithm for block motion estimation based on the assessment of spatiotemporal differences within and across video frames. It also proposes a semantic-aware motion estimation framework that integrates Foundation Models (FMs) with the OST-based decision process. By leveraging pretrained visual models such as Vision Transformers (ViT) and the Segment Anything Model (SAM), the framework extracts semantic attention scores that indicate the importance of motion within specific spatial regions. These scores are fused with traditional distortion-based metrics, such as the Sum of Absolute Differences (SAD), to guide a hybrid stopping criterion that jointly considers motion magnitude and semantic relevance. The resulting adaptive algorithm stops early in redundant regions while continuing the search in areas where motion is semantically significant. Experiments compare the proposed solution with widely used approaches from the literature on benchmark and multimodal video datasets. The proposed method achieves a significant reduction in computation with minimal accuracy loss and improved semantic coverage. The results highlight the benefits of bridging low-level motion analysis with high-level semantic reasoning, offering a promising direction for efficient multimodal video understanding in next-generation smart systems.

2605.23411 2026-05-25 cs.LG cs.CR cs.CV 版本更新

Sample-wise Targeted Adversarial Attacks on Test-time Adaptation

面向测试时自适应的样本级定向对抗攻击

Phuc Duc Nguyen, Quang Duc Nguyen

发表机构 * College of Computing and Data Science(计算与数据科学学院) Nanyang Technological University(南洋理工大学)

AI总结 本文研究了针对测试时适应(TTA)的样本级定向对抗攻击问题,旨在在不引起分布异常的情况下,使特定样本被错误分类。为解决现有方法在批量操作中导致目标标签频率异常的问题,作者提出了一种基于元学习的攻击方法,结合优先级感知的梯度对齐策略,以确保攻击成功率同时保持整体标签分布不变。实验表明,该方法在多个数据集上取得了高成功率,且难以被检测,对现有防御机制也表现出较强的鲁棒性。

Comments 32 pages, 17 figures

详情
AI中文摘要

测试时自适应(TTA)有效应对分布偏移,但通过未标记的测试流使模型暴露于对抗性操纵之下。现有的类别级定向攻击在此场景下难以实现隐蔽利用:由于TTA在批次上操作,强制部分样本朝向目标标签会无意中拉拢相似的良性样本,导致目标标签出现频率异常高,易于检测。为了捕捉更现实的威胁,我们引入了一种样本级定向攻击。与先前方法不同,攻击者旨在仅使携带攻击者选择的触发器的输入被错误分类,同时保持良性查询的全局标签分布以逃避检测。为实现这一目标,我们提出了一种基于元学习的攻击,采用新颖的优先感知梯度对齐策略,明确优先考虑攻击成功率。该策略将梯度更新形式化为椭球信任区域问题,缓解了攻击成功与分布隐蔽性之间的失调,同时为在梯度失调情况下有效优化攻击目标提供了理论保证。在CIFAR-10-C、CIFAR-100-C和ImageNet-C上跨TTA协议的大量实验表明,我们的方法在保持与无攻击基线一致的标签分布的同时,实现了高定向成功率,使其在未标记的TTA部署场景中难以检测。此外,我们证明了我们的攻击对现有防御表现出强鲁棒性。

英文摘要

Test-time adaptation (TTA) effectively counters distribution shifts but exposes models to adversarial manipulation via the unlabeled test stream. Existing class-wise targeted attacks remain impractical for stealthy exploitation in this setting: since TTA operates on batches, forcing a subset of samples toward a target label unintentionally pulls similar benign samples along, resulting in a conspicuously high frequency of the target label that is easy to detect. To capture a more realistic threat, we introduce a sample-wise targeted attack. Unlike prior approaches, the attacker aims to misclassify only inputs carrying an attacker-chosen trigger, while preserving the global label distribution of benign queries to evade detection. To achieve this, we propose a meta-learning-based attack with a novel priority-aware gradient alignment strategy that explicitly prioritizes attack success. The strategy formulates the gradient update as an ellipsoidal trust-region problem, mitigating the misalignment between attack success and distributional stealth, while providing theoretical guarantees for effective optimization of the attack objective in the presence of gradient misalignment. Extensive experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C across TTA protocols demonstrate that our method achieves high targeted success rates while maintaining a label distribution that is consistent with the no-attack baseline, making it difficult to detect in unlabeled TTA deployment scenarios. Furthermore, we demonstrate that our attack shows strong robustness against existing defenses.

2605.23410 2026-05-25 cs.LG cs.CV 版本更新

What Linear Probes Miss: Multi-View Probing for Weight-Space Learning

线性探测的盲区:面向权重空间学习的多视角探测

Eunwoo Heo, Kyeongkook Seo, Jaejun Yoo

发表机构 * Graduate School of Artificial Intelligence, Ulsan National Institute of Science(乌山国立科学技术研究生院)

AI总结 随着开源模型库的快速增长,如何高效识别和分析模型参数成为重要问题。现有基于探针的方法虽轻量,但受限于单一视角设计,难以捕捉参数间的高阶交互信息。本文提出多视角探针框架 MVProbe,结合一阶结构与基于格拉姆矩阵的交互感知视角,理论分析表明其能更全面地表征模型参数,实验显示其在多种架构上均优于现有方法。

Comments Accepted at ICML 2026. Code: https://github.com/AI-hew-math/MVProbe ; Project page: https://ai-hew-math.github.io/MVProbe/

详情
AI中文摘要

开源模型库的爆炸式增长催生了“模型丛林”,其中检查点经常在缺乏充分文档或元数据的情况下共享。虽然权重空间学习提供了一种直接从参数识别和分析这些模型的途径,但处理全尺度权重在计算上成本高昂。基于探测的方法作为一种轻量级替代方案出现,通过可学习的探测向量提取置换等变表示。然而,现有探测方法受限于单视角设计:它们捕获一阶结构,但未能编码行-列交互中固有的丰富高阶相关模式。为弥补这一差距,我们引入MVProbe,一个多视角探测框架,它综合了一阶信号与交互感知(基于Gram)的视角。我们的方法有理论依据;我们分析了不同探测阶数的缩放定律,以推导出原则性的标准化和融合策略,确保所有分支的贡献平衡。在Model Jungle基准上,MVProbe在多种架构上持续优于最先进的ProbeX,包括判别式骨干网络(ResNet、SupViT、MAE、DINO)和大规模生成式LoRA适配器(Stable Diffusion LoRA)。

英文摘要

The explosive growth of open-source model repositories has created a Model Jungle, where checkpoints are frequently shared without adequate documentation or metadata. While weight-space learning offers a pathway to identify and analyze these models directly from their parameters, processing full-scale weights is computationally prohibitive. Probing-based methods have emerged as a lightweight alternative, extracting permutation-equivariant representations via learnable probe vectors. However, existing probing methods are limited by a single-view design: they capture first-order structures but fail to encode the rich, higher-order correlation patterns inherent in row-column interactions. To bridge this gap, we introduce MVProbe, a multi-perspective probing framework that synthesizes first-order signals with interaction-aware (Gram-based) views. Our approach is theoretically grounded; we analyze the scaling laws of different probing orders to derive a principled standardization and fusion strategy that ensures balanced contributions from all branches. On the Model Jungle benchmark, MVProbe consistently outperforms the state-of-the-art ProbeX across diverse architectures, including discriminative backbones (ResNet, SupViT, MAE, DINO) and large-scale generative LoRA adapters (Stable Diffusion LoRA).

2605.23409 2026-05-25 cs.CV cs.AI 版本更新

Online Hand Gesture Recognition Using 3D Convolutional Neural Networks

使用3D卷积神经网络的在线手势识别

Yinghao Qin, Tijana Timotijevic

发表机构 * School of Electronic Engineering and Computer Science(电子工程与计算机科学学院) Queen Mary, University of London(伦敦大学Queen Mary)

AI总结 本文提出了一种基于3D卷积神经网络的在线手部手势识别系统,旨在实现实时视频流中手势的定位与分类。为提高系统鲁棒性,采用滑动窗口方法对多窗口结果进行优化。该系统在Jester数据集上训练,检测和分类准确率分别达到98%以上和90%以上,在自制数据集上达到37.5%的Levenshtein准确率,且响应时间在三秒以内。

Comments Master's dissertation work written in Autumn 2020

详情
AI中文摘要

在人机交互中,动态手势的实时检测与分类具有挑战性,因为:1) 系统必须在实时视频流中运行,且执行手势后响应无明显延迟;2) 不同人执行手势的方式差异较大,使得识别更加困难。本文提出一种在线手势识别系统,能够定位实时视频流中的手势并识别其类别。为提高系统鲁棒性,采用滑动窗口方法对多个窗口的结果进行优化。项目中的所有模型均在Jester数据库上训练,检测器准确率达到98%以上,分类器准确率达到90%以上。在系统整体性能方面,最佳组可在三秒内响应,并在自制数据集上达到37.5%的Levenshtein准确率。本工作使用的项目代码已公开。

英文摘要

In human computer interaction, real-time detection and classification of dynamic hand gestures is challenging as: 1) the system must run in a real-time video stream and there is no noticeable lag in response after performing a gesture; 2) there is a large difference in how people perform gestures, making recognition more difficult. In this paper, an online hand gesture recognition system is proposed, which is able to localize gestures in real-time video stream and recognize what these gestures are. To improve the robustness of the system, the sliding window approach is used to refine results from multiple windows. All of the models in my project are trained on Jester database, achieving 98+% accuracy for detector and 90+% accuracy for classifier. For the overall performance of the system, the best group can respond within three seconds and reach 37.5% Levenshtein accuracy on the homemade dataset. The project codes used in this work are publicly available.

2605.23406 2026-05-25 cs.CV 版本更新

RS2AD-LiDAR: End-to-End Autonomous Driving LiDAR Data Generation from Roadside Sensor Observations

RS2AD-LiDAR:基于路侧传感器观测的端到端自动驾驶LiDAR数据生成

Runyi Huang, Ni Ding, Ruidan Xing, Yuheng Shi, Lei He, Keqiang Li

发表机构 * State Key Laboratory of Intelligent Green Vehicle and Mobility(智能绿色车辆与移动国家重点实验室) School of Vehicle and Mobility(车辆与移动学院) College of Artificial Intelligence(人工智能学院) Logic & Silicon AI Studio School of Instrumentation and Optoelectronic Engineering(仪器科学与光电工程学院)

AI总结 本文提出了一种名为RS2AD-LiDAR的全新框架,用于从路边传感器观测数据中重建和生成车载激光雷达数据,以解决当前自动驾驶系统在数据采集和标注成本高、场景稀缺等问题。该方法通过坐标转换、虚拟激光雷达建模和点云重采样技术,生成高保真的车载激光雷达数据,并构建了专门用于评估的R2V-LiDAR数据集。实验表明,生成数据在语义相似性和目标检测性能上均表现出良好的效果,有效提升了自动驾驶模型的感知能力。

详情
AI中文摘要

端到端自动驾驶解决方案直接处理多模态传感器数据并输出细粒度控制命令,随着自动驾驶技术的发展已逐渐成为主流方向。然而,当前此类方法依赖单车数据采集进行模型训练和优化,面临采集和标注成本高、有价值场景稀缺以及数据孤岛等问题。为解决这些挑战,我们提出RS2AD-LiDAR,一种从路侧传感器观测重建和生成车载LiDAR数据的新框架。由于目前没有公开数据集提供路侧与车载LiDAR传感器之间高度重叠的感知覆盖(这对于研究路侧到车辆数据生成至关重要),我们构建了专用数据集R2V-LiDAR,仅用于本文的评估。具体而言,我们的方法将路侧LiDAR点云变换到车载LiDAR坐标系,并通过虚拟LiDAR建模和点云重采样技术合成高保真车载数据。据我们所知,这是首个从路侧传感器输入重建车载LiDAR数据的方法。大量实验比较表明,生成数据与真实数据具有语义相似性。此外,目标检测实验显示,将生成数据融入真实数据用于模型训练,可同时提升鸟瞰图(BEV)和3D检测精度,从而验证了所提方法的有效性。

英文摘要

End-to-end autonomous driving solutions, which directly process multimodal sensory data and output fine-grained control commands, have gradually become a mainstream direction with the development of autonomous driving technology. However, current methods in this category rely on single-vehicle data collection for model training and optimization, which suffers from high acquisition and annotation costs, scarcity of valuable scenarios, and data silos. To address these challenges, we propose RS2AD-LiDAR, a novel framework for reconstructing and generating vehicle-mounted LiDAR data from roadside sensor observations. Since no public dataset currently provides highly overlapping perception coverage between roadside and vehicle-mounted LiDAR sensors, which is essential for studying roadside-to-vehicle data generation, we constructed a dedicated dataset named R2V-LiDAR which is used solely for evaluation in this work. Specifically, our method transforms roadside LiDAR point clouds into the vehicle-mounted LiDAR coordinate system, and synthesizes high-fidelity vehicle-mounted data via virtual LiDAR modeling and point cloud resampling techniques. To the best of our knowledge, this is the first approach to reconstruct vehicle-mounted LiDAR data from roadside sensor inputs. Extensive experimental comparisons demonstrate the semantic similarity between the generated data and real data. Furthermore, object detection experiments show that incorporating the generated data into real data for model training improves both Bird's Eye View (BEV) and 3D detection accuracy, thereby validating the effectiveness of the proposed method.

2605.23397 2026-05-25 cs.CV 版本更新

Joint Target-Less Intrinsic and Extrinsic Camera-LiDAR Calibration using Deep Point Correspondences

基于深度点对应的无靶标联合相机-激光雷达内参和外参标定

Simon Bultmann, Daniele Cattaneo, Abhinav Valada

发表机构 * Department of Computer Science, University of Freiburg, Germany(弗赖堡大学计算机科学系)

AI总结 本文研究了无需标定目标的相机-激光雷达联合标定问题,提出了一种基于深度点对应关系的全新方法,能够同时估计相机的内参(包括径向-切向畸变)和外参。该方法通过结构从运动自动初始化内参,扩展了对未知畸变图像的匹配能力,并将点对应估计与内、外参的联合非线性优化紧密耦合,实现了更准确的标定效果。实验表明,该方法在KITTI数据集上表现出优越的外参精度和内参恢复能力。

Comments presented at 2nd German Robotics Conference (GRC)

详情
AI中文摘要

精确的相机-激光雷达标定是机器人多模态感知鲁棒性的前提。最近基于深度点对应的无靶标方法在外参标定中取得了显著性能,但假设图像已校正且内参已知。本文克服了这一限制,提出了首个完全无靶标的流程,通过深度像素-点对应联合估计相机内参(径向-切向畸变的针孔模型)和相机-激光雷达外参。我们的方法通过以下方式扩展了基于深度对应的标定:(i) 通过运动结构自动初始化内参,(ii) 将相机-激光雷达匹配推广到包含未知畸变的原始图像,(iii) 将对应估计与内参和外参的联合非线性优化紧密耦合。我们在KITTI数据集上使用未见过的相机-激光雷达对评估了该方法,并证明联合标定在恢复精确内参的同时提高了外参精度。

英文摘要

Accurate camera-LiDAR calibration is a prerequisite for robust multi-modal perception in robotics. Recent target-less approaches based on deep point correspondences achieve remarkable performance for extrinsic calibration but assume rectified images with known intrinsics. In this work, we overcome this limitation and present the first fully target-less pipeline that jointly estimates camera intrinsics (pinhole model with radial-tangential distortion) and camera-LiDAR extrinsics with deep pixel-point correspondences. Our approach extends deep correspondence-based calibration by (i) automatic intrinsic initialization via structure-from-motion, (ii) generalizing camera-LiDAR matching to raw images with unknown intrinsics including distortion, and (iii) tightly coupling correspondence estimation with joint nonlinear optimization over both intrinsics and extrinsics. We evaluate our method on the KITTI dataset with unseen camera-LiDAR pairs and demonstrate that joint calibration achieves improved extrinsic accuracy while additionally recovering accurate intrinsics.

2605.23381 2026-05-25 cs.CV 版本更新

VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation

VDE: 通过速度分解与估计实现无训练加速整流流模型

Junwen Tan, Jinglin Liang, Hongyuan Chen, Shuangping Huang

发表机构 * South China University of Technology(华南理工大学)

AI总结 尽管rectified flow模型在图像、视频和3D生成中表现出色,但其推理速度较慢限制了实际应用。本文提出了一种无需训练的加速方法VDE,通过速度分解与估计将传统缓存复用的范式转变为分解估计,提升了输入适应性与输出质量。VDE将模型速度分解为沿输入方向和平行方向的分量,并利用其时间可预测性和方向稳定性进行精确估计,同时通过定期全前向传播防止误差累积,实验表明该方法在保持视觉质量的同时显著提升了生成效率。

Comments Accepted by CVPR 2026

详情
AI中文摘要

尽管整流流模型在图像、视频和3D生成中取得了显著性能,但其实际部署受到推理速度慢的挑战。先前的加速方法重用前一步的缓存特征,忽略了静态缓存与不断变化的输入之间日益增长的失配,导致输出保真度下降。本文提出速度分解与估计(VDE),一种无训练加速方法,将范式从缓存重用转变为分解估计。具体而言,VDE将模型的速度分解为与输入平行和正交的分量,利用它们的时间可预测性和方向稳定性进行精确的输入自适应估计。为防止误差累积,它通过完整前向传播定期锚定模型状态。在图像和视频生成任务上的大量实验表明,VDE在视觉质量损失极小的情况下实现了显著加速。值得注意的是,VDE将Flux加速3.22倍,并在Qwen-Image上实现了0.069的LPIPS,比最佳基线降低了52.2%。

英文摘要

Though rectified flow models have achieved remarkable performance in image, video, and 3D generation, their practical deployments are challenged by slow inference speeds. Prior acceleration methods reuse cached features from previous steps, which neglects the growing mismatch between static caches and the evolving input, leading to reduced output fidelity. This work proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating. Specifically, VDE decomposes the model's velocity into components parallel and orthogonal to the input, exploiting their temporal predictability and directional stability for precise, input-adaptive estimation. To prevent error accumulation, it periodically anchors the model's state via full forward passes. Extensive experiments on image and video generation tasks demonstrate that VDE achieves substantial acceleration with minimal loss in visual quality. Notably, VDE accelerates Flux by 3.22 times and achieves an LPIPS of 0.069 on Qwen-Image, outperforming the best baseline with a 52.2% reduction.

2605.23355 2026-05-25 cs.CV cs.LG cs.MM 版本更新

Decoupling Spatio-Temporal Adapter for Fine-Grained Badminton Action Localization

解耦时空适配器用于细粒度羽毛球动作定位

Tianyu Wang, Junjie Wu, Jingquan Gao, Shishuo Li

发表机构 * School of Economics and Management, Beihang University(北京航空航天大学经济管理学院) Key Laboratory of Data Intelligence and Management, Beihang University, Ministry of Industry and Information Technology(信息产业部北京航空航天大学数据智能与管理重点实验室)

AI总结 本文研究了专业羽毛球视频中的细粒度时序动作定位问题,针对其复杂的时空动态特性,提出了一种解耦时空适配器(DSTA),通过将运动表示分解为三个并行分支,分别捕捉时间动态以及垂直和水平方向的空间变化,从而更有效地建模细粒度动作的细微差异。同时,作者构建了一个包含31场比赛、29类细粒度击球动作的Fine-Badminton数据集,并在该数据集和ShuttleSet基准上验证了方法的有效性,取得了最先进的性能,且计算和参数开销增加有限。

Comments 11 pages, 11figures

详情
AI中文摘要

时间动作定位(TAL)在通用视频理解中已被广泛研究,而由于复杂微妙的时空动态,专业羽毛球等细粒度体育场景仍未被充分探索。本文聚焦于专业羽毛球视频中的细粒度TAL,并引入一个新的基准数据集Fine-Badminton,包含31场比赛、29个细粒度击球类别,涵盖2104个回合和27597个标注动作。为了有效捕捉此类场景中的复杂运动模式,我们提出解耦时空适配器(DSTA),能够在参数高效框架内高效建模时空特征。具体而言,DSTA将运动表示分解为三个并行分支,分别捕捉时间动态以及垂直和水平空间变化。该设计使模型能够更好地区分细粒度动作之间的细微差异。在Fine-Badminton数据集和ShuttleSet基准上的大量实验表明,所提方法在仅增加微小计算和参数成本的情况下实现了最先进性能。这些结果验证了所提方法在细粒度时间动作定位中的有效性和效率。

英文摘要

Temporal Action Localization (TAL) has been extensively studied in generic video understanding, while fine-grained sports scenarios, such as professional badminton, remain underexplored due to their complex and subtle spatio-temporal dynamics. In this paper, we focus on fine-grained TAL in professional badminton videos and introduce a new benchmark dataset, Fine-Badminton, which consists of 31 matches with 29 fine-grained stroke categories, covering 2104 rallies and 27597 annotated actions. To effectively capture the intricate motion patterns in such scenarios, we propose a Decoupling Spatio-Temporal Adapter (DSTA), which enables efficient modeling of spatio-temporal features within a parameter-efficient framework. Specifically, DSTA decomposes motion representation into three parallel branches, capturing temporal dynamics as well as vertical and horizontal spatial variations. The design allows the model to better distinguish subtle differences among fine-grained actions. Extensive experiments on both the Fine-Badminton dataset and the ShuttleSet benchmark demonstrate that the proposed method achieves state-of-the-art performance while introducing only a marginal increase in computational and parameter cost. These results validate the effectiveness and efficiency of the proposed approach for fine-grained temporal action localization.

2605.23344 2026-05-25 cs.CV cs.AI 版本更新

CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs

CHASD:面向LVLMs中幻觉的语言增量校准对比解码

Xiaoyi Huang, Kejia Zhang, Zhiming Luo

发表机构 * Institute of Artificial Intelligence, Xiamen University(厦门大学人工智能学院) Department of Artificial Intelligence, Xiamen University(厦门大学人工智能系)

AI总结 本文研究了大型视觉-语言模型(LVLMs)在语言先验主导下容易产生物体幻觉的问题,提出了一种无需训练的对比解码方法CHASD。该方法通过注意力引导的局部视觉扰动构建负样本分支,并在生成过程中仅对低置信度的词元进行对比校准,从而在保证推理效率的同时有效抑制幻觉。实验表明,CHASD在多个基准数据集上显著提升了相关指标,优于现有的训练自由基线方法。

详情
AI中文摘要

大型视觉-语言模型展现了强大的多模态推理能力,但当语言先验主导不足或错位的视觉证据时,它们仍然容易产生对象幻觉。无训练对比解码方法通过比较原始和扰动视觉输入的预测来缓解此问题,但现有方法要么应用可能改变有用视觉证据的全局扰动,要么在每个解码步骤调用额外的负分支。在本文中,我们观察到幻觉风险是瞬态且特定于token的:视觉注意力在生成的token间转移,而一些功能token以高置信度产生,不需要对比校准。基于这一观察,我们提出面向大型视觉-语言模型的对比幻觉感知逐步解码(CHASD),一种“按需校准”的推理时框架。CHASD使用不确定性驱动的置信门控,仅当下一token的最大概率低于阈值时激活对比分支,并通过注意力引导的局部扰动构建负分支,扰动当前显著的视觉token。这种设计减少了不必要的负分支前向传播,同时保留了高置信度步骤的原始分布。在POPE、AMBER、MME、MMHal-Bench和CHAIR上的实验表明,CHASD在强无训练基线上改进了幻觉相关指标,并具有有竞争力的推理效率。

英文摘要

Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or misaligned visual evidence. Training-free contrastive decoding methods mitigate this issue by comparing predictions from original and perturbed visual inputs, but existing approaches either apply global perturbations that may alter useful visual evidence or invoke an additional negative branch at every decoding step. In this paper, we observe that hallucination risks are transient and token-specific: visual attention shifts across generated tokens, while some functional tokens are produced with high confidence and do not require contrastive calibration. Based on this observation, we propose Contrastive Hallucination-Aware Step-wise Decoding (CHASD) for Large Vision-Language Models, an inference-time framework for "calibration on demand". CHASD uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next-token is less than the threshold, and constructs the negative branch through attention-guided localized perturbations of the currently salient visual tokens. This design reduces unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR show that CHASD improves hallucination-related metrics over strong training-free baselines with competitive inference efficiency.

2605.23324 2026-05-25 cs.CV quant-ph 版本更新

Enhancing Blood Cells Classification using Hybrid Quantum Neural Networks

使用混合量子神经网络增强血细胞分类

Guilherme Cruz, Nouhaila Innan, Alberto Marchisio, Gabriel Falcao, Muhammad Shafique

发表机构 * Center for Quantum and Topological Systems, NYUAD Research Institute, New York University Abu Dhabi, UAE(阿布扎比纽约大学NYUAD研究机构量子与拓扑系统中心) Science Division, New York University Abu Dhabi, UAE(阿布扎比纽约大学科学学院)

AI总结 本文研究了如何利用混合量子-经典神经网络(HQNN)提升显微血细胞分类的准确性。作者提出了一种模块化架构,结合预训练的ResNet-50主干网络、低维潜在瓶颈和变分量子电路,以比较量子增强与纯经典变换机制的效果。实验结果表明,HQNN在两个公开血细胞数据集上均表现出更优或更均衡的分类性能,尤其在高难度的8类分类任务中,F1分数提升了0.15个百分点,并在IBM量子硬件上验证了模型对噪声的鲁棒性。

Comments 11 pages, 13 figures

详情
AI中文摘要

显微镜血细胞的准确分类仍然是医学图像分析中的关键任务,其中微小的变化和有限的数据可能挑战传统的深度学习模型。因此,在这项工作中,我们研究了混合量子-经典神经网络(HQNN)在该领域中增强特征表示和改善分类性能的潜力。我们提出了一种模块化架构,结合了预训练的ResNet-50骨干网络、低维潜在瓶颈和变分量子电路,使得量子增强和纯经典变换机制之间能够进行直接比较。为了隔离量子组件的贡献,我们评估了三种架构:HQNN模型、具有可比容量的额外非线性变换层的经典匹配模型,以及没有中间变换阶段的基线模型。在两个公开的血细胞数据集(即血细胞图像数据集和PBC数据集)上进行的实验表明,HQNN在评估指标上始终实现更优或更平衡的性能。在血细胞图像数据集中,与经典基线相比,所提出的方法将宏F1分数提高了高达3.7%,而在更具挑战性的8类场景中,F1分数从98.54%提高到98.69%,性能接近饱和。在IBM量子硬件上的额外评估表明,该模型在噪声下仍然保持鲁棒性,与模拟结果相比仅出现适度的性能下降。这些结果表明,量子特征变换可以增强判别表示,特别是在具有挑战性的分类场景中,并突显了HQNN模型在医学成像任务中的实际潜力。

英文摘要

Accurate classification of microscopic blood cells is still a critical task in medical image analysis, where subtle variations and limited data can challenge conventional deep learning models. As such, we investigate in this work the potential of Hybrid Quantum-Classical Neural Networks (HQNNs) to enhance feature representation and improve classification performance in this domain. We propose a modular architecture combining a pre-trained ResNet-50 backbone with a low-dimensional latent bottleneck and a variational quantum circuit, enabling a direct comparison between quantum-enhanced and purely classical transformation mechanisms. To isolate the contribution of the quantum component, we evaluate three architectures: a HQNN model, a Classical Matched Model with an additional nonlinear transformation layer of comparable capacity, and a baseline model without an intermediate transformation stage. Experiments conducted on two publicly available blood cell datasets, namely the Blood Cell Images dataset and the PBC dataset, demonstrate that HQNNs consistently achieve superior or more balanced performance across evaluation metrics. In the Blood Cell Images Dataset, the proposed approach improves macro F1-score by up to 3.7% compared to classical baselines, while improving the F1-score from 98.54% to 98.69% in the more challenging 8-class scenario with near-saturated performance. Additional evaluation on IBM quantum hardware shows that the model remains robust under noise, with only a modest performance degradation relative to simulated results. These results indicate that quantum feature transformations can enhance discriminative representations, particularly in challenging classification scenarios, and highlight the practical potential of HQNN models for medical imaging tasks.

2605.23323 2026-05-25 eess.IV cs.CV 版本更新

Efficient Learned Image Compression without Entropy Coding

无需熵编码的高效学习图像压缩

Hao Cao, Wenqi Guo, Zhijin Qin, Jungong Han

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Department of Automation, Tsinghua University(清华大学自动化系) State Key Laboratory of Space Network(空间网络与通信国家重点实验室) Beijing National Research Center for Information Science(北京信息科学国家研究中心)

AI总结 本文提出了一种无需熵编码的高效学习图像压缩方法EF-LIC,旨在解决传统方法中熵编码导致的编码延迟瓶颈问题。该方法通过引入无约束向量量化和上下文条件自回归变换,有效去除统计冗余和相关性冗余,实现了与传统方法相当的压缩性能。实验表明,EF-LIC在保持高质量的同时,显著提升了编码和解码速度。

Comments Accepted by ICML 2026

详情
AI中文摘要

熵编码在典型的学习图像压缩(LIC)中被广泛使用,它将潜在变量转换为紧凑的比特流。然而,熵编码通常是顺序执行的,成为编码延迟的瓶颈。为了克服这一问题,我们提出了无需熵编码的学习图像压缩(EF-LIC),这是一个多速率框架,通过去除统计冗余和相关冗余,以低编码延迟生成紧凑表示。首先,我们引入无约束向量量化,并证明其索引分布接近最大熵界,从而产生最小的统计冗余。其次,我们提出了一种上下文条件自回归变换,直接重新参数化潜在变量以减少相互依赖性。理论分析表明,EF-LIC可以像带有熵编码的典型LIC一样有效地去除相关冗余,从而实现相当的压缩性能。实验表明,在Kodak数据集上使用LPIPS度量,EF-LIC相比MS-ILLM实现了高达67.86%的比特率降低。消融研究进一步表明,EF-LIC在匹配基于熵编码的变体的压缩性能的同时,实现了超过3倍的编码加速和超过5倍的解码加速。

英文摘要

Entropy coding is widely used in typical learned image compression (LIC) that converts latents into a compact bitstream. However, entropy coding is typically sequential and becomes the coding latency bottleneck. To overcome it, we present Entropy-Coding Free Learned Image Compression (EF-LIC), a multi-rate framework that generates compact representation by removing statistical and correlation redundancy with low coding latency. First, we introduce unconstrained vector quantization and prove that its index distribution approaches the maximum-entropy bound, yielding minimal statistical redundancy. Second, we propose a context-conditioned autoregressive transform that directly reparameterizes the latents to reduce inter-dependency. Theoretical analysis shows that EF-LIC can remove correlation redundancy as effectively as typical LIC with entropy coding, leading to comparable compression performance. Experiments show EF-LIC achieves up to 67.86% bitrate reduction over MS-ILLM on Kodak with LPIPS. Ablation studies further show EF-LIC matches the compression performance of its entropy-coding based variant while achieving over $3\times$ faster encoding and $5\times$ faster decoding.

2605.23304 2026-05-25 cs.CV 版本更新

General Hazard Detection

通用危险检测

Stephanie Ng, CP Lim, SueJen Looi, Hendrik Zurlinden, David Nguyen, Lei Wei, Saeid Nahavandi, Hailing Zhou

发表机构 * Swinburne University of Technology(斯winburne大学) National Transport Research Organisation(国家交通运输研究组织) Google Cloud(谷歌云) Deakin University(德金大学)

AI总结 本文研究了如何检测抽象概念的“危害”,并提出了一种基于语言规则而非具体图像示例的通用危害检测方法。为了解决现有系统在数据稀疏性、定义动态变化和泛化能力方面的不足,作者构建了CompliVision数据集,并设计了一个结合视觉与语言模型的框架,通过权威规范定义多领域危害概念,实现对安全合规性的有效评估。该方法引入主动学习机制,提升模型在复杂场景下的鲁棒性和适应性。

Comments 20 pages, 7 figures and 4 tables

详情
AI中文摘要

危险作为一个抽象概念,通常通过认知层面的逻辑推理而非具体示例来定义。相比之下,现有的危险检测系统依赖于预定义的危险类别,并需要在检测或分类架构中密集收集标注示例。这种方法在处理抽象安全概念时面临三个基本挑战:(1) 噪声大且稀疏的训练数据,(2) 随上下文和时间动态演变的定义,以及(3) 对未见或新颖场景的泛化能力有限。为了解决这些局限性,我们提出了CompliVision数据集,这是第一个专为基于规则的合规评估设计的通用危险数据集,同时提供了一个用于危险评估的基线框架。我们的关键创新在于通过基于语言的规则表达安全要求,从而将危险概念与基于图像的示例解耦。我们将方法建立在权威领域法规和ISO标准之上,以定义跨多个领域的多样化危险概念。CompliVision数据集包含跨越交通、建筑和仓库环境的3,006张图像,每张图像都根据特定安全规则进行了合规性标注,并附有突出显示支持性视觉证据的自然语言解释。为了实现稳健的泛化,我们开发了一个主动学习框架,以更有效地指导和优化视觉语言模型在危险合规评估中的表现。尽管最先进的VLM表现出强大的能力,但在准确安全评估所需的细粒度、上下文相关解释方面仍存在困难。我们提出了一个通用危险检测框架来解决这一局限性,该框架结合了基于LLaVA的视觉推理与人在回路反馈。

英文摘要

Hazard, as an abstract concept, is typically defined through cognitive-level logical reasoning rather than concrete examples. In contrast, existing hazard detection systems rely on predefined hazard categories and require intensive collection of labelled examples within detection or classification architectures. This approach faces three fundamental challenges when addressing abstract safety concepts: (1) noisy and sparse training data, (2) dynamically evolving definitions that change across contexts and time, and (3) limited generalisation to unseen or novel scenarios. To address these limitations, we present the CompliVision dataset, the first general-purpose hazard dataset designed for rule-based compliance assessment, along with a baseline framework for hazard evaluation. Our key innovation is decoupling the hazard concept from image-based examples by expressing safety requirements through language-based rules. We ground our approach in authoritative domain regulations and ISO standards to define diverse hazard concepts across multiple domains. The CompliVision dataset comprises 3,006 images spanning traffic, construction, and warehouse environments, with each image annotated for compliance against specific safety rules, accompanied by natural language explanations highlighting the supporting visual evidence. To achieve robust generalisation, we develop an active learning framework to more effectively guide and refine vision-language models in assessing hazard compliance. While state-of-the-art VLMs demonstrate strong capabilities, they struggle with the fine-grained, context-dependent interpretation required for accurate safety assessment. We proposed a general hazard detection framework to address this limitation which combines LLaVA-based visual reasoning with with human-in-the-loop feedback.

2605.23288 2026-05-25 cs.CV 版本更新

Spatio-Temporal Similarity Volume Aggregation for Open-Vocabulary Action Recognition

时空相似性体积聚合用于开放词汇动作识别

Yerim So, Jiyeong Kim, Jiwon Yoon, Dongbo Min

发表机构 * Ewha Womans University(成均馆大学)

AI总结 本文提出了一种名为SimVA的框架,用于解决开放词汇动作识别中的细粒度时空信息丢失问题。该方法通过构建局部视频块与动作类别之间的密集四维时空相似性体积,保留了局部视觉-文本对齐信息,并结合类采样、空间聚合和运动感知调制等技术,提升了模型对时空动态变化的建模能力。实验表明,SimVA能够有效将CLIP模型迁移至视频动作识别任务,在零样本、少样本及基础到新类别的多个基准测试中均取得具有竞争力的性能。

详情
AI中文摘要

最近的开放词汇动作识别(OVAR)方法通常在计算文本对齐之前将视觉特征聚合为全局表示,这一过程掩盖了局部补丁信息和细粒度的时空线索。我们提出了相似性体积聚合(SimVA)框架,该框架从补丁级别的视觉-文本相似性构建密集的4D时空相似性体积。SimVA在局部视频令牌和动作类别上构建时空相似性体积,并采用类别采样确保相似性聚合可扩展到大型词汇表。通过空间聚合对相似性体积进行细化,将局部相似性模式上下文化以提高帧内一致性。运动感知调制进一步注入帧间变化线索,突出动态变化区域。基于Mamba的时序聚合则建模类别条件相似性模式在帧间的演化。通过保持密集的视觉-文本对应关系,SimVA有效地将CLIP迁移到视频动作识别,在零样本、少样本和基类到新类基准测试中均取得了竞争性性能。

英文摘要

Recent Open-Vocabulary Action Recognition (OVAR) methods typically aggregate visual features into a global representation before computing text alignment, a process that obscures local patch information and fine-grained spatio-temporal cues. We propose Similarity Volume Aggregation (SimVA), a framework that constructs a dense 4D spatio-temporal similarity volume from patch-level visual-text similarities. SimVA constructs a spatio-temporal similarity volume over local video tokens and action classes, and employs class sampling to ensure similarity aggregation scalable to large vocabularies. The similarity volume is refined by spatial aggregation, which contextualizes local similarity patterns to improve intra-frame consistency. Motion-aware modulation further injects inter-frame variation cues, highlighting dynamically changing regions. Mamba-based temporal aggregation then models the evolution of class-conditioned similarity patterns across frames. By maintaining dense visual-text correspondence, SimVA effectively transfers CLIP to video action recognition, achieving competitive performance across zero-shot, few-shot, and base-to-novel benchmarks.

2605.23287 2026-05-25 cs.CV 版本更新

LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images

LangFlash: 基于前馈的3D语言高斯泼溅从稀疏无位姿图像

Yilong Liu, Wanhua Li, Chen Zhu-Tian, Hanspeter Pfister

发表机构 * Harvard University(哈佛大学) Nanyang Technological University(南洋理工大学) Tsinghua University(清华大学) University of Minnesota - Twin Cities(明尼苏达大学-双城分校)

AI总结 本文提出 LangFlash,一种基于前馈网络的 3D 语言高斯溅射框架,能够从稀疏未配准的多视角图像中直接重建带有语言对齐语义特征的 3D 场景。与基于优化的 3D 方法不同,LangFlash 在一次前向传播中同时预测几何结构和语义信息,实现了低延迟的 3D 重建与语义一致的场景理解。通过引入稀疏语义编码方案和增强的语义监督数据集,LangFlash 在新型视角合成和语义一致性方面优于现有方法,为无姿态依赖、语言驱动的 3D 场景重建提供了新范式。

Comments CVPRF 2026

详情
AI中文摘要

我们提出LangFlash,一种用于3D语言高斯泼溅的前馈框架,它从稀疏无位姿多视图图像重建由高斯原语参数化的3D场景,这些原语富含语言对齐的语义特征。与基于优化的3D方法不同,LangFlash在单次前向传播中直接预测几何和语义,实现低延迟3D重建和语言一致的场景理解。为了支持大规模训练,我们为RealEstate10k数据集丰富了连贯且密集的语义信息,用于3D语义监督。此外,我们提出了一种稀疏语义编码方案,该方案将全局语义字典与局部变化的每个原语权重相结合,在保留高级语言信息的同时降低表示复杂度。实验结果表明,与先前方法相比,LangFlash在新视图合成和语义一致性方面表现更优。本研究为无位姿、语言基础的3D场景重建建立了新范式,推动了可泛化3D视觉和多模态场景理解的发展。演示地址:https://liylo.github.io/langflash.github.io/。

英文摘要

We present LangFlash, a feed-forward framework for 3D Language Gaussian Splatting that reconstructs 3D scenes parameterized by Gaussian primitives enriched with language-aligned semantic features from sparse unposed multi-view images. Unlike optimization-based 3D methods, LangFlash directly predicts the geometry and semantics in a single forward pass, enabling low-latency 3D reconstruction and language-consistent scene understanding. To support large-scale training, we enriched the RealEstate10k dataset with coherent and dense semantic information for 3D semantic supervision. Furthermore, we propose a sparse semantic encoding scheme that combines a global semantic dictionary with locally varying per-primitive weights, preserving high-level linguistic information, while reducing representation complexity. Experimental results show that LangFlash achieves superior novel view synthesis and semantic consistency compared with previous methods. This study establishes a new paradigm for pose-free, language-grounded 3D scene reconstruction, advancing generalizable 3D vision and multimodal scene understanding. Demo is available at https://liylo.github.io/langflash.github.io/.

2605.23282 2026-05-25 eess.IV cs.CV cs.LG 版本更新

Discontinuous Galerkin Neural Operator for Pathology Defocus Deblurring

病理学离焦去模糊的间断伽辽金神经算子

Shaoqing Duan, Haofei Song, Xintian Mao, Qingli Li, Yan Wang

发表机构 * Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai, China(上海多维信息处理关键实验室,华东师范大学,上海,中国)

AI总结 病理学显微镜中的离焦去模糊因光学模糊的空间变化和局部不连续性而具有挑战性。现有深度学习方法受限于位移不变性假设和可解释性不足,难以处理这种异质性模糊模式。本文提出了一种基于不连续伽辽金格式的神经算子(DGNO),通过局部体积算子和界面数值通量参数化积分核,有效建模了异质且局部不连续的模糊模式,在保持光学成像物理特性的前提下,实现了更优的去模糊效果,并在高分辨率场景下表现出良好的性能。

Comments 17 pages, 9 figures. Accepted by ICML 2026

详情
AI中文摘要

病理显微镜中的离焦去模糊仍然具有挑战性,因为由位置相关的积分成像过程引起的光学模糊具有空间变化和局部不连续的特性。现有的深度学习方法受限于平移不变性假设和有限的可解释性,不太适合这种异质模糊模式。神经算子通过直接将离焦形成建模为积分算子,提供了一种原则性的替代方案,为离焦去模糊提供了新的视角。然而,大多数现有的用于低级视觉的神经算子架构依赖于全局参数化核,这些核假设平滑性和平稳性,限制了它们建模异质和局部不连续模糊模式的能力。为了解决这一限制,我们提出了间断伽辽金神经算子(DGNO),它使用具有单元局部体积算子和界面数值通量的间断伽辽金公式来参数化积分核。DGNO 提供了局部性、异质性建模和全局一致性的原则性组合,同时保留了光学图像形成的底层物理。广泛且深入的实验表明,DGNO 超越了现有技术,提供了更清晰的图像重建、对空间变化模糊的鲁棒处理以及可扩展的高分辨率性能。代码将在 https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur 发布。

英文摘要

Defocus deblurring in pathological microscopy remains challenging due to the spatially varying and locally discontinuous nature of optical blur induced by a position-dependent integral imaging process. Existing deep learning methods, constrained by shift-invariance assumptions and limited interpretability, are not well suited to such heterogeneous blur patterns. Neural operators provide a principled alternative by modeling defocus formation directly as an integral operator, offering a new perspective on defocus deblurring. However, most existing neural operator architectures for low-level vision rely on globally parameterized kernels that assume smoothness and stationarity, limiting their ability to model heterogeneous and locally discontinuous blur patterns. To address this limitation, we propose the Discontinuous Galerkin Neural Operator (DGNO), which parameterizes the integral kernel using a discontinuous Galerkin formulation with element-local volume operators and interface numerical fluxes. DGNO provides a principled combination of locality, heterogeneity modeling, and global coherence while preserving the underlying physics of optical image formation. Extensive and insightful experiments demonstrate that DGNO surpasses state-of-the-arts, delivering sharper reconstructions, robust handling of spatially varying blur, and scalable high-resolution performance. The code will be released at https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur.

2605.23281 2026-05-25 cs.CV 版本更新

DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection

DepthAgent: 通过样本级专家选择实现更好的通用深度估计

Jie Zhu, Girish Chandar Ganesan, Xiaoming Liu

发表机构 * Michigan State University(密歇根州立大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文提出了一种名为 DepthAgent 的视觉语言智能体,用于自适应单目深度估计。该方法通过分析场景和相机特性,选择或融合多个预训练深度模型的预测结果,从而提升在不同视角、鱼眼和全景图像等多样化相机设置下的深度估计性能。研究发现,不同模型在不同输入域上的表现存在显著差异,通过样本级专家选择与融合可以显著提升难样本的估计精度,实验表明 DepthAgent 在多个基准测试中均优于单一模型及固定融合方法。

详情
AI中文摘要

单目度量深度估计通过大规模训练和通用相机建模取得了显著进展,但在不同相机设置(如透视、鱼眼和全景图像)下的鲁棒部署仍然具有挑战性。现有方法通常依赖单一深度估计器,忽略了不同模型编码不同的相机假设并在不同输入域下表现最佳。本文中,我们展示了深度专家在样本级上具有强互补性:模型偏好与相机几何高度相关,多模型融合在单个专家不可靠的困难样本上带来最大收益。受这些观察启发,我们提出了 extbf{\ours},一种用于自适应单目深度估计的视觉语言智能体。DepthAgent将现有深度模型视为冻结工具,学习分析场景和相机线索,通过多轮工具调用调用合适的专家,并为每个输入选择或融合它们的预测。为了优化这种离散决策以实现密集几何质量,我们设计了一种多奖励强化学习微调方案,共同鼓励有效的工具执行、相机/场景分析、专家选择质量和推理效率。在透视、鱼眼和全景基准上的大量实验表明,\ours一致优于单个专家、固定模型融合和不同选择策略,在困难样本上取得了显著改进,突显了专家选择和融合的关键作用。代码和模型将在发表后发布。

英文摘要

Monocular metric depth estimation has achieved strong progress with large-scale training and universal-camera modeling, yet robust deployment across diverse camera settings, such as perspective, fisheye, and panoramic images, remains challenging. Existing methods typically rely on a single depth estimator, overlooking that different models encode different camera assumptions and perform best under different input domains. In this paper, we show that depth experts exhibit strong sample-wise complementarity: model preference is highly correlated with camera geometry, and multi-model fusion brings the largest gains on difficult samples where individual experts are unreliable. Motivated by these observations, we propose \textbf{\ours}, a vision-language agent for adaptive monocular depth estimation. DepthAgent treats existing depth models as frozen tools and learns to analyze scene and camera cues, invoke suitable experts through multi-turn tool utilization, and select or fuse their predictions for each input. To optimize such discrete decision-making toward dense geometric quality, we design a multi-reward reinforcement fine-tuning scheme that jointly encourages valid tool execution, camera/scene analysis, expert-selection quality, and inference efficiency. Extensive experiments across perspective, fisheye, and panoramic benchmarks show that \ours consistently outperforms individual experts, fixed model fusion, and different selection strategies, with strong improvements on challenging samples, highlighting the critical role of expert selection and fusion. The code and model will be released upon publication.

2605.23274 2026-05-25 cs.CV 版本更新

U-CESE: Unified Clip-based Event Search Engine for AI Challenge HCMC 2025

U-CESE:面向AI挑战赛胡志明市2025的统一基于片段的事件搜索引擎

Duc-Nhuan Le, Hoang-Phuc Nguyen, Thanh-Duy Lam, Minh-Nhut Dang, Minh-Hoang Le

发表机构 * Faculty of Information Technology, University of Science, VNU-HCM(越南国家大学胡志明市分校信息科技学院) Vietnam National University, Ho Chi Minh City, Vietnam(越南国家大学胡志明市分校)

AI总结 本文提出U-CESE,一种统一的基于片段的事件搜索引擎,用于AI Challenge HCMC 2025中的多模态事件检索任务。U-CESE整合了原有CESE的三个模块,形成统一框架,支持跨多种视频源的一致事件检索。其核心方法包括统一剪辑算法、基于JPEG文件大小变化的无训练关键帧提取方法DAKE,以及受循环神经网络启发的时序一致字幕生成框架ReCap,有效提升了大规模多模态事件检索的效率与准确性。

Comments Accepted for publication in the Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)

详情
AI中文摘要

从大规模视频数据集中检索事件由于复杂的时空和多模态信息而具有挑战性。本文介绍了U-CESE,这是我们对AI挑战赛胡志明市2025的解决方案,一个统一的基于片段的事件搜索引擎,用于跨多种视频源的多模态事件检索。在CESE的基础上,U-CESE将其三个模块集成到一个统一的框架中,确保跨查询类型的一致处理和检索。核心组件是统一剪辑算法,它将单独的剪辑算法合并为一个高效的流水线。为了处理大规模数据,我们提出了DAKE,一种轻量级、无需训练的关键帧提取方法,利用JPEG文件大小变化来识别显著的场景变化。最后,我们引入了ReCap,一个受循环神经网络启发的时序一致字幕生成框架,生成详细且上下文感知的文本描述。实验表明,U-CESE在大规模多模态事件检索中提供了稳健、一致且高效的性能。

英文摘要

Retrieving events from large-scale video datasets is challenging due to complex temporal, spatial, and multimodal information. This paper presents U-CESE, our solution for the AI Challenge HCMC 2025, a Unified Clip-based Event Search Engine for multimodal event retrieval across diverse video sources. Building on CESE, U-CESE integrates its three modules into a single cohesive framework, ensuring consistent processing and retrieval across query types. A core component is the Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline. To handle large-scale data, we propose DAKE, a lightweight, training-free keyframe extraction method using JPEG file size variations to identify significant scene changes. Finally, we introduce ReCap, a temporally consistent captioning framework inspired by Recurrent Neural Network, generating detailed and context-aware textual descriptions. Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval.

2605.23271 2026-05-25 cs.CV cs.AI 版本更新

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

EvalVerse:面向专业电影级视频生成的流水线感知与专家校准基准测试

Songlin Yang, Haobin Zhong, Ruilin Zhang, Xiaotong Zhao, Shuai Li, Kai Zheng, Xuyi Yang, Zhe Wang, Zhenchen Tang, Yang Li, Bohai Gu, Zhengwei Peng, Yidan Huang, Mengzhou Luo, Yihang Bo, Dalu Feng, Yujia Zhang, Juntao Ma, Ruiqi Wang, Lvmin Zhang, Yuwei Guo, Frank Guan, Maneesh Agrawala, Hongbo Fu, Alan Zhao, Anyi Rao

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Tencent(腾讯) Tsinghua University(清华大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beijing Film Academy(北京电影学院) Stanford University(斯坦福大学) The Chinese University of Hong Kong(香港中文大学) Singapore Institute of Technology(新加坡理工学院)

AI总结 随着生成式视频基础模型的快速发展,影视级视频生成成为研究热点,但现有的评估方法多关注生成内容是否符合提示,而忽视了其艺术质量、表演和美学表现。为解决这一问题,本文提出 EvalVerse,一个流程感知且由专家校准的评估框架,通过构建专业影视制作流程的评估体系、收集大规模专家标注数据,并结合专家校准的微调策略提升视觉语言模型的推理能力,从而实现对视频生成质量的全面评估,为未来奖励模型和评估代理的研究提供了基础支撑。

详情
AI中文摘要

生成式视频基础模型的快速发展推动该领域向专业级电影合成迈进。为达到如此苛刻的质量,社区正转向强化学习和智能体工作流。然而,可靠的评估已成为关键瓶颈。现有基准主要评估“是否正确”(基本提示遵循),而从根本上忽略了“是否优良”(电影质量、表演和美学)。此外,当前的自动指标缺乏提供可信信号所需的领域特异性,在人类审美感知与机器评分之间造成了严重的可信度差距。为弥合这一差距,我们引入了EvalVerse,一个全面、流水线感知且专家校准的评估框架。我们将视频生成评估不仅视为一项工程任务,而是作为一个核心科学问题:主观电影专业知识的系统数字化。首先,我们将领域知识组织成与专业电影制作工作流(前期制作、制作和后期制作)一致的评估分类法。其次,我们将人类专家判断提炼为带有大规模人工标注的精选数据集。第三,我们通过专家校准的微调策略将这些知识注入视觉语言模型,使VLM能够执行显式的思维链推理。与先前工作相比,EvalVerse不仅保持与基础“正确性”指标的兼容性,还显著扩展了“优良性”标准,并将任务覆盖范围拓宽到复杂的多镜头序列和视听整合。因此,通过提供细粒度的诊断信号,EvalVerse超越了静态排行榜,为未来工作(如奖励模型和评估智能体)建立了基础基础设施。

英文摘要

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.

2605.23270 2026-05-25 cs.CV cs.AI cs.RO 版本更新

ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

ChainFlow-VLA: 基于视觉语言模型的因果流规划

Xiyang Wang, Xinlin Wang, Tingguang Zhou, Gong Chen, Xingtai Gui, Zhi Xu, Xiaolei Wu, Feiyang Tan, Hangning Zhou, Mu Yang

发表机构 * Afari Intelligent Drive(阿法瑞智能驾驶) Tianjin University(天津大学) University of Macau(澳门大学)

AI总结 当前端到端自动驾驶系统在时间因果推理与全局轨迹一致性之间存在根本性矛盾。为解决这一问题,本文提出 ChainFlow-VLA,通过统一因果生成与全局优化的联合概率框架,将因果推理与全局轨迹修正相结合。该方法利用视觉语言模型作为语义先验,在保留因果结构的基础上进行轨迹修正,实验表明其在复杂场景中表现出色,达到了与人类相当的高水平性能。

详情
AI中文摘要

当前的端到端自动驾驶系统从根本上受到时间因果推理与全局轨迹一致性之间不匹配的限制。自回归(AR)模型通过因果分解捕获交互感知的时间依赖性,但其逐步解码导致误差累积和次优的全局结构。相比之下,扩散模型全局优化轨迹但缺乏显式因果约束,使其在交互和关键安全场景中不可靠。这种二分法揭示了一个更深层次的问题:现有方法将因果建模和全局优化视为分离的范式,没有原则性的方式将它们统一在单个轨迹分布中。为了解决这个问题,我们提出了ChainFlow-VLA,它在统一的概率框架内统一了因果生成和全局细化。我们将规划公式化为AR诱导模式的混合,并学习这些模式上的视觉语言模型(VLM)条件残差分布。自回归生成器(Chain)生成一组离散的因果轨迹模式,随后基于扩散的细化器(Flow)利用VLM隐藏状态作为语义先验,在残差空间中执行模式条件校正,同时保持因果结构。这种直接的调节将高层场景理解无缝注入到细粒度的轨迹调整中。实验表明,ChainFlow-VLA在模糊和长尾场景中实现了鲁棒的规划,在NAVSIM v1排行榜上取得了94.85的最新分数,匹配人类水平(94.8)。代码将在https://github.com/AFARI-Research/ChainFlow-VLA提供。

英文摘要

Current end-to-end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajectory consistency. Autoregressive (AR) models capture interaction-aware temporal dependencies via causal factorization, but their step-wise decoding leads to error accumulation and suboptimal global structure. In contrast, diffusion models optimize trajectories globally but lack explicit causal constraints, making them unreliable in interactive and safety-critical scenarios. This dichotomy reveals a deeper issue: existing methods treat causal modeling and global optimization as separate paradigms, without a principled way to unify them within a single trajectory distribution. To address this, we propose ChainFlow-VLA, which unifies causal generation and global refinement within a unified probabilistic framework. We formulate planning as a mixture over AR-induced modes and learn Vision-Language Model (VLM)-conditioned residual distributions over these modes. An autoregressive generator (Chain) produces a discrete set of causal trajectory modes, followed by a diffusion-based refiner (Flow) that leverages VLM hidden states as semantic priors to perform mode-conditioned correction in residual space while preserving causal structure. This straightforward conditioning seamlessly injects high-level scene understanding into fine-grained trajectory adjustments. Experiments demonstrate that ChainFlow-VLA achieves robust planning in ambiguous and long-tail scenarios, achieving a state-of-the-art score of 94.85 on the NAVSIM v1 leaderboard, matching human-level performance (94.8). Code will be available at https://github.com/AFARI-Research/ChainFlow-VLA.

2605.23264 2026-05-25 cs.CV cs.AI 版本更新

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

着色噪声:用于忠实图像超分辨率的对抗性Sobolev对齐

Hongbo Wang, Huaibo Huang, Pin Wang, Jinhua Hao, Chao Zhou, Ran He

发表机构 * MAIS \& NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

AI总结 图像超分辨率生成中,生成先验常导致还原不够忠实,本文认为这是由于各向同性目标与自然图像内在流形之间存在基本的谱不匹配。为解决这一问题,研究提出了一种基于Sobolev诱导黎曼几何的ASASR框架,通过显式地对噪声转移核进行谱色处理,使其更符合自然图像的谱衰减特性,并引入基于Riesz表示定理的参数化对抗网络,生成针对性的负样本以引导优化方向。实验表明,该方法在保持谱一致性和结构保真度方面优于现有生成方法,有效减少了伪影。

Comments Accepted to ICML 2026

详情
AI中文摘要

图像超分辨率(SR)中的生成先验常常损害忠实重建,我们将这一限制归因于各向同性目标与内在自然图像流形之间的基本光谱失配。虽然直接偏好优化提供了一条对齐路径,但其对光谱平坦高斯噪声的依赖无法区分真实高频细节与幻觉。为了弥合这一几何差距,我们提出了ASASR,一个理论基础的框架,通过显式着色噪声转移核以镜像自然光谱衰减,将生成流重铸为Sobolev诱导的黎曼几何。驱动这一几何对齐,我们集成一个基于Riesz表示定理的参数化对抗器,该对抗器合成目标负样本,等效于最坏情况下的Sobolev梯度,以沿着可能结构失效的切空间引导优化。大量评估表明,ASASR优于领先的生成基线,特别是在保持光谱一致性和结构保真度方面,提供了一种有效缓解伪影的鲁棒解决方案。

英文摘要

Generative priors in Image Super-Resolution (SR) often compromise faithful restoration, we attribute this limitation to a fundamental spectral misalignment between isotropic objectives and the intrinsic natural image manifold. While Direct Preference Optimization offers a path to alignment, its reliance on spectrally flat Gaussian noise fails to distinguish authentic high-frequency details from hallucinations. To bridge this geometric gap, we propose ASASR, a theoretically grounded framework that recasts the generative flow into a Sobolev-induced Riemannian geometry by explicitly coloring the noise transition kernel to mirror natural spectral decay. Driving this geometric alignment, we integrate a parametric adversary grounded in the Riesz Representation Theorem, which synthesizes targeted negative samples equivalent to worst-case Sobolev gradients to direct optimization along the tangent space of plausible structural failures. Extensive evaluations demonstrate that ASASR outperforms leading generative baselines, particularly in preserving spectral consistency and structural fidelity, offering a robust solution that effectively mitigates artifacts.

2605.23257 2026-05-25 cs.RO cs.CV 版本更新

Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation

将适应转化为资产:面向在线视觉语言导航的跨域桥接

Zixuan Hu, Xuantuo Huang, Yancheng Li, Yichun Hu, Shengyong Xu, Ling-Yu Duan

发表机构 * School of Computer Science, Peking University, Beijing, China(北京大学计算机科学系) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室) School of Electronics, Peking University, Beijing, China(北京大学电子学院)

AI总结 本文研究了视觉语言导航(VLN)代理在非平稳环境下的适应问题,提出了一种新的测试时适应(TTA)框架IDEA,通过将在线适应转化为知识资产的积累与组合,有效解决了现有方法中的灾难性遗忘和负迁移问题。IDEA引入了基于Fisher指导的软提示优化机制,并结合领域坐标构建动态资产库,利用历史知识构建跨领域桥梁,实现无需训练的适应。实验表明,该方法在多个基准测试中表现优异,展示了其在实际应用中的有效性。

Comments Accepted by ICML 2026

详情
AI中文摘要

在非平稳环境变化下导航对部署在野外的视觉语言导航(VLN)智能体构成了关键挑战。然而,现有的 VLN 测试时适应(TTA)方法大多将在线适应视为瞬时的、孤立的更新,导致灾难性遗忘和负迁移。为了克服这些问题,我们提出了 IDEA(Inter-Domain BridgE with Historical Assets),一种新颖的 TTA 框架,将适应转化为资产的积累和组合。具体来说,IDEA 引入了通过 Fisher 引导的加权方案优化的软提示,以捕获可迁移的知识。然后,这些优化后的提示与域坐标相结合,形成动态资产库。利用该库,IDEA 通过将目标域投影到历史知识的凸包上来构建跨域桥接。这些设计形成了一个互补循环:不断演化的库支撑桥接构建,而桥接提供优越的初始化以加速资产优化。在 REVERIE、R2R 和 R2R-CE 基准上的大量实验表明,IDEA 相对于现有方法具有一致的优越性,展示了其通过资产共享实现无需训练的适应的能力。

英文摘要

Navigating under non-stationary environment shifts poses a critical challenge for a Vision-and-Language Navigation (VLN) agent deployed in the wild. Yet, existing Test-Time Adaptation (TTA) methods for VLN largely treat online adaptation as transient, isolated updates, leading to catastrophic forgetting and negative transfer. To overcome these issues, we propose Inter-Domain BridgE with Historical Assets (IDEA), a novel TTA framework that transforms adaptation into the accumulation and composition of assets. Specifically, IDEA introduces soft prompts optimized via a Fisher-guided weighting scheme to capture the transferable knowledge. These optimized prompts are then augmented with domain coordinates to form a dynamic asset library. Leveraging this library, IDEA constructs a cross-domain bridge by projecting the target domain onto the convex hull of historical knowledge. These designs form a complementary loop: the evolving library underpins bridge construction, while the bridge provides superior initialization to accelerate asset optimization. Extensive experiments across REVERIE, R2R, and R2R-CE benchmarks demonstrate the consistent superiority of IDEA over existing methods, showcasing its ability to enable training-free adaptation via asset sharing.

2605.23254 2026-05-25 cs.CV 版本更新

CARE: Class-Adaptive Expert Consensus for Reliable Learning with Long-Tailed Noisy Labels

CARE: 面向长尾噪声标签可靠学习的类别自适应专家共识

Mengke Li, Haiquan Ling, Lihao Chen, Yang Lu, Yiqun Zhang, Hui Huang

发表机构 * College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China.(深圳大学计算机科学与软件工程学院,中国深圳) Guangming Laboratory, Shenzhen, China.(深圳广明实验室,中国深圳) School of Informatics, Xiamen University, Xiamen, China.(厦门大学信息学院,中国厦门) School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China(广东工业大学计算机科学与技术学院,中国广州)

AI总结 在现实数据学习中,长尾类别分布和噪声标签的复合挑战常常导致模型性能下降。为了解决这一问题,本文提出了一种参数高效的框架CARE,通过结合视觉-语言模型的三种互补监督源,引入类自适应专家共识机制,根据不同类别的频率调整标签校正的严格程度,从而更有效地过滤噪声并重新校准类别分布。实验表明,CARE在多个合成和真实数据集上均优于现有方法,性能提升最高达3.0%。

Comments poster in ICML 2026

详情
AI中文摘要

从现实世界数据中学习常常受到长尾类别分布和噪声标注的双重挑战。现有方法部分解决了这些问题,但通常忽略了标签噪声在不同类别上的非均匀影响,导致对尾部类的修正无效,对头部类的过度正则化。为了解决这个问题,我们提出了类别自适应专家修正(CARE),一个参数高效的框架,利用来自视觉语言模型(VLM)的三种互补监督源:观察到的噪声标签、VLM文本嵌入和视觉特征。CARE引入了一种类别自适应专家共识机制,根据类别频率对尾部类施加更严格的一致性,对头部类施加更宽松的一致性。通过聚合这些来源的高置信度预测,CARE过滤不可靠信号并重新校准类别分布,从而在长尾分布下实现更可靠的修正。在合成和真实世界基准上的大量实验表明,CARE始终优于最先进的方法,实现了高达3.0%的性能提升。源代码可在https://github.com/qwq123-study/CARE获取。

英文摘要

Learning from real-world data is frequently hindered by the compound challenge of long-tailed class distributions and noisy annotations. Existing methods partially address these issues but typically ignore the non-uniform impact of label noise across classes, resulting in ineffective correction for tail classes and over-regularization for head classes. To address this issue, we propose Class-Adaptive Rectification with Experts (CARE), a parameter-efficient framework that leverages three complementary supervision sources from vision-language models (VLM): observed noisy labels, VLM text embeddings, and visual features. CARE introduces a class-adaptive expert consensus mechanism that enforces stricter agreement for tail classes and more permissive agreement for head classes based on class frequency. By aggregating high-confidence predictions across these sources, CARE filters unreliable signals and recalibrates class distributions, yielding more reliable rectification under long-tailed distributions. Extensive experiments on both synthetic and real-world benchmarks demonstrate that CARE consistently outperforms state-of-the-art methods, achieving up to 3.0\% performance gains. The source code is available at https://github.com/qwq123-study/CARE.

2605.23245 2026-05-25 cs.CV cs.AI 版本更新

SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion

SimInsert: 通过区域稀疏注意力融合实现无缝视频对象插入

Xinyu Chen, Yuyi Qian, Jiang Lin, Shenyi Wang, Gao Wang, Zhiqiu Zhang, Jizhi Zhang, Mingjie Wang, Qiang Tang, Qian Wang, Song Wu, Zili Yi

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(新型软件技术国家重点实验室,南京大学) School of Intelligence Science and Technology, Nanjing University(智能科学与技术学院,南京大学) JIUTIAN Research(JIUTIAN研究机构) Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) Zhejiang Sci-Tech University(浙江科技学院) The University of British Columbia(不列颠哥伦比亚大学)

AI总结 SimInsert 是一种无需训练的视频对象插入方法,旨在解决现有方法依赖显式运动工程或耗时重训练的问题,提升灵活性和泛化能力。该方法通过区域稀疏注意力融合,将任务分解为单帧编辑和语义运动描述,利用图像到视频扩散模型的生成先验,实现编辑内容在时间上的自然传播,并保持背景不变性与交互真实感。实验表明,SimInsert 在多项指标上显著优于现有方法,为高保真视频编辑提供了高效解决方案。

Comments Accepted by ICME2026

详情
AI中文摘要

视频对象插入需要确保时空连贯性和交互真实感,远不止简单的内容放置。然而,当前方法通常受限于对显式运动工程或资源密集型重新训练的依赖,限制了其灵活性和泛化能力。为弥补这一差距,我们提出了 extit{SimInsert},一种无需训练的新范式,将任务高效地分解为直观的单帧编辑和语义运动描述。通过利用图像到视频扩散模型的强大生成先验,SimInsert在时间上传播编辑,严格保持背景不变性,同时实现插入对象与动态环境之间合理的、文本驱动的交互。我们的方法依赖于非侵入式引导机制,这些机制强制执行结构一致性,促进无缝边界融合,并抵消在去噪轨迹中通常累积的保真度漂移。大量定量实验验证了我们的有效性:SimInsert在PSNR上超越最先进方法18.8%,在SSIM上超越20.1%,在LPIPS上降低44.1%,为高保真视频编辑提供了流线型解决方案。

英文摘要

Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering or resource-intensive retraining, restricting their flexibility and generalization. To bridge this gap, we present \textit{SimInsert}, a training-free paradigm that efficiently decouples the task into intuitive single-frame editing and semantic motion description. By harnessing the robust generative priors of image-to-video diffusion models, SimInsert propagates edits temporally, strictly preserving background invariance while enabling plausible, text-driven interactions between the inserted object and the dynamic environment. Our approach hinges on non-invasive guidance mechanisms that enforce structural consistency, facilitate seamless boundary fusion, and counteract the fidelity drift that typically accumulates during the denoising trajectory. Extensive quantitative experiments validate our efficacy: SimInsert surpasses state-of-the-art methods with an 18.8\% gain in PSNR, 20.1\% in SSIM, and a 44.1\% decrease in LPIPS, offering a streamlined solution for high-fidelity video editing.

2605.23237 2026-05-25 cs.CV 版本更新

StereoGenBench: A Synthetic Multi-Camera Benchmark for Stereo Generation under Controlled Baseline Regimes

StereoGenBench:一种用于受控基线条件下立体生成的合成多相机基准

Yangzhi Cui, Feng Qiao, Nathan Jacobs

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校)

AI总结 StereoGenBench 是一个基于 Unreal Engine 的合成多相机基准数据集,旨在为立体生成、几何估计和可控视角合成提供精确可控的多基线配对数据。该数据集通过固定场景下六相机阵列的渲染,生成包含多基线、内参、深度、相机位姿等信息的高质量配对视图,支持对不同基线范围下的生成模型进行评估。该工作填补了现有数据集在多基线配对和可控参数方面的不足,为立体生成研究提供了标准化的测试平台。

详情
AI中文摘要

立体图像和视频生成、立体几何估计以及条件控制视图合成需要配对数据,其中决定双目几何的变量——相机基线、内参、场景深度和相机运动——是已知且可控的。现有的立体资源提供了这些变量的子集,但据我们所知,常用于立体生成评估的资源并未在单一受控源中提供场景配对的、校准的多基线右视图真值,以及联合记录的内参、密集度量深度和每帧姿态。我们引入了StereoGenBench,一个合成的Unreal Engine基准,旨在使基线灵敏度与目标相机一致性在匹配的场景内容下可测量。每个场景使用刚性六相机横向阵列渲染,产生多达15个校准视图对;相邻基线从瞳孔间到宽基线范围采样;焦距独立采样;每个视图发布RGB、度量深度、内参、每对基线和每帧姿态。数据集划分包括窄基线和宽基线两个评估族,以及一个仅训练族用于更广泛的全对覆盖。我们发布了数据集、评估代码、参考结果、Croissant元数据以及用于扩展的生成代码/配置(兼容资产)。数据集可在https://huggingface.co/datasets/stereo-dataset/stereo-dataset获取。

英文摘要

Stereo image and video generation, stereo geometry estimation, and condition-controlled view synthesis require paired data in which the variables that determine binocular geometry -- camera baseline, intrinsics, scene depth, and camera motion -- are known and controllable. Existing stereo resources provide subsets of these variables, but resources commonly used for stereo generation evaluation do not, to our knowledge, provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source. We introduce StereoGenBench, a synthetic Unreal Engine benchmark designed to make baseline-regime sensitivity and target-camera consistency measurable under matched scene content. Each scene is rendered with a rigid six-camera lateral array, yielding up to 15 calibrated view pairs; adjacent baselines are sampled from inter-pupillary to wide-baseline regimes; focal length is sampled independently; and every view is released with RGB, metric depth, intrinsics, per-pair baselines, and per-frame poses. The splits include two evaluation families for narrow and wide baseline regimes and a train-only family for broader all-pairs coverage. We release the dataset, evaluation code, reference results, Croissant metadata, and generation code/configuration for extension with compatible assets. The dataset is available at https://huggingface.co/datasets/stereo-dataset/stereo-dataset

2605.23216 2026-05-25 cs.CV 版本更新

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

CaST-Bench:面向视频问答的因果链时空推理基准

Mingfang Zhang, Jingjing Pan, Ashutosh Kumar, Rajat Saini, Mustafa Erdogan, Hsuan-Kung Yang, Caixin Kang, Yifei Huang, Yoichi Sato, Quan Kong

发表机构 * Woven by Toyota(丰田公司) The University of Tokyo(东京大学)

AI总结 CaST-Bench 是一个用于评估视频问答中因果链引导的时空推理能力的新基准,旨在解决现有模型在因果推理方面缺乏细致、可验证证据的问题。该基准通过人类与AI协作构建了包含2066个问题的高质量数据集,每个问题都附带有时间片段和边界框标注的因果链证据。研究还设计了新的评估指标,全面衡量模型在答案正确性和视觉证据推理方面的能力,揭示了当前视觉语言模型在构建精确因果链方面的不足,为未来模型改进指明了方向。

Comments CVPR 2026

详情
AI中文摘要

视频中的因果推理对视觉语言模型(VLM)是一个重大挑战,因为它需要超越表面感知,深入理解因果机制。然而,现有基准很少提供严格评估这一能力所需的细粒度、有依据的证据。为填补这一空白,我们引入了CaST-Bench,一个用于因果链时空视频推理的基准。CaST-Bench提出复杂的因果问题,要求模型识别并定位多个时空证据组成的链条。通过人机协作流程,我们构建了一个高质量数据集,包含1015个视频上的2066个问题,因果链由时间片段和边界框轨迹标注。此外,我们设计了一套全面的评估方案,包含新颖的指标,不仅评估答案正确性,还评估基于视觉证据的推理能力。这种证据基础对于通过减轻虚假相关性来提高准确性,以及通过使模型更透明来增强用户信任至关重要。我们的实验表明,当前的VLM在因果问题上表现不佳,主要原因是它们构建精确且有依据的因果链的能力有限。这为改进未来VLM指明了一个重要方向。

英文摘要

Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.

2605.23203 2026-05-25 cs.CV cs.AI cs.LG cs.RO 版本更新

Lipschitz Optimization for Formal Verification of Homographies

单应性矩阵形式化验证的Lipschitz优化

Jean-Guillaume Durand, Panagiotis Kouvaros, Maxime Gariel, Alessio Lomuscio

发表机构 * Joby Aviation(Joby航空) Safe Intelligence

AI总结 本文研究了针对视觉神经网络在安全关键领域应用的正式鲁棒性验证问题,特别关注相机运动引起的3D扰动对图像生成过程的影响。作者提出了一种基于李普希茨优化和分段连续性分析的验证方法,建立了相机姿态到像素值的闭式映射,并推导出对扰动像素值的紧致线性界。该方法适用于具有平面结构的场景,如增强现实、自动驾驶和机器人操作等,并在多个基准测试中验证了其有效性,相比现有方法在速度和边界紧致性方面均有提升。

Comments 18 pages, 13 figures, 6 tables, to be published at CVPR 2026

详情
AI中文摘要

在受监管行业中采用视觉神经网络需要形式化的鲁棒性保证,尤其是在医疗、自动驾驶和航空航天等安全关键领域。然而,当前方法局限于不完整的统计验证或对$\ell_p$范数和仿射变换的鲁棒性,仅覆盖了图像形成过程中一小部分扰动。特别是,对相机运动的鲁棒性仍然是一个开放问题,尽管它是部署许多视觉应用的关键。我们提出了一种形式化验证方法,针对捕获相机的3D运动扰动鲁棒性。我们首先建立了从相机位姿到像素值的闭式映射。通过分析所得单应性矩阵的连续性性质,我们展示了如何将最近关于Lipschitz优化和分段连续性的工作扩展到推导扰动像素值的紧线性边界。我们的方法适用于以平面结构为主的场景,例如增强现实中的地面、自动驾驶中的道路标记和交通标志,或机器人操作中的平面工作空间。这实现了对投影几何变换的首次形式化验证,无需复杂仿真、替代网络或显式图像形成模型。我们验证了实现,并展示了相比先前工作最高89%的加速和7%更紧的边界。然后,我们在VNN-COMP基准上评估了我们的方法,揭示了投影扰动的系统性弱点。最后,我们在一个安全关键的跑道分类器上进行了真实世界案例研究,突出了对相机运动的实际漏洞,并解决了学习模型认证中的一个关键挑战。数据和代码公开在https://github.com/jeangud/homography-verification。

英文摘要

The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete statistical verification or robustness to $\ell_p$-norm and affine transforms, which cover only a narrow subset of perturbations to the image formation process. In particular, robustness to camera motion remains an open problem despite being key to deploy many vision applications. We present a formal verification approach that targets robustness against 3D motion perturbations of the capturing camera. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. Our approach applies to scenes with predominantly planar structure, such as ground planes in augmented reality, road markings and traffic signs in autonomous driving, or planar workspaces in robotic manipulation. This enables the first formal verification of projective geometry transforms, without complex simulation, surrogate networks, or explicit image-formation models. We validate our implementation and show up to 89% speedup and 7% tighter bounds over prior work. We then evaluate our method on the VNN-COMP benchmark and reveal systematic weaknesses to projective perturbations. Finally, we demonstrate a real-world case study on a safety-critical runway classifier, highlighting practical vulnerabilities to camera motion, and addressing a key challenge in the certification of learned models. Data and code are publicly available at https://github.com/jeangud/homography-verification .

2605.23187 2026-05-25 cs.CV cs.RO 版本更新

IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

IntentionNav: 一种基于隐式人类指令的意图驱动目标导航基准

Lin Qian, Shijie Li, Sihao Lin, Xuan Zhang, Bangya Liu, Yanran Li, Hujun Yin

发表机构 * The University of Manchester(曼彻斯特大学) A*STAR Responsible AI Research Centre, Adelaide University(阿德莱德大学负责任人工智能研究中心) University of Bedfordshire(贝福德郡大学)

AI总结 IntentionNav 是一个用于意图驱动对象导航的新基准,旨在评估智能体从隐含人类指令中推断目标物体并完成导航任务的能力。该基准不直接提供目标物体名称,而是通过自然语言指令隐含表达需求,要求智能体理解意图、识别目标并完成导航。研究引入了四种意图模式和多种指令风格,支持对目标推理、语言鲁棒性及导航成功率的细致分析,揭示了当前视觉语言模型在理解隐含意图和完成精准导航任务方面仍面临挑战。

Comments preprint

详情
AI中文摘要

现有的目标导航基准通常告诉具身智能体要找到哪个物体类别,例如微波炉或椅子。面向人类的具身AI经常被问到一些不那么直接的问题:“我需要热一下这个食物”或“房间感觉很闷”。智能体必须推断出能够满足需求的物体,找到一个场景中的实例,并决定是否已达到目标。我们将这种设置研究为意图驱动的目标导航,并引入IntentionNav,一个用于从隐式人类指令进行主动目标搜索的诊断基准。每个episode提供一个自由文本意图、RGB-D观测和位姿,但隐藏目标物体名称。IntentionNav包含176个Isaac Sim场景和64个目标类别上的500个意图。每个意图以四种受控指令风格重写,并标注四种意图模式之一,将表面措辞与语义线索类型分离,同时保持几何匹配。这种配对设计支持对目标推断、语言鲁棒性、邻域可达性和终端成功(而非仅聚合成功)的分析。我们使用一个固定的主动导航智能体评估了三个VLM。模型在48.3%的episode中识别出预期目标,在68.7%中进入其2米邻域,但仅在24.9%中成功终止,并在5.5%中达到接地1米成功。事件脚本意图的成功率最高(28.7%),而物理状态和可供性意图的成功率较低(分别为19.2%和18.5%),表明间接人类意图仍然是主动具身搜索中目标选择、视觉验证和终端定位的瓶颈。

英文摘要

Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy." The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached. We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48.3 percent of episodes and enter its 2 m neighborhood in 68.7 percent, but terminate successfully in only 24.9 percent and achieve grounded 1 m success in 5.5 percent. Success is highest for event-script intents (28.7 percent) and lower for physical-state and affordance intents (19.2 percent and 18.5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.

2605.23183 2026-05-25 eess.IV cs.CV 版本更新

GMENet: Generative Mixture of Experts Network for Multi-Center Glioma Diagnosis with Incomplete Imaging Sequences

GMENet: 用于多中心胶质瘤诊断的生成式专家混合网络(不完整成像序列)

Pengfei Song, Fangjin Liu, Wenwen Zeng, Yonghuang Wu, Chengqian Zhao, Feiyu Yin, Xuan Xie, Jinhua Yu

发表机构 * School of Biomedical Engineering and Technology Innovation, Fudan University(复旦大学生物医学工程与技术创新学院) Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University(复旦大学脑启发智能科学技术研究院) Intelligent Diagnosis and Treatment Laboratory for Brain Diseases, Joint Laboratory of Neurosurgery Department of Huashan Hospital and School of Information Science and Technology, Fudan University(脑病智能诊断与治疗实验室,华山医院神经外科部门联合实验室,复旦大学信息科学学院)

AI总结 当前胶质瘤诊断通常结合分子特征与组织病理学信息进行临床决策,但在实际应用中,不同中心的影像协议不统一,导致MRI序列不完整,限制了现有模型的临床适用性。为此,本文提出GMENet,一种用于多中心胶质瘤诊断的生成专家混合网络。该方法通过跨注意力门控生成模块合成缺失的影像特征,并引入动态加权专家融合模块实现多任务预测,有效提升了模型在不完整数据下的诊断性能和跨中心适应能力。

Comments IJCAI Accept

详情
AI中文摘要

当代胶质瘤诊断将分子特征与组织病理学相结合以指导临床决策。然而,在临床环境中,不同的成像协议导致MRI序列不完整,从而带来两个主要挑战:迫使现有框架在训练期间丢弃大量临床数据,并因此限制了其临床适用性。为解决这些限制,我们提出了GMENet,一种用于不完整成像序列的多中心胶质瘤诊断的生成式专家混合网络。首先,我们设计了一个基于交叉注意力的门控生成模块,该模块通过交叉注意力和动态门控机制从可用序列合成缺失序列特征,并引入循环一致性损失以保持语义完整性。其次,我们引入了一个动态加权专家融合模块,该模块对原始和合成的双序列特征进行专家混合交互和置信度感知融合,以进行多任务预测。我们在一个包含来自四个内部数据集和两个公共存储库的1241名受试者的多中心队列上评估了GMENet。实验表明,相对于仅完整序列的数据,GMENet将临床可用的训练数据扩大了97%。此外,它始终优于在完整数据上训练的最先进方法,在跨中心分布偏移下表现出更强的鲁棒性。

英文摘要

Contemporary glioma diagnosis integrates molecular features with histopathology to guide clinical decision-making. However, in clinical settings, divergent imaging protocols result in incomplete MRI sequences, leading to two primary challenges: forcing existing frameworks to discard a large portion of clinical data during training and consequently limiting their clinical applicability. To address these limitations, we propose GMENet, a Generative Mixture of Experts Network for multi-center glioma diagnosis with incomplete imaging sequences. Firstly, we design a Cross-attention-based Gated Generation Module that synthesizes missing sequence features from available sequences via cross-attention and dynamic gating mechanisms, incorporating a cycle-consistency loss to preserve semantic integrity. Secondly, we introduce a Dynamically Weighted Experts Fusion Module that performs mixture-of-experts interaction and confidence-aware fusion over original and synthesized dual-sequence features for multi-task prediction. We evaluate GMENet on a multi-center cohort of 1,241 subjects from four in-house datasets and two public repositories. Experiments show that GMENet expands clinically usable training data by 97\%, relative to complete-sequence-only data. Furthermore, it consistently outperforms state-of-the-art methods trained on complete data, demonstrating improved robustness under cross-center distribution shifts.

2605.23178 2026-05-25 cs.CV 版本更新

Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes

将人物组合在一起:面向多人交互场景的迭代姿态-图像生成

Wenxuan Peng, Bharath Hariharan, Hadar Averbuch-Elor

发表机构 * Cornell University(康奈尔大学)

AI总结 尽管现有文本到图像模型在生成多人互动场景时仍面临语义多样性不足和构图准确性低的问题,常导致布局重复、姿势刻板和互动不自然。本文提出一种双模态的姿势-图像表示方法,将以人为中心的结构先验引入预训练的扩散变换模型,通过联合预测二维姿势图和对应的RGB图像,使结构与外观在学习过程中协同演化。核心方法采用跨模态对齐方案,将文本、姿势和图像表示进行绑定,确保多模态一致性,并设计迭代场景生成策略,逐步构建复杂的多人互动场景,有效分解整体生成复杂度,实验表明该方法显著提升了多人图像生成的提示对齐度和场景多样性。

Comments Accepted to SIGGRAPH Conference Papers 2026. 22 pages, 12 figures. Project page: https://cornell-vailab.github.io/PeopleComposer/

详情
AI中文摘要

尽管近期取得了进展,文本到图像模型仍然难以生成语义多样且组合准确的多人交互场景,常常陷入重复布局、刻板姿态和交互基础薄弱的问题。在这项工作中,我们通过引入一种双姿态-图像表示来弥合这一差距,该表示将人物中心的结构先验引入预训练扩散Transformer。我们的模型联合预测2D姿态可视化图像及其对应的RGB图像,使得结构和外观在学习过程中共同演化。其核心是一种跨模态对齐方案,将文本、姿态和图像表示绑定在一起,确保跨模态的一致性基础。此外,我们设计了一种迭代场景构建方案,逐步生成复杂的多人交互,同时有效分解整体生成复杂性。大量实验表明,我们的方法在多人图像生成中显著提高了提示对齐度和场景多样性。

英文摘要

Despite recent progress, text-to-image models still struggle to generate semantically diverse and compositionally accurate multi-person interaction scenes, often collapsing to repetitive layouts, stereotypical poses, and poorly grounded interactions. In this work, we bridge this gap by introducing a dual pose-image representation that brings person-centric structural priors into pretrained diffusion transformers. Our model jointly predicts a 2D pose visualization image and its corresponding RGB image, enabling structure and appearance to co-evolve during learning. At its core, a cross-modal alignment scheme binds text, pose, and image representations, ensuring consistent grounding across modalities. Furthermore, we design an iterative scene construction scheme, progressively generating complex multi-human interactions while effectively decomposing the overall generation complexity. Extensive experiments demonstrate that our method substantially improves prompt alignment and scene diversity in multi-person image generation.

2605.23174 2026-05-25 cs.CV 版本更新

LQ-rPPG: A Label-Quantized Coarse-to-Fine Learning Framework for Remote Physiological Measurement

LQ-rPPG:一种用于远程生理测量的标签量化粗到细学习框架

Jun Seong Lee, Samyeul Noh, Changki Sung, Hyun Myung

发表机构 * Electronics and Telecommunications Research Institute(电子电信研究院) School of Electrical Engineering, Korea Advanced Institute of Science and Technology(韩国科学技术院电气工程学院)

AI总结 远程光电容积图(rPPG)技术能够通过面部视频非接触地测量生理信号,在远程医疗和日常健康监测中具有重要应用前景。然而,现有基于深度学习的rPPG方法大多忽视了训练标签的质量及其对模型学习的影响,导致模型易受标签噪声和变化的影响,影响泛化性能。为此,本文提出LQ-rPPG,一种基于标签量化和粗到细学习的框架,通过将连续PPG信号转化为多比特伪标签以减少噪声,并在分层监督下逐步优化rPPG估计,从而提升模型鲁棒性和泛化能力,实验表明其在多个数据集上表现优异且计算效率显著提高。

详情
AI中文摘要

远程光电容积描记(rPPG)技术能够从面部视频中非接触式测量生理信号,在远程医疗和日常健康监测方面具有巨大潜力。受此驱动,研究者提出了多种基于深度学习的rPPG方法以改进估计性能。然而,以往的深度学习方法很少关注训练标签的质量及其对模型学习的影响。用作训练标签的接触式PPG信号通常包含由运动伪影、传感器接触不一致和形态畸变引起的噪声和变异性。这种标签不一致性可能导致模型过拟合标签噪声和变异性,从而降低泛化性能。为解决此问题,我们提出LQ-rPPG,一种标签量化的粗到细学习框架,用于鲁棒的rPPG估计。LQ-rPPG包含一个标签量化模块和一个粗到细的rPPG估计模型。标签量化模块将连续PPG信号转换为多比特量化伪标签,以降低噪声和变异性。粗到细估计模型在多比特伪标签的分层监督下逐步细化rPPG信号。这种设计减轻了对标签特定变异性的过拟合,使模型能够学习结构化和一致的表示。因此,LQ-rPPG即使在挑战性条件下也能实现鲁棒且可泛化的rPPG估计。在多个基准数据集上的实验表明,LQ-rPPG在数据集内和跨数据集评估中均取得了强劲性能,同时参数和乘累加操作分别减少88%和29%,吞吐量提高191%。代码可在https://github.com/Anonymous-repo-code/LQ-rPPG获取。

英文摘要

Remote photoplethysmography (rPPG) enables non-contact measurement of physiological signals from facial videos, offering strong potential for remote healthcare and daily health monitoring. Driven by this potential, various deep learning-based rPPG methods have been proposed to improve rPPG estimation. However, previous deep learning-based rPPG methods have paid little attention to the quality of training labels and their impact on model learning. Contact-based PPG signals used as training labels often contain noise and variability caused by motion artifacts, inconsistent sensor contact, and morphological distortions. Such label inconsistency can lead models to overfit to the label noise and variability and consequently degrade generalization performance. To address this issue, we propose LQ-rPPG, a label-quantized coarse-to-fine learning framework for robust rPPG estimation. LQ-rPPG consists of a label quantization module and a coarse-to-fine rPPG estimation model. The label quantization module transforms continuous PPG signals into multi-bit quantized pseudo labels with reduced noise and variability. The coarse-to-fine estimation model progressively refines rPPG signals under hierarchical supervision guided by the multi-bit pseudo labels. This design alleviates overfitting to label-specific variations and enables the model to learn structured and consistent representations. As a result, LQ-rPPG achieves robust and generalizable rPPG estimation even under challenging conditions. Experiments on multiple benchmark datasets demonstrate that LQ-rPPG achieves strong performance in both intra- and cross-dataset evaluations, while reducing parameters and multiply-accumulate operations by 88% and 29%, respectively, and increasing throughput by 191%. The code is available at https://github.com/Anonymous-repo-code/LQ-rPPG.

2605.23160 2026-05-25 cs.RO cs.CV 版本更新

Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping

语义感知引导的无人机探索:面向语言条件的三维室内建图

Nitin Vegesna, Avideh Zakhor

发表机构 * Department of Electrical Engineering and Computer Sciences(电气工程与计算机科学系)

AI总结 本文提出了一种语义感知引导的无人机探索系统SAGE,用于在未知的室内3D环境中进行开放词汇的探索,能够在保持全面覆盖行为的同时,利用语义线索重新优先选择探索前沿。SAGE基于FALCON体积探索器,通过集成CLIP模型的四个关键组件,实现了语义与几何信息的联合规划,有效提升了目标发现效率。实验表明,SAGE在模拟和真实环境中均优于现有方法,尤其在目标发现速度和体积吞吐量方面表现突出。

Comments 10 pages, 6 figures, 4 tables. To be presented at the 2nd 3D-LLM/VLA Workshop at CVPR 2026 (non-archival workshop)

详情
AI中文摘要

我们提出语义感知引导探索(SAGE),一个用于未知三维室内环境的开放词汇探索系统,该系统在保持覆盖导向行为的同时,允许语义提示重新优先化前沿选择。基于FALCON体积探索器,SAGE通过四个关键组件集成对比语言-图像预训练(CLIP):以物体为中心的嵌入存储、将最近观测投影到自由-未知边界的时间缓存、用于高相似度检测的物体前沿,以及统一的语义-几何规划成本。该成本函数限制了语义重新加权的影响,确保前沿被优先化而不牺牲总覆盖率。在基于Matterport3D的仿真中,SAGE在地图-查询对上的物体发现方面优于FALCON和纯语义消融。与Finding Things in the Unknown(FTU)相比,SAGE在九个共享地图-查询对上的探索速度提高了9.0到25.9倍,平均加速13.7倍。此外,SAGE的体积吞吐量显著高于FTU。最后,我们在Modal AI Starling 2四旋翼飞行器上,在两种环境中的五次真实飞行中部署了SAGE,配备机载感知和规划以及离板CLIP推理。比较SAGE和FALCON,我们发现虽然FALCON导致更快的探索和更短的建图轨迹,但SAGE在物体发现方面优于FALCON。

英文摘要

We present Semantic-Aware Guided Exploration, SAGE, a system for open-vocabulary exploration in unknown 3D indoor environments that preserves coverage-oriented behavior while allowing semantic cues to reprioritize frontier selection. Building on the FALCON volumetric explorer, SAGE integrates Contrastive Language-Image Pre-training (CLIP) via four key components: object-centric embedding storage, a temporal cache that projects recent observations onto the free-unknown boundary, object frontiers for high-similarity detections, and a unified semantic-geometric planning cost. This cost function bounds semantic reweighting influence, ensuring frontiers are prioritized without sacrificing total coverage. In Matterport3D-based simulations, SAGE outperforms FALCON and a semantic-only ablation in object discovery across map-query pairs. Compared to Finding Things in the Unknown (FTU), SAGE completes exploration 9.0 to 25.9 times faster across the nine shared map-query pairs, achieving a mean speedup of 13.7. Furthermore, SAGE achieves substantially higher volumetric throughput than FTU. Finally, we deploy SAGE in five real-world flights in two environments on a Modal AI Starling 2 quadrotor with onboard sensing and planning, and offboard CLIP inference. Comparing SAGE and FALCON, we find that while FALCON results in faster exploration and shorter mapping trajectories, SAGE outperforms FALCON in terms of object discovery.

2605.23144 2026-05-25 cs.CV 版本更新

SLIP-RS: Structured-Attribute Language-Image Pre-Training for Remote Sensing Object Detection

SLIP-RS:面向遥感目标检测的结构化属性语言-图像预训练

Chenxu Wang, Yuxuan Li, Yunheng Li, Xiang Li, Jingyuan Xia, Qibin Hou

发表机构 * VCIP, CS, Nankai University(中国南开大学计算机科学与技术研究所) National University of Defense Technology(国防科技大学)

AI总结 现有的遥感目标检测语言-图像预训练方法受限于单一标签学习,依赖黑盒数据枚举开放类别以获取细粒度表示,难以适应遥感领域数据稀缺的特点。为此,本文提出SLIP-RS方法,构建了一个结构化属性解耦范式,将开放类别空间映射到有限且具有物理意义的属性空间,通过显式结构逻辑提升细粒度判别能力。该方法包含两个关键技术:结构化属性对比学习和符合性属性可靠性引擎,分别用于解耦视觉逻辑和从噪声数据中提取高质量监督信号,最终在细粒度检测和跨域泛化方面取得了显著提升。

详情
AI中文摘要

现有的遥感目标检测语言-图像预训练受限于单一标签学习,该方法通过黑盒数据穷举开放集类别以获取细粒度表示,这种依赖性与领域固有的数据稀缺性不兼容。为突破这一瓶颈,我们提出SLIP-RS,建立结构化属性解耦范式,将开放类别空间映射到有限且物理有意义的属性空间,通过显式结构逻辑解锁细粒度判别能力。该范式通过两个技术支柱实现:(1)结构化属性对比学习,通过组合属性增强强制学习解耦的内在视觉逻辑;(2)共形属性可靠性引擎,利用共形预测理论从噪声源中严格提取高保真监督,生成RS-Attribute-15M,这是最大的包含超过1500万属性标注的数据集。大量实验表明,SLIP-RS在细粒度检测和跨域泛化方面建立了前所未有的性能,验证了结构化属性作为遥感基础的重要性。代码:https://github.com/facias914/SLIP-RS。

英文摘要

Existing language-image pre-training for remote sensing object detection is constrained by Monolithic Label Learning, which relies on exhaustively enumerating open-set categories via black-box data to acquire fine-grained representations, creating a dependency incompatible with the domain's inherent data scarcity. To transcend this bottleneck, we propose SLIP-RS, establishing a Structured-Attribute Decoupling Paradigm that maps the open-ended category space into a finite, physically meaningful attribute space, unlocking fine-grained discriminability via explicit structural logic. This paradigm is realized via two technical pillars: (1) Structured-Attribute Contrastive Learning, which enforces the learning of decoupled intrinsic visual logic via combinatorial attribute augmentation; and (2) Conformal Attribute Reliability Engine, which leverages conformal prediction theory to rigorously distill high-fidelity supervision from noisy sources, yielding RS-Attribute-15M, the largest dataset with over 15 million attribute annotations. Extensive experiments demonstrate that SLIP-RS establishes unprecedented performance in fine-grained detection and cross-domain generalization, validating structured attributes as a vital foundation for remote sensing. Code: https://github.com/facias914/SLIP-RS.

2605.23141 2026-05-25 cs.CV 版本更新

VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images

VisAnalog:自然图像上视觉概念迁移的诊断套件

Zhaonan Li, Kyle R. Chickering, Bangzheng Li, Jacob Dineen, Xiao Ye, Zhikun Xu, Shijie Lu, Yuxi Huang, Ming Shen, Bach Nguyen, Jaya Adithya Pavuluri, Mau Son Nguyen, Sanika Chavan, Ngoc Minh Thu Le, Muhao Chen, Ben Zhou

发表机构 * Arizona State University(亚利桑那州立大学) Luma AI UC Davis(加州大学戴维斯分校)

AI总结 VisAnalog 是一个用于评估视觉概念迁移能力的诊断数据集,旨在测试模型是否能在不同场景中保持和操作概念属性。该数据集通过“A:B::C:?”的形式构造样本,要求模型根据给定的图像和变换关系推断出目标图像。实验表明,即使在强大的视觉语言模型上,其性能也远低于理想情况,且随着变换步骤的增加性能显著下降,而人类表现则接近最优。该数据集为分析模型在视觉关系推理和变换应用上的缺陷提供了有效工具。

Comments Accepted to the Workshop on Visual Concepts at CVPR 2026 as a non-archival report

详情
AI中文摘要

视觉概念学习的一个有用测试不仅在于模型能否在单张图像中识别概念,还在于它能否在变换下保留和操作概念级属性并将其迁移到新场景。我们引入了VisAnalog,一个针对自然图像上这一场景的受控套件。每个示例实例化$A\!:\!B::C\!:\,?$:图像$B$和隐藏的目标图像$D$是通过对源图像$A$和$C$应用相同的确定性变换序列生成的。给定$A$、$B$和$C$,模型必须回答关于$D$的多选题。该基准包含617个人工验证的问题,涵盖一到四步变换,如缩放、象限交换、旋转、翻转和色调旋转。在强大的专有和开源视觉语言模型上,当直接显示$D$时,端到端准确率显著低于oracle准确率,并且随着变换深度的增加而急剧下降,而人类表现仍接近上限。程序条件评估进一步将关系推理失败与变换应用失败分开,表明从$A \rightarrow B$推断视觉关系是主要瓶颈,在更困难的多步案例中还会出现额外的应用错误。该数据集公开于https://huggingface.co/datasets/zli99/VisAnalog。

英文摘要

A useful test of visual concept learning is not just whether a model can recognize a concept in a single image, but whether it can preserve and manipulate concept-level properties under transformation and transfer them to new scenes. We introduce VisAnalog, a controlled suite for this setting on natural images. Each example instantiates $A\!:\!B::C\!:\,?$: images $B$ and a hidden target image $D$ are produced by applying the same deterministic transformation sequence to source images $A$ and $C$. Given $A$, $B$, and $C$, a model must answer a multiple-choice question about $D$. The benchmark contains 617 human-validated questions spanning one- to four-step transformations such as zoom, quadrant swap, rotation, flip, and hue rotation. Across strong proprietary and open-source VLMs, end-to-end accuracy is substantially lower than oracle accuracy when $D$ is directly shown, and degrades sharply as transformation depth increases, while human performance remains near the ceiling. A program-conditioned evaluation further separates failures of relation inference from failures of transformation application, showing that inferring the visual relation from $A \rightarrow B$ is the dominant bottleneck, with additional application errors emerging on harder multi-step cases. The dataset is publicly available at https://huggingface.co/datasets/zli99/VisAnalog.

2605.23118 2026-05-25 cs.CV cs.AI cs.LG 版本更新

Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking

在临床医生验证的交互式病灶追踪中利用纵向上下文

Yannick Kirchhoff, Maximilian Rokuss, Daniel Philipp Mertens, David Füller, Benjamin Hamm, Andreas Schreyer, Oliver Ritter, Klaus Maier-Hein

发表机构 * German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Germany(德国癌症研究中心(DKFZ)海德堡,医学图像计算部,德国) Faculty of Mathematics and Computer Science, Heidelberg University, Germany(海德堡大学数学与计算机科学学院,德国) HIDSS4Health -- Helmholtz Information and Data Science School for Health, Karlsruhe/Heidelberg, Germany(HIDSS4Health——海德堡信息与数据科学健康学校,卡尔斯鲁厄/海德堡,德国) Medical Faculty, Heidelberg University, Germany(海德堡大学医学学院,德国) University Hospital Brandenburg an der Havel, Brandenburg Medical School Theodor Fontane, Germany(勃兰登堡运河大学医院,布兰登堡泰奥多尔·冯·_fontane医学学校,德国) Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Germany(放射肿瘤科模式分析与学习组,海德堡大学医院,德国)

AI总结 本文研究了如何在临床验证的交互式病灶追踪中有效利用纵向影像信息,以提高肿瘤在连续CT扫描中的追踪准确性。作者提出了一种“验证追踪”范式,通过临床医生验证注册提出的提示,并结合病灶的基线外观信息,解决分割中的模糊问题。该方法结合了早期空间提示融合与潜在时间差分加权,构建了一个统一的纵向信息引导分割框架,并通过大规模合成预训练克服数据稀缺问题,显著提升了性能。实验表明,该方法在全自动和验证追踪设置下均优于现有方法,且在MICCAI autoPET IV挑战赛中取得第一名。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

在系列CT扫描中追踪肿瘤病灶对于肿瘤学反应评估至关重要。现有的自动化方法面临一个基本权衡:端到端追踪器实现高度自动化,但无法纠正无声的追踪失败;而解耦的配准-分割流程允许用户验证,却丢弃了病灶的先验外观,限制了在模糊情况下的准确性。在这项工作中,我们提出了一种验证追踪范式:临床医生验证配准提出的提示,模型利用该提示以及基线病灶外观来解决分割模糊性。我们提出了一个统一框架,结合早期空间提示融合与潜在时间差异加权,用于纵向信息感知的分割。为了解决数据稀缺问题,我们利用大规模合成预训练,证明这对于利用纵向上下文至关重要,相比从头训练性能提升高达4.5个Dice点。我们的方法在MICCAI autoPET IV挑战中获得第一名。我们进一步整理并发布了PanTrack,一个新的纵向胰腺癌基准,以评估分布外泛化能力。实验表明,我们的模型在全自动和所提出的验证追踪设置中均优于先前工作,在自动化与控制之间提供了一个临床安全的中间地带。代码、模型和数据集将在https://github.com/MIC-DKFZ/LongiSeg发布。

英文摘要

Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundamental trade-off: end-to-end trackers achieve high automation but offer no opportunity to correct silent tracking failures, while decoupled registration-segmentation pipelines permit user verification yet discard the lesion's prior appearance, limiting accuracy in ambiguous cases. In this work, we propose a Verified Tracking paradigm: a clinician verifies a registration-proposed prompt, which the model leverages alongside the baseline lesion appearance to resolve segmentation ambiguities. We present a unified framework combining early spatial prompt fusion with latent temporal difference weighting for longitudinally-informed segmentation. To address data scarcity, we leverage large-scale synthetic pretraining, proving essential for exploiting longitudinal context, improving performance by up to 4.5 Dice points over training from scratch. Our approach secured first place in the MICCAI autoPET IV challenge. We further curate and release PanTrack, a new longitudinal pancreatic cancer benchmark, to assess out-of-distribution generalization. Experiments show that our model outperforms prior work in both fully automatic and the proposed verified tracking setting offering a clinically safe middle ground between automation and control. Code, model and dataset will be released at https://github.com/MIC-DKFZ/LongiSeg

2605.23116 2026-05-25 cs.CV cs.AI 版本更新

CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection

CoReVAD: 一种无需训练的视频异常检测上下文推理框架

Hyeongmuk Lim, Youngbum Hur

发表机构 * Department of Industrial Engineering, Inha University, Incheon, Republic of Korea(韩国釜山大学工业工程系)

AI总结 现有视频异常检测方法通常依赖任务特定的训练,导致领域依赖性强且训练成本高,且大多仅输出标量异常分数,缺乏对异常原因的解释。为此,本文提出CoReVAD,一种无需训练的上下文推理框架,利用冻结的视觉-语言模型直接生成异常分数和时间描述,并通过局部响应清理模块和全局时序优化策略提升检测精度与可解释性。实验表明,CoReVAD在多个数据集上表现出色,提供了可靠且易于理解的异常解释。

Comments Accepted to ICPR 2026

详情
AI中文摘要

现有的视频异常检测方法通常依赖于任务特定的训练,导致强领域依赖性和高训练成本。此外,大多数现有方法仅输出标量异常分数,对特定事件为何被视为异常提供的洞察有限。视觉语言模型的最新进展使得异常检测和人类可解释推理成为可能。然而,许多基于视觉语言模型的方法仍然需要额外的训练步骤(例如,指令调优或口头化学习)或外部大型语言模型,从而带来进一步的训练成本和推理开销。为了解决这些挑战,我们提出了CoReVAD,一种用于无需训练的视频异常检测的上下文推理框架,该框架使用单个冻结的视觉语言模型运行。CoReVAD直接从视觉语言模型生成异常分数和时间描述。为了减轻生成输出中的噪声,我们引入了一个基于局部视觉-文本对齐的局部响应清理模块。此外,通过基于softmax的精炼、高斯平滑和位置加权,融入了全局时间上下文和进展。在UCF-Crime和XD-Violence上的实验表明,CoReVAD在无需训练的方法中取得了竞争性能,同时提供了可靠且可解释的解释。我们的官方代码可在https://github.com/Muk-00/CoReVAD获取。

英文摘要

Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision-Language Models (VLMs) have enabled both anomaly detection and human-interpretable reasoning. However, many VLM-based approaches still require additional training steps (e.g., instruction tuning or verbalized learning) or external Large Language Models (LLMs), incurring further training costs and inference overhead. To address these challenges, we propose CoReVAD, a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. CoReVAD directly generates anomaly scores and temporal descriptions from the VLM. To mitigate noise in generative outputs, we introduce a Local Response Cleaning (LRC) module based on local vision-text alignment. Furthermore, global temporal context and progression are incorporated through softmax-based refinement, Gaussian smoothing, and position weighting. Experiments on UCF-Crime and XD-Violence demonstrate that CoReVAD achieves competitive performance among training-free methods while providing reliable and interpretable explanations. Our official code is available at: https://github.com/Muk-00/CoReVAD

2605.23113 2026-05-25 cs.CV 版本更新

Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization

不一致感知多模态薛定谔桥用于深度伪造定位

Jiayu Xiong, Jing Wang, Qi Zhang, Wanlong Wang, Jun Xue

发表机构 * Department of Computer Science and Techonology, Huaqiao University(华侨大学计算机科学与技术系) Xiamen Key Laboratory of Computer Vision and Pattern Recognition, Huaqiao University(厦门计算机视觉与模式识别重点实验室) Tongji University(同济大学) School of Cyber Science and Engineering, Wuhan University(武汉大学网络空间安全学院)

AI总结 本文提出了一种基于不一致性感知的多模态Schrödinger Bridge(IaMSB)方法,用于深度伪造视频的区间级定位。该方法通过联合估计跨模态一致性并进行时间区间定位,有效抑制了单侧和异步伪造中的跨模态噪声传播。IaMSB利用Schrödinger Bridge框架统一了一致性估计、跨模态信息选择和桥步调度,在提升定位精度的同时减少了不必要的迭代,显著提高了高精度定位性能,尤其在单侧伪造检测中表现优异。

Comments Accepted by CVPR2026

详情
AI中文摘要

音视频深度伪造定位需要区间级输出作为时间证据。尽管近期取得进展,但在单侧或异步伪造下的对称融合会传播跨模态噪声,降低高精度定位。我们提出IaMSB,一种不一致感知多模态薛定谔桥(SB),联合估计跨模态一致性并执行区间级定位。与扩散模型不同,SB最小化路径分布差异,无需显式噪声注入或去噪即可生成一致性分数。借助薛定谔桥(SB),IaMSB将一致性估计、跨模态信息选择和桥步调度统一在一个框架中。具体地,轻量级粗桥首先提出候选区间并估计跨模态一致性;这些统计量选择跨模态见证信号并跨模态非对称分配桥步。然后,精炼桥执行步调融合并输出精炼的时间对齐区间。IaMSB预判单侧和异步伪造,并通过带步分配的瓶颈跨模态交互抑制噪声转移,避免不必要的迭代。在多个基准上,IaMSB稳定了严格IoU边界精度,将AP@0.95提高了3%~10%,并实现了改进的高精度定位,特别是对于单侧伪造。

英文摘要

Audio-visual deepfake localization demands interval-level outputs that serve as temporal evidence. Despite recent progress, symmetric fusion under single-sided or asynchronous forgeries propagates cross-modal noise, degrading high-precision localization. We present IaMSB, an inconsistency-aware multimodal Schrödinger Bridge (SB) that jointly estimates cross-modal consistency and performs interval-level localization. Unlike diffusion models, SB minimizes path-distribution discrepancy and yields consistency scores without explicit noise injection or denoising. With the Schrödinger Bridge (SB), IaMSB unifies consistency estimation, cross-modal information selection, and bridge-step scheduling in one framework. Specifically, a lightweight coarse bridge first proposes candidate intervals and estimates cross-modal consistency; these statistics select cross-modal witness signals and allocate bridge steps asymmetrically across modalities. A refinement bridge then performs step-tuned fusion and outputs refined, time-aligned intervals. IaMSB anticipates single-sided and asynchronous forgeries and, using bottlenecked cross-modal interaction with step allocation, suppresses noise transfer, avoids unnecessary iterations. Across benchmarks, IaMSB stabilizes strict-IoU boundary precision, raising AP@0.95 by 3%~10%, and yields improved high-precision localization, particularly for single-sided forgeries.

2605.23094 2026-05-25 eess.IV cs.AI cs.CV 版本更新

Do Synthetic Brain MRIs Reliably Improve Tumour Classification? A StyleGAN2-ADA Class-Plane Augmentation Study on BRISC 2025

合成脑部MRI能否可靠改善肿瘤分类?基于BRISC 2025的StyleGAN2-ADA类平面增强研究

José Rafael Noriega Cedeño

发表机构 * NVIDIA

AI总结 该研究探讨了合成脑部MRI图像是否能有效提升肿瘤分类任务的性能,使用StyleGAN2-ADA生成器在BRISC 2025数据集上生成图像,并测试其对三种分类模型的影响。研究发现,合成图像的增益效果因模型架构和真实与合成图像比例不同而有所差异,其中MobileViTV2模型在使用过滤后的1:1合成图像增强后,肿瘤分类准确率提升了1.02%。结果表明,生成式增强的效果并非仅取决于图像的视觉质量,而是与模型结构和数据配比密切相关。

Comments 18 pages, 16 figures

详情
AI中文摘要

生成式增强常被提议作为小规模医学图像数据集的补救措施,但合成图像只有在改善下游任务性能时才有用。此处的“增强”指合成补充:将GAN生成的样本添加到真实训练池中,而非对现有图像进行几何或光度变换。我们在受限的BRISC 2025分区上训练了十二个类平面StyleGAN2-ADA生成器,以测试其输出(无论是否经过InceptionV3特征空间过滤)是否能改善三个分类器家族上的留出肿瘤分类:基于InceptionV3特征的随机森林(RF)、紧凑型双头卷积神经网络(CNN)以及移动混合卷积-Transformer MobileViTV2。每个分类器在1:1和1:2的真实与合成比例下进行评估。独立的GPT-5.5盲测在模型可读子集上将门控真实与合成辨别率定为57.73%(95%置信区间:54.48–60.92%),略高于随机水平。RF分类器未从合成MRI中获益。CNN显示出一致的均值增益,但未通过Holm校正。MobileViTV2显示出最清晰的益处:过滤后的1:1增强将肿瘤分类准确率绝对提高了1.02%(95%置信区间:0.54–1.54%;Holm校正后p=0.0104)。二次效率分析发现,每个增强的CNN条件比基线提前42–64%选择其检查点,而计算匹配的MobileViTV2运行在减少50–67%的真实数据epoch后达到选择。总体而言,增强效用被发现依赖于架构和比例,而非仅由视觉保真度保证。

英文摘要

Generative augmentation is often proposed as a remedy for small medical-image datasets, but synthetic images are only useful when they improve downstream task performance. "Augmentation" here means synthetic supplementation: GAN-generated samples added to the real training pool, not geometric or photometric transforms of existing images. Twelve class-plane StyleGAN2-ADA generators were trained on constrained BRISC 2025 partitions to test whether their output, with or without InceptionV3 feature-space filtering, improves held-out tumour classification across three classifier families: a random forest (RF) on InceptionV3 features, a compact two-headed convolutional neural network (CNN), and MobileViTV2, a mobile hybrid convolutional-transformer. Each was evaluated at 1:1 and 1:2 real-to-synthetic ratios. An independent GPT-5.5 blind test placed gated real-versus-synthetic discrimination at 57.73% (95% CI: 54.48--60.92%) on the model-legible subset -- modestly above chance. The RF classifier did not benefit from the synthetic MRIs. The CNN showed consistent mean gains that did not survive Holm correction. MobileViTV2 showed the clearest benefit: filtered 1:1 augmentation improved tumour classification accuracy by 1.02% absolute (95% CI: 0.54--1.54%; Holm-corrected p = 0.0104). A secondary efficiency analysis found that every augmented CNN condition selected its checkpoint 42--64% earlier than baseline, while compute-matched MobileViTV2 runs reached selection after 50--67% fewer real-data epochs. Overall, augmentation utility was found to be architecture- and ratio-dependent, not guaranteed by visual fidelity alone.

2605.23070 2026-05-25 cs.CV 版本更新

Flow Mismatching: Unsupervised Anomaly Detection via Velocity Discrepancies in Flow Matching Models

Flow Mismatching: 通过流匹配模型中的速度差异进行无监督异常检测

Shengzhe Chen, Mehrdad Moradi, Kamran Paynabar, Hao Yan

发表机构 * Arizona State University(亚利桑那州立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种名为 Flow Mismatching 的无监督异常检测方法,避免了基于重建的范式,转而利用流匹配模型中的速度差异来检测异常。该方法通过在从高斯噪声到目标图像的仿射路径上分析模型预测速度与几何路径速度之间的不一致,从而识别出异常区域。实验表明,该方法在多个基准数据集上优于现有的基于重建和基于流匹配的最新方法。

详情
AI中文摘要

我们提出Flow Mismatching,一种无监督异常检测方法,有意避免基于重建的范式。相反,我们将流匹配视为几何动力学,并利用一个关键见解:异常发生在学习到的正常流与指向测试图像的几何路径不一致的地方。给定仅在正常图像上训练的流匹配模型,我们沿着从高斯噪声到目标图像的仿射路径探测其学习到的速度场。沿着每条路径,我们比较模型预测的速度(遵循正常生成动力学)与指向目标的速度(包含任何异常内容)。异常会导致这些速度之间的强烈局部不一致。聚合不同时间步和多条路径上的不匹配,产生像素级热图和图像级分数,无需测试时优化、特征记忆或额外校准。我们的分析表明,总体不匹配分解为一个不可约的降噪项和一个测试路径与正常路径得分函数之间的Fisher散度项,后者识别出驱动异常分离的得分差距成分,并解释了鲁棒路径聚合的有效性。在MVTec-AD和VisA上的大量实验表明,与最先进的基于重建和最近的基于流匹配的方法相比,性能优越。

英文摘要

We propose Flow Mismatching, an unsupervised anomaly detection method that deliberately avoids reconstruction-based paradigms. Instead, we treat flow matching as geometric dynamics and leverage a key insight: anomalies occur at places where the learned normal flow disagrees with the geometric path toward a test image. Given a flow matching model trained only on normal images, we probe its learned velocity field along affine paths from Gaussian noise to a target image. Along each path, we compare the model-predicted velocity, which follows normal generative dynamics, with the geometric velocity toward the target, which includes any anomalous content. Anomalies induce strong local disagreement between these velocities. Aggregating the mismatch over different time steps and multiple paths yields pixel-wise heatmaps and image-level scores without test-time optimization, feature memories, or additional calibration. Our analysis shows that the population mismatch decomposes into an irreducible denoising term and a Fisher-divergence term between the test-path and normal-path score functions, which identifies the score-gap component that drives anomaly separation and explains the effectiveness of robust path aggregation. Extensive experiments on MVTec-AD and VisA demonstrate superior performance compared with SOTA reconstruction-based and recent flow matching-based approaches.

2605.23068 2026-05-25 cs.CV 版本更新

RoboSurg-VQA: A Multimodal Benchmark for Surgical Segmentation-Aware Visual Question Answering

RoboSurg-VQA:面向手术分割感知的视觉问答多模态基准

Chengyi Zhang, Zi Ye, Ziyang Wang

发表机构 * Swansea University, UK(威尔士大学) Maynooth University, Ireland(迈诺特大学) Aston University, UK(阿斯顿大学)

AI总结 本文提出了一种名为 RoboSurg-VQA 的多模态基准,用于评估手术场景下的分割感知视觉问答能力。该基准基于公开的手术分割数据集构建,每个图像帧都配有一组临床导向的问题,涵盖手术背景、解剖结构、成像方式、手术器械可见性等方面,并采用封闭式答案集以保证评估一致性。研究通过约束提示生成候选答案,并结合人工审核提升答案的合理性和标签一致性,旨在推动机器人辅助手术中更可靠的视觉理解技术发展。

详情
AI中文摘要

在机器人辅助和微创手术(RMIS/MIS)中,可靠的视觉理解不仅仅需要精确的掩膜:在临床实践中,临床医生会提出关于手术过程背景、可见性、伪影以及解剖结构和手术器械存在性的语言类问题,且通常是在由遮挡、烟雾、出血和镜面高光导致的退化视图下。我们提出了 extbf{RoboSurg-VQA},这是一个基于共享模式重新利用公共手术分割数据集构建的分割感知视觉问答(VQA)基准。每帧图像与一组固定的临床驱动问题配对,涵盖手术过程背景、解剖结构(包括区域)、成像模态/视图、手术伪影、图像质量以及基本可见性和空间属性,并采用封闭答案集以实现一致的评估。为了扩展标注,我们通过约束提示生成候选答案,并自动进行有效性和一致性检查,随后进行人工审计以提高合理性和标签一致性。我们报告了基准统计信息、基线合理性以及在挑战性手术条件下的常见评估挑战。代码将在https://github.com/ziyangwang007/Robosurg-VQA上提供。

英文摘要

Reliable visual understanding in robot-assisted and minimally invasive surgery (RMIS/MIS) demands more than accurate masks: in clinical practice, clinicians pose language-like questions about procedural context, visibility, artefacts, and the presence of anatomical structures and surgical instruments, often under degraded views caused by occlusion, smoke, bleeding, and specular highlights. We present \textbf{RoboSurg-VQA}, a segmentation-aware visual question answering (VQA) benchmark built by repurposing public surgical segmentation datasets under a shared schema. Each frame is paired with a fixed set of clinically motivated questions spanning procedure context, anatomy (including region), imaging modality/view, surgical artefacts, image quality, and basic visibility and spatial attributes, with closed answer sets to enable consistent evaluation. To scale annotation, we generate candidate answers via constrained prompting with automatic validity and consistency checks, followed by human auditing to improve plausibility and label consistency. We report benchmark statistics, sanity baselines, and common evaluation challenges under challenging surgical conditions. The code will be available on https://github.com/ziyangwang007/Robosurg-VQA.

2605.23065 2026-05-25 cs.CV cs.AI cs.LG 版本更新

Dithering Defense: Adversarial Robustness of Vision Foundation Models via Multi-Level Floyd-Steinberg Dithering

抖动防御:通过多级 Floyd-Steinberg 抖动实现视觉基础模型的对抗鲁棒性

Yury Belousov, Brian Pulfer, Vitaliy Kinakh, Slava Voloshynovskiy

发表机构 * Department of Computer Science, University of Geneva, Switzerland(日内瓦大学计算机科学系)

AI总结 该研究提出了一种基于多级Floyd-Steinberg抖动算法的轻量输入变换方法,用于提升视觉基础模型在对抗攻击下的鲁棒性。该方法通过在图像中引入可控的噪声,破坏对抗扰动的同时保留语义内容,适用于多种下游任务和不同模型架构。实验表明,该方法在多种攻击场景下表现优异,且对干净输入的性能下降较小,优于现有的去噪基线方法。

Comments Paper accepted at the IEEE International Conference on Image Processing (ICIP 2026)

详情
AI中文摘要

视觉基础模型被广泛用作许多下游任务中的冻结骨干,使其成为对抗攻击下的单点故障。我们研究了多级 Floyd-Steinberg 误差扩散抖动作为一种轻量级、模型无关的输入变换,它在保留语义内容的同时破坏对抗扰动。与先前局限于二值抖动、灰度 CIFAR-10 和从头训练的单个小模型的工作不同,我们在六个任务(分类、分割、深度估计、检索、字幕生成、视觉问答)、两个模型家族(DINOv2、PaliGemma)以及三种强度递增的攻击(PGD、MI-FGSM、SIA)上进行了评估,还包括使用直通估计器的自适应攻击者。我们的结果表明,在中间量化级别上的 Floyd-Steinberg 抖动,尤其是与后处理模糊相结合时,超过或匹配所有测试的基线(包括基于扩散的去噪),并且在干净输入上的退化显著更小。

英文摘要

Vision foundation models are widely used as frozen backbones across many downstream tasks, making them a single point of failure under adversarial attack. We study multi-level Floyd-Steinberg error-diffusion dithering as a lightweight, model-agnostic input transformation that disrupts adversarial perturbations while preserving semantic content. Unlike prior work, which was limited to binary dithering, grayscale CIFAR-10, and a single small model trained from scratch, we evaluate across six tasks (classification, segmentation, depth estimation, retrieval, captioning, visual question answering), two model families (DINOv2, PaliGemma), and three attacks of increasing strength (PGD, MI-FGSM, SIA), as well as an adaptive attacker using a straight-through estimator. Our results show that Floyd-Steinberg dithering at intermediate quantization levels, especially when combined with post-processing blur, exceeds or matches all tested baselines, including diffusion-based denoising, with substantially less degradation on clean inputs.

2605.23064 2026-05-25 cs.CV cs.LG 版本更新

Millimeter-wave Imaging for Anthropometric Body Measurement

毫米波成像用于人体测量

Miriam Senne, Benjamin D. Killeen, Christoph Baur, Nassir Navab, Azade Farshad

发表机构 * Chair for Computer Aided Medical Procedures(计算机辅助医疗程序研究所) Technical University of Munich(慕尼黑技术大学) Rohde & Schwarz GmbH & Co. KG(罗德与施瓦茨 GmbH & Co. KG) Munich Center for Machine Learning(慕尼黑机器学习中心) ELLIS Unit Helsinki, Dept. Computer Science, Aalto University(赫尔辛基ELLIS单位,计算机科学系,阿alto大学)

AI总结 该研究提出了一种基于毫米波雷达的无接触人体体型测量方法,旨在解决传统测量工具在隐私、效率和适用性方面的不足。通过优化框架,该方法能够从毫米波点云数据中恢复人体三维形状并提取全面的体态测量指标。其核心贡献在于引入了一种顶点加权策略,结合参数化人体模型(SMPL)进行鲁棒的表面对齐与噪声抑制,实现了无需脱衣、无需摄像头的快速、隐私保护的测量流程,适用于各类人群的临床风险评估。

详情
AI中文摘要

身体形状和围度是临床上用于风险分层的信息性生物标志物,包括腰臀比、肢体和躯干周长等指标,然而传统工具如手动卷尺和光学扫描仪通常需要脱衣和保持姿势。这些要求减缓了工作流程,损害了尊严,并且排除了许多老年人和行动不便者。为了实现快速无接触测量,我们利用毫米波雷达,它保护隐私并能穿透典型衣物,实现快速全身采集。在这项工作中,我们提出了一个新的基于优化的框架,从体积毫米波数据中恢复3D人体形状并提取一套全面的人体测量数据。我们的方法引入了一个加权配准流程,将参数化身体模型(SMPL)直接拟合到噪声毫米波点云上。我们贡献的核心是一种顶点加权策略,该策略调节Chamfer能量函数以实现可靠的表面对齐和噪声消除。我们通过加入脚-地面约束和姿态先验进一步稳定拟合,直接优化SMPL参数。这些组件共同实现了一个快速、保护隐私的工作流程,无需摄像头或脱衣,且只需最小程度的配合,即可通过衣物提供高保真度的身体形状和测量数据,支持在诊所和护理机构中对所有年龄和活动水平的患者进行频繁的风险导向评估。

英文摘要

Body shape and circumferences are clinically informative biomarkers for risk stratification, including measures such as waist to hip ratio, limb and trunk girths, yet conventional tools such as manual tape measures and optical scanners often require undressing and sustained poses. These demands slow workflows, compromise dignity, and exclude many older adults and people with limited mobility. To make measurement fast and contactless, we leverage millimeter-wave (mmWave) radar, which preserves privacy and operates through typical clothing, enabling quick full-body acquisition. In this work, we present a new optimization-based framework to recover 3D human shape and extract a comprehensive set of anthropometric measurements from volumetric mmWave data. Our method introduces a weighted registration pipeline that fits a parametric body model (SMPL) directly to the noisy mmWave point cloud. The core of our contribution is a vertex-weighting strategy that modulates a Chamfer energy function for reliable surface alignment and noise elimination. We further stabilize the fit by incorporating a foot-ground plane constraint and pose priors, optimizing directly for the SMPL parameters. Together, these components enable a fast, privacy preserving workflow that delivers high fidelity body shape and measurements through clothing without cameras or disrobing and with minimal cooperation, supporting frequent risk oriented assessments in clinics and care facilities for patients of all ages and mobility levels.

2605.23045 2026-05-25 cs.CV cs.AI cs.LG 版本更新

The TIME Machine: On The Power of Motion for Efficient Perception

时间机器:论运动在高效感知中的力量

Mantas Skackauskas, Xinyue Hao, Laura Sevilla-Lara

发表机构 * School of Informatics University of Edinburgh(信息学院爱丁堡大学)

AI总结 本文提出了一种以运动为核心模态的视频表征学习方法,旨在解决现有视频模型在时序理解和训练成本方面的局限。通过使用点轨迹表示视频中的运动,并利用掩码自编码器进行自监督训练,模型能够学习到更高效且细粒度的视频表征。该方法无需依赖语言标注,大幅降低了训练数据需求,并在多项任务中展现出与当前先进模型相当的性能,为构建更高效、更具时序感知能力的视频模型提供了新方向。

详情
AI中文摘要

近年来,视频表示学习取得了巨大进展。这受到多种因素的推动,包括训练规模以及通过语言对比训练的视觉模型的成功。虽然这些因素推动了视频模型的能力边界,但它们也引入了自身的局限性:首先,扩展视频模型可能达到高昂的成本;其次,从语言学习限制了可学习概念的范围,仅限于字幕中的概念。因此,视频模型在时间理解方面仍然存在困难。在本文中,我们提出了一种新颖的方法,将运动作为视频表示的核心模态。具体而言,给定视频中以点轨迹形式存在的运动,我们使用掩码自编码器来掩码部分轨迹,并训练自编码器重建缺失的轨迹。这使我们能够以自监督方式学习表示。我们表明,使用运动来表示视频实际上解决了视频技术的两个核心局限性。首先,它使我们能够大幅减少训练数据的规模,因为运动本质上与外观无关,因此需要更少的样本就能很好地泛化。其次,运动使我们能够绕过依赖语言的训练范式,学习更细粒度的概念。结果是一种嵌入,我们称之为TIME(时间感知运动嵌入),这是一种仅使用合成运动数据训练的表示。我们在零样本方式下对广泛的任务测试了这种嵌入。我们观察到,无需额外技巧,其性能与使用多达4个数量级更少训练数据的最先进模型相当。这为迈向更有时序感知且更具可扩展性的视频模型新范式奠定了基础。

英文摘要

Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models trained contrastively with language. While these factors have pushed the boundaries of what video models can do, they also introduce their own set of limitations: first, scaling video models can reach prohibitive costs and second, learning from language restricts the range of concepts that can be learned to those in captions. As a result, video models still struggle with temporal understanding. In this paper we propose a novel approach that uses motion as the central modality for video representation. In particular, given the motion in a video in the form of point-tracks, we use a masked-autoencoder to mask some of the tracks and train the autoencoder to reconstruct the missing tracks. This allows us to learn a representation in a self-supervised manner. We show that using motion to represent videos actually addresses both of the core limitations of video technology. First, it allows us to massively reduce the scale of training data, as motion is inherently appearance-independent and hence needs fewer examples to generalize well. Second, motion allows us to bypass the language-dependent training paradigm, learning better fine-grained concepts. The result is an embedding that we call TIME (Temporally Informed Motion Embedding), a representation trained exclusively on synthetic motion data. We test this embedding on a wide set of tasks in a zero-shot manner. We observe that without bells and whistles, performance is on par with state-of-the-art models using up to 4 orders of magnitude less training data. This is a stepping stone towards a new paradigm of video models that are both more temporally aware as well as more scalable.

2605.23028 2026-05-25 cs.LG cs.CL cs.CV 版本更新

RADAR: Relative Angular Divergence Across Representations

RADAR: 表示间的相对角度散度

Xavier Cadet, Mateusz Nowak, Peter Chin

发表机构 * Dartmouth College(达特茅斯学院)

AI总结 本文提出了一种名为 RADAR 的度量方法,用于评估基础模型在跨领域任务中的迁移能力。该方法基于几何原理,通过分析模型各层表示的角对齐和层间位移轨迹上的距离变化,比较域内与跨域动态的分布差异,从而估计领域间迁移的可行性。实验表明,RADAR 在多个模态任务中表现出色,尤其在领域过渡平滑或明确的情况下具有更强的预测能力,且其效果依赖于模型内部表示空间的几何结构。

Comments 27 pages; 8 figures; 10 tables

详情
AI中文摘要

机器学习方法依赖于数据。然而,由于可用性限制、成本或需要领域专业知识,收集合适的数据可能具有挑战性。用额外来源扩展数据集是对有限数据的常见回应,但这种做法并不总能提高下游性能,有时甚至会导致性能下降,即负迁移。我们提出RADAR,一种简单、基于几何的度量,用于估计基础模型中的跨域迁移性。RADAR通过测量沿层间位移轨迹的角度对齐和距离的相对变化,并比较域内和跨域动态的经验分布,来分析表示的逐层演化。我们假设域迁移性与这些轨迹分布之间的散度有关。我们在多种模态上评估该度量,包括使用文本嵌入模型的跨语言情感分类和使用基础视觉模型的跨域图像分类。在多种设置下,RADAR在几个视觉和文本基准上相对于现有迁移性度量提供了有竞争力的预测性能,特别是在域过渡平滑或清晰分离时。我们的消融实验进一步表明,迁移性估计的有效性取决于模型内部表示空间的几何结构,不同模态偏好不同的拓扑形式。

英文摘要

Machine learning methods rely on data. However, gathering suitable data can be challenging due to availability constraints, cost, or the need for domain expertise. Expanding datasets with additional sources is a common response to limited data, yet this practice does not always improve downstream performance and can sometimes lead to a loss of performance, known as negative transfer. We propose RADAR, a simple, geometrically grounded metric for estimating cross-domain transferability in foundation models. RADAR analyzes the layer-wise evolution of representations by measuring angular alignments and relative changes in distance along layer-to-layer displacement trajectories, and by comparing empirical distributions of within-domain and cross-domain dynamics. We hypothesize that domain transferability is related to the divergence between these trajectory distributions. We evaluate the metric across multiple modalities, including cross-lingual sentiment classification with text embedding models and cross-domain image classification with foundation vision models. Across several settings, RADAR provides competitive predictive performance relative to existing transferability metrics on several vision and text benchmarks, with particularly strong results when domain transitions are smooth or cleanly separated. Our ablations further suggest that the effectiveness of transferability estimation depends on the geometry of the model's internal representation space, with different modalities favoring different topological formulations.

2605.22997 2026-05-25 cs.CV 版本更新

Scene Reconstruction as Mapping Priors for 3D Detection

场景重建作为3D检测的映射先验

Yang Fu, Yuliang Zou, Hao Xiang, Xin Huang, Yijing Bai, Chen Song, Weijing Shi, Govind Thattai, Dragomir Anguelov, Mingxing Tan, Yingwei Li

发表机构 * Waymo LLC(Waymo公司) UC San Deigo(加州大学圣地亚哥分校)

AI总结 在自动驾驶中,地图对运动规划至关重要,但其在3D目标检测等感知任务中的应用仍不充分。本文提出了一种可扩展的解决方案,通过自动构建密集的地图先验信息,并设计一种融合多传感器模态的MPA3D框架,有效提升了3D检测性能。实验表明,该方法在Waymo Open Dataset上取得了新的最先进成果,验证了可扩展场景先验对增强3D检测的有效性。

Comments Accepted to CVPR 2026

详情
AI中文摘要

在自动驾驶中,映射对于运动规划至关重要,但仍然是3D目标检测等感知任务中未被充分利用的资源。地图可以提供静态环境的鲁棒结构先验,有助于解决歧义并纠正传感器数据稀疏或噪声问题,特别是对于远处物体或在恶劣天气条件下。然而,传统的高清(HD)地图获取和维护成本高昂,这对高效的大规模部署构成了挑战。在本文中,我们提出了一种可扩展的解决方案,通过克服两个主要挑战来系统地利用映射改进3D检测。首先,我们引入了一个从聚合传感器数据自动构建密集映射先验的流程,消除了人工标注的需求。其次,我们设计了一个新颖的映射先验增强3D检测(MPA3D)框架,以有效整合映射先验与不同传感器模态。在Waymo开放数据集上的大量实验表明,我们的方法达到了新的最先进结果,证明了可扩展的重建场景先验在增强3D检测方面的有效性。

英文摘要

In autonomous driving, mapping is critical for motion planning but remains an under-utilized resource for perception tasks such as 3D object detection. Maps can provide robust structural priors of the static environment, helping resolve ambiguities and correct for sensor data sparsity or noise, especially for distant objects or under adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for efficient, large-scale deployment. In this paper, we propose a scalable solution to systematically leverage mapping to improve 3D detection by overcoming two primary challenges. First, we introduce a pipeline to automatically build dense mapping priors from aggregated sensor data, eliminating the need for human labeling. Second, we design a novel Mapping Priors Augmented 3D Detection (MPA3D) framework to effectively integrate mapping priors with different sensor modalities. Extensive experiments on the Waymo Open Dataset demonstrate that our approach achieves new state-of-the-art results, proving the effectiveness of scalable reconstructed scene priors for enhancing 3D detection.

2605.22996 2026-05-25 cs.CV 版本更新

CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

CoMoGen: 基于掩码引导的视频生成的可控运动动力学与交互

Adil Meric, Lin Geng Foo, Mert Kiray, Benjamin Busam, Rishabh Dabral, Christian Theobalt

发表机构 * Technical University of Munich(慕尼黑技术大学) Max Planck Institute for Informatics, Saarland Informatics Campus(马克斯·普朗克信息研究所,萨尔兰信息校园) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Obsphera

AI总结 本文提出了一种可控视频生成框架 CoMoGen,能够在输入图像和二值掩码序列的条件下生成具有真实交互动态的视频。该方法引入了一个轻量的 MaskAdapter 模块,将掩码序列编码为残差信号,并通过余弦加权调度注入到多模态扩散变换器(MMDiT)中。通过低秩适配(LoRA)对 MMDiT 中负责运动生成的特定层进行微调,实现了对运动关键组件的聚焦,降低了计算成本。实验表明,CoMoGen 在运动保真度和感知真实感方面优于现有方法,达到了当前最优水平。

详情
AI中文摘要

我们提出了CoMoGen,一个可控视频生成框架,它能够根据输入图像和单个二进制掩码序列生成逼真的交互动力学。CoMoGen引入了一个轻量级的MaskAdapter,将二进制掩码序列编码为潜在残差信号,并通过余弦加权调度注入到多模态扩散Transformer(MMDiT)模型中。与UNet架构的分层粗到细设计不同,MMDiT作为一系列统一的Transformer块运行,因此很难确定哪些层负责运动生成。因此,我们提出了一种新颖的方法来确定在MMDiT注意力空间中运行的“运动层”。我们通过使用低秩适应(LoRA)对运动层进行微调,而不需要对MMDiT进行任何架构更改。这种选择性适应使我们的方法能够专注于运动关键组件,从而降低计算成本。尽管方法简单,CoMoGen实现了精确的主体运动以及与周围人类、物体和场景的合理交互。在不同数据集上的全面实验表明,CoMoGen始终优于先前的可控视频生成方法,并在运动保真度和感知真实性方面达到了最先进的性能。项目页面:mericadil.github.io/CoMoGen。

英文摘要

We present CoMoGen, a controllable video generation framework that generates realistic interactive dynamics from a single binary mask sequence conditioned on an input image. CoMoGen introduces a lightweight MaskAdapter that encodes binary mask sequences into a latent residual signal, injected into the Multi Modal Diffusion Transformer (MMDiT) model through a cosine-weighted schedule. Unlike the hierarchical coarse-to-fine design of UNet architectures, MMDiT operates as a sequence of uniform transformer blocks, making it difficult to identify which layers are responsible for the motion generation. Therefore, we propose a novel way to determine "Motion Layers" operating in the attention space of MMDiT. We fine-tune the model by using Low-Rank Adaptation (LoRA) to the Motion Layers, without requiring any architecture change in the MMDiT. This selective adaptation enables our method to focus on motion-critical components, yielding reduced computational cost. Despite its simplicity, CoMoGen enables precise subject motion and plausible interactions with surrounding humans, objects, and scenes. Comprehensive experiments on different datasets show that CoMoGen consistently outperforms prior controllable video generation methods and achieves state-of-the-art performance in motion fidelity and perceptual realism. Project page: mericadil.github.io/CoMoGen.

2605.22962 2026-05-25 cs.CV cs.CE cs.HC cs.SE q-bio.NC 版本更新

GazeBehavior Annotation Toolkit (GBAT): AI-powered toolkit for automatic annotation of egocentric eye-tracking and video data of child-caregiver interaction

凝视行为注释工具包 (GBAT): 基于AI的自动注释工具,用于自我中心眼动追踪和儿童-照顾者互动视频数据

Iba Baig, Kevin Li, Yanbin Xu, Seiji Cattelain, Marie Hallo, Hayato Ono, Sho Tsuji, Ming Bo Cai

发表机构 * Department of Psychology, University of Miami(迈阿密大学心理学系) Northeastern University(东北大学) Ecole Normale Supérieure, PSL University, EHESS, CNRS(巴黎高等师范学院(PSL大学)、EHESS、CNRS) International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo Institutes for Advanced Study(神经智能国际研究中心(WPI-IRCN)、东京大学高级研究机构)

AI总结 该研究提出了一种基于人工智能的工具GazeBehavior Annotation Toolkit(GBAT),用于自动标注儿童与照顾者互动过程中的第一人称眼动追踪和视频数据。该工具通过深度学习技术实现了多视频后同步、视线目标半自动标注以及参与者姿态和手部动作的分类,显著提高了数据预处理和特征提取的效率与可扩展性。这一工具为研究人类早期发展中注意力动态和自然行为的大规模长期研究提供了重要支持。

Comments submitted to IEEE International Conference on Development and Learning (ICDL), 2026

详情
AI中文摘要

儿童-照顾者互动的视频记录使得能够研究自然行为中的注意力动态。这种多模态记录还允许研究人员实时检查注意力如何与动作和语言使用相互作用。然而,手动注释此类数据非常耗时。在这里,我们介绍凝视行为注释工具包,这是一个基于深度学习的工具包,旨在促进数据预处理和特征提取中的三个关键过程:多视频的事后同步、注视目标类别的半自动注释以及参与者姿态和手部动作的分类。该工具包提高了从人类自我中心眼动追踪和视频数据中提取特征的效率和可扩展性。这种改进对于支持人类早期发展中注意力动态和自然行为的大规模纵向研究至关重要。

英文摘要

Video recordings of child-caregiver interactions enable investigation of attentional dynamics during naturalistic behavior. Such multimodal recording also allows researchers to examine how attention interacts with action and language use in real time. However, manual annotation of such data is time-consuming. Here, we introduce GazeBehavior Annotation Toolkit, a deep-learning-based toolkit designed to facilitate three key processes in data preprocessing and feature extraction: post-hoc synchronization across multiple videos, semi-automatic annotation of gaze target categories, and categorization of participants' poses and hand actions. This toolkit improves the efficiency and scalability of feature extraction from human egocentric eye-tracking and video data. Such improvement is critical in supporting large-scale and longitudinal investigations of attentional dynamics and naturalistic behavior in human early development.

2605.22635 2026-05-25 cs.LG cs.CL cs.CV 版本更新

The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution

多任务放射学报告生成中的双重困境:梯度动力学分析与解决方案

Erjian Zhang, Yatong Hao, Liejun Wang, Zhiqing Guo

发表机构 * School of Computer Science and Technology(计算机科学与技术学院) Xinjiang University(新疆大学) Information Security Engineering Technology Research Center(信息安全工程技术研究中心)

AI总结 在多任务医学影像报告生成中,现有的线性标量化策略难以有效平衡临床监督的严格约束与报告生成的平滑性需求。本文从梯度动力学角度分析了这一问题,揭示其本质是漂移项偏差与扩散项衰减的“双重困境”,并提出了一种与模型无关的优化器CAME-Grad,通过冲突规避方向校正和幅度增强能量注入,实现了几何有效性与局部最优解的规避,实验表明该方法在多个任务中均能显著提升临床效果。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管基于多任务学习的自动放射学报告生成(RRG)被广泛采用以确保临床一致性,但大多数研究集中在架构设计上,仍局限于粗糙的线性标量化策略。这些策略无法有效平衡判别性临床监督的硬约束与报告生成的平滑性要求。为了解决这些问题,我们从梯度动力学的角度分析了线性标量化的失败机制,利用随机微分方程(SDE)框架将其表征为漂移项偏差和扩散项衰减的“双重困境”。基于此,我们提出了一种与骨干网络无关的优化器,名为冲突规避幅度增强梯度下降(CAME-Grad)。通过冲突规避的方向修正和幅度增强的能量注入,该算法不仅保证了几何有效性,还避免了局部最优解。然后,自适应梯度融合机制用于建立理论最优方向与任务特定归纳偏差之间的动态平衡。实验表明,作为一种通用的即插即用优化器,CAME-Grad在八种不同的RRG方法上带来了显著且一致的改进,在MIMIC-CXR上平均提升整体临床效能2.3%,在IU X-Ray上提升1.9%。我们的代码可在https://github.com/vpsg-research/CAME-Grad获取。

英文摘要

While multi-task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics, utilizing the stochastic differential equation (SDE) framework to characterize it as a "Double Dilemma" of drift term deviation and diffusion term decay. Based on this, we propose a backbone-agnostic optimizer named Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad). Through conflict-averse direction rectification and magnitude-enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids local optimal solutions. Then, the adaptive gradient fusion mechanism is used to establish a dynamic balance between the theoretical optimal direction and the task-specific inductive bias. Experiments show that as a universal plug-and-play optimizer, CAME-Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3% on MIMIC-CXR and 1.9% on IU X-Ray. Our code is available at https://github.com/vpsg-research/CAME-Grad.

2605.22423 2026-05-25 cs.CV 版本更新

Moment-Reenacting: Inverse Motion Degradation with Cross-shutter Guidance

时刻重现:基于交叉快门引导的逆运动退化

Xiang Ji, Guixu Lin, Zhengwei Yin, Jiancheng Zhao, Yinqiang Zheng

发表机构 * Graduate School of Information Science and Technology, The University of Tokyo(信息科学与技术研究生院,东京大学)

AI总结 该论文研究了在计算成像中如何逆向解决由快速运动或低光照引起的运动退化问题,提出了一种统一框架,通过结合全局快门(GS)模糊和滚动快门(RS)畸变的互补特性,实现运动场景的重建。作者设计了一种双快门系统,同步捕获模糊-RS图像对,并构建了三轴成像系统采集真实世界数据集,用于训练和评估模型。所提出的网络通过双流模块分离运动的上下文和时间特性,实现了高质量的帧重建,在复杂运动退化下的高速视频重建任务中表现出优越性和广泛适用性。

Comments Accepted by TPAMI

详情
AI中文摘要

运动退化表现为全局快门(GS)图像中的模糊或卷帘快门(RS)图像中的畸变,在快速运动或低光条件下仍是计算成像的基本挑战。以往工作将模糊分解和RS时间超分辨率视为独立任务,未能利用其内在互补性。本文提出统一框架,通过联合利用GS模糊和RS畸变的互补特性来逆转运动退化并重现成像时刻。为此,我们引入一种新颖的双快门设置,捕获同步的模糊-RS图像对,并证明该组合有效解决了两种模态固有的时间和空间模糊性。为允许灵活的性能-成本权衡,我们进一步将双快门设置扩展到窄基线的立体模糊-RS配置。此外,我们构建了一个三轴成像系统,收集了具有对齐GS-RS对和真实高速帧的真实世界数据集,支持超越合成数据的鲁棒训练和评估。我们提出的网络通过双流运动解释模块将运动显式解耦为上下文感知和时间敏感表示,随后进行自提示帧重建阶段。大量实验验证了我们方法的优越性和泛化能力,为复杂运动退化下的真实高速视频重建建立了新范式。代码和更多资源见 https://jixiang2016.github.io/dualBR_site/。

英文摘要

Motion degradation, manifested as blur in global shutter (GS) images or rolling shutter (RS) distortion in RS counterparts, remains a fundamental challenge in computational imaging, especially under fast motion or low-light conditions. While prior works have treated blur decomposition and RS temporal super-resolution as separate tasks, this separation fails to exploit their intrinsic complementarity. In this paper, we propose a unified framework to invert motion degradation and reenact imaging moment by jointly leveraging the complementary characteristics of GS blur and RS distortion. To this end, we introduce a novel dual-shutter setup that captures synchronized blur-RS image pairs and demonstrate that this combination effectively resolves temporal and spatial ambiguities inherent in both modalities. For allowing flexible performance-cost trade-offs, we further extend this dual-shutter setup to a stereo Blur-RS configuration with a narrow baseline. In addition, we construct a triaxial imaging system to collect a real-world dataset with aligned GS-RS pairs and ground-truth high-speed frames, enabling robust training and evaluation beyond synthetic data. Our proposed network explicitly disentangles motion into context-aware and temporally-sensitive representations via a dual-stream motion interpretation module, followed by a self-prompted frame reconstruction stage. Extensive experiments validate the superiority and generalizability of our approach, establishing a new paradigm for realistic high-speed video reconstruction under complex motion degradations. Codes and more resources are available at https://jixiang2016.github.io/dualBR_site/.

2605.22272 2026-05-25 cs.RO cs.CV 版本更新

Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

Imagine2Real: 通过视频生成先验实现零样本人形机器人-物体交互

Jiahe Chen, ZiRui Wang, Feiyu Jia, Xiao Chen, Xiaojie Niu, Weishuai Zeng, Tianfan Xue, Xiaowei Zhou, Jiangmiao Pang, Jingbo Wang

发表机构 * Zhejiang University(浙江大学) Shanghai AI Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学)

AI总结 全身体型人机交互(HOI)因高质量3D数据稀缺而面临瓶颈。现有基于视频生成先验的方法由于依赖几何先验(如显式CAD模型)导致表示对齐问题,并因复杂的形态重定向过程而面临重定向复杂性问题。本文提出Imagine2Real,一种无需几何信息的零样本HOI框架,通过将机器人和物体运动统一为4D点轨迹解决表示对齐问题,并通过稀疏关键点追踪避开重定向误差,结合行为基础模型的潜在空间实现自然运动,最终在运动捕捉系统中实现零样本物理部署。

详情
AI中文摘要

全身人形机器人-物体交互(HOI)受限于高保真3D数据的稀缺性。虽然视频生成先验提供了一种有前景的替代方案,但现有方法由于依赖几何先验(如显式CAD模型)而遭受表示不对齐问题,并且由于密集变形和形态不匹配而产生重定向复杂性。我们提出了Imagine2Real,一个零样本HOI框架,用于灵活、无几何的交互。为了解决不对齐问题,我们将机器人和物体的运动统一为4D点轨迹。为了克服重定向复杂性,我们的关键点跟踪器仅跟踪稀疏的关键点(基座、手和物体),完全绕过了误差放大的重定向过程。为了在这些稀疏信号下保持自然步态,我们利用行为基础模型(BFM)的潜在空间作为跟踪器的搜索域。通过渐进式训练策略,Imagine2Real学习到具有简单跟踪奖励的鲁棒行为,从而在动作捕捉(mocap)系统内实现零样本物理部署。

英文摘要

Whole-body Humanoid-Object Interaction (HOI) is bottlenecked by the scarcity of high-fidelity 3D data. While video generative priors offer a promising alternative, existing methods suffer from \textit{Representation Misalignment} due to their reliance on geometric priors (e.g., explicit CAD models), and \textit{Retargeting Complexity} arising from intensive morphing and morphological mismatch. We propose Imagine2Real, a zero-shot HOI framework for flexible, geometry-free interaction. To resolve misalignment, we formulate robot and object motions as unified 4D point trajectories. To overcome retargeting complexity, our Keypoints Tracker tracks only sparse critical points (base, hands, and object), entirely bypassing the error-amplifying retargeting process. To maintain natural gaits despite these sparse signals, we utilize the latent space of a Behavior Foundation Model (BFM) as the tracker's search domain. Using a progressive training strategy, Imagine2Real learns robust behaviors with simple tracking rewards, enabling zero-shot physical deployment within a motion capture(mocap) system.

2605.22216 2026-05-25 cs.CV 版本更新

A Robust Semantic Segmentation Pipeline for the CVPR 2026 8th UG2+ Challenge Track 2

面向CVPR 2026第八届UG2+挑战赛赛道2的鲁棒语义分割流程

Jinming Chai, Libo Yan, Licheng Jiao, Fang Liu

发表机构 * School of Artificial Intelligence, Xidian University(西安电子科技大学人工智能学院)

AI总结 本文提出了针对CVPR 2026第八届UG2+挑战赛Track 2(恶劣天气下的语义分割)的解决方案,旨在解决在不良天气条件下进行图像语义分割的难题。我们设计了一种半监督分割流水线,仅基于挑战赛提供的WeatherProof数据集进行训练,无需额外数据。方法以UniMatch V2为基线模型,将所有退化天气图像作为未标注数据进行半监督学习,并在推理阶段采用测试时增强技术以提升分割结果的鲁棒性和准确性。

详情
AI中文摘要

本报告介绍了我们针对WeatherProof数据集挑战赛(即CVPR 2026第八届UG2+挑战赛赛道2:恶劣天气下的语义分割)的解决方案。针对恶劣天气条件下的语义分割任务,我们提出了一种半监督分割流程。我们的方法仅使用WeatherProof数据集进行训练,未使用任何额外的外部数据。具体而言,我们采用UniMatch V2作为基线模型,并将所有退化天气图像视为未标记数据进行半监督训练,从而充分利用挑战赛提供的数据分布。在推理过程中,我们进一步应用测试时增强,以提高最终预测的鲁棒性和分割精度。代码已公开:https://github.com/ylb888/weatherproof-challenge-unimatchv2。

英文摘要

This report presents our solution for the WeatherProof Dataset Challenge, namely CVPR 2026 8th UG2+ Challenge Track 2: Semantic Segmentation in Adverse Weather. For the semantic segmentation task under adverse weather conditions, we propose a semi-supervised segmentation pipeline. Our method is trained exclusively on the WeatherProof dataset, without using any additional external data. Specifically, we adopt UniMatch V2 as the baseline model and treat all degraded-weather images as unlabeled data for semi-supervised training, thereby fully exploiting the data distribution provided by the challenge. During inference, we further apply test-time augmentation to improve the robustness and segmentation accuracy of the final predictions. The code is publicly available at: https://github.com/ylb888/weatherproof-challenge-unimatchv2.

2605.22020 2026-05-25 cs.CV 版本更新

ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting

ForeSplat:面向前馈3D高斯泼溅的优化感知预判

Yuke Li, Weihang Liu, Cheng Zhang, Yuefeng Zhang, Jiadi Cui, Zixuan Wang, Junran Ding, Haoyu Wu, Yujiao Shi, Jingyi Yu, Xin Lou

发表机构 * ShanghaiTech University(上海科技大学) GGU Technology Co., Ltd(GGU技术有限公司) Stereye

AI总结 本文提出ForeSplat,一种优化感知的前馈3D高斯溅射训练框架,旨在提升模型在有限网络容量下的重建质量。通过引入MetaGrad方法,ForeSplat将部分场景建模任务转移给优化器,使前馈模型能生成更利于后续优化的初始化表示,从而在更少优化步骤内达到更高的重建精度。实验表明,该方法在多种网络结构上均能有效提升重建效果,为轻量级高保真3D重建提供了实用路径。

详情
AI中文摘要

前馈3D高斯泼溅模型能够实现快速单次重建,但将其扩展到匹配逐场景优化质量时,受到大规模3D标注稀缺的根本限制。一种实用的折衷方案是“先预测后优化”,即通过预测后优化来弥补前馈网络有限的能力。然而,标准的前馈3DGS仅针对零步渲染误差进行训练,忽略了其输出是否为下游优化器提供了良好的初始化。我们提出了ForeSplat,一个优化感知的训练框架,使前馈3DGS模型能够产生明确设计用于快速、有效精细化的初始化。通过将部分场景建模负担转移给优化器,ForeSplat显著减轻了前馈模型的能力压力,即使使用紧凑网络也能实现高质量重建。其核心是MetaGrad,一种轻量级多锚点元梯度训练规则,通过3DGS优化器避免了昂贵的高阶微分。MetaGrad展开一个短的内循环细化轨迹,采样锚点状态,并将聚合的一阶梯度反向传播到预测头,作为替代的优化感知信号。这种微调不增加推理成本,并在几步细化后几秒内实现高质量重建。我们在多种骨干网络上实例化ForeSplat,包括AnySplat、Pi3X以及专为边缘部署定制的蒸馏变体。在所有测试架构中,经过ForeSplat训练的初始化在更少的细化步骤内收敛,并达到比原始版本更高的峰值重建质量,即使完全收敛也是如此。该框架持续弥合了摊销预测与逐场景优化之间的差距,为轻量级、高保真3D重建开辟了实用路径。

英文摘要

Feed-forward 3D Gaussian Splatting models offer fast single-pass reconstruction,but scaling them to match per-scene optimization quality is fundamentally hindered by the scarcity of large-scale 3D annotations. A practical compromise is predict-then-refine,where post-prediction optimization compensates for the limited capacity of the feed-forward network. However,standard feed-forward 3DGS is trained solely for zero-step rendering error,ignoring whether its output constitutes a good initialization for the downstream optimizer. We present ForeSplat,an optimization-aware training framework that equips feed-forward 3DGS models to produce initializations explicitly designed for rapid,effective refinement. By offloading part of the scene-modeling burden to the optimizer,ForeSplat substantially reduces the capacity pressure on the feed-forward model,making high-quality reconstruction feasible even with compact networks. At its core is MetaGrad,a lightweight multi-anchor meta-gradient training rule that bypasses costly higher-order differentiation through the 3DGS optimizer. MetaGrad unrolls a short inner-loop refinement trajectory,samples anchor states,and back-propagates aggregated first-order gradients to the prediction head as a surrogate optimization-aware signal. This fine-tuning adds no inference cost and enables high-quality reconstruction within seconds after a few refinement steps. We instantiate ForeSplat on diverse backbones,including AnySplat,Pi3X,and a distilled variant tailored for edge deployment. Across all tested architectures,a ForeSplat-trained initialization converges in fewer refinement steps and reaches a higher peak reconstruction quality than its vanilla counterpart,even fully converged. The framework consistently bridges the gap between amortized prediction and per-scene optimization,establishing a practical path toward lightweight,high-fidelity 3D reconstruction.

2605.21906 2026-05-25 cs.CV 版本更新

Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

从解剖到疾病表型的通用CT表示:通过聚合预训练

Yuheng Li, Yuan Gao, Haoyu Dong, Yuxiang Lai, Shansong Wang, Mojtaba Safari, James E. Baciak, Xiaofeng Yang

发表机构 * Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University(沃森·H·库勒生物医学工程系,佐治亚理工学院和埃默里大学) Department of Radiation Oncology and Winship Cancer Institute, Emory University(放射肿瘤学系和Winship癌症研究所,埃默里大学) Department of Electrical and Computer Engineering, Duke University(电气与计算机工程系,杜克大学) Department of Computer Science and Informatics, Emory University(计算机科学与信息学系,埃默里大学) Department of Materials Science & Engineering, Nuclear Engineering Program, University of Florida(材料科学与工程系、核工程项目,佛罗里达大学)

AI总结 该研究提出了一种名为FlexiCT的CT基础模型,通过聚合式持续预训练方法,在56个公开数据集的26万余例CT影像上进行训练,构建了一个大规模的CT表征学习资源。模型分三个阶段进行预训练,涵盖二维轴向、三维解剖结构以及报告引导的语义对齐,支持切片级、体积级和视觉-语言分析。实验表明,FlexiCT在多个下游任务中表现优异,并能通过嵌入信息反映肿瘤阶段等疾病表型特征,为CT影像的通用表征学习提供了新方法。

详情
AI中文摘要

计算机断层扫描(CT)是三维医学成像的核心,但基于CT的人工智能仍然分散在用于分割、分类、配准和报告分析的任务特定模型中。这里我们提出FlexiCT,一个CT基础模型系列,通过对来自56个公开数据集的266,227个CT体积进行聚合连续预训练,形成了用于CT表示学习的大规模公共资源。FlexiCT采用三阶段聚合预训练:二维轴向预训练、三维解剖预训练和报告引导的语义对齐。这种训练策略支持切片级、体积级和视觉语言分析。在五个下游任务族(分割、分类、配准、视觉语言理解和临床检索)中,FlexiCT在多个基准上匹配或超过先前的任务特定方法。其嵌入进一步沿着与不同肿瘤阶段相关的梯度组织CT扫描,表明CT基础模型可以捕获与疾病表型表征相关的影像特征。项目页面和代码见:https://ricklisz.github.io/flexict.github.io 和 https://github.com/ricklisz/FlexiCT。

英文摘要

Computed tomography (CT) is a central to three-dimensional medical imaging, yet CT-based artificial intelligence remains fragmented across task-specific models for segmentation, classification, registration, and report analysis. Here we present FlexiCT, a family of CT foundation models trained by agglomerative continual pretraining on 266,227 CT volumes from 56 publicly available datasets, forming a large-scale public resource for CT representation learning. FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment. This training strategy supports slice-level, volume-level and vision-language analysis. Across five downstream task families (segmentation, classification, registration, vision-language understanding and clinical retrieval), FlexiCT matches or exceeds prior task-specific approaches on multiple benchmarks. Its embeddings further organize CT scans along gradients associated with various tumor stages, suggesting that CT foundation models can capture imaging features relevant to disease phenotype characterization. Project page and code are available at: https://ricklisz.github.io/flexict.github.io and https://github.com/ricklisz/FlexiCT.

2605.21605 2026-05-25 cs.CV 版本更新

GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

GenEvolve: 通过工具编排的视觉经验蒸馏实现自我进化的图像生成智能体

Sixiang Chen, Zhaohu Xing, Tian Ye, Xinyu Geng, Yunlong Lin, Jianyu Lai, Xuanhua He, Fuxiang Zhai, Jialin Gao, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Meituan(美团) The Hong Kong University of Science and Technology(香港科学与技术大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种名为GenEvolve的自进化图像生成框架,旨在应对日益复杂和多样的图像生成需求。该方法通过工具协调的视觉经验蒸馏技术,使智能体能够在生成过程中自主学习和优化策略,包括证据收集、参考选择和提示构建等关键步骤。GenEvolve通过对比不同生成轨迹,提炼结构化视觉经验并用于指导模型训练,显著提升了生成质量与效率,并在多个基准测试中取得了优于现有方法的性能。

详情
AI中文摘要

开放式图像生成已不再是简单的提示词到图像问题。高质量生成通常需要智能体将模型的内部生成能力与外部资源相结合。随着请求变得更加多样化和苛刻,我们旨在开发一个通用的图像生成智能体,该智能体能够通过轨迹自我进化,并在各种生成挑战中更有效地使用工具。为此,我们提出了GenEvolve,一个基于工具编排的视觉经验蒸馏的自我进化框架。在GenEvolve中,每次生成尝试都被建模为工具编排的轨迹,智能体收集证据、选择参考、调用生成技能,并将它们组合成提示-参考程序。与主要依赖图像级标量奖励的现有智能体生成方法不同,GenEvolve针对同一请求比较多个轨迹,并将最佳-最差差异抽象为结构化视觉经验,仅提供给特权教师分支。受在线策略自蒸馏的启发,视觉经验蒸馏提供密集的令牌级监督,帮助学生内化更好的搜索、知识激活、参考选择和提示构建。我们进一步构建了GenEvolve-Data和GenEvolve-Bench。在公共基准和GenEvolve-Bench上的实验表明,与强基线相比有显著提升,在当前的图像生成框架中达到了最先进的性能。我们的网站如下:https://ephemeral182.github.io/GenEvolve/

英文摘要

Open-ended image generation is no longer a simple prompt-to-image problem. High-quality generation often requires an agent to combine a model's internal generative ability with external resources. As requests become more diverse and demanding, we aim to develop a general image-generation agent that can self-evolve through trajectories and use tools more effectively across varied generation challenges. To this end, we propose GenEvolve, a self-evolving framework based on Tool-Orchestrated Visual Experience Distillation. In GenEvolve, each generation attempt is modeled as a tool-orchestrated trajectory, where the agent gathers evidence, selects references, invokes generation skills, and composes them into a prompt-reference program. Unlike existing agentic generation methods that mainly rely on image-level scalar rewards, GenEvolve compares multiple trajectories for the same request and abstracts best-worst differences into structured visual experience, provided only to a privileged teacher branch. Inspired by on-policy self-distillation, Visual Experience Distillation provides dense token-level supervision, helping the student internalize better search, knowledge activation, reference selection, and prompt construction. We further construct GenEvolve-Data and GenEvolve-Bench. Experiments on public benchmarks and GenEvolve-Bench show substantial gains over strong baselines, achieving state-of-the-art performance among current image-generation frameworks. Our website is as follows: https://ephemeral182.github.io/GenEvolve/

2605.21489 2026-05-25 cs.LG cs.AI cs.CV stat.CO stat.ML 版本更新

Variance Reduction for Expectations with Diffusion Teachers

具有扩散教师的期望方差缩减

Jesse Bettencourt, Xindi Wu, Matan Atzmon, James Lucas, Jonathan Lorraine

发表机构 * NVIDIA University of Toronto(多伦多大学) Princeton University(普林斯顿大学)

AI总结 本文研究了如何在使用预训练扩散模型作为“教师”进行下游任务(如文本到3D生成、单步蒸馏等)时,降低梯度估计的方差。提出了一种名为CARV的计算感知方差控制框架,通过分层蒙特卡洛估计器,将昂贵的上游计算过程与廉价的扩散噪声重采样相结合,并结合时间步重要性采样和分层逆CDF构造,有效减少了计算成本。实验表明,CARV在不改变目标函数的前提下显著提升了计算效率,但在某些任务中梯度方差的降低并未带来生成质量的提升,表明此时方差已不再是性能瓶颈。

Comments Project page: https://research.nvidia.com/labs/sil/projects/CARV/

详情
AI中文摘要

预训练的扩散模型作为冻结教师,为文本到3D、单步蒸馏和数据归因等下游流程提供支持。这些流程消耗的教师梯度是关于噪声水平和高斯噪声样本的蒙特卡洛期望;其估计器方差主导了计算成本,因为每次抽取都需要昂贵的上游工作(渲染、模拟、编码)。我们引入了CARV,一个计算感知的方差核算框架,它激发了一种分层蒙特卡洛估计器:通过廉价的扩散噪声重采样来摊销昂贵的上游计算,并通过时间步重要性采样和分层逆CDF构造加以强化。在我们的文本到3D蒸馏和归因实验中,CARV在不改变目标的情况下提供了2-3倍的有效计算乘数(主要来自摊销重用;约25%来自IS+分层);在单步蒸馏中,相同的技术将梯度方差降低了一个数量级,但并未改善下游FID,标志着MC方差不再是瓶颈的区间。

英文摘要

Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo (MC) expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical MC estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified-inverse-CDF construction. In our text-to-3D distillation and attribution experiments, CARV delivers 2-3x effective compute multipliers (most from amortized reuse; ~25% additional from IS+stratification) without changing the objective; in single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, marking the regime where MC variance is no longer the bottleneck.

2605.21487 2026-05-25 cs.CV 版本更新

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Uni-Edit: 智能编辑作为统一模型调优的通用任务

Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, Hongsheng Li

发表机构 * CUHK MMLab(香港大学多模态实验室) Meituan(美团) TJU(天津大学) USTC(中国科学技术大学)

AI总结 本文提出了一种名为Uni-Edit的智能图像编辑任务,作为统一多模态模型(UMMs)调优的通用任务。与传统的多任务混合训练方法不同,Uni-Edit通过单一任务、单一训练阶段和单一数据集,同时提升模型在图像理解、生成和编辑三方面的能力。研究引入了一种自动化且可扩展的数据合成方法,将多样化的视觉问答数据转化为复杂且有效的编辑指令,从而显著提升了模型的编辑性能,并在多个基准测试中验证了其对多模态能力的全面提升效果。

Comments Project Page: https://zhengdian1.github.io/Uni-Edit-proj/ Code: https://github.com/zhengdian1/Uni-Edit

详情
AI中文摘要

目前,增强统一多模态模型(UMMs)的图像理解、生成和编辑能力主要依赖于混合多任务训练。由于固有的任务冲突,这种策略需要复杂的多阶段流水线、大量数据混合和平衡技巧,仅能实现性能折衷而非真正的相互增强。为了打破这一范式,我们提出Uni-Edit,一种智能图像编辑任务,作为UMM调优的第一个通用任务。与复杂的混合流水线不同,Uni-Edit仅使用一个任务、一个训练阶段和一个数据集,即可同时提升所有三种能力。具体来说,我们首先识别出图像编辑本质上是一个理想的通用任务,因为它自然需要视觉理解和生成。然而,现有的编辑数据依赖于过于简单的指令,严重低估了模型的理解能力。为解决这一问题,我们引入了第一个自动化且可扩展的智能编辑数据合成流水线,将多样化的VQA数据转化为复杂且有效的编辑指令,其中嵌入了问题和嵌套逻辑。由此产生了Uni-Edit-148k数据集,将多样化的推理密集型指令与高质量编辑图像配对。在BAGEL和Janus-Pro上的大量实验表明,仅对Uni-Edit进行调优即可在所有三种能力上实现全面增强,无需任何辅助操作。

英文摘要

Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.

2605.21139 2026-05-25 cs.CV cs.LG 版本更新

Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

蒸馏思考,预见行动:面向自动驾驶的认知-物理强化学习

Yang Wu, Qiang Meng, Zhaojiang Liu, Youquan Liu, Jian Yang, Jin Xie

发表机构 * NJU(南京大学) SJTU(上海交通大学) FDU(福建大学)

AI总结 当前端到端自动驾驶模型受到模仿学习行为克隆天花板的限制,为此,本文提出CoPhy认知-物理强化学习框架,通过将视觉语言模型知识蒸馏到鸟瞰图编码器中,实现零推理成本的认知能力,并构建自回归的鸟瞰图世界模型以预测候选动作的未来语义地图,从而在物理环境层面预见行动后果。该方法结合物理奖励和认知奖励优化驾驶策略,不仅在NAVSIM基准上取得最优性能,还支持通过用户定义的语言指令实现更安全、更灵活的驾驶控制。

详情
AI中文摘要

当前的端到端自动驾驶模型从根本上受到模仿学习的行为克隆上限的限制。虽然强化学习提供了更智能自主性的路径,但它需要两个缺失的基础设施:(1)理解交通语义和驾驶意图的认知基础,以及(2)能够预见候选行动后果的前瞻性物理环境。为此,我们提出了CoPhy,一个用于自动驾驶的认知-物理强化学习框架。为了蒸馏思考,我们将VLM知识蒸馏到BEV编码器中,然后完全丢弃VLM,以零推理成本保留认知能力,同时将认知通道作为可插拔接口释放,用于可选的人类语言命令。为了预见行动,我们构建了一个自回归BEV世界模型,该模型明确预测以候选行动为条件的未来语义地图,作为一个可解释的物理沙盒,从中直接推导出安全指标。基于这一双重基础设施,我们通过GRPO优化驾驶策略,采用新颖的双奖励机制:从BEV rollout导出的物理奖励强制执行硬安全约束,而来自语言对齐评分器的认知奖励确保意图合规。大量实验表明,CoPhy不仅在NAVSIM v1和v2基准上取得了最先进的结果,而且通过认知信息化的场景合规性和通过用户定义的语言指令实现的灵活意图控制,实现了更安全的驾驶。

英文摘要

Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual-reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language-aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user-defined language instructions.

2605.18329 2026-05-25 cs.CV cs.LG 版本更新

Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

迷失在折叠中:当交叉验证不是用于不确定性估计的深度集成时

Tristan Kirscher, Markus Bujotzek, Yannick Kirchhoff, Maximilian Rokuss, Fabian Isensee, Kim-Celine Kahl, Balint Kovacs, Klaus Maier-Hein

发表机构 * ICube Laboratory, CNRS UMR-7357, University of Strasbourg, Strasbourg, France(ICube实验室,法国斯特拉斯堡大学) CLCC Institut-Strauss, Strasbourg, France(CLCC斯特拉斯堡研究所) German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing(海德堡德国癌症研究中心(DKFZ)医学影像计算部门) Medical Faculty Heidelberg, Heidelberg University, Heidelberg, Germany(海德堡医学院,海德堡大学) Faculty of Mathematics and Computer Science, University of Heidelberg, Germany(海德堡大学数学与计算机科学学院) Helmholtz Imaging, German Cancer Research Center, Heidelberg, Germany(海德堡德国癌症研究中心Helmholtz成像部门) Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany(海德堡大学医院放射肿瘤学部模式分析与学习小组)

AI总结 在医学图像分割中,集成模型的分歧常被用作认识论不确定性的代理,但许多研究通过K折交叉验证(CV)构建集成模型,却称之为“深度集成”(DE),导致术语与实现不一致。本文对比了标准5折CV集成与5成员DE在三个多标注分割数据集上的表现,发现DE在保持分割精度的同时,提升了校准和失败检测能力,而CV集成有时与标注者间差异相关性更强。研究指出,应根据研究目标选择集成构建方式:DE适用于可靠性导向任务(如选择性转诊),CV集成则更适合作为模糊性代理。

Comments Accepted for publication at MICCAI 2026

详情
Journal ref
29th International Conference On Medical Image Computing And Computer Assisted Intervention, Sep 2026, Strasbourg, France
AI中文摘要

集成不一致性被广泛用作医学图像分割中认知不确定性的代理。在实践中,许多研究通过K折交叉验证(CV)形成集成,却称之为“深度集成”(DE)。由于CV成员在不同的数据子集上训练,它们的不一致性混合了种子驱动变异和数据暴露效应,这可能改变不确定性的解释方式。我们审查了最近的分割不确定性研究,发现术语与实现不匹配很常见。然后,我们在三个多模态多标注者分割数据集上,在相同配置下比较了标准5折CV集成与5成员DE(固定训练集,不同随机种子)。我们评估了不确定性在校准、故障检测、歧义建模和分布偏移下的鲁棒性。DE在匹配分割精度的同时改善了校准和故障检测,而CV集成在研究数据集上有时与标注者间变异性相关性更强。因此,应选择与研究问题匹配的集成构建方式:DE用于可靠性导向的使用(如选择性转诊/故障检测),CV集成作为歧义的代理。我们提供了一个轻量级的nnU-Net修改,使得在默认流程内能够进行DE训练。

英文摘要

Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as ``deep ensembles'' (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology--implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation accuracy while improving calibration and failure detection, whereas CV ensembles sometimes correlate more strongly with inter-rater variability on the studied datasets. Thus, ensemble construction should be chosen to match the research question: DE for reliability-oriented use (e.g., selective referral/failure detection) and CV ensembles as a proxy for ambiguity. We provide a lightweight nnU-Net modification enabling DE training within the default pipeline.

2605.15828 2026-05-25 cs.CV 版本更新

Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer

并非所有任务量化平等:面向视觉几何Transformer的Fisher引导量化

Yipu Zhang, Jintao Cheng, Weilun Feng, Jiehao Luo, Chuanguang Yang, Zhulin An, Yongjun Xu, Wei Zhang

发表机构 * Department of Electronic and Computer Engineering, HKUST(香港科技大学电子与计算机工程系) State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院人工智能安全国家重点实验室) University of Chinese Academy of Sciences(中国科学院大学) School of Data Science and Engineering, South China Normal University(华南师范大学数据科学与工程学院)

AI总结 本文研究了如何在视觉几何变换器(VGGT)等前馈3D重建模型中进行有效的量化,以降低模型的内存和计算开销。针对不同任务、块和通道对量化误差的敏感性差异,作者提出了一种基于Fisher信息矩阵的引导量化方法(FGQ),通过量化不同组件对任务的重要性,在校准过程中动态调整仿射变换,从而更有效地保留关键信息。实验表明,FGQ在多个3D视觉任务中显著优于现有方法,在4位量化下相对提升了高达39%的性能。

详情
AI中文摘要

以视觉几何基础Transformer(VGGT)为代表的前馈3D重建模型,在单次前向传播中联合预测多个视觉几何任务,如深度估计、相机姿态预测和点云重建。它们已广泛应用于3D视觉应用,但其十亿级参数带来了巨大的内存和计算开销,给设备端部署带来挑战。训练后量化(PTQ)是减少这种开销的有效技术。现有的前馈3D模型PTQ方法主要关注处理重尾激活分布和构建多样化的校准数据集。然而,我们观察到前馈3D模型通过共享骨干网络预测多个几何属性,其中不同的Transformer块和隐藏通道对每个任务的贡献不同,导致不同任务、块和通道对量化误差的敏感性差异显著。因此,平等对待所有任务会过度强调不敏感的任务,并导致敏感任务上的显著精度损失。为解决此问题,我们提出面向前馈3D重建模型的Fisher引导量化(FGQ)。具体地,FGQ使用对角Fisher信息矩阵来量化不同任务、块和通道的敏感性,并在校准期间将这些敏感性纳入可学习仿射变换中,以更好地保留对每个任务最关键的通道和块。在相机姿态估计、点云重建和深度估计上的大量实验表明,FGQ在VGGT上始终优于最先进的量化基线,在4比特量化下实现了高达39%的相对改进。代码可在https://github.com/ypzhng/FGQ获取。

英文摘要

Feed-forward 3D reconstruction models, represented by Visual Geometry Grounded Transformer (VGGT), jointly predict multiple visual geometry tasks such as depth estimation, camera pose prediction, and point cloud reconstruction in a single forward pass. They have been widely adopted in 3D vision applications, but their billion-scale parameters bring substantial memory and computation overhead, posing challenges for on-device deployment. Post-Training Quantization (PTQ) is an effective technique to reduce this overhead. Existing PTQ methods for feed-forward 3D models mainly focus on handling heavy-tailed activation distributions and constructing diverse calibration datasets. However, we observe that feed-forward 3D models predict multiple geometric attributes through a shared backbone, where different transformer blocks and hidden channels contribute distinctly to each task, resulting in substantially different sensitivities to quantization errors across tasks, blocks, and channels. Consequently, treating all tasks equally over-emphasizes insensitive tasks and causes significant accuracy loss on the sensitive ones. To address this issue, we propose Fisher-Guided Quantization (FGQ) for feed-forward 3D reconstruction models. Specifically, FGQ uses the diagonal Fisher information matrix to quantify the different sensitivities across tasks, blocks, and channels, and incorporates these sensitivities into the Learnable Affine Transformation during calibration to better preserve the channels and blocks most critical to each task. Extensive experiments across camera pose estimation, point map reconstruction, and depth estimation show that FGQ consistently outperforms state-of-the-art quantization baselines on VGGT, achieving up to 39% relative improvement under the 4-bit quantization. Code is available at https://github.com/ypzhng/FGQ.

2605.11596 2026-05-25 cs.CV 版本更新

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

HorizonDrive: 用于长时域驾驶仿真的自纠正自回归世界模型

Conglang Zhang, Yifan Zhan, Qingjie Wang, Zhanpeng Ouyang, Yu Li, Zihao Yang, Xiaoyang Guo, Weiqiang Ren, Qian Zhang, Zhen Dong, Yinqiang Zheng, Wei Yin, Zhengqing Chen

发表机构 * Wuhan University(武汉大学) The University of Tokyo(东京大学) Horizon Robotics Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出HorizonDrive,一种用于长时域驾驶模拟的自纠正自回归世界模型。该方法通过引入计划式回滚恢复机制,使教师模型能够在长序列预测中保持稳定,并利用其自回归扩展提供无界监督,从而在有限内存下实现分钟级的预测。实验表明,HorizonDrive在多项指标上显著优于现有方法,提升了驾驶模拟的质量与效率。

Comments Comments: 22 pages, 14 figures. Project page: https://zcliangyue.github.io/HorizonDrive Code: https://github.com/zcliangyue/HorizonDrive

详情
AI中文摘要

闭环驾驶仿真需要超越短时离线片段的实时交互,推动当前驾驶世界模型向自回归(AR)滚转发展。现有的AR蒸馏方法通常依赖于帧沉或学生端退化训练。前者由于快速的自我运动和场景变化,难以迁移到驾驶场景;后者受限于教师单次输出长度,仅提供有限的监督时域。一个自然的问题是:能否通过AR滚转扩展教师本身,以有限的内存成本提供无限时域的监督?关键困难在于标准教师会在自身预测下漂移,污染其提供的监督。我们的关键见解是使教师具备滚转能力,确保从其自身的AR滚转中获得可靠监督。这实例化为HorizonDrive,一个用于AR驾驶仿真的抗漂移训练与蒸馏框架。首先,计划性滚转恢复(SRR)训练基础模型从预测损坏的历史中重建真实未来片段,得到一个在长AR滚转中保持稳定的教师。其次,通过AR滚转扩展具备滚转能力的教师,在有限内存下提供长时域分布匹配监督,同时短窗口学生通过教师滚转DMD(TRD)与之对齐,以实现高效的实时部署。HorizonDrive原生支持在有限内存下的分钟级AR滚转;在nuScenes上,与最强的长时域流式基线相比,HorizonDrive将FID降低52%,FVD降低37%,并将ARE和DTW分别降低21%和9%,同时与单次驾驶视频生成器保持竞争力。

英文摘要

Closed-loop driving simulation requires real-time interaction beyond short offline clips, pushing current driving world models toward autoregressive (AR) rollout. Existing AR distillation approaches typically rely on frame sinks or student-side degradation training. The former transfers poorly to driving due to fast ego-motion and rapid scene changes, while the latter remains bounded by the teacher's single-pass output length and thus provides only a limited supervision horizon. A natural question is: can the teacher itself be extended via AR rollout to provide unbounded-horizon supervision at bounded memory cost? The key difficulty is that a standard teacher drifts under its own predictions, contaminating the supervision it provides. Our key insight is to make the teacher rollout-capable, ensuring reliable supervision from its own AR rollouts. This is instantiated as HorizonDrive, an anti-drifting training-and-distillation framework for AR driving simulation. First, scheduled rollout recovery (SRR) trains the base model to reconstruct ground-truth future clips from prediction-corrupted histories, yielding a teacher that remains stable across long AR rollouts. Second, the rollout-capable teacher is extended via AR rollout, providing long-horizon distribution-matching supervision under bounded memory, while a short-window student aligns to it with teacher rollout DMD (TRD) for efficient real-time deployment. HorizonDrive natively supports minute-scale AR rollout under bounded memory; on nuScenes, HorizonDrive reduces FID by 52% and FVD by 37%, and lowers ARE and DTW by 21% and 9% relative to the strongest long-horizon streaming baselines, while remaining competitive with single-pass driving video generators.

2605.07590 2026-05-25 cs.CV 版本更新

Beyond Defenses: Manifold-Aligned Regularization for Intrinsic 3D Point Cloud Robustness

超越防御:面向内在3D点云鲁棒性的流形对齐正则化

Pedro Alonso, Chongshou Li, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University(计算机与人工智能学院,西南交通大学) Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Southwest Jiaotong University(可持续城市智能交通工程研究中心,教育部,西南交通大学)

AI总结 尽管点云鲁棒性研究已取得进展,但现有方法多依赖数据增强或防御机制,忽视了对抗脆弱性的几何本质。本文提出一种基于流形对齐的正则化方法,认为3D网络的对抗脆弱性源于模型学习的潜在几何结构与点云表面内在几何之间的不匹配。通过引入Manifold-Aligned Point Recognition(MAPR)框架,在不依赖对抗训练或额外数据的情况下,有效提升了模型在多个数据集上的鲁棒性。

详情
AI中文摘要

尽管点云鲁棒性研究取得了广泛进展,现有方法主要依赖增强策略或防御机制,却忽视了对抗脆弱性的几何本质。我们假设3D网络中的对抗脆弱性源于模型学习的潜在几何与底层表面的内在几何之间的流形错位。沿输入流形的微小几何保持扰动往往在特征空间中引起不成比例的扭曲,可能导致误分类。我们通过建立3D鲁棒性的几何解释来形式化这一现象,将经典对抗理论与点云的内在结构联系起来。受此分析启发,我们提出了流形对齐点识别(MAPR),该框架通过跨内在扰动对齐预测来正则化潜在几何。MAPR为每个点云增强捕获局部曲率和扩散结构的内在特征,并应用保持内在几何保持扰动不变性的一致性损失。在不依赖对抗训练或额外数据的情况下,MAPR在多个数据集上持续提升对多种对抗攻击的鲁棒性,在ModelNet40和ScanObjectNN上分别比原始模型平均提高+20.02和+8.83个百分点的鲁棒性。

英文摘要

Despite extensive progress in point cloud robustness, existing methods primarily rely on augmentation strategies or defense mechanisms while overlooking the geometric nature of adversarial fragility. We hypothesize that adversarial vulnerability in 3D networks arises from a manifold misalignment between the latent geometry learned by the model and the intrinsic geometry of the underlying surface. Small, geometry-preserving perturbations along the input manifold often induce disproportionate distortions in feature space, potentially leading to misclassifications. We formalize this phenomenon by developing a geometric interpretation of 3D robustness that links classical adversarial theory to the intrinsic structure of point clouds. Motivated by this analysis, we introduce Manifold-Aligned Point Recognition (MAPR), a framework that regularizes the latent geometry by aligning predictions across intrinsic perturbations. MAPR augments each point cloud with intrinsic features capturing local curvature and diffusion structure, and applies a consistency loss that preserves invariance to intrinsic, geometry-preserving perturbations. Without relying on adversarial training or additional data, MAPR consistently improves robustness under multiple adversarial attacks across several datasets, achieving average robustness gains of +20.02 and +8.83 percentage points over vanilla models on ModelNet40 and ScanObjectNN, respectively.

2605.06094 2026-05-25 cs.CV cs.AI 版本更新

VISD: Enhancing Video Reasoning via Structured Self-Distillation

VISD: 通过结构化自蒸馏增强视频推理

Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, Hongbo Jin

发表机构 * HUST(华中科技大学) Wuhan University(武汉大学) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 本文提出VISD,一种用于增强视频推理的结构化自蒸馏框架,旨在解决视频大语言模型在复杂推理任务中因稀疏奖励和细粒度信用分配不足而导致的学习效率低下的问题。VISD引入了一个视频感知的评判模型,将推理质量分解为答案正确性、逻辑一致性和时空定位等多个维度,并利用结构化反馈指导教师策略进行细粒度的标记级监督。通过方向与幅度解耦机制,VISD稳定地将密集监督与强化学习结合,显著提升了推理准确性和训练效率。实验表明,VISD在多个基准测试中均优于现有方法,且收敛速度更快。

详情
AI中文摘要

训练视频大语言模型进行复杂推理仍然具有挑战性,原因在于稀疏的序列级奖励以及缺乏对长时间、时间上接地推理轨迹的细粒度信用分配。虽然具有可验证奖励的强化学习提供了可靠的监督,但它无法捕捉令牌级贡献,导致学习效率低下。相反,现有的自蒸馏方法提供密集监督,但缺乏结构和诊断特异性,并且通常与强化学习交互不稳定。在这项工作中,我们提出了VISD,一个结构化自蒸馏框架,为视频推理引入诊断上有意义的特权信息。VISD采用视频感知判断模型,将推理质量分解为多个维度,包括答案正确性、逻辑一致性和时空接地性,并使用这种结构化反馈指导教师策略进行令牌级监督。为了将密集监督与强化学习稳定集成,我们引入了方向-幅度解耦机制,其中由奖励计算的展开级优势决定更新方向,而结构化特权信号调节令牌级更新幅度。这种设计实现了语义对齐和细粒度的信用分配,提高了推理忠实度和训练效率。此外,VISD结合了课程调度和基于指数移动平均的教师稳定化,以支持长视频序列上的鲁棒优化。在多个基准上的实验表明,VISD始终优于强基线,提高了答案准确性和时空接地质量。值得注意的是,VISD在优化步骤中实现了近2倍的收敛速度,突出了结构化自监督在提高视频大语言模型性能和样本效率方面的有效性。

英文摘要

Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.

2605.06088 2026-05-25 cs.CV 版本更新

OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention

OpenGaFF: 基于码本注意力的开放词汇高斯特征场

Kunyi Li, Michael Niemeyer, Sen Wang, Stefano Gasperini, Nassir Navab, Federico Tombari

发表机构 * Technical University of Munich(慕尼黑技术大学) Google(谷歌) Munich Center for Machine Learning(慕尼黑机器学习中心) Visualais

AI总结 本文提出了一种名为 OpenGaFF 的新型框架,用于实现开放词汇的3D场景理解。该方法基于3D高斯点喷射技术,通过引入高斯特征场,将语义建模为高斯几何和外观的连续函数,从而增强几何与语义之间的关联性,提升3D空间中语义的一致性。此外,作者设计了一个结构化码本和基于码本引导的注意力机制,以实现对开放词汇的鲁棒推理,并减少物体内部特征的差异。实验表明,该方法在多个标准2D和3D开放词汇基准测试中均优于现有方法,取得了更优的分割质量与更强的3D语义一致性。

详情
AI中文摘要

理解基于高斯表示的开放词汇3D场景仍然具有挑战性,因为多视角观测下的语义预测碎片化且空间不一致。在本文中,我们提出了OpenGaFF,一个基于3D高斯泼溅构建的开放词汇3D场景理解新框架。我们方法的核心是一个高斯特征场,它将语义建模为高斯几何和外观的连续函数。通过显式地将语义预测条件于几何结构,该公式加强了几何与语义之间的耦合,从而在3D空间中相似结构上实现了更好的空间一致性。为了进一步强制执行对象级语义一致性,我们引入了一个结构化码本,作为一组共享的语义基元。此外,提出了一种码本引导的注意力机制,通过查询嵌入与学习到的码本条目之间的相似性匹配来检索语言特征,从而实现鲁棒的开放词汇推理,同时减少对象内特征方差。在标准2D和3D开放词汇基准上的大量实验表明,我们的方法持续优于先前的方法,实现了改进的分割质量、更强的3D语义一致性以及一个语义可解释的码本,为学习到的表示提供了洞察。

英文摘要

Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic predictions on geometric structure, this formulation strengthens the coupling between geometry and semantics, leading to improved spatial coherence across similar structures in 3D space. To further enforce object-level semantic consistency, we introduce a structured codebook that serves as a set of shared semantic primitives. Furthermore, a codebook-guided attention mechanism is proposed to retrieve language features via similarity matching between query embeddings and learned codebook entries, enabling robust open-vocabulary reasoning while reducing intra-object feature variance. Extensive experiments on standard 2D and 3D open-vocabulary benchmarks demonstrate that our method consistently outperforms prior approaches, achieving improved segmentation quality, stronger 3D semantic consistency and a semantically interpretable codebook that provides insight into the learned representation.

2605.05997 2026-05-25 cs.CV 版本更新

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

4DThinker: 用4D图像进行动态空间理解的思考

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xiang An, Bo Li, Xin Xie, ZiDong Wang, Mingze Sun, Shuang Chen, Hongyu Li, Xiaobin Hu, Ruqi Huang

发表机构 * Tsinghua University, SIGS(清华大学 SIGS) Meituan(美团) The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学) LMMs-Lab(LMMs实验室) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出了一种名为4DThinker的新型框架,旨在通过动态的潜空间心理图像使视觉语言模型(VLMs)具备四维(4D)动态空间推理能力。该方法引入了无需标注的数据生成流程和动态图像微调(DIFT)技术,结合文本与4D潜变量进行联合监督,从而增强模型对动态视觉语义的理解。此外,基于奖励的4D强化学习(4DRL)进一步提升了模型在复杂推理任务中的表现,实验表明该方法在多个动态空间推理基准测试中均优于现有方法。

Comments 21 pages, 16 figures

详情
AI中文摘要

从单目视频中进行动态空间推理对于连接视觉智能与物理世界至关重要,但对视觉语言模型(VLM)仍然具有挑战性。先前的方法要么将时空推理完全表述为文本,这对于复杂动态来说本质上是冗长且不精确的,要么依赖外部几何模块,这增加了推理复杂性而不培养内在模型能力。在本文中,我们提出了4DThinker,这是第一个使VLM能够通过动态潜在心理图像(即在连续隐藏空间内模拟场景如何演化)进行“4D思考”的框架。具体来说,我们首先引入了一个可扩展的、无需标注的数据生成流程,从原始视频中合成4D推理数据。然后我们提出了动态图像微调(DIFT),它联合监督文本令牌和4D潜在变量,将模型锚定在动态视觉语义中。在此基础上,4D强化学习(4DRL)通过基于结果的奖励进一步处理复杂推理任务,将策略梯度限制在文本令牌上以确保稳定优化。在多个动态空间推理基准上的大量实验表明,4DThinker始终优于强基线,并为VLM中的4D推理提供了新视角。我们的代码可在https://github.com/zhangquanchen/4DThinker获取。

英文摘要

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.

2605.01018 2026-05-25 cs.CV 版本更新

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

WildTableBench:在真实场景中评估多模态基础模型的表格理解能力

Junzhe Huang, Xiaoxiao Sun, Yan Yang, Yuxuan Hou, Ruotian Zhang, Sirui Li, Hehe Fan, Serena Yeung-Levy, Xin Yu

发表机构 * The University of Queensland(昆士兰大学) Stanford University(斯坦福大学) The Australian National University(澳大利亚国立大学) Zhejiang University(浙江大学) Murdoch University(莫纳什大学) The University of Adelaide(阿德莱德大学)

AI总结 WildTableBench 是一个用于评估多模态基础模型在真实场景下理解表格图像能力的基准测试。该研究引入了包含402张来自不同领域的真实表格图像和928个手动标注问题的数据集,用于测试模型在结构感知和数值推理方面的能力。实验表明,目前主流的多模态模型在该基准上的表现普遍较低,仅有一款模型准确率超过50%,揭示了当前模型在处理复杂表格图像时仍存在显著不足。

详情
AI中文摘要

使用多模态基础模型分析表格图像是消费和企业场景中高价值但具有挑战性的应用。尽管其重要性,当前评估主要依赖于结构化文本表格或干净渲染的图像,忽视了真实世界表格图像的视觉复杂性。这些图像具有多样的布局和领域,需要复杂的结构感知和数值推理。为弥补这一差距,我们引入了WildTableBench,这是第一个针对真实世界设置中自然出现的表格图像的问答基准。WildTableBench包含从跨领域在线论坛和网站收集的402张高信息密度表格图像,以及928个手动标注和验证的问题,涵盖五个类别的17个子类型。我们在此基准上评估了21个前沿专有和开源多模态基础模型。仅有一个模型准确率超过50%,其余模型准确率在4.1%至49.9%之间。我们进一步进行诊断分析以表征模型失败,并揭示结构感知和推理方面的持续弱点。这些结果和分析为当前模型能力提供了有用的见解,并将WildTableBench建立为表格图像理解的有价值的诊断基准。数据集:https://huggingface.co/datasets/jzhuang/WildTableBench 代码:https://github.com/hjzhe/WildTableBench 排行榜:https://hjzhe.github.io/WildTableBench

英文摘要

Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding. Dataset: https://huggingface.co/datasets/jzhuang/WildTableBench Code: https://github.com/hjzhe/WildTableBench Leaderboard: https://hjzhe.github.io/WildTableBench

2604.27247 2026-05-25 cs.CV 版本更新

Towards Generalizable Mapping of Hedges and Linear Woody Features from Earth Observation Data: a national Product for Germany

面向地球观测数据中树篱与线性木本特征的可泛化映射:德国国家产品

Thorsten Hoeser, Verena Huber-Garcia, Sarah Asam, Ursula Gessner, Claudia Kuenzer

发表机构 * Earth Observation Center (EOC), German Aerospace Center (DLR)(地球观测中心(EOC),德国航空航天中心(DLR))

AI总结 本文旨在从地球观测数据中生成适用于全国范围的可推广的灌木和线性木质特征地图,以支持生态管理和保护。研究提出了一种模块化的工作流程,包含一个灵活的数据接口和一个深度神经网络,分别用于生成木质植被掩膜和区分线性与非线性结构。该方法在德国全国范围内应用了三种不同分辨率的数据源,无需重新训练模型即可生成高质量的线性木质特征地图,并在多个评估区域表现出良好的性能。

Comments 33 pages, 17 figures

详情
AI中文摘要

树篱和其他线性木本特征在集约化管理的农业景观中提供宝贵的生态系统服务。它们是气候适应和生物多样性的关键要素,不仅因为其高度变化的植物区系,还作为许多动物和昆虫(包括有价值的传粉者)的觅食、休息和筑巢场所。因此,它们需要专门的管理、保护和关注。从地球观测数据中对这些特征进行系统化和大规模制图具有重要意义。然而,考虑到传感器类型、空间分辨率、数据采集条件以及研究区域复杂的景观变异性,可转移和可复用的线性木本特征制图工作流仍然是一个关键的方法论挑战。我们引入了一个模块化工作流,围绕两个独立可优化的组件构建。首先,一个灵活的输入数据接口,将异构的地球观测数据整合为二值木本植被掩膜;其次,一个深度神经网络,训练用于区分这些掩膜中的线性形状和非线性形状。我们通过使用单个训练模型(无需重新训练)从三个输入源(空间分辨率分别为0.73米、1米和3米)推导出覆盖整个德国的三个全国尺度线性木本特征图来演示该工作流。与来自四个联邦州生物群落制图活动的精细参考数据进行的评估,以及与两个现有线性木本特征图的比较表明,该工作流在全国所有评估站点均产生具有竞争力的结果。其模块化设计及其在全国尺度上的适用性为超越德国的可扩展和可泛化线性木本特征制图提供了基础。

英文摘要

Hedges and other linear woody features provide valuable ecosystem services, particularly within intensively managed agricultural landscapes. They are key elements for climate adaptation and biodiversity amongst others not only due to a largely varying flora, but also as a feeding-, resting-, and nesting place for many animals and insects including valuable pollinators. Therefore, they require dedicated management, preservation, and attention. Thus, systematic and large-scale mapping of these features from Earth observation data is of high importance. However, transferable and reusable workflows for linear woody feature mapping remain a key methodological challenge, given the diversity of sensor types, spatial resolutions, data acquisition conditions, and complex landscape variability encountered across study areas. We introduce a modular workflow built around two independently optimizable components. Firstly, a flexible input data interface that consolidates heterogeneous Earth observation data into a binary woody vegetation mask, and secondly, a deep neural network trained to separate linear from non-linear shapes within these masks. We demonstrate the workflow by deriving three national-scale linear woody feature maps for all of Germany from three input sources with 0.73 m, 1 m and 3 m spatial resolution, respectively, by using a single trained model without retraining. Evaluation against refined reference data from four federal state biotope mapping campaigns and comparison with two existing linear woody feature maps demonstrate that the workflow produces competitive results across all evaluation sites on a national level. The modular design and its demonstrated applicability at national scale provide a foundation for scalable and generalizable linear woody feature mapping beyond Germany.

2604.25755 2026-05-25 quant-ph cs.CV physics.comp-ph 版本更新

Quantum-Inspired Robust and Scalable SAR Object Classification

量子启发的鲁棒可扩展SAR目标分类

Maximilian Scharf, Marco Trenti, Felix Bock, Padraig Davidson, Tobias Brosch, Benjamin Rodrigues de Miranda, Sigurd Huber, Timo Felser

发表机构 * Tensor AI Solutions GmbH(Tensor AI解决方案有限公司) Ulm University(乌尔姆大学) Institute for Complex Quantum Systems(复杂量子系统研究所) Hensoldt Sensors GmbH(亨索尔特传感器有限公司) German Aerospace Center (DLR)(德国航空航天中心(DLR)) Microwaves and Radar Institute(微波与雷达研究所)

AI总结 本文研究了合成孔径雷达(SAR)图像分类中面对噪声干扰和动态范围大的挑战,以及在边缘设备上部署时对模型鲁棒性与效率的平衡需求。研究探索了张量网络在提升分类鲁棒性及模型压缩方面的潜力,特别是其对数据中毒攻击的抵御能力。与以往基于传统神经网络的方法不同,本文聚焦于张量网络在目标分类中的鲁棒性与模型简化能力,表明其在应对复杂环境和资源限制方面具有显著优势,为雷达应用和深度学习方法提供了新的见解。

Comments 6 pages, 6 figures, EUSAR 2026 conference

详情
AI中文摘要

SAR图像分类自然需要处理巨大的噪声和高动态范围,特别需要鲁棒的分类模型。此外,这些模型在边缘设备(如无人机和军用飞机)上的部署需要在模型大小和分类精度之间仔细平衡。本研究探索了张量网络满足这些鲁棒性要求的潜力,特别评估了它们对数据投毒的抵抗力。与以往专注于传统神经网络进行SAR目标检测的工作不同,本研究聚焦于张量网络在目标分类中的鲁棒性和模型缩减能力。我们的发现表明,张量网络擅长同时解决鲁棒性挑战和模型效率需求,从而为雷达应用和深度学习方法的持续讨论贡献了有价值的见解。

英文摘要

SAR image classification naturally has to deal with huge noise and a high dynamic range particularly requiring robust classification models. Additionally, the deployment of these models on edge devices, such as drones and military aircraft, requires a careful balance between model size and classification accuracy. This study explores the potential of tensor networks to meet these robustness requirements, specifically evaluating their resilience to data poisoning. Unlike previous works that concentrated on conventional neural networks for SAR object detection, this research focuses on the robustness and model reduction capabilities of tensor networks in object classification. Our findings indicate that tensor networks are adept at addressing both the challenges of robustness and the need for model efficiency, thereby contributing valuable insights to the ongoing discourse in radar applications and deep learning methodologies in general.

2604.13596 2026-05-25 cs.CV 版本更新

VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

VGGT-Segmentor: 几何增强的跨视角分割

Yulu Gao, Bohao Zhang, Zongheng Tang, Jitong Liao, Wenjun Wu, Si Liu

发表机构 * Hangzhou International Innovation Institute of Beihang University(北航杭州国际创新研究院) Beihang University(北京航空航天大学)

AI总结 本文提出了一种名为VGGT-Segmentor(VGGT-S)的几何增强跨视角分割框架,旨在解决从第一人称视角到第三人称视角的实例级物体分割难题。该方法结合了VGGT模型强大的跨视角特征表示能力,并引入了一个新的联合分割头,通过多阶段处理实现高精度的像素级分割。此外,该方法采用单图像自监督训练策略,无需成对标注即可实现良好的泛化能力,在Ego-Exo4D基准测试中取得了优于现有方法的性能。

详情
AI中文摘要

跨不同自我中心和外部中心视图的实例级对象分割是视觉理解中的基本挑战,对于具身AI和远程协作应用至关重要。由于尺度、视角和遮挡的剧烈变化,直接像素级匹配变得不稳定,使得该任务异常困难。尽管像VGGT这样的最新几何感知模型为特征对齐提供了坚实基础,但我们发现,即使其内部对象级注意力保持一致,它们在密集预测任务中常常因显著的像素级投影漂移而失败。为弥合这一差距,我们引入了VGGT-Segmentor(VGGT-S),一个将鲁棒几何建模与像素精确语义分割统一的框架。VGGT-S利用VGGT强大的跨视图特征表示,并引入了一种新颖的Union分割头。该分割头分三个阶段运行:掩码提示融合、点引导预测和迭代掩码细化,有效地将高级特征对齐转化为精确的分割掩码。此外,我们提出了一种单图像自监督训练策略,消除了对配对标注的需求,并实现了强大的泛化能力。在Ego-Exo4D基准上,VGGT-S在Ego到Exo和Exo到Ego任务中分别实现了67.7%和68.0%的平均IoU,显著优于先前方法。值得注意的是,我们的无对应预训练模型超越了大多数全监督基线,证明了我们方法的有效性和可扩展性。代码公开于:https://github.com/buaa-colalab/VGGT-S。

英文摘要

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach. Code is publicly available at: https://github.com/buaa-colalab/VGGT-S.

2604.11679 2026-05-25 cs.CV 版本更新

Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge

面向临床的大脑MRI基础模型:来自FOMO25挑战赛的发现

Asbjørn Munk, Stefano Cerri, Vardan Nersesjan, Christian Hedeager Krag, Jakob Ambsdorf, Pablo Rocamora García, Julia Machnio, Peirong Liu, Suhyun Ahn, Nasrin Akbari, Yasmina Al Khalil, Kimberly Amador, Sina Amirrajab, Tal Arbel, Meritxell Bach Cuadra, Ujjwal Baid, Bhakti Baheti, Jaume Banus, Kamil Barbierik, Christoph Brune, Yansong Bu, Baptiste Callard, Yuhan Chen, Cornelius Crijnen, Corentin Dancette, Peter Drotar, Prasad Dutande, Nils D. Forkert, Saurabh Garg, Jakub Gazda, Matej Gazda, Benoît Gérin, Partha Ghosh, Weikang Gong, Pedro M. Gordaliza, Sam Hashemi, Tobias Heimann, Fucang Jia, Jiexin Jiang, Emily Kaczmarek, Chris Kang, Seung Kwan Kang, Mohammad Khazaei, Julien Khlaut, Petros Koutsouvelis, Jae Sung Lee, Yuchong Li, Mengye Lyu, Mingchen Ma, Anant Madabhushi, Klaus H. Maier-Hein, Pierre Manceron, Andrés Martínez Mora, Moona Mazher, Felix Meister, Nataliia Molchanova, Steven A. Niederer, Leonard Nürnberg, Jinah Park, Abdul Qayyum, Jonas Richiardi, Antoine Saporta, Branislav Setlak, Ning Shen, Justin Szeto, Constantin Ulrich, Puru Vaish, Vibujithan Vigneshwaran, Leroy Volmer, Zihao Wang, Siqi Wei, Anthony Winder, Jelmer M. Wolterink, Maxence Wynen, Chang Yang, Si Young Yie, Mostafa Mehdipour Ghazi, Akshay Pai, Espen Jimenez Solem, Sebastian Nørgaard Llambias, Mikael Boesen, Michael Eriksen Benros, Juan Eugenio Iglesias, Mads Nielsen

发表机构 * organization= Department of Computer Science, University of Copenhagen , city= Copenhagen , country= Denmark organization= Pioneer Centre for AI , city= Copenhagen , country= Denmark organization= Copenhagen Research Centre for Biological Precision Psychiatry, Mental Health Centre Copenhagen, Copenhagen University Hospital , region= Capital Region of Denmark , city= Copenhagen , country= Denmark organization= Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital Harvard Medical School , city= Boston , state= Massachusetts , country= USA Artificial Intelligence Laboratory, Massachusetts Institute of Technology , city= Boston , state= Massachusetts , country= USA organization= Johns Hopkins University , city= Baltimore , state= Maryland , country= USA organization= Radiological AI Testcenter (RAIT) , region= Capital Region of Denmark , city= Copenhagen , country= Denmark organization= Copenhagen University Hospital, Rigshospitalet , region= Capital Region of Denmark , city= Copenhagen , country= Denmark organization= Copenhagen University Hospital, Bispebjerg \& Frederiksberg Hospital , region= Capital Region of Denmark , city= Copenhagen , country= Denmark organization= Department of Clinical Medicine, Faculty of Health Medical Sciences, University of Copenhagen , city= Copenhagen , country= Denmark organization= Division of Medical Image Computing, German Cancer Research Center (DKFZ) , city= Heidelberg , country= Germany organization= University of British Columbia , city= Vancouver , state= British Columbia , country= Canada organization= Hawkes Institute, Department of Computer Science, University College London , city= London , country= United Kingdom Lung Institute, Faculty of Medicine, Imperial College London , city= London , country= United Kingdom organization= Department of Applied Mathematics, Technical Medical Centre, University of Twente , city= Enschede , country= Netherlands organization= IISLAB, Technical University of Košice , city= Košice , country= Slovakia organization= 2nd Department of Internal Medicine, Pavol Jozef Safarik University L Pasteur University Hospital , city= Košice , country= Slovakia organization= Fudan University , city= Shanghai , country= China organization= Shenzhen Technology University , city= Shenzhen , country= China organization= Department of Radiology, Lausanne University Hospital University of Lausanne , city= Lausanne , country= Switzerland organization= Louvain Neuroinflammation Imaging Lab (NIL), Université Catholique de Louvain , city= Brussels , country= Belgium organization= University of Applied Sciences organization= CIBM Center for Biomedical Imaging , city= Lausanne , country= Switzerland organization= Department of Radiation Oncology (Maastro), GROW Research Institute for Oncology Reproduction, Maastricht University Medical Centre+ , city= Maastricht , country= The Netherlands organization= Department of Biomedical Engineering, Medical Image Analysis, Eindhoven University of Technology , city= Eindhoven , country= The Netherlands organization= Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences , city= Shenzhen , country= China organization= McGill University Mila - Quebec AI Institute , city= Montreal , country= Canada organization= Hotchkiss Brain Institute Department of Radiology, University of Calgary , city= Calgary , state= Alberta , country= Canada organization= Department of Radiology, University of Calgary , city= Calgary , state= Alberta , country= Canada organization= Alberta Children's Hospital Research Institute, Department of Clinical Neuroscience, University of Calgary , city= Calgary , state= Alberta , country= Canada organization= The Wallace H. Coulter Department of Biomedical Engineering, Georgia Tech Emory University , city= Atlanta , state= Georgia , country= USA organization= SGGS College of Engineering organization= Seoul National University , city= Seoul , country= South Korea organization= The D-Lab, Department of Precision Medicine, GROW Research Institute for Oncology Reproduction, Maastricht University , city= Maastricht , country= The Netherlands organization= Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School , city= Boston , state= Massachusetts , country= USA Nuclear Medicine, CARIM \& GROW, Maastricht University , city= Maastricht , country= The Netherlands organization= Department of Radiation Oncology, Dana-Farber Cancer Institute, Brigham Women’s Hospital, Harvard Medical School , city= Boston , state= Massachusetts , country= USA Learning Group, Heidelberg University Hospital , city= Heidelberg , country= Germany

AI总结 临床部署自动化脑部MRI分析面临数据异质性强、标签获取成本高的挑战。本文通过组织FOMO25挑战赛,提供了大规模预训练数据集FOMO60K,并在临床真实数据上评估了模型在少样本和跨域场景下的表现。研究发现,无监督预训练能有效提升模型在跨域数据上的泛化能力,且不同预训练目标对不同任务效果各异,小规模预训练模型已能取得良好性能,进一步扩大模型规模和训练时间并未带来稳定提升。

详情
AI中文摘要

自动化脑MRI分析的临床部署面临一个基本挑战:临床数据异质且有噪声,高质量标签的获取成本高得令人望而却步。自监督学习(SSL)可以通过利用临床工作流程中产生的大量未标记数据来训练鲁棒的 extit{基础模型},这些模型在最小监督下适应域外场景。然而,脑MRI基础模型的发展一直受到小规模预训练数据集和专注于高质量研究级数据的域内基准测试的限制。为解决这一差距,我们组织了FOMO25挑战赛,作为MICCAI 2025的卫星活动。FOMO25为参与者提供了一个大型预训练数据集FOMO60K,并在少样本和域外设置下,直接使用来自临床工作流程的数据评估模型。任务涵盖梗死分类、脑膜瘤分割和脑年龄回归,并考虑了在FOMO60K上训练的模型(方法赛道)和任何数据上训练的模型(开放赛道)。来自16个团队的19个基础模型使用标准化容器化流程进行了评估。结果表明:(a) 自监督预训练提升了域迁移下临床数据的泛化能力,最强的 extit{域外}训练模型超越了 extit{域内}训练的有监督基线。(b) 没有单一的预训练目标对所有任务都有利:MAE有利于分割,混合重建-对比目标有利于分类,以及(c) 小型预训练模型取得了强劲性能,而扩大模型规模和训练时长并未带来可靠收益。

英文摘要

Clinical deployment of automated brain MRI analysis faces a fundamental challenge: clinical data is heterogeneous and noisy, and high-quality labels are prohibitively costly to obtain. Self-supervised learning (SSL) can address this by leveraging the vast amounts of unlabeled data produced in clinical workflows to train robust \textit{foundation models} that adapt out-of-domain with minimal supervision. However, the development of foundation models for brain MRI has been limited by small pretraining datasets and in-domain benchmarking focused on high-quality, research-grade data. To address this gap, we organized the FOMO25 challenge as a satellite event at MICCAI 2025. FOMO25 provided participants with a large pretraining dataset, FOMO60K, and evaluated models on data sourced directly from clinical workflows in few-shot and out-of-domain settings. Tasks covered infarct classification, meningioma segmentation, and brain age regression, and considered both models trained on FOMO60K (method track) and any data (open track). Nineteen foundation models from sixteen teams were evaluated using a standardized containerized pipeline. Results show that (a) self-supervised pretraining improves generalization on clinical data under domain shift, with the strongest models trained \textit{out-of-domain} surpassing supervised baselines trained \textit{in-domain}. (b) No single pretraining objective benefits all tasks: MAE favors segmentation, hybrid reconstruction-contrastive objectives favor classification, and (c) strong performance was achieved by small pretrained models, and improvements from scaling model size and training duration did not yield reliable benefits.

2604.09349 2026-05-25 cs.CV cs.AI cs.CL 版本更新

Visually-Guided Policy Optimization for Multimodal Reasoning

视觉引导的多模态推理策略优化

Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu

发表机构 * AMAP, Alibaba Group(阿里集团AMAP) SYSU(南方科技大学) BUPT(北京邮电大学)

AI总结 该研究针对视觉语言模型在多模态推理中视觉关注不足的问题,提出了一种名为Visually-Guided Policy Optimization(VGPO)的新框架,通过引入视觉注意力补偿机制和双粒度优势重加权策略,增强模型在推理过程中的视觉聚焦能力。实验表明,VGPO有效提升了模型在数学多模态推理和依赖视觉的任务中的表现,显著改善了视觉信息的利用效率。

Comments Accepted to ACL 2026, https://github.com/wzb-bupt/VGPO

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)显著提升了视觉语言模型(VLM)的推理能力。然而,VLM固有的文本主导特性常导致视觉忠实度不足,表现为对视觉标记的注意力激活稀疏。更重要的是,我们的实证分析揭示,推理步骤中的时序视觉遗忘加剧了这一缺陷。为弥补这一差距,我们提出视觉引导策略优化(VGPO),一种在策略优化期间强化视觉聚焦的新框架。具体而言,VGPO首先引入视觉注意力补偿机制,利用视觉相似性定位并放大视觉线索,同时在后续步骤中逐步提升视觉期望以对抗视觉遗忘。基于此机制,我们实施双粒度优势重加权策略:轨迹内层级突出显示具有相对较高视觉激活的标记,而轨迹间层级优先选择表现出优越视觉累积的轨迹。大量实验表明,VGPO在数学多模态推理和视觉依赖任务中实现了更好的视觉激活和优越性能。代码已发布于https://github.com/wzb-bupt/VGPO。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks. The code has been released at https://github.com/wzb-bupt/VGPO.

2604.06885 2026-05-25 cs.CV 版本更新

Time-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer

基于FDG-PET/CT的非小细胞肺癌时间驱动生存分析

Sambit Tarai, Ashish Chauhan, Elin Lundström, Johan Öfverstedt, Therese Sjöholm, Veronica Sanchez Rodriguez, Håkan Ahlström, Joel Kullberg

发表机构 * Radiology, Department of Surgical Sciences(外科科学系放射学部) Antaros Medical(Antaros医疗) Molecular Imaging and Medical Physics, Department of Surgical Sciences(外科科学系分子成像与医学物理部)

AI总结 该研究提出了一种基于FDG-PET/CT影像的深度回归框架,用于预测非小细胞肺癌患者的总生存期(OS),并引入时间变量作为输入以实现时间驱动的生存分析。方法结合ResNet-50提取影像特征,并与时间信息融合,生成随时间变化的生存概率预测。实验表明,该方法在AUC指标上优于基线模型,且结合临床与影像特征的集成模型取得了最佳性能,验证了多模态数据在生存预测中的互补价值。

Comments Under review

详情
Journal ref
Ann Biomed Eng (2026)
AI中文摘要

目的:基于医学图像的临床结果(如总生存期,OS)自动预测在改善患者预后和个性化治疗计划方面具有巨大潜力。我们开发了一个深度回归框架,使用组织FDG-PET/CT投影作为输入,以及一个表示标量时间范围(以天为单位)的时间输入,来预测非小细胞肺癌(NSCLC)患者的OS。方法:所提出的框架采用ResNet-50骨干网络处理输入图像并生成相应的图像嵌入。然后将嵌入与时间数据结合,生成作为时间函数的OS概率,从而有效地基于时间参数化预测。整体框架使用U-CAN队列(n=556)开发,并在测试集(n=292)上与基线方法进行比较评估。基线使用ResNet-50架构,仅处理图像作为输入,并在预定义的时间间隔(如2年或5年)提供OS预测。结果:将时间数据与图像嵌入相结合在预测OS方面显示出优势,优于基线方法,AUC提高了4.3%。使用临床+IDP特征的模型取得了强劲性能,而成像与临床+IDP模型的集成取得了最佳整体性能(0.788),突显了多模态输入的互补价值。所提出的方法还能够将患者风险分层为不同类别(高风险与低风险)。显著性分析的热图突出显示了肿瘤区域作为预测的关键结构。结论:我们的方法提供了一个自动化的框架来预测作为时间函数的OS,并展示了结合成像和表格数据以改善生存预测的潜力。

英文摘要

Purpose: Automated medical image-based prediction of clinical outcomes, such as overall survival (OS), has great potential in improving patient prognostics and personalized treatment planning. We developed a deep regression framework using tissue-wise FDG-PET/CT projections as input, along with a temporal input representing a scalar time horizon (in days) to predict OS in patients with Non-Small Cell Lung Cancer (NSCLC). Methods: The proposed framework employed a ResNet-50 backbone to process input images and generate corresponding image embeddings. The embeddings were then combined with temporal data to produce OS probabilities as a function of time, effectively parameterizing the predictions based on time. The overall framework was developed using the U-CAN cohort (n = 556) and evaluated by comparing with a baseline method on the test set (n = 292). The baseline utilized the ResNet-50 architecture, processing only the images as input and providing OS predictions at pre-specified intervals, such as 2- or 5-year. Results: The incorporation of temporal data with image embeddings demonstrated an advantage in predicting OS, outperforming the baseline method with an improvement in AUC of 4.3%. The proposed model using clinical + IDP features achieved strong performance, and an ensemble of imaging and clinical + IDP models achieved the best overall performance (0.788), highlighting the complementary value of multimodal inputs. The proposed method also enabled risk stratification of patients into distinct categories (high vs low risk). Heat maps from the saliency analysis highlighted tumor regions as key structures for the prediction. Conclusion: Our method provided an automated framework for predicting OS as a function of time and demonstrates the potential of combining imaging and tabular data for improved survival prediction.

2603.24985 2026-05-25 cs.CV 版本更新

Few-Shot Left Atrial Wall Segmentation in 3D LGE MRI via Meta-Learning

基于元学习的3D LGE MRI左心房壁少样本分割

Yusri Al-Sanaani, Rebecca Thornhill, Pablo Nery, Elena Pena, Robert deKemp, Calum Redpath, David Birnie, Sreeraman Rajan

发表机构 * Department of Systems and Computer Engineering, Carleton University(系统与计算机工程系,卡尔顿大学) Department of Radiology, Radiation Oncology, and Medical Physics, University of Ottawa(放射科、放射肿瘤学与医学物理系,渥太华大学) Division of Cardiology, Department of Medicine, University of Ottawa Heart Institute(心内科,医学系,渥太华心脏研究所)

AI总结 该研究针对3D晚期钆增强磁共振成像(LGE-MRI)中左心房壁分割的挑战,提出了一种基于元学习的模型无关框架,结合3D残差U-Net网络,实现少量样本(5、10、20个样本)下的分割任务。通过联合训练左心房壁及辅助左、右心房腔任务,并引入边界感知复合损失函数,提升了对薄结构的分割精度。实验表明,该方法在少样本条件下优于传统微调方法,并在不同数据域下表现出良好的鲁棒性,有助于减少心脏重构评估中的标注负担。

Comments Accepted to IEEE EMBC 2026

详情
AI中文摘要

从晚期钆增强磁共振成像(LGE-MRI)中分割左心房(LA)壁因其薄几何结构、低对比度和有限的专家标注而具有挑战性。我们提出了一种基于模型无关元学习(MAML)的框架,采用3D残差U-Net骨干网络,用于K-shot(K=5, 10, 20)左心房壁分割。该框架在左心房壁任务以及辅助的左心房和右心房(RA)腔任务上进行元训练,并使用边界感知复合损失来改善薄结构描绘。我们在一个保留的干净测试集上评估了MAML,并在未见过的合成域偏移和本地队列上评估了其鲁棒性。在保留的干净测试集上,MAML在5-shot下优于少样本微调基线,Dice系数(DSC)=0.54对比0.48,豪斯多夫距离(HD95)=4.60对比6.40毫米。在20-shot下,MAML接近从头训练的完全监督模型,DSC=0.59对比0.61。在未见过的偏移下,性能相对于干净测试有所下降,但随K增加而持续改善。在5-shot下,MAML在未见过的合成偏移下达到DSC=0.52和HD95=5.02毫米,在本地队列上达到DSC=0.50和HD95=5.43毫米。这些结果表明,元学习可以改善低样本适应中的薄壁描绘,并可能减少心房重构评估的标注负担。

英文摘要

Segmenting the left atrial (LA) wall from late gadolinium enhancement magnetic resonance imaging (LGE-MRI) is challenging because of its thin geometry, low contrast, and limited expert annotations. We propose a model-agnostic meta-learning (MAML) framework with a 3D residual U-Net backbone for K-shot (K = 5, 10, 20) LA wall segmentation. The framework is meta-trained on LA wall tasks together with auxiliary LA and right atrial (RA) cavity tasks and uses a boundary-aware composite loss to improve thin-structure delineation. We evaluated MAML on a held-out clean test set and assessed its robustness under an unseen synthetic domain shift and on a local cohort. On the held-out clean test set, MAML outperformed the K-shot fine-tuning baseline at 5-shot, achieving Dice coefficient (DSC) = 0.54 versus 0.48 and Hausdorff distance (HD95) = 4.60 versus 6.40 mm. At 20-shot, MAML approached the fully supervised model trained from scratch, with DSC = 0.59 versus 0.61. Under unseen shifts, performance decreased relative to clean testing but improved consistently as K increased. At 5-shot, MAML achieved DSC = 0.52 and HD95 = 5.02 mm under the unseen synthetic shift, and DSC = 0.50 and HD95 = 5.43 mm on the local cohort. These results suggest that meta-learning can improve thin-wall delineation in low-shot adaptation and may reduce the annotation burden for atrial remodeling assessment.

2603.17879 2026-05-25 cs.CV cs.AI 版本更新

Anatomy-Guided Vision-Language Learning with Angular Prototype Separation for Multi-Label Video Capsule Endoscopy Classification Under Class Imbalance

解剖引导的视觉-语言学习与角度原型分离用于类别不平衡下的多标签视频胶囊内镜分类

Podakanti Satyajith Chary, Nagarajan Ganapathy

发表机构 * Department of Engineering Science, IIT Hyderabad(印度海得拉尔理工学院工程科学系) Department of Biomedical Engineering, IIT Hyderabad(印度海得拉尔理工学院生物医学工程系)

AI总结 本文提出了一种用于视频胶囊内镜(VCE)的多标签时间事件检测框架,针对Galar数据集中严重的类别不平衡问题,结合了角度分离损失和生物状态机解码器两个核心贡献。该框架基于BiomedCLIP模型,通过局部差分注意力模块融合连续帧以增强病理信号,并利用解剖上下文头结合软解剖激活进行病理预测。实验表明,该方法在RARE-VISION测试集上显著提升了检测性能,实现了更高的平均精度。

Comments 12 pages, 1 figure, ICPR 2026 RARE-VISION Competition

详情
AI中文摘要

本文提出一个多标签时间事件检测框架用于视频胶囊内镜(VCE),通过结合两个主要贡献来解决Galar数据集固有的极端类别不平衡问题:类原型上的角度分离损失和生物状态机时间解码器。主干网络保持为BiomedCLIP,一个生物医学视觉-语言基础模型。三个连续帧通过局部差分注意力模块融合,该模块通过抑制静态时间冗余来放大瞬态病理信号。然后,解剖上下文头将病理预测条件化于软解剖激活上,利用已知的胃肠道发现空间共现结构。可学习的文本特征提示和基于原型的logit增强与角度分离损失一起训练,该损失惩罚类原型之间的非对角线余弦相似度,防止在极端不平衡下影响罕见类的原型崩溃。为抵消倾斜的标签分布,训练方案结合了非对称焦点损失、逆频率加权采样、时间混合、指数移动平均和每类阈值校准。生物状态机解码器用基于解剖标签的生理学基础前向状态转换替代朴素间隙合并,消除了先前方法中每视频产生数百个虚假解剖事件的碎片化伪影,并将每视频解剖输出减少到2-3个临床现实事件。在包含三个NaviCam检查(161,025帧)的保留RARE-VISION测试集上,更新后的管道实现了整体时间mAP@0.5为0.3597,mAP@0.95为0.3399,相比先前提交分别相对提升46%和44%,总推理时间在单个GPU上约21分钟完成。

英文摘要

This work presents a multi-label temporal event detection framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset by combining two principal contributions: an Angular Separation Loss on class prototypes and a Biological State Machine temporal decoder. The backbone remains BiomedCLIP, a biomedical vision-language foundation model. Three consecutive frames are fused through a Local Differencing Attention module that amplifies transient pathological signals by suppressing static temporal redundancy. An Anatomy Context Head then conditions pathological predictions on soft anatomical activations, exploiting the known spatial co-occurrence structure of GI findings. Learnable text-feature prompts and prototype-based logit augmentation are trained alongside an Angular Separation Loss that penalizes off-diagonal cosine similarity between class prototypes, preventing the prototype collapse that afflicts rare classes under extreme imbalance. To counteract the skewed label distribution, the training regime combines asymmetric focal loss, inverse-frequency weighted sampling, temporal Mixup, Exponential Moving Average, and per-class threshold calibration. The Biological State Machine decoder replaces naive gap merging with a physiologically grounded forward-only state transition over anatomy labels, eliminating the fragmentation artefact that produced hundreds of spurious anatomy events per video in the prior approach and reducing per-video anatomy output to 2--3 clinically realistic events. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the updated pipeline achieves an overall temporal mAP@0.5 of 0.3597 and mAP@0.95 of 0.3399, representing a relative improvement of 46% and 44% respectively over the prior submission, with total inference completed in approximately 21 minutes on a single GPU.

2603.10688 2026-05-25 cs.RO cs.CV 版本更新

MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction

MapGCLR: 用于在线矢量化高清地图构建的地理空间对比学习表示

Jonas Merkert, Alexander Blumberg, Jan-Hendrik Pauls, Christoph Stiller

发表机构 * Institute of Measurement and Control Systems, Karlsruhe Institute of Technology (KIT)(测量与控制系,卡尔斯鲁厄理工学院(KIT))

AI总结 本文提出了一种名为 MapGCLR 的方法,旨在提升在线矢量化高精地图构建中鸟瞰图(BEV)特征网格的表示能力。通过在对比损失函数中引入地理空间一致性约束,该方法增强了重叠区域特征的一致性,并结合多遍历数据集划分策略,实现了半监督学习框架。实验表明,该方法在矢量化地图感知任务和特征空间可视化方面均优于传统监督方法。

详情
AI中文摘要

自动驾驶汽车依赖地图信息来理解周围环境。然而,离线高清地图的创建和维护成本仍然很高。一种更具可扩展性的替代方案是在线高清地图构建,它仅在训练时需要地图标注。为了进一步减少标注大量训练标签的需求,自监督训练提供了一种替代方案。本文通过在地理空间上强制重叠的鸟瞰图特征网格之间的一致性作为对比损失函数的一部分,专注于改进矢量化在线高清地图构建模型中的潜在鸟瞰图特征网格表示。为了确保对比对的地理空间重叠,我们引入了一种方法来分析给定数据集中遍历之间的重叠,并根据可调整的多遍历要求生成子数据集划分。我们使用减少的单遍历标注数据对同一模型进行监督训练,并在更广泛的未标注数据集上根据我们的多遍历要求进行自监督训练,有效实现了半监督方法。我们的方法在各个方面都优于监督基线,无论是在下游任务矢量化地图感知性能的定量评估上,还是在鸟瞰图特征空间的主成分分析可视化的分割定性评估上。

英文摘要

Autonomous vehicles rely on map information to understand the world around them. However, the creation and maintenance of offline high-definition (HD) maps remains costly. A more scalable alternative lies in online HD map construction, which only requires map annotations at training time. To further reduce the need for annotating vast training labels, self-supervised training provides an alternative. This work focuses on improving the latent birds-eye-view (BEV) feature grid representation within a vectorized online HD map construction model by enforcing geospatial consistency between overlapping BEV feature grids as part of a contrastive loss function. To ensure geospatial overlap for contrastive pairs, we introduce an approach to analyze the overlap between traversals within a given dataset and generate subsidiary dataset splits following adjustable multi-traversal requirements. We train the same model supervised using a reduced set of single-traversal labeled data and self-supervised on a broader unlabeled set of data following our multi-traversal requirements, effectively implementing a semi-supervised approach. Our approach outperforms the supervised baseline across the board, both quantitatively in terms of the downstream tasks vectorized map perception performance and qualitatively in terms of segmentation in the principal component analysis (PCA) visualization of the BEV feature space.

2603.02897 2026-05-25 cs.CV 版本更新

ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

ProGIC:基于残差向量量化的渐进式轻量级生成图像压缩

Hao Cao, Chengbin Liang, Wenqi Guo, Zhijin Qin, Jungong Han

发表机构 * Tsinghua University(清华大学) State Key Laboratory of Space Network and Communications(空间网络与通信国家重点实验室)

AI总结 本文提出了一种名为 ProGIC 的渐进式轻量级生成图像压缩方法,基于残差向量量化(RVQ)构建,能够在保证感知质量的同时实现更高效的压缩。该方法通过多阶段的残差编码生成渐进式比特流,支持部分数据预览,并结合轻量化的深度可分离卷积和小注意力模块,提升了在低算力设备上的部署能力。实验表明,ProGIC 在 Kodak 数据集上相比现有方法实现了显著的码率节省,并在编码解码速度上也有明显提升。

Comments Accepted by CVPR 2026 Findings

详情
AI中文摘要

生成图像压缩(GIC)的最新进展在感知质量上取得了显著提升。然而,许多GIC依赖于大规模且刚性的模型,严重限制了其在低比特率场景下灵活传输和实际部署的实用性。为解决这些问题,我们提出了渐进式生成图像压缩(ProGIC),一种基于残差向量量化(RVQ)的紧凑编解码器。在RVQ中,一系列向量量化器逐级编码残差,每个量化器拥有自己的码本。生成的码字累加实现从粗到细的重建和渐进比特流,从而支持从部分数据预览。我们将其与基于深度可分离卷积和小型注意力模块的轻量级骨干网络配对,使得在GPU和仅CPU设备上均可实际部署。实验结果表明,ProGIC在压缩性能上与先前方法相当。在Kodak数据集上,与MS-ILLM相比,它在DISTS上节省高达57.57%的比特率,在LPIPS上节省58.83%。除了感知质量,ProGIC还支持渐进传输以提高灵活性,并且在GPU上编码解码速度比MS-ILLM快10倍以上。

英文摘要

Recent advances in generative image compression (GIC) have delivered remarkable improvements in perceptual quality. However, many GICs rely on large-scale and rigid models, which severely constrain their utility for flexible transmission and practical deployment in low-bitrate scenarios. To address these issues, we propose Progressive Generative Image Compression (ProGIC), a compact codec built on residual vector quantization (RVQ). In RVQ, a sequence of vector quantizers encodes the residuals stage by stage, each with its own codebook. The resulting codewords sum to a coarse-to-fine reconstruction and a progressive bitstream, enabling previews from partial data. We pair this with a lightweight backbone based on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices. Experimental results show that ProGIC attains comparable compression performance compared with previous methods. It achieves bitrate savings of up to 57.57% on DISTS and 58.83% on LPIPS compared to MS-ILLM on the Kodak dataset. Beyond perceptual quality, ProGIC enables progressive transmission for flexibility, and also delivers over 10 times faster encoding and decoding compared with MS-ILLM on GPUs for efficiency.

2602.07801 2026-05-25 cs.CV cs.AI 版本更新

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

VideoTemp-o3:在智能视频思考中协调时间定位与视频理解

Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su, Tianke Zhang, Haonan Fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Yinwei Wei, Xuemeng Song

发表机构 * Shandong University(山东大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beihang University(北航) Southern University of Science and Technology(南方科技大学)

AI总结 在长视频理解任务中,传统均匀采样方法难以捕捉关键视觉证据,导致性能下降和幻觉增加。为此,本文提出VideoTemp-o3,一种统一的基于视频的智能推理框架,通过联合建模视频定位与问答任务,显著提升了定位精度与推理效率。该方法引入统一的掩码机制和专用奖励策略,支持按需剪辑与定位修正,并构建了高质量的长视频定位问答数据集及评估基准,实验表明其在长视频理解和定位任务中均表现出色。

Comments ICML 2026

详情
AI中文摘要

在长视频理解中,传统的均匀帧采样通常无法捕捉关键视觉证据,导致性能下降和幻觉增加。为了解决这个问题,最近出现了智能视频思考范式,采用定位-裁剪-回答流程,模型主动识别相关视频片段,在这些片段内进行密集采样,然后生成答案。然而,现有方法仍然效率低下,定位能力弱,且遵循僵化的工作流。为了解决这些问题,我们提出了VideoTemp-o3,一个统一的智能视频思考框架,联合建模视频定位和问答。VideoTemp-o3展现出强大的定位能力,支持按需裁剪,并能修正不准确的定位。具体来说,在监督微调阶段,我们设计了一个统一的掩码机制,在鼓励探索的同时防止噪声。对于强化学习,我们引入了专用奖励以减轻奖励黑客。此外,从数据角度,我们开发了一个有效的流程来构建高质量的长视频定位问答数据,以及一个相应的基准,用于在不同视频时长上进行系统评估。实验结果表明,我们的方法在长视频理解和定位方面均取得了显著性能。

英文摘要

In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.

2602.07399 2026-05-25 cs.AI cs.CV 版本更新

VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

VGAS: 价值引导的动作块选择用于少样本视觉-语言-动作适应

Changhua Xu, En Yu, Junyu Xuan, Jie Lu

发表机构 * Australian Artificial Intelligence Institute (AAII)(澳大利亚人工智能研究所)

AI总结 视觉-语言-动作(VLA)模型能够实现多模态推理与物理控制的结合,但在仅有少量示例的情况下进行任务适应时仍存在可靠性问题。本文提出了一种名为VGAS的新框架,从生成-选择的角度出发,通过引入语义忠实与几何精确的行动片段选择机制,有效解决了几何模糊导致的执行偏差问题。该方法结合了微调后的VLA模型作为高召回率提案生成器,并引入基于几何的Transformer评论器Q-Chunk-Former以及显式几何正则化(EGR)策略,显著提升了在少量示例和分布偏移情况下的任务成功率与鲁棒性。

Comments Preprint

详情
AI中文摘要

视觉-语言-动作(VLA)模型桥接了多模态推理与物理控制,但将其适应于新任务且仅有少量演示时仍不可靠。虽然微调后的VLA策略通常能产生语义上合理的轨迹,但失败往往源于未解决的几何歧义,其中接近正确的动作在有限监督下会导致不同的执行结果。我们从生成-选择的角度研究少样本VLA适应,并提出一个新颖的框架VGAS(价值引导的动作块选择)。它在推理时执行最佳N选1,以识别既语义忠实又几何精确的动作块。具体来说,VGAS使用微调的VLA作为高召回率提议生成器,并引入Q-Chunk-Former,一个基于几何的Transformer评论家,以解决细粒度的几何歧义。此外,我们提出了显式几何正则化(EGR),它塑造了一个判别性的价值景观,以保持接近正确候选之间的动作排序分辨率,同时减轻在稀缺监督下的价值不稳定性。实验和理论分析表明,VGAS在有限演示和分布偏移下持续提高了成功率和鲁棒性。我们的代码可在https://github.com/Jyugo-15/VGAS获取。

英文摘要

Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss actions lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.

2602.05202 2026-05-25 cs.CV 版本更新

GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

GT-SVJ: 基于生成式Transformer的自监督视频评判器用于高效视频奖励建模

Shivanshu Shekhar, Uttaran Bhattacharya, Raghavendra Addanki, Mehrab Tanjim, Somdeb Sarkhel, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Adobe Inc.(Adobe公司)

AI总结 该研究提出了一种基于生成式变换器的自监督视频评估模型GT-SVJ,旨在更高效地建模视频奖励,以对齐视频生成模型与人类偏好。不同于依赖视觉语言模型的方法,GT-SVJ通过将先进的视频生成模型重新设计为能量基模型,从而捕捉视频中的细微时序动态并精确判断视频质量。通过构造具有可控退化特性的合成负样本,模型被引导学习有意义的时空特征,实验表明其在仅使用30K人工标注数据的情况下,在多个基准测试中取得了优于现有方法的性能。

详情
AI中文摘要

将视频生成模型与人类偏好对齐仍然具有挑战性:当前方法依赖视觉语言模型(VLM)进行奖励建模,但这些模型难以捕捉细微的时间动态。我们提出了一种根本不同的方法:将天生设计用于建模时间结构的视频生成模型重新用作奖励模型。我们提出了基于生成式Transformer的自监督视频评判器(GT-SVJ),这是一种新颖的评估模型,将最先进的视频生成模型转化为强大的时间感知奖励模型。我们的关键洞察是,生成模型可以重新表述为基于能量的模型(EBM),该模型为高质量视频分配低能量,为退化视频分配高能量,从而在通过对比目标训练时能够以惊人的精度区分视频质量。为了防止模型利用真实视频与生成视频之间的表面差异,我们通过受控的潜在空间扰动设计了具有挑战性的合成负样本:时间切片、特征交换和帧洗牌,这些模拟了真实但细微的视觉退化。这迫使模型学习有意义的时空特征,而不是琐碎的伪影。GT-SVJ在GenAI-Bench和MonteBench上仅使用30K人工标注就达到了最先进的性能:比现有的基于VLM的方法少6倍到65倍。

英文摘要

Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.

2602.05126 2026-05-25 cs.CV 版本更新

CLEAR-HPV: Interpretable concept discovery for human-papillomavirus-associated morphology in whole-slide histology

CLEAR-HPV: 全切片组织学中人乳头瘤病毒相关形态的可解释概念发现

Weiyi Qin, Yingci Liu-Swetz, Shiwei Tan, Hao Wang

发表机构 * Department of Computer Science, Rutgers University(罗格斯大学计算机科学系) Rutgers Health(罗格斯健康) Rutgers University(罗格斯大学)

AI总结 CLEAR-HPV 是一种用于宫颈癌和头颈癌病理切片中HPV相关形态分析的可解释性框架,旨在解决基于注意力机制的多重实例学习(MIL)模型在形态学解释性方面的不足。该方法通过重构MIL的潜在空间,无需概念标签即可自动发现如角化、基底样和间质等关键形态概念,并生成对应的空间概念图与简洁的概念分数向量,从而在保持预测性能的同时实现高度可解释的特征表示。

详情
AI中文摘要

人乳头瘤病毒(HPV)状态是头颈癌和宫颈癌预后及治疗反应的关键决定因素。尽管基于注意力的多实例学习(MIL)在HPV相关的全切片组织病理学中实现了强切片级预测,但其形态学可解释性有限。为解决这一局限,我们引入了CLEAR-HPV(Concept-Level Explainable Attention-guided Representation for HPV),该框架利用注意力重构MIL潜在空间,从而在训练过程中无需概念标签即可实现概念发现。在注意力加权的潜在空间中运行,CLEAR-HPV自动发现角化、基底样和间质形态概念,生成空间概念图,并使用紧凑的概念分数向量表示每个切片。CLEAR-HPV的概念分数向量保留了原始MIL嵌入的预测信息,同时将高维特征空间(例如1536维)减少到仅10个可解释概念。CLEAR-HPV在TCGA-HNSCC、TCGA-CESC和CPTAC-HNSCC上一致地泛化,通过一个通用的、与骨干网络无关的框架,为基于注意力的全切片组织病理学MIL模型提供紧凑的概念级可解释性。

英文摘要

Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPV's concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.

2601.19117 2026-05-25 eess.IV cs.CV stat.AP 版本更新

Optimized $k$-means color quantization of digital images in machine-based and human perception-based colorspaces

基于机器感知和人类感知色彩空间的优化 $k$-均值图像颜色量化

Ranjan Maitra

发表机构 * Department of Statistics, Iowa State University(统计学系,爱荷华州立大学)

AI总结 该研究探讨了在不同颜色空间中使用 $k$-means 算法进行数字图像颜色量化的效果,比较了 RGB、CIE-XYZ 和 CIE-LUV/CIE-HCL 等颜色空间在不同量化级别下的表现。通过视觉信息保真度(VIF)指标评估图像质量,发现 $k$-means 在 RGB 空间中表现最佳的情况约占一半,而在较高量化级别时,CIE-XYZ 空间通常表现更优,部分低量化级别情况下 CIE-LUV 空间效果更佳。研究还分析了色调、色度和亮度分布对颜色空间选择的影响,为不同场景下的颜色量化提供了更细致的指导。

Comments 25 pages, 11 figures, 5 tables, accepted in the Journal of Electronic Imaging

详情
Journal ref
Journal of Electronic Imaging Journal of Electronic Imaging, Vol. 35, Issue 2, 023002 (Mar 2026)
AI中文摘要

颜色量化使用原始颜色数量的一小部分来表示图像,同时仅最小程度地损失视觉质量。$k$-均值算法在此背景下常用,但主要应用于由三原色组成的基于机器的RGB色彩空间。然而,最近一些研究表明其在基于人类感知的色彩空间中性能有所提升。我们研究了在RGB、CIE-XYZ和CIE-LUV/CIE-HCL色彩空间中,$k$-均值颜色量化在四个量化级别下对148张涵盖广泛场景、主题和设置的多样化数字图像的性能。视觉信息保真度(VIF)度量数值上评估了量化图像的质量,并显示在大约一半的情况下,$k$-均值颜色量化在RGB空间中最佳,而在其他时候,特别是对于更高的量化级别($k$),CIE-XYZ色彩空间通常表现更好。也有一些情况,尤其是在较低的$k$下,最佳性能在CIE-LUV色彩空间中获得。进一步根据图像中色调、色度和亮度分布对性能的分析,为每个色彩空间更适合$k$-均值颜色量化的图像提供了细致的视角和特征描述。

英文摘要

Color quantization represents an image using a fraction of its original number of colors while only minimally losing its visual quality. The $k$-means algorithm is commonly used in this context, but has mostly been applied in the machine-based RGB colorspace composed of the three primary colors. However, some recent studies have indicated its improved performance in human perception-based colorspaces. We investigated the performance of $k$-means color quantization at four quantization levels in the RGB, CIE-XYZ, and CIE-LUV/CIE-HCL colorspaces, on 148 varied digital images spanning a wide range of scenes, subjects and settings. The Visual Information Fidelity (VIF) measure numerically assessed the quality of the quantized images, and showed that in about half of the cases, $k$-means color quantization is best in the RGB space, while at other times, and especially for higher quantization levels ($k$), the CIE-XYZ colorspace is where it usually does better. There are also some cases, especially at lower $k$, where the best performance is obtained in the CIE-LUV colorspace. Further analysis of the performances in terms of the distributions of the hue, chromaticity and luminance in an image presents a nuanced perspective and characterization of the images for which each colorspace is better for $k$-means color quantization.

2601.15224 2026-05-25 cs.CV cs.CL 版本更新

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

PROGRESSLM: 迈向视觉-语言模型中的进度推理

Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu

发表机构 * Northwestern University(西北大学) University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 该论文提出ProgressLM,旨在解决视觉语言模型在任务进展推理方面的能力不足问题。研究引入了Progress-Bench基准,用于系统评估模型对任务进展的推理能力,并提出了一种受人类启发的两阶段推理范式,通过训练无关的提示和基于数据集ProgressLM-45K的训练方法进行探索。实验表明,大多数现有模型在任务进展估计上表现有限,而基于训练的ProgressLM-3B即使在小规模下也取得了稳定提升,显示出良好的泛化能力。

Comments ACL 2026 Camera Ready Version

详情
AI中文摘要

估计任务进度需要对长期动态进行推理,而非仅识别静态视觉内容。尽管现代视觉-语言模型(VLM)擅长描述可见内容,但它们能否从部分观测中推断任务进展程度仍不清楚。为此,我们引入了Progress-Bench,一个用于系统评估VLM进度推理能力的基准。除基准测试外,我们进一步探索了一种受人类启发的两阶段进度推理范式,包括基于无训练提示和基于训练的方法,后者基于精心策划的数据集ProgressLM-45K。对14个VLM的实验表明,大多数模型尚未准备好进行任务进度估计,对演示模态和视角变化敏感,且难以处理不可回答的情况。虽然强制结构化进度推理的无训练提示仅带来有限且依赖模型的改进,但基于训练的ProgressLM-3B即使在小型模型规模下也取得了一致的改进,尽管其训练任务集与评估任务完全不相交。进一步分析揭示了特征性错误模式,并阐明了进度推理成功或失败的时间与原因。网站:https://progresslm.github.io/ProgressLM/

英文摘要

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails. Website: https://progresslm.github.io/ProgressLM/

2601.14821 2026-05-25 cs.CV 版本更新

POTR: Post-Training 3DGS Compression

POTR:训练后3DGS压缩

Bert Ramlot, Martijn Courteaux, Peter Lambert, Glenn Van Wallendael

发表机构 * IDLab-MEDIA research group(IDLab-MEDIA研究组) Ghent University(根特大学) imec

AI总结 本文提出了一种名为POTR的后训练3D高斯点云压缩方法,旨在解决3D高斯溅射(3DGS)在存储需求过高的问题。该方法引入了一种高效的剪枝技术,通过改进的3DGS光栅化器同时计算每个点的移除影响,显著减少了点的数量并提升了推理速度;同时,提出了一种无需训练即可重构光照系数的新方法,大幅降低了其熵值并提高了稀疏性。实验表明,POTR在压缩率与推理速度方面均优于现有方法。

Comments 15 pages, 12 figures. Submitted to IEEE TCSVT, under review

详情
Journal ref
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2026
AI中文摘要

3D高斯泼溅(3DGS)最近在3D场景重建和实时新视角合成中成为神经辐射场(NeRF)的有力竞争者。3DGS在训练和推理速度上优于NeRF,但存储需求显著更高。为解决这一缺点,我们提出POTR,一种基于两项新技术的训练后3DGS编解码器。首先,POTR引入一种新颖的剪枝方法,使用修改后的3DGS光栅化器同时高效计算每个泼溅的单独移除效果。该技术相比其他训练后剪枝技术减少2-4倍的泼溅数量,并因此显著加速推理,实验表明其推理速度比其他压缩模型快1.5-2倍。其次,我们提出一种重新计算光照系数的新方法,在不使用任何训练的情况下显著降低其熵。我们的快速且高度并行的方案特别增加了AC光照系数的稀疏性,实验表明稀疏性从70%提升到97%,且质量损失极小。最后,我们通过简单的微调方案扩展POTR,以进一步增强剪枝、推理和率失真性能。实验表明,即使没有微调,POTR在率失真性能和推理速度上始终优于所有其他训练后压缩技术。

英文摘要

3D Gaussian Splatting (3DGS) has recently emerged as a promising contender to Neural Radiance Fields (NeRF) in 3D scene reconstruction and real-time novel view synthesis. 3DGS outperforms NeRF in training and inference speed but has substantially higher storage requirements. To remedy this downside, we propose POTR, a post-training 3DGS codec built on two novel techniques. First, POTR introduces a novel pruning approach that uses a modified 3DGS rasterizer to efficiently calculate every splat's individual removal effect simultaneously. This technique results in 2-4x fewer splats than other post-training pruning techniques and as a result also significantly accelerates inference with experiments demonstrating 1.5-2x faster inference than other compressed models. Second, we propose a novel method to recompute lighting coefficients, significantly reducing their entropy without using any form of training. Our fast and highly parallel approach especially increases AC lighting coefficient sparsity, with experiments demonstrating increases from 70% to 97%, with minimal loss in quality. Finally, we extend POTR with a simple fine-tuning scheme to further enhance pruning, inference, and rate-distortion performance. Experiments demonstrate that POTR, even without fine-tuning, consistently outperforms all other post-training compression techniques in both rate-distortion performance and inference speed.

2512.18363 2026-05-25 cs.CV 版本更新

Enhancing 3D Semantic Scene Completion with a Refinement Module

使用精化模块增强3D语义场景补全

Dunxing Zhang, Jiachen Lu, Han Yang, Lei Bao, Bo Song

发表机构 * National Science Center for Earthquake Engineering, Tianjin University(天津大学地球quake工程科学中心) School of Civil Engineering, Tianjin University(天津大学土木工程学院) Chair of Robotics, Artificial Intelligence and Real-time Systems, Technical University of Munich(慕尼黑技术大学机器人、人工智能与实时系统教授席)

AI总结 本文提出了一种名为ESSC-RM的增强型语义场景补全框架,该框架通过一个可插拔的细化模块,能够无缝集成到现有的语义场景补全模型中。该方法采用两阶段策略,首先由基础网络生成粗粒度体素预测,再通过基于3D U-Net的预测噪声感知模块和体素级局部几何模块进行多尺度监督下的细化。实验表明,ESSC-RM在SemanticKITTI数据集上显著提升了语义预测性能,验证了其作为通用细化框架的广泛适用性。

Comments 19 pages, 8 figures

详情
AI中文摘要

我们提出ESSC-RM,一种即插即用的增强框架,用于带有精化模块的语义场景补全,可以无缝集成到现有的SSC模型中。ESSC-RM分两个阶段运行:基线SSC网络首先生成粗体素预测,随后由基于3D U-Net的预测噪声感知模块(PNAM)和体素级局部几何模块(VLGM)在多尺度监督下进行精化。在SemanticKITTI上的实验表明,ESSC-RM持续改善语义预测性能。当集成到CGFormer和MonoScene中时,平均IoU分别从16.87%提升到17.27%,以及从11.08%提升到11.51%。这些结果表明ESSC-RM作为一个通用的精化框架,可适用于广泛的SSC模型。

英文摘要

We propose ESSC-RM, a plug-and-play Enhancing framework for Semantic Scene Completion with a Refinement Module, which can be seamlessly integrated into existing SSC models. ESSC-RM operates in two phases: a baseline SSC network first produces a coarse voxel prediction, which is subsequently refined by a 3D U-Net-based Prediction Noise-Aware Module (PNAM) and Voxel-level Local Geometry Module (VLGM) under multiscale supervision. Experiments on SemanticKITTI show that ESSC-RM consistently improves semantic prediction performance. When integrated into CGFormer and MonoScene, the mean IoU increases from 16.87% to 17.27% and from 11.08% to 11.51%, respectively. These results demonstrate that ESSC-RM serves as a general refinement framework applicable to a wide range of SSC models.

2511.17171 2026-05-25 cs.CV cs.LG 版本更新

FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle

FireScope: 基于思维链预言机的野火风险栅格预测

Mario Markov, Stefan Maria Ailuro, Luc Van Gool, Konrad Schindler, Danda Pani Paudel

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 该论文提出了一种名为FireScope的框架,用于预测野火风险栅格图,通过结合视觉、气候和地理信息进行因果推理。研究引入了FireScope-Bench数据集,整合了Sentinel-2卫星图像、气候数据和专家定义的风险图,用于跨大陆评估。FireScope基于视觉语言模型,结合强化学习和视觉监督,生成带有推理轨迹的风险图,显著提升了模型在不同大陆间的泛化能力和可解释性。该工作首次展示了基于语言的推理在视觉生成中的泛化提升作用,并提出了首个可跨大陆应用的高分辨率野火风险模型。

Comments CVPR 2026, Project Page: https://firescope.ai/research

详情
AI中文摘要

预测野火风险是一个推理密集型的空间问题,需要整合视觉、气候和地理因素来推断连续的风险地图。现有方法缺乏可靠泛化所需的因果推理和多模态理解。我们引入了FireScope-Bench,一个大规模数据集和基准,将Sentinel-2图像和气候数据与专家定义的全美风险栅格以及欧洲的真实野火事件配对,用于跨大陆评估。基于此数据集,我们提出了FireScope,一个基于VLM的推理到生成框架,从强化学习和视觉监督中学习,通过互补的推理轨迹预测风险栅格。当在美国训练并在欧洲测试时,FireScope取得了显著的性能提升,而专家反馈和自动化分析证实其推理轨迹是忠实且有语义意义的。我们的发现表明,推理可以支撑栅格预测模型,提高泛化性和可解释性。据我们所知,这是第一个(1)证明基于语言的推理可以改善视觉生成泛化性的框架,(2)提出一个可跨大陆应用的高分辨率野火风险模型,以及(3)能够系统研究多模态火灾风险模型稳健跨大陆泛化的框架。我们相信FireScope-Bench有潜力成为推动推理驱动、可解释和可泛化空间建模的基础。数据和源代码将公开提供。

英文摘要

Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce FireScope-Bench, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose FireScope, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, FireScope achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that FireScope-Bench has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.

2511.14286 2026-05-25 cs.CV 版本更新

NeuralBoneReg: An Instance-Specific Label-Free Point Cloud-Based Method for Multi-Modal Bone Surface Registration

NeuralBoneReg:一种用于多模态骨表面注册的实例特定无标签点云方法

Luohong Wu, Matthias Seibold, Nicola A. Cavalcanti, Yunke Ao, Roman Flepp, Aidana Massalimova, Lilian Calvet, Philipp Fürnstahl

发表机构 * Research in Orthopedic Computer Science, Balgrist University Hospital, University of Zurich(骨科计算机科学研究所,巴尔格里斯大学医院,苏黎世大学) AI Center, ETH Zurich(人工智能中心,苏黎世联邦理工学院)

AI总结 在计算机辅助骨科手术中,术前影像与术中数据的精确配准对手术规划至关重要。本文提出了一种无需标注的点云为基础的神经配准方法NeuralBoneReg,通过隐式神经网络学习术前骨模型,并结合多层感知机进行全局初始化与局部优化,实现了跨模态骨表面的鲁棒配准。该方法无需跨受试者训练数据,实验表明其在多个公开数据集上表现优异,具有良好的解剖结构与模态泛化能力。

详情
AI中文摘要

在计算机和机器人辅助骨科手术(CAOS)中,基于术前影像的患者特定手术计划定义了目标位置和植入物轨迹。在手术过程中,这些计划必须准确传递,依赖于术前和术中数据之间的精确交叉注册。然而,不同成像模态之间的显著异质性使得这种注册具有挑战性且容易出错。因此,鲁棒、自动且与模态无关的骨表面注册在临床上非常重要。我们提出了NeuralBoneReg,一个自监督的基于表面的框架,使用3D点云作为与模态无关的表示来注册骨表面。NeuralBoneReg包括两个模块:一个学习术前骨模型的隐式神经无符号距离场(UDF),以及一个基于MLP的注册模块,通过生成变换假设来对齐术中点云与神经UDF,从而执行全局初始化和局部细化。与最先进的监督方法不同,NeuralBoneReg以自监督方式运行,无需跨受试者的训练数据。我们在两个公开的多模态数据集上评估了NeuralBoneReg与基线方法的性能:一个腓骨和胫骨的CT-超声数据集(UltraBones100k)和一个脊柱椎骨的CT-RGB-D数据集(SpineDepth)。评估还包括一个新引入的包含股骨和骨盆的尸体的CT-超声数据集(UltraBones-Hip),该数据集将公开提供。NeuralBoneReg在所有数据集上匹配或超越现有方法,在UltraBones100k上平均RRE/RTE为1.83°/2.02 mm,在UltraBones-Hip上为1.90°/1.56 mm,在SpineDepth上为3.78°/2.80 mm。这些结果证明了跨解剖结构和模态的强泛化能力,为CAOS提供了鲁棒且准确的跨模态对齐。

英文摘要

In computer- and robot-assisted orthopedic surgery (CAOS), patient-specific surgical plans derived from preoperative imaging define target locations and implant trajectories. During surgery, these plans must be accurately transferred, relying on precise cross-registration between preoperative and intraoperative data. However, substantial modality heterogeneity across imaging modalities makes this registration challenging and error-prone. Robust, automatic, and modality-agnostic bone surface registration is therefore clinically important. We propose NeuralBoneReg, a self-supervised, surface-based framework that registers bone surfaces using 3D point clouds as a modality-agnostic representation. NeuralBoneReg includes two modules: an implicit neural unsigned distance field (UDF) that learns the preoperative bone model, and an MLP-based registration module that performs global initialization and local refinement by generating transformation hypotheses to align the intraoperative point cloud with the neural UDF. Unlike SOTA supervised methods, NeuralBoneReg operates in a self-supervised manner, without requiring inter-subject training data. We evaluated NeuralBoneReg against baseline methods on two publicly available multi-modal datasets: a CT-ultrasound dataset of the fibula and tibia (UltraBones100k) and a CT-RGB-D dataset of spinal vertebrae (SpineDepth). The evaluation also includes a newly introduced CT-ultrasound dataset of cadaveric subjects containing femur and pelvis (UltraBones-Hip), which will be made publicly available. NeuralBoneReg matches or surpasses existing methods across all datasets, achieving mean RRE/RTE of 1.83°/2.02 mm on UltraBones100k, 1.90°/1.56 mm on UltraBones-Hip, and 3.78°/2.80 mm on SpineDepth. These results demonstrate strong generalizability across anatomies and modalities, providing robust and accurate cross-modal alignment for CAOS.

2511.11051 2026-05-25 cs.CV 版本更新

NP-LoRA: Null Space Projection for Subject-Style LoRA Fusion

NP-LoRA: 用于主题-风格LoRA融合的零空间投影

Chuheng Chen, Xiaofei Zhou, Geyuan Zhang, Yong Huang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院)

AI总结 本文提出了一种名为NP-LoRA的训练无需方法,用于融合主题和风格的LoRA表示,以实现可控生成。该方法从几何角度出发,将内容和风格LoRA视为在共享参数空间中重叠的非正交低秩子空间,并通过投影操作显式调控它们之间的交互。NP-LoRA利用风格LoRA的主方向定义投影子空间,将内容LoRA投影到风格子空间的补空间中,从而在抑制主导风格方向干扰的同时保留互补信息,实验表明该方法在多个预训练LoRA对上取得了更平衡的内容-风格组合效果。

详情
AI中文摘要

低秩适配(LoRA)融合能够组合主题和风格表示以实现可控生成,无需重新训练。然而,现有方法主要通过权重级合并操作,没有明确建模独立训练的LoRA如何在共享参数空间中交互。我们从几何角度看待LoRA融合,将内容和风格LoRA解释为占据重叠、非正交的低秩子空间,这种重叠可能导致参数更新冲突,影响生成质量。这一观察促使我们将LoRA融合重新表述为控制重叠子空间更新如何组合的问题,而不仅仅是参数组合。基于这一见解,我们提出零空间投影LoRA(NP-LoRA),一种无需训练的框架,采用投影作为融合算子来显式调节跨LoRA交互。具体而言,NP-LoRA使用风格LoRA的主方向定义投影子空间,并将内容LoRA投影到补子空间(即风格LoRA的零空间),从而抑制沿主导风格方向的干扰,同时保留互补信息。为避免硬投影的过度激进抑制,我们进一步将软投影表述为一个正则化优化问题,平衡内容保留与风格子空间抑制。该目标具有闭式解,产生一个由单一参数控制的投影算子,该参数在线性合并和硬投影之间连续插值。在多个预训练LoRA对上的大量实验表明,与强基线相比,NP-LoRA实现了更平衡的内容-风格组合,且无需重新训练。

英文摘要

Low-Rank Adaptation (LoRA) fusion enables the composition of subject and style representations for controllable generation without retraining. However, existing approaches primarily operate through weight-level merging, without explicitly modeling how independently trained LoRAs interact in the shared parameter space. We adopt a geometric perspective on LoRA fusion, interpreting content and style LoRAs as occupying overlapping, non-orthogonal low-rank subspaces, where such overlap can lead to conflicting parameter updates that affect generation quality. This observation motivates us to reformulate LoRA fusion not merely as parameter combination, but as a problem of controlling how updates from overlapping subspaces are combined. Based on this insight, we propose Null Space Projection LoRA (NP-LoRA), a training-free framework that employs projection as a fusion operator to explicitly modulate cross-LoRA interactions. Specifically, NP-LoRA uses principal directions of the style LoRA to define a projection subspace and projects the content LoRA onto the complementary subspace (i.e., the null space of the style LoRA), suppressing interference along dominant style directions while preserving complementary information. To avoid the overly aggressive suppression of hard projection, we further formulate soft projection as a regularized optimization problem that balances content preservation against style-subspace suppression. This objective admits a closed-form solution, yielding a projection operator controlled by a single parameter that continuously interpolates between linear merging and hard projection. Extensive experiments across multiple pretrained LoRA pairs show that NP-LoRA achieves more balanced content-style composition compared to strong baselines, without requiring retraining.

2510.21270 2026-05-25 cs.CL cs.AI cs.CV 版本更新

Sparser Block-Sparse Attention via Token Permutation

通过令牌置换实现更稀疏的块稀疏注意力

Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 随着大语言模型上下文长度的增加,计算成本显著上升,主要瓶颈来自自注意力机制的二次复杂度。为此,本文提出了一种名为Permuted Block-Sparse Attention(PBS-Attn)的新型稀疏注意力方法,通过重新排列token顺序以提升块级稀疏性,从而在保持模型精度的同时显著提高计算效率。实验表明,该方法在多个长上下文数据集上优于现有块稀疏注意力方法,并在端到端推理速度上实现了最高2.75倍的加速。

Comments ICML 2026

详情
AI中文摘要

扩展大语言模型(LLM)的上下文长度带来了显著的好处,但计算成本高昂。这种成本主要源于自注意力机制,其相对于序列长度的$O(N^2)$复杂度在内存和延迟方面构成了主要瓶颈。幸运的是,注意力矩阵通常是稀疏的,尤其是对于长序列,这为优化提供了机会。块稀疏注意力已成为一种有前景的解决方案,它将序列划分为块并跳过其中一部分块的计算。然而,该方法的有效性高度依赖于底层的注意力模式,这可能导致次优的块级稀疏性。例如,单个块内查询的重要键令牌可能分散在许多其他块中,导致计算冗余。在这项工作中,我们提出了置换块稀疏注意力(PBS-Attn),这是一种即插即用的方法,利用注意力的置换性质来增加块级稀疏性并提高LLM预填充的计算效率。我们在具有挑战性的真实世界长上下文数据集上进行了全面实验,结果表明PBS-Attn在模型精度上始终优于现有的块稀疏注意力方法,并紧密匹配全注意力基线。借助我们自定义的permuted-FlashAttention内核,PBS-Attn在长上下文预填充中实现了高达2.75倍的端到端加速,证实了其实用性。代码可在https://github.com/xinghaow99/pbs-attn获取。

英文摘要

Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\times$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn

2510.16335 2026-05-25 cs.CV 版本更新

On the Provable Importance of Gradients for Language-Assisted Image Clustering

关于梯度在语言辅助图像聚类中可证明的重要性

Bo Peng, Jie Lu, Guangquan Zhang, Zhen Fang

发表机构 * University of Technology Sydney(悉尼技术大学)

AI总结 本文研究了语言辅助图像聚类(LaIC)问题,旨在利用文本语义提升图像表示的可区分性,从而改善图像聚类效果。由于缺乏真实的类别名称,如何从未标注的语料库中筛选出与图像语义相近的正名词是核心挑战。为此,作者提出了一种基于梯度的框架 GradNorm,通过反向传播的交叉熵梯度大小衡量名词的正相关性,并提供了理论误差界以保证其有效性,同时证明该方法能涵盖现有筛选策略。实验表明,GradNorm 在多个基准数据集上取得了最先进的聚类性能。

Comments revised and extended version of ICCV2025

详情
AI中文摘要

本文研究了最近出现的语言辅助图像聚类(LaIC)问题,其中利用文本语义来改善视觉表示的可区分性以促进图像聚类。由于真实类别名称不可用,LaIC的核心挑战之一在于如何从未标记的野生语料数据中筛选正名词,即那些与感兴趣图像语义接近的名词。现有的筛选策略主要基于CLIP学习的现成特征空间;然而,尽管直观,这些策略缺乏严格的理论基础。为了填补这一空白,我们提出了一种新颖的基于梯度的框架,称为GradNorm,该框架具有理论保证并表现出强大的实证性能。特别地,我们根据从预测目标分布与softmax输出之间的交叉熵反向传播的梯度大小来衡量每个名词的正性。理论上,我们提供了严格的误差界来量化GradNorm对正名词的可分离性,并证明GradNorm自然地将现有筛选策略作为其极端特例。实证上,大量实验表明GradNorm在各种基准测试上达到了最先进的聚类性能。代码公开于\href{https://github.com/60pen9/On-the-Provable-Importance-of-Gradients-for-Language-Assisted-Image-Clustering}{此处}。

英文摘要

This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various benchmarks. Code is publicly available at \href{https://github.com/60pen9/On-the-Provable-Importance-of-Gradients-for-Language-Assisted-Image-Clustering}{here}.

2510.09450 2026-05-25 cs.CV 版本更新

Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement Under Extreme Noise

基于动态权重的极端噪声下低光视频增强的时间聚合

Ruirui Lin, Guoxi Huang, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, University of Bristol, United Kingdom(布里斯托大学视觉信息实验室)

AI总结 本文研究了在极端噪声环境下低光视频增强(LLVE)的问题,针对现有基于学习的方法在处理真实场景中严重噪声时效果不佳的问题,提出了一种新型的基于深度学习的递归框架DWTA-Net。该方法采用两阶段架构,第一阶段通过多帧对齐实现时序一致的Mamba增强,第二阶段利用动态权重引导的光流驱动的时序聚合进行递归细化,有效提升了视频的视觉质量。实验表明,DWTA-Net在噪声抑制和细节保留方面优于现有先进方法。

详情
AI中文摘要

低光视频增强(LLVE)由于噪声、低对比度和颜色退化而具有挑战性。虽然基于学习的方法能够实现快速推理,但由于未能充分利用长期时间线索,它们在严重的真实噪声下常常失败。我们提出了DWTA-Net,一种新颖的基于深度学习的递归LLVE框架,采用递归设计。DWTA-Net采用集成的两阶段架构:第一阶段通过多帧对齐恢复局部结构和颜色,实现时间一致的基于Mamba的增强;第二阶段使用新颖的基于动态权重的时间聚合(由光流引导)进行递归细化,作为适应运动的递归去噪器。我们进一步引入了一种纹理自适应损失,在保留纹理区域细节的同时抑制均匀区域中的噪声。在真实低光视频上的实验表明,DWTA-Net实现了更强的噪声抑制和更少的伪影,与最先进的方法相比,提供了优越的视觉质量。

英文摘要

Low-light video enhancement (LLVE) is challenging due to noise, low contrast, and color degradation. While learning-based methods enable fast inference, they often fail under heavy real-world noise because they do not sufficiently exploit long-term temporal cues. We propose DWTA-Net, a novel deep-learning recurrent LLVE framework with a recurrent design. DWTA-Net adopts an integrated two-stage architecture: Stage I restores local structure and color via multi-frame alignment for temporally consistent Mamba-based enhancement, while Stage II performs recurrent refinement using a novel dynamic weight-based temporal aggregation guided by optical flow, functioning as a recurrent denoiser that adapts to motion. We further introduce a texture-adaptive loss that preserves fine details in textured regions while suppressing noise in homogeneous areas. Experiments on real-world low-light footage show that DWTA-Net achieves stronger noise suppression and fewer artifacts, delivering superior visual quality compared with state-of-the-art methods.

2510.00948 2026-05-25 cs.CV 版本更新

InfVSR: Toward Consistency-Driven Streaming Generative Video Super-Resolution

InfVSR:迈向一致性驱动的流式生成视频超分辨率

Ziqing Zhang, Kai Liu, Zheng Chen, Xi Li, Yucong Chen, Bingnan Duan, Linghe Kong, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种名为InfVSR的生成式视频超分辨率方法,旨在解决处理长序列视频时效率低和时序一致性差的问题。该方法将视频超分辨率重构为自回归单步扩散框架,通过因果结构的预训练DiT模型和滚动键值缓存机制,实现了流式推理并保持局部与全局的一致性。此外,研究还引入了块级像素监督和跨块分布匹配技术,显著提升了处理效率,并构建了一个针对长视频的评估基准,推动了长序列视频超分辨率领域的发展。

Comments Code and model are available at https://github.com/Kai-Liu001/InfVSR

详情
AI中文摘要

真实世界视频通常包含数千帧。然而,现有的生成式视频超分辨率(VSR)方法在处理长序列时面临两个持续挑战:(1)由于对全长序列进行多步去噪的高成本导致的低效率;(2)时间分解导致伪影和不连续性,阻碍了良好的一致性。为突破这些限制,我们提出InfVSR,将VSR重新构建为自回归单步扩散范式,并利用视频扩散先验实现流式推理。首先,我们将预训练的DiT适配为因果结构,通过滚动KV缓存和联合视觉引导保持局部和全局一致性。其次,我们通过逐块像素监督和跨块分布匹配,高效地将扩散过程蒸馏为单步。为填补长视频评估的空白,我们构建了一个针对扩展序列的新基准,并引入语义级指标以全面评估时间一致性。我们的方法推动了长视频VSR的前沿,实现了具有增强语义一致性的最先进质量,并相比现有方法(如MGLD-VSR)提供了高达58倍的加速。我们的代码和模型可在https://github.com/Kai-Liu001/InfVSR获取。

英文摘要

Real-world videos often extend over thousands of frames. Existing generative video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor consistency is hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which reformulates VSR as an autoregressive-one-step-diffusion paradigm, and enables streaming inference with video diffusion priors. First, we adapt the pretrained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Our code and models are available at https://github.com/Kai-Liu001/InfVSR.

2508.18958 2026-05-25 cs.CV cs.AI 版本更新

A drone-based framework for coral habitat mapping via weakly supervised segmentation

基于弱监督分割的无人机珊瑚栖息地制图框架

Matteo Contini, Victor Illien, Sylvain Poulain, Serge Bernard, Julien Barde, Sylvain Bonhommeau, Alexis Joly

发表机构 * IFREMER Délégation Océan Indien (DOI)(IFREMER大洋印度洋办事处) INRIA, LIRMM, Université de Montpellier, CNRS(INRIA、LIRMM、蒙彼利埃大学、国家科学研究中心) UMR Marbec, IRD, Université de Montpellier, CNRS, Ifremer(Marbec联合研究单位、IRD、蒙彼利埃大学、国家科学研究中心、IFREMER) CNRS, LIRMM, Université de Montpellier(国家科学研究中心、LIRMM、蒙彼利埃大学)

AI总结 本文提出了一种基于无人机的弱监督分割框架,用于珊瑚生境的映射。该方法通过结合水下图像的细粒度多标签预测和广覆盖的航拍数据,无需像素级标注即可训练高分辨率分割模型。研究在珊瑚礁图像上验证了该方法,实现了大面积珊瑚形态的分割,取得了86.07%的像素准确率和52.23%的平均交并比,展示了其在生态监测中的高效性和适用性。

Comments Extended journal version of "The Point is the Mask: Scaling coral reef segmentation with weak supervision"

详情
AI中文摘要

在大空间范围内获取像素级标注仍然是机器学习在生态应用中部署的主要瓶颈。本文提出了一种多尺度弱监督语义分割(WSSS)框架,能够利用密集的、基于分类的输出训练高分辨率分割模型。我们的方法将来自水下图像的细粒度多标签预测与广覆盖的航空数据相结合。将这些点级分类转换为粗监督掩码,用于训练无人机(UAV)正射影像上的语义分割模型。然后使用模型自身的细化预测进行第二步训练,以进一步提高空间精度,无需额外标注。我们在珊瑚礁图像上展示了该方法,实现了珊瑚形态类型的大面积分割,并展示了其整合新类别的灵活性。最终模型在人工标注的礁区上达到86.07%的像素精度和52.23%的平均交并比(mIoU),表明无需像素级标注即可获得准确的大规模珊瑚分割。通过跨尺度和跨模态连接图像分类与分割,该方法为标注不可用场景下部署分割模型提供了高效解决方案,并为生态学及其他领域的可扩展、高效监测开辟了机会。

英文摘要

Obtaining pixel-level annotations over large spatial extents remains a major bottleneck for deploying machine learning in ecological applications. Here we present a multi-scale weakly supervised semantic segmentation (WSSS) framework that enables training high-resolution segmentation models from dense, classification-based outputs. Our method combines fine-scale, multi-label predictions from underwater imagery with broad-coverage aerial data. We convert these point-level classifications into coarse supervision masks that can be used to train a semantic segmentation model on Unmanned Aerial Vehicle (UAV) orthophotos. A second training step using the model's own refined predictions is then used to further improve spatial accuracy without requiring additional annotations. We demonstrate the approach on coral reef imagery, enabling large-area segmentation of coral morphotypes and illustrating its flexibility in integrating new classes. The final model achieves 86.07% pixel accuracy and 52.23% mean Intersection over Union (mIoU) on manually annotated reef zones, demonstrating that accurate large-scale coral segmentation can be obtained without pixel-level annotations. By bridging image classification and segmentation across scales and modalities, this method provides an efficient solution for deploying segmentation models in settings where annotations are unavailable and opens opportunities for scalable, efficient monitoring in ecology and beyond.

2507.23372 2026-05-25 cs.CV 版本更新

UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries

UniEmo: 利用可学习专家查询统一情感理解与生成

Yijie Zhu, Lingsen Zhang, Zitong Yu, Rui Shao, Tao Tan, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Great Bay University(大湾区大学) School of Computing and Information Technology(计算机与信息学院) Dongguan Key Laboratory for Intelligence and Information Technology(东莞智能与信息科技重点实验室) Shenzhen Loop Area Institute(深圳环 area 院) Macao Polytechnic University(澳门 polytechnic 大学)

AI总结 本文提出 UniEmo,一个统一的情感理解和生成框架,通过可学习的专家查询机制,将情感理解与生成任务有机结合。该方法通过分层的情感理解链逐步提取多尺度情感特征,并利用这些特征引导扩散模型生成具有情感表达的图像,同时引入情感相关系数和条件损失以提升生成图像的多样性和真实性。实验表明,UniEmo 在情感理解和生成任务上均优于现有先进方法。

Comments Accepted to TIP 2026

详情
Journal ref
IEEE Transactions on Image Processing, vol. 35, pp. 5165-5180, 2026
AI中文摘要

情感理解和生成通常被视为独立的任务,然而它们本质上是互补的,可以相互增强。在本文中,我们提出UniEmo,一个无缝集成这两个任务的统一框架。关键挑战在于情感的抽象性质,需要提取对两个任务都有益的视觉表示。为此,我们提出一个带有可学习专家查询的分层情感理解链,逐步提取多尺度情感特征,从而作为统一的基础步骤。同时,我们融合这些专家查询和情感表示,以指导扩散模型生成引发情感反应的图像。为了增强生成情感图像的多样性和保真度,我们进一步在融合过程中引入情感相关系数和情感条件损失。这一步骤促进了由理解引导的情感生成的融合与对齐。反过来,我们证明联合训练允许生成部分向理解部分提供隐式反馈。此外,我们提出一种新颖的数据过滤算法,以选择由训练良好的模型生成的高质量和多样化的情感图像,这些图像显式地反馈到理解部分。这些生成驱动的双重反馈过程共同增强了模型的理解能力。大量实验表明,UniEmo在情感理解和生成任务上均显著优于现有方法。所提出方法的代码可在 https://github.com/JiuTian-VL/UniEmo 获取。

英文摘要

Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model's understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at https://github.com/JiuTian-VL/UniEmo.

2507.12455 2026-05-25 cs.CV 版本更新

Mitigating Object Hallucinations via Sentence-Level Early Intervention

通过句子级早期干预缓解对象幻觉

Shangpin Peng, Senqiao Yang, Li Jiang, Zhuotao Tian

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) The Chinese University of Hong Kong(香港中文大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 多模态大语言模型在跨模态理解方面取得了显著进展,但在生成过程中常出现与视觉输入矛盾的“幻觉”问题。本文提出了一种基于句子级早期干预的解决方案SENTINEL,通过迭代生成并验证模型输出,构建高质量的领域内偏好数据,进而利用上下文感知的偏好损失进行训练,有效抑制幻觉生成。实验表明,该方法在多个基准测试中大幅减少了幻觉现象,同时保持了模型的通用能力。

详情
AI中文摘要

多模态大语言模型(MLLMs)彻底改变了跨模态理解,但仍难以应对幻觉——即与视觉输入相矛盾的虚构内容。现有的幻觉缓解方法要么计算成本过高,要么在训练数据和模型输出之间引入分布不匹配。我们识别出一个关键见解:幻觉主要出现在文本生成的早期阶段,并通过后续输出传播。为解决此问题,我们提出了SENTINEL(通过领域内偏好学习的句子级早期干预),一个消除对人工标注依赖的框架。具体来说,我们首先通过迭代采样模型输出、通过与两个开放词汇检测器的交叉验证来验证对象存在性,并将句子分类为幻觉/非幻觉类别,从而引导出高质量的领域内偏好对。随后,我们使用上下文连贯的正样本和幻觉负样本迭代构建上下文感知的偏好数据。最后,我们使用上下文感知偏好损失(C-DPO)训练模型,该损失在幻觉最初显现的句子级别强调判别性学习。实验结果表明,与原始模型相比,SENTINEL可将幻觉减少超过90%,并在幻觉基准和通用能力基准上均优于先前的最先进方法,展示了其优越性和泛化能力。模型、数据集和代码可在 https://github.com/pspdada/SENTINEL 获取。

英文摘要

Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at https://github.com/pspdada/SENTINEL.

2506.03530 2026-05-25 cs.MM cs.CL cs.CV 版本更新

How Far Are We from Generating Missing Modalities with Foundation Models?

我们距离用基础模型生成缺失模态还有多远?

Guanzhou Ke, Bo Wang, Guoqing Chao, Weiming Hu, Shengfeng He

发表机构 * Institute of Data Science and Intelligent Decision Support, Beijing Jiaotong University(数据科学与智能决策支持研究所,北京交通大学) School of Computing and Information Systems, Singapore Management University(计算与信息系统学院,新加坡管理大学) State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,自动化研究所,中国科学院) School of Computer Science and Technology, Harbin Institute of Technology(计算机科学与技术学院,哈尔滨工业大学)

AI总结 该研究探讨了基础模型在生成缺失模态数据方面的潜力与局限,提出了三种缺失模态重建的范式,并对42种模型变体进行了系统评估。研究发现,当前基础模型在细粒度语义提取和生成模态的鲁棒验证方面存在不足,导致生成结果不够理想。为此,作者提出了一种智能代理框架,通过动态的模态感知挖掘策略和自优化机制,显著提升了缺失模态重建的质量,实验表明在图像和文本重建任务中分别取得了14%和10%以上的性能提升。

Comments T-PAMI

详情
AI中文摘要

多模态基础模型在各种任务中展现了令人印象深刻的能力。然而,它们作为缺失模态重建的即插即用解决方案的潜力尚未被充分探索。为弥补这一差距,我们识别并形式化了三种可能的缺失模态重建范式,并跨这些范式进行了全面评估,覆盖了42个模型变体在重建准确性和下游任务适应性方面的表现。我们的分析表明,当前基础模型在两个关键方面往往表现不佳:(i) 从可用模态中提取细粒度语义,以及(ii) 对生成模态的稳健验证。这些限制导致了次优甚至有时不匹配的生成。为解决这些挑战,我们提出了一个专为缺失模态重建设计的智能框架。该框架根据输入上下文动态制定模态感知的挖掘策略,促进提取更丰富、更具判别性的语义特征。此外,我们引入了一种自精炼机制,通过内部反馈迭代验证和提升生成模态的质量。实验结果表明,与基线相比,我们的方法在缺失图像重建上FID降低了至少14%,在缺失文本重建上MER降低了至少10%。代码已发布在:https://github.com/Guanzhou-Ke/AFM2。

英文摘要

Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14\% and MER for missing text reconstruction by at least 10\% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.

2506.00560 2026-05-25 cs.RO cs.CV 版本更新

Using Ensemble Diffusion to Estimate Uncertainty for End-to-End Autonomous Driving

使用集成扩散估计端到端自动驾驶的不确定性

Florian Wintel, Sigmund H. Høeg, Gabriel Kiss, Frank Lindseth

发表机构 * Norwegian University of Science and Technology(挪威科学技术大学)

AI总结 本文提出了一种基于集成扩散模型的端到端自动驾驶系统EnDfuser,用于估计轨迹规划中的不确定性。该方法通过将注意力池化与轨迹规划结合到一个扩散变换器模块中,有效融合了摄像头和激光雷达等多源感知信息,并从单帧感知输入生成多个候选轨迹(共128个),从而提供对不确定未来轨迹空间的可解释性。实验表明,该方法通过设计简单安全规则,在LAV基准测试中提升了1.7%的驾驶性能,展示了集成扩散模型在端到端自动驾驶策略中建模轨迹后验不确定性分布的有效性。

Comments Accepted at NLDL 2026

详情
AI中文摘要

端到端自动驾驶规划系统正在快速改进,特别是在CARLA等闭环模拟环境中。许多此类驾驶系统要么不考虑规划本身的不确定性,要么通过使用不泛化的专用表示来获取不确定性。在本文中,我们提出了EnDfuser,一个使用扩散模型作为轨迹规划器的端到端驾驶系统。EnDfuser通过将注意力池化和轨迹规划结合到一个单一的扩散变换器模块中,有效利用复杂的感知信息,如融合的相机和激光雷达特征。EnDfuser不承诺单一规划,而是通过集成扩散从单一感知帧生成候选轨迹分布(在我们的情况下为128个)。通过观察完整的候选轨迹集,EnDfuser为不确定的多模态未来轨迹空间提供了可解释性。利用这些信息,我们设计了一个简单的安全规则,在LAV基准上将系统的驾驶评分提高了1.7%。我们的发现表明,集成扩散作为传统点估计轨迹规划模块的直接替代品,可以通过建模后验轨迹分布的不确定性,为端到端驾驶策略中的不确定性感知决策过程做出贡献。

英文摘要

End-to-end planning systems for autonomous driving are rapidly improving, especially in closed-loop simulation environments like CARLA. Many such driving systems either do not consider uncertainty as part of the plan itself or obtain it by using specialized representations that do not generalize. In this paper, we propose EnDfuser, an end-to-end driving system that uses a diffusion model as the trajectory planner. EnDfuser effectively leverages complex perception information like fused camera and LiDAR features, through combining attention pooling and trajectory planning into a single diffusion transformer module. Instead of committing to a single plan, EnDfuser produces a distribution of candidate trajectories (128 for our case) from a single perception frame through ensemble diffusion. By observing the full set of candidate trajectories, EnDfuser provides interpretability for uncertain, multimodal future trajectory spaces. Using this information we design a simplistic safety-rule that improves the system's driving score by 1.7% on the LAV benchmark. Our findings suggest that ensemble diffusion, used as a drop-in replacement for traditional point-estimate trajectory planning modules, can contribute to an uncertainty-aware decision making process in End-to-End driving policies by modeling the uncertainty of the posterior trajectory distribution.

2506.00474 2026-05-25 eess.IV cs.CV 版本更新

A European Multi-Center Breast Cancer MRI Dataset

欧洲多中心乳腺癌MRI数据集

Gustav Müller-Franzes, Lorena Escudero Sánchez, Nicholas Payne, Alexandra Athanasiou, Michael Kalogeropoulos, Aitor Lopez, Alfredo Miguel Soro Busto, Julia Camps Herrero, Nika Rasoolzadeh, Tianyu Zhang, Ritse Mann, Debora Jutz, Maike Bode, Christiane Kuhl, Yuan Gao, Wouter Veldhuis, Oliver Lester Saldanha, JieFu Zhu, Jakob Nikolas Kather, Daniel Truhn, Fiona J. Gilbert

发表机构 * University of Cambridge(剑桥大学) MITERA Hospital(MITERA医院) Ribera Salud Group(Ribera Salud集团) Radboud University Medical Center(拉德堡德大学医学中心) University Hospital RWTH Aachen(亚琛工业大学医院) University Medical Center Utrecht(乌得勒支大学医学中心) University Hospital Carl Gustav Carus(卡尔·古斯塔夫·卡鲁斯大学医院) EKFZ Technical University Dresden(德累斯顿技术大学EKFZ)

AI总结 该研究提出了一种公开的欧洲多中心乳腺癌MRI数据集,旨在解决当前乳腺MRI人工智能辅助诊断中缺乏大规模、多样化数据的问题。数据集包含来自五个欧洲国家六家临床机构的741例乳腺MRI检查,涵盖恶性、良性及无病灶病例,并使用不同扫描设备和参数采集,真实反映临床多样性。研究还利用基于Transformer的模型进行了基准测试,展示了数据集的潜在应用价值,并为后续方法比较提供了参考性能。

详情
AI中文摘要

早期检测乳腺癌对于改善患者预后至关重要。虽然乳腺X线摄影仍是主要筛查手段,但磁共振成像(MRI)越来越多地被推荐作为乳腺组织致密女性及高风险女性的补充工具。然而,多参数乳腺MRI的采集和解读耗时且需要专业知识,限制了其在临床实践中的可扩展性。人工智能(AI)方法在支持乳腺MRI解读方面显示出潜力,但其发展受到大型、多样化和公开可访问数据集可用性有限的阻碍。为弥补这一差距,我们提供了一个公开可用的多中心乳腺MRI数据集,该数据集收集自五个欧洲国家的六个临床机构。该数据集包含741例接受筛查或诊断性乳腺MRI的女性检查,包括恶性、良性和非病灶病例。数据使用异构扫描仪、场强和采集协议获取,反映了真实世界的临床变异性。此外,我们报告了使用基于Transformer模型的基线基准实验,以说明该数据集的潜在用例,并为未来的方法比较提供参考性能。

英文摘要

Early detection of breast cancer is critical for improving patient outcomes. While mammography remains the primary screening modality, magnetic resonance imaging (MRI) is increasingly recommended as a supplemental tool for women with dense breast tissue and those at elevated risk. However, the acquisition and interpretation of multiparametric breast MRI are time-consuming and require specialized expertise, limiting scalability in clinical practice. Artificial intelligence (AI) methods have shown promise in supporting breast MRI interpretation, but their development is hindered by the limited availability of large, diverse, and publicly accessible datasets. To address this gap, we present a publicly available, multi-centre breast MRI dataset collected across six clinical institutions in five European countries. The dataset comprises 741 examinations from women undergoing screening or diagnostic breast MRI and includes malignant, benign, and non-lesion cases. Data were acquired using heterogeneous scanners, field strengths, and acquisition protocols, reflecting real-world clinical variability. In addition, we report baseline benchmark experiments using a transformer-based model to illustrate potential use cases of the dataset and to provide reference performance for future methodological comparisons.

2505.17015 2026-05-25 cs.CV cs.CL 版本更新

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Multi-SpatialMLLM: 多模态大语言模型的多帧空间理解

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin J. Liang

发表机构 * FAIR, Meta(FAIR,Meta) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出了一种名为Multi-SpatialMLLM的多模态大语言模型框架,旨在增强模型对多帧场景的时空理解能力。通过引入深度感知、视觉对应和动态感知等基本空间技能,并构建包含2700多万个样本的MultiSPA数据集,该方法显著提升了模型在多帧空间任务中的表现。实验表明,Multi-SpatialMLLM在多种空间任务上优于现有基线模型和商业系统,展示了其在复杂场景下的泛化能力和多任务学习优势,并可应用于机器人领域的多帧奖励标注。

Comments CVPR 2026 Camera Ready. 27 pages. Project page: https://runsenxu.com/projects/Multi-SpatialMLLM

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉任务中取得了快速进展,但其空间理解仍局限于单张图像,使其不适合需要多帧推理的物理世界应用。在本文中,我们提出一个框架,通过整合基本空间技能(包括深度感知、视觉对应和动态感知)来赋予MLLMs多帧空间理解能力。我们设计了一个新颖的数据管道,并收集了包含超过2700万个样本的MultiSPA数据集,涵盖多样的3D和4D场景,以支持训练。除了MultiSPA,我们还引入了一个全面的基准测试,在统一的度量标准下测试广泛的空间任务。我们的最终模型Multi-SpatialMLLM在基线和专有系统上取得了显著提升,展示了可扩展和可泛化的多帧感知能力。我们进一步观察到多任务收益和在挑战性场景中的新兴空间能力,并展示了我们的模型如何作为机器人学的多帧奖励标注器。

英文摘要

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception. We design a novel data pipeline and collect the MultiSPA dataset of more than 27 million samples spanning diverse 3D and 4D scenes to enable training. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable and generalizable multi-frame perception. We further observe multi-task benefits and emergent spatial capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

2503.12868 2026-05-25 cs.CV 版本更新

UniReg: A Universal Model for Controllable CT Image Registration

UniReg: 一种用于可控CT图像配准的通用模型

Zi Li, Jianpeng Zhang, Tai Ma, Tony C. W. Mok, Yan-Jie Zhou, Zeli Chen, Xianghua Ye, Le Lu, Cheng Chen, Dakai Jin

发表机构 * The University of Hong Kong(香港大学) DAMO Academy, Alibaba Group(阿里巴巴集团达摩院) The First Affiliated Hospital of Zhejiang University(浙江大学第一附属医院)

AI总结 本文提出了一种名为UniReg的通用可控CT图像配准模型,旨在解决现有方法在不同临床场景下泛化能力差、需为每个任务单独训练网络的问题。UniReg通过结合任务特定学习方法的精度优势与传统优化方法的泛化能力,构建了一个统一的配准框架,能够根据解剖结构先验、配准类型约束和实例特征自适应估计形变场,实现跨场景的最优配准。实验表明,UniReg在多个CT/MR配准数据集上取得了优于现有先进方法的平均配准精度,并显著降低了模型冗余和训练成本。

详情
AI中文摘要

基于学习的医学图像配准在匹配传统方法精度的同时,提供了优越的计算效率。然而,现有方法在不同临床场景中泛化能力差,需要为特定配准任务(如个体间/个体内配准或解剖区域特定对齐)开发多个孤立的网络,导致开发流程繁琐。为克服这一局限,我们提出了UniReg,首个用于多场景CT图像配准的条件统一模型,它结合了任务特定学习方法的精度优势和传统优化方法的泛化能力。我们的关键创新是一个统一的配准框架,该框架根据以下条件自适应估计变形场:(1)解剖结构先验,(2)配准类型约束(个体间/个体内),以及(3)实例特定特征,从而在单个模型中实现跨异构场景的最优对齐。通过在多个CT/MR配准数据集上的全面实验,UniReg相比当前最先进的基于学习方法取得了更优的平均配准精度,同时展现出强大的跨场景泛化能力。此外,通过用一个紧凑的统一模型替代多个孤立的任务特定模型,UniReg显著降低了总体训练负担,包括总训练成本和模型冗余。

英文摘要

Learning-based medical image registration has matched the accuracy of conventional methods while offering superior computational efficiency. However, existing approaches suffer from poor generalization across diverse clinical scenarios, requiring the laborious development of multiple isolated networks for specific registration tasks, e.g., inter-/intra-subject registration or anatomical region-specific alignment, leading to cumbersome development pipelines. To overcome this limitation, we propose UniReg, the first conditional unified model for multi-scenario CT image registration, which combines the precision advantages of task-specific learning methods with the generalization of traditional optimization methods. Our key innovation is a unified registration framework that adaptively estimates deformation fields conditioned on: (1) anatomical structure priors, (2) registration type constraints (inter/intra-subject), and (3) instance-specific features, enabling optimal alignment across heterogeneous scenarios within a single model. Through comprehensive experiments on multiple CT/MR registration datasets, UniReg achieves superior average registration accuracy compared with current state-of-the-art learning-based methods while exhibiting strong cross-scenario generalization. Moreover, by replacing multiple isolated task-specific models with a compact unified model, UniReg substantially reduces the overall training burden in terms of total training cost and model redundancy.

2503.06684 2026-05-25 cs.CV 版本更新

PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation

PixelPonder: 动态补丁自适应增强多条件文本到图像生成

Yanjie Pan, Qingdong He, Zhengkai Jiang, Pengcheng Xu, Chaoyi Wang, Jinlong Peng, Haoxuan Wang, Yun Cao, Zhenye Gan, Mingmin Chi, Bo Peng, Yabiao Wang

发表机构 * Fudan University(复旦大学) Tencent Youtu Lab(腾讯优图实验室) Western University(西澳大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出了一种名为PixelPonder的新型统一控制框架,用于解决多条件文本到图像生成中多个异构控制信号之间的语义保真与视觉质量协调问题。该方法通过设计一种基于图像块的自适应条件选择机制,在子区域层面动态优先选择空间相关的控制信号,从而实现精确的局部引导而不受全局干扰,并结合时间感知的控制注入策略,根据去噪时间步调整条件影响,逐步从结构保持过渡到纹理优化。实验表明,PixelPonder在多个基准数据集上优于现有方法,在空间对齐精度和文本语义一致性方面均表现出色。

详情
AI中文摘要

最近基于扩散的文本到图像生成通过视觉条件控制取得了有希望的结果。然而,现有的ControlNet类方法在处理组合视觉条件时存在困难——在多个异质控制信号之间同时保持语义保真度,同时维持高视觉质量,它们采用独立的控制分支,在去噪过程中常常引入冲突的引导,导致生成图像中出现结构失真和伪影。为了解决这个问题,我们提出了PixelPonder,一种新颖的统一控制框架,允许在单一控制结构下有效控制多个视觉条件。具体来说,我们设计了一种补丁级自适应条件选择机制,在子区域级别动态优先考虑空间相关的控制信号,实现精确的局部引导而无需全局干扰。此外,部署了一种时间感知控制注入方案,根据去噪时间步调节条件影响,逐步从结构保留过渡到纹理细化,充分利用不同类别的控制信息以促进更和谐的图像生成。大量实验表明,PixelPonder在多个基准数据集上超越了先前的方法,在保持高文本语义一致性的同时,在空间对齐精度上显示出优越的提升。

英文摘要

Recent advances in diffusion-based text-to-image generation have demonstrated promising results through visual condition control. However, existing ControlNet-like methods struggle with compositional visual conditioning - simultaneously preserving semantic fidelity across multiple heterogeneous control signals while maintaining high visual quality, where they employ separate control branches that often introduce conflicting guidance during the denoising process, leading to structural distortions and artifacts in generated images. To address this issue, we present PixelPonder, a novel unified control framework, which allows for effective control of multiple visual conditions under a single control structure. Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level, enabling precise local guidance without global interference. Additionally, a time-aware control injection scheme is deployed to modulate condition influence according to denoising timesteps, progressively transitioning from structural preservation to texture refinement and fully utilizing the control information from different categories to promote more harmonious image generation. Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets, showing superior improvement in spatial alignment accuracy while maintaining high textual semantic consistency.

2503.05534 2026-05-25 cs.CV 版本更新

S4M: 4-points to Segment Anything

S4M: 4点分割一切

Adrien Meyer, Lorenzo Arboit, Giuseppe Massimiani, Shih-Min Yin, Didier Mutter, Nicolas Padoy

发表机构 * University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France(斯特拉斯堡大学、法国国家科学研究中心、法国国家医学研究院、ICube、UMR7357、法国斯特拉斯堡) Department of General Surgery, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University College of Medicine, Kaohsiung, Taiwan(高雄长庚纪念医院外科部、长庚大学医学院)

AI总结 本文提出了一种名为S4M的改进方法,旨在解决医学图像分割中Segment Anything Model(SAM)因点提示模糊而导致的分割精度不足问题。该方法引入了一种结构化的四点提示策略,利用极值点或主/次轴端点作为实例级别的形状描述,增强提示的表达能力。通过扩展提示空间并引入辅助的“Canvas”预训练任务,S4M提升了模型对几何结构的理解能力,实验表明其在多个超声和手术内窥镜数据集上显著提升了分割性能,并减少了临床标注的工作量。

详情
AI中文摘要

目的:Segment Anything Model (SAM) 有望缓解医学分割中的标注瓶颈,但重叠解剖结构和模糊边界使其点提示存在歧义,导致需要反复手动细化才能获得精确掩膜。需要更好的提示策略。方法:我们提出一种结构化提示策略,使用4个点作为紧凑的实例级形状描述。受超声测量实践启发,我们研究了两种4点变体:极值点和提出的长短轴端点。SAM无法充分利用此类结构化提示,因为它将所有点等同对待,缺乏几何感知推理。为解决此问题,我们引入S4M(4点分割一切),它增强SAM以将4点解释为关系线索而非孤立点击。S4M通过角色特定嵌入扩展提示空间,并添加辅助“画布”前置任务,直接从提示草绘粗略掩膜,促进几何感知推理。结果:在超声和手术内镜的八个数据集上,在相同提示预算下,S4M比强SAM基线提升+3.42 mIoU。与三位临床医生的标注研究进一步表明,长短轴提示可实现更快的标注。结论:S4M提高了性能,减少了标注工作量,并使提示与临床实践对齐,从而在医学影像中实现更可扩展的数据集开发。我们在https://github.com/CAMMA-public/S4M发布代码和预训练模型。

英文摘要

Purpose: The Segment Anything Model (SAM) promises to ease the annotation bottleneck in medical segmentation, but overlapping anatomy and blurred boundaries make its point prompts ambiguous, leading to cycles of manual refinement to achieve precise masks. Better prompting strategies are needed. Methods: We propose a structured prompting strategy using 4 points as a compact instance-level shape description. We study two 4-point variants: extreme points and the proposed major/minor axis endpoints, inspired by ultrasound measurement practice. SAM cannot fully exploit such structured prompts because it treats all points identically and lacks geometry-aware reasoning. To address this, we introduce S4M (4-points to Segment Anything), which augments SAM to interpret 4 points as relational cues rather than isolated clicks. S4M expands the prompt space with role-specific embeddings and adds an auxiliary "Canvas" pretext task that sketches coarse masks directly from prompts, fostering geometry-aware reasoning. Results: Across eight datasets in ultrasound and surgical endoscopy, S4M improves segmentation by +3.42 mIoU over a strong SAM baseline at equal prompt budget. An annotation study with three clinicians further shows that major/minor prompts enable faster annotation. Conclusion: S4M increases performance, reduces annotation effort, and aligns prompting with clinical practice, enabling more scalable dataset development in medical imaging. We release our code and pretrained models at https://github.com/CAMMA-public/S4M.

2502.04415 2026-05-25 cs.CV cs.AI 版本更新

TerraQ: Spatiotemporal Question-Answering on Satellite Image Archives

TerraQ:卫星图像档案的时空问答

Sergios-Anestis Kefalidis, Konstantinos Plas, Manolis Koubarakis

发表机构 * Dept. of Informatics and Telecommunications(信息与电信系) National and Kapodistrian University of Athens(国家与卡布里亚大学) Archimedes/Athena RC(阿基米德/雅典RC)

AI总结 TerraQ 是一个用于卫星图像档案的时空问答系统,能够根据自然语言查询快速检索符合条件的卫星图像。该系统结合了自然语言处理与空间知识库,支持基于图像元数据和地理实体的复杂查询。其核心贡献在于提升了地球观测数据的可访问性与智能化检索能力。

详情
AI中文摘要

TerraQ是一个用于卫星图像档案的时空问答引擎。它是一个自然语言处理系统,旨在处理满足特定条件的卫星图像请求。这些请求可以引用图像元数据和来自专门知识库(例如,艾米利亚-罗马涅大区)的实体。通过它,用户可以提出诸如“给我一百张法国港口附近河流的图像,雪覆盖率低于20%,云覆盖率高于10%”之类的请求,从而使地球观测数据更易于访问,符合当前数字助手的趋势。

英文摘要

TerraQ is a spatiotemporal question-answering engine for satellite image archives. It is a natural language processing system that is built to process requests for satellite images satisfying certain criteria. The requests can refer to image metadata and entities from a specialized knowledge base (e.g., the Emilia-Romagna region). With it, users can make requests like "Give me a hundred images of rivers near ports in France, with less than 20% snow coverage and more than 10% cloud coverage", thus making Earth Observation data more easily accessible, in-line with the current landscape of digital assistants.

2407.03535 2026-05-25 cs.CV 版本更新

BVI-RLV: A Fully Registered Dataset for Low-Light Video Enhancement

BVI-RLV:一个完全配准的低光视频增强数据集

Ruirui Lin, Guoxi Huang, Joanne Lin, Qi Sun, Alexandra Malyugina, David R Bull, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, Bristol Vision Institute (BVI), University of Bristol(视觉信息实验室,布里斯托尔视觉研究所(BVI),布里斯托尔大学)

AI总结 低光照视频常伴有时空不一致的噪声,影响视觉清晰度和计算机视觉任务的性能。为解决深度学习增强此类内容时缺乏高质量对齐训练数据的问题,本文提出了BVI-RLV数据集,包含40个不同场景下超过3万对低光与正常光配对帧,实现了高精度的亚像素级对齐。该数据集在动态运动场景中具有广泛适用性,并提供了多种模型的基线实现,实验表明其对监督学习效果显著,且在跨数据集评估中表现优于现有数据集。

Comments arXiv admin note: text overlap with arXiv:2402.01970

详情
AI中文摘要

低光视频通常表现出时空不连贯的噪声,损害可见性并降低计算机视觉应用的性能。使用深度学习增强此类内容的一个主要挑战在于缺乏像素对齐的高质量训练数据。我们引入了BVI-RLV,一个完全配准的低光视频数据集,包含来自40个不同场景的超过3万对帧,在两种低光条件下,每个帧都与正常光照的真实值对齐。与依赖中性密度(ND)滤波器或存在未对齐问题的现有数据集不同,BVI-RLV通过使用电动滑轨和基于图像的优化,在动态运动场景下实现了全高清分辨率下99.24%数据的亚像素配准。该数据集涵盖了广泛的运动类型和真实的时间噪声。我们还提供了使用四种代表性架构的基线实现:卷积神经网络(CNN)、Transformer、状态空间模型(Mamba)和扩散模型(DM)。实验表明,配准对于监督学习至关重要,与未配准训练相比,PSNR提升高达5.85 dB。在跨数据集评估中,基于BVI-RLV训练的模型优于基于现有数据集训练的模型,即使在真实户外场景中也取得了优越性能。我们的数据集公开于https://doi.org/10.21227/mzny-8c77。

英文摘要

Low-light videos often exhibit spatiotemporally incoherent noise, compromising visibility and degrading performance in computer vision applications. A major challenge for enhancing such content using deep learning lies in the scarcity of pixel-aligned, high-quality training data. We introduce BVI-RLV, a fully registered low-light video dataset comprising over 30k paired frames from 40 diverse scenes under two low-light conditions, each aligned with normal-light ground truth. Unlike existing datasets that rely on neutral density (ND) filters or suffer from misalignment issues, BVI-RLV achieves sub-pixel registration for 99.24% of data at full HD resolution across dynamic motion scenarios using a motorized dolly and image-based refinement. The dataset covers a wide range of motion types and realistic temporal noise. We also provide baseline implementations using four representative architectures: Convolutional Neural Network (CNN), Transformer, State Space Model (Mamba), and Diffusion Model (DM). Experiments demonstrate that registration is crucial for supervised learning, yielding up to 5.85 dB PSNR improvement compared to unregistered training. Models trained on BVI-RLV outperform those trained on existing datasets in cross-dataset evaluations, achieving superior performance even in real-world outdoor scenes. Our dataset is publicly available at https://doi.org/10.21227/mzny-8c77.

2605.22942 2026-05-25 cs.CV 版本更新

Improved Vision-to-Chart Buoy Association with Learned World-to-Image Projection

改进的视觉到图表浮标关联:学习世界到图像投影

Borja Carrillo-Perez

发表机构 * Arquimea Research Center(阿基米德研究中心)

AI总结 本文针对MaCVi 2026视觉-海图数据关联挑战,提出了一种对基于DETR的融合变压器基线的轻量改进方法。通过引入一个专门的多层感知机(QueryMLP),该方法能够从海图测量和IMU姿态数据中显式预测浮标在图像中的水线接触点,从而为每个浮标提供直接的空间先验信息,减轻了变压器解码器的几何推理负担。该方法在测试集上取得了总体得分为0.7386(F1=0.8055,mIoU=0.6718)的优异性能,位列挑战赛提交结果的第二名。

Comments 5 pages, 3 figures. Technical report for the MaCVi 2026 Vision-to-Chart Data Association Challenge at the CVPR 2026 Workshop; 2nd place submission. Code: https://github.com/bcarrpe/macvi26-visionmap-querymlp

详情
AI中文摘要

本报告提出了对基于DETR的融合变压器基线的一种轻量级修改,用于MaCVi 2026视觉到图表数据关联挑战。挑战基线解码器接收每个浮标的查询,编码世界空间距离和方位,迫使变压器隐式学习从世界坐标到图像像素的复杂几何投影。相反,本工作训练了一个额外的专用MLP(QueryMLP),以根据图表测量和IMU方向数据显式预测浮标在水线处的接触点图像坐标。预测的像素坐标被附加到基线解码器查询向量中,为每个浮标提供直接的空间先验,并减轻变压器解码器的几何推理负担。在挑战排行榜上,所提出的方法在保留测试集上取得了Overall 0.7386、F1 0.8055、mIoU 0.6718的成绩,在所有提交中排名第二。

英文摘要

This report presents a lightweight modification to the DETR-based fusion transformer baseline for the MaCVi 2026 Vision-to-Chart data association challenge. The challenge baseline decoder receives per-buoy queries encoding world-space distance and bearing, forcing the transformer to implicitly learn the complex geometric projection from world coordinates to image pixels. Instead, this work trains an additional dedicated MLP, QueryMLP, to explicitly predict the buoy's waterline contact point in the image from chart measurements and IMU orientation data. The predicted pixel coordinates are appended to the baseline decoder query vector, providing a direct spatial prior per buoy and reducing the geometric reasoning burden on the transformer decoder. On the challenge leaderboard, the presented approach achieves an Overall score of 0.7386, with F1 = 0.8055 and mIoU = 0.6718, on the held-out test set, placing second among all submissions.

2605.22907 2026-05-25 cs.CV 版本更新

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

VideoOdyssey: 超长上下文与全模态视频理解基准

Haichen He, Jiayi Zhou, Sifeng Shang, Yihan Hu, Yuanhan Zhang, Kaiyang Zhou

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) S-Lab, Nanyang Technological University(S 实验室,南洋理工大学) GVC Lab, Great Bay University(GVC 实验室,Great Bay 大学)

AI总结 VideoOdyssey 是一个面向超长上下文和多模态视频理解的新型基准,旨在评估模型在长时间视频中持续追踪、信息整合与记忆保持的能力。该基准通过极长的视频时长、多样化的场景以及多层次的连续验证机制,全面衡量模型在不同认知负荷下的表现。研究揭示了当前多模态大语言模型在超长上下文推理、细粒度感知和非语言多模态理解方面仍面临显著挑战。

详情
AI中文摘要

现实世界中的长视频理解要求模型在极端视频时长内对大量时间跨度进行连续跟踪、信息整合和记忆保持。掌握这种高强度的认知负荷构成了长视频理解的基本瓶颈。虽然现有基准通过增加视频时长推动了进展,但其评估任务通常仅需理解短且孤立的视频片段,未能捕捉超长上下文推理的挑战。为衡量这种认知负荷,我们强调连续证书长度,即人类必须连续观看以明确回答给定问题的视频长度。受此指标驱动,我们引入了VideoOdyssey,一个专门为超长上下文和全模态视频理解设计的基准。VideoOdyssey具有三个关键特征:1)极端的视频时长和多样性:涵盖11个领域和54个子类别,平均视频时长为109分钟;2)全面的评估场景:提供两个子集以应对不同的研究重点,即VideoOdyssey-V用于探测MLLMs的视觉理解极限,以及VideoOdyssey-AV用于评估全模态模型的同步音视频理解;3)超长且多级别的连续证书:将VideoOdyssey-V的平均连续证书延长至16分钟,VideoOdyssey-AV延长至12.8分钟。关键的是,我们设计了从秒到小时的5个粒度级别,提供了一个全面的诊断工具,用于评估模型在不同上下文长度和认知负荷下的表现。广泛评估表明,当前MLLMs的瓶颈不仅限于简单的检索,还包括在不同上下文长度下的连续推理、细粒度感知和非语言全模态理解方面的困难。

英文摘要

Real-world long video understanding requires models to perform continuous tracking, information integration and memory retention over massive temporal spans within extreme video durations. Mastering this intense cognitive load constitutes the fundamental bottleneck in long video understanding. While existing benchmarks have driven progress by scaling up video duration, their evaluation tasks often require comprehending only short and isolated video segments, falling short of capturing the challenge of ultra-long-context reasoning. To measure this cognitive load, we emphasize continuous certificate length, defined as the video length a human must continuously watch to definitively answer a given question. Driven by this metric, we introduce VideoOdyssey, a benchmark specifically designed for ultra-long-context and omni-modal video understanding. VideoOdyssey is characterized by three key features: 1) Extreme video duration and diversity: spanning 11 domains and 54 subcategories with an average video duration of 109 minutes; 2) Comprehensive evaluation scenarios: offering two subsets to address different research focuses, i.e., VideoOdyssey-V for probing the limits of visual understanding in MLLMs, and VideoOdyssey-AV for evaluating synchronized audio-visual understanding for omni-modal models; 3) Ultra-long and multi-level continuous certificates: extending the average continuous certificate to 16 minutes for VideoOdyssey-V and 12.8 minutes for VideoOdyssey-AV. Crucially, we design 5 granular levels from seconds to hours, providing a comprehensive diagnostic tool to evaluate models across varying context lengths and cognitive loads. Extensive evaluations show that bottlenecks of current MLLMs extend beyond simple retrieval to include struggles with continuous reasoning across varying context lengths, fine-grained perception, and non-verbal omni-modal understanding.

2605.22903 2026-05-25 cs.CV cs.AI cs.CL 版本更新

Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

不看而见:视觉-语言基准测试真的测试视觉吗?

Zixuan Lan, Luzhe Sun, Matthew R. Walter, Jiawei Zhou

发表机构 * University of Chicago(芝加哥大学) Stony Brook University(石溪大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所)

AI总结 该研究质疑了当前视觉-语言模型(VLMs)基准测试是否真正评估了模型对视觉证据的依赖程度。通过系统分析多个开源模型的行为表现,研究发现尽管VLMs会利用视觉输入,但其预测对细粒度视觉信息的丢失并不敏感,这与标准准确率所暗示的情况存在明显偏差。研究还从表示层面揭示了视觉特征在深层逐渐趋同的现象,为这一现象提供了可能的解释,表明现有基准可能无法有效评估模型的细粒度视觉理解能力。

Comments Accepted to GRAIL-V: Grounded Retrieval and Agentic Intelligence for Vision-Language, CVPR 2026 Workshop. accepted version

详情
AI中文摘要

基准测试的准确性通常被隐含地视为反映了视觉-语言模型(VLM)中的基础视觉理解,但尚不清楚这些分数在多大程度上真正反映了对视觉证据的依赖。受一个令人惊讶的观察结果——在广泛使用的幻觉基准测试中,移除大量图像令牌仅轻微降低模型性能——的启发,我们在一组开源VLM中系统地研究了这种不匹配。我们的分析涵盖多个粒度级别,包括全局视觉退化、局部遮挡、问题重述、答案空间扩展以及超出标准准确率的决策级分析。我们进一步用视觉令牌几何的逐层分析补充这些行为结果。在整个实验中,我们发现尽管VLM确实整合了视觉输入,但其预测对细粒度视觉证据丢失的敏感性低于标准准确率所暗示的程度。即使最终预测保持不变,模型对正确答案的内部支持可能已经减弱。我们还补充了表示级分析,显示深层中视觉令牌之间的相似性增加,这为我们的发现提供了一个可能的解释。总之,这些结果表明,当前的基准测试不足以可靠地评估VLM中的细粒度视觉基础。

英文摘要

Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial fraction of image tokens only degrades model performance very slightly on a widely used hallucination benchmark, we systematically investigate this mismatch in a set of open-source VLMs. Our analysis spans multiple levels of granularity, spanning global visual degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level analyses beyond standard accuracy. We further complement these behavioral results with a layer-wise analysis of vision-token geometry. Throughout the experiments, we find that although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. We further complement a representation-level analysis, which shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for our findings. Together, these results suggest that current benchmarks are not sufficient to reliably evaluate fine-grained visual grounding in VLMs.

2605.22890 2026-05-25 cs.RO cs.CV 版本更新

Extending Deep Event Visual Odometry with Sparse Point-Cloud Export

基于稀疏点云导出的深度事件视觉里程计扩展

Alireza Safdari, Sajad Ashraf

发表机构 * st Sajad Ashraf(第一作者单位) nd Alireza Safdari(第二作者单位)

AI总结 该研究针对事件相机在高速运动和复杂光照条件下的视觉里程计问题,扩展了深度事件视觉里程计(DEVO)系统,引入了一种稀疏点云输出模块。通过提取DEVO内部估计的3D结构并转化为显式点云表示,实现了对场景几何信息的可视化与后续处理,同时保留了原有的视觉里程计流程。实验表明,生成的稀疏点云在局部一致性方面表现良好,达到了高精度要求,但也体现了在密度、完整性及对累积里程计噪声的敏感性方面的局限性。

Comments 9 Pages, 4 figures, 5 tabel

详情
AI中文摘要

事件相机因其低延迟、高时间分辨率和高动态范围,非常适合高速运动和挑战性光照条件下的视觉里程计。深度事件视觉里程计(DEVO)通过结合稀疏块跟踪、学习块选择、循环对应精化和可微束调整,证明了单目纯事件里程计能够实现强性能。在本项目中,我们通过稀疏点云导出管线扩展了DEVO。我们的方法不修改核心里程计公式,而是暴露DEVO已估计的内部3D结构,并将其转换为显式点云表示,用于可视化和进一步处理。此外,我们实现了一个实用的工作流程,用于数据导出、格式转换和点云清理。最终系统保留了原始视觉里程计管线,同时支持稀疏几何场景输出。在BOARD SLOW序列上的实验表明,导出的稀疏点云与EMVS重建在局部一致,在5厘米阈值下达到高精度,同时也突出了在密度、完整性和对累积里程计噪声敏感性方面的预期局限性。

英文摘要

Event cameras are well suited for visual odometry under high-speed motion and challenging lighting conditions due to their low latency, high temporal resolution, and high dynamic range. Deep Event Visual Odometry (DEVO) demonstrated that monocular event-only odometry can achieve strong performance by combining sparse patch tracking, learned patch selection, recurrent correspondence refinement, and differentiable bundle adjustment. In this project, we extend DEVO with a sparse point-cloud export pipeline. Rather than modifying the core odometry formulation, our approach exposes the internal 3D structure already estimated by DEVO and converts it into an explicit point-cloud representation for visualization and further processing. In addition, we implement a practical workflow for data export, format conversion, and point-cloud cleanup. The resulting system preserves the original visual odometry pipeline while enabling sparse geometric scene output. Experiments on the BOARD SLOW sequence show that the exported sparse cloud is locally consistent with EMVS reconstructions, achieving high precision at a 5 cm threshold, while also highlighting the expected limitations in density, completeness, and sensitivity to accumulated odometry noise.

2605.22872 2026-05-25 cs.LG cs.AI cs.CV 版本更新

MedExpMem: Adapting Experience Memory for Differential Diagnosis

MedExpMem:适应经验记忆用于鉴别诊断

Qianhan Feng, Zhongzhen Huang, Yakun Zhu, Yannian Gu, Winnie Chiu Wing Chu, Xiaofan Zhang, Qi Dou

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种名为 MedExpMem 的经验记忆框架,旨在提升基于视觉-语言模型的医疗诊断代理在鉴别诊断方面的能力。该方法通过记录模型自身在诊断过程中的失败经验,生成包含关键鉴别点、决策规则和推理错误模式的成对鉴别笔记,并采用两阶段构建过程模拟医生的学习过程。实验表明,MedExpMem 在多个放射学子专科基准上有效提升了诊断准确性,验证了其在医疗适应性方面的优越性。

Comments MICCAI 2026 Early Accept. Submission Version

详情
AI中文摘要

经验丰富的医生通过临床实践发展诊断专业知识,不仅获得疾病知识,还能区分易混淆的病症。当前的医学视觉语言模型(VLM)缺乏这种能力——它们的参数编码了静态知识,不会随着诊断经历而演变。我们提出了MedExpMem,一个经验记忆框架,使基于VLM的诊断代理能够积累鉴别诊断专业知识。与检索增强生成(检索百科式疾病描述)不同,MedExpMem记忆从代理自身的诊断失败中获得的判别经验,并将其组织为成对的鉴别笔记,编码关键判别因素、可操作的决策规则和推理错误模式。该框架采用两阶段构建过程,模仿医生的学习:初始实践暴露知识差距,反思性重新诊断完善理解。当遇到新病例时,代理检索经验记忆以指导鉴别推理。我们在涵盖11个亚专业的放射学基准上评估了MedExpMem。结果表明,在不同模型和规模上,准确率持续提升,最高达7.0%。分析实验验证了经验质量和鲁棒性,表明MedExpMem是一种有竞争力的方法,解决了参数学习无法触及的医学适应需求。

英文摘要

Experienced physicians develop diagnostic expertise through clinical practice, acquiring not only disease knowledge but also the ability to differentiate confusable conditions. Current medical vision-language models (VLMs) lack this capability -- their parameters encode static knowledge that does not evolve across diagnostic encounters. We propose MedExpMem, an experience memory framework enabling VLM-based diagnostic agents to accumulate differential diagnosis expertise. Unlike retrieval-augmented generation, which retrieves encyclopedic disease descriptions, MedExpMem memorizes discriminative experience derived from the agent's own diagnostic failures and organizes them as pairwise differential notes encoding key discriminators, actionable decision rules and reasoning error patterns. The framework adopts a two-phase construction process mirroring physician learning: initial practice exposes knowledge gaps, and reflective re-diagnosis refines understanding. When encountering new cases, the agent retrieves experience memory to guide differential reasoning. We evaluate MedExpMem on a radiology benchmark spanning 11 subspecialties. Results demonstrate consistent accuracy improvements, maximum 7.0%, across diverse models and scales. Analytical experiments validate experience quality and robustness, demonstrating MedExpMem as a competitive method addresses medical adaptation needs beyond the reach of parameteric learning.

2605.19859 2026-05-25 cs.CV 版本更新

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

Eyes on VLM: 视觉语言模型中注视跟随与社会性注视预测的基准测试

Hengfei Wang, Anshul Gupta, Pierre Vuillecard, Jean-Marc Odobez

发表机构 * Idiap Research Institute(Idiap研究机构)

AI总结 本文提出EyeVLM,一个用于评估视觉语言模型(VLMs)在注视理解能力上的系统性框架,重点研究注视追踪和社交注视预测两项任务。通过零样本设置和微调方法,对比多种先进VLM在不同提示策略下的表现,并与纯视觉模型进行系统比较。研究发现,当前VLM在精确理解人类注视行为方面仍存在明显不足,需进一步改进模型和训练策略。

Comments Under review

详情
AI中文摘要

视觉语言模型(VLM)已迅速演变为具有强大零样本泛化能力的通用多模态推理器。在此背景下,VLM 可极大促进人类注视与注意力的分析——这是人类行为理解的核心任务,需要推理物理场景以及活动、交互和社会背景。然而,VLM 能在多大程度上可靠地理解人类注视及相关注意行为仍基本未被探索。本文提出 EyeVLM,一个跨两个互补维度(任务和模型)对 VLM 注视理解能力进行系统评估的框架。为评估注视理解能力,我们聚焦两个核心任务。第一个是注视跟随,即预测一个人注视的二维位置,侧重于几何和视觉处理,需要精确理解人脸、注意力方向、3D 场景结构以及被注视目标的空间定位。第二个是社会性注视预测,需要对多人交互(如相互注视和共享注意)进行社会和关系推理,可能更受益于 VLM 中的 LLM 语义推理能力。在模型方面,EyeVLM 通过两种方式评估这些任务:零样本设置,使用多种最先进的开源和闭源 VLM,探索不同提示策略;以及基于任务特定问答对的微调方法,研究模型规模和数据规模的影响。作为基准,我们依赖现有注视理解数据集,并与最先进的纯视觉模型进行系统比较。总体而言,我们的结果表明,当前 VLM 缺乏精确的注视理解能力。虽然标准训练有助于缩小与视觉模型的差距,但仍需显著改进。

英文摘要

Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires reasoning about the physical scene as well as the activity, interactions, and social context. However, the extent to which VLMs can reliably understand human gaze and related attentional behaviors remains largely unexplored. In this work, we present EyeVLM, a systematic evaluation framework for gaze understanding in VLMs across two complementary dimensions: tasks and models. To assess gaze understanding capabilities, we focus on two core tasks. The first, gaze following, i.e., predicting the 2D location where a person is looking, has a geometric and visual processing focus, requiring a precise understanding of the human face, attention direction, 3D scene structure, and spatial grounding of attended targets. The second, social gaze prediction, requires social and relational reasoning over multi-person interactions (e.g., mutual gaze and shared attention), and may benefit more from the LLM semantic reasoning capabilities within VLMs. Regarding models, EyeVLM evaluates these tasks in two ways: a zero-shot setting with a diverse set of state-of-the-art open- and closed-source VLMs, exploring different prompting strategies; and a fine-tuning approach based on task-specific QA pairs, studying the impact of model scale and data scale. As benchmarks, we rely on existing gaze understanding datasets and perform a systematic comparison with state-of-the-art purely visual models. Overall, our results show that current VLMs lack precise gaze understanding capabilities. While standard training helps reduce the gap with visual models, significant improvements are still needed.

2604.19995 2026-05-25 cs.CV 版本更新

A Computational Model of Message Sensation Value in Short Video Multimodal Features that Predicts Sensory and Behavioral Engagement

短视频多模态特征中信息感知价值的计算模型预测感官与行为参与

Haoning Xue, Jingwen Zhang, Xiaohui Wang, Diane Dagyong Kim, Yunya Song

发表机构 * Department of Communication, University of Utah(犹他大学通讯系) Department of Communication, University of California, Davis(加州大学戴维斯分校通讯系) Department of Media and Communication, City University of Hong Kong(香港城市大学媒体与传播系) Division of Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology(香港科学与技术大学新兴跨学科领域 division)

AI总结 本文提出了一种基于多模态特征计算短视频信息感知价值(MSV)的模型,用于预测用户对短视频的感官和行为参与度。该模型结合多模态特征分析与1200个短视频的人类评估,并在三个短视频平台的14492个未见数据上验证,发现MSV与感官参与呈正相关,但与行为参与呈倒U型关系。研究不仅深化了对短视频用户参与机制的理论理解,也为短视频研究提供了可靠的计算工具。

详情
AI中文摘要

当代媒体环境以耸人听闻的短视频为特征。虽然先前研究考察了单个多模态特征的影响,但多模态特征对短视频观众参与度的集体影响仍然未知。基于信息感知价值(MSV)的理论框架,本研究通过多模态特征分析和对1200个短视频的人工评估,开发并测试了一个MSV计算模型。该模型预测感官和行为参与,并在来自三个短视频平台的两个未见数据集(总计N=14,492)上进一步验证。虽然MSV与感官参与正相关,但与行为参与呈倒U型关系:较高的MSV引起更强的感官刺激,但适度的MSV优化行为参与。这项研究推进了对短视频参与的理论理解,并为短视频研究引入了一个强大的计算工具。

英文摘要

The contemporary media landscape is characterized by sensational short videos. While prior research examines the effects of individual multimodal features, the collective impact of multimodal features on viewer engagement with short videos remains unknown. Grounded in the theoretical framework of Message Sensation Value (MSV), this study develops and tests a computational model of MSV with multimodal feature analysis and human evaluation of 1,200 short videos. This model that predicts sensory and behavioral engagement was further validated across two unseen datasets from three short video platforms (combined N = 14,492). While MSV is positively associated with sensory engagement, it shows an inverted U-shaped relationship with behavioral engagement: Higher MSV elicits stronger sensory stimulation, but moderate MSV optimizes behavioral engagement. This research advances the theoretical understanding of short video engagement and introduces a robust computational tool for short video research.

2503.20066 2026-05-25 cs.RO cs.CV 版本更新

Learning Scene-Level Signed Directional Distance Function with Ellipsoidal Priors and Neural Residuals

学习场景级有符号方向距离函数:结合椭球先验与神经残差

Zhirui Dai, Hojoon Shin, Yulun Tian, Ki Myung Brian Lee, Nikolay Atanasov

发表机构 * Department of Electrical and Computer Engineering, University of California San Diego(加州大学圣地亚哥分校电气与计算机工程系) Brain Corporation(Brain公司) Robotics Department, University of Michigan(密歇根大学机器人系)

AI总结 本文提出了一种新的神经隐式表示方法——有符号方向距离函数(SDDF),用于解决三维重建和可微渲染中的效率与精度问题。SDDF 以位置和视角方向为输入,直接输出到表面的距离,从而实现高效且精确的几何重建。为提升学习效率,作者结合显式的椭球先验和隐式的神经残差,构建了可微混合表示,有效处理障碍物边界处的距离不连续问题,并在多个指标上优于现有方法。

详情
Journal ref
2026 IEEE Transactions on Pattern Analysis and Machine Intelligence
AI中文摘要

密集重建和可微渲染是3D视觉和计算机图形学中紧密相连的基本操作。最近的神经隐式表示在重建保真度和可微性方面相比传统的离散表示(如网格、点云和体素)展现出显著优势。然而,许多神经隐式模型,如神经辐射场(NeRF)和有符号距离函数(SDF)网络,由于需要沿每条相机射线进行多次查询,渲染效率低下。此外,NeRF和高斯泼溅方法在光度重建方面表现令人印象深刻,但通常需要仔细的监督才能实现精确的几何重建。为了解决这些挑战,我们提出了一种称为有符号方向距离函数(SDDF)的新型表示。与SDF不同,与NeRF类似,SDDF以位置和观察方向作为输入。与SDF类似,与NeRF不同,SDDF直接提供到观察表面的距离,而不是沿视线方向积分。因此,SDDF实现了精确的几何重建和高效的可微方向距离预测。为了高效地学习和预测场景级SDDF,我们开发了一种可微混合表示,结合了显式椭球先验和隐式神经残差。这使得模型能够有效处理障碍物边界周围的距离不连续性,同时保持密集高保真距离预测的能力。通过与最先进表示的广泛评估,我们展示了SDDF实现了(i)有竞争力的SDDF预测精度,(ii)比SDF和NeRF更快的预测速度,以及(iii)与NeRF和高斯泼溅相比更优越的几何一致性。

英文摘要

Dense reconstruction and differentiable rendering are fundamental tightly connected operations in 3D vision and computer graphics. Recent neural implicit representations demonstrate compelling advantages in reconstruction fidelity and differentiability over conventional discrete representations such as meshes, point clouds, and voxels. However, many neural implicit models, such as neural radiance fields (NeRF) and signed distance function (SDF) networks, are inefficient in rendering due to the need to perform multiple queries along each camera ray. Moreover, NeRF and Gaussian Splatting methods offer impressive photometric reconstruction but often require careful supervision to achieve accurate geometric reconstruction. To address these challenges, we propose a novel representation called signed directional distance function (SDDF). Unlike SDF and similar to NeRF, SDDF has a position and viewing direction as input. Like SDF and unlike NeRF, SDDF directly provides distance to the observed surface rather than integrating along the view ray. As a result, SDDF achieves accurate geometric reconstruction and efficient differentiable directional distance prediction. To learn and predict scene-level SDDF efficiently, we develop a differentiable hybrid representation that combines explicit ellipsoid priors and implicit neural residuals. This allows the model to handle distance discontinuities around obstacle boundaries effectively while preserving the ability for dense high-fidelity distance prediction. Through extensive evaluation against state-of-the-art representations, we show that SDDF achieves (i) competitive SDDF prediction accuracy, (ii) faster prediction speed than SDF and NeRF, and (iii) superior geometric consistency compared to NeRF and Gaussian Splatting.