arXivDaily arXiv每日学术速递 周一至周五更新

1. 多模态与视觉语言模型 18 篇

2606.19534 2026-06-19 cs.CV cs.AI cs.CL 新提交

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

PerceptionDLM:基于多模态扩散语言模型的并行区域感知

Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang, Jacky Mai, Yihan Wang, Haochen Wang, Jinbin Bai, Ling Yang, Yunhai Tong

发表机构 * Peking University(北京大学) MSALab ByteDance(字节跳动)

AI总结 提出PerceptionDLM,利用扩散语言模型的并行解码特性,通过高效提示和结构化注意力掩码实现多区域并行感知,显著提升推理效率,并构建ParaDLC-Bench基准进行评估。

Comments Code available at https://github.com/MSALab-PKU/PerceptionDLM

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉理解任务中取得了显著进展。然而,现有大多数MLLMs依赖自回归生成,这限制了它们在需要描述多个区域的感知任务中的效率。在这项工作中,我们提出PerceptionDLM,一种针对高效并行区域感知优化的多模态扩散语言模型。基于PerceptionDLM-Base(一个在开源扩散MLLMs中达到最先进性能的强基础基线),我们的架构充分利用了DLMs的并行解码特性。具体来说,我们引入了高效提示和结构化注意力掩码,以实现对多个掩码区域的同步感知,使模型能够在序列和token级别并行生成区域描述。与现有顺序处理区域的方法相比,这种设计显著提高了推理效率。为了系统评估DLMs视觉感知能力的并行性,我们通过将DLC-Bench扩展为每张图像包含多个区域掩码,构建了一个新的并行详细局部描述基准(ParaDLC-Bench),从而能够联合评估描述质量和推理效率。实验表明,PerceptionDLM在区域描述中保持竞争性能,同时在多区域感知任务中实现了显著的加速。我们的结果凸显了多模态扩散语言模型在高效并行视觉感知中的潜力。据我们所知,我们是首个利用扩散语言模型优势实现并行区域描述和感知的工作。代码、模型和数据集已发布。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.

2606.19584 2026-06-19 cs.CV 新提交

Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

语言引导的视觉嵌入用于可控且可泛化的感知

Chengzhi Mao, Xudong Lin, Wen-Sheng Chu

发表机构 * Google(谷歌)

AI总结 提出语言引导视觉嵌入(LIVE)方法,利用语言动态引导视觉编码器生成任务中心嵌入,无需任务特定重训练,减少视觉幻觉并提升泛化能力。

Journal ref Published as a conference paper at ICLR 2026

详情
AI中文摘要

视觉基础模型通常被训练为静态特征提取器,将任务适应的负担转移到大型下游模型上。我们提出另一种范式:不是仅将视觉特征输入语言模型,而是使用语言本身动态引导视觉编码器。我们的方法,语言引导视觉嵌入(LIVE),利用语言作为高层指导在推理时生成以任务为中心的嵌入,消除了任务特定重训练的需要。这使得编码器能够关注输入中上下文相关的方面,产生更可控和可泛化的表示。实验上,LIVE减少了视觉幻觉(在MMVP上提升34分),在视觉问答上超越了参数数量大几个数量级的视觉语言模型,并泛化到未见过的指令和任务——为自适应的、指令驱动的视觉智能提供了直接路径。

英文摘要

Vision foundation models are typically trained as static feature extractors, placing the burden of task adaptation onto large downstream models. We propose an alternative paradigm: instead of solely feeding visual features into language models, we use language itself to dynamically guide the vision encoder. Our method, Language-Instructed Vision Embeddings (LIVE), leverages language as high-level guidance to produce task-centric embeddings at inference time, removing the need for task-specific retraining. This enables the encoder to focus on contextually relevant aspects of the input, yielding more controllable and generalizable representations. Empirically, LIVE reduces visual hallucinations (+34 points on MMVP), surpasses vision-language models with orders of magnitude more parameters on visual question answering, and generalizes to unseen instructions and tasks -- offering a direct path toward adaptive, instruction-driven visual intelligence.

2606.19828 2026-06-19 cs.CV 新提交

3D-PLOT-LLM: Part-Level Object Tokens for 3D Large Language Models

3D-PLOT-LLM: 用于三维大语言模型的部件级对象标记

Jintang Xue, Xinyu Wang, Yixing Wu, Jingwen Chen, C. -C. Jay Kuo

发表机构 * University of Southern California(南加州大学) Ohio State University(俄亥俄州立大学)

AI总结 提出3D-PLOT-LLM,通过重组输入标记流使部件可直接通过LLM词汇寻址,无需分割解码器或边界框,在部件级基准上超越现有方法。

详情
AI中文摘要

三维多模态大语言模型(3D MLLMs)将3D对象作为一个整体进行描述,但无法处理、命名或推理其部件。先前的部件感知尝试增加了分割解码器、更重的3D编码器或边界框语法,导致参数成本大幅增加。我们采取了一条根本不同的路径:重新组织输入标记流,使得部件通过LLM自身的词汇变得可直接寻址。我们的模型3D-PLOT-LLM将冻结的点编码器的块分割成K个局部一致的区域,并在每个区域的块标记之前插入一个可学习的每区域标记和一个保留词汇标记<part_k>;然后,一个标记空间精化(MSR)模块根据每个区域的空间统计信息和邻接邻居对该标记进行条件化。因此,模型在其输出中引用部件,并遵循通过标记引用部件的提示,这是先前对象级3D MLLMs所不具备的能力。为了探究这一接口,我们构建了PartVerse-QA,一个基于PartVerse网格注释改编的词汇级部件问答基准(77K训练对和588个保留查询,基于不相交的对象划分),在该基准上,3D-PLOT-LLM达到了描述到槽的Jaccard指数0.459和精确匹配率13.78%,槽到描述的GPT-4o评判得分为44.68。在3DCoMPaT-GrIn部件感知接地描述基准上,3D-PLOT-LLM在所有文本输出指标上优于PointLLM、Kestrel、PARIS3D和SegPoint,并在4项指标中的3项上优于ShapeLLM,相比PointLLM的GPT-4o评判得分最高提升+3.03。在Objaverse整体对象描述中,在第二阶段添加PartVerse-QA使得相比PointLLM的SBERT得分提升+0.65,GPT-4o得分提升+1.85,并且在5项传统指标中的4项(SBERT、SimCSE、BLEU-1、METEOR)上超过PointLLM-PiSA,尽管其目标是不同的(部件接地)目标。所有这些仅需在冻结的点编码器上增加不到100万个可训练参数,比先前的部件感知3D MLLMs低一个数量级,且无需分割解码器或边界框头。

英文摘要

3D multimodal large language models (3D MLLMs) describe a 3D object as a whole but cannot address, name, or reason about its parts. Prior part-aware attempts add segmentation decoders, heavier 3D encoders, or bounding-box grammars at substantial parameter cost. We take a fundamentally different path: we reorganize the input token stream so that parts become directly addressable through the LLM's own vocabulary. Our model, 3D-PLOT-LLM, partitions the frozen point encoder's patches into K locally coherent regions and inserts, before each region's patch tokens, a learnable per-region marker and a reserved vocabulary token <part_k>; a Marker-Space Refinement (MSR) module then conditions each marker on its region's spatial statistics and adjacency neighbors. The model thus cites parts in its output and follows prompts that refer to parts by token, a capability absent from prior object-level 3D MLLMs. To probe this interface, we construct PartVerse-QA, a vocabulary-level part-QA benchmark adapted from PartVerse mesh annotations (77K training pairs and 588 held-out queries on disjoint object splits), on which 3D-PLOT-LLM reaches caption-to-slots Jaccard 0.459 and Exact-match 13.78%, with a slot-to-caption GPT-4o judge of 44.68. On the 3DCoMPaT-GrIn part-aware grounded description benchmark, 3D-PLOT-LLM outperforms PointLLM, Kestrel, PARIS3D, and SegPoint on every text-output metric, and ShapeLLM on 3 of 4, with up to +3.03 GPT-4o judge over PointLLM. On Objaverse whole-object captioning, adding PartVerse-QA at Stage 2 yields +0.65 SBERT and +1.85 GPT-4o over PointLLM, and tops PointLLM-PiSA on 4 of 5 traditional metrics (SBERT, SimCSE, BLEU-1, METEOR) despite targeting a different (part-grounded) objective. All with under 1M new trainable parameters on a frozen point encoder, an order of magnitude below prior part-aware 3D MLLMs, and no segmentation decoder or bounding-box head.

2606.19882 2026-06-19 cs.CV cs.LG 新提交

Multimodal Concept Bottleneck Models

多模态概念瓶颈模型

Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng

发表机构 * UC San Diego(加州大学圣地亚哥分校)

AI总结 提出多模态概念瓶颈模型(MM-CBM),利用双概念瓶颈层对齐图像和文本嵌入,实现可解释的零样本分类和图像检索,在四个基准上平均准确率提升高达51.26%。

Comments Present at NeurIPS 2025 Mechanistic Interpretability Workshop

详情
AI中文摘要

概念瓶颈模型(CBM)通过将图像提取的特征与自然概念对齐,增强了深度学习网络的可解释性。然而,现有的CBM在泛化到固定预定义类别集之外的能力以及非概念信息泄露的风险方面受到限制,其中预期概念之外的预测信号被无意中利用。在本文中,我们提出了多模态概念瓶颈模型(MM-CBM)来解决这些问题,并将CBM扩展到CLIP。MM-CBM利用双概念瓶颈层(CBL)将图像和文本嵌入对齐为可解释的特征。这使我们能够以可解释的方式执行新的视觉任务,如零样本分类或图像检索。与现有方法相比,MM-CBM在四个标准基准上平均准确率提升高达51.26%。我们的方法保持高准确率,在黑盒性能的约5%以内,同时提供更高的可解释性。

英文摘要

Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.

2606.19915 2026-06-19 cs.CV 新提交

SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

SpatialSV: 通过任务导向的视觉监督在多模态大语言模型中内化可解释的3D空间感知

Jiayu Tang, Yuchen Zhou, Chao Gou

发表机构 * School of Intelligent Systems Engineering, Sun Yat-sen University(中山大学智能工程学院)

AI总结 提出SpatialSV框架,通过任务导向的视觉监督将MLLM的2D特征提升为显式3D表示(深度图、相机姿态、点云),实现可解释的3D空间感知内化,无需外部工具,并在半监督设置中展现强泛化能力。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

解锁多模态大语言模型(MLLMs)的空间智能对于理解和与3D世界交互至关重要。当前主流方法通常通过外部工具注入空间先验,这会带来显著的推理开销,或依赖潜在特征蒸馏,后者缺乏可解释性和细粒度几何约束。为解决这些问题,我们提出SpatialSV,一个旨在将鲁棒的3D空间感知内化到MLLMs中,同时提供内在可解释性的框架。与被动特征模仿不同,SpatialSV采用任务导向的视觉监督,迫使模型主动将其2D视觉特征提升为显式3D表示,包括深度图、相机姿态和点云。关键的是,这个2D到3D的提升过程为模型的表示提供了一个透明窗口:生成的3D重建作为可视化和诊断模型内在空间知识质量的直观代理。跨多个模型和基准的广泛实验证明了SpatialSV在增强和解释MLLMs空间智能方面的有效性。此外,该框架在半监督设置中展现出强泛化能力,验证了其利用未标记视觉数据进行可扩展、可解释空间表示学习的潜力。

英文摘要

Unlocking the spatial intelligence of multimodal large language model (MLLMs) is crucial for understanding and interacting with the 3D world. Prevailing approaches typically inject spatial priors via external tools, which impose significant inference overhead, or rely on latent feature distillation, which remains uninterpretable and lacks fine-grained geometric constraints. To address these issues, we propose SpatialSV, a framework designed to internalize robust 3D spatial awareness within MLLMs while simultaneously offering inherent interpretability. Deviating from passive feature imitation, SpatialSV employs task-oriented visual supervision, compelling the model to actively lift its 2D visual features into explicit 3D representations, including depth maps, camera poses, and point clouds. Crucially, this 2D-to-3D lifting process provides a transparent window into the model's representations: the resulting 3D reconstructions serve as an intuitive proxy for visualizing and diagnosing the quality of the model's intrinsic spatial knowledge. Extensive experiments across multiple models and benchmarks demonstrate the effectiveness of SpatialSV in enhancing and interpreting MLLMs' spatial intelligence. Furthermore, the framework exhibits strong generalization in semi-supervised settings, validating its potential to leverage unlabeled visual data for scalable, interpretable spatial representation learning.

2606.19944 2026-06-19 cs.CV 新提交

Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models

Timage: 一种用于微调视觉语言模型的文本嵌入图像生成范式

Yifeng Wu, Huimin Huang, Ruiluo Wu, Chunyi Lin, Guanhua Chen, Xian Wu, Wang Song, Ruize Han

发表机构 * Fudan University(复旦大学) Shenzhen University of Advanced Technology(深圳先进技术大学) Tencent Jarvis Lab(腾讯贾维斯实验室) Southern University of Science and Technology(南方科技大学)

AI总结 提出Timage范式,通过约束薛定谔桥将查询文本作为排版覆盖层嵌入图像,以显式空间锚点引导模型关注,在不侵蚀骨干能力前提下提升细粒度空间推理性能。

Comments ECCV

详情
AI中文摘要

多模态大语言模型(MLLMs)在细粒度空间推理中常丢失正确图像区域,因为文本查询很少携带明确的几何锚点进入像素域。现有补救方法要么重新调整模型权重,要么用冗长指令填充提示,但都无法在不侵蚀骨干通用能力的情况下可靠地将语言定位到正确的视觉坐标。我们提出Timage,一种将多模态理解重新定义为输入层面对齐问题的范式:查询被绘制为排版覆盖层直接叠加在图像上。该覆盖层的放置和外观由约束薛定谔桥(cSB)生成,这是一种熵最优传输采样器,将布局合成分解为两个耦合的随机阶段。第一阶段——区域搜索,将噪声向查询对齐的图像区域传输,同时遵守硬遮挡屏障以保护显著前景内容;第二阶段——外观塑造,通过“墨水预算”正则化调整字形大小,使渲染文本保持可读和视觉平衡。生成的覆盖层作为显式注意力信标,引导模型沿空间语义聚焦。在VMCBench基准上,Timage搭配7B骨干模型明显超越更大的专有系统和参数调优基线。该研究将审慎的输入重构定位为一种强大的、架构中立的杠杆,以增强多模态推理。

英文摘要

Multimodal Large Language Models (MLLMs) often lose track of the right image regions during fine-grained spatial reasoning, because a textual query rarely carries any explicit geometric anchor into the pixel domain. Prevailing remedies either rewire the model's weights or pad the prompt with verbose instructions, yet neither reliably pins the language to the correct visual coordinates without eroding the backbone's general competence. We introduce Timage, a paradigm that recasts multimodal understanding as an alignment problem solved at the input: the query is drawn, as a typeset overlay, onto the image itself. The placement and appearance of this overlay are produced by a Constrained Schrödinger Bridge (cSB), an entropic optimal-transport sampler that factorizes layout synthesis into two coupled stochastic stages. The first stage, Region Search, transports noise toward query-aligned image zones while obeying a hard occlusion barrier that protects salient foreground content; the second stage, Appearance Shaping, sizes the glyphs through an ``ink-budget'' regularizer so that the rendered text stays legible and visually balanced. The resulting overlay behaves as an explicit attention beacon that channels the model's focus along spatial semantics. On the VMCBench suite, Timage paired with a modest 7B backbone clearly overtakes far larger proprietary systems as well as parameter-tuned baselines. The study positions deliberate input reconstruction as a powerful, architecture-neutral lever for strengthening multimodal reasoning.

2606.20077 2026-06-19 cs.CV cs.AI 新提交

The Hidden Evolution of Disguised Visual Context inside the VLM

VLM内部伪装视觉上下文的隐藏演化

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Atito

发表机构 * Surrey Institute for People-Centred AI, University of Surrey(萨里大学以人为本人工智能研究所) Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey(萨里大学视觉、语音与信号处理中心)

AI总结 研究视觉语言模型中视觉令牌如何通过不同集成架构(上下文注入与逐层注入)转化为有意义表示,揭示其内部演化过程及对性能的影响。

详情
AI中文摘要

视觉令牌作为原始的外部信号进入大语言模型(LLM)。它们如何被转化为有意义的表示并与语言空间交互完全取决于集成架构——无论是将视觉令牌视为输入序列中的上下文提示,还是直接注入到LLM的中间层。对于这些架构选择如何影响视觉信息及其内部转换以与LLM集成,目前仍缺乏受控比较和理解。我们通过在相同训练条件下评估上下文注入和逐层注入的VLM集成范式,在单图像、多图像和视频基准上进行公平比较。在此过程中,我们揭示了一个隐藏的演化:视觉令牌作为伪装的视觉上下文(缺乏语言结构的原始表示)进入LLM,但根据集成范式逐渐被重塑,每种范式捕捉视觉信号的不同频率特征。我们表明,LLM内部的这种演化决定了VLM能够有效利用哪些视觉特征、视觉表示如何与语言空间对齐,以及最终每种范式在不同任务上的表现。我们进一步证明,仅关注注意力分配是不够的,性能由每一层视觉表示的质量驱动。

英文摘要

Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

2606.20177 2026-06-19 cs.CV cs.AI 新提交

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

评估与增强遥感多模态大语言模型的否定理解能力

Haochen Han, Jue Wang, Alex Jinpeng Wang, Fangming Liu

发表机构 * Peng Cheng Laboratory(鹏城实验室) Tsinghua University(清华大学) Central South University(中南大学)

AI总结 提出RS-Neg基准评估遥感MLLMs的否定理解,并设计NeFo方法通过测试时学习利用约5%未标注样本显著提升模型性能。

Comments ECCV 2026 Accepted

详情
AI中文摘要

多模态大语言模型(MLLMs)在各种遥感(RS)任务中取得了显著成功。然而,它们理解否定的能力仍未得到充分探索,限制了在现实应用中的部署,其中模型必须明确识别什么是错误的或不存在的,例如,应急响应人员需要定位非洪水路线进行疏散。为了全面研究这一局限性,我们引入了RS-Neg,这是第一个从区域级到场景级任务评估否定理解的基准。具体来说,我们为遥感图像设计了一个自动数据生成流程,使用LLMs合成多样化的否定查询,并引入了一个动态视觉焦点模块进行验证。我们的评估表明,先进的遥感MLLMs在否定理解上存在困难,表现出幻觉和显著的性能下降。为了弥补这一差距,我们提出了NeFo,一种新颖的测试时学习方法,将否定的逻辑角色明确纳入模型优化。值得注意的是,使用约5%的未标注测试样本,NeFo显著提升了模型的否定理解能力,并展现出对未见任务的强泛化能力。代码和数据将在接收后发布。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in various Remote Sensing (RS) tasks. However, their ability to comprehend negation remains underexplored, limiting deployment in real-world applications where models must explicitly identify what is false or absent, e.g., emergency responders need to locate non-flooded routes for evacuation. To comprehensively study this limitation, we introduce RS-Neg, the first benchmark to evaluate negation understanding across region-level to scene-level tasks. Specifically, we design an automated data generation pipeline for RS imagery, using LLMs to synthesize diverse negation queries, and introduce a dynamic visual focus module for verification. Our evaluation reveals that advanced RS MLLMs struggle with negation, exhibiting hallucinations and substantial performance degradation. To close this gap, we propose NeFo, a novel test-time learning method that explicitly incorporates the logical role of negation into the model optimization. Remarkably, using about 5\% unlabeled test samples, NeFo significantly improves the negation understanding of models and shows strong generalization to unseen tasks. Code and data will be released upon acceptance.

2606.20244 2026-06-19 cs.CV cs.AI 新提交

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

SPOT-E:基于视觉聚光灯的冻结VLM测试时熵整形

Bo Yin, Xiaobin Hu, Chengming Xu, Ruolin Shen, Mo Yang, Jiangning Zhang, Peng-Tao Jiang, Cheng Tan, Shuicheng YAN

发表机构 * National University of Singapore(新加坡国立大学) Fudan University(复旦大学) Technical University of Munich(慕尼黑工业大学) Sagenic Tech Zhejiang University(浙江大学) vivo Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出SPOT-E方法,通过测试时熵整形和视觉聚光灯,解决VLM在证据密集型任务中因忽视局部关键证据而表现不佳的问题,无需重新训练即可提升定位与鲁棒性。

详情
AI中文摘要

视觉语言模型(VLM)在证据密集型任务中通常表现不佳,因为决定性视觉证据往往微小、局部且容易被忽略,导致即使高层推理完好,证据读取也会失败。先前的推理时视觉干预可以在不重新训练的情况下改善定位,但大多是开环的,缺乏验证高亮证据是否实际使用的机制。我们研究答案跨度预测熵作为模型内部反馈信号,并表明朴素熵最小化具有歧义性,因为低熵可能源于证据支持的置信度或捷径坍塌。为解决这一歧义,我们引入低熵锚点和熵整形目标,在减少答案不确定性的同时保留基线高置信度标记。我们将这一原理实例化为SPOT-E,一种即插即用的测试时方法,生成问题条件聚光灯,并通过基于组相对策略优化(GRPO)的轻量级调优对每个实例进行优化。在所有基准测试和不同VLM家族中,SPOT-E在视觉损坏下均取得一致增益和改进的鲁棒性。代码公开于:\url{this https URL}

英文摘要

Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{https://github.com/YinBo0927/SPOT-E}

2606.20419 2026-06-19 cs.CV 新提交

Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation

谱查询-键乘积权重引导用于免训练VLM幻觉缓解

Karn Tiwari, Varnith Chordia, Prathosh A P

发表机构 * Indian Institute of Science, Bengaluru(印度科学理工学院,班加罗尔) Snap Research(Snap 研究院)

AI总结 提出QK乘积引导,一种无数据、免训练、零推理成本的权重编辑方法,通过抑制中间层主导奇异模式减少对象幻觉,在三个GQA基VLM上平均降低CHAIR$_s$ 4.0%。

Comments Under Review

详情
AI中文摘要

视觉语言模型(VLM)通常生成流畅但视觉上无依据的描述,尤其是提及图像中不存在的对象。我们提出QK乘积引导,一种无数据、免训练、零推理成本的权重编辑方法,用于减少对象幻觉。该方法通过抑制选定中间层中少量主导奇异模式,直接编辑每头的查询-键乘积(即产生softmax前注意力logits的算子)。然后,通过封闭形式的仅查询更新将编辑后的乘积映射回查询权重,同时保持共享的键权重固定,使编辑兼容分组查询注意力。我们进一步将QK乘积分解为对称和反对称分量,以区分相互内容相似性模式与方向性注意力模式。在三个基于GQA的VLM上,QK乘积引导实现了平均相对CHAIR$_s$降低4.0%,而匹配的随机模式控制显示可忽略的变化。可解释性消融表明,幻觉信号特定于主导QK模式,并主要定位于对称相互注意力通道。总体而言,QK乘积引导提供了一种解码时缓解的简单替代方案,无需额外数据、微调或推理时开销,同时基本保持多模态能力。

英文摘要

Vision-language models (VLMs) often generate fluent but visually unsupported descriptions, especially by mentioning objects absent from the image. We propose QK Product Steering, a data-free, training-free, and zero-inference-cost weight edit for reducing object hallucination. The method directly edits the per-head query-key product, the operator that produces pre-softmax attention logits, by suppressing a small number of dominant singular modes in selected middle layers. The edited product is then mapped back to the query weights through a closed-form query-only update while keeping shared key weights fixed, making the edit compatible with grouped-query attention. We further decompose the QK product into symmetric and antisymmetric components to distinguish mutual content-similarity patterns from directional attention patterns. Across three GQA-based VLMs, QK Product Steering achieves an average relative CHAIR$_s$ reduction of $4.0\%$, while matched random-mode controls show negligible change. Interpretability ablations show that the hallucination signal is specific to dominant QK modes and is primarily localized to the symmetric mutual-attention channel. Overall, QK Product Steering offers a simple alternative to decoding-time mitigation, requiring no additional data, fine-tuning, or inference-time overhead while largely preserving general multimodal capability.

2606.19646 2026-06-19 cs.IR cs.CV 交叉投稿

SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

SAFE-Cascade: 面向图表问答的成本自适应视觉语言路由

Ayush Dwivedi, Qixin Wang, Ashvi Soni, Ruoteng Wang, Han Li, Animesh Mahapatra, Neeraj Agrawal, Xintao Wu

AI总结 提出SAFE-Cascade系统,通过OCR和轻量语言模型先给出答案,再由学习路由器决定是否调用VLM,在ChartQA上以73.1%的VLM调用率达到69.1%准确率,减少26.9%的VLM调用和9.3%的成本。

Comments Demo paper submitted at CIKM 2026. 4 pages, 2 figures

详情
AI中文摘要

视觉语言模型(VLM)在图表问答中表现出色,但若每个查询都调用VLM,当许多问题可通过OCR文本和轻量语言推理回答时,成本会不必要地高昂。我们展示了SAFE-Cascade,一个用于成本自适应图表问答的交互系统。给定图表图像和自然语言问题,SAFE-Cascade首先通过OCR提取图表文本,从纯文本语言模型获得临时答案,然后使用学习路由器决定接受文本答案还是升级到VLM。该演示向用户展示这一决策过程:OCR证据、纯文本答案、路由概率、升级决策、最终答案、估计成本和估计延迟并排显示。SAFE-Cascade被设计为一个透明界面,用于理解何时实际需要视觉基础。用户可以上传或选择图表、提问、检查每条路径使用的证据、比较纯文本和VLM答案,并调整升级阈值以探索准确率-成本边界。该系统使用Azure Document Intelligence进行OCR,gpt-5-mini作为纯文本模型,gemini-2.5-flash-image作为VLM,以及基于推理时特征训练的随机森林路由器。在从2500个样本实验中留出的375个ChartQA测试集上,SAFE-Cascade实现了69.1%的统一准确率和73.1%的VLM调用率,而全VLM基线为67.7%准确率和100% VLM调用率。观察到的+1.4个百分点差异在统计上不确定,因此我们将SAFE-Cascade解释为匹配全VLM性能,同时减少26.9%的VLM调用和9.3%的估计成本。该演示展示了选择性模态路由如何使多模态知识系统更加透明、可调优和成本感知。

英文摘要

Vision-language models (VLMs) are powerful for chart question answering, but invoking a VLM for every query can be unnecessarily expensive when many questions are answerable from OCR text and lightweight language reasoning. We demonstrate SAFE-Cascade, an interactive system for cost-adaptive chart question answering. Given a chart image and a natural-language question, SAFE-Cascade first extracts chart text with OCR, obtains a provisional answer from a text-only language model, and then uses a learned router to decide whether to accept the text answer or escalate to a VLM. The demo exposes this decision process to users: OCR evidence, text-only answer, routing probability, escalation decision, final answer, estimated cost, and estimated latency are shown side by side. SAFE-Cascade is designed as a transparent interface for understanding when visual grounding is actually needed. Users can upload or select charts, ask questions, inspect the evidence used by each pathway, compare text-only and VLM answers, and adjust the escalation threshold to explore the accuracy-cost frontier. The system is implemented with Azure Document Intelligence for OCR, gpt-5-mini as the text-only model, gemini-2.5-flash-image as the VLM, and a Random Forest router trained on inference-time features. On a held-out ChartQA test split of 375 examples from a 2,500-example experiment, SAFE-Cascade achieves 69.1% unified accuracy with 73.1% VLM invocation, compared with 67.7% accuracy and 100% VLM invocation for the full-VLM baseline. The observed +1.4 percentage-point difference is statistically uncertain, so we interpret SAFE-Cascade as matching full-VLM performance while reducing VLM calls by 26.9% and estimated cost by 9.3%. The demonstration shows how selective modality routing can make multimodal knowledge systems more transparent, tunable, and cost-aware.

2305.14985 2026-06-19 cs.CV cs.CL 版本更新

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

IdealGPT: 通过大型语言模型迭代分解视觉与语言推理

Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University(哥伦比亚大学) HKUST(香港科技大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出IdealGPT框架,利用大型语言模型迭代分解视觉语言推理任务,通过子问题生成、子答案获取和最终答案推理的循环过程,在零样本设置下显著提升多步推理性能。

Comments 13 pages, 5 figures

详情
AI中文摘要

视觉与语言(VL)理解领域通过端到端的大型预训练VL模型(VLM)取得了前所未有的进展。然而,它们在需要多步推理的零样本推理任务中仍存在不足。为了实现这一目标,先前的工作采用了分而治之的流程。本文认为,先前的工作存在几个固有的缺点:1)它们依赖于特定领域的子问题分解模型。2)即使子问题或子答案提供的信息不足,它们也强制模型预测最终答案。我们通过IdealGPT框架解决了这些局限性,该框架利用大型语言模型(LLM)迭代分解VL推理。具体来说,IdealGPT使用一个LLM生成子问题,一个VLM提供相应的子答案,另一个LLM进行推理以得出最终答案。这三个模块迭代地执行分而治之的过程,直到模型对主问题的最终答案有信心。我们在零样本设置下对多个具有挑战性的VL推理任务评估了IdealGPT。特别是,我们的IdealGPT在VCR上比现有最好的GPT-4类模型绝对提高了10%,在SNLI-VE上提高了15%。代码可在以下网址获取:此 https URL

英文摘要

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT

2504.11171 2026-06-19 cs.CV cs.AI 版本更新

TerraMind: Large-Scale Generative Multimodality for Earth Observation

TerraMind:面向地球观测的大规模生成式多模态模型

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé

发表机构 * IBM Research – Europe(IBM欧洲研究院) ETH Zurich(苏黎世联邦理工学院) Forschungszentrum Jülich(尤利希研究中心) European Space Agency(欧洲航天局) Φ \Phi -Lab(Φ实验室) NASA IMPACT University of Iceland(爱沙尼亚大学)

AI总结 提出首个任意到任意生成式多模态基础模型TerraMind,通过双尺度表示(token级和像素级)预训练,实现零样本/少样本应用,并引入“模态思考”能力,在PANGAEA等基准上达到领先性能。

Comments Accepted at ICCV'25

详情
AI中文摘要

我们提出了TerraMind,这是首个面向地球观测(EO)的任意到任意生成式多模态基础模型。与其他多模态模型不同,TerraMind在跨模态的双尺度表示(结合token级和像素级数据)上进行预训练。在token级别,TerraMind编码高层上下文信息以学习跨模态关系;在像素级别,TerraMind利用细粒度表示捕捉关键空间细节。我们在一个全球大规模数据集的九种地理空间模态上预训练了TerraMind。在本文中,我们证明:(i)TerraMind的双尺度早期融合方法为地球观测解锁了一系列零样本和少样本应用;(ii)TerraMind引入了“模态思考”(TiM)——在微调和推理过程中生成额外人工数据以改善模型输出的能力;(iii)TerraMind在PANGAEA等社区标准的地球观测基准上达到了超越现有最优的性能。预训练数据集、模型权重和我们的代码均在宽松许可下开源。

英文摘要

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

2604.04917 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Vero: An Open RL Recipe for General Visual Reasoning

Vero: 通用视觉推理的开放RL配方

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出Vero系列开放视觉语言模型,通过构建600K样本数据集Vero-600K和任务路由奖励,在30个基准测试中平均提升2.9-5.4点,Vero-Qwen3I-8B超越Qwen3-VL-8B-Thinking 3.8点。

Comments Project page: https://vero-reasoning.github.io/

详情
AI中文摘要

构建一个能在图表、科学、空间理解和开放式任务中工作的视觉推理器需要什么?最强的视觉语言模型(VLM)表明广泛的视觉推理是可以实现的,但其封闭的数据和强化学习(RL)流程使得其成果难以研究、复现或扩展。我们引入了Vero,一个完全开放的VLM系列,在各种视觉推理任务中匹配或超越现有的开放权重模型。我们跨六个广泛的任务类别扩展RL数据和奖励,构建了Vero-600K,一个来自59个数据集的600K样本数据集,并设计了处理异构答案的任务路由奖励。在我们的30个基准测试套件VeroEval中,Vero-600K在受控比较下优于现有的RL数据集。应用于五个起始模型,Vero变体在其初始模型上平均获得2.9-5.4分的提升。值得注意的是,基于Instruct模型训练的Vero-Qwen3I-8B,在没有额外蒸馏的情况下,平均超过Qwen3-VL-8B-Thinking 3.8分。系统的消融实验揭示,不同的任务类别引发不同的推理模式,而广泛的收益依赖于联合学习它们,而非孤立学习。所有数据、代码和模型均已公开。

英文摘要

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) suggest that broad visual reasoning is within reach, yet their closed data and reinforcement learning (RL) pipelines make their gains difficult to study, reproduce, or extend. We introduce Vero, a family of fully open VLMs that match or exceed existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answers. Across VeroEval, our 30-benchmark suite, Vero-600K outperforms existing RL datasets under controlled comparisons. Applied to five starting models, Vero variants gain 2.9-5.4 points on average over their initial models. Notably, Vero-Qwen3I-8B, trained on the Instruct model, surpasses Qwen3-VL-8B-Thinking by 3.8 points on average without additional distillation. Systematic ablations reveal that different task categories elicit distinct reasoning patterns and that broad gains depend on learning them jointly rather than in isolation. All data, code, and models are publicly available.

2605.20448 2026-06-19 cs.CV cs.LG 版本更新

Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

视觉-语言模型是理解3D场景还是仅仅 catalogue 物体?

Animesh Maheshwari, Divyansh Sahu, Nishit Verma

发表机构 * Deccan AI(德克南人工智能)

AI总结 本文通过一个包含3034个样本的人工整理基准,探讨了视觉-语言模型对空间理解的深度有序遮挡、光学几何推断和体积重新安排规划能力,发现模型在重新安排可见布局时表现优异,但在遮挡和反射推断上表现较差。

详情
AI中文摘要

视觉-语言模型能够可靠地命名场景中的物体,但它们是否代表这些物体所处的3D布局?我们引入了一个包含3034个样本的人工整理基准,针对空间理解的三个组成部分:深度有序遮挡(通过三种独立的反事实操作化进行探测)、可见反射的光学几何推断,以及体积重新安排规划。六个前沿和开放权重的VLMs在18,204个响应上由训练注释者评分,没有使用LLM作为判断标准,揭示了明显的分离:在53-97%的准确率下,能够对可见布局进行重新安排的模型,在遮挡任务中表现不佳,仅在6-45%之间,而在反射任务中低于7%。一个具身推理模型重现了相同的模式。对Qwen3-VL-8B-Thinking的白盒分析显示,失败归因于视觉标记合并:在视觉编码器中可恢复的空间信息在标记压缩后变得不可用,只有在清洁的标记合并后激活被重新引入语言解码器后才恢复。

英文摘要

Vision-language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

2606.05833 2026-06-19 cs.CV cs.AI 版本更新

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 提出GeoVR框架,通过从2D视频序列中蒸馏3D几何知识(包括相机姿态、深度图、尺度因子和多尺度3D特征),重塑多模态大语言模型的内部表示以赋予其空间智能,在空间推理基准上达到最先进性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在2D语义理解方面表现出色,但缺乏内在的3D感知能力,导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性,我们提出了GeoVR,一种新颖的框架,仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间,以解锁空间智能。GeoVR并非采用浅层的特征混合,而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的,该策略由四个互补的几何目标驱动:(1)估计帧间相机姿态以嵌入变化的视角动态,(2)回归密集深度图以锚定物理距离,(3)预测度量尺度因子以进行真实世界校准,以及(4)蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下,模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明,GeoVR实现了最先进的性能,为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

2606.16615 2026-06-19 cs.CV 版本更新

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

SUP-MCRL:面向EEG视觉解码的感知主体统一伪特征编码多模态对比表示学习

Shengyu Gong, Weiming Zeng, Yueyang Li, Zijian Kang, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University(上海海事大学数字图像与智能计算实验室) Department of Language Science and Technology, The Hong Kong Polytechnic University(香港理工大学语言科学与技术系) Affiliated Lianyungang Hospital of Xuzhou Medical University(徐州医科大学附属连云港医院)

AI总结 提出SUP-MCRL框架,通过语义感知视觉编码器、统一EEG增强器和原型渐进增强器,解决多模态对比学习中语义一致性和主体选择性问题,在THINGS-EEG零样本任务上达到66.0%/91.9%的Top-1/Top-5准确率。

详情
AI中文摘要

非侵入式脑机接口在泛化到自然视觉体验时,神经视觉解码面临严重的保真度退化。传统的多模态对比表示学习仅优化几何距离对齐,忽略了语义一致性和主体选择性,导致虚假的零样本对齐。我们提出SUP-MCRL,一个统一框架,集成了三种协作机制:(1) 语义实体感知视觉编码器(SAVE),学习空间注意力以提取语义内容,无需预训练的显著性模型;(2) 统一EEG增强器(UEE),采用多尺度空洞卷积和频带间注意力实现自适应跨主体鲁棒性;(3) 基于原型的渐进增强器(PPA),维护一个EMA更新的伪特征池以防止表示崩溃。在THINGS-EEG上的零样本实验实现了66.0%/91.9%(Top-1/Top-5)的个体内准确率和24.0%/52.9%的LOSO准确率,超越了现有最先进方法。代码可在https://github.com/NZWANG/SUP-MCRL获取。

英文摘要

Non-invasive brain-computer interfaces exhibit significant performance degradation when moving from controlled laboratory stimuli to real-world natural images. This degradation occurs because conventional multimodal contrastive representation learning models focus exclusively on optimizing geometric distance alignment, thereby failing to account for semantic consistency and inter-subject variability in neural representation and selective attention. As a result, these models are prone to producing spurious zero-shot matches. To address these limitations, we propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) a Semantic-entity Aware Visual Encoder (SAVE) that learns spatial attention to extract semantic content without relying on pre-trained saliency models; (2) a Unified EEG Enhancer (UEE) that employs multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) a Prototype-based Progressive Augmenter (PPA) that maintains an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on the THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, significantly surpassing state-of-the-art methods and demonstrating that structured alignment supervision is key to overcoming the limitations of cross-modal decoding. Code is available at https://github.com/NZWANG/SUP-MCRL.

2606.18249 2026-06-19 cs.CV 版本更新

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

统一多模态自回归建模:共享上下文-视觉分词器是实现统一的关键

Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang, Chenfei Wu, Junyang Lin, Zuxuan Wu, Shuai Bai

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(可信具身AI研究院,复旦大学) Shanghai Innovation Institute(上海创新研究院) Qwen Team, Alibaba Inc.(通义实验室,阿里公司)

AI总结 提出UniAR框架,通过单一离散视觉分词器桥接视觉理解与生成,采用并行位预测和扩散解码,在图像生成和编辑上达到最优,同时保持多模态理解竞争力。

Comments ICML2026. Project page https://sharelab-sii.github.io/uniar-web

详情
AI中文摘要

统一多模态建模旨在将视觉理解和生成集成到单个系统中。然而,现有方法通常依赖两个不同的视觉分词器,这分割了表示空间并阻碍了真正的统一建模。我们提出UniAR,一个统一的自回归框架,其中单个离散视觉分词器作为理解和生成之间的关键桥梁,使得模型能够直接解释其自身生成的视觉标记而无需额外的重新编码,从而实现共享上下文。UniAR采用预训练的视觉编码器,结合多级特征融合和无查找的逐位量化方案,在保留高层语义和低层细节的同时,以最小代价扩展有效视觉词汇。在此基础上,统一自回归模型采用并行逐位预测来联合预测空间分组的多级视觉编码,大幅减少视觉序列长度并加速生成。最后,基于扩散的视觉解码器对离散视觉标记进行操作,以解码高保真图像。通过大规模预训练,随后进行监督微调和强化学习,UniAR在图像生成和图像编辑上达到了最先进的性能,同时在多模态理解基准上保持竞争力。项目页面可在此URL获取。

英文摘要

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

2. 具身智能、机器人与自动驾驶 17 篇

2606.19531 2026-06-19 cs.CV cs.RO 新提交

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

ImageWAM:世界动作模型真的需要视频生成,还是只需要图像编辑?

Yuyang Zhang, Wenyao Zhang, Zekun Qi, He Zhang, Haitao Lin, Jingbo Zhang, Yao Mu, Xiaokang Yang, Wenjun Zeng, Xin Jin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology(东方理工学院) Tencent Robotics X(腾讯机器人X) Tsinghua University(清华大学) Zhongguancun Academy(中关村学院)

AI总结 提出ImageWAM框架,利用预训练图像编辑模型替代视频生成进行机器人动作预测,通过编辑去噪的KV缓存作为世界动作上下文,在多个模拟和真实实验中优于基线,计算量降至1/6,延迟降至1/4。

Comments Project Page: https://zhangwenyao1.github.io/ImageWAM/

详情
AI中文摘要

世界动作模型(WAMs)通常依赖视频生成来桥接视觉世界建模和机器人控制。然而,基于视频的WAMs面临三个耦合的限制:密集的多帧未来令牌使得推理成本高昂,完整的视频预测将容量花费在与动作无关的时间和外观细节上,以及长期未来想象可能引入误导动作预测的错误。这些问题提出了一个简单的问题:世界动作模型真的需要视频生成吗?我们提出ImageWAM,一个简单的WAM框架,将预训练的图像编辑模型重新用于机器人动作预测。与视频生成相比,图像编辑提供了更匹配的先验:它只需要建模目标帧变换,关注与动作相关的当前到目标视觉差异,并通过编辑预训练将任务指令接地到局部视觉变化。在实践中,ImageWAM在推理时不解码目标帧;相反,它根据图像编辑去噪产生的KV缓存条件化一个流匹配动作专家,将其用作紧凑的世界动作上下文。ImageWAM在多个模拟和真实世界实验中优于标准VLA基线和匹配的竞争性WAM,且无需额外的策略预训练。它还将FLOPs降低到基于视频的WAMs的1/6,延迟降低到1/4。注意力分析进一步表明,编辑缓存聚焦于任务相关的变化区域,支持图像编辑作为基于视频的世界动作建模的有效替代方案。

英文摘要

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.

2606.20045 2026-06-19 cs.CV cs.AI 新提交

See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View

See-and-Reach: 视场内的精确视觉语言导航用于无人机

Fanfu Xue, En Yu, Yantian Shen, Zhikun Hu, Hongjun Wang, Yang Yang, Xindi Wang, Jiande Sun

发表机构 * School of Information Science and Engineering, Shandong University(山东大学信息科学与工程学院) Faculty of Engineering and Information Technology, University of Technology Sydney(悉尼科技大学工程与信息技术学院) School of Computer Science and Technology, Shandong University(山东大学计算机科学与技术学院) School of Artificial Intelligence, Shandong University(山东大学人工智能学院) School of Computer Science and Artificial Intelligence, Shandong Normal University(山东师范大学计算机科学与人工智能学院) Interdisciplinary Research Center of General Artificial Intelligence, Shandong Normal University(山东师范大学通用人工智能跨学科研究中心)

AI总结 针对无人机视觉语言导航中目标可见后精确到达能力评估不足的问题,提出UAV-VLN-FOV任务和3DG-VLN框架,通过动态3D方向线索增强细粒度视觉定位与空间对齐,在基准和真实实验中显著提升成功率。

Comments 12 pages, 7 figures

详情
AI中文摘要

无人机视觉语言导航(UAV-VLN)通常被形式化为一个整体的搜索与到达问题,其中远程目标发现和最终目标接近被联合优化和评估。这种表述使得评估空中具身代理的关键能力变得困难,即一旦目标进入其视场,无人机能否准确地将可见目标定位并将视觉语言证据转化为精确的3D运动。为了解决这一局限性,我们引入了UAV-VLN-FOV,一个目标可见的导航任务,它隔离了“看到并到达”阶段,并能够对终端到达能力进行更具诊断性的评估。我们进一步提出了3DG-VLN,一种由动态3D方向线索引导的视觉语言航点预测框架,以增强细粒度视觉定位和空间方向对齐,从而实现精确的目标到达。具体来说,3DG-VLN自适应地处理高分辨率的前视和下视观测,以保留用于目标定位的细粒度视觉和几何细节。它还在闭环导航过程中在线更新目标相对方向,使代理能够保持与目标的空间对齐并减少累积的方向漂移。为了支持该任务,我们构建了一个专用的高分辨率基准,包含2,717条轨迹,带有面向目标的高级指令、高分辨率的前视和下视自我中心观测以及连续的3D航点注释。实验表明,3DG-VLN优于具有竞争力的UAV-VLN基线,成功率提高了13.82%。真实世界试验进一步展示了3DG-VLN在实际“看到并到达”导航中的潜力。源代码和基准可在以下网址获取:此 https URL。

英文摘要

UAV Vision-Language Navigation (UAV-VLN) is typically formulated as a holistic search-and-reach problem, where long-range target discovery and final target approach are optimized and evaluated jointly. This formulation makes it difficult to assess a critical capability of aerial embodied agents, namely whether a UAV can accurately ground a visible target and translate vision-language evidence into precise 3D motion once the target enters its field of view. To address this limitation, we introduce UAV-VLN-FOV, a target-visible navigation task that isolates the see-and-reach stage and enables a more diagnostic evaluation of terminal reaching ability. We further propose 3DG-VLN, a vision-language waypoint prediction framework guided by dynamic 3D direction cues to enhance fine-grained visual grounding and spatial direction alignment for precise target reaching. Specifically, 3DG-VLN adaptively processes high-resolution front-view and downward-view observations to preserve fine-grained visual and geometric details for target grounding. It also updates the target-relative direction online during closed-loop navigation, allowing the agent to maintain spatial alignment with the target and reduce accumulated direction drift. To support this task, we construct a dedicated high-resolution benchmark which contains 2,717 trajectories with target-oriented high-level instructions, high-resolution front-view and downward-view egocentric observations, and continuous 3D waypoint annotations. Experiments show that 3DG-VLN outperforms competitive UAV-VLN baselines, achieving a 13.82\% improvement in success rate. Real-world trials further demonstrate the potential of 3DG-VLN for practical see-and-reach navigation. The source code and benchmark are available at https://github.com/xuefanfu/3DG-VLN.

2606.20092 2026-06-19 cs.CV 新提交

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

EventVLA: 面向长程视觉-语言-动作策略的事件驱动视觉证据记忆

Ganlin Yang, Zhangzheng Tu, Yuqiang Yang, Sitong Mao, Junyi Dong, Tianxing Chen, Jiaqi Peng, Jing Xiong, Jiafei Cao, Jifeng Dai, Wengang Zhou, Yao Mu, Tai Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Dalian University of Technology(大连理工大学) Huawei Technologies Co., Ltd.(华为技术有限公司) The University of Hong Kong(香港大学) Tsinghua University(清华大学) Peking University(北京大学)

AI总结 针对长程机器人操作中记忆瓶颈问题,提出EventVLA框架,通过动态关键帧证据记忆模块自主捕获任务关键视觉事件,在17个模拟和4个真实任务中平均成功率提升40%。

详情
AI中文摘要

记忆仍然是长程机器人操作的关键瓶颈,因为标准的视觉-语言-动作(VLA)策略在任务相关线索随时间变得遮挡或不可观测时常常失败。虽然现有的记忆增强方法利用历史上下文,但它们要么遭受严重的信息瓶颈,通过解耦的双系统引入高延迟,要么依赖积累大量视觉冗余的无选择性缓冲区。为了解决这些限制,我们引入了EventVLA,一个基于稀疏视觉证据记忆概念的端到端框架,包含两个核心组件:用于保留初始和短期上下文的基础视觉锚点,以及动态关键帧证据记忆(KEM)模块。具体来说,KEM直接从VLA的潜在嵌入中预测未来关键帧概率,以自主捕获和存储稀疏的、任务关键的视觉事件。这种前瞻驱动的机制使策略能够动态评估当前观测的未来因果效用,在瞬态视觉证据变得不可观测之前将其保留。此外,我们提出了RoboTwin-MeM,一个专门设计用于评估具有交互式视觉证据的非马尔可夫操作任务的诊断基准。大量评估表明,在17个需要记忆的模拟任务和4个真实世界双臂任务中,EventVLA相比最先进的记忆增强VLA实现了平均成功率提升+40%。

英文摘要

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.

2606.20110 2026-06-19 cs.CV 新提交

FrozenDrive: Zero-Shot Text-Guided Driving Scene Generation and Data Augmentation with Parameter-Free Frozen Diffusion Model

FrozenDrive: 零样本文本引导驾驶场景生成与数据增强的无参数冻结扩散模型

Yuhwan Jeong, Hyeonseong Kim, Daehyun We, Seonkyu Song, Jinnyeong Yang, Hyun-Kurl Jang, Youngho Yoon, Kuk-Jin Yoon

发表机构 * KAIST, Visual Intelligence Lab(韩国科学技术院视觉智能实验室)

AI总结 提出FrozenDrive框架,利用冻结的预训练扩散模型,通过知识保留的时空注意力实现多视图一致性和时间连贯性,无需微调即可生成恶劣天气下的驾驶场景,提升自动驾驶模型鲁棒性。

Comments Accepted to ECCV 2026

详情
AI中文摘要

自动驾驶的合成数据正在激增,这得益于扩散模型能够实现可扩展的场景生成。然而,关键障碍依然存在,因为强制执行多视图和时间一致性通常依赖于骨干网络微调或添加层,这会侵蚀预训练知识并削弱文本对齐。模型也保持接近训练分布,在恶劣天气和未见配置下表现不佳,并且保真度偏向频繁类别而非稀有类别。我们通过FrozenDrive解决这些差距,这是一个可控生成框架,在保持预训练扩散模型知识的同时实现强一致性。FrozenDrive以丰富的驾驶堆栈信号和文本提示为条件,并引入知识保留的时空注意力,在无参数的冻结扩散骨干中单次通过时施加跨视图对齐和时间连贯性。额外的对象聚焦约束提高了稀有类别的每个对象保真度。无需任何天气或场景特定的微调,我们的模型从文本合成全局连贯的多视图驾驶场景,特别是在恶劣和稀有条件下,并超越了先前的基线。在nuScenes上,FrozenDrive增强数据显著提升了AD模型的性能,尤其是在夜间和雨天,当使用我们的场景定向数据训练时,展示了更强的鲁棒性。

英文摘要

Synthetic data for autonomous driving is surging, powered by diffusion models that promise scalable scene generation. Yet key obstacles remain, as enforcing multi-view and temporal consistency often relies on backbone fine-tuning or added layers, which erodes pre-trained knowledge and weakens text alignment. Models also stay close to the training distribution, struggling under adverse weather and unseen configurations, and fidelity favors frequent over rare classes. We address these gaps with FrozenDrive, a controllable generative framework that preserves a pretrained diffusion models knowledge while achieving strong consistency. FrozenDrive conditions on rich driving-stack signals and text prompts, and introduces knowledge-preserving spatio-temporal attention to impose cross-view alignment and temporal coherence in a single pass within a parameter-free frozen diffusion backbone. An additional object-focused constraint improves per-object fidelity for rare categories. Without any weather- or scene-specific fine-tuning, our model synthesizes globally coherent multi-view driving scenes from text, particularly under adverse and rare conditions, and surpasses prior baselines. On nuScenes, FrozenDrive augmented data significantly improves AD models performance, especially at night and in rain, demonstrating stronger robustness when trained with our scenario-targeted data.

2606.20189 2026-06-19 cs.CV cs.AI cs.RO 新提交

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

HilDA:利用扩散的分层蒸馏推进自监督LiDAR预训练

Maciej Wozniak, Jesper Ericsson, Hariprasath Govindarajan, Truls Nyberg, Thomas Gustafsson, Patric Jensfelt, Olov Andersson

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院) Linköping University(林雪平大学) TRATON AB(TRATON公司) Qualcomm Auto Ltd Sweden Filial(高通汽车有限公司瑞典分公司)

AI总结 提出HilDA框架,通过分层蒸馏(多层蒸馏和全局上下文蒸馏)结合时间占用扩散目标,自监督预训练LiDAR骨干网络,在3D检测、场景流和语义占用预测任务上达到最先进水平。

Comments Accepted to ECCV 2026. Maciej and Jesper contributed equally

详情
AI中文摘要

利用视觉基础模型(VFM)进行相机到LiDAR的知识蒸馏为解决真实世界自动驾驶中巨大的几何和运动多样性所需的标注数据稀缺问题提供了一种有前景的方案。然而,当前方法通常将VFM视为黑盒教师,仅依赖逐帧特征相似性。因此,它们未能充分利用教师的逐层语义结构和全局上下文,以及LiDAR序列中固有的丰富时空信息。我们提出HilDA,一个用于LiDAR骨干网络的自监督预训练框架,能更好地捕捉驾驶任务所需的语义“是什么”和几何“在哪里”。HilDA结合了分层蒸馏(包括用于渐进语义对齐的多层蒸馏和用于场景级语义的全局上下文蒸馏)与一个促进时空一致性的时间占用扩散目标。使用HilDA预训练的模型在跨模态蒸馏基准上取得了最先进的结果,并在3D目标检测、场景流和语义占用预测任务上优于通过先前蒸馏方法训练的模型。代码见:此 https URL。

英文摘要

Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.

2606.20515 2026-06-19 cs.CV 新提交

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

S-Agent:空间工具使用激发空间智能推理

Yalun Dai, Hao Li, Shulin Tian, Runmao Yao, Yuhao Dong, Fangzhou Hong, Zhaoxi Chen, Fangfu Liu, Baoliang Tian, Dingwen Zhang, Tao Wang, Kim-Hui Yap, Ziwei Liu

发表机构 * NTU(南洋理工大学) THU(清华大学) ByteDance(字节跳动) NWPU(西北工业大学)

AI总结 提出S-Agent空间工具使用智能体范式,通过时空证据积累和层次化工具集,将VLM作为语义规划器,实现连续多视图图像和视频的空间推理,在无训练下提升开源和闭源VLM性能,并基于S-300K轨迹微调得到紧凑空间智能体S-Agent-8B。

Comments Project Page : https://Ropedia.github.io/S-Agent

详情
AI中文摘要

现实世界的空间智能需要对连续且不断变化的三维世界进行推理,然而现有的VLM和工具增强智能体大多仍局限于从孤立的视觉观察中进行静态、无状态的推理。我们引入了\textbf{\textsc{S-Agent}},一种用于理解和推理连续多视图图像和视频的空间工具使用智能体范式。通过将空间推理表述为时空证据积累而非孤立的帧级预测,\textsc{S-Agent}将空间感知重塑为以场景为中心的理解,超越以帧为中心的识别。具体而言,\textsc{S-Agent}将VLM作为语义规划器,决定需要哪些证据,而层次化的空间工具和专家将物体锚定在2D中,将其提升为3D几何证据,并将这些证据聚合为高级空间知识(例如,计数、测量、方向和相对位置)。此外,时间记忆机制,包括用于维护不断演变的场景状态的场景记忆和用于积累推理上下文的智能体记忆,实现了跨帧和推理步骤的证据整合。在多视图和视频空间推理基准上的全面实验表明,\textsc{S-Agent}以无需训练的方式持续提升开源和闭源VLM的性能。除了推理时增强,在\textsc{S-Agent}生成的空间轨迹\textsc{S-300K}上进行监督微调(SFT)得到了\textsc{S-Agent-8B},一个紧凑的空间智能体,显著超越了类似规模的基线(例如,Qwen3-VL-8B),并与先进的闭源模型(例如,GPT-5.4和Gemini 3)性能相当。

英文摘要

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

2606.20521 2026-06-19 cs.CV 新提交

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

HumanScale: 以自我为中心的人类视频在具身预训练中可超越真实机器人数据

Juncheng Ma, Jianxin Bi, Yufan Deng, Xuanran Zhai, Kewei Zhang, Ye Huang, Bo Liang, Shukai Gong, Jiankai Tu, Xiaotian Tang, Jiaxin Li, Kaiqi Chen, Duomin Wang, Yuqi Wang, Bingyi Kang, Eric Huang, Zhiyang Dou, Zhen Dong, Enze Xie, Wojciech Matusik, Tat-Seng Chua, Daquan Zhou

发表机构 * PKU(北京大学) NUS(新加坡国立大学) MIT(麻省理工学院) UCSB(加州大学圣塔芭芭拉分校) NVIDIA(英伟达)

AI总结 本文通过系统比较发现,经过精心设计的过滤和标注流程,以自我为中心的人类视频在具身基础模型预训练中不仅可行,而且性能优于遥操作真实机器人数据,验证了“预训练于人类视频+少量机器人数据适配”的可扩展范式。

Comments Github: https://github.com/DAGroup-PKU/HumanNet/

详情
AI中文摘要

具身基础模型有望像大型语言模型一样从数据扩展中受益,但面临更严重的数据瓶颈。遥操作真实机器人轨迹因其精确的动作监督和具身对齐而仍然是主要的预训练来源,但其可扩展性受限于高采集成本、获取难度以及低行为和环境多样性。这些限制引发了对以自我为中心的人类视频作为可扩展、成本显著更低且更多样化的具身模型预训练替代方案的兴趣。然而,与遥操作真实机器人数据相比,其有效性仍未得到充分探索。为了解决这个问题,我们在固定的后训练和验证协议下,进行了一项系统研究,比较以自我为中心的人类视频和遥操作真实机器人轨迹作为具身基础模型的预训练数据源。令人惊讶的是,我们发现经过精心设计的过滤和标注流程处理的以自我为中心的数据,不仅是模型预训练的可行替代品,而且可以带来更优的性能。在相同预训练数据量下,在以自我为中心数据上预训练的模型在真实机器人动作预测上的验证损失降低了24%,在分布内和分布外真实机器人任务执行上的成功率分别提高了52.5%和90%。这一发现验证了具身基础模型的一种可扩展范式:在以自我为中心的人类视频上预训练以学习多样化的世界表征,然后使用少量标注的真实机器人数据进行适配以实现动作空间对齐。我们希望这项研究能鼓励对以自我为中心数据的更广泛探索,并在昂贵的机器人数据收集之前为数据质量评估提供指导。

英文摘要

Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.

2606.17054 2026-06-19 cs.RO cs.AI cs.CV cs.LG 交叉投稿

Human Universal Grasping

人类通用抓取

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

发表机构 * New York University(纽约大学) Tsinghua University(清华大学) University of Michigan(密歇根大学)

AI总结 提出HUG模型,利用人类抓取数据(1M-HUG数据集)和流匹配方法,从单张RGB-D图像生成多样化抓取姿态,并重定向到机器人手,实现零样本抓取,在HUG-Bench上超越基线23%-34%。

Comments 28 pages, 20 figures, 7 tables

详情
AI中文摘要

人类可以轻松抓取物体,而多指机器人远未达到这种通用性。我们认为机器人抓取数据最自然的来源是人类,他们每天拿起数千个物体。我们提出HUG,一个流匹配模型,能够为任何用户指定的物体(从立体相机捕获的单张RGB-D图像中)生成多样化的人类抓取。使用智能眼镜,我们首先收集了1M-HUGs,一个自我中心的人类抓取数据集,涵盖100万帧(27.8小时)和41栋建筑中的6,707个物体实例。接下来,为了建模自然人类抓取的分布,我们的新型流匹配模型融合RGB和深度观测,输出由手腕平移、手腕旋转和MANO手姿态参数化的抓取。预测的抓取可以重定向到各种机器人手,实现在日常场景中的零样本抓取。为了标准化评估,我们构建了一个新的模拟基准HUG-Bench,包含来自五个几何类别和不同尺寸的90个未见物体,并带有公制尺度的3D网格。我们在真实世界中评估HUG,使用HUG-Bench的30个物体测试集,跨越多个立体相机、机器人实体和家庭环境。HUG在我们具有挑战性的物体集上比最先进的抓取基线高出23%和34%。代码、数据、基准、检查点和交互式演示已在我们的网站上发布:https://grasping.io/

英文摘要

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/

2606.19641 2026-06-19 cs.RO cs.CV 交叉投稿

Scaling Self-Play for End-to-End Driving

扩展端到端驾驶的自我对弈

Luke Rowe, Roger Girgis, Rodrigue de Schaetzen, Daphne Cornelisse, Alaap Grandhi, Felix Heide, Eugene Vinitsky, Christopher Pal, Liam Paull

发表机构 * Mila(米拉研究所) Université de Montréal(蒙特利尔大学) Polytechnique Montréal(蒙特利尔理工学院) Torc Robotics NYU Tandon School of Engineering(纽约大学坦登工程学院) McMaster University(麦克马斯特大学) Princeton University(普林斯顿大学)

AI总结 提出大规模自我对弈训练策略,通过高效模拟器Gigapixel实现像素级自我对弈,结合DAgger蒸馏和感知适应,提升端到端驾驶模型性能。

详情
AI中文摘要

端到端自动驾驶模型通常基于离线的人类演示数据集进行训练,这些数据集提供的状态覆盖有限,且通常没有闭环反馈,使得模型在闭环部署时容易出现复合误差,并对长尾智能体交互脆弱。为克服这些限制,我们提出了一种替代策略:直接在模拟中的像素上进行大规模自我对弈。虽然先前的自我对弈方法已显示出向真实世界驾驶的有前景的迁移,但它们通常假设向量化的鸟瞰图(BEV)观测,这与直接基于传感器观测的端到端策略不兼容。为此,我们引入了Gigapixel,一个具有透视渲染的高吞吐量批处理驾驶模拟器,实现了直接从像素观测的可扩展自我对弈。Gigapixel并非针对计算成本高的逼真传感器模拟,而是渲染一个简化的边界框世界,保留基本场景结构,同时实现每秒5万智能体步的吞吐量。由于直接像素空间的自我对弈强化学习在端到端模型规模下样本效率极低,我们提出了自我对弈DAgger训练:通过从特权RL教师进行在线策略蒸馏来训练基于像素的策略。为弥合模拟到现实的差距,我们随后通过轻量级感知适应将自我对弈训练的策略迁移到真实世界传感器数据。在Gigapixel中训练并适应真实世界传感器数据的策略在HUGSIM和NAVSIM-v2基准测试中取得了竞争性表现,无需人类轨迹监督。此外,扩展自我对弈训练带来策略性能的成比例提升,确立了自我对弈作为训练端到端模型的实用且可扩展的策略。

英文摘要

End-to-end autonomous driving models are typically trained on offline human-demonstration datasets that provide limited state coverage and often no closed-loop feedback, making them prone to compounding errors when deployed in closed-loop and brittle to long-tail agent interactions. To overcome these limitations, we propose an alternative strategy for training end-to-end driving models: large-scale self-play directly from pixels in simulation. While prior self-play approaches have shown promising transfer to real-world driving, they typically assume vectorized Bird's-Eye-View (BEV) observations that are incompatible with end-to-end policies operating directly on sensor observations. To this end, we introduce Gigapixel, a high-throughput batched driving simulator with perspective rendering, enabling scalable self-play directly from pixel observations. Rather than targeting compute-costly photorealistic sensor simulation, Gigapixel renders a simplified bounding-box world that preserves essential scene structure while achieving throughput at 50k agent steps per second. Since direct pixel-space self-play RL is prohibitively sample-inefficient at end-to-end model scale, we propose self-play DAgger training: we train pixel-based policies in self-play via on-policy distillation from a privileged RL teacher. To bridge the sim-to-real gap, we subsequently transfer the self-play trained policies to real-world sensor data through lightweight perception adaptation. Policies trained in Gigapixel and adapted to real-world sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision. Moreover, scaling self-play training yields proportional gains in policy performance, establishing self-play as a practical and scalable strategy for training end-to-end models.

2606.19836 2026-06-19 cs.RO cs.CV 交叉投稿

World Engine: Towards the Era of Post-Training for Autonomous Driving

World Engine:迈向自动驾驶后训练时代

Tianyu Li, Li Chen, Caojun Wang, Haochen Liu, Kashyap Chitta, Zhenjie Yang, Yuhang Lu, Naisheng Ye, Yihang Qiu, Yufei Wang, Luoxi Zou, Jiaxin Peng, Jin Pan, Zhaoyu Su, Andrei Bursuc, Shengbo Eben Li, Andreas Geiger, Peng Su, Hongyang Li

AI总结 提出World Engine生成式框架,通过从真实日志重建高保真交互环境并外推安全关键变体,利用强化后训练对齐策略与安全约束,显著减少罕见安全关键场景故障,提升自动驾驶安全性。

Comments Technical Report. Project Page: https://opendrivelab.com/WorldEngine/

详情
AI中文摘要

自动驾驶车辆必须在现实世界中安全运行,而错误可能带来严重后果。尽管现代端到端驾驶策略在常规场景中表现出色,但其可靠性受限于真实驾驶数据集中安全关键的“长尾”事件的稀缺性。这些罕见交互定义了学习策略的实际安全边界,但在现实世界中难以大规模收集。我们展示了这一根本限制可以通过在合成的关键交互上对预训练驾驶模型进行后训练来解决。我们引入了World Engine,一个生成式框架,从真实日志中重建高保真交互环境,并系统性地将其外推为现实的安全关键变体。这一范式使得基于强化的后训练能够将策略与安全约束对齐,规避现实世界探索中固有的物理风险。在基于nuPlan构建的公开基准上,World Engine显著减少了罕见安全关键场景中的故障,并且相比仅扩展预训练数据带来了更大的增益。此外,当部署到生产级自动驾驶系统时,所得策略减少了模拟碰撞,并在道路测试中显示出可衡量的改进,表明在合成的安全关键交互上进行后训练为更安全的自动驾驶提供了一条可扩展且有效的途径。完整的代码库套件(包括训练)已向公众发布。

英文摘要

Autonomous vehicles must operate safely in the real world, where errors can have severe consequences. Although modern end-to-end driving policies excel in routine scenarios, their reliability is limited by the scarcity of safety-critical ``long-tail'' events in real driving datasets. These rare interactions define the practical safety boundary of the learned policy, yet they are difficult to collect at scale in the real world. Here we show that this fundamental limitation can be addressed by post-training pre-trained driving models on synthesized high-stakes interactions. We introduce World Engine, a generative framework that reconstructs high-fidelity interactive environments from real-world logs and systematically extrapolates them into realistic safety-critical variations. This paradigm enables reinforcement-based post-training to align policies with safety constraints, circumventing the physical risks inherent in real-world exploration. On a public benchmark built on nuPlan, World Engine substantially reduces failures in rare safety-critical scenarios and yields significantly larger gains than scaling pre-training data alone. Furthermore, when deployed on a production-scale autonomous driving system, the resulting policy reduces simulated collisions and demonstrates measurable improvements in on-road testing, showing that post-training on synthesized, safety-critical interactions offers a scalable and effective pathway to safer autonomous driving. The full codebase suite, including training, is released to the public.

2606.19998 2026-06-19 cs.RO cs.AI cs.CV cs.LG 交叉投稿

Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory

Tri-Info: 基于信息论的VLA模型可泛化、可解释的故障预测

Jinghan Yang, Yunchao Zhang, Wang Yuan, Haolun Wan, Jiaming Zhang, Zhengyang Hu, Yanchao Yang

发表机构 * InfoBodied AI Lab, The University of Hong Kong(香港大学信息具身人工智能实验室) HKU Musketeers Foundation Institute of Data Science(香港大学赛马会数据科学研究院)

AI总结 提出Tri-Info方法,通过信息论信号捕捉动作多样性、时间一致性和状态耦合,实现跨架构、环境及仿真到现实的零样本故障检测,准确率达83%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越多地部署在各种任务中,但它们仍然是黑箱,其物理交互可能导致不可逆的伤害,因此需要可泛化和可解释的故障检测。我们观察到成功和失败的轨迹具有系统不同的信息论特征。基于此,我们将VLA控制形式化为闭环信息管道,并推导出三重信息论(Tri-Info)信号,这些信号捕捉动作是否保持多样性、时间一致性以及与状态转换的耦合。在六个VLA模型和三个基准环境中,Tri-Info在域内匹配最强的基线。此外,Tri-Info无需重新训练即可跨架构、环境和仿真到现实差距迁移,在现实世界任务中达到83%的准确率,而先前的检测器则降至随机水平。这确立了Tri-Info作为一种简单而强大的方法,不仅能够检测故障并具有强大的跨域泛化能力,还能提供底层故障模式的可解释诊断。

英文摘要

Vision-Language-Action (VLA) models are increasingly deployed across diverse tasks, yet they remain black boxes whose physical interactions can cause irreversible harm, making generalizable and interpretable failure detection essential. We observe that successful and failed rollouts carry systematically different information-theoretic signatures. Building on this, we formalize VLA control as a closed-loop information pipeline and derive the Triple Information-theoretic (Tri-Info) signals that capture whether actions remain diverse, temporally consistent, and coupled to state transitions. Across six VLA models and three benchmark environments, Tri-Info matches the strongest baselines in-domain. Moreover, Tri-Info transfers across architectures, environments, and the sim-to-real gap without retraining, reaching 83\% accuracy on real-world tasks where prior detectors collapse to chance. This establishes Tri-Info as a simple yet powerful method that not only detects failures with strong cross-domain generalization, but also delivers interpretable diagnostics of the underlying failure modes.

2606.20491 2026-06-19 cs.RO cs.CV 交叉投稿

Fast Human Attention Prediction for Fixation-guided Active Perception in Autonomous Navigation

用于自主导航中注视引导主动感知的快速人类注意力预测

Fatma Youssef Mohammed, Grzegorz Malczyk, Kostas Alexis

发表机构 * Norwegian University of Science and Technology (NTNU)(挪威科技大学)

AI总结 提出GazeLNN,一种基于液态神经网络和MobileNetV3的轻量级扫描路径预测模型,在MIT低分辨率数据集上达到最优性能,计算成本降低99.40%,推理速度提升6倍,并集成到强化学习训练的主动相机-机器人控制策略中,实现自主导航中的注视引导感知。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

人类视觉注意力依赖于结构化的扫描路径来高效处理场景,但将这种行为注入机器人自主性仍处于初级阶段,且受到现有预测模型高计算成本的阻碍。为了解决这一问题,我们提出了GazeLNN,一种计算轻量级的扫描路径预测模型,该模型采用液态神经网络作为其循环引擎,并使用MobileNetV3进行特征提取。该架构以自回归方式运行,根据当前视觉刺激和注视历史预测顺序注视热图。尽管仅需0.61 GFLOPs,GazeLNN在MIT低分辨率数据集上达到了最先进的性能,获得了0.47的ScanMatch分数。它在多种评估指标上优于现有的循环基线,同时将计算成本降低了99.40%,并将推理速度提高了六倍。为了研究人类注意力建模在机器人自主性中的作用,并展示这种高效架构的实际效用,我们将GazeLNN集成到通过强化学习训练的主动相机-机器人控制策略中。这种集成使得在自主导航过程中能够实现人类注视引导的感知,并通过在无人机上的成功实际部署得到了验证。

英文摘要

Human visual attention relies on structured scanpaths to efficiently process scenes, yet instilling this behavior into robot autonomy is in its infancy and hindered by the high,computational costs of existing predictive models. To address this, we introduce GazeLNN, a computationally lightweight,scanpath prediction model that leverages Liquid Neural Networks as its recurrent engine and employs MobileNetV3 for feature extraction. Operating auto-regressively, the architecture predicts sequential fixation heatmaps conditioned on the current visual stimulus and fixation history. Despite requiring only 0.61 GFLOPs, GazeLNN achieves state-of-the-art performance on the MIT Low Resolution dataset achieving 0.47 ScanMatch score. It outperforms existing recurrent baselines across diverse evaluation metrics, while reducing computational costs by 99.40% and accelerating inference by up to six times. To investigate the role of human attention modeling in robot autonomy and demonstrate the practical utility of this highly efficient architecture, we integrate GazeLNN into an active camera-robot control policy trained via Reinforcement Learning. This integration enables human-fixation-guided perception during autonomous navigation, validated through successful real-world deployments on an aerial robot.

2505.17006 2026-06-19 cs.CV cs.RO 版本更新

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

CoMo: 从互联网视频中学习连续潜在运动以实现可扩展的机器人学习

Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, Limin Wang

发表机构 * Nanjing University(南京大学) Shanghai AI Lab(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) Fudan University(复旦大学) Tongji University(同济大学)

AI总结 提出CoMo方法,通过早期时间差分和时序对比学习从互联网视频中学习连续潜在运动,避免离散化信息损失,实现零样本泛化生成伪动作标签,联合训练策略在仿真和真实实验中表现优异。

Comments CVPR 2026

详情
AI中文摘要

从互联网视频中无监督学习潜在运动对于机器人学习至关重要。现有的离散方法通常通过小码本大小的向量量化来减轻提取过多静态背景导致的捷径学习,但它们存在信息损失,难以捕捉更复杂和细粒度的动态。此外,离散潜在运动与连续机器人动作之间存在固有分布差距,阻碍了统一策略的联合学习。我们提出CoMo,旨在从互联网规模视频中学习更精确的连续潜在运动。CoMo采用早期时间差分(Td)机制来增加捷径学习难度并显式增强运动线索。此外,为确保潜在运动更好地捕捉有意义的背景,我们进一步提出时序对比学习(Tcl)方案。具体地,正样本对通过小的未来帧时间偏移构建,而负样本对则通过直接反转时间方向形成。所提出的Td和Tcl协同工作,有效确保潜在运动更好地关注前景并增强运动线索。关键的是,CoMo表现出强大的零样本泛化能力,使其能够为未见过的视频生成有效的伪动作标签。大量的仿真和真实实验表明,使用CoMo伪动作标签联合训练的策略在扩散和自回归架构下均实现了优越性能。

英文摘要

Unsupervised learning of latent motion from Internet videos is crucial for robot learning. Existing discrete methods generally mitigate the shortcut learning caused by extracting excessive static backgrounds through vector quantization with a small codebook size. However, they suffer from information loss and struggle to capture more complex and fine-grained dynamics. Moreover, there is an inherent gap between the distribution of discrete latent motion and continuous robot action, which hinders the joint learning of a unified policy. We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. CoMo employs an early temporal difference (Td) mechanism to increase the shortcut learning difficulty and explicitly enhance motion cues. Additionally, to ensure latent motion better captures meaningful foregrounds, we further propose a temporal contrastive learning (Tcl) scheme. Specifically, positive pairs are constructed with a small future frame temporal offset, while negative pairs are formed by directly reversing the temporal direction. The proposed Td and Tcl work synergistically and effectively ensure that the latent motion focuses better on the foreground and reinforces motion cues. Critically, CoMo exhibits strong zeroshot generalization, enabling it to generate effective pseudo action labels for unseen videos. Extensive simulated and real-world experiments show that policies co-trained with CoMo pseudo action labels achieve superior performance with both diffusion and auto-regressive architectures.

2603.00654 2026-06-19 cs.CV 版本更新

RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

RC-GeoCP:雷达-相机协同感知的几何一致性

Xiaokai Bai, Lianqing Zheng, Runwei Guan, Siyuan Cao, Songkai Wang, Huiliang Shen

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院) School of Automotive Studies, Tongji University(同济大学汽车学院) Thrust of Artificial Intelligence, Hong Kong University of Science and Technology(香港科技大学人工智能研究所)

AI总结 提出首个4D雷达与相机协同感知框架RC-GeoCP,通过雷达锚定几何一致性解决深度模糊和空间分散导致的错位,实现高效通信与全局一致表示。

Comments 11 pages, 6 figures, 9 tables

详情
AI中文摘要

协同感知(CP)通过多智能体信息共享增强场景理解。尽管以LiDAR为中心的系统提供精确几何,但高成本和恶劣天气下的性能下降需要多模态替代方案。尽管具有密集的视觉语义和鲁棒的空间测量,相机与4D雷达之间的协同在协作环境中仍未得到充分探索。本文介绍RC-GeoCP,这是首个探索CP中4D雷达与图像融合的框架。为解决由深度模糊和跨智能体空间分散引起的错位,RC-GeoCP建立了雷达锚定的几何一致性。具体而言,几何结构修正(GSR)将视觉语义与雷达导出的几何对齐,以生成空间有根基的、几何一致的表示。不确定性感知通信(UAC)将选择性传输表述为条件熵减少过程,基于智能体间分歧优先处理信息特征。最后,共识驱动聚合器(CDA)通过共享几何锚聚合多智能体信息,形成全局一致的表示。我们在V2X-Radar和V2X-R上建立了首个统一的雷达-相机CP基准,展示了最先进的性能,同时显著降低了通信开销。代码即将发布。

英文摘要

Collaborative perception (CP) enhances scene understanding through multi-agent information sharing. While LiDAR-centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi-modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC-GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC-GeoCP establishes a radar-anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry-consistent representations. Uncertainty-Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter-agent disagreement. Finally, the Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar-camera CP benchmark on V2X-Radar and V2X-R, demonstrating state-of-the-art performance with significantly reduced communication overhead. Code will be released soon.

2603.09420 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Class-Incremental Motion Forecasting

类别增量运动预测

Nicolas Schischka, Nikhil Gosala, B Ravi Kiran, Senthil Yogamani, Abhinav Valada

发表机构 * Department of Computer Science, University of Freiburg, Germany(弗赖堡大学计算机科学系) Qualcomm SARL France(法国.qualcomm SARL) Automated Driving, Qualcomm Technologies, Inc.(qualcomm Technologies, Inc. 自动驾驶部门)

AI总结 提出类别增量运动预测新任务,通过端到端框架结合伪标签与开放词汇分割,利用3D-2D投票机制和查询特征方差重放策略,缓解灾难性遗忘并适应新类别。

Comments V3: Change title. Add further experiments

详情
AI中文摘要

运动预测使自动驾驶车辆能够通过预测动态智能体的未来轨迹来预判场景演化。然而,现有方法通常假设一个封闭世界设定,具有固定的对象分类法并依赖高质量感知,限制了其在现实世界中的应用,因为现实世界中感知不完美,且新对象类别可能随时间出现。在这项工作中,我们引入了类别增量运动预测,这是一个新颖的设定,其中新对象类别随时间顺序引入,并且直接从相机图像预测未来对象轨迹。我们提出了首个针对该设定的端到端框架,该框架适应新引入的类别,同时减轻对先前学习类别的灾难性遗忘。我们的方法为已知类别生成运动预测伪标签,并将其与开放词汇分割模型的2D实例掩码进行匹配。这种3D到2D关键点投票机制过滤不一致和过度自信的预测,而基于查询特征方差的重放策略采样信息丰富的过去序列以保留先验知识。在nuScenes和Argoverse 2上的广泛评估表明,我们的方法成功地在已知类别上保持性能,同时有效适应新类别。我们进一步展示了向真实世界驾驶的零样本迁移,并表明该框架自然地扩展到nuScenes和NeuroNCAP上的开环和闭环端到端类别增量规划。代码和模型将在该https URL上公开。

英文摘要

Motion forecasting enables autonomous vehicles to anticipate scene evolution by predicting the future trajectories of dynamic agents. However, existing approaches typically assume a closed-world setting with a fixed object taxonomy and access to high-quality perception, limiting their applicability in the real world where perception is imperfect, and new object classes may emerge over time. In this work, we introduce class-incremental motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are predicted directly from camera images. We propose the first end-to-end framework for this setting, which adapts to newly introduced classes while mitigating catastrophic forgetting of previously learned ones. Our method generates motion forecasting pseudo-labels for known classes and matches them with 2D instance masks from an open-vocabulary segmentation model. This 3D-to-2D keypoint voting mechanism filters inconsistent and overconfident predictions, while a query feature variance-based replay strategy samples informative past sequences to preserve prior knowledge. Extensive evaluations on nuScenes and Argoverse 2 show that our approach successfully preserves performance on known classes while effectively adapting to novel ones. We further demonstrate zero-shot transfer to real-world driving and show that the framework extends naturally to open- and closed-loop end-to-end class-incremental planning on nuScenes and NeuroNCAP. Code and models will be made publicly available at https://omen.cs.uni-freiburg.de.

2606.18960 2026-06-19 cs.CV cs.RO 版本更新

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Mem-World:用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology(大连理工大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院)

AI总结 提出Mem-World,通过4D腕部视角曲面元索引内存W-VMem,解决操作中因遮挡和运动导致的场景遗忘问题,实现持久世界建模,提升策略评估与改进效果。

详情
AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式,通过生成动作一致的视频推演,为昂贵的真实世界实验提供了可扩展的替代方案。然而,在操作中持久世界建模仍然具有挑战性:频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图,导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制,我们提出了Mem-World,一种内存增强的多视图动作条件世界模型。其核心是W-VMem,一种4D腕部视图为中心的曲面元索引内存,将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置,W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中,通过基于曲面元的渲染和评分选择相关历史帧,为预测提供信息丰富且非冗余的上下文。大量实验表明,Mem-World在复杂操作场景中生成持久推演,比Ctrl-World实现更可靠的策略评估,将皮尔逊相关系数提高14.5%,并通过合成数据生成支持有效的策略改进,在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

2606.18112 2026-06-19 cs.RO cs.CV 版本更新

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

Qwen-RobotNav 技术报告:为智能体导航系统设计的可扩展导航模型

Jiazhao Zhang, Gengze Zhou, Hale Yin, Yiyang Huang, Zixing Lei, Qihang Peng, Haoqi Yuan, Jie Zhang, Xudong Guo, Xiaoyue Chen, An Yang, Fei Huang, Zhibo Yang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Zhuoyuan Yu, Jingyang Fan, Zhixuan Liang, Pei Lin, Ye Wang, Anzhe Chen, Kun Yan, Xiao Xu, Jiahao Li, Lulu Hu, Minying Zhang, Shurui Li, Wenhu Xiao, Shuai Bai, Xuancheng Ren, Chenxu Lv, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team(通义实验室)

AI总结 提出 Qwen-RobotNav 可扩展导航模型,通过参数化接口支持多种任务模式和可调观测参数,在15.6M样本上训练,联合视觉语言数据防止行为坍缩,在多个导航基准上取得新最优结果,并展示零样本泛化能力。

详情
AI中文摘要

智能体导航系统需要一个基础导航模型,其观测策略可以在推理时从外部重新配置,因为指令跟随、目标搜索、目标跟踪和自动驾驶共享相同的感知规划主干,但对视觉流的消费方式有根本不同的要求。我们提出 Qwen-RobotNav,一个建立在 Qwen-RobotNav 上的可扩展导航模型,通过一个具有两个互补维度的参数化接口来解决这个问题:多个任务模式选择导航行为,以及可控的观测参数(例如,token 预算、每个摄像头的权重)控制视觉历史的编码方式。通过训练时对所有参数进行随机化,Qwen-RobotNav 对任何推理时配置都具有鲁棒性,无需对 Qwen-RobotNav 主干进行任何架构修改。我们在15.6M样本上训练 Qwen-RobotNav;与视觉语言数据联合训练防止了在仅轨迹训练中观察到的反应性动作序列映射器的坍缩。参数化接口也使 Qwen-RobotNav 成为智能体系统的自然构建块:对于长时域场景,上层规划器将目标分解为子任务,并在情节中动态切换 Qwen-RobotNav 的任务模式和上下文策略,通过重复调用同一模型组合出复杂行为。大量实验表明,Qwen-RobotNav 在主要导航基准上取得了新的最优结果。该模型从2B到8B参数展现出良好的扩展性,联合多任务训练发展出一个跨任务族迁移的共享空间规划基板,并在多样环境中对真实世界机器人展现出强大的零样本泛化能力。

英文摘要

Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.

3. 图像识别、检索与分类 6 篇

2606.19684 2026-06-19 cs.CV 新提交

Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval

探索多模态大语言模型与两阶段微调在时尚图像检索中的应用

Nguyen Cao Hoang, Hoang Bui Le, Nam Vo Hoang, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(胡志明市国家大学下属理科大学) Vietnam National University, Ho Chi Minh(胡志明市国家大学)

AI总结 提出融合多模态大语言模型(LLaVA)生成属性感知三元组,并采用两阶段微调策略增强对比学习,以解决时尚图像检索中标注数据稀缺和负采样简单的问题。

Comments SOICT 2025

详情
AI中文摘要

组合图像检索通过参考图像和修改文本描述的复合查询来检索目标图像。在时尚领域,该任务需要理解颜色、图案和纹理等细微属性变化。然而,现有方法因标注数据稀缺和负采样简单而面临局限性。我们提出了一种新颖框架,该框架集成多模态大语言模型(LLaVA)以生成属性感知三元组,并引入两阶段微调策略来增强对比学习。我们利用预训练的视觉-语言模型(如CLIP-ViT/B32)生成句子级提示并与相对描述拼接,以及使用静态表示来增加负样本数量。实验结果表明,该框架增强了组合推理能力并改进了细粒度检索行为,突显了所提框架在时尚检索中的可行性和潜力。

英文摘要

Composed image retrieval retrieves a target image using a composed query of a reference image and a modified text description. In the fashion domain, this task requires understanding subtle attribute variations such as color, pattern, and texture. However, existing approaches face limitations due to scarce annotated data and simplistic negative sampling. We propose a novel framework that integrates a multi-modal large language model (LLaVA) to generate attribute-aware triplets and introduces a two-stage fine-tuning strategy to enhance contrastive learning. We leverage pretrained vision-language models, such as CLIP-ViT/B32, to generate and concatenate sentence-level prompts with the relative caption and to scale the number of negatives using static representations. Experimental results demonstrate enhanced compositional reasoning and improved fine-grained retrieval behavior, underscoring the feasibility and potential of the proposed framework for fashion retrieval.

2606.20044 2026-06-19 cs.CV 新提交

FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

FUSE:面向多模态目标重识别的频域统一与频谱能量对齐

Xuanhao Qi, Tom H. Luan, Yukang Zhang, Jinkai Zheng, Zhou Su, Shuwei Li, Lei Tan

发表机构 * School of Cyber Science and Engineering, Xi'an Jiaotong University(西安交通大学网络空间安全学院) School of Informatics, Xiamen University(厦门大学信息学院) National University of Singapore(新加坡国立大学)

AI总结 提出频域框架FUSE,通过频谱解耦和能量对齐两阶段处理,解决多模态重识别中低频偏置问题,在三个数据集上mAP提升9.1%。

Comments Accepted in ICML 2026

详情
AI中文摘要

尽管多模态重识别(ReID)取得了显著进展,现有方法往往强调低频线索。因此,它们关注颜色、光照和粗略外观等属性,而忽略了编码几何、纹理和身份判别细节的中高频结构。这种不平衡导致频谱表示不完整和跨模态对齐不稳定。为了克服这些限制,我们引入了FUSE,一个频域框架,将多模态ReID重新表述为频谱解耦和能量对齐的两阶段过程。所提出的频谱分解模块(SDM)自适应地将特征划分为低频、中频和高频子空间,实现分层频谱建模。跨模态对齐模块(CAM)进一步通过频率一致性正则化强制实现跨模态的能量对齐和子空间互补性。此外,FUSE结合了可学习的频率调制,以增强在不同光照和异构传感器条件下的鲁棒性。在RGBNT201、RGBNT100和MSVR310上的大量实验表明,FUSE实现了9.1%的mAP和9.5%的Rank-1改进,为多模态表示学习建立了一个可解释的频域范式。

英文摘要

Despite significant progress in multi-modal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1\% mAP and 9.5\% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.

2606.20199 2026-06-19 cs.CV 新提交

Evaluation of Image Matching for Art Skills Assessment

艺术技能评估中的图像匹配评价

Asaad Alghamdi, Michael Poor, Trung-Nghia Le, Tam V. Nguyen

发表机构 * University of Dayton(代顿大学) University of Science, VNU-HCM(胡志明市国家大学理科大学) Vietnam National University, Ho Chi Minh City(胡志明市国家大学)

AI总结 提出通过手绘图像与模板匹配来评估绘画技能的方法,比较SIFT特征与孪生网络,发现SIFT关键点匹配更有效。

Comments MAPR 2024

详情
AI中文摘要

虽然有些人天生具有绘画天赋,但掌握这项技能需要专门的训练和练习。确定一个人的绘画技能需要适当的全面评估。在本文中,我们提出了一种通过将手绘图像与原始模板匹配来衡量绘画技能的方法。现有技术通常涉及复杂的过程。然而,计算机视觉的进步使我们能够训练计算机以类似人类的水平进行这些比较,从而解决了繁琐且耗时的传统过程。使用计算机视觉应用,确定图像相似性涉及识别图像与参考图像的相似程度。我们实现并分析了SIFT特征和孪生网络来衡量图像相似性。我们的结果表明,评估艺术技能水平是可行的。通过特征分析,我们发现基于SIFT的关键点匹配为检测绘画技能提供了更有效的手段。

英文摘要

While some individuals possess a natural talent for drawing, mastering this skill requires dedicated training and practice. Determining one's skill in the art of drawing requires proper comprehensive assessment. In this paper, we propose a method to measure drawing skill by by matching the hand-drawn image with the original template. Existing techniques often involve complex processes. However, advancements in computer vision allow us to train computers to perform these comparisons at a human-like level, thereby resolving the tedious and overwhelming traditional process. Using computer vision applications, determining image similarity involves identifying the level of similarities in an image with a reference image. We have implemented and analyzed the SIFT feature and Siamese network to measure image similarity. Our results indicate that it is feasible to assess art skill levels. Through feature analysis, we found that SIFT-based key point matching provides a more effective means of detecting drawing skills.

2508.04424 2026-06-19 cs.CV 版本更新

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

组合对象检索:通过组合表达式进行对象级检索

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, Jiangsu, China(新一代人工智能技术及跨学科应用国家重点实验室,东南大学,教育部,江苏,中国) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE(穆罕默德·本·扎耶德人工智能大学(MBZUAI),阿布扎赫德,阿联酋)

AI总结 提出组合对象检索(COR)任务,通过组合参考对象、掩码和检索文本进行对象级检索,并构建COR125K基准和CORE模型,显著优于现有方法。

详情
AI中文摘要

基于用户意图检索细粒度视觉内容在多模态系统中仍然是一个挑战。尽管当前的组合图像检索(CIR)方法结合了参考图像和检索文本,但它们局限于图像级匹配,无法定位特定对象。为此,我们提出了组合对象检索(COR),一种新的对象级检索任务,从目标图像中的候选对象中检索目标对象,并用像素级掩码对检索结果进行定位。给定一个参考对象、其掩码、一个目标图像以及描述所需修改的检索文本,COR要求模型执行组合视觉-文本推理,而不是依赖显式的类别名称。这一设置带来了若干挑战,包括细粒度组合匹配、在视觉相似干扰物下的负对象过滤以及灵活的单对象或多对象检索。我们构建了COR125K,第一个大规模COR基准,包含408个类别的125,541个检索三元组,并划分基础/新类别以评估类别级泛化能力。我们还提出了CORE,一个统一的端到端模型,集成了参考区域编码、自适应视觉-文本交互和区域级对比学习,以将组合表示与目标对象对齐,同时抑制背景和干扰物。大量实验表明,CORE在基础和新类别上均显著优于现有的基于CIR的流程和强基线,为细粒度对象级多模态检索建立了一个简单而有效的基础。代码将在此https URL公开发布。

英文摘要

Retrieving fine-grained visual content based on user intent remains a challenge in multimodal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a new object-level retrieval task that retrieves target object(s) from candidate objects in a target image and grounds the retrieved result with pixel-level masks. Given a reference object, its mask, a target image, and a retrieval text describing the desired modification, COR requires models to perform composed visual-textual reasoning rather than relying on explicit category names. This setting introduces several challenges, including fine-grained compositional matching, negative-object filtering under visually similar distractors, and flexible single- or multi-object retrieval. We construct COR125K, the first large-scale COR benchmark, containing 125,541 retrieval triplets across 408 categories with base/novel splits for evaluating category-level generalization. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive vision-text interaction, and region-level contrastive learning to align composed representations with target objects while suppressing background and distractors. Extensive experiments demonstrate that CORE significantly outperforms existing CIR-based pipelines and strong baselines in both base and novel categories, establishing a simple and effective foundation for fine-grained object-level multimodal retrieval. Code will be released publicly at https://github.com/wangtong627/COR.

2512.03199 2026-06-19 cs.CV 版本更新

Does Head Pose Correction Improve Biometric Facial Recognition?

姿态校正是否能提升生物特征面部识别?

Justin Norman, Hany Farid

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究探讨了AI驱动的头部姿态校正与图像修复对面部识别准确率的影响,发现选择性应用CFR-GAN与CodeFormer可提升识别性能。

详情
AI中文摘要

生物特征面部识别模型在处理现实世界图像时常表现出显著的准确性下降,通常表现为图像质量差、非正面姿态和主体遮挡。我们调查了针对这些挑战的AI驱动头部姿态校正和图像修复是否能提高识别准确率。使用模型无关的大规模法医评估流程,我们评估了三种修复方法:3D重建(NextFace)、2D正面化(CFR-GAN)和特征增强(CodeFormer)。我们发现这些技术的简单应用会显著降低面部识别准确率。然而,我们还发现选择性应用CFR-GAN结合CodeFormer可以带来有意义的提升。

英文摘要

Biometric facial recognition models often demonstrate significant decreases in accuracy when processing real-world images, often characterized by poor quality, non-frontal subject poses, and subject occlusions. We investigate whether targeted, AI-driven, head-pose correction and image restoration can improve recognition accuracy. Using a model-agnostic, large-scale, forensic-evaluation pipeline, we assess the impact of three restoration approaches: 3D reconstruction (NextFace), 2D frontalization (CFR-GAN), and feature enhancement (CodeFormer). We find that naive application of these techniques substantially degrades facial recognition accuracy. However, we also find that selective application of CFR-GAN combined with CodeFormer yields meaningful improvements.

2604.19196 2026-06-19 cs.CV 版本更新

Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

面向域泛化人脸反欺骗的视觉基础模型基准测试

Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki

发表机构 * Graduate School of Information Sciences, Tohoku University, Japan(东北大学信息科学研究生院,日本)

AI总结 本文系统评估15种预训练视觉模型在人脸反欺骗域泛化中的表现,发现自监督ViT(尤其是DINOv2+Registers)结合数据增强和注意力损失在MICO协议上达到最优,且计算高效。

Comments 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

详情
AI中文摘要

人脸反欺骗(FAS)由于需要在未见过的环境中进行鲁棒的域泛化而仍然具有挑战性。尽管最近的趋势利用视觉-语言模型(VLM)进行语义监督,但这些多模态方法通常需要高昂的计算资源并表现出高推理延迟。此外,它们的有效性本质上受限于底层视觉特征的质量。本文重新审视仅视觉基础模型建立高效鲁棒FAS基线的潜力。我们在严苛的跨域场景下(包括MICO和有限源域(LSD)协议)对15个预训练模型进行了系统基准测试,例如有监督CNN、有监督ViT和自监督ViT。我们的全面分析表明,自监督视觉模型,特别是带有寄存器的DINOv2,显著抑制了注意力伪影并捕获了关键的细粒度欺骗线索。结合人脸反欺骗数据增强(FAS-Aug)、分块数据增强(PDA)和注意力加权分块损失(APL),我们提出的仅视觉基线在MICO协议上达到了最先进的性能。该基线在数据受限的LSD协议下优于现有方法,同时保持优越的计算效率。这项工作为FAS提供了一个确定的仅视觉基线,表明优化的自监督视觉变换器可以作为仅视觉和未来多模态FAS系统的骨干。项目页面见:此https URL。

英文摘要

Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .

4. 目标检测、分割与定位 8 篇

2606.20032 2026-06-19 cs.CV 新提交

ReA-OVCD: Reliability-Aware Open-Vocabulary Change Detection via Semantic and Spatial Refinement

ReA-OVCD:通过语义和空间精炼的可靠性感知开放词汇变化检测

Hongming Zhu, Huaji Chen, Bowen Du, Sicong Liu, Qin Liu

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) College of Surveying and Geo-Informatics, Tongji University(同济大学测绘与地理信息学院)

AI总结 提出一种无需训练的可靠性感知开放词汇变化检测框架,通过语义变化推理和边界感知精炼策略,解决实例级比较忽略细粒度变化和像素级比较不可靠的问题,在多个数据集上F1提升2.13%-9.75%。

详情
AI中文摘要

与依赖预定义类别的传统遥感变化检测不同,开放词汇变化检测(OVCD)使用任意文本提示灵活识别土地覆盖变化。然而,现有方法在建模变化时存在固有折衷:实例级比较忽略了细粒度语义变化(例如部分建筑扩建),而直接像素比较不可靠,由于语义模糊和空间不一致导致不稳定响应和边界伪影。为此,我们提出一种高效的无训练可靠性感知开放词汇变化检测(ReA-OVCD)框架。它首先从像素级语义差异中推导候选变化区域,以确保灵活和详细的定位。为确保可靠性,随后引入协作精炼策略,从语义和空间角度显式建模变化有效性。具体而言,我们开发了语义变化推理(SCR)模块,通过联合分析分布差异和响应变化重新评估变化,从而抑制偶然不一致性同时保留可靠的语义转变。此外,设计了边界感知变化精炼(BCR)模块,通过验证候选区域是否得到可靠内部像素支持来减轻由边界错位和不确定性引起的伪影。在多个数据集(LEVIR-CD、WHU-CD、DSIFN和SECOND)上的大量实验表明,我们的方法持续优于现有技术,在更高计算效率下实现了2.13%至9.75%的F1提升。代码已公开于此 https URL。

英文摘要

Unlike traditional remote sensing change detection that relies on predefined categories, Open-Vocabulary Change Detection (OVCD) identifies land cover changes flexibly using arbitrary text prompts. However, existing methods suffer from an inherent trade-off when modeling changes: instance-level comparison overlooks fine-grained semantic variations (e.g., partial building extensions), while direct pixel comparison proves unreliable, yielding unstable responses and boundary artifacts due to semantic ambiguity and spatial inconsistency. To this end, we propose an efficient training-free Reliability-Aware Open-Vocabulary Change Detection (ReA-OVCD) framework. It first derives candidate change regions from pixel-wise semantic discrepancies to ensure flexible and detailed localization. To ensure reliability, it subsequently introduces a collaborative refinement strategy to explicitly model change validity from both semantic and spatial perspectives. Specifically, we develop a Semantic Change Reasoning (SCR) module that reassesses changes by jointly analyzing distributional divergence and response variation, enabling the suppression of incidental inconsistencies while preserving reliable semantic shifts. In addition, a Boundary-aware Change Refinement (BCR) module is designed to mitigate artifacts stemming from boundary misalignment and uncertainty through validating whether candidate regions are supported by reliable interior pixels. Extensive experiments across multiple datasets (LEVIR-CD, WHU-CD, DSIFN, and SECOND) demonstrate that our method consistently outperforms state-of-the-art approaches, achieving $\mathrm{F}_{1}^{C}$ improvements of 2.13\% to 9.75\% with higher computational efficiency. The code is publicly available at \https://github.com/Funny0101/ReA-OVCD

2606.20130 2026-06-19 cs.CV 新提交

SAM3 Self-Distillation for Fine-Grained GOOSE 2D Semantic Segmentation

SAM3自蒸馏用于细粒度GOOSE 2D语义分割

Xuesong Wang

发表机构 * Wayne State University(韦恩州立大学)

AI总结 提出基于SAM3图像编码器与轻量解码器的分割模型,通过自蒸馏、多尺度测试增强和光度畸变迁移,在GOOSE 2D挑战赛达69.73% mIoU。

Comments 4th place in ICRA 2026 GOOSE 2D Semantic Segmentation Challenge

详情
AI中文摘要

我们描述了在ICRA 2026 GOOSE 2D细粒度语义分割挑战赛中获得第四名的方案,该方案在官方1815张图像测试集上达到了69.73%的复合平均交并比(mIoU)。我们的模型适配了近期视觉基础模型Segment Anything Model 3(SAM3)的图像编码器,并搭配轻量级解码器。除此之外,我们贡献了两项技术和一项经验发现:(i)一种自蒸馏方案,该方案重新利用SAM3本身,以真实边界框作为提示,在SAM3性能优于我们自身模型的类别上充当教师;(ii)一种图像级多尺度测试时增强方案,通过重新缩放图像而非模型输入,为固定输入尺寸的模型恢复多尺度推理;(iii)一项发现:来自2025年GOOSE 2D获胜方案的一种激进光度畸变,移植到我们的流程中,是单一最大的改进来源。

英文摘要

We describe our 4th-place entry to the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge, which reached a composite mean Intersection-over-Union (mIoU) of 69.73% on the official 1,815-image test set. Our model adapts the image encoder of a recent visual foundation model, Segment Anything Model 3 (SAM3), with a lightweight decoder. Beyond this, we contribute two techniques and one empirical finding: (i) a self-distillation scheme that re-uses SAM3 itself, prompted with ground-truth boxes, as a teacher on the classes where it outperforms our own model; (ii) an image-level multi-scale test-time augmentation scheme that restores multi-scale inference for a fixed-input-size model by rescaling the image rather than the model input; and (iii) the finding that an aggressive photometric distortion from a winning 2025 GOOSE 2D entry, transplanted onto our pipeline, is its single largest source of improvement.

2606.20161 2026-06-19 cs.CV 新提交

ARTEMIS: Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation

ARTEMIS: 基于智能体引导的可靠性感知时间掩码演化用于不完美监督的视频息肉分割

Tong Wang, Siwen Wang, Yaolei Qi, Jinxing Zhou, Yuting He, Guanyu Yang, Yutong Xie

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education(东南大学教育部新一代人工智能技术及其跨学科应用重点实验室) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) School of Medicine, Case Western Reserve University(凯斯西储大学医学院)

AI总结 提出ARTEMIS框架,利用视觉语言智能体选择可靠时间锚点,结合SAM2传播和可靠性感知鲁棒学习,从不完美监督(点、涂鸦、少量密集标签)中学习高质量视频息肉分割掩码,在多个基准上达到最优性能。

详情
AI中文摘要

不完美监督的视频息肉分割(VPS)旨在从廉价监督中学习密集、时间一致的掩码,包括弱标注(点、涂鸦)和少量密集标注帧的半监督。该设置具有临床价值,但由于弱对比、模糊边界、运动模糊和镜面高光,加上稀疏的像素级指导,具有挑战性。虽然SAM2可以从稀疏输入生成密集掩码,但直接伪标签通常会产生几何退化的掩码,存在边界泄漏,未充分利用时间一致性,并忽略可靠性。为解决这些问题,我们提出ARTEMIS,一个由智能体引导的可靠性感知时间掩码演化驱动的统一框架,用于不完美监督的VPS。ARTEMIS从可用监督初始化粗掩码:SAM2转换点/涂鸦,而密集标签作为可靠锚点。一个辩论-判断视觉语言智能体在弱监督下选择可靠的时间锚点,这些锚点通过SAM2双向传播以细化不可靠或未标注的帧。最后,ARTEMIS使用时间可靠性感知鲁棒学习训练分割器,结合可靠性引导的参考选择、参考原型传输模块和可靠性感知鲁棒损失。这些组件评估掩码可靠性,随时间演化锚点,跨帧传输目标身份,并降低噪声监督的权重而非丢弃困难样本。在SUN-SEG和CVC-ClinicDB-612上的涂鸦、点和有限标签设置下的实验表明,ARTEMIS达到了最先进的性能。代码将在此https URL发布。

英文摘要

Imperfectly supervised video polyp segmentation (VPS) aims to learn dense, temporally consistent masks from inexpensive supervision, including weak annotations (points, scribbles) and semi-supervision with few densely labeled frames. This setting is clinically valuable but challenging due to weak contrast, ambiguous boundaries, motion blur, and specular highlights, compounded by sparse pixel-level guidance. While SAM2 can generate dense masks from sparse inputs, direct pseudo-labeling often yields geometry-degraded masks with boundary leakage, underutilizes temporal consistency, and ignores reliability. To address these issues, we propose ARTEMIS, a unified framework for imperfectly supervised VPS driven by agent-guided reliability-aware temporal mask evolution. ARTEMIS initializes coarse masks from available supervision: SAM2 converts points/scribbles, while dense labels serve as reliable anchors. A debate-and-judge vision-language agent selects reliable temporal anchors under weak supervision, which are propagated bidirectionally with SAM2 to refine unreliable or unlabeled frames. Finally, ARTEMIS trains the segmenter using temporal reliability-aware robust learning, incorporating reliability-guided reference selection, a Reference Prototype Transport Module, and reliability-aware robust loss. These components assess mask reliability, evolve anchors over time, transport target identity across frames, and down-weight noisy supervision instead of discarding difficult samples. Experiments on SUN-SEG and CVC-ClinicDB-612 under scribble, point, and limited-label settings demonstrate that ARTEMIS achieves state-of-the-art performance. Code will be released at https://github.com/wangtong627/ARTEMIS.

2606.20282 2026-06-19 cs.CV 新提交

U$^2$Mamba: A Two-level Nested U-structure Mamba for Salient Object Detection

U$^2$Mamba:用于显著目标检测的两级嵌套U结构Mamba

Junhui Li, Jialu Li, Youshan Zhang

发表机构 * University of Science and Technology Liaoning(辽宁科技大学) Chuzhou University(滁州学院) Yeshiva University(叶史瓦大学)

AI总结 提出U$^2$Mamba,一种两级嵌套U结构网络,通过多尺度Mamba U块增强深度和上下文信息,并采用分层训练监督,在显著目标检测上达到先进性能。

Comments 6 pages, 2 figures

详情
AI中文摘要

基于Mamba的模型已成为显著目标检测(SOD)的有前途的替代方案,在长序列建模方面具有显著优势。然而,现有模型往往未能充分利用上下文信息和整个架构的深度。本文介绍了U$^2$Mamba,一种用于显著目标检测的强大且创新的U结构网络。我们提出了多尺度Mamba U块(MMUBs),增强了模型深度以改进局部特征提取能力。我们新开发的嵌套U结构结合了MMUBs,使网络能够整合来自浅层和深层的不同感受野,从而收集更丰富的上下文信息和更长距离的数据,而不受分辨率限制。我们提出了一种分层训练监督方法,在训练过程中在每个层级计算损失,而不是使用传统的深度监督方案和顶层监督训练。大量实验表明,U$^2$Mamba在显著目标检测上取得了与最先进方法高度竞争的性能。源代码可在\url{this https URL}获取。

英文摘要

Mamba-based models have emerged as a promising alternative for salient object detection (SOD), offering significant advantages in modeling long sequences. However, existing models often fail to explore contextual information and the depth of the entire architecture. This paper introduces U$^2$Mamba, a powerful and innovative U-structured network for salient object detection. We propose multiscale Mamba U-blocks (MMUBs) that enhance the model depth to improve local feature extraction capabilities. Our newly developed nested U-structure, incorporating MMUBs, enables the network to integrate various receptive fields from shallow and deep layers, thereby collecting richer contextual information and longer-range data without being constrained by resolution. Instead of using the traditional deep supervision scheme and top-level supervised training, we propose a hierarchical training supervision method where the loss is computed at each level during the training process. Extensive experiments demonstrate that U$^2$Mamba achieves highly competitive performance against state-of-the-art methods. The source code is available at \url{https://github.com/JL021/U2Mamba}.

2606.20300 2026-06-19 cs.CV 新提交

CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection

CMDS-AD: 跨模态双流解耦用于少样本异常检测

Junhao Cai, Deyu Zeng, Junhao Pang, Junyu Chen, Qiwei Liang, Xiaopin Zhong, Zongze Wu

发表机构 * Shenzhen University(深圳大学) Guangzhou Maritime University(广州航海学院) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出跨模态双流异常检测框架CMDS-AD,通过扩散模型生成多样本并利用低频正常估计辅助解耦高频缺陷,在1-shot设置下MVTec 3D-AD上I-AUROC提升5.7%。

Comments Accepted to ECCV 2026!

详情
AI中文摘要

少样本异常检测由于训练数据有限仍然具有挑战性。多模态异常检测(MAD)提供了一种可行的解决方案,利用3D几何线索丰富2D RGB表示并弥补这一稀缺性。然而,现有的MAD方法采用空间均匀的特征处理,混淆了稳定的宏观结构与高频局部缺陷信号,加剧了跨模态错位并增加了假阳性率。为了克服这一问题,我们提出了CMDS-AD,一种跨模态双流异常检测框架。一个LoRA引导的扩散模型生成多样的RGB样本以缓解极端数据稀缺。对于3D正常增强,我们采用预训练的扩散模型作为正常估计器。关键的是,该估计器本质上充当非线性低通滤波器,直接从RGB输入中提取低频正常表示。这建立了一个纯低频信息的辅助估计流,锚定稳健的结构模板,并帮助包含耦合高低频分量的未压缩真实流精确隔离微缺陷。一个坐标感知的分层特征映射器自适应地对齐跨模态语义,而一个乘法评分机制过滤模态特定噪声。在极端1-shot设置下,CMDS-AD在MVTec 3D-AD上实现了5.7%(I-AUROC)和2.0%(AUPRO)的绝对性能提升,在EyeCandies上分别提升了7.7%和5.6%,确立了新的最先进水平。

英文摘要

Few-shot anomaly detection remains challenging due to limited training data. Multi-modal anomaly detection (MAD) offers a viable solution, leveraging 3D geometric cues to enrich 2D RGB representations and compensate for this scarcity. However, existing MAD methods apply spatially uniform feature processing, conflating stable macroscopic structures with high-frequency localized defect signals, exacerbating cross-modal misalignment and inflating false-positive rates. To overcome this, we present CMDS-AD, a Cross-Modal Dual-Stream Anomaly Detection framework. A LoRA-guided diffusion model generates diverse RGB samples to mitigate extreme data scarcity. For 3D normal augmentation, we employ a pre-trained diffusion model as a normal estimator. Crucially, this estimator inherently acts as a non-linear low-pass filter, directly extracting low-frequency normal representations from RGB inputs. This establishes an auxiliary estimated stream of purely low-frequency information, anchoring robust structural templates and assisting the uncompressed real stream, containing coupled high- and low-frequency components, to precisely isolate micro-defects. A Coordinate-Aware Hierarchical Feature Mapper adaptively aligns cross-modal semantics, while a multiplicative scoring mechanism filters modality-specific noise. Under the extreme 1-shot setting, CMDS-AD achieves absolute performance gains of 5.7% (I-AUROC) and 2.0% (AUPRO) on MVTec 3D-AD, alongside 7.7% and 5.6% improvements on EyeCandies, establishing a new state-of-the-art.

2507.21460 2026-06-19 cs.CV 版本更新

An Angular-Temporal Interaction Network for Light Field Object Tracking in Low-Light Scenes

用于低光场景光场目标跟踪的角-时交互网络

Mianzhao Wang, Fan Shi, Xu Cheng, Feifei Zhang, Shengyong Chen

发表机构 * Engineering Research Center of Learning-Based Intelligent System (Ministry of Education)(教育部学习驱动智能系统工程研究中心) key Laboratory of Computer Vision and System (Ministry of Education)(教育部计算机视觉与系统重点实验室) School of Computer Science and Engineering, Tianjin University of Technology(天津工业大学计算机科学与工程学院)

AI总结 提出一种光场极线平面结构图像表示和角-时交互网络,通过显式建模几何结构和自监督优化,在低光场景下实现高效目标跟踪,性能达到最优。

详情
AI中文摘要

高质量的四维光场表示结合高效的角特征建模对于场景感知至关重要,因为它可以提供判别性的空间-角度线索来识别移动目标。然而,近期的发展仍然难以在时间域中提供可靠的角建模,尤其是在复杂的低光场景中。在本文中,我们提出了一种新颖的光场极线平面结构图像(ESI)表示,该表示显式定义了光场内的几何结构。通过利用极线平面内光线角度的突变,这种表示可以增强低光场景中的视觉表达,并减少高维光场的冗余。我们进一步提出了一种用于光场目标跟踪的角-时交互网络(ATINet),该网络从光场的几何结构线索和角-时交互线索中学习角感知表示。此外,ATINet还可以通过自监督方式进行优化,以增强时间域上的几何特征交互。最后,我们引入了一个大规模的光场低光数据集用于目标跟踪。大量实验表明,ATINet在单目标跟踪中达到了最先进的性能。此外,我们将所提方法扩展到多目标跟踪,这也显示了高质量光场角-时建模的有效性。

英文摘要

High-quality 4D light field representation with efficient angular feature modeling is crucial for scene perception, as it can provide discriminative spatial-angular cues to identify moving targets. However, recent developments still struggle to deliver reliable angular modeling in the temporal domain, particularly in complex low-light scenes. In this paper, we propose a novel light field epipolar-plane structure image (ESI) representation that explicitly defines the geometric structure within the light field. By capitalizing on the abrupt changes in the angles of light rays within the epipolar plane, this representation can enhance visual expression in low-light scenes and reduce redundancy in high-dimensional light fields. We further propose an angular-temporal interaction network (ATINet) for light field object tracking that learns angular-aware representations from the geometric structural cues and angular-temporal interaction cues of light fields. Furthermore, ATINet can also be optimized in a self-supervised manner to enhance the geometric feature interaction across the temporal domain. Finally, we introduce a large-scale light field low-light dataset for object tracking. Extensive experimentation demonstrates that ATINet achieves state-of-the-art performance in single object tracking. Furthermore, we extend the proposed method to multiple object tracking, which also shows the effectiveness of high-quality light field angular-temporal modeling.

2510.24399 2026-06-19 cs.CV cs.RO 版本更新

GenTrack: A New Generation of Multi-Object Tracking

GenTrack:新一代多目标跟踪

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

发表机构 * SDU Robotics, University of Southern Denmark(SDU机器人实验室,南丹麦大学)

AI总结 提出GenTrack多目标跟踪方法,采用随机与确定性混合策略,结合粒子群优化与社会交互,在弱检测器、遮挡等场景下有效维持目标身份一致性并减少ID切换。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

本文介绍了一种新颖的多目标跟踪(MOT)方法,称为GenTrack,其主要贡献包括:第一,一种混合跟踪方法,采用随机和确定性方式,以鲁棒地处理未知且时变的目标数量,特别是在维持目标身份(ID)一致性和管理非线性动态方面;第二,利用粒子群优化(PSO)和一些提出的适应度度量,引导随机粒子朝向其目标分布模式,从而即使在弱且噪声大的目标检测器下也能实现有效跟踪;第三,整合目标间的社会交互,以增强PSO引导的粒子,并改进强(匹配)和弱(未匹配)轨迹的连续更新,从而减少ID切换和轨迹丢失,尤其是在遮挡期间;第四,基于GenTrack重新定义的视觉MOT基线,结合了基于空间一致性、外观、检测置信度、轨迹惩罚和社会分数的综合状态与观测模型,以实现系统且高效的目标更新;第五,首个公开可用的最小依赖源代码参考实现,包含三种变体,包括GenTrack Simple、Strengthen和Super,便于灵活重新实现。实验结果表明,与最先进的跟踪器相比,GenTrack在标准基准和现实场景中提供了优越的性能,并集成了基线实现以进行公平比较。还讨论了未来工作的潜在方向。所提方法和比较跟踪器的源代码参考实现已在GitHub上提供:this https URL

英文摘要

This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: first-a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, second-leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, third-integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, fourth-a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and five-the first ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Simple, Strengthen, and Super, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack

2510.24410 2026-06-19 cs.CV cs.RO 版本更新

GenTrack2: An Improved Hybrid Approach for Multi-Object Tracking

GenTrack2: 一种改进的多目标跟踪混合方法

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

发表机构 * SDU Robotics, University of Southern Denmark(SDU机器人研究所,南丹麦大学)

AI总结 提出结合随机粒子滤波与确定性关联的多目标跟踪方法,通过粒子群优化和新型代价矩阵解决非线性动态下的标识一致性问题,性能优于现有方法。

Comments The content of this paper was included in the full manuscript of GenTrack family which has been submitted to the journal for possible publication

详情
AI中文摘要

本文提出一种视觉多目标跟踪方法,联合使用随机和确定性机制,以确保在非线性动态下未知且时变目标数量的标识一致性。随机粒子滤波处理非线性动态和非高斯噪声,并借助粒子群优化(PSO)将粒子引导至状态分布模式,通过提出的适应度度量(包含运动一致性、外观相似性和与邻近目标的社交互动线索)减轻发散。确定性关联通过提出的代价矩阵进一步强制标识一致性,该矩阵包含粒子与当前检测之间的空间一致性、检测置信度和轨迹惩罚。随后,提出一种新颖方案,在保持目标身份的同时平滑更新目标状态,特别是对于与其他目标交互和长时间遮挡期间的弱轨迹。此外,对过去状态的速度回归提供趋势种子速度,增强粒子采样和状态更新。所提出的跟踪器设计灵活,适用于预录视频和相机直播流(未来帧不可用)。实验结果表明,与最先进的跟踪器相比,性能优越。所提出方法和对比跟踪器的源代码参考实现已在GitHub上提供:此 https URL

英文摘要

This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2

5. 视频理解与时序视觉 9 篇

2606.19682 2026-06-19 cs.CV 新提交

Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval

Vortex: 面向智能视频检索的多模态融合系统

Duc-Tho Nguyen, Hieu-Hoc Tran-Minh, Khanh-Hoa Lam, Hoang-Nhut Ly, Huu-Phuc Huynh, Thanh-Tien Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(越南国立大学胡志明市理科大学) Vietnam National University, Ho Chi Minh City(越南国立大学胡志明市)

AI总结 提出Vortex系统,融合自适应关键帧提取、多模态元数据生成及混合检索策略(CLIP与SigLIP2的倒数秩融合),结合Rocchio反馈和多阶段时序搜索,在比赛中取得优异成绩。

Comments SOICT 2025

详情
AI中文摘要

本文介绍了Vortex,这是我们的团队FocusOnFun为胡志明市AI挑战赛2025开发的多模态视频检索系统,旨在推进智能多媒体搜索和时间推理。该系统集成了自适应关键帧提取、来自视觉语言和语音模型的多模态元数据生成,以及通过倒数秩融合融合CLIP和SigLIP2嵌入的混合检索策略,以平衡全局和细粒度语义。为了增强交互性,Vortex引入了基于Rocchio的相关性反馈和多阶段时序搜索机制,用于顺序事件对齐。该系统基于Milvus和Elasticsearch构建,支持可扩展的索引和高效检索。在官方比赛中,我们的FocusOnFun团队的系统在初赛中获得了79.6/88(90.5%)的分数,并在决赛中进一步评估,整体表现达到“优秀”,在问答(QA)任务中取得“杰出”成绩。这证明了CLIP和SigLIP2的互补优势,并确认了混合检索方法的有效性。该系统为未来在智能、上下文感知和交互式视频检索方面的研究奠定了坚实基础。

英文摘要

This paper presents Vortex, the multimodal video retrieval system developed by our team, FocusOnFun, for the Ho Chi Minh City AI Challenge 2025, designed to advance intelligent multimedia search and temporal reasoning. The system integrates adaptive keyframe extraction, multimodal metadata generation from vision-language and speech models, and a hybrid retrieval strategy that fuses CLIP and SigLIP2 embeddings through Reciprocal Rank Fusion to balance global and fine-grained semantics. To enhance interactivity, Vortex incorporates Rocchio-based relevance feedback and a multi-stage temporal search mechanism for sequential event alignment. Built on Milvus and Elasticsearch, the architecture enables scalable indexing and efficient retrieval. Evaluated in the official competition, our FocusOnFun team's system achieved a score of 79.6/88 (90.5\%) in the Preliminary Round and was further evaluated in the Final Round, achieving an `Excellent' overall performance with `Outstanding' results in the question-answering (QA) task. This demonstrating the complementary strengths of CLIP and SigLIP2 and confirming the effectiveness of the hybrid retrieval approach. The system establishes a robust foundation for future research in intelligent, context-aware, and interactive video retrieval.

2606.19706 2026-06-19 cs.CV cs.CL 新提交

NEST: Narrative Event Structures in Time for Long Video Understanding

NEST:面向长视频理解的时间叙事事件结构

Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang, Anushka Sivakumar, Zaber Ibn Abdul Hakim, Shaurya Mallampati, Chris Thomas

发表机构 * Department of Computer Science, Virginia Tech(弗吉尼亚理工大学计算机科学系)

AI总结 提出NEST数据集(1005部全长电影),通过多模态叙事事件标注和关系链接,评估模型在长视频中理解事件结构、时间顺序和长程依赖的能力,实验表明事件检测等任务极具挑战性。

详情
AI中文摘要

视觉-语言模型的最新进展使得处理越来越长的视频序列成为可能,但处理扩展令牌流的能力并不能转化为对长视频中叙事结构的理解。现有的长视频基准侧重于大海捞针式检索,而不是评估低级动作如何形成事件、事件如何跨时间交互以及叙事如何进展,例如,模型是否能够将早期的挫折(如失业)与后来的关系破裂联系起来,尽管存在长时间间隔、中间场景或重新诠释事件的闪回。我们引入了NEST(面向长视频理解的时间叙事事件结构),一个包含1005部全长电影(平均98分钟)的数据集,每部电影都标注了102个基于视觉内容、对话和音频的多模态叙事事件。NEST通过基于视觉内容、对话和音频的结构化标注捕捉多模态叙事事件,并通过反映叙事结构的关系(包括时间顺序、层次组合和长程依赖)将它们联系起来。我们引入了事件触发检测(ETD)、事件定位(EL)、事件论元抽取(EAE)和事件关系抽取(ERE)的基线。该基准对于基于事件发现极具挑战性,ETD低于8%,EL低于6%,EAE低于11%。相比之下,一旦事件给定,ERE更容易处理,零样本F1达到35.45%,微调后F1达到44.42%。

英文摘要

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

2606.19849 2026-06-19 cs.CV 新提交

ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference

ViCoStream: 流式视频大模型通过阶段协调推理可运行超过100 FPS

Yang Tan, Junlong Tong, Linan Yue, Hao Wu, Pengfei Fang, Xiaoyu Shen

发表机构 * Southeast University(东南大学) Eastern Institute of Technology, Ningbo(宁波东方理工大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出ViCoStream框架,通过阶段协调的流水线(分块执行、CUDA流重叠、视觉令牌控制、有界视觉注意力、查询端检索)实现流式视频大模型的高吞吐低延迟推理,在单A100上达到134 FPS视频吞吐和<50 ms首令牌延迟,精度接近全历史基线。

Comments 19 pages, 7 figures, 13 tables

详情
AI中文摘要

流式视频大模型必须持续处理传入的视频,同时保持低查询延迟,这使得视频摄入吞吐量和查询时间响应性对于实时部署至关重要。现有方法主要集中于加速单个模块,如视觉编码、令牌剪枝或KV缓存压缩,但对由此产生的系统能否维持实时流式性能提供的见解有限。我们将流式视频大模型推理形式化为一个协调的流水线,涵盖视觉预处理、视觉编码、令牌丢弃和LLM预填充/解码。基于这一形式化,我们提出了ViCoStream(视频协调流式处理),一个阶段协调的流式框架,结合了分块执行、CUDA流重叠、视觉令牌控制、有界视觉注意力和查询端检索,以限制每块的计算和内存成本。我们进一步对瓶颈迁移进行了系统研究,揭示了块大小、令牌保留、注意力局部性和检索范围如何影响吞吐量-准确率权衡。在多个流式基准测试上使用Qwen2.5-VL-3B/7B-Instruct进行的实验表明,ViCoStream在单块A100 GPU上实现了134 FPS的视频吞吐量和小于50 ms的首令牌延迟,同时保持接近全历史基线的准确率。

英文摘要

Streaming VideoLLMs must continuously process incoming video while maintaining low query latency, making both video-ingestion throughput and query-time responsiveness critical for real-time deployment. Existing methods largely focus on accelerating individual modules, such as visual encoding, token pruning, or KV-cache compression, but provide limited insight into whether the resulting system can sustain real-time streaming performance. We formulate streaming VideoLLM inference as a coordinated pipeline spanning visual preprocessing, visual encoding, token dropping, and LLM prefilling/decoding. Building on this formulation, we propose ViCoStream (Video Coordinated Streaming), a stage-wise coordinated streaming framework that combines chunk-wise execution, CUDA-stream overlap, visual token control, bounded visual attention, and query-side retrieval to bound per-chunk computation and memory costs. We further provide a systematic study of bottleneck migration, revealing how chunk size, token retention, attention locality, and retrieval scope shape the throughput-accuracy trade-off. Experiments with Qwen2.5-VL-3B/7B-Instruct across multiple streaming benchmarks show that ViCoStream achieves 134 FPS video throughput and less than 50 ms TTFT on a single A100 GPU while maintaining accuracy close to full-history baselines.

2606.19927 2026-06-19 cs.CV 新提交

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

CARE: 面向视频多模态大语言模型的自适应推理长度的能力感知奖励塑形

Chengwen Liu, Hao Peng, Jisheng Dang, Hong Peng, Bin Hu, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) School of Medical Technology, Beijing Institute of Technology(北京理工大学医学技术学院) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出CARE框架,通过能力感知奖励塑形自适应优化推理长度,利用指数移动平均估计能力并分阶段调整奖励偏好,结合批次归一化和后验放大器提升效率与准确性。

详情
AI中文摘要

在多模态视频推理中,基于强化学习的方法通常依赖简单且不灵活的推理长度控制策略,无法适应模型不断变化的能力。这种不匹配可能在早期阶段抑制必要的探索,而在模型变得更有能力后鼓励冗余推理和低效解码。本文提出CARE,一种用于多模态推理中自适应推理长度优化的能力感知奖励塑形框架。具体来说,CARE通过通过率的指数移动平均维护平滑的能力估计,并利用它将训练路由到渐进阶段,将奖励偏好从探索导向的长形式推理转向效率导向的简洁推理。为避免将冗长与内在任务复杂性混淆,CARE进一步使用批次级统计归一化推理努力,并引入后验放大器以增强对历史上困难样本上意外强性能的奖励信号。所提出的机制无缝集成到GRPO训练流程中,且不增加额外推理开销。在多个视频推理和通用视频理解基准上的大量实验表明,CARE持续提高推理准确性,稳定强化学习,并显著提升令牌效率。此外,CARE在训练过程中展现出推理长度的特征性倒U型轨迹,并在收敛时产生更短但信息更丰富的推理轨迹,表明推理预算的有效自适应分配。我们在以下网址提供CARE框架和实验的源代码:此https URL。

英文摘要

In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at https://github.com/1Pansy/Video-CARE.

2606.20140 2026-06-19 cs.CV 新提交

SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

SA-VIS: 用于训练视频实例分割的稀疏帧标注

Edoardo Mello Rella, Ajad Chhatkuli, Shipra Jain, Ender Konukoglu, Luc Van Gool

发表机构 * CVL, ETH Zurich(计算机视觉实验室,苏黎世联邦理工学院) Align Technology VISICS, KU Leuven(VISICS,鲁汶大学) INSAIT, Sofia(INSAIT,索非亚)

AI总结 提出稀疏帧标注的SA-VIS方法,通过过去帧特征传播模块利用低维特征,在仅使用1/5标注帧时性能仅下降0.4%,显著降低标注成本。

详情
AI中文摘要

最近的在线视频实例分割(VIS)方法取得了令人印象深刻的结果,因此成为视频中实例分割的首选方法。尽管令人印象深刻的单图像模型(例如基于SAM的模型)重新兴起,但在线(或半在线)VIS方法通过在训练期间使用长序列的密集标注帧,优于单图像模型。然而,这种VIS的训练设置在计算和所需密集标注方面成本高昂。为了解决这些主要缺陷,我们认为实例及其在视频中的演变的有效建模并不需要密集标注的帧。为此,我们提出了一个简单有效的模块,称为过去帧特征传播(PFP),它聚合来自多个帧的图像编码器的低维特征。这个简单的低计算量模块为使用稀疏视频帧标签进行端到端训练提供了巨大的学习能力。结合轻量级的帧特定实例查询,我们的稀疏帧标注VIS(SA-VIS)显著提高了其基线的性能。最有趣的是,我们避免复杂性的简单设计有效地弥合了在稀疏和密集标注视频序列上训练之间的精度差距。这意味着当仅使用数据集中1/5图像的标注时,SA-VIS的性能仅下降0.4%。实验上,SA-VIS在YouTube-VIS 2019/2021/2022和Occluded VIS(OVIS)上显示出相对于基线的强劲改进,并且在有限标注场景下,AP比最先进方法提高了1%以上。

英文摘要

Recent online video instance segmentation (VIS) methods have achieved impressive results, thus becoming the preferred approach to segment instances in videos. Despite the resurgence of impressive single image models, the online (or semi-online) VIS approaches outperform single-image models (e.g., based on SAM) by using long sequences of densely annotated frames during training. However,such a training setup of VIS is expensive in the sense of compute as well as dense annotations required. In order to solve these major flaws, we argue that the effective modeling of the instances and their evolution in videos do not require densely annotated frames. To that end, we propose a simple and effective module, called Past-frames Feature Propagation (PFP) which aggregates low-dimensional features from the image encoder of multiple frames. This simple low-compute module provides tremendous learning capability in using sparse video frame labels for end-to-end training. Combined with a light-weight frame-specific Instance Queries, our Sparse frame Annotation VIS (SA-VIS) significantly improves performance over its baseline. Most interestingly, our simple design that avoids complexities effectively bridges the gap in accuracy between training on sparsely and densely annotated video sequences. This translates to a mere 0.4% drop in performance of SA-VIS when using annotations for only 1/5 of the images in the dataset. Empirically, SA-VIS shows strong improvements over the baseline on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS) and an over 1% improvement in AP on the state-of-the-art in a limited annotations scenario.

2606.20312 2026-06-19 cs.CV 新提交

Reliability-Aware Prototype Calibration for Frozen Pose-Flow Video Anomaly Detection

面向冻结姿态流视频异常检测的可靠性感知原型校准

Ning Dong, Yingna Su, Xin Dong, Ziyun Jiao, Xinnian Guo, Zhuangzhuang Pan

AI总结 提出一种后验评分校准方法RPC,通过标准化潜在空间中的最近原型偏差修正冻结姿态流检测器的排名,在8个骨干-数据集组合上平均提升AUROC 2.03个百分点。

Comments 15 pages, 5 figures, 7 tables. Code available at https://github.com/iNing10/RPC

详情
AI中文摘要

姿态流视频异常检测器因其能为跟踪的骨架窗口提供基于似然的排名,在一类监控中具有吸引力。然而,单个似然分数可能隐藏多模态正常行为,并对姿态观测噪声敏感。我们研究了一个冻结检测器设置,其中姿态流骨干网络、缓存的骨架轨迹和评估流程是固定的。可靠性感知原型校准(RPC)是针对该设置的一种后验评分校准方法。它在冻结潜在空间中添加标准化的最近原型偏差到标准化的流分数,并仅使用关键点置信度来门控这一新增的几何证据。因此,RPC在保留原始密度信号的同时,利用姿态可靠性下的经验正常模式结构修正排名。在两个冻结姿态流骨干网络和四个数据集上,RPC在所有八个骨干-数据集对中提升了帧级AUROC,增益范围为0.34到4.49个百分点,平均为2.03个百分点。消融和可靠性分析表明,原型偏差是主要的修正信号,而可靠性门控在姿态观测不可靠时最为有用。这些结果表明,当重新训练或复现完整姿态流程不可行时,轻量级后验校准可以增强缓存的姿态流系统。

英文摘要

Pose-flow video anomaly detectors are attractive for one-class surveillance because they provide likelihood-based rankings for tracked skeleton windows. However, a single likelihood score may hide multimodal normal behavior and be sensitive to pose-observation noise. We study a frozen-detector setting in which the pose-flow backbone, cached skeleton tracks, and evaluation pipeline are fixed. Reliability-Aware Prototype Calibration (RPC) is a post-hoc score calibration method for this setting. It adds a standardized nearest-prototype deviation in the frozen latent space to the standardized flow score, and uses keypoint confidence only to gate this added geometric evidence. Thus, RPC preserves the original density signal while correcting the ranking with empirical normal-mode structure under pose reliability. Across two frozen pose-flow backbones and four datasets, RPC improves frame-level AUROC in all eight backbone-dataset pairs, with gains ranging from 0.34 to 4.49 percentage points and averaging 2.03 points. Ablation and reliability analyses show that prototype deviation is the main corrective signal, while reliability gating is most useful when pose observations are less trustworthy. These results suggest that lightweight post-hoc calibration can strengthen cached pose-flow systems when retraining or reproducing the full pose pipeline is impractical.

2606.20559 2026-06-19 cs.CV cs.LG 新提交

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

UNIEGO:代理作为中介的统一自我中心视频表示学习

Wenhao Chi, Arkaprava Sinha, Dominick Reilly, Hieu Le, Srijan Das

AI总结 提出分层多教师蒸馏框架UNIEGO,通过代理模型将异构教师知识转化为同质自我中心空间,并采用选择性代理蒸馏自适应筛选可靠监督,在三个自我中心视频理解任务上达到最优。

详情
AI中文摘要

自我中心视频理解本质上受限于可穿戴摄像头的狭窄视角:单一视角、单一模态、单一模型无法捕捉人类动作的全部丰富性。我们认为,真正富有表现力的自我中心表示必须包含跨视角、跨模态和基础模型表示的互补知识,同时仍能仅从自我中心视频部署。为此,我们引入了一个分层多教师蒸馏框架,生成UNIEGO,一个统一的自我中心编码器,使用九个教师(涵盖自我-外部视角、RGB、深度和骨架模态)以及四个基础模型进行训练。我们的框架不是直接从异构教师中蒸馏(其不兼容的架构和特征几何会导致冲突梯度),而是在其中插入一层表示特定的代理模型,将多样的教师知识转化为同质的自我中心空间。第二阶段蒸馏,即选择性代理蒸馏(SPD),然后自适应地为每个训练样本选择既正确又自信的代理子集,仅从可靠监督中蒸馏并抑制错误信号。SPD进一步通过将UNIEGO初始化为代理参数的凸组合来稳定,在蒸馏开始前将统一模型置于损失景观的良好条件区域。UNIEGO在三个自我中心视频理解任务(动作识别、视频检索和动作分割)上,在三个具有挑战性的自我-外部基准测试中达到了最先进的性能,优于朴素的多教师蒸馏基线,并证明了结构化的、代理中介的知识转移能产生更丰富、更具判别性的自我中心表示。

英文摘要

Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

2606.20561 2026-06-19 cs.CV 新提交

TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

TimeProVe: 先提出后验证,实现日常活动中的高效长视频时间推理

Arkaprava Sinha, Dominick Reilly, Siddharth Krishnan, Hieu Le, Srijan Das

AI总结 提出TimeProVe框架,先通过轻量模块生成基于动作的候选假设,再调用昂贵VLM验证,在长视频问答中降低75%VLM调用和93%推理成本,性能提升7.3%。

详情
AI中文摘要

长视频问答(LVQA)需要在数小时未修剪的视频中识别稀疏的、与查询相关的证据。现有方法要么使用大型视觉语言模型(VLM)密集处理视频,导致计算成本过高,要么依赖稀疏的基于字幕的推理,这往往会遗漏时间局部化和以运动为中心的证据。我们提出TimeProVe,一种用于长视频中时间基础推理的高效混合框架。TimeProVe首先使用轻量模块生成基于动作的答案-证据假设,随后仅调用昂贵的VLM进行针对性验证。我们框架的核心在于基于动作的候选证据(ACE)模块,该模块通过轻量级LLM推理将时间局部化的动作转换为查询条件化的候选答案和支持证据窗口。我们进一步引入OpenTSUBench(OTB),一个开放基准测试,旨在评估真实世界日常活动(ADL)场景中的时间基础推理。实验表明,TimeProVe在OTB上比最强基线高出7.3%,同时减少了75%的VLM调用和93%的推理成本。此外,在没有显式时间基础训练的情况下,TimeProVe在Charades-STA上取得了竞争性性能,并在结合基础VLM增强时达到了最先进的结果。

英文摘要

Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer--evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.

2606.09547 2026-06-19 cs.CV cs.LG 版本更新

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

流式干预:视频大语言模型能否在错误发生时即时纠正?

Apratim Bhattacharyya, Shweta Mahajan, Sanjay Haresh, Rajeev Yasarla, Reza Pourreza, Litian Liu, Risheek Garrepalli, Roland Memisevic

发表机构 * Qualcomm AI Research(高通人工智能研究院) York University(约克大学) Vector Institute for AI(向量人工智能研究所)

AI总结 提出Ego-MC-Bench基准评估视频LLM在烹饪场景中的实时干预能力,并构建Ego-CoMist反事实合成数据集提升小模型性能。

Comments The project page is available at https://apratimbh.github.io/livecookv2/

详情
AI中文摘要

学习日常技能(如烹饪一道菜)越来越依赖于教学媒体,例如在线视频。这为使用视频(和多模态)大语言模型(LLMs)作为任务指导助手打开了大门。一个潜在的任务指导助手在现实世界中成功的关键能力是,它能够在错误一出现时就主动干预以引导用户。为了评估这一关键能力,我们引入了Ego-MC-Bench(错误纠正),这是一个用于评估在现实烹饪场景中反应性、逐步任务指导的基准。大量实验表明,Ego-MC-Bench对于最先进的视频LLMs具有高度挑战性。我们认为一个关键原因是用于在此任务上微调模型的训练数据有限。尽管存在广泛的烹饪视频数据集,但现有数据集缺乏错误示例以及适当时间的干预。为了帮助解决这一数据限制,我们还引入了Ego-CoMist,这是一个反事实合成数据集,通过将非交互式烹饪视频转换为显示主动干预的监督训练示例而创建。我们表明,在Ego-CoMist上进行微调可以带来性能提升,特别是对于更适合在边缘设备上提供帮助的更小、更高效的视频LLMs。

英文摘要

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

6. 生成式视觉与世界模型 26 篇

2606.19495 2026-06-19 cs.CV 新提交

LooseControlVideo: Directorial Video Control using Spatial Blocking

LooseControlVideo: 使用空间分块进行导演式视频控制

Shariq Farooq Bhat, Niloy J. Mitra, Kalyan Sunkavalli

发表机构 * Adobe Research(Adobe研究院)

AI总结 提出LooseControlVideo框架,通过稀疏定向3D框作为“分块”代理,实现文本到视频生成中多对象场景的直观布局与轨迹控制,显著优于现有2D框和流方法。

Comments Project page at https://shariqfarooq123.github.io/LooseControlVideo/

详情
AI中文摘要

在文本到视频生成中,精确的3D空间编排仍然是一个重大挑战,特别是对于语义布局和时间动态经常纠缠的多对象场景。虽然现有的深度条件模型实现了良好的结构保真度,但它们需要密集的、帧精确的指导,这对于涉及可变形对象的动态事件来说,制作起来非常费力。我们提出了LooseControlVideo,一个通过使用稀疏的、定向的3D框作为“分块”代理来实现直观和表达性控制的框架。这允许用户创作高级布局和轨迹,同时利用视频生成模型生成逼真的遮挡、动态和交互。我们通过在带有DNOCS(一种用于3D大小、方向和深度排序遮挡的新型编码)注释的视频数据集上微调Wan 2.2骨干网络来实现这一点。此外,我们的方法允许局部细化,例如调整跳跃轨迹或添加交互,而对全局场景上下文的干扰最小。在nuScenes、HO-3D和BEHAVE基准上的广泛评估表明,LooseControlVideo显著优于现有的2D框和基于流的基线。我们的结果表明,与当前最先进的布局条件模型相比,轨迹误差提高了1.2倍到3倍;刚体运动一致性提高了2倍;遮挡精度提高了1.5倍到2倍,表明定向3D基元为复杂的多智能体视频创作提供了良好的几何先验。

英文摘要

Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.

2606.19662 2026-06-19 cs.CV 新提交

Learning When to Denoise: Optimizing Asynchronous Schedules for Latent Diffusion

学习何时去噪:优化潜在扩散的异步调度

Bingshuo Qian, Xiang Cheng

AI总结 提出学习异步调度策略,通过调度校正目标优化多表示扩散模型的去噪顺序,在ImageNet 256x256上以不到1%额外训练计算实现4倍加速,FID达1.02。

Comments 25 pages, 9 figures, 4 tables

详情
AI中文摘要

多表示扩散模型可以通过对图像的互补视图进行去噪来改善视觉合成,但其性能关键取决于决定每个表示何时去噪的异步调度。我们提出学习这种调度。我们的方法在多个表示空间上制定异步流匹配,并使用调度校正目标,该目标在调度变化时保持每个表示的局部噪声时间权重固定。我们用一个灵活的参数类实例化调度,该类通过构造是凸且单调的,并使用快速联合探针进行学习,额外训练计算少于1%。在ImageNet 256x256上,学习的调度在匹配的675M参数XL骨干下显著提高了收敛速度和最终质量。使用AutoGuidance,我们的200 epoch模型达到FID 1.05,与800 epoch的SFD-XL基线相当,训练量减少4倍。训练到600 epoch进一步改善到FID 1.02,优于1B参数的SFD-XXL结果(FID 1.04),同时使用更小的模型。在无引导设置中,我们的200 epoch模型达到FID 2.37,已经低于最佳800 epoch SFD-XL结果(2.54),训练量减少4倍,并在600 epoch时改善到FID 2.14。代码可在https://this URL获取。

英文摘要

Multi-representation diffusion models can improve visual synthesis by denoising complementary views of an image, but their performance depends critically on the asynchronous schedule that determines when each representation is denoised. We propose to learn this schedule. Our method formulates asynchronous flow matching over multiple representation spaces and uses a schedule-corrected objective that keeps each representation's local noising-time weights fixed as the schedule changes. We instantiate the schedule with a flexible parametric class that is convex and monotone by construction, and learn it using a fast joint probe with less than 1% additional training compute. On ImageNet 256x256, the learned schedule substantially improves both convergence speed and final quality under a matched 675M-parameter XL backbone. With AutoGuidance, our 200-epoch model reaches FID 1.05, matching the 800-epoch SFD-XL baseline with 4x less training. Training to 600 epochs further improves to FID 1.02, outperforming the 1B-parameter SFD-XXL result of FID 1.04 while using a smaller model. In the unguided setting, our 200-epoch model reaches FID 2.37, already below the best 800-epoch SFD-XL result (2.54) at 4x less training, and improves to FID 2.14 at 600 epochs. Code is available at https://github.com/bsq532087/LWD

2606.19676 2026-06-19 cs.CV cs.AI 新提交

TeleMorpher: Toward Robust Simultaneous Motion-Location Editing

TeleMorpher: 迈向鲁棒的同步运动-位置编辑

Haengbok Chung

AI总结 提出TeleMorpher,一种基于扩散模型的一步式框架,通过运动先验、姿态扭曲和基线运动编辑器注入,实现视频中主角运动与位置的同步编辑,在定量和定性评估中表现优异。

详情
AI中文摘要

扩散模型在图像和视频生成与编辑中取得了显著成功。尽管最近的研究将工作扩展到运动编辑,但同步变换运动与位置——尽管具有实际重要性——仍基本未被探索。为了更好地理解鲁棒的运动-位置编辑,我们首先分析了降低其质量的根本因素。基于此分析,我们提出了TeleMorpher,据我们所知,这是首个用于同步运动-位置编辑的一步式框架之一。我们的方法利用运动先验(从现成模型生成的目标运动中心视频作为运动编辑指导)和真实运动,实现更可控和精确的运动-位置编辑。通过这种方式,我们的框架工作如下:(1) 首先通过预训练的分割和修复模型分离主角和背景。(2) 然后,我们引入一种无需训练的姿势扭曲,以运动先验为指导编辑主角的运动。(3) 扭曲运动视频的结果在推理时直接注入基线运动编辑器,减轻源运动与目标运动之间的差异,同时保留源视频的外观。(4) 为提高定量评估的可靠性,我们提出了两个新的基于LPIPS的指标,分别测量运动编辑前后背景一致性以及通过测量从源视频和目标视频中提取的主角骨架差异来评估运动编辑性能的保真度。在野外视频和TaiChi数据集上的实验表明,TeleMorpher在定量和定性测量(真实人类评估)中均取得了优越性能,凸显了其有效性。

英文摘要

Diffusion models have achieved remarkable success in image and video generation and editing. While recent studies have extended these efforts toward motion editing, simultaneously transforming both motion and location-despite its practical importance-remains largely unexplored. To better understand robust motion-location editing, we first analyze the fundamental factors that degrade its quality. Based on this analysis, we propose TeleMorpher, one of the first one-shot frameworks to the best of our knowledge, for simultaneous motion-location editing. Our approach leverages motion priors, a target motion-centric video generated from an off-the-shelf model as motion-editing guidance, and the ground truth motion to enable more controllable and precise motion-location editing. Via this, our framework works as follows: (1) we first disentangle the protagonist and the background via pre-trained segmentation and inpainting models. (2) Then, we introduce a training-free pose warping that edits the protagonist's motion with the motion prior as the guidance. (3) The result of warped motion video is directly injected into a baseline motion editor during inference, mitigating the difference between source and target motions while preserving the appearance of the source video. (4) To enhance the reliability of quantitative evaluations, we propose two new LPIPS-based metrics that measure the background consistency before and after the motion editing and the fidelity of motion editing performance via measuring the difference between the extracted protagonist's skeletons from source and target videos. Experiments with in-the-wild videos and the TaiChi dataset demonstrate that TeleMorpher achieves superior performance across both quantitative and qualitative measurements (real-human evaluation), underscoring its effectiveness.

2606.19718 2026-06-19 cs.CV 新提交

One-Shot Novel View and Pose Human Image Synthesis via 3D Prior Guided Diffusion Model

基于3D先验引导扩散模型的单样本新视角与姿态人体图像合成

Shenjian Gong, Kangkan Wang, Shanshan Zhang, Jian Yang

发表机构 * PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院教育部高维信息智能感知与系统重点实验室、江苏省社会安全图像与视频理解重点实验室及PCA实验室) Advanced Laser Technology Laboratory of Anhui Province, Electronic Engineering Institute, National University of Defense Technology, and Jianghuai Advance Technology Center(国防科技大学电子工程学院安徽省先进激光技术实验室及江淮前沿技术中心)

AI总结 提出一种基于条件去噪扩散模型的方法,利用3D人体先验(法线图和颜色提示)作为几何和颜色条件,从单张参考图像合成任意姿态和视角的高质量人体图像,包括被遮挡部分。

Comments 30 pages, 10 figures

详情
AI中文摘要

本文解决了单样本新视角和姿态人体图像合成的挑战。现有方法通过一组2D姿态关键点将参考人体图像转移到目标姿态,或基于可泛化人体NeRF(使用人体模型先验提取逐点特征)合成人体图像。然而,基于姿态转移的方法无法处理使用模糊2D姿态作为条件的复杂人体姿态,而可泛化人体NeRF在缺乏可靠特征时可能无法准确恢复被遮挡/不可见的人体部分。为解决这些问题,我们提出了一种基于条件去噪扩散模型的新方法,用于从单张人体图像进行新视角和姿态合成。我们的扩散模型将新视角和姿态合成问题分解为一系列条件去噪步骤。具体而言,为了生成具有复杂和任意姿态的人体,我们将3D人体先验(即3D法线图和颜色提示)作为几何和颜色条件引入生成过程。通过一系列扩散步骤将参考人体转移到目标人体,我们的扩散模型能够实现高质量合成,包括被遮挡/不可见部分。此外,我们提出了一种基于自重建的自定义细化方法,以在测试新视角时增强细节。在多个公共数据集上的实验结果表明,我们的方法显著优于先前方法,并显示出更好的跨数据集泛化能力。代码将在https://this https URL上公开。

英文摘要

This paper addresses the challenge of one-shot novel view and pose human image synthesis. The existing methods transfer the reference human image to a target pose using a set of 2D pose keypoints or synthesize human images based on generalizable human NeRF which uses human model priors to extract point-wise features. However, pose transfer based methods can not handle complex human pose using ambiguous 2D pose as the condition, while generalizable human NeRFs may be inaccurate to recover occluded/invisiable human parts without extracted reliable features. To solve these problems, we propose a novel approach for novel view and pose synthesis from a singe human image via conditional denoising diffusion model. Our diffusion model divides the novel view and pose synthesis problem into a sequence of conditional denoising steps. Specifically, to generate humans with complex and arbitrary poses, we introduce 3D human priors, i.e., 3D normal map and color prompt, as geometry and color conditions into the generation process. By transferring the reference human into the target human with a series of diffusion steps, our diffusion model enables high-quality synthesis including the occluded/invisible parts. Further, we propose a self-reconstruction based customized refinement to enhance fine details when tested on novel persons.Experimental results on different public datasets demonstrate that our approach significantly outperforms previous methods and also shows better generalization ability across datasets. The code will be made publicly available at https://github.com/Yankeegsj/3DPGDM.

2606.19889 2026-06-19 cs.CV 新提交

SurgVista: Long-Horizon Surgical World Modeling with Plausible Instrument-Tissue Dynamics

SurgVista:具有合理器械-组织动力学的长程手术世界建模

Wentao Pan, Wuyang Li, Shengyuan Liu, Xinyu Liu, Hengyu Liu, Yixuan Yuan

发表机构 * The Chinese University of Hong Kong(香港中文大学) EPFL(瑞士联邦理工学院洛桑) Imperial College London(伦敦帝国学院)

AI总结 提出SurgVista手术世界模型,通过变形一致性正则化和漂移适应训练,解决空间交互不连贯和时间保真度崩溃问题,在长程预测中显著优于现有方法。

详情
AI中文摘要

将机器人策略学习扩展到自主手术面临挑战,因为专家演示成本高昂且体内探索存在重大安全风险。手术世界模型通过从初始观测生成逼真的、动作条件下的未来帧来解决这一问题,但现有方法存在两种持续失效模式:空间交互不连贯,即可见器械接触未能引起空间一致的组织变形;以及时间保真度崩溃,即预测误差在自回归展开中累积并逐渐破坏视觉质量。我们提出SurgVista,一种通过两种训练策略缓解这两种失效的手术世界模型。变形一致性正则化从训练视频中提取场景点轨迹,并通过潜在对比学习强制跨帧一致性,增强物理一致的器械-组织动力学。漂移适应训练通过用在线预测残差和根据长程漂移统计校准的光度增强扰动条件帧,减轻长程漂移,在扩展展开中维持视觉保真度。为了进行严格评估,我们进一步引入SurgWorld-Bench,包含多样化的手术类型、长程展开以及用于器械运动精度和组织响应保真度的解耦指标。大量实验表明,SurgVista在视觉质量、时间一致性和交互保真度方面持续优于最先进方法,且随着预测视界增长优势扩大。

英文摘要

Scaling robot policy learning for autonomous surgery is challenging, as expert demonstrations are expensive and in vivo exploration poses substantial safety risks. Surgical world models address this by generating realistic, action-conditioned future frames from an initial observation, but existing methods exhibit two persistent failure modes: spatial interaction incoherence, where visible instrument contact fails to induce spatially consistent tissue deformation, and temporal fidelity collapse, where prediction errors compound across autoregressive rollouts and progressively corrupt visual quality. We present SurgVista, a surgical world model that mitigates both failures through two training recipes. Deformation Consistency Regularization extracts scene-point trajectories from training videos and enforces cross-frame coherence through latent contrastive learning, strengthening physically consistent instrument-tissue dynamics. Drift Adaptation Training mitigates long-horizon drift by perturbing conditioning frames with online prediction residuals and photometric augmentations calibrated to long-horizon drift statistics, sustaining visual fidelity over extended rollouts. To enable rigorous evaluation, we further introduce SurgWorld-Bench, featuring diverse procedure types, long-range rollouts, and decoupled metrics for instrument-motion accuracy and tissue-response fidelity. Extensive experiments show that SurgVista consistently outperforms state-of-the-art methods across visual quality, temporal consistency, and interaction fidelity, with gains widening as the prediction horizon grows.

2606.19958 2026-06-19 cs.CV 新提交

SketchKeyAnime: Reference-anchored Sparse Key-Sketch Animation Synthesis

SketchKeyAnime:基于参考锚点的稀疏关键草图动画合成

Meixi Li, Xianlin Zhang, Yue Zhang, Xueming Li

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出SketchKeyAnime视频扩散框架,通过双分支条件机制和可学习门控的草图交叉注意力,从单张参考RGB图像和稀疏关键草图生成结构可控、外观一致且时间连贯的动画,在Sakuga-42M数据集上显著优于基线方法。

详情
AI中文摘要

传统动画制作严重依赖手工绘制和迭代细化,特别是关键姿势设计、中间帧生成和角色着色。虽然现有的动画和视频生成方法取得了显著进展,但它们通常依赖于RGB边界帧、密集的帧级条件或完整的草图序列,限制了在低成本输入条件下的适用性。我们提出了SketchKeyAnime,一个视频扩散框架,用于从稀疏关键草图输入生成结构可控、外观一致且时间连贯的动画。给定单个参考RGB图像和几个按时间索引的关键草图,SketchKeyAnime引入了一种双分支条件机制,以编码局部几何约束以及语义-时间上下文。它利用草图交叉注意力,通过可学习门控融合参考图像和草图条件,并加入自适应加权损失以加强对关键草图帧和线条艺术区域的监督。在Sakuga-42M的Aesthetic子集上的实验结果表明,我们的方法始终优于代表性的动画插值和草图引导生成基线。与最佳基线相比,SketchKeyAnime将EDMD降低了31.9%,FVD降低了9.5%,展示了卓越的草图保真度和时间连贯性,同时在大多数定量指标上实现了最佳整体性能。这些结果验证了所提出的框架,并突显了其在低成本、高度可控动画创作中的潜力。

英文摘要

Traditional animation production relies heavily on manual drawing and iterative refinement, particularly for key-pose design, in-betweening, and character coloring. While existing animation and video generation methods have made notable progress, they typically depend on RGB boundary frames, dense frame-wise conditions, or complete sketch sequences, limiting their applicability under low-cost input conditions. We present SketchKeyAnime, a video diffusion framework for generating structurally controllable, appearance-consistent, and temporally coherent animations from sparse key-sketch inputs. Given a single reference RGB image and a few temporally indexed key sketches, SketchKeyAnime introduces a dual-branch conditioning mechanism to encode local geometric constraints alongside semantic-temporal context. It leverages Sketch Cross Attention to fuse reference image and sketch conditions with learnable gating, and incorporates an Adaptive Weighted Loss to strengthen supervision on key-sketch frames and line-art regions. Experimental results on the Aesthetic subset of Sakuga-42M show that our approach consistently outperforms representative animation interpolation and sketch-guided generation baselines. Compared to the best-performing baseline, SketchKeyAnime reduces EDMD by 31.9\% and FVD by 9.5\%, demonstrating superior sketch fidelity and temporal coherence, while achieving the best overall performance across most quantitative metrics. These results validate the proposed framework and highlight its potential for low-cost, highly controllable animation creation.

2606.19970 2026-06-19 cs.CV 新提交

CrossFlow: One-Step Generation Across Latent and Pixel Spaces

CrossFlow: 跨潜在空间与像素空间的单步生成

Xiyuan Wang, Xiao Zhang, Yang Li, Ruoxi Jiang, Zhao Zhong, Liefeng Bo, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Tencent(腾讯) Fudan University(复旦大学)

AI总结 提出CrossFlow,一种跨空间流模型,将噪声潜在输入直接映射到像素图像,通过无速度单步目标实现潜在到像素的生成,并替代潜在扩散中的解码器,在ImageNet-1k上达到1.62 FID。

Comments Preprint, Under Review

详情
AI中文摘要

大多数扩散和流匹配生成器在相同的表示空间中定义先验、概率路径和预测目标。潜在扩散通过将该路径移动到自编码器潜在空间来提高效率,但最终样本仍由单独训练的解码器生成。这种分离造成了不匹配:生成器针对潜在空间预测进行优化,而最终质量取决于解码器如何处理可能与干净编码器输出不同的生成潜在变量。我们引入了CrossFlow,一种跨空间流公式,将噪声潜在输入直接映射到像素空间图像。关键技术步骤是一个无速度的单步目标:潜在轨迹定义了训练路径,但监督预测是图像而非潜在位移。这使得一个模型既可以作为单步潜在到像素生成器,也可以作为潜在扩散管道的解码器替代品。在类别条件ImageNet-1k $256\ imes256$上,CrossFlow-XL通过一次函数评估达到了1.62 FID。消融实验表明,潜在编码器以及像素空间感知和对抗损失对保真度很重要。这些结果表明,跨空间流目标可以结合潜在表示的效率与直接像素空间监督,而无需在推理时使用单独的解码器。

英文摘要

Most diffusion and flow-matching generators define the prior, probability path, and prediction target in the same representation space. Latent diffusion improves efficiency by moving this path into an autoencoder latent space, but the final sample is still produced by a separately trained decoder. This separation creates a mismatch: the generator is optimized for latent-space prediction, while final quality depends on how the decoder handles generated latents that may differ from clean encoder outputs. We introduce CrossFlow, a cross-space flow formulation that maps noisy latent inputs directly to pixel-space images. The key technical step is a velocity-free one-step objective: the latent trajectory defines the training path, but the supervised prediction is an image rather than a latent displacement. This lets one model act both as a one-step latent-to-pixel generator and as a decoder replacement for latent diffusion pipelines. On class-conditional ImageNet-1k at $256\times256$, CrossFlow-XL achieves 1.62 FID with one function evaluation. Ablations show that the latent encoder and pixel-space perceptual and adversarial losses are important for fidelity. These results indicate that cross-space flow objectives can combine the efficiency of latent representations with direct pixel-space supervision, without requiring a separate decoder at inference.

2606.20076 2026-06-19 cs.CV cs.AI 新提交

Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers

基于可学习全局合并的可变长度分词用于扩散变换器

Dong Hoon Lee, Seunghoon Hong

发表机构 * Kim Jaechul Graduate School of AI, KAIST, Daejeon, South Korea(韩国科学技术院金载哲人工智能研究生院,大田,韩国) School of Computing, KAIST, Daejeon, South Korea(韩国科学技术院计算学院,大田,韩国)

AI总结 针对固定压缩比限制扩散模型质量-计算权衡的问题,提出基于可学习全局合并的可变长度分词器,通过合并令牌实现跨长度表示对齐,在ImageNet 256×256生成中实现更优的gFID-计算权衡。

详情
AI中文摘要

潜在扩散模型(LDM)在视觉合成中占据主导地位,但其质量-计算权衡很大程度上受限于分词器的固定压缩比。可变长度分词器(VLT)通过改变令牌数量实现自适应压缩,使扩散模型能够灵活平衡质量和计算。然而,传统的VLT通过截断有序令牌序列来调节长度,这使得令牌语义依赖于令牌位置,并破坏了跨长度的表示对齐。这导致潜在分布出现跨长度偏移,阻碍单个可变长度扩散模型有效运行。为了解决这个问题,我们提出了一种新颖的可变长度分词器,通过合并令牌来调节长度。我们表明,当扩散变换器根据合并模式运行时,鼓励相似令牌合并可以实现直接的跨长度表示对齐。由于传统的合并方法是数据依赖的,使得生成过程中无法访问合并模式,我们引入了可学习的全局合并,它是数据独立的,以确保与扩散变换器的兼容性。在ImageNet 256×256生成中,我们的基于合并的可变长度分词器与扩散变换器集成,相比之前的VLT方法实现了更优的gFID-计算权衡。代码可在[此https URL](此https URL)获取。

英文摘要

Latent Diffusion Models (LDMs) have become dominant in visual synthesis, but their quality-compute trade-off is largely constrained by the tokenizer's fixed compression ratio. Variable-length tokenizers (VLTs) promise adaptive compression by varying token counts, allowing diffusion models to flexibly balance quality and compute. However, conventional VLTs modulate length by truncating ordered token sequences, which makes token semantics depend on token position and breaks representational alignment across lengths. This leads to a cross-length shift in the latent distribution that hinders a single variable-length diffusion model from operating effectively. To address this, we propose a novel variable-length tokenizer that modulates length by merging tokens. We show that encouraging similar tokens to merge enables direct cross-length representation alignment when the diffusion transformer operates according to the merging pattern. Since conventional merging methods are data-dependent, making the merging pattern inaccessible during generation, we introduce learnable global merging, which is data-independent, to ensure compatibility with diffusion transformers. On ImageNet 256$\times$256 generation, our merging-based variable-length tokenizer integrated with a diffusion transformer achieves a superior gFID-compute trade-off compared to prior VLT methods. Code is available at [this https URL](https://github.com/movinghoon/lgm)

2606.20083 2026-06-19 cs.CV 新提交

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Holo-World: 视频世界模型的统一相机、物体和天气控制

Xiangchen Yin, Wenzhang Sun, Jiahui Yuan, Zijie Liu, Yinda Chen, Wei Li, Dachun Kai, Chunfeng Wang, Xiaoyan Sun

AI总结 提出Holo-World,一种从单张图像联合控制相机、物体运动和天气的统一视频世界模型,通过场景适配器和解耦CFG实现世界保持与天气迁移。

Comments Project Page: \url{https://xiangchenyin.github.io/Holo-World} Code: \url{https://github.com/XiangchenYin/Holo-World}

详情
AI中文摘要

视频世界模型正朝着在可控相机和物体运动下保持观察到的世界,同时允许其环境状态变化的方向发展。然而,这些控制仍然是孤立的,天气生成通常依赖于已经指定未来结构的源视频或重建场景。我们研究了一种基于第一帧锚定的源到状态设置,其中模型从单张图像开始,遵循明确的相机和物体控制以及可选的天气指令,然后生成一个视频,该视频要么保持源世界,要么将其转移到目标天气状态。为了解决这些挑战,我们首先构建了HoloStateData,一个状态视频数据集,将多样化的视频转换为用于相机、物体和天气监督的统一控制样本。其次,我们引入了Holo-World,一个统一的、可控制的视频世界模型,从单张图像联合控制场景。其统一场景适配器将世界保持和天气迁移分解为不同的参数子空间,使用渲染背景、几何缓冲区和物体控制来维持受控场景结构,同时建模依赖天气的外观和粒子效果。此外,场景-天气解耦CFG分别引导场景和天气残差,增强目标天气效果而不过度放大完整条件。定量和定性实验表明,Holo-World在保持精确的相机和物体控制以及一致场景结构的同时,将场景迁移到多样化的目标天气状态,在天气状态生成上优于视频到视频的天气编辑基线。我们的项目页面可在\url{this https URL}获取。

英文摘要

Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at \url{https://xiangchenyin.github.io/Holo-World/}.

2606.20094 2026-06-19 cs.CV cs.AI cs.GR cs.LG cs.MM 新提交

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

MakeupMirror:在用于化妆迁移的扩散模型中改进面部属性保持

Nefeli Andreou, Angel Martínez-González, Sabine Sternig, Matthieu Guillaumin, Epameinondas Antonakos, Michael Opitz

发表机构 * Amazon(亚马逊)

AI总结 提出MakeupMirror扩散模型,通过ControlNet几何条件、区域特定迁移控制、肤色调制和Langevin采样器,在保持面部特征和肤色的同时实现高质量化妆迁移,相比Stable-Makeup提升面部识别相似度60%、降低肤色差异50%。

详情
AI中文摘要

化妆迁移模型能够实现有趣的增强现实(AR)体验以及在线化妆购物的虚拟试妆(VTO)。尽管最近最先进的基于扩散的解决方案(如Stable-Makeup)显著提高了化妆迁移的准确性和逼真度,但在身份和肤色保持方面仍存在局限性,使得用于化妆购物的生产级VTO不切实际。在这项工作中,我们提出了MakeupMirror,一种基于扩散的化妆迁移方法,在保持面部特征和肤色方面取得了显著进展。我们在Stable-Makeup的基础上引入了多项技术创新:(1)将面部几何条件与ControlNets集成以保持面部保真度;(2)区域特定的化妆迁移控制,以便在面部区域(如皮肤、眼睛和嘴唇)实现精确的化妆应用;(3)基于肤色的化妆迁移调制,防止跨主体迁移场景中的肤色改变;(4)集成Levenberg-Marquardt Langevin采样器以加速推理同时保持生成质量。我们在CPM-Real、Makeup Wild以及(本文新收集的、更多样化的)MakeupSelfies数据集上的实验表明,与Stable-Makeup相比,MakeupMirror将相对面部识别相似度提高了+60%,将相对肤色差异降低了-50%,延迟为0.7秒,同时在核心面部身份保持标准上达到了94%的专家接受率。

英文摘要

Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.

2606.20233 2026-06-19 cs.CV 新提交

Cinematic Compositing Using Character-Environment-Harmonized Video Generation Models

使用角色-环境协调视频生成模型的电影级合成

Tianyi Xiang, Mingming He, Li Ma, Jing Liao

发表机构 * City University of Hong Kong(香港城市大学) Independent Researcher(独立研究员)

AI总结 提出端到端视频扩散框架,通过三掩码引导和RGB-D联合去噪建模角色与环境的双向物理与光照交互,实现高质量动态视频合成。

详情
AI中文摘要

电影级合成旨在将绿幕角色融入新环境,同时保持物理和光度真实性。先前的方法通常未能捕捉角色与其周围环境之间的复杂双向交互,我们将其表征为角色到环境(C2E)的物理交互和环境到角色(E2C)的光照协调。为了解决这个问题,我们提出了一个端到端的视频扩散框架,联合建模C2E和E2C交互,特别处理交互道具的挑战。我们的方法引入了一种三掩码引导架构,结合RGB-D联合去噪,以确保角色、道具和环境之间的物理一致交互。我们进一步开发了一种高效的先验驱动数据整理流程,无需昂贵的渲染即可构建高质量的重光照对。最后,参考条件机制实现了可控的环境合成和精确的道具替换。大量实验表明,我们的框架在电影级动态视频合成方面显著优于现有方法。

英文摘要

Cinematic compositing aims to integrate green-screen characters into novel environments while maintaining physical and photometric realism. Previous methods often fail to capture the complex bidirectional interactions between characters and their surroundings, which we characterize as Character-to-Environment (C2E) physical interaction and Environment-to-Character (E2C) lighting harmonization. To address this, we propose an end-to-end video diffusion framework that jointly models C2E and E2C interactions, specifically handling the challenges of interactive props. Our approach introduces a tri-mask-guided architecture with RGB-D joint denoising to ensure physically consistent interactions among the character, props, and environment. We further develop an efficient prior-driven data curation pipeline to construct high-quality relighting pairs without expensive rendering. Finally, a reference-conditioned mechanism enables controllable environment synthesis and precise prop replacement. Extensive experiments demonstrate that our framework significantly outperforms existing methods in cinematic-quality dynamic video compositing.

2606.20310 2026-06-19 cs.CV 新提交

Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

通过PRISM:视频扩散模型中间状态中的偏好表示

Haoxuan Wu, Lai Man Po, Mengyang Liu, Kun Li, Hongzheng Yang, Wei Liu

发表机构 * City University of Hong Kong(香港城市大学) Video Rebirth The Chinese University of Hong Kong(香港中文大学)

AI总结 提出PRISM方法,利用冻结的视频扩散骨干网络和轻量级查询聚合头从噪声潜变量中解码偏好信号,实现高精度偏好预测和噪声鲁棒性,支持早期最佳采样以降低计算成本并提升视频质量。

详情
AI中文摘要

使用干净的、基于像素的奖励模型评估视频生成,会使评估与噪声扩散过程脱节,并产生巨大的VAE解码成本。在本文中,我们通过提出一个基本问题来挑战这一范式:一个强大的视频生成器能否直接从噪声潜变量中内在地区分偏好?为了回答这个问题,我们引入了\textbf{PRISM}(\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels)。PRISM采用一个轻量级的基于查询的聚合头,配合冻结的视频扩散骨干网络,从噪声潜变量中解码偏好信号。令人惊讶的是,PRISM不仅达到了最先进的偏好准确率,还解锁了强大的噪声鲁棒性,从而实现了早期最佳-$N$采样。这使得在去噪的初始阶段就能过滤掉次优候选,大幅减少计算量并提升视频质量。我们还揭示了骨干网络的生成性能与其内在评估能力之间的强正相关性,从而实现了视频骨干网络的自我改进。

英文摘要

Evaluating video generation with clean, pixel-based reward models disconnects evaluation from the noisy diffusion process and incurs massive VAE decoding costs. In this paper, we challenge this paradigm by asking a fundamental question: Can a powerful video generator inherently discriminate preferences directly from noisy latents? To answer this, we introduce \textbf{PRISM} (\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels). PRISM employs a lightweight Query-based Aggregation head with a frozen video diffusion backbone to decode preference signals from noisy latents. Surprisingly, PRISM not only achieves SOTA preference accuracy but also unlocks strong noise-robustness, which enables early-stage Best-of-$N$ sampling. This allows for filtering suboptimal candidates at the very beginning of denoising, drastically reducing computation while boosting video quality. We also reveal a strong positive correlation between a backbone's generative performance and its inherent evaluative power, enabling self-improving video backbones.

2606.20404 2026-06-19 cs.CV 新提交

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

FlowBender: 面向自校正条件流的反馈感知训练

Daniel Gilo, Sven Elflein, Ido Sobol, Or Litany

发表机构 * Technion(以色列理工学院) NVIDIA(英伟达) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 针对条件扩散/流模型常违反任务约束的问题,提出FlowBender闭环框架,将对齐误差作为输入训练网络学习校正策略,在图像翻译、复原和3D纹理贴图中同时提升保真度与合理性。

Comments Project page: https://flow-bender.github.io/

详情
AI中文摘要

条件扩散和流模型通常无法满足定义其任务的约束条件。例如,深度条件模型经常产生重新提取的深度与输入不一致的图像,尽管定义约束的前向算子(深度预测器)在训练和推理期间都可用。现有方法通常分为两类:将条件信号视为静态线索并在推理时忽略对齐信息的监督模型,以及通过手动调整的线性更新咨询约束的基于引导的方法,通常以生成样本的合理性为代价来换取对条件的保真度。我们认为这两种范式的根本差距在于模型从未被训练利用自身的对齐误差。我们引入FlowBender,一个闭环框架,将此误差视为一等输入,训练网络学习基于推理时反馈的校正策略。在每一步,无引导的前瞻传递估计干净信号,通过前向算子计算特定任务的偏差,然后细化传递消耗此信号以产生校正速度。我们提出了FlowBender的几种变体,包括用于可微算子的基于梯度的公式和用于不可微设置(如JPEG压缩)的零阶变体。为了实现高效采样,我们引入了一个前一步捷径,使得以最小的额外计算成本实现闭环校正。在图像到图像翻译、复原和3D网格纹理贴图中,FlowBender始终优于标准监督基线、对齐损失增强训练和最先进的推理时引导,同时提高保真度和合理性,而不是在它们之间进行权衡。项目页面:此 https URL

英文摘要

Conditional diffusion and flow models routinely fail to satisfy the very constraints that define their task. For instance, a depth-conditioned model often produces images whose re-extracted depth disagrees with the input, even though the forward operator--the depth predictor defining the constraint--is available during both training and inference. Existing approaches generally fall into two categories: supervised models that treat the conditioning signal as a static cue and ignore alignment information at inference, and guidance-based methods that consult it through hand-tuned linear updates, typically trading fidelity to the condition against the plausibility of the generated sample. We argue that the fundamental gap in both paradigms is that the model is never trained to utilize its own alignment error. We introduce FlowBender, a closed-loop framework that treats this error as a first-class input, training the network to learn a correction policy conditioned on inference-time feedback. At each step, an unguided look-ahead pass estimates the clean signal, a task-specific deviation is computed via the forward operator, and a refinement pass consumes this signal to produce a corrected velocity. We propose several variants of FlowBender, including a gradient-based formulation for differentiable operators and a zero-order variant for non-differentiable settings such as JPEG compression. For efficient sampling, we introduce a prior-step shortcut that enables closed-loop correction at a minimal additional computational cost. Across image-to-image translation, restoration, and 3D mesh texturing, FlowBender consistently outperforms standard supervised baselines, alignment-loss-augmented training, and state-of-the-art inference-time guidance, improving fidelity and plausibility simultaneously rather than trading them against each other. Project page: https://flow-bender.github.io/

2606.20506 2026-06-19 cs.CV cs.AI 新提交

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

FreeStyle: 从社区LoRA挖掘中实现风格-内容双参考生成的自由控制

Jinghong Lan, Wei Cheng, Yunuo Chen, Ziqi Ye, Peng Xing, Yixiao Fang, Rui Wang, Yufeng Yang, Xuanyang Zhang, Xianfang Zeng, Difan Zou, Gang Yu, Chi Zhang

AI总结 提出FreeStyle框架,利用社区LoRA作为锚点,通过两阶段课程学习(注意力级约束和频率感知RoPE调制)解决双参考生成中的内容泄露问题,并引入新基准和评估指标,实现风格对齐、内容保持与泄露抑制的平衡。

Comments 35 pages, 26figures. Project page: https://github.com/Blue2Giant/FreeStyle

详情
AI中文摘要

风格-内容双参考生成旨在合成一张图像,该图像保留内容参考的结构和语义,同时采用单独风格参考的风格。尽管近期有所进展,但这一设置仍然具有挑战性,因为模型必须平衡内容保真度、风格对齐和指令遵循,同时避免风格参考的语义泄露。一个关键瓶颈是缺乏大规模的三元组数据,这些数据具有清晰的内容-风格分离和广泛的长尾风格。在这项工作中,我们提出了FreeStyle,一个基于社区LoRA的可扩展双参考生成框架。我们将社区LoRA视为风格和内容的组合锚点,并设计了一个严格的生成和过滤流水线,以在多个基础模型上构建大规模的风格参考和内容参考三元组。为了解决内容泄露,我们采用了两阶段课程学习,并设计了特定阶段的解耦机制:在风格迁移阶段,采用注意力级增强约束来抑制风格参考泄露;在更困难的双参考阶段,采用频率感知的RoPE调制策略来针对基于位置对应的泄露。我们还引入了一个基准,涵盖风格参考和双参考生成,并在风格相似性、内容保持、美学质量、指令遵循和泄露拒绝方面进行评估。该基准包含一个风格不变的内容对齐分数(CAS),并引入了一个基于校准的VLM的拒绝分数,用于评估生成可靠性和泄露。大量实验表明,我们的模型在风格对齐、内容保持和泄露抑制之间实现了强平衡。

英文摘要

Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style of a separate style reference.Despite recent progress, this setting remains challenging because models must balance content fidelity, style alignment, and instruction following avoiding semantic leakage from the style reference.A key bottleneck is the lack of large-scale triplet data with clean content-style separation and broad long-tail style coverage.In this work, we propose FreeStyle, a scalable dual-reference generation framework based on community LoRA mining.We treat community LoRAs as compositional anchors for style and content, and design a rigorous generation and filtering pipeline to construct large-scale Style-Reference and Content-Reference triplets across multiple base models.To address content leakage, we adopt a two-stage curriculum with stage-specific disentanglement mechanisms: an attention-level enrichment constraint that suppresses style-reference leakage in the style-transfer stage, and a frequency-aware RoPE modulation strategy that targets positional-correspondence-based leakage in the harder dual-reference stage.We also introduce a benchmark covering both style-reference and dual-reference generation, with evaluations on style similarity, content preservation, aesthetics, instruction following, and leakage rejection. The benchmark incorporates a style-invariant Content Alignment Score (CAS) and introduces a calibrated VLM-based Rejection Score for evaluating generation reliability and leakage suppression.Extensive experiments show that our model achieves a strong balance among style alignment, content preservation, and leakage suppression.

2606.20543 2026-06-19 cs.CV 新提交

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

SSD: 空间推测解码加速自回归图像生成

Shilong Xiang, Zirui Zhang, Lijun Yu, Chengzhi Mao

发表机构 * Rutgers University(罗格斯大学)

AI总结 提出空间推测解码(SSD),利用二维空间相关性同时预测相邻水平与下方令牌,突破视觉推理中的内存瓶颈,实现高达13.3倍的自回归图像生成加速。

详情
AI中文摘要

自回归模型通过将图像视为离散令牌的一维序列,在视觉生成中表现出色,类似于语言建模。然而,这种扁平化处理丢弃了视觉信号固有的二维空间局部性,在推理过程中造成严重的计算瓶颈。我们提出空间推测解码(SSD),一种将预测目标与图像自然几何结构对齐的框架。我们的模型不是仅预测一维序列中的下一个令牌,而是同时预测相邻的水平令牌和正下方的令牌。通过利用这种二维空间相关性,空间推测解码克服了视觉推理中的内存墙。我们的方法在DPG-Bench和GenEval上保持高保真度的同时,将自回归图像生成速度提升高达13.3倍。我们的结果表明,尊重视觉的底层几何结构可以释放巨大的计算效率,为实时、高分辨率自回归生成模型铺平道路。

英文摘要

Autoregressive models excel in visual generation by treating images as 1D sequences of discrete tokens, mirroring language modeling. However, this flattening discards the intrinsic 2D spatial locality of visual signals, creating severe computational bottlenecks during inference. We introduce Spatially Speculative Decoding (SSD), a framework that aligns the predictive objective with the natural geometry of images. Rather than predicting only the immediate next token in a 1D sequence, our model simultaneously predicts the adjacent horizontal token and the token directly below it. By capitalizing on this 2D spatial correlation, spatially speculative decoding overcomes the memory wall in visual inference. Our approach accelerates autoregressive image generation by up to 13.3x while maintaining high fidelity on DPG-Bench and GenEval. Our results suggest that respecting the underlying geometry of vision unlocks massive computational efficiencies, paving the way for real-time, high-resolution autoregressive generative models.

2606.20563 2026-06-19 cs.CV 新提交

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

JanusMesh: 通过跨空间去噪实现快速零样本3D视觉错觉生成

Siang-Ling Zhang, Huai-Hsun Cheng, Tsung-Ju Yang, Yu-Lun Liu

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学)

AI总结 提出一种无需训练的快速框架,通过跨空间双分支去噪和视图条件纹理合成,在3-5分钟内生成高真实感双语义3D视觉错觉,优于现有方法。

Comments ECCV 2026. Project page: https://siang1105.github.io/JanusMesh.github.io/

详情
AI中文摘要

创建3D视觉错觉——一个从不同视角揭示完全不同语义的单一3D网格——是一个迷人但艰巨的挑战。现有的基于优化的方法速度慢且可能产生过饱和颜色。相比之下,简单的拼接方法无法生成几何一致的物体,导致可见的不自然接缝和语义泄露。在本文中,我们提出了一个快速且无需训练的框架,用于生成文本驱动的3D视觉错觉。我们的方法将生成过程解耦为两个阶段。首先,我们提出一个跨空间双分支去噪过程。该过程动态地将3D潜在变量解码到体素空间,用于CLIP引导的方向对齐和符号距离场(SDF)混合,确保无缝的几何融合。其次,我们引入一个视图条件纹理合成模块,将特定视图的2D扩散先验投影并聚合到融合的几何上。大量实验表明,我们的方法在仅3-5分钟内生成高度逼真的双语义3D错觉,在几何完整性、语义可识别性和效率上显著优于现有方法。项目页面:此https URL

英文摘要

Creating 3D visual illusions, a single 3D mesh that reveals entirely different semantics from various viewing angles, is a fascinating but tough challenge. Existing optimization-based methods are slow and can produce oversaturated colors. In contrast, naive stitching approaches fail to produce geometrically coherent objects. This results in visible unnatural seams and semantic leaks. In this paper, we present a fast and training-free framework for generating text-driven 3D visual illusions. Our approach decouples the generation into two stages. First, we propose a cross-space dual-branch denoising process. This process dynamically decodes 3D latents into voxel space for CLIP-guided orientation alignment and Signed Distance Field (SDF) blending, which ensures seamless geometric fusion. Second, we introduce a view-conditioned texture synthesis module that projects and aggregates view-specific 2D diffusion priors onto the fused geometry. Extensive experiments demonstrate that our method generates highly realistic, dual-semantic 3D illusions in just 3-5 minutes. It significantly outperforms existing methods in geometric integrity, semantic recognizability, and efficiency. Project page: https://siang1105.github.io/JanusMesh.github.io/

2606.20416 2026-06-19 cs.LG cs.CV 交叉投稿

On the Redundancy of Timestep Embeddings in Diffusion Models

扩散模型中时间步嵌入的冗余性研究

José A. Chávez

发表机构 * Independent Researcher, Lima, Peru(独立研究者,秘鲁利马)

AI总结 本文通过理论和实验证明,在U-Net和Diffusion Transformer架构中,扩散模型无需显式时间步嵌入也能达到全局最优,甚至在某些指标上超越有条件模型。

Comments 17 pages

详情
AI中文摘要

扩散模型严重依赖显式的时间步嵌入来调节不同噪声尺度下的去噪过程。在这项工作中,我们通过分析时间步嵌入对U-Net和Diffusion Transformer架构的影响,挑战了这些时间信号的必要性。除了经验证据外,我们提供了一个理论框架,证明在某些条件下,无需显式时间步条件即可达到扩散训练目标的全局最小值。我们的发现揭示了当完全移除时间步嵌入时令人惊讶的鲁棒性。在CelebA和CIFAR-10数据集上的大量消融研究表明,这些时间无关模型可以保持高结构保真度,甚至在竞争性指标(包括FID、精确率和召回率)上超越其有条件对应模型。我们的分析表明,这些架构可以在特定假设下从损坏输入中隐式推断噪声尺度,使得显式时间条件变得冗余。这项研究挑战了长期以来的时间条件范式,并为更高效、更注重结构的生成架构铺平了道路。

英文摘要

Diffusion models rely heavily on explicit timestep embeddings to modulate the denoising process across various noise scales. In this work, we challenge the necessity of these temporal signals by analyzing their impact on U-Net and Diffusion Transformer architectures. Beyond empirical evidence, we provide a theoretical framework demonstrating that, under certain conditions, the global minimizer of the diffusion training objective can be achieved without explicit timestep conditioning. Our findings reveal a surprising robustness when timestep embeddings are completely removed. Extensive ablation studies on the CelebA and CIFAR-10 datasets show that these time-agnostic models can maintain high structural fidelity and even surpass their conditioned counterparts in competitive metrics, including FID, precision, and recall. Our analysis suggests these architectures can implicitly infer noise scales from the corrupted input under specific assumptions, rendering explicit temporal conditioning redundant. This study challenges long-standing temporal conditioning paradigms and paves the way for more efficient and structurally focused generative architectures.

2506.06952 2026-06-19 cs.CV 版本更新

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

LaTtE-Flow: 基于层间时间步专家流的Transformer

Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Maryland(马里兰大学) Nvidia(英伟达) Salesforce AI Research(Salesforce AI研究) Intuit AI Research(Intuit AI研究)

AI总结 提出LaTtE-Flow,一种基于预训练视觉语言模型的高效统一架构,通过层间时间步专家流和条件残差注意力机制,实现图像理解与生成,生成速度提升约6倍。

Comments Unified multimodal model, Flow-matching

详情
AI中文摘要

多模态基础模型在统一图像理解与生成方面取得了最新进展,为在单一框架内处理广泛的视觉-语言任务开辟了令人兴奋的途径。尽管取得了进展,现有的统一模型通常需要大量的预训练,并且与专门针对每项任务的模型相比,难以达到相同的性能水平。此外,许多这些模型存在图像生成速度慢的问题,限制了它们在实时或资源受限环境中的实际部署。在这项工作中,我们提出了基于层间时间步专家流的Transformer(LaTtE-Flow),一种新颖且高效的架构,可在单个多模态模型中统一图像理解与生成。LaTtE-Flow建立在强大的预训练视觉语言模型(VLM)之上,以继承强大的多模态理解能力,并通过新颖的层间时间步专家流架构扩展它们,以实现高效的图像生成。LaTtE-Flow将流匹配过程分布到专门的Transformer层组中,每组负责不同的时间步子集。这种设计通过在每个采样时间步仅激活一小部分层,显著提高了采样效率。为了进一步提升性能,我们提出了一种时间步条件残差注意力机制,用于跨层高效的信息重用。实验表明,LaTtE-Flow在多模态理解任务上取得了强劲的性能,同时与最近的统一多模态模型相比,实现了具有竞争力的图像生成质量,推理速度提高了约6倍。

英文摘要

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

2601.21081 2026-06-19 cs.CV 版本更新

Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought

思维形状:通过视觉思维链进行渐进式物体组装

Yu Huo, Siyu Zhang, Kun Zeng, Haoyue Liu, Owen Lee, Junlin Chen, Yuquan Lu, Yifu Guo, Yaodong Liang, Xiaoying Tang

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)科学与工程学院) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) Sun Yat-sen University(中山大学) The Hong Kong University of Science and Technology, Guangzhou(香港科学与技术大学(广州)) Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen)(深圳未来网络智能研究所(FNii-Shenzhen)) Guangdong Provincial Key Laboratory of Future Networks of Intelligence, CUHK(SZ)(广东省未来网络智能重点实验室,CUHK(SZ))

AI总结 提出Shape-of-Thought (SoT)框架,通过视觉思维链在渲染2D域中逐步组装形状,解决文本到图像生成中的组合结构约束问题,在组件计数和结构拓扑上显著优于直接生成。

Comments ICML2026

详情
AI中文摘要

用于文本到图像生成的多模态模型已实现强视觉保真度,但在组合结构约束(特别是生成计数、属性绑定和部分级关系)下仍然脆弱。为解决这些挑战,我们提出了Shape-of-Thought (SoT),一种视觉思维链框架,用于在渲染2D域中进行过程监督的渐进式形状组装,推理时无需外部引擎。SoT训练一个统一的多模态自回归模型,生成交错文本计划和渲染中间状态,帮助模型在不产生显式几何表示的情况下捕捉形状组装逻辑。与纯文本思维链不同,每个决策都基于渲染状态,使得计数、连接、拓扑和中间部件添加错误在整个轨迹中可检查。为支持这一范式,我们引入了SoT-26K,一个基于部件CAD层次结构的大规模接地组装轨迹数据集,以及T2S-CompBench,一个用于评估结构完整性和轨迹忠实度的基准。在SoT-26K上微调在组件计数上达到88.4%,在结构拓扑上达到84.8%,在组件计数上比直接生成高出24.2个百分点,在结构拓扑上高出19.3个百分点。SoT为渲染域结构感知生成建立了一个透明测试平台。代码见此https URL。

英文摘要

Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints, notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework for process-supervised progressive shape assembly in the rendered 2D domain, without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. Unlike text-only CoT, each decision is grounded in a rendered state, making counts, attachments, topology, and intermediate part-addition errors inspectable across the trajectory. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming direct generation by +24.2 points on component numeracy and +19.3 points on structural topology. SoT establishes a transparent testbed for rendered-domain structure-aware generation. The code is available at https://github.com/yuhuo03/Shape-of-Thought.

2601.21542 2026-06-19 cs.CV cs.AI 版本更新

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

双锚点插值求解器加速生成建模

Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen

发表机构 * The Hong Kong University of Science(香港科学与技术大学)

AI总结 提出BA-solver,通过轻量SideNet(1-2%主干大小)学习双向时间感知和双锚点速度积分,在不重新训练主干的情况下,以极低训练成本实现10步内达到100+步Euler求解器质量,支持即插即用。

详情
AI中文摘要

流匹配(FM)模型已成为高保真合成的前沿范式。然而,它们对迭代常微分方程(ODE)求解的依赖造成了显著的延迟瓶颈。现有解决方案面临两难:无训练求解器在低神经函数评估(NFE)下性能严重下降,而基于训练的一步或几步生成方法则面临高昂的训练成本且缺乏即插即用的通用性。为弥合这一差距,我们提出了双锚点插值求解器(BA-solver)。BA-solver保留了标准无训练求解器的通用性,同时通过引入轻量级SideNet(主干大小的1-2%)与冻结主干并行,实现了显著加速。具体而言,我们的方法基于两个协同组件:1)双向时间感知,其中SideNet学习近似未来和过去的速度,无需重新训练重型主干;2)双锚点速度积分,利用带有两个锚点速度的SideNet高效近似中间速度,用于批量高阶积分。通过利用主干建立高精度“锚点”并利用SideNet加密轨迹,BA-solver能够以最小误差实现大步长。在ImageNet-256^2上的实验结果表明,BA-solver仅需10次NFE即可达到与100+次NFE的Euler求解器相当的生成质量,并在仅5次NFE时保持高保真度,且训练成本可忽略不计。此外,BA-solver确保与现有生成流水线的无缝集成,便于图像编辑等下游任务。

英文摘要

Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

2602.15819 2026-06-19 cs.CV 版本更新

VideoSketcher: Sequential Sketch Generation Using Video Model Priors

VideoSketcher:利用视频模型先验的序列草图生成

Hui Ren, Yuval Alaluf, Omer Bar Tal, Alexander Schwing, Antonio Torralba, Yael Vinker

发表机构 * MIT(麻省理工学院)

AI总结 提出VideoSketcher方法,结合LLM的语义规划与视频扩散模型的时序渲染,通过两阶段微调从少量样本学习笔画顺序与风格,生成高质量序列草图。

详情
AI中文摘要

素描本质上是序列化的:笔画逐步绘制以探索和完善想法。然而,大多数生成方法将草图视为静态图像,忽略了创造性探索背后的时间过程。建模这种序列结构仍然具有挑战性:先前的方法要么依赖大规模但多样性有限的人类绘制数据集,要么使用大型语言模型(LLM)生成绘制指令,但往往以视觉保真度为代价。我们提出VideoSketcher,一种通过将预训练的文本到视频扩散模型适应于草图形成的稀疏连续性质来生成高质量绘制过程的方法。我们的关键洞察是LLM和视频扩散模型提供互补优势:LLM作为语义规划器,将概念分解为逐步指令,而视频扩散模型作为强大的“渲染器”,将它们转化为时间连贯的草图序列。我们引入一种两阶段微调策略,将时间结构与视觉外观解耦:笔画顺序从合成形状组合中学习,而风格则从少至七幅手绘示例中提炼。尽管监督极少,我们的方法能够生成多样、高质量的序列草图,并忠实遵循指定的绘制顺序。我们的框架自然扩展到笔刷风格控制和自回归生成,支持艺术应用。

英文摘要

Sketching is inherently sequential: strokes are drawn progressively to explore and refine ideas. Yet most generative approaches treat sketches as static images, ignoring the temporal process underlying creative exploration. Modeling this sequential structure remains challenging: prior methods either rely on large-scale human-drawn datasets with limited diversity, or use large language models (LLMs) to produce drawing instructions, often at the cost of visual fidelity. We present VideoSketcher, a method for generating high-quality sketching processes by adapting pretrained text-to-video diffusion models to the sparse, continuous nature of sketch formation. Our key insight is that LLMs and video diffusion models offer complementary strengths: LLMs act as semantic planners that decompose concepts into step-by-step instructions, while video diffusion models serve as powerful "renderers" that translate them into temporally coherent sketch sequences. We introduce a two-stage fine-tuning strategy that decouples temporal structure from visual appearance: stroke ordering is learned from synthetic shape compositions, while style is distilled from as few as seven hand-drawn examples. Despite minimal supervision, our method can generate diverse, high-quality sequential sketches that faithfully follow specified drawing orders. Our framework naturally extends to brush style control and autoregressive generation, supporting artistic applications.

2603.12252 2026-06-19 cs.CV cs.CL 版本更新

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

EndoCoT:扩散模型中的内生思维链推理扩展

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Xi’an Jiaotong University(西安交通大学) University of Science and Technology of China(中国科学技术大学) Shanghai Jiaotong University(上海交通大学) Fudan University(复旦大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出EndoCoT框架,通过迭代思维引导模块激活MLLM的推理潜力,并利用终端思维接地模块确保推理轨迹与文本监督对齐,使DiT逐步执行复杂任务,在多个基准上平均准确率达92.1%。

Comments 23 pages, 18 figures, The code and dataset are publicly available at https://internlm.github.io/EndoCoT/

详情
AI中文摘要

最近,多模态大语言模型(MLLMs)被广泛集成到扩散框架中,主要作为文本编码器来处理空间推理等复杂任务。然而,这种范式存在两个关键限制:(i)MLLM文本编码器表现出不足的推理深度。单步编码无法激活思维链过程,而这对MLLM为复杂任务提供准确指导至关重要。(ii)在解码过程中,指导保持不变。即使有正确的MLLM编码,解码过程中的不变指导也阻止了DiT逐步将复杂指令分解为可执行的去噪步骤。为此,我们提出了内生思维链(EndoCoT),一种新颖的框架,首先通过迭代思维引导模块迭代细化潜在思维状态来激活MLLM的推理潜力,然后将这些状态桥接到DiT的去噪过程。其次,应用终端思维接地模块,通过将最终状态与真实答案对齐,确保推理轨迹保持与文本监督的接地。通过这两个组件,MLLM文本编码器提供精心推理的指导,使DiT能够逐步执行并最终以逐步方式解决复杂任务。在多个基准(如Maze、TSP、VSP和Sudoku)上的广泛评估实现了平均准确率92.1%,比最强基线高出8.3个百分点。代码和数据集在此https URL公开。

英文摘要

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. The code and dataset are publicly available at https://internlm.github.io/EndoCoT/.

2603.29924 2026-06-19 cs.CV 版本更新

Abstraction in Style: Beyond Texture and Color

风格中的抽象:超越纹理与色彩

Min Lu, Yuanfeng He, Anthony Chen, Jianhuang He, Pu Wang, Daniel Cohen-Or, Hui Huang

发表机构 * Shenzhen University(深圳大学) Visual Computing Research Center (VCC), College of Computer Science and Software Engineering (CSSE)(视觉计算研究中心(VCC),计算机科学与软件工程学院) Peking University(北京大学)

AI总结 提出Abstraction in Style (AiS)框架,将结构抽象与视觉风格分离,通过中间抽象代理实现几何保真度放松,从而支持更广泛的非真实感风格迁移。

Comments SIGGRAPH 2026

详情
AI中文摘要

艺术风格通常嵌入超越表面外观的抽象,涉及对结构的有意重新诠释,而不仅仅是纹理或色彩的变化。传统的风格迁移方法通常保留输入几何结构,因此难以捕捉这种更深层次的抽象行为,尤其是对于插画和非真实感风格。在这项工作中,我们引入了Abstraction in Style (AiS),一个将结构抽象与视觉风格化分离的生成框架。给定目标图像和少量风格样本,AiS首先推导出一个中间抽象代理,该代理根据风格所展现的抽象逻辑重新诠释目标的结构。代理捕捉语义结构,同时放松几何保真度,使得后续的风格化能够在抽象表示而非原始图像上进行操作。在第二阶段,渲染抽象代理以产生最终风格化输出,保持与参考风格的视觉一致性。两个阶段都使用共享的图像空间类比实现,使得变换可以从视觉样本中学习,无需显式的几何监督。通过将抽象与外观解耦,并将抽象视为显式、可迁移的过程,AiS支持更广泛的风格变换,提高了可控性,并实现了更具表现力的风格化。

英文摘要

Artistic styles often embed abstraction beyond surface appearance, involving deliberate reinterpretation of structure rather than mere changes in texture or color. Conventional style transfer methods typically preserve the input geometry and therefore struggle to capture this deeper abstraction behavior, especially for illustrative and nonphotorealistic styles. In this work, we introduce Abstraction in Style (AiS), a generative framework that separates structural abstraction from visual stylization. Given a target image and a small set of style exemplars, AiS first derives an intermediate abstraction proxy that reinterprets the target's structure in accordance with the abstraction logic exhibited by the style. The proxy captures semantic structure while relaxing geometric fidelity, enabling subsequent stylization to operate on an abstracted representation rather than the original image. In a second stage, the abstraction proxy is rendered to produce the final stylized output, preserving visual coherence with the reference style. Both stages are implemented using a shared image space analogy, enabling transformations to be learned from visual exemplars without explicit geometric supervision. By decoupling abstraction from appearance and treating abstraction as an explicit, transferable process, AiS supports a wider range of stylistic transformations, improves controllability, and enables more expressive stylization.

2605.31158 2026-06-19 cs.CV cs.LG 版本更新

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

光交互:交互式视频世界模型的免训练推理加速

Jiacheng Lu, Haoyi Zhu, Sipei Yi, Enze Xie, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University(浙江大学) NVIDIA

AI总结 针对交互式视频世界模型推理成本高的问题,提出免训练加速框架Light Interaction,通过自适应上下文管理、去噪缓存加速和3D块稀疏注意力实现最高2.59倍加速。

Comments 13 pages, 6 figures, 3 tables. Project page: https://2843721358l-del.github.io/Light-Interaction-Project/

详情
AI中文摘要

交互式视频世界模型根据用户控制的相机运动逐块生成视频,支持实时游戏模拟、虚拟场景导航和具身AI训练等应用。然而,由于上下文记忆增长、二次注意力复杂度和重复去噪步骤,扩展到长交互轨迹的成本过高。我们提出Light Interaction,一种用于交互式视频世界模型的免训练推理加速框架。我们的关键洞察是,交互自然支持轨迹依赖的自适应计算:在探索新区域时可丢弃检索到的空间记忆,根据局部潜在动态调整时间上下文,当相机重新访问熟悉区域时可重用早期步骤的模型输出。基于此洞察,Light Interaction结合了自适应上下文管理、去噪缓存加速以及硬件-软件协同设计的3D块稀疏注意力(融合Triton内核)。在HY-WorldPlay和Matrix-Game-3.0上的评估表明,Light Interaction在无需模型重训练的情况下实现了最高2.59倍加速,同时保持有竞争力的视觉质量。

英文摘要

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.

2606.15015 2026-06-19 cs.CV cs.AI 版本更新

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

NEXUS: 用于物理一致的高接触3D物体动力学的神经能量场

Qizhen Ying, Guangming Wang, Yangchen Pan, Victor Adrian Prisacariu, Brian Sheil, Yixiong Jing

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学)

AI总结 提出神经能量场框架NEXUS,通过标量能量和耗散项建模保守与非保守动力学,提升高接触3D场景下的长时程轨迹精度并指导视频生成。

Comments 18 pages, 4 figures, 6 tables. Preprint

详情
AI中文摘要

基于物理的视频生成需要可控的3D物体动力学,这些动力学在接触、变形和外力作用下保持物理一致性。现有的基于轨迹的方法通常建模孤立的物理效应,难以在高接触3D场景中组合保守和非保守动力学。我们提出NEXUS,一个用于高接触3D物体动力学的神经能量场框架。NEXUS将每个物体表示为结构图,并构建动态的物体-物体和物体-环境接触图。受哈密顿神经网络启发,NEXUS通过标量能量和耗散项而非直接预测状态或加速度来公式化运动。保守效应(包括重力和弹性变形)被组合为加性能量项,而非保守效应(如阻尼和冲击引起的能量损失)则通过学习的瑞利型耗散建模。力通过对能量和耗散函数求导得到,并通过多子步半隐式积分器进行演化。在受控轨迹基准测试中,NEXUS在不同力学属性和物理效应组合下,相较于代表性的学习和物理结构化动力学基线,提高了长时程精度。我们进一步展示NEXUS轨迹为高接触视频生成提供了有效指导,在保持竞争性视觉质量的同时提高了物理合理性。

英文摘要

Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.

2601.03112 2026-06-19 eess.IV cs.CV 版本更新

DiT-JSCC: Rethinking Deep JSCC with Diffusion Transformers and Semantic Representations

DiT-JSCC:基于扩散变换器与语义表示的深度JSCC再思考

Kailin Tan, Jincheng Dai, Sixian Wang, Guo Lu, Shuo Shao, Kai Niu, Wenjun Zhang, Ping Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Shanghai Jiao Tong University(上海交通大学) University of Shanghai for Science and Technology(上海科技大学)

AI总结 提出DiT-JSCC框架,联合学习语义优先表示编码器和扩散变换器生成解码器,通过粗细粒度条件解码和基于Kolmogorov复杂度的自适应带宽分配,在极端信道条件下提升语义一致性与传输效率。

Comments 14pages, 14figures, 2tables

详情
AI中文摘要

生成式联合源信道编码(GJSCC)已成为一种新的深度JSCC范式,用于在极端无线信道条件(如超低带宽和低信噪比)下实现高保真和鲁棒的图像传输。近期研究通常采用扩散模型作为生成解码器,但经常产生视觉上逼真但语义一致性有限的结果。这种局限性源于面向重建的JSCC编码器与生成解码器之间的根本性不匹配,因为前者缺乏显式的语义判别能力,无法提供可靠的条件线索。在本文中,我们提出DiT-JSCC,一种新颖的GJSCC骨干网络,能够联合学习语义优先的表示编码器和基于扩散变换器(DiT)的生成解码器,我们的开源项目旨在促进GJSCC的未来研究。具体来说,我们设计了一个语义-细节双分支编码器,与从粗到细的条件DiT解码器自然对齐,在极端信道条件下优先考虑语义一致性。此外,受Kolmogorov复杂度启发,引入了一种无需训练的自适应带宽分配策略,以进一步提高传输效率,从而真正重新定义生成解码时代的信息价值概念。大量实验表明,DiT-JSCC在语义一致性和视觉质量上始终优于现有JSCC方法,尤其是在极端条件下。

英文摘要

Generative joint source-channel coding (GJSCC) has emerged as a new Deep JSCC paradigm for achieving high-fidelity and robust image transmission under extreme wireless channel conditions, such as ultra-low bandwidth and low signal-to-noise ratio. Recent studies commonly adopt diffusion models as generative decoders, but they frequently produce visually realistic results with limited semantic consistency. This limitation stems from a fundamental mismatch between reconstruction-oriented JSCC encoders and generative decoders, as the former lack explicit semantic discriminability and fail to provide reliable conditional cues. In this paper, we propose DiT-JSCC, a novel GJSCC backbone that can jointly learn a semantics-prioritized representation encoder and a diffusion transformer (DiT) based generative decoder, our open-source project aims to promote the future research in GJSCC. Specifically, we design a semantics-detail dual-branch encoder that aligns naturally with a coarse-to-fine conditional DiT decoder, prioritizing semantic consistency under extreme channel conditions. Moreover, a training-free adaptive bandwidth allocation strategy inspired by Kolmogorov complexity is introduced to further improve the transmission efficiency, thereby indeed redefining the notion of information value in the era of generative decoding. Extensive experiments demonstrate that DiT-JSCC consistently outperforms existing JSCC methods in both semantic consistency and visual quality, particularly in extreme regimes.

7. 3D视觉、点云与空间智能 15 篇

2606.19733 2026-06-19 cs.CV cs.AI 新提交

QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval

QueryGaussian: 可扩展且无需训练的开词汇3D实例检索

Xiuyuan Zhu, Ke Lu, Zijie Yang, Chao Yue, Jian Xue, Dongming Zhang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) State Key Laboratory of Communication Content Cognition(通信内容认知国家重点实验室) Peng Cheng Laboratory(鹏城实验室)

AI总结 提出QueryGaussian,一种无需训练的开词汇3D实例检索框架,通过实例级查询机制解耦语义与几何,结合2D视觉模型和时序融合模块,在保持精度的同时降低70%以上GPU内存并加速180倍,支持城市级场景。

Comments 8 pages, 4 figures, 6 tables. Accepted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2026)

详情
AI中文摘要

通过自然语言提示从大规模场景中高效检索特定3D实例仍然是多媒体分析中的一个严峻挑战。现有方法主要遵循“场景级嵌入”范式,需要将高维语义特征蒸馏到每个3D基元中。这种策略存在一个根本性的架构瓶颈:内存和计算成本随场景复杂度线性增长,不可避免地导致城市级环境中的内存溢出(OOM)故障。为了解决这一障碍,我们提出了QueryGaussian,一个无需训练的框架,用于快速且可扩展的开词汇3D实例检索。与整体语义蒸馏不同,QueryGaussian采用实例级查询机制,将语义理解与几何表示解耦。具体来说,我们利用预训练的2D视觉模型解释用户提示,并通过并发最大权重关联策略将分割掩码提升到3D,确保语义-视觉一致性。为了缓解投影歧义,我们引入了一个具有多阶段自适应密度聚类的时间融合模块。实验结果表明,QueryGaussian不仅匹配了最先进方法的准确性,还实现了决定性的效率飞跃,将GPU内存使用减少超过70%,并将推理速度提升180倍。关键的是,QueryGaussian能够在包含数千万个高斯的城市级场景中,使用消费级硬件实现快速的实例检索。

英文摘要

Efficiently retrieving specific 3D instances from large-scale scenes via natural language prompts remains a formidable challenge in multimedia analysis. Existing approaches predominantly follow a "scene-level embedding" paradigm, which requires distilling high-dimensional semantic features into every 3D primitive. This strategy suffers from a fundamental architectural bottleneck: memory and computational costs scale linearly with scene complexity, inevitably triggering out-of-memory (OOM) failures in city-scale environments. To address this barrier, we propose QueryGaussian, a training-free framework for expeditious and scalable open-vocabulary 3D instance retrieval. Unlike holistic semantic distillation, QueryGaussian employs an instance-level query mechanism that decouples semantic understanding from geometric representation. Specifically, we leverage pre-trained 2D vision models to interpret user prompts and lift segmentation masks into 3D via a concurrent maximum-weight association strategy, ensuring semantic-visual consistency. To mitigate projection ambiguity, we introduce a temporal fusion module with multi-stage adaptive density clustering. Experimental results demonstrate that QueryGaussian not only matches the accuracy of state-of-the-art methods but also delivers a decisive efficiency leap, reducing GPU memory usage by over 70% and accelerating inference by 180x. Crucially, QueryGaussian enables expeditious instance retrieval on city-scale scenes containing tens of millions of Gaussians using consumer-grade hardware.

2606.19776 2026-06-19 cs.CV 新提交

Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding

Occ-VLM: 面向室内场景理解的占用接地视觉语言模型

Jianing Li, Zhou Fang, Yijiang Liu, Li Du

发表机构 * School of Electronic Science and Engineering, Nanjing University(南京大学电子科学与工程学院)

AI总结 提出Occ-VLM,仅用姿态RGB图像和单一2D视觉编码器,通过重建3D占用作为几何先验,实现统一的3D场景理解,在占用预测、3D VQA和密集描述任务上达到领先水平。

详情
AI中文摘要

近期,视觉语言模型(VLM)在3D场景理解方面取得了显著进展,推动了具身智能和机器人视觉等应用的发展。然而,现有方法通常要么直接依赖显式的3D输入(如点云或RGB-D序列),要么引入额外的3D几何编码器从2D图像中推导出3D感知的视觉标记。这种设计在结构上将3D几何感知与通过视觉语言预训练学到的丰富2D语义解耦,阻碍了统一3D视觉语言表示的发展。在这项工作中,我们提出了Occ-VLM,一个仅基于姿态RGB图像并采用单一2D视觉编码器的3D场景理解新框架。具体而言,Occ-VLM重建3D场景占用作为辅助几何先验,用于将前景2D标记与3D空间进行空间关联。然后,这些标记由大型语言模型(LLM)解码,实现统一的场景理解。大量实验表明,Occ-VLM实现了准确的几何感知和稳健的视觉语言推理:在多视角占用预测上达到最先进性能,同时在3D视觉问答(VQA)和3D密集描述基准上与使用3D输入的VLM表现相当。

英文摘要

Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.

2606.19805 2026-06-19 cs.CV cs.AI 新提交

ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number

ParaScale: 通过规范不变视差数进行尺度校准的相机运动迁移

Zijie Meng

发表机构 * Peking University(北京大学)

AI总结 提出ParaScale模块,通过规范不变的视差数Pi实现尺度忠实相机运动迁移,无需重新训练,在四个数量级尺度上降低视差一致性误差3倍以上。

Comments Accepted by SCA2026(poster)

详情
AI中文摘要

将参考视频的相机运动迁移到新生成的视频中,可以让创作者重复使用电影级运镜。然而,参考视频和目标视频往往处于不兼容的尺度——例如跨越银河系的扫视与桌面上的轻推——直接复用恢复的轨迹会导致运动要么不可察觉,要么剧烈夸张。我们将此归结为一个几何事实:平移引起的图像运动与||T||/Z成比例,因此单目轨迹仅在深度尺度规范下才有意义。我们将此提炼为视差数Pi = ||Delta T|| / Zbar,这是一个无量纲、规范不变的描述符,用于衡量相机运动的感知强度,并证明它是尺度忠实迁移必须保持的量,而非原始轨迹。ParaScale是一个即插即用模块,它从任何参考视频中读取Pi,并针对目标场景的深度逐帧重新实现它,保持旋转不变。它位于姿态提取和姿态注入之间,无需重新训练,可插入任何姿态条件生成器。我们进一步引入了视差一致性误差(PCE),这是一种尺度对称的度量,与相似性对齐的TransErr不同,它能暴露场景尺度不匹配。在跨越四个数量级的尺度范围和多个骨干网络上,ParaScale将实现的视差保持在恒等线上,并将PCE比未校准的迁移降低3倍以上,且不损失视觉保真度。

英文摘要

Transferring the camera motion of a reference video to a freshly generated one lets creators reuse cinematic moves. Yet reference and target often live at incompatible scales -- a sweep across a galaxy versus a nudge across a desk -- and naively reusing the recovered trajectory yields either imperceptible or violently exaggerated motion. We trace this to a geometric fact: translation-induced image motion scales as ||T||/Z, so a monocular trajectory is meaningful only up to a depth-scale gauge. We distill this into the Parallax Number Pi = ||Delta T|| / Zbar, a dimensionless, gauge-invariant descriptor of how strongly a camera move is felt, and prove that it -- not the raw trajectory -- is the quantity that scale-faithful transfer must preserve. ParaScale is a plug-and-play module that reads Pi off any reference video and re-realizes it against the target scene's own depth, per frame, leaving rotation untouched. Sitting between pose extraction and pose injection, it requires no retraining and drops into any pose-conditioned generator. We further introduce the Parallax Consistency Error (PCE), a scale-symmetric metric that -- unlike the similarity-aligned TransErr -- exposes scene-scale mismatch. Across scale regimes spanning four orders of magnitude and multiple backbones, ParaScale keeps the realized parallax on the identity line and cuts PCE by more than 3x over uncalibrated transfer with no loss of visual fidelity.

2606.20103 2026-06-19 cs.CV 新提交

Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration

3D高斯溅射中保持几何结构的LiDAR-相机外参标定

Kyoleen Kwak, Daeho Kim, Jeong Woon Lee, Hyoseok Hwang

发表机构 * Kyung Hee University(庆熙大学)

AI总结 针对LiDAR-相机标定中跨模态特征稀缺问题,提出通过多视图LiDAR深度监督和阻止光度梯度更新高斯空间参数来保持3DGS代理的度量几何,提升标定精度。

Comments Accepted to ECCV 2026. 15 pages (excluding references), 5 figures

详情
AI中文摘要

精确的LiDAR-相机标定对于鲁棒的多模态感知至关重要。无目标方法避免了手动设置,但仍受限于跨模态判别特征的稀缺性。最近的方法通过在可微模型中重建场景,通过密集光度监督实现外参优化。其中,3D高斯溅射(3DGS)被广泛用作几何代理,在单一可微框架内桥接LiDAR和相机。然而,由于3DGS最初是为新视图合成设计的,现有方法倾向于优先考虑渲染质量,导致代理几何偏离真实的LiDAR结构。我们提出了一种框架,通过聚合多视图LiDAR观测进行密集深度监督,并阻止光度梯度更新高斯空间参数,从而保持高斯代理的度量几何。我们在公开驾驶数据集上验证了该方法,在标定精度上持续优于现有无目标方法。

英文摘要

Accurate LiDAR-camera calibration is essential for robust multi-modal perception. Targetless approaches avoid manual setup but remain limited by the scarcity of discriminative cross-modal features. Recent methods address this by reconstructing the scene within a differentiable model, enabling extrinsic optimization through dense photometric supervision. Among these, 3D Gaussian Splatting (3DGS) has been widely adopted as a geometric proxy that bridges LiDAR and camera within a single differentiable framework. However, since 3DGS was originally designed for novel view synthesis, existing methods tend to prioritize rendering quality, causing the proxy geometry to drift from the true LiDAR structure. We propose a framework that preserves the metric geometry of the Gaussian proxy by aggregating multi-view LiDAR observations for dense depth supervision and blocking photometric gradients from updating the Gaussian spatial parameters. We validate our method on public driving datasets, where it consistently outperforms existing targetless methods in calibration accuracy.

2606.20131 2026-06-19 cs.CV cs.GR 新提交

TriFlow: Generating Artist-Like 3D Mesh Topology via Nearest-Vertex Vector Fields

TriFlow: 通过最近顶点向量场生成类艺术家3D网格拓扑

Haoxuan Li, Ziya Erkoç, Daniele Sirigatti, Vladislav Rosov, Lei Li, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑工业大学) AUDI AG(奥迪股份公司) University of Virginia(弗吉尼亚大学)

AI总结 提出TriFlow,一种基于最近顶点向量场(NVF)的生成方法,通过流匹配模型合成NVF并引导拓扑感知的网格简化,直接从输入几何条件生成紧凑且具有类艺术家拓扑的3D网格。

详情
AI中文摘要

我们提出了TriFlow,一种新的生成方法,能够直接从输入几何条件(如符号距离场)生成具有类艺术家三角形拓扑的紧凑3D网格。我们的关键见解是将网格拓扑表示为在表面上定义的最近顶点向量场(NVF),其中每个点编码其在局部重心坐标系中与最近三角形顶点的关联。我们训练一个潜在流匹配模型来合成该场,从而实现基于输入几何条件的拓扑生成。为了提取连贯的网格,我们使用生成的NVF对表面区域进行聚类,并引导具有拓扑感知优化的约束二次误差度量(QEM)网格简化。这产生了与输入几何紧密匹配且具有结构化、类艺术家连接性的输出网格。实验表明,与最先进的基于学习方法相比,TriFlow实现了更强的泛化能力和显著提高的拓扑质量,同时Chamfer距离降低了90%,速度提升了8倍。

英文摘要

We present TriFlow, a new generative approach for producing compact 3D meshes with artist-like triangle topology directly from input geometry conditions such as signed distance fields. Our key insight is to represent mesh topology as a nearest-vertex vector field (NVF) defined over the surface, where each point encodes its association to the nearest triangle vertex in the local barycentric frame. We train a latent flow-matching model to synthesize this field, enabling topology generation conditioned on the input geometry. To extract a coherent mesh, we cluster surface regions using the generated NVF and guide a constrained quadric error metric (QEM) mesh simplification with topology-aware optimization. This yields output meshes that closely match the input geometry while exhibiting structured, artist-like connectivity. Experiments demonstrate that TriFlow achieves stronger generalization and significantly improved topology quality compared to state-of-the-art learning-based approaches, alongside 90% lower Chamfer Distance and an 8x speedup.

2606.20531 2026-06-19 cs.CV 新提交

VisDom: Sparse Novel View Synthesis with Visible Domain Constraint

VisDom: 具有可见域约束的稀疏新视角合成

Mariia Gladkova*, Tarun Yenamandra*, Edmond Boyer, Robert Maier, Tony Tung, Daniel Cremers

发表机构 * TU Munich(慕尼黑工业大学) MCML(慕尼黑机器学习中心)

AI总结 提出VisDom,一种无学习的几何约束,通过最小多视角可见性要求增强视觉外壳重建,作为稀疏新视角合成中的空间先验,集成到NeRF和GS管线中,从四张输入图像实现高质量重建。

详情
AI中文摘要

稀疏新视角合成(NVS)由于从少量输入视角恢复3D几何的歧义性仍然具有挑战性。虽然基于NeRF和高斯泼溅(GS)的方法在密集监督下表现良好,但在稀疏设置中它们往往过拟合,产生漂浮伪影和不一致的几何。轮廓一致性通常用作正则化器,但还不够,因为轮廓一致区域可能超出真实物体几何。我们引入VisDom,一种无学习的几何约束,通过强制执行最小多视角可见性要求来增强经典的基于雕刻的视觉外壳重建。具体地,我们将可见域定义为至少被$K$个视角观察到的3D空间子集,并将其用作标准基于轮廓重建之上的额外过滤标准。这在稀疏视角设置中提供了更强的空间先验。我们通过限制体积采样和指导优化过程中的高斯放置,将VisDom集成到隐式(NeRF)和显式(GS)管线中。在三个具有挑战性的数据集上的实验表明,稀疏NVS的一致改进,使得从仅四张输入图像就能实现高质量以物体为中心的重建。我们的方法是领域无关的,仅需要轮廓,并且不引入学习参数,使其成为现有方法的简单补充。在GaussianObject之上应用VisDom进一步提高了在Omni3D和MipNeRF360上的性能,同时以22倍的训练成本匹配或超越它。

英文摘要

Sparse novel view synthesis (NVS) remains challenging due to the ambiguity of recovering 3D geometry from few input views. While NeRF- and Gaussian Splatting (GS)-based methods perform well with dense supervision, they often overfit in sparse settings, producing floating artifacts and inconsistent geometry. Silhouette consistency is commonly used as a regularizer, but it remains insufficient, as silhouette-consistent regions can extend beyond the true object geometry. We introduce VisDom, a learning-free geometric constraint that augments classical carving-based visual hull reconstruction by enforcing a minimum multi-view visibility requirement. Specifically, we define a visible domain as the subset of 3D space observed by at least $K$ views and use it as an additional filtering criterion on top of standard silhouette-based reconstruction. This provides a stronger spatial prior in sparse-view settings. We integrate VisDom into both implicit (NeRF) and explicit (GS) pipelines by restricting volumetric sampling and guiding Gaussian placement during optimization. Experiments on three challenging datasets show consistent improvements in sparse-view NVS, enabling high-quality object-centric reconstruction from as few as four input images. Our method is domain-agnostic, requires only silhouettes, and introduces no learned parameters, making it a simple complement to existing approaches. Applying VisDom on top of GaussianObject further improves performance on Omni3D and MipNeRF360, while matching or surpassing it at 22 $\times$ lower training cost.

2606.20556 2026-06-19 cs.CV 新提交

Thinking in Boxes: 3D Editing in Real Images Made Easy

Thinking in Boxes: 真实图像中的3D编辑变得简单

Pradhaan S Bhat, Naveen Chandra R, Rishubh Parihar, Vaibhav Vavilala, R. Venkatesh Babu, D. A. Forsyth, Anand Bhattad

发表机构 * Indian Institute of Science(印度科学研究所) Apple(苹果公司) UIUC(伊利诺伊大学厄巴纳-香槟分校) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出使用3D盒子作为结构化规范,通过用户提供输入和输出盒子来精确控制真实图像中的平移、旋转、缩放和视角变化,同时保持场景和物体身份,恢复未见的物体区域。

Comments Project Page: https://thinking-in-boxes.github.io/

详情
AI中文摘要

文本和2D条件接口在图像编辑中提供对空间变换的弱、模糊控制——特别是在大物体运动和相机变化下。先前的工作使用了如盒子这样的3D基元,但仅作为松散的调节信号指示近似物体位置,而非指定变换。我们则使用3D盒子作为结构化规范:用户提供编辑的输入和输出盒子,将编辑视为一个适定的几何问题。这种“在盒子中思考”的界面,其中每个盒子面都带有颜色编码以传达3D方向,提供了对真实图像中平移、旋转、缩放和视角变化的精确控制,同时保留场景和物体身份,并恢复之前未见的物体区域。为了将变换与场景外观联系起来,我们引入了一个深度对齐的平面地板作为全局参考框架,并用深度感知线索进行着色。基于这种结构,图像生成器在大变换下产生一致的结果。该系统在两个阶段训练——在合成多物体场景和来自Objectron的小型真实世界视频集上——能够泛化到复杂的、野外真实图像。我们的方法直接作用于真实照片,并在大型3D编辑上显著优于最近的最先进方法。

英文摘要

Text and 2D-conditioning interfaces provide weak, ambiguous control over spatial transformations in image editing -- particularly under large object motions and camera changes. Prior work has used 3D primitives such as boxes, but only as loose conditioning signals indicating approximate object location rather than specifying the transformation. We instead use 3D boxes as structured specifications: the user provides the input and output boxes of the edit, casting editing as a well-posed geometry problem. This ``thinking in boxes'' interface, where each box face is color-coded to convey 3D orientation, gives precise control over translation, rotation, scaling, and viewpoint changes in real images while preserving scene and object identity, and recovering previously unseen object regions. To ground transformations in scene appearance, we introduce a depth-aligned planar floor as a global reference frame, shaded with depth-aware cues. Conditioned on this structure, an image generator produces consistent results under large transformations. Trained in two stages -- on synthetic multi-object scenes and a small set of real-world videos from Objectron -- the system generalizes to complex, in-the-wild real images. Our method operates directly on real photographs and substantially outperforms recent state-of-the-art methods on large 3D edits.

2606.19451 2026-06-19 cs.LG cs.CV cs.RO 交叉投稿

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

3D-DLP:自监督3D物体中心场景表示学习

Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak, David Held, Tal Daniel

AI总结 提出3D-DLP模型,通过自监督学习将场景级RGB-D或体素观测分解为3D潜在粒子,每个粒子编码解耦属性,实现可解释的逐粒子分割图,并支持场景操控和下游机器人操作。

Comments ICML 2026. Project webpage: https://eubooks3003.github.io/3d-dlp

详情
AI中文摘要

我们引入了3D-DLP,一种自监督的物体中心表示学习模型,它将场景级RGB-D或体素观测分解为一组3D潜在粒子。基于深度潜在粒子(DLP)框架,每个粒子编码解耦的属性,包括3D关键点位置、边界框尺寸和外观特征,并代表场景中的一个独特实体。该模型通过端到端的自监督重建目标学习可解释的逐粒子分割图。我们在模拟和真实数据集上证明,学习到的潜在空间是可解释和可控的:通过操纵粒子位置并解码,我们可以生成新颖的场景配置。此外,我们展示了将这些紧凑的3D潜在粒子用于下游机器人操作,相比缺乏显式3D信息或依赖无物体中心结构的密集3D输入的基线方法,性能有所提升。代码和视频可在以下网址获取:此 https URL。

英文摘要

We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.

2606.19874 2026-06-19 cs.RO cs.CV 交叉投稿

MMD-SLAM: Structure-Enhanced Multi-Meta Gaussian Distribution-Guided Visual SLAM

MMD-SLAM:结构增强的多元高斯分布引导视觉SLAM

Fan Zhu, Ziyu Chen, Peichen Liu, Yifan Zhao, Zhisong Xu, Hui Zhu, Hongxing Zhou, Sixun Liu, Chunmao Jiang

发表机构 * HFIPS, Chinese Academy of Sciences(中国科学院合肥物质科学研究院) University of Science and Technology of China(中国科学技术大学) Aarhus University(奥胡斯大学) University of Tokyo(东京大学) Beijing University of Chemical Technology(北京化工大学) North China Electric Power University(华北电力大学)

AI总结 提出MMD-SLAM,利用亚特兰大世界假设引导多元高斯表示,通过点线融合、主导方向编码和高斯进化策略,提升视觉SLAM的跟踪精度与建图质量。

Comments ICRA 2026

详情
AI中文摘要

3D高斯泼溅(3DGS)显著提升了新视角合成和高保真场景重建,扩展了基于3DGS的视觉同步定位与建图(SLAM)方法的潜力。然而,大多数现有系统未能充分利用底层结构信息,这限制了渲染质量并常常导致地图不一致。为了解决这些限制,我们提出了MMD-SLAM,一个结构增强的视觉SLAM框架,利用亚特兰大世界(AW)假设来引导多元高斯表示以实现逼真的建图。首先,我们引入了一种点线融合策略用于位姿优化,其中3D线段被纳入以提高跟踪鲁棒性并为建图提供额外约束。其次,我们设计了一种具有主导方向的多元高斯表示,显式编码来自AW假设的结构先验。最后,我们提出了一种高斯进化策略,该策略适应场景几何并将结构线索融入全局优化。大量实验表明,这些创新使MMD-SLAM在跟踪精度和建图质量方面均达到了最先进的性能。例如,与MonoGS相比,我们的方法在ScanNet上实现了48.56%的ATE RMSE降低,在Replica上实现了5.71%的PSNR提升。

英文摘要

3D Gaussian Splatting (3DGS) has significantly boosted novel view synthesis and high-fidelity scene reconstruction, expanding the potential of 3DGS-based Visual Simultaneous Localization and Mapping (SLAM) methods. However, most existing systems fail to fully exploit the underlying structural information, which limits rendering quality and often leads to inconsistent maps. To address these limitations, we propose MMD-SLAM, a structure-enhanced Visual SLAM framework that leverages the Atlanta World (AW) assumption to guide a Multi-Meta Gaussian representation for photorealistic mapping. First, we introduce a point-line fusion strategy for pose optimization, where 3D line segments are incorporated to improve tracking robustness and provide additional constraints for mapping. Second, we design a Multi-Meta Gaussian representation with dominant directions, explicitly encoding structural priors from the AW hypothesis. Finally, we propose a Gaussian evolution strategy that adapts to scene geometry and incorporates structural cues into global optimization. Extensive experiments demonstrate that these innovations enable MMD-SLAM to achieve state-of-the-art performance in both tracking accuracy and mapping quality. e.g., our method achieves a 48.56% reduction in ATE RMSE on ScanNet and a 5.71% improvement in PSNR on Replica, compared with MonoGS.

2508.15228 2026-06-19 cs.CV 版本更新

Collaborative Multi-Modal Coding for High-Quality 3D Generation

协作多模态编码用于高质量3D生成

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University, Singapore(南洋理工大学S实验室) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出TriMM,首个前馈式3D原生生成模型,通过协作多模态编码融合RGB、RGBD和点云特征,结合辅助2D/3D监督和三平面潜在扩散模型,实现高质量3D资产生成。

详情
AI中文摘要

3D内容本质上具有多模态特性,可投影到不同模态(如RGB图像、RGBD和点云)。每种模态在3D资产建模中表现出独特优势:RGB图像包含生动的3D纹理,而点云定义精细的3D几何。然而,现有大多数3D原生生成架构要么主要在单模态范式下运行——从而忽略了多模态数据的互补优势,要么局限于3D结构,从而限制了可用训练数据集的范围。为了全面利用多模态进行3D建模,我们提出了TriMM,这是第一个从基本多模态(如RGB、RGBD和点云)学习的前馈式3D原生生成模型。具体来说,1) TriMM首先引入协作多模态编码,该编码在保留各模态独特表示优势的同时整合模态特定特征。2) 此外,引入辅助2D和3D监督以提高多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码,TriMM采用三平面潜在扩散模型生成更高质量的3D资产,增强了纹理和几何细节。在多个知名数据集上的大量实验表明,TriMM通过有效利用多模态,尽管使用少量训练数据,仍能达到与在大规模数据集上训练的模型相竞争的性能。此外,我们在最近的RGB-D数据集上进行了额外实验,验证了将其他多模态数据集纳入3D生成的可行性。

英文摘要

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

2512.00850 2026-06-19 cs.CV 版本更新

Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting

Smol-GS: 抽象3D高斯溅射的紧凑表示

Haishan Wang, Mohammad Hassan Vali, Arno Solin

发表机构 * ELLIS Institute Finland(芬兰ELLIS研究所) Aalto University(阿alto大学)

AI总结 提出Smol-GS方法,通过八叉树位置编码和熵压缩学习高效溅射特征,实现3D高斯溅射的紧凑表示,在保持渲染质量的同时大幅降低存储。

详情
AI中文摘要

我们提出Smol-GS,一种学习3D高斯溅射(3DGS)紧凑表示的新方法。我们的方法学习高效的逐溅射特征来建模3D空间,这些特征捕获抽象线索,包括颜色、不透明度、变换和材质属性。我们提出八叉树导出的位置编码,显式建模空间局部性并增强表示效率。我们进一步应用基于熵的压缩来利用特征冗余,并使用递归体素层次压缩溅射坐标。这种设计在保持表示灵活性的同时,实现了数量级的存储减少。Smol-GS在标准基准测试上以高渲染质量实现了最先进的压缩性能。

英文摘要

We present Smol-GS, a novel method for learning compact representations for 3D Gaussian Splatting (3DGS). Our approach learns highly efficient splat-wise features to model 3D space, which capture abstracted cues, including color, opacity, transformation, and material properties. We propose octree-derived positional encoding, which explicitly models spatial locality and enhances representation efficiency. We further apply entropy-based compression to exploit feature redundancy and compress splat coordinates using a recursive voxel hierarchy. This design enables orders-of-magnitude reduction in storage while preserving representation flexibility. Smol-GS achieves state-of-the-art compression performance on standard benchmarks with high-level rendering quality.

2602.23172 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

潜在高斯泼溅用于4D全景占据跟踪

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

发表机构 * University of Freiburg(弗赖堡大学) Bosch Research(博世研究院) University of Haifa(海法大学)

AI总结 提出潜在高斯泼溅(LaGS)方法,通过特征高斯体作为动态关键点实现多视图特征聚合,用于4D全景占据跟踪,在Occ3D nuScenes和Waymo上达到最优性能。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L), 2026

详情
AI中文摘要

捕捉4D时空场景结构对于机器人在动态环境中安全可靠运行至关重要。然而,现有方法通常只解决部分问题:它们要么通过边界框提供粗略的几何跟踪,要么提供缺乏显式时间关联和实例级推理的详细3D占据估计。在这项工作中,我们提出了潜在高斯泼溅(LaGS)用于4D全景占据跟踪(4D-POT)。我们重新审视底层表示,将3D特征建模为一组稀疏的带特征高斯体。这些高斯体作为动态的、面向体积的关键点,在泼溅到体素网格进行解码之前,能够实现多视图特征的空间连续、距离加权聚合。这种以点为中心的公式实现了灵活、数据相关的感受野和长程空间交互,这是局部密集体素算子难以捕捉的。分层高斯表示通过结合来自粗超点的全局上下文和来自高分辨率流的细粒度细节,进一步实现了多尺度推理。在Occ3D nuScenes和Waymo上的大量实验证明了4D-POT的最先进性能。我们在以下网址提供代码和模型:this https URL。

英文摘要

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.

2606.15908 2026-06-19 cs.CV 版本更新

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

高保真4D手-物体捕捉:基于多视角时空追踪和物理感知高斯模型

Bo Peng, Xu Chen, Yi Gu, Hidenobu Matsuki, Mingsong Dou, Jingjing Shen, Deying Kong, Juyong Zhang, Zhengyang Shen

发表机构 * Google XR(谷歌XR) University of Science and Technology of China (USTC)(中国科学技术大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出无需模板和标记的多视角系统,通过跨视角几何与时间线索的Transformer初始化,结合物理感知高斯优化,实现鲁棒且无伪影的4D手-物体交互重建。

Comments Project page: https://hostpg.github.io/

详情
AI中文摘要

具身AI和空间计算中对高保真4D手-物体交互(HOI)数据的需求日益增长,但目前受限于对预扫描物体模板和物理标记的依赖。尽管近期方法在从视频重建4D手-物体交互方面取得了有希望的结果,但它们对手和物体姿态的初始估计高度敏感。然而,从图像中估计这些姿态具有挑战性,尤其是在手-物体交互场景中固有的严重遮挡下。我们提出了一种新颖系统,用于从同步且校准的多视角视频中鲁棒且精确地重建手和物体,无需任何模板或标记。我们的系统包含两个主要创新组件:(1)一个多视角前馈Transformer模型,聚合跨视角几何和时间线索,为姿态和密集物体几何提供可靠的、度量一致的初始化;(2)一个手-物体物理感知高斯优化框架,用于细化初始估计,集成四面体约束、碰撞细化和外观分解,以产生物理上合理且视觉上精确的重建。在公共基准和广泛内部数据集上的验证表明,我们的流程实现了高度鲁棒、无伪影的重建,为自动化4D资产生成提供了高效基础。我们的项目页面位于https://zyshen021.github.io/HOSTPG/。

英文摘要

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.

2606.15966 2026-06-19 cs.CV cs.GR 版本更新

VEPHand: View-Efficient Photometric Hand Performance Capture at Scale

VEPHand: 大规模视图高效光度手部性能捕捉

Zhengyang Shen, Kai-Hung Chang, Erroll Wood, Deying Kong, Bo Peng, Timo Bolkart, Jinlong Yang, Bowen Zhao, Danhang Tang, Sasa Petrovic, Emre Aksan, Jérémy Riviere, Vassilis Choutas, Delio Vicini, Jay Busch, Shichen Liu, Zhe Cao, Hugh Liu, JingJing Shen, Jonathan Taylor, Mingsong Dou

发表机构 * Google XR

AI总结 提出面向有限视角(约20个)的端到端手部动态捕捉与配准管线,通过无掩膜神经方法和物理启发框架解决几何歧义与自接触变形难题,在12000+序列上验证了高保真重建与配准。

详情
AI中文摘要

鲁棒、高保真的3D手部捕捉是数字人创建的基础,但在实际多视角系统中仍具挑战性,这些系统需要在丰富光度信息与有限视角密度导致的重建几何歧义之间取得平衡。本文提出一种端到端的动态手部性能捕捉与配准管线,专为视图高效设置(约20个视角)设计。我们通过两项主要创新应对关键挑战。首先,为克服重建困难(如视角重叠有限和背景杂乱),我们的无掩膜神经方法通过场景参数化和场景特定密度正则化,从无掩膜图像中鲁棒地提取精细的手部几何和外观。其次,针对配准挑战(如准确捕捉非线性皮肤变形和确保严重自接触时的合理结果),我们提出一个物理启发框架。它通过优化个性化手部模型规范四面体网格内的固有体积偏移以及姿态参数,将重建与个性化手部模型对齐。该方法在鲁棒损失和优化支持下,捕捉精细表面变形,确保在严重关节运动和自接触下的合理结果,并对输入噪声表现出强容忍性。我们在超过12000个序列的大规模数据集上展示了自动化管线的可扩展性和鲁棒性,并从中导出一个大规模、高质量合成2D/3D手部数据集用于训练下游任务。这展示了该方法在单手、复杂双手交互和自然手物操作中的有效性。我们的方法在视图高效、无掩膜场景下实现了最先进的重建保真度和高精度配准。项目页面:https://zyshen021.github.io/VEPHand/。

英文摘要

Robust, high-fidelity 3D hand capture, while fundamental to digital human creation, remains challenging with practical multi-view systems that balance rich photometry with the geometric ambiguities of reconstruction arising from limited viewpoint density. This paper presents an end-to-end pipeline for dynamic hand performance capture and registration, specifically designed for view-efficient setups ($\sim$20 views). We address key challenges with two primary innovations. First, to overcome reconstruction difficulties like limited view overlap and background clutter, our mask-free neural method robustly extracts detailed hand geometry and appearance from unmasked images using scene parameterization and scenario-specific density regularization. Second, addressing registration challenges such as accurately capturing non-linear skin deformations and ensuring plausible results during severe self-contact, we propose a physics-inspired framework. It aligns reconstructions to a personalized hand model by optimizing intrinsic volumetric offsets within its canonical tetrahedral mesh, alongside pose parameters. This approach, supported by robust losses and optimization, captures fine surface deformations, ensures plausible results under severe articulation and self-contact, and demonstrates strong tolerance to input noise. We demonstrate the scalability and robustness of our automated pipeline on an extensive dataset of over 12,000 sequences, from which we also derive a large-scale, high-quality synthetic 2D/3D hand dataset for training downstream tasks. This showcases its effectiveness for single hands, intricate two-hand interactions, and natural hand-object manipulations. Our method achieves state-of-the-art reconstruction fidelity in view-efficient, unmasked scenarios and highly accurate registration. Our project page are available at https://vephand.github.io/.

2503.01425 2026-06-19 cs.GR cs.CV 版本更新

MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing

MeshPad: 交互式草图条件艺术家风格网格生成与编辑

Haoxuan Li, Ziya Erkoc, Lei Li, Daniele Sirigatti, Vladislav Rosov, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑技术大学) AUDI AG(奥迪股份公司)

AI总结 提出MeshPad,一种基于草图输入的交互式3D网格生成与编辑方法,通过分解为网格区域的删除和添加操作,结合Transformer和顶点对齐推测策略,实现快速迭代编辑,在Chamfer距离上提升22%以上质量,并获90%用户偏好。

Comments Project page: https://derkleineli.github.io/meshpad/ Video: https://www.youtube.com/watch?v=_T6UTGTMZ1E

详情
AI中文摘要

我们介绍了MeshPad,一种从草图输入生成3D网格的生成方法。基于最近在艺术家风格三角形网格生成方面的进展,我们的方法解决了交互式网格创建的需求。为此,我们专注于通过将编辑分解为网格区域的“删除”和随后新网格几何的“添加”来实现一致编辑。这两个操作都由用户对草图图像的简单编辑触发,促进了迭代内容创建过程,并能够构建复杂的3D网格。我们的方法基于三角形序列网格表示,利用大型Transformer模型进行网格三角形的添加和删除。为了交互式地执行编辑,我们在加法网格生成器之上引入了一种顶点对齐的推测预测策略。该推测器预测对应于一个顶点的多个输出标记,从而显著降低推理的计算成本并加速编辑过程,使得每个编辑步骤只需几秒钟即可完成。综合实验表明,MeshPad优于最先进的草图条件网格生成方法,在Chamfer距离上实现了超过22%的网格质量改进,并且在感知评估中被90%的参与者所偏好。

英文摘要

We introduce MeshPad, a generative approach that creates 3D meshes from sketch inputs. Building on recent advances in artist-reminiscent triangle mesh generation, our approach addresses the need for interactive mesh creation. To this end, we focus on enabling consistent edits by decomposing editing into 'deletion' of regions of a mesh, followed by 'addition' of new mesh geometry. Both operations are invoked by simple user edits of a sketch image, facilitating an iterative content creation process and enabling the construction of complex 3D meshes. Our approach is based on a triangle sequence-based mesh representation, exploiting a large Transformer model for mesh triangle addition and deletion. In order to perform edits interactively, we introduce a vertex-aligned speculative prediction strategy on top of our additive mesh generator. This speculator predicts multiple output tokens corresponding to a vertex, thus significantly reducing the computational cost of inference and accelerating the editing process, making it possible to execute each editing step in only a few seconds. Comprehensive experiments demonstrate that MeshPad outperforms state-of-the-art sketch-conditioned mesh generation methods, achieving more than 22% mesh quality improvement in Chamfer distance, and being preferred by 90% of participants in perceptual evaluations.

8. 医学影像与生物视觉 30 篇

2606.19460 2026-06-19 cs.CV cs.AI cs.LG 新提交

Scaling Generative Foundation Models for Chest Radiography with Rectified Flow Transformers

使用整流流变换器扩展胸部X光片的生成式基础模型

Fabio De Sousa Ribeiro, Emma A. M. Stanley, Charles Jones, Tian Xia, Dominic C. Marshall, Laurent Renard Triché, Christopher V. Cosgriff, Panagiotis Dimitrakopoulos, Sotirios A. Tsaftaris, Ben Glocker

发表机构 * Imperial College London(帝国理工学院) Causality in Healthcare AI Hub(医疗AI因果关系中心) University of Edinburgh(爱丁堡大学) Cleveland Clinic London(克利夫兰诊所伦敦) Department of Perioperative Medicine, CHU Clermont-Ferrand(克莱蒙费朗大学医院围手术期医学科) Department of Medicine, Massachusetts General Hospital(麻省总医院医学部) Broad Institute of MIT and Harvard(麻省理工学院与哈佛大学博德研究所)

AI总结 提出首个十亿参数级胸部X光片生成基础模型,通过整流流变换器实现高保真可控合成,显著提升合成图像与真实图像的不可区分性。

Comments Project page: https://RadiT-project.github.io

详情
AI中文摘要

我们引入了首个从零开始在十亿参数规模上训练的胸部X光片合成生成基础模型。现有的放射学AI模型通常在不同患者亚群、机构和采集设置下泛化能力差,导致实际临床效用有限。可控、高保真的胸部X光片合成是多样化临床数据集和评估诊断模型鲁棒性的有前景途径。因此,我们提出了迄今为止最大的胸部X光片专用生成基础模型,拥有超过13亿参数,在包含120万张X光片和临床专家指导元数据的精选异质数据集上训练了1.6万亿个token。我们的模型支持跨多个人口统计亚组、采集视图和十多种病理的可控X光片生成和编辑。此外,我们显著推进了X光片合成保真度的最新技术,生成的图像对临床专家而言与真实X光片无法区分。

英文摘要

We introduce the first generative foundation model for chest radiograph synthesis trained from scratch at the billion-parameter scale. Existing radiographic AI models often suffer from poor generalisation across patient subpopulations, institutions, and acquisition settings, resulting in limited real-world clinical utility. Controlled, high-fidelity synthesis of chest radiographs is a promising path toward diversifying clinical datasets and evaluating the robustness of diagnostic models. Therefore, we present the largest specialist generative foundation model for chest radiographs to date, with over 1.3B parameters, trained for 1.6T tokens on a curated, heterogeneous dataset comprising 1.2M radiographs and clinical expert-guided metadata. Our model supports controllable radiograph generation and editing across multiple demographic subgroups, acquisition views, and a dozen pathologies. Moreover, we significantly advance the state of the art in radiograph synthesis fidelity, producing images that are indistinguishable from real radiographs to clinical experts.

2606.19804 2026-06-19 cs.CV 新提交

HypOProto: Hyperbolic Ordinal Prototypes for Left Ventricular Filling Pressure Classification

HypOProto: 用于左心室充盈压分类的双曲序数原型

Victoria Wu, Nima Hashemi, Hooman Vaseli, Christina Luong, Purang Abolmaesumi, Teresa S. M. Tsang

发表机构 * The University of British Columbia(不列颠哥伦比亚大学) Vancouver General Hospital(温哥华综合医院)

AI总结 提出HypOProto框架,利用双曲空间中的序数原型对左心室充盈压进行分类,通过冻结的可解释基础模型实现高精度与临床可解释性。

详情
AI中文摘要

超声心动图(echo)是一种广泛用于评估心脏功能的成像模态,左心室充盈压(LVFP)是心力衰竭等疾病的关键生理标志物。将LVFP分为正常和升高类别的标准依赖于多普勒衍生的$E/e'$比值,该比值依赖于操作者,且在资源有限的环境中通常不可用,这促使了直接从B模式超声推断LVFP的方法。现有的深度学习方法实现了高性能,但大多是黑盒模型,限制了临床可解释性。我们提出了HypOProto,一个基于双曲序数原型的可解释LVFP分类框架,使用冻结的可解释基础模型骨干。HypOProto沿着生理$E/e'$尺度排列原型,将边界情况放置在双曲面根附近,其中小的角度差异区分相似情况,而正常和升高情况占据向外位置,反映诊断确定性的增加。这种双曲几何编码了临床上有意义的序数关系,并提高了可解释性。我们还引入了一种新的双曲原型角度分离(HyperPAS)损失,强制在双曲空间中实现类间原型分离。HypOProto在保持透明性的同时实现了最先进的性能,并在可视化中突出显示临床相关区域。这项工作代表了超声中LVFP分类的第一个基于原型的框架。我们的代码可在以下网址找到:此 https URL。

英文摘要

Echocardiography (echo) is a widely used imaging modality for assessing cardiac function, with Left Ventricular Filling Pressure (LVFP) serving as a critical physiological marker for conditions such as heart failure. Standard LVFP classification into normal \emph{vs} elevated categories relies on the Doppler-derived $E/e'$ ratio, which is operator-dependent and often unavailable in resource-limited settings, motivating methods that infer LVFP directly from B-mode echo. Existing deep learning approaches achieve high performance but remain largely black-box, limiting clinical interpretability. We propose HypOProto, a hyperbolic, ordinal prototype-based framework for interpretable LVFP classification using a frozen, explainable foundation model backbone. HypOProto arranges prototypes along the physiological $E/e'$ scale, placing borderline cases near the hyperboloid root where small angular differences separate similar cases, while normal and elevated cases occupy outward positions reflecting increasing diagnostic certainty. This hyperbolic geometry encodes clinically meaningful ordinal relationships and improves interpretability. We also introduce a novel Hyperbolic Prototype Angular Separation (HyperPAS) loss, enforcing inter-class prototype separation in hyperbolic space. HypOProto achieves SOTA performance while maintaining transparency, and highlights clinically relevant regions in visualizations. This work represents the first prototype-based framework for LVFP classification in echo. Our code can be found at https://github.com/DeepRCL/HypOProto.

2606.19824 2026-06-19 cs.CV cs.AI 新提交

CSWinUNETR: Segmentation of Thin Anatomical Structures in Medical Images

CSWinUNETR: 医学图像中薄解剖结构的分割

Junho Moon, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University(汉阳大学) Hankuk University of Foreign Studies(韩国外国语大学)

AI总结 提出CSWinUNETR通用骨干网络,通过交叉形条带自注意力、循环移位、细节增强多尺度自注意力和稀疏控制动态蛇形卷积,解决薄结构分割中的低对比度、断裂和类不平衡问题,在眼科、神经血管和皮肤科基准上超越现有方法。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

准确分割薄而曲折的解剖结构,如视网膜血管、脑血管和面部皱纹,由于低对比度、频繁断裂和严重的类别不平衡仍然具有挑战性。尽管最近的卷积和基于Transformer的模型提高了性能,但它们常常产生碎片化的预测,并且无法恢复细小的分支。我们提出了CSWinUNETR,一个用于2D和3D薄结构分割的通用骨干网络。它采用交叉形条带自注意力来建模长距离主轴上下文,并结合循环移位以增强条带间的信息交换。为了更好地保留细粒度细节,我们进一步引入了一个细节增强的多尺度自注意力模块,该模块从多分辨率表示中聚合上下文特征。此外,我们提出了稀疏控制动态蛇形卷积,它从稀疏预测的控制点重建可靠的密集曲线核,以更好地跟随曲折的几何形状。在眼科、神经血管成像和皮肤科的四个基准上的大量实验表明,CSWinUNETR在没有任务特定后处理或拓扑感知损失的情况下,始终优于最先进的方法。代码可在该网址获取。

英文摘要

Accurate segmentation of thin, tortuous anatomical structures, such as retinal vessels, cerebral vasculature, and facial wrinkles, remains challenging due to low contrast, frequent discontinuities, and severe class imbalance. Although recent convolutional and Transformer-based models have improved performance, they often yield fragmented predictions and fail to recover fine branches. We propose CSWinUNETR, a general-purpose backbone for 2D and 3D thin-structure segmentation. It employs cross-shaped stripe self-attention to model long-range principal-axis context and incorporates cyclic shifts to enhance information exchange across stripes. To better preserve fine-grained details, we further introduce a detail-enhanced multi-scale self-attention module that aggregates contextual features from multi-resolution representations. In addition, we propose sparse-control dynamic snake convolution, which reconstructs reliable dense curvilinear kernels from sparsely predicted control points to better follow tortuous geometry. Extensive experiments on four benchmarks across ophthalmology, neurovascular imaging, and dermatology demonstrate that CSWinUNETR consistently outperforms state-of-the-art methods without task-specific post-processing or topology-aware losses. The code is available at https://github.com/labhai/CSWinUNETR.

2606.19838 2026-06-19 cs.CV 新提交

OTCHA: Optimal Transport-driven Confidence-aware Latent Hub Alignment for Multi-View Medical Image Classification

OTCHA: 基于最优传输的置信度感知潜在中心对齐用于多视图医学图像分类

Jiwoong Yang, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University(汉阳大学) Hankuk University of Foreign Studies(韩国外国语大学)

AI总结 提出OTCHA模块,通过最优传输对齐多视图补丁令牌与共享潜在中心令牌,结合置信度门控和部分匹配,消除无关特征,提升多视图医学图像分类鲁棒性。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

多视图成像(如乳腺X线摄影和胸部X线摄影)是临床实践的标准组成部分。然而,医学图像通常未配准,且包含视图特定的伪影或无关背景线索,这些可能掩盖诊断相关发现。许多现有方法直接融合每个视图的表征,使得此类无关内容污染融合嵌入,并在不同视图配置下降低鲁棒性。我们提出OTCHA,一种基于最优传输(OT)的置信度感知潜在中心令牌对齐模块,在融合前细化补丁令牌以用于多视图分类。OTCHA引入一组跨视图共享的可学习潜在中心令牌。对于每个视图,我们计算补丁令牌与中心令牌之间的OT计划,该计划联合考虑特征相似性和几何结构,并通过令牌条件尘埃箱增强OT公式以实现部分匹配并丢弃无关令牌。所得传输计划提供令牌级匹配置信度,该置信度门控中心介导的消息传递,并加权一种新的基于最优传输的表征对齐损失以稳定细化。在三个多视图医学图像数据集上的实验表明,在不同解剖结构和视图配置下,相比竞争基线方法取得一致改进。我们的代码可在该https URL获取。

英文摘要

Multi-view imaging, such as mammography and chest radiography, is a standard component of clinical practice. However, medical images are often unregistered and contain view-specific artifacts or irrelevant background cues that can obscure diagnostically relevant findings. Many existing methods directly fuse per-view representations, allowing such irrelevant content to contaminate the fused embedding and reducing robustness under varying view configurations. We propose OTCHA, a confidence-aware latent hub token alignment module based on optimal transport (OT) that refines patch tokens before fusion for multi-view classification. OTCHA introduces a set of learnable latent hub tokens shared across views. For each view, we compute an OT plan between patch tokens and hub tokens that jointly considers feature similarity and geometry, and augment the OT formulation with token-conditional dustbins to enable partial matching and discard irrelevant tokens. The resulting transport plan provides token-wise matching confidence, which gates hub-mediated message passing and weights a novel optimal-transport-based representation alignment loss to stabilize refinement. Experiments on three multi-view medical image datasets demonstrate consistent improvements over competing baselines across diverse anatomies and view configurations. Our code is available at https://github.com/labhai/OTCHA.

2606.19867 2026-06-19 cs.CV cs.AI 新提交

PSCT-Net: Geometry-Aware Pediatric Skull CT Reconstruction via Differentiable Back-Projection and Attention-Guided Refinement

PSCT-Net: 通过可微反投影和注意力引导细化实现几何感知的儿科颅骨CT重建

Dong Yeong Kim, Jaewon Choi, Youmin Shin, Jungyu Lee, Myeongseop Kim, Jinwook Choi, Joo Whan Kim, Young-Gon Kim

发表机构 * Interdisciplinary Program in Bioengineering, Seoul National University(首尔大学生物工程跨学科项目) Department of Transdisciplinary Medicine, Seoul National University Hospital(首尔大学医院跨学科医学系) Department of Artificial Intelligence, Yonsei University(延世大学人工智能系) Department of Medicine, Seoul National University College of Medicine(首尔大学医学院医学系) Healthcare AI Research Institute, Seoul National University Hospital(首尔大学医院医疗人工智能研究所)

AI总结 提出PSCT-Net,利用可微反投影建立空间先验,结合注意力引导投影和双向Mamba模块,从稀疏双平面X射线重建3D CT,缓解深度模糊并改善骨边界。

Comments 11pages, 5 figures

详情
AI中文摘要

计算机断层扫描(CT)对于诊断儿科颅面异常至关重要,但对发育中的解剖结构存在辐射风险。从稀疏双平面X射线重建3D CT提供了一种低剂量替代方案,但问题严重不适定。现有方法采用几何无关的特征提升,将2D特征天真地投影到3D中,缺乏显式空间建模,导致深度模糊和骨边界退化。我们提出PSCT-Net,一种具有可微反投影的几何感知框架。可微反投影建立了空间保真的体积先验,缓解了深度模糊。然后,注意力引导投影(AGP-3D)模块学习2D区域与3D位置之间的非线性体素级对应关系。双向Mamba(BiM-3D)模块以线性复杂度捕获长程体积依赖关系。我们进一步整理了一个私有的机构儿科颅骨CT数据集PedSkull-CT,包含正常和病理病例用于内部评估,弥补了以成人中心和躯干为主的数据集的空白。

英文摘要

Computed Tomography (CT) is essential for diagnosing pediatric craniofacial abnormalities, yet poses radiation risks to developing anatomies. Reconstructing 3D CT from sparse bi-planar X-rays offers a low-dose alternative but is severely ill-posed. Existing methods employ geometry-agnostic feature lifting, naively projecting 2D features into 3D without explicit spatial modeling, causing depth ambiguity and degraded osseous boundaries. We present PSCT-Net, a geometry-aware framework with differentiable back-projection. Differentiable back-projection establishes a spatially faithful volumetric prior, alleviating depth ambiguity. An Attention-Guided Projection (AGP-3D) module then learns non-linear voxel-wise correspondences between 2D regions and 3D locations. A Bidirectional Mamba (BiM-3D) module captures long-range volumetric dependencies with linear complexity. We further curate a private institutional pediatric skull CT cohort, PedSkull-CT, comprising normal and pathological cases for internal evaluation, addressing the gap in adult-centric, trunk-focused datasets.

2606.19908 2026-06-19 cs.CV 新提交

Gaussian Process Prior Variational Autoencoder for Endoscopic Videos

用于内窥镜视频的高斯过程先验变分自编码器

Ivan De Boi, Xinxing Shi, Xiaoyu Jiang, Tim J. M. Jaspers, Francisco Caetano, Mauricio A. Alvarez, Fons van der Sommen, Sam Van der Jeught

发表机构 * Department of Electromechanics, InViLab, University of Antwerp(安特卫普大学机电工程系InViLab实验室) Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) Department of Electrical Engineering, Eindhoven University of Technology(埃因霍温理工大学电气工程系)

AI总结 提出高斯过程先验变分自编码器(GPVAE),通过时间高斯过程先验替代因子化先验,结合两种可扩展GP近似和镜面反射掩码,实现内窥镜视频缺失帧的插值与修复,在C3VDv2数据集上平均降低RMSE 21.9%。

详情
AI中文摘要

内窥镜视频分析对于胃肠道诊断和计算机辅助干预至关重要,但视频序列经常受到镜面反射、运动伪影和缺失帧的退化影响。这些瞬态损坏会分散临床医生的注意力,降低图像可解释性,并干扰下游任务(如3D重建和导航)。因此,有效的修复需要利用时间连续性而非孤立处理帧的方法。我们提出了一种用于内窥镜视频修复的高斯过程先验变分自编码器(GPVAE)框架,该框架用时间高斯过程先验替代标准因子化潜在先验,从而能够以不确定性感知的重建方式插值缺失帧。该框架结合了内窥镜专用编码器(包括卷积EndoVAE骨干网络和来自GastroNet-5M的预训练Vision Transformer编码器)以及两种可扩展GP近似:层次先验近似(HPA)和稀疏精度近似(SPA)。镜面反射通过基于DUCKNet的掩码流水线处理,该流水线从重建目标中排除损坏像素。在C3VDv2结肠镜数据集上,最佳GPVAE变体相对于匹配的VAE基线,图像重建RMSE平均降低21.9%,最高降低26.1%。下游轨迹RMSE在经典视觉里程计和预训练PoseNet上平均降低12.7%,而每epoch训练时间平均增加27.3%。最后,GP后验提供每帧不确定性估计,反映时间支持并为修复帧提供置信度信号。

英文摘要

Endoscopic video analysis is essential for gastrointestinal diagnosis and computer-assisted interventions, but video sequences are routinely degraded by specular reflections, motion artifacts, and missing frames. These transient corruptions can distract clinicians, reduce image interpretability, and disrupt downstream tasks such as 3D reconstruction and navigation. Effective restoration therefore requires methods that exploit temporal continuity rather than treating frames in isolation. We introduce a Gaussian Process Prior Variational Autoencoder (GPVAE) framework for endoscopic video restoration that replaces the standard factorized latent prior with a temporal Gaussian process prior, enabling interpolation of missing frames with uncertainty-aware reconstruction. The framework combines endoscopy-specific encoders, including a convolutional EndoVAE backbone and pretrained Vision Transformer encoders from GastroNet-5M, with two scalable GP approximations: Hierarchical Prior Approximation (HPA) and Sparse Precision Approximation (SPA). Specular reflections are handled using a DUCKNet-based masking pipeline that excludes corrupted pixels from the reconstruction objective. On the C3VDv2 colonoscopy dataset, the best GPVAE variants reduced image reconstruction RMSE by 21.9\% on average, and by up to 26.1\%, relative to matched VAE baselines. Downstream trajectory RMSE was reduced by 12.7\% on average across classical visual odometry and a pretrained PoseNet, at an average increase of 27.3\% in training time per epoch. Finally, the GP posterior provides per-frame uncertainty estimates that reflect temporal support and offer a confidence signal for restored frames.

2606.19950 2026-06-19 cs.CV cs.AI 新提交

Confidence Calibration for Multimodal LLMs: An Empirical Study through Medical VQA

多模态大语言模型的置信度校准:基于医学视觉问答的实证研究

Yuetian Du, Yucheng Wang, Ming Kong, Tian Liang, Qiang Long, Bingdi Chen, Qiang Zhu

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) Zhihui Medical Technology (Shanghai) Co., Ltd.(智汇医疗科技(上海)有限公司)

AI总结 针对多模态大语言模型在医学任务中置信度与准确性不匹配的问题,提出结合多策略融合询问与专家大语言模型评估的方法,在三个医学VQA数据集上将期望校准误差平均降低40%,提升了模型可靠性。

Comments Accepted by MICCAI 2025

详情
AI中文摘要

多模态大语言模型(MLLMs)在医学任务中展现出巨大潜力,但其引发的置信度常常与实际准确性不一致,可能导致误诊或忽略正确建议。本研究首次全面分析了医学MLLMs中准确性与置信度之间的关系。提出了一种新方法,将多策略融合询问(MS-FBI)与辅助专家大语言模型评估相结合,旨在改善医学视觉问答(VQA)中的置信度校准。实验表明,我们的方法在三个医学VQA数据集上将期望校准误差(ECE)平均降低了40%,显著增强了MLLMs的可靠性。研究结果强调了领域特定校准对医疗领域MLLMs的重要性,为AI辅助诊断提供了更可信的解决方案。

英文摘要

Multimodal Large Language Models (MLLMs) show great potential in medical tasks, but their elicited confidence often misaligns with actual accuracy, potentially leading to misdiagnosis or overlooking correct advice. This study presents the first comprehensive analysis of the relationship between accuracy and confidence in medical MLLMs. It proposes a novel method that combines Multi-Strategy Fusion-Based Interrogation (MS-FBI) with auxiliary expert LLM assessment, aiming to improve confidence calibration in Medical Visual Question Answering (VQA). Experiments demonstrate that our method reduces the Expected Calibration Error (ECE) by an average of 40\% across three Medical VQA datasets, significantly enhancing MLLMs' reliability. The findings highlight the importance of domain-specific calibration for MLLMs in healthcare, offering a more trustworthy solution for AI-assisted diagnosis.

2606.19966 2026-06-19 cs.CV cs.LG 新提交

Semantic-Anchored Evidential Fusion for Domain-Robust Whole-Slide Survival Analysis

语义锚定证据融合用于域鲁棒的全切片生存分析

Yucheng Xing, Ling Huang, Pei Liu, Jingying Ma, Jiaqing Xu, Kai He, Mengling Feng

发表机构 * National University of Singapore(新加坡国立大学) Imperial College London(帝国理工学院) Hunan University(湖南大学)

AI总结 提出SAEFS框架,通过视觉问答提取语义锚点,结合双流证据提取和狄利克雷主观逻辑建模不确定性,实现跨域零样本生存分析,平均C-index提升10.2%。

详情
AI中文摘要

全切片图像(WSIs)广泛用于计算癌症预后。然而,现有方法主要关注域内性能,难以泛化到不同临床中心。这一局限性源于它们依赖像素级表示,极易受到染色协议和扫描硬件导致的域特定伪影影响。我们假设高级病理语义(如肿瘤分级和微环境结构)提供了域不变的语义表示,反映了人类病理学家的鲁棒诊断逻辑。因此,我们提出了语义锚定证据融合生存(SAEFS)框架,其中SAEFS通过视觉问答(VQA)从WSIs中推导语义锚点,采用双流WSI证据提取架构,使用基于狄利克雷的主观逻辑建模不确定性,并通过谨慎合取规则融合语义和视觉证据,以避免来自相关源的过度自信融合。仅在单一源域上训练并在四个未见域上进行零样本评估,SAEFS在预测准确性和可靠性上均一致优于最先进模型,平均C-index提升10.2%。定量分析进一步表明,VQA导出的语义特征比像素级特征表现出显著更低的跨中心差异,突显了其在跨中心临床应用中的鲁棒性。

英文摘要

Whole-slide images (WSIs) are widely used for computational cancer prognosis. However, most existing methods primarily focus on in-domain performance and fail to generalize across clinical centers. This limitation stems from their reliance on pixel-derived representations that are highly susceptible to domain-specific artifacts caused by staining protocols and scanner hardware. We hypothesize that high-level pathology semantics, such as tumor grade and micro-environmental architecture, provide a domain-invariant semantic representation that mirrors the robust diagnostic logic of human pathologists. Therefore, we propose a Semantic-Anchored Evidential Fusion Survival (SAEFS) framework, where SAEFS derives semantic anchors from WSIs via Visual Question Answering (VQA), employs a dual-stream WSI evidence extraction architecture, uses Dirichlet-based Subjective Logic to model uncertainty, and fuses semantic and visual evidence through a cautious conjunction rule to avoid overconfident fusion from correlated sources. Trained exclusively on one source domain and evaluated zero-shot across four unseen domains, SAEFS consistently outperforms state-of-the-art models both in prediction accuracy and reliability, improving the average C-index by 10.2%. Quantitative analyses further show that VQA-derived semantic features exhibit significantly lower cross-center divergence than pixel-derived features, highlighting their robustness for cross-center clinical applications.

2606.20027 2026-06-19 cs.CV 新提交

QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging

QG-MIL:一种用于医学影像中领域无关多实例学习的门控Transformer聚合器

Luca Zedda, Davide Antonio Mura, Cecilia Di Ruberto, Maurizio Atzori, Muhammed Furkan Dasdelen, Carsten Marr, Andrea Loddo

发表机构 * Department of Mathematics and Computer Science, University of Cagliari(卡利亚里大学数学与计算机科学系) Institute of AI for Health, Helmholtz Munich(亥姆霍兹慕尼黑人工智能健康研究所)

AI总结 提出QG-MIL门控Transformer聚合器,通过RMSNorm预归一化、逐头QK归一化、细粒度注意力输出门控和SwiGLU前馈模块,解决注意力集中问题,在六个基准上平均提升+6.1个宏F1分数。

详情
AI中文摘要

医学影像中基于注意力的多实例学习聚合器容易出现注意力集中,导致预测过于自信且不稳定。我们引入QG-MIL,一种门控Transformer聚合器,通过四个协同架构组件解决这一问题:基于RMSNorm的预归一化、逐头QK归一化、细粒度注意力输出门控和SwiGLU风格的前馈模块。这些设计选择共同稳定了训练,并将注意力更均匀地分布在实例上,无需辅助损失、掩码或多阶段正则化。我们在涵盖全切片病理学和细胞级血液学的六个基准上评估了QG-MIL,覆盖两种根本不同的MIL尺度。性能最佳的QG-MIL变体在所有六个基准上均优于领先的基线,平均提升+6.1个宏F1分数。注意力覆盖图和注意力质量分析证实了更分布的实例权重。消融研究表明,虽然单个组件在特定数据集上可以匹配完整模型,但与所选基线相比,QG-MIL设计提供了最一致的跨域性能和最紧凑的方差。我们发布了一个可配置的实现以支持可重复性,网址为:this https URL

英文摘要

Attention-based Multiple Instance Learning aggregators in medical imaging are prone to attention concentration, producing overconfident and unstable predictions. We introduce QG-MIL, a gated transformer aggregator that addresses this through four synergistic architectural components: RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style feed-forward modules. Together, these design choices stabilize training and distribute attention more uniformly across instances without auxiliary losses, masking, or multi-stage regularization. We evaluate QG-MIL across six benchmarks spanning whole-slide pathology and cell-level hematology, covering two fundamentally different MIL scales. The best-performing QG-MIL variants outperform leading baselines on all six benchmarks, with an average improvement of +6.1 mean macro F1 points. Attention overlays and attention mass analysis confirm more distributed instance weighting. Ablation studies show that while individual components can match the full model on specific datasets, the QG-MIL design provides the most consistent cross-domain performance and tightest variance when compared to selected baselines. We release a configurable implementation to support reproducibility at: https://github.com/unica-visual-intelligence-lab/QG-MIL

2606.20035 2026-06-19 cs.CV cs.LG 新提交

PU-UNet: Stable Multiplicative Interactions for Medical Image Segmentation

PU-UNet:用于医学图像分割的稳定乘法交互

Ziyuan Li, Osamah Sufyan, Uwe Jaekel, Babette Dellen

发表机构 * Department of Mathematics, Informatics and Technology, University of Applied Sciences Koblenz(科布伦茨应用科学大学数学、信息学与技术系) Technical University of Munich(慕尼黑工业大学)

AI总结 提出PU-UNet,通过稳定乘积单元残差块在低分辨率阶段实现显式乘法特征交互,在三个医学图像分割数据集上提升Dice和IoU,降低假阳性率。

Comments Accepted to the ICANN 2026

详情
AI中文摘要

许多密集预测网络依赖于加性特征变换,并且仅隐式地建模高阶特征交互。乘积单元为乘法特征建模提供了显式机制,但其对数-指数公式可能导致数值不稳定性,这限制了它们在深度密集预测网络中的使用。在这项工作中,我们提出了乘积单元U-Net(PU-UNet),这是一种残差U-Net,它将稳定的乘积单元残差块集成到丰富的低分辨率阶段,用于医学图像分割。所提出的公式结合了平滑正性映射和对数域裁剪,实现了稳定的乘法特征学习,且计算开销可忽略不计。在ISIC 2018、Kvasir-SEG和BUSI上,PU-UNet分别达到了0.942、0.959和高达0.925的Dice分数。与匹配的残差U-Net基线相比,PU-UNet在保持参数、FLOPs和推理延迟几乎不变的情况下,持续提高了Dice和IoU,并将正常BUSI病例的图像级假阳性率从0.077降至零。消融研究表明,这些增益与乘积单元交互相关,在低分辨率放置下最强,并受益于所提出的稳定化设计。这些结果表明,稳定的乘积单元残差学习可以成为通过显式乘法交互增强U-Net风格分割网络的有效方式。

英文摘要

Many dense prediction networks rely on additive feature transformations and model higher-order feature interactions only implicitly. Product units provide an explicit mechanism for multiplicative feature modeling, but their logarithmic--exponential formulation can cause numerical instability, which has limited their use in deep dense prediction networks. In this work, we propose Product-Unit U-Net (PU-UNet), a residual U-Net that integrates stable product-unit residual blocks into rich low-resolution stages for medical image segmentation. The proposed formulation combines smooth positivity mapping with log-domain clipping, enabling stable multiplicative feature learning with negligible computational overhead. On ISIC 2018, Kvasir-SEG, and BUSI, PU-UNet achieves Dice scores of 0.942, 0.959, and up to 0.925, respectively. Compared with a matched Residual U-Net baseline, PU-UNet consistently improves Dice and IoU while keeping parameters, FLOPs, and inference latency nearly unchanged, and reduces the image-level false-positive rate on normal BUSI cases from 0.077 to zero. Ablation studies suggest that the gains are associated with product-unit interactions, are strongest under low-resolution placement, and benefit from the proposed stabilization design. These results suggest that stable product-unit residual learning can be an effective way to enhance U-Net-style segmentation networks with explicit multiplicative interactions.

2606.20108 2026-06-19 cs.CV cs.LG 新提交

EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors

EFIQA: 基于解剖先验的可解释眼底图像质量评估

Pengwei Wang, José Morano, Qian Wan, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria(维也纳医科大学医学数据科学中心人工智能研究所) Christian Doppler Lab for Artificial Intelligence in Retina, Medical University of Vienna, Austria(维也纳医科大学视网膜人工智能克里斯蒂安·多普勒实验室)

AI总结 提出无需质量标签的EFIQA框架,利用解剖先验通过掩膜解剖修复学习正常结构,生成空间质量图,在多个基准上超越监督方法,兼具可解释性。

Comments Accepted in MIDL 2026. Code: https://github.com/penway/EFIQA

Journal ref Proceedings of Machine Learning Research 315:2248-2264, 2026

详情
AI中文摘要

图像质量控制对于广泛的下游应用至关重要。基于深度学习的图像质量评估方法通常根据数据集特定的质量标签训练分类器,这继承了两种局限性:(1)泛化能力受限于训练集的标注标准;(2)这些方法无法提供质量下降的空间反馈,缺乏可解释性。在这项工作中,我们提出了EFIQA,一个无需质量相关监督的框架,并通过设计生成空间质量图。EFIQA不是从人工标注的标签中学习“什么是退化”,而是通过利用解剖先验来学习“应该有什么”。对于眼底摄影,我们将其实例化为两阶段方法:首先通过掩膜解剖修复训练无监督异常检测器,以识别缺失血管区域;然后将这一先验知识蒸馏到一个浅层适配器中,将冻结基础模型的特征映射到精确的质量图。外部数据集评估表明,这种无需标签且只需最小适配的方法,在不同质量标准的基准上,与监督方法相比,实现了更好的性能和可解释性,突显了其在现实应用中的潜力。

英文摘要

Image quality control is vital for a wide range of downstream applications. Deep learning-based image quality assessment methods typically train classifiers on dataset-specific quality labels, inheriting two limitations: (1) generalization is tied to the labeling criteria of the training set and (2) these methods cannot provide spatial feedback on where the quality is degraded, lacking explainability. In this work, we propose EFIQA, a framework that requires no quality-related supervision and produces spatial quality maps by design. Rather than learning ``what is degradation" from human-annotated labels, EFIQA learns ``what should be there" by leveraging anatomical priors. For fundus photography, we instantiate this as a two-stage approach, by first training an unsupervised anomaly detector via masked anatomical inpainting to identify regions of missing vasculature, and then distilling this prior knowledge into a shallow adapter mapping features of a frozen foundation model to precise quality maps. External-dataset evaluation demonstrates that this label-free approach with minimal adaptation achieves better performance and explainability compared with supervised methods across benchmarks with different quality criteria, highlighting its potential for real-world applications.

2606.20112 2026-06-19 cs.CV eess.IV 新提交

Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation

像素级残差扩散Transformer:可扩展的3D CT体生成

Zhenkai Zhang, Markus Hiller, Krista A. Ehinger, Tom Drummond

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院)

AI总结 提出像素级残差扩散Transformer(PRDiT),通过两阶段训练(局部MLP盲估计器分离低频结构+全局残差扩散Transformer建模高频残差)实现高保真3D CT体生成,在LIDC-IDRI和RAD-ChestCT数据集上优于现有方法。

Comments Accepted at ICLR 2026. Code available at https://github.com/Fredy-Zhang/PRDiT

详情
AI中文摘要

由于现有生成模型固有的巨大计算需求和优化困难,生成具有精细细节的高分辨率3D CT体仍然具有挑战性。在本文中,我们提出了像素级残差扩散Transformer(PRDiT),这是一种可扩展的生成框架,可直接在体素级别合成高质量的3D医学体。PRDiT引入了一个两阶段训练架构,包括:1)一个局部去噪器,形式为基于MLP的盲估计器,作用于重叠的3D块,以有效分离低频结构;2)一个全局残差扩散Transformer,采用内存高效注意力来建模和细化整个体上的高频残差。这种从粗到细的建模策略简化了优化,增强了训练稳定性,并有效保留了细微结构,而无需自编码器瓶颈。在LIDC-IDRI和RAD-ChestCT数据集上进行的大量实验表明,PRDiT始终优于最先进的模型,如HA-GAN、3D LDM和WDM-3D,在3D FID、MMD和Wasserstein距离指标上显著降低。

英文摘要

Generating high-resolution 3D CT volumes with fine details remains challenging due to substantial computational demands and optimization difficulties inherent to existing generative models. In this paper, we propose the Pixel-Level Residual Diffusion Transformer (PRDiT), a scalable generative framework that synthesizes high-quality 3D medical volumes directly at voxel-level. PRDiT introduces a two-stage training architecture comprising 1) a local denoiser in the form of an MLP-based blind estimator operating on overlapping 3D patches to separate low-frequency structures efficiently, and 2) a global residual diffusion transformer employing memory-efficient attention to model and refine high-frequency residuals across entire volumes. This coarse-to-fine modeling strategy simplifies optimization, enhances training stability, and effectively preserves subtle structures without the limitations of an autoencoder bottleneck. Extensive experiments conducted on the LIDC-IDRI and RAD-ChestCT datasets demonstrate that PRDiT consistently outperforms state-of-the-art models, such as HA-GAN, 3D LDM and WDM-3D, achieving significantly lower 3D FID, MMD and Wasserstein distance scores.

2606.20143 2026-06-19 cs.CV 新提交

HEad and neCK TumOR (HECKTOR) 2025: Benchmark of Segmentation, Diagnosis, and Prognosis in Multimodal PET/CT

头颈肿瘤 (HECKTOR) 2025 挑战赛:多模态 PET/CT 中的分割、诊断与预后基准

Numan Saeed, Salma Hassan, Shahad Hardan, Lishan Cai, Xinglong Liang, Moona Mazher, Abdul Qayyum, Yansong Bu, Mengye Lyu, Yue Lin, Mingyuan Meng, Chuanyi Huang, Lisheng Wang, Dalal Chamseddine, Shamimeh Ahrari, Beining Wu, Yifei Chen, Fuyou Mao, Hao Zhang, Baixiang Zhao, Surajit Ray, Muzi Guo, Lei Xiang, Jakob Dexl, Michael Ingrisch, Adrien Depeursinge, Arman Rahmim, Mathieu Hatt, Vincent Andrearczyk, Mohammad Yaqub

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) Amsterdam UMC(阿姆斯特丹大学医学中心) The Netherlands Cancer Institute(荷兰癌症研究所) Radboud University Medical Centre(拉德堡德大学医学中心) University College London(伦敦大学学院) Imperial College London(帝国理工学院) Shenzhen Technology University(深圳技术大学) Shenzhen University(深圳大学) Newland Digital Technology(新大陆数字技术) The University of Sydney(悉尼大学) Shanghai Jiao Tong University(上海交通大学) University Hospital, Nantes(南特大学医院) Nantes Université, Centrale Nantes, CNRS, LS2N(南特大学、南特中央理工学院、法国国家科学研究中心、LS2N实验室) Hangzhou Dianzi University(杭州电子科技大学) Tsinghua University(清华大学) Central South University(中南大学) University of Glasgow(格拉斯哥大学) China Mobile System Integration Co., Ltd.(中移系统集成有限公司) Subtle Medical Inc.(Subtle Medical公司) University Hospital, LMU Munich(慕尼黑大学医院) Munich Center for Machine Learning(慕尼黑机器学习中心) BC Cancer Research Institute(不列颠哥伦比亚癌症研究所) HES-SO Valais-Wallis University of Applied Sciences and Arts(HES-SO瓦莱州应用科学与艺术大学) Lausanne University Hospital (CHUV)(洛桑大学医院) LaTIM, INSERM, UMR 1101, Univ Brest(LaTIM实验室、法国国家健康与医学研究院、UMR 1101、布雷斯特大学)

AI总结 HECKTOR 2025 挑战赛利用多模态 PET/CT 和电子健康记录,建立了头颈癌自动分析的基准,涵盖肿瘤分割、复发预测和 HPV 分类三个任务,最佳算法分别达到 Dice 0.75、C-index 0.66 和平衡准确率 0.56。

Comments 17 pages, 4 figures, 4 tables. Overview paper for the HECKTOR 2025 challenge, held as a satellite event at MICCAI 2025. Challenge website: https://hecktor.grand-challenge.org/

详情
AI中文摘要

头颈癌 (HNC) 构成显著的全球健康负担,准确的肿瘤勾画对于有效的放疗计划至关重要。口咽部解剖结构的复杂性,加上肿瘤在影像上的异质性表现,使得手动分割耗时且存在观察者间差异。除分割外,从非侵入性影像预测长期临床结局(如无复发生存期 RFS)和确定人乳头瘤病毒 (HPV) 状态,仍然是具有挑战性但临床价值高的目标。HECKTOR 2025 挑战赛通过使用多模态 PET/CT 影像和电子健康记录,建立了一个用于自动 HNC 分析的全面基准。基于前几届(2020-2022),本次挑战赛采用了扩展的多机构数据集,包含来自全球 10 个中心的 1100 多名患者。参与者需完成三个互补目标:(1) 分割原发肿瘤体积 (GTVp) 和转移淋巴结 (GTVn),(2) 预测无复发生存期,(3) 分类 HPV 状态。挑战赛吸引了 35 个注册团队,其中 15 个最终提交在保留测试集上进行了评估。表现最佳的算法在分割上达到平均 Dice 相似系数 0.75,在生存预测上达到一致性指数 0.66,在 HPV 分类上达到平衡准确率 0.56。本文对所提交的方法进行了全面分析,评估了它们在不同病变特征上的性能,并讨论了它们在自动化肿瘤学工作流程和决策支持系统中临床转化的意义。

英文摘要

Head and neck cancers (HNC) represent a significant global health burden, with accurate tumor delineation being essential for effective radiotherapy planning. The complexity of the oropharyngeal anatomy, combined with the heterogeneous appearance of tumors on imaging, makes manual segmentation time-intensive and subject to inter-observer variability. Beyond segmentation, predicting long-term clinical outcomes, such as recurrence-free survival (RFS), and determining human papillomavirus (HPV) status from noninvasive imaging, remain challenging yet clinically valuable goals. The HECKTOR 2025 challenge addresses these needs by establishing a comprehensive benchmark for automated HNC analysis using multimodal PET/CT imaging and electronic health records. Building on previous editions (2020-2022), this challenge features an expanded multi-institutional dataset comprising over 1,100 patients from 10 centers worldwide. Participants were tasked with three complementary objectives: (1) segmenting primary gross tumor volumes (GTVp) and metastatic lymph nodes (GTVn), (2) predicting recurrence-free survival, and (3) classifying HPV status. The challenge attracted 35 registered teams, with 15 final submissions evaluated on a held-out test set. Top-performing algorithms achieved a mean Dice similarity coefficient of 0.75 for segmentation, a concordance index of 0.66 for survival prediction, and a balanced accuracy of 0.56 for HPV classification. This paper presents a comprehensive analysis of the submitted methodologies, evaluates their performance across different lesion characteristics, and discusses their implications for clinical translation in automated oncology workflows and decision support systems.

2606.20223 2026-06-19 cs.CV q-bio.QM 新提交

DeepForestVisionV2: Ecology-Driven Taxonomy Expansion for Camera-Trap Monitoring in African Tropical Forests

DeepForestVisionV2:面向非洲热带森林相机监测的生态驱动分类扩展

Hugo Magaldi, Theau d'Audiffret, Etienne Francois Akomo-Okoue, Bala Amarasekaran, Naomi Anderson, Claire Auger, Noemie Cappelle, Daniel Cornelis, Raphael Cornette, Tobias Deschner, Gabriel Dubus, Davy Fonteyn, Rosa M. Garriga, Jennifer Hatlauf, Innocent Kasekendi, Raymond Katumba, Aram Kazandjian, Alfred Ngomanda, Stephan Ntie, Simone Pika, Xavier Rufray, Harold Rugonge, John Justice Tibesigwa, Peter van Lunteren, Hadrien Vanthomme, Joeri A. Zwerts, Sabrina Krief

发表机构 * UMR7206 Eco-Anthropologie, MNHN(UMR7206 生态人类学,法国国家自然历史博物馆) One Forest Vision initiative(One Forest Vision 倡议) Sebitoli Chimpanzee Project(塞比托利黑猩猩项目) Centre National de la Recherche Scientifique et Technologique(国家科学技术研究中心) Institut de Recherche en Ecologie Tropicale(热带生态研究所) Tacugama Chimpanzee Sanctuary(塔库加马黑猩猩保护区) Biotope(Biotope 公司) CIRAD(法国农业发展国际合作研究中心) Max Planck Institute for Evolutionary Anthropology(马克斯·普朗克进化人类学研究所) BOKU University(维也纳自然资源与生命科学大学) Agence Nationale des Parcs Nationaux du Gabon(加蓬国家公园管理局) Uganda Wildlife Authority(乌干达野生动物管理局) Addax Data Science(Addax 数据科学公司) Utrecht University(乌得勒支大学)

AI总结 针对非洲热带森林相机监测中生态梯度(垂直分层、场景开放度、人为界面)导致原35类分类过粗的问题,提出扩展至64类的DeepForestVisionV2,在保持离线工作流的同时提升野外实用性。

Comments Accepted at ICPR 2026 - Computer Vision for Biodiversity Monitoring and Conservation Workshop

详情
AI中文摘要

非洲热带森林中的相机监测正从封闭冠层内部扩展到河岸、空地和公园边缘。在现有的非洲森林相机分类开放工具中,DeepForestVision是唯一提供照片和视频匹配离线工作流的工具,先前研究表明其在可比基准上优于其他基线。然而,它专为封闭冠层、地面森林内部设计,使用35类预测空间,当部署遇到树栖灵长类、鸟类、半水生类群或家畜等人为混杂因素时,该空间变得过于粗糙。我们提出DeepForestVisionV2,这是一个从35类扩展到64类预测空间(61个动物类加上人类、车辆和空白)的生态驱动扩展,旨在解决三个反复出现的部署梯度:垂直分层、场景开放度和人为界面。DeepForestVisionV2保留相同的离线工作流,并在来自多国非洲热带森林项目的1,535,010张照片和243,354个视频上训练。评估结合了一个跨国家裁剪照片验证集(用于评估跨站点和相机设置的鲁棒性)和三个涵盖目标梯度的留出乌干达视频基准。在验证集上,DeepForestVisionV2达到0.86准确率、0.82宏F1和0.81平衡准确率。在部署基准上,尽管分类任务更困难,它仍保持或提高了基线准确率,同时将识别的类群数量从森林内部视频的22个增加到29个,河岸视频从4个增加到9个。在公园边缘用例中,它将准确率从0.62提高到0.86,并将误报从11次减少到0次。这些结果表明,DeepForestVisionV2在保持跨站点、栖息地和相机设置鲁棒性的同时,显著提高了野外实用性。

英文摘要

Camera-trap monitoring in African tropical forests increasingly extends beyond closed-canopy interiors to riverbanks, clearings, and park edges. Among available open tools for African forest camera-trap classification, DeepForestVision is the only one providing a matched offline workflow for both photographs and videos, and previous work showed that it outperformed other available baselines on a comparable benchmark. However, it was designed for closed-canopy, ground-level forest interiors and uses a 35-class prediction space that becomes too coarse when deployments encounter arboreal primates, birds, semi-aquatic taxa, or human-associated confounders such as livestock. We present DeepForestVisionV2, an ecology-driven expansion from 35 to 64 prediction classes (61 animal classes plus human, vehicle, and blank) designed to address three recurrent deployment gradients: vertical stratification, scene openness, and anthropogenic interfaces. DeepForestVisionV2 retains the same offline workflow and is trained on 1,535,010 photographs and 243,354 videos from multi-country African tropical-forest projects. Evaluation combines a cross-country cropped-photo validation set, used to assess robustness across sites and camera-trap settings, with three held-out Uganda video benchmarks spanning the targeted gradients. On the validation set, DeepForestVisionV2 reaches 0.86 accuracy, 0.82 macro-F1, and 0.81 balanced accuracy. On the deployment benchmarks, it preserves or improves baseline accuracy despite its harder classification task, while increasing the number of identified taxa from 22 to 29 in forest-interior videos and from 4 to 9 at riverbanks. In the park-edge use case, it raises accuracy from 0.62 to 0.86 and reduces false alarms from 11 to 0. These results show that DeepForestVisionV2 materially improves field utility while preserving robustness across sites, habitats, and camera-trap settings.

2606.20250 2026-06-19 cs.CV 新提交

Single-Stage Hierarchical Rectification for Weakly Supervised Histopathology Segmentation

单阶段层次化校正用于弱监督组织病理学分割

Duc T. Nguyen, Hoang-Long Nguyen, Thanh-Ha DO, Huy-Hieu Pham

发表机构 * VinUni-Illinois Smart Health Center, VinUniversity, Hanoi, Vietnam(越南河内VinUniversity VinUni-Illinois智慧健康中心) The Computer Vision and Medical AI Lab, VinUniversity, Hanoi, Vietnam(越南河内VinUniversity计算机视觉与医学人工智能实验室) Posts and Telecommunications Institute of Technology, Hanoi, Vietnam(越南河内邮电技术学院)

AI总结 提出单阶段层次化校正框架,通过层次化特征校正模块在单次训练中直接生成高保真激活图,解决多阶段弱监督分割中的误差传播和计算开销问题。

Comments Accepted to MICCAI 2026. This is the pre-review submitted version, not the camera-ready version. The final authenticated version will be available in the MICCAI 2026 proceedings

详情
AI中文摘要

现有的计算病理学中的弱监督语义分割方法依赖于多阶段范式:类激活图生成、离线伪掩码细化和全监督再训练。虽然这种解耦方法已被广泛采用,但它存在根本性缺陷。多阶段过程不仅导致高计算训练成本,还遭受误差传播:浅层CNN中的局部纹理偏差产生假阳性伪影,后续细化步骤往往无法纠正。为了通过简单而高效的方法解决这些持续存在的挑战,我们提出了单阶段层次化校正(SSHR)框架。我们的方法不是事后被动地细化CAM,而是在前向传播过程中主动净化中间特征表示。我们引入了一个层次化特征校正模块(HFRM),利用深层全局语义上下文过滤浅层中的局部异常。该机制在单个训练循环内直接生成高保真激活图。在LUAD-HistoSeg和BCSS数据集上的实验表明,SSHR优于最先进的多阶段方法。此外,SSHR将训练时间减少了2到5倍。这种效率降低了计算开销,并加速了大规模组织病理学工作流的临床转化。代码可在以下网址获取:this https URL

英文摘要

Existing weakly supervised semantic segmentation (WSSS) methods in computational pathology rely on a multi-stage paradigm: class activation map (CAM) generation, offline pseudo-mask refinement, and fully supervised retraining. While established, this decoupled approach presents fundamental limitations. The multi-stage process not only incurs high computational training costs but also suffers from error propagation: local texture biases in shallow CNN layers generate false-positive artifacts that subsequent refinement steps often fail to correct. To address these persistent challenges through a simple yet highly effective approach, we propose the Single-Stage Hierarchical Rectification (SSHR) framework. Rather than passively refining CAMs post-hoc, our method proactively purifies intermediate feature representations during the forward pass. We introduce a Hierarchical Feature Rectification Module (HFRM) that utilizes deep global semantic context to filter out local anomalies in shallow layers. This mechanism generates high-fidelity activation maps directly within a single training loop. Experiments on the LUAD-HistoSeg and BCSS datasets demonstrate that SSHR outperforms state-of-the-art multi-stage methods. Furthermore, SSHR reduces training duration by 2 to 5 times. This efficiency minimizes computational overhead and accelerates clinical translation for large-scale histopathology workflows. The code is available at: https://github.com/trongduc-nguyen/SSHR

2606.20390 2026-06-19 cs.CV 新提交

Geometry-Aware Superpixel Graph Transformer with Metadata for Skin Lesion Classification

几何感知超像素图变换器结合元数据用于皮肤病变分类

Muhammad Azeem, Tanveer Hussain, Amr Ahmed, Ardhendu Behera

发表机构 * Edge Hill University(埃奇希尔大学)

AI总结 提出一种基于区域的图学习框架,将病变建模为超像素图,利用几何边属性和元数据上下文节点,通过边缘感知图变换器实现多模态融合,在四个公开数据集上取得优于现有方法的分类性能。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

由于病变结构异质性、类内变异大以及良恶性病例间细微视觉差异,从皮肤镜图像进行自动化皮肤癌分类仍然具有挑战性。现有的CNN/ViT流程通常依赖全局或补丁级特征,并常通过后期融合结合患者元数据,这限制了空间基础的多模态推理。我们提出一种新颖的基于区域的图学习框架,将病变显式建模为空间连贯的超像素区域图,这些区域表示为冻结的CNN特征。为了捕捉细粒度的病变排列,我们将区域间几何编码为边属性,并引入一个与所有区域相连的专用元数据上下文节点,从而在同一关系空间内结构化地整合人口统计学/临床变量。节点表示通过我们的边缘感知图变换器进行更新,随后进行注意力驱动的传播,最终生成用于良恶性分类的图级嵌入。在四个公开基准上的实验表明,显式的区域级关系建模和图原生多模态融合相较于现有技术取得了持续改进。因此,我们建立了一种新的以图为中心的视角,其中CNN特征被建模为关系节点,并通过上下文整合得到改进,从而产生更具表现力和鲁棒性的分类结果。

英文摘要

Automated skin cancer classification from dermoscopic images remains challenging due to heterogeneous lesion structure, strong intra-class variability, and subtle visual differences between benign and malignant cases. Existing CNN/ViT pipelines typically rely on global or patch-level features and often combine patient metadata via late fusion, which limits spatially grounded multimodal reasoning. We present a novel region-based graph learning framework that explicitly models lesions as graphs of spatially coherent superpixel regions represented as frozen CNN features. To capture fine-grained lesion arrangements, we encode inter-regional geometry as edge attributes and introduce a dedicated metadata context node connected to all regions, providing structured integration of demographic/clinical variables within the same relational space. Node representations are updated using our edge-aware graph transformer followed by attention-driven propagation, and a final graph-level embedding for benign-malignant classification. Experiments on four public benchmarks demonstrate that explicit region-level relational modeling and graph-native multimodal fusion yield consistent gains over the state-of-the-art. Consequently, we establish a new graph-centric perspective in which CNN features are modeled as relational nodes and improved through contextual integration, yielding more expressive and robust classifications.

2606.20449 2026-06-19 cs.CV 新提交

InfantFace: Detecting infant faces in neonatal clinical environments

InfantFace:新生儿临床环境中的婴儿面部检测

Abdullah Bin-Obaid, Maria M. Cobo, Rebeccah Slater, Lionel Tarassenko, Mauricio Villarroel

AI总结 针对新生儿临床环境中的遮挡和光照问题,提出基于YOLOv11m的单阶段面部检测模型,在多个公开数据集预训练后,通过临床数据微调,AP50从0.87提升至0.96。

Comments 32 pages, 7 figures, 4 tables; supplementary information included

详情
AI中文摘要

新生儿面部的可靠定位是基于视频摄像头的非接触式评估的第一步,例如疼痛和痛苦相关的面部表情分析、疼痛评分、心肺信号提取和呼吸停止警报。然而,新生儿临床环境中仍存在重大挑战。杂乱的背景、光照变化和不良照明条件会降低面部检测模型的准确性。临床干预、监测设备以及在某些情况下的医疗设备可能会遮挡面部,使视觉评估变得困难。我们提出了一种基于YOLOv11m的单阶段模型,专门用于新生儿临床环境中的婴儿面部检测。我们结合了多个公开数据集(VGGFace2、CelebA、FDDB、WIDER FACE)来训练和评估我们提出的模型。然后,我们在一个新生儿研究数据集上对模型进行了微调,该数据集包含来自114个记录会话的228个视频,涉及113名独立婴儿。在微调之前,我们的模型达到了0.87的AP50,超过了三个最先进的通用面部检测器的性能。在临床领域适应后,性能进一步提高到0.96的AP50。由于缺乏公开的新生儿数据集,评估不同数据集上的面部检测性能仍然是一个挑战。优先创建此类数据集,同时在其创建和使用中维护适当的隐私保护措施和伦理标准,将极大地支持该领域的进一步进展。

英文摘要

Reliable localisation of the neonatal face is the first step for several video-camera based non-contact assessments such as pain and distress related facial expression analysis, pain scoring, cardiorespiratory signal extraction and cessation of breathing alerts. However, major challenges persist in neonatal clinical environments. Cluttered backgrounds, illumination changes and poor lighting conditions can reduce the accuracy of face detection models. Clinical interventions, monitoring equipment and, in some cases, medical devices can obstruct the face, making visual assessment difficult. We propose a one-stage YOLOv11m-based model tailored for face detection of infants in neonatal clinical environments. We combined multiple publicly available datasets (VGGFace2, CelebA, FDDB, WIDER FACE) to train and evaluate our proposed model. We then fine-tuned our model on a neonatal research dataset involving 228 videos from 114 recording sessions of 113 independent infants. Before fine-tuning, our model achieved an AP50 of 0.87, surpassing the performance of three state-of-the-art general face detectors. Performance improved further to an AP50 of 0.96 after clinical-domain adaptation. Evaluating face detection performance across different datasets remains a challenge due to the lack of publicly available neonatal datasets. Prioritising the creation of such datasets, while upholding appropriate privacy safeguards and ethical standards in their creation and use, would greatly support further progress in this field.

2606.20477 2026-06-19 cs.CV cs.CL cs.LG 新提交

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

面向放射学的空间定位2D视觉-语言模型的可扩展训练

Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter, Behzad Bozorgtabar, Thomas Brox

发表机构 * Computer Vision Group, University of Freiburg, Germany(德国弗莱堡大学计算机视觉组) Department of Radiology, Medical Center -- University of Freiburg, Germany(德国弗莱堡大学医学中心放射科) CRIION-AI Lab, Freiburg, Germany(德国弗莱堡CRIION-AI实验室)

AI总结 提出RefRad2D大规模双语数据集,通过LLM和自动分割生成空间定位数据,训练RadGrounder模型联合完成报告生成、VQA和空间定位,在外部基准上取得竞争性结果。

Comments Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

详情
AI中文摘要

我们研究了如何在没有手动空间标注的情况下,为放射学训练具有视觉定位能力的视觉-语言模型(VLM)。我们引入了RefRad2D,这是一个大规模的双语(德语/英语)数据集,包含来自临床实践的120万对CT和MR图像-文本对,并通过基于LLM的筛选和自动分割自动生成任务特定的VQA和空间定位子集。在此数据上训练的模型RadGrounder联合执行报告生成、视觉问答以及通过边界框检测或分割进行的空间定位。在外部VQA基准(Slake,VQA-RAD)上,RadGrounder取得了与专用医学VLM竞争的结果。将我们的临床数据加入训练混合集,相比于仅在下游数据集上微调,提高了开放式VQA的性能,显示了数据集的迁移性。关键在于,添加定位监督不会降低语言质量,从而在不牺牲VQA性能的情况下实现空间可验证的输出。

英文摘要

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

2606.19371 2026-06-19 cs.LG cs.AI cs.CV 交叉投稿

ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification

ProMUSE: 渐进式多模态不确定性引导的分阶段证据阿尔茨海默病分类

Long Doan, Branden Chen, Ethan Litton, Huan Huang, Jiajing Huang, Yixin Xie, Weihua Zhou, Nandakumar Narayanan, Chen Zhao

发表机构 * Kennesaw State University(肯尼索州立大学) Michigan Technological University(密歇根理工大学) University of Iowa(爱荷华大学)

AI总结 提出ProMUSE,一种渐进式多模态不确定性引导的分阶段证据网络,通过自适应决定何时需要额外模态,在保持准确性的同时降低数据采集成本。

详情
AI中文摘要

阿尔茨海默病(AD)是一种致命性疾病,会破坏老年人的记忆和认知能力。大多数AD治疗在早期阶段有效,导致对早期AD诊断的需求日益增加。AD诊断越来越依赖多模态数据,如临床评估、结构磁共振成像(MRI)和正电子发射断层扫描(PET)成像。然而,MRI和PET采集仍然昂贵且不易普及,使得全模态推理在现实临床工作流程中不切实际。我们提出ProMUSE,一种渐进式多模态不确定性引导的分阶段证据网络,该网络自适应地确定何时需要额外模态,有助于在保持准确性的同时降低数据采集的总体成本。ProMUSE首先使用低成本临床数据进行证据分类,并通过基于Dirichlet的主观逻辑模型量化不确定性。当不确定性超过学习阈值时,ProMUSE逐步引入MRI或PET特征,通过Dempster-Shafer理论融合模态层面的信念和不确定性,获得校准的多模态预测。这种分阶段采集策略能够在最小化对昂贵成像依赖的同时实现准确诊断。在ADNI、AIBL和OASIS数据集上针对CN-AD、CN-MCI和MCI-AD任务的实验表明,ProMUSE在减少50-90%的MRI/PET使用量的同时,实现了与全模态基线相当或更优的准确性,从而大幅节省成本。这些结果突显了ProMUSE作为现实世界AD筛查中一种实用、不确定性感知且资源高效的解决方案。

英文摘要

Alzheimer's disease (AD) is a fatal disorder that destroys memory and cognitive skills in the elderly population. Most treatments for AD are effective in the early stage, leading to an increasing demand for early AD diagnosis. AD diagnosis increasingly relies on multimodal data such as clinical assessments, structural Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) imaging. However, MRI and PET acquisition remain costly and not universally accessible, making full-modality inference impractical in real-world clinical workflows. We propose ProMUSE, a Progressive Multi-modal Uncertainty Guided Staged Evidential Network that adaptively determines when additional modalities are necessary, helping reduce the overall cost of data acquisition while maintaining accuracy. ProMUSE first performs evidential classification using low-cost clinical data and quantifies uncertainty via a Dirichlet-based subjective logic model. When uncertainty exceeds a learned threshold, ProMUSE progressively incorporates MRI or PET features, fusing modality-wise belief and uncertainty through Dempster-Shafer theory to obtain a calibrated multimodal prediction. This staged acquisition strategy enables accurate diagnosis while minimizing reliance on expensive imaging. Experiments on ADNI, AIBL, and OASIS across CN-AD, CN-MCI, and MCI-AD tasks demonstrate that ProMUSE achieves competitive or superior accuracy compared to full-modality baselines while reducing MRI/PET usage by 50-90%, yielding substantial cost savings. These results highlight ProMUSE as a practical, uncertainty-aware, and resource-efficient solution for real-world AD screening.

2606.19372 2026-06-19 eess.IV cs.CV cs.LG 交叉投稿

Full-Self Diagnostics (FSD): Physics-Grounded Visual Biomarker Inference from Smartphone Video via Inverse Problems and Operator Learning

全自诊断(FSD): 通过逆问题和算子学习从智能手机视频进行基于物理的可视生物标志物推断

Jonathan Thomas, Harsh Thaker

AI总结 提出全自诊断(FSD)框架,结合物理前向模型、信息论可观测性、正则化逆问题、算子学习和随机变分推断,从9秒面部视频恢复生理状态,在59名受试者38812次扫描中验证,血糖MARD达29.86%。

Comments 38,812 paired scans, preliminary longitudinal validation of multichannel visual glucose inference (MARD 17 to 46 percent across cohorts); physics plus information theory plus operator learning framework

详情
AI中文摘要

我们提出全自诊断(FSD),一个统一的数学框架,用于从消费级智能手机拍摄的无约束9秒面部视频中恢复潜在生理状态。该方法整合了五个相互增强的组件:(1)基于辐射传输方程和发色团吸收的物理前向模型,将相机观测映射到生物标志物浓度;(2)信息论可观测性理论,证明多通道视觉信号(光谱、脉搏、呼吸、微表情和眼动)与生理状态包含严格递增的互信息;(3)具有域均匀可辨识性保证的稳定Tikhonov正则化逆问题;(4)算子学习公式,实现跨设备、分辨率和人群的泛化;(5)可解释为随机变分推断的监督学习过程,从配对生物传感器真实值持续优化模型,性能随配对观测数量的平方根倒数比例提升。在59名受试者的38812次真实世界配对扫描上的实证验证展示了实际性能。第一作者自采数据(血糖范围35-550 mg/dL)的MARD为29.86%,97.57%的预测落在Clarke误差网格A+B区,仅0.27%在危险E区。一位管理良好的糖尿病参与者在较窄的70-180 mg/dL范围内达到MARD 17%。这些结果证实,消费级面部视频编码了足够的结构化信息,可在完全无约束条件下进行临床相关的非侵入性生物标志物推断,且性能随更多配对数据的可用性可预测地提升。

英文摘要

We present Full-Self Diagnostics (FSD), a unified mathematical framework for recovering latent physiological states from unconstrained 9-second facial videos captured by consumer smartphones. The approach integrates five mutually reinforcing components: (1) a physics-based forward model derived from the radiative transfer equation and chromophore absorption that maps camera observables to biomarker concentrations; (2) an information-theoretic observability theory proving that multi-channel visual signals (spectral, pulse, respiratory, micro-expression, and oculomotor) contain strictly increasing mutual information with physiological state; (3) a stable, Tikhonov-regularized inverse problem with domain-uniform identifiability guarantees; (4) an operator-learning formulation that enables generalization across devices, resolutions, and populations; and (5) a supervised learning procedure, interpretable as stochastic variational inference, that continuously refines the model from paired biosensor ground truth with performance improving proportionally to one over the square root of the number of paired observations. Empirical validation on 38812 real-world paired scans across 59 subjects demonstrates practical performance. Self-collected data from the lead author (glucose range 35-550 mg/dL) yields MARD of 29.86 percent with 97.57 percent of predictions in Clarke Error Grid Zones A+B and only 0.27 percent in the dangerous Zone E. A well-managed diabetic participant achieves MARD of 17 percent in the narrower 70-180 mg/dL band. These results confirm that consumer-grade facial video encodes sufficient structured information for clinically relevant, non-invasive biomarker inference under fully unconstrained conditions, with performance scaling predictably as more paired data becomes available.

2606.19651 2026-06-19 cs.AI cs.CV cs.LG 交叉投稿

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

BrainG3N:用于可控3D脑MRI生成的双用途分词器

Max Van Puyvelde, Ibrahim Gulluk, Wim Van Criekinge, Olivier Gevaert

发表机构 * Department of Biomedical Data Science, Stanford University School of Medicine(斯坦福大学医学院生物医学数据科学系) Department of Mathematical Modelling, Statistics & Bioinformatics, Ghent University(根特大学数学建模、统计与生物信息学系) Department of Electrical Engineering, Stanford University(斯坦福大学电气工程系)

AI总结 提出基于3D掩码自编码器的分词器,解耦编码器与解码器,在23项线性探测任务中21项超越SOTA,并支持条件生成和纵向预测。

详情
AI中文摘要

三维(3D)脑MRI是临床神经病学和神经肿瘤学的核心,生成模型可以增强代表性不足的队列、模拟疾病轨迹并支持隐私保护的数据共享。潜在扩散已成为建模成像数据的首选解决方案,但它对分词器提出了两个竞争性要求:编码器嵌入必须保留下游任务所需的临床信息,解码器必须重建解剖学上准确的体积。现有的重建驱动分词器以牺牲前者为代价实现了后者。为了解决这个问题,我们引入了一种基于全体积掩码自编码器(MAE)的分词器,用于3D脑MRI潜在扩散,解耦编码器和解码器:冻结的3D MAE编码器产生临床信息丰富的嵌入,而专用的CNN解码器从这些嵌入的线性投影重建体素。我们在来自18个公共队列的35,309个体积上预训练编码器,涵盖四种模态、十种疾病类别和200多个采集站点,并在两种设置中展示了其双重用途。首先,在23项线性探测基准测试中,编码器在21项任务上优于或匹配SOTA模型(即BrainIAC、BrainSegFounder和MedicalNet)。其次,在这些临床信息丰富的嵌入上训练的条件扩散变压器(DiT)支持跨六个变量的条件生成和患者特定的纵向预测。这些结果共同建立了一个单一的3D脑MRI嵌入空间,能够同时支持下游临床任务和可控生成。

英文摘要

Three-dimensional (3D) brain MRI is central to clinical neurology and neuro-oncology, where generative models could augment under-represented cohorts, simulate disease trajectories, and support privacy-preserving data sharing. Latent diffusion has been the go-to solution for modeling imaging data, but it places two competing demands on the tokenizer: encoder embeddings must retain the clinical information that downstream tasks act on, and the decoder must reconstruct anatomically faithful volumes. Existing reconstruction-driven tokenizers achieve the second at the expense of the first. To address this, we introduce a fully volumetric masked-autoencoder (MAE) based tokenizer for 3D brain MRI latent diffusion, decoupling encoder and decoder: a frozen 3D MAE encoder produces clinically informative embeddings, while a dedicated CNN decoder reconstructs voxels from a linear projection of those embeddings. We pretrain the encoder on 35,309 volumes from 18 public cohorts spanning four modalities, ten disease categories, and 200+ acquisition sites, and demonstrate its dual utility in two settings. First, on a 23-task linear-probing benchmark, the encoder outperforms or matches SOTA models (i.e., BrainIAC, BrainSegFounder, and MedicalNet) on 21 of 23 tasks. Second, a conditional diffusion transformer (DiT) trained on these clinically informative embeddings supports both conditional generation across six variables and patient-specific longitudinal forecasting. Together these results establish a single 3D brain-MRI embedding space capable of both downstream clinical tasks and controllable generation.

2606.19767 2026-06-19 eess.IV cs.CV physics.med-ph 交叉投稿

Contour-Constrained Deformable Registration with Parameter Characterization for Head and Neck Surgical Guidance

面向头颈外科引导的带参数表征的轮廓约束可变形配准

Qingyun Yang, Jon S. Heiselman, Ayberk Acar, Morgan J. Ringel, Michael I. Miga, Matthieu Chabanas, Michael C. Topf, Jie Ying Wu

AI总结 提出一种基于正则化Kelvinlet基函数的可变形配准框架,通过表面点云、基准标记和轮廓约束校正术后组织变形,在9例头颈标本上将配准误差从刚性配准的11.11mm降至5.62mm,降幅达49.41%。

详情
AI中文摘要

全球每年新增89万例头颈部鳞状细胞癌,其复发率在实体恶性肿瘤中最高。尽管冰冻切片分析是术中切缘评估的标准方法,但由于切除标本与切除床之间的对准不精确,加上切除后黏膜组织收缩,准确地将检测到的阳性切缘重新定位到切除床上仍然具有挑战性。我们提出了一种生物力学驱动的可变形配准框架,用于校正术后组织变形以提供术中引导。该方法基于正则化Kelvinlet基函数的可变形配准方法,将3D标本网格配准到术中切除床点云。配准匹配表面点云、基准标记和边界轮廓约束,直接惩罚标本与切除床边界之间的垂直距离一致性。在来自皮肤、颊粘膜和舌部位的9个标本上,使用刚性配准的整体平均目标配准误差为$11.11 \pm 4.07$ mm,使用无轮廓约束的可变形配准则降至$8.20 \pm 2.68$ mm(降低26.19%)。所提出的轮廓约束可变形配准进一步将误差降至$5.62 \pm 2.28$ mm,相对于刚性配准降低了49.41%。我们在临床最具挑战性的舌标本中观察到最大降幅。我们还进行了系统的两阶段参数搜索,以表征表面配准、基准对应、轮廓约束和应变能正则化的相对重要性。该搜索表明,对于具有大侧向变形的组织类型,轮廓权重主导配准精度,而算法在广泛的参数组合范围内均可运行。

英文摘要

With 890,000 annual new cases globally, head and neck squamous cell carcinoma has one of the highest recurrence rates among solid malignancies. Although frozen section analysis is the standard of care for intraoperative margin assessment, accurately relocating detected positive margins on the resection bed remains challenging due to imprecise alignment between resected specimens and their resection bed, compounded by post-resection mucosal tissue shrinkage. We present a biomechanics-driven deformable registration framework that corrects post-resection tissue deformation to provide intraoperative guidance. Our approach registers 3D specimen meshes to intraoperative resection bed point clouds using a deformable registration approach based on regularized Kelvinlet basis functions. The registration matches surface point clouds, fiducial landmarks, and boundary contour constraints that directly penalize perpendicular distance-to-agreement between specimen and resection bed boundaries. Across nine specimens from skin, buccal mucosa, and tongue sites, the overall mean target registration error was $11.11 \pm 4.07$ mm using rigid registration, which decreased to $8.20 \pm 2.68$ mm (26.19\% reduction) using deformable registration without contour constraint. The proposed contour-constrained deformable registration further reduced the error to $5.62 \pm 2.28$ mm, a 49.41\% reduction relative to rigid registration. We observed the largest reduction in the most clinically challenging tongue specimens. We also performed a systematic two-stage parameter search to characterize the relative importance of surface alignment, fiducial correspondences, contour constraint, and strain energy regularization. This search revealed that contour weighting dominates registration accuracy for tissue types with large lateral deformation, while the algorithm operates over a broad range of parameter combinations.

2606.20115 2026-06-19 cs.LG cs.CV 交叉投稿

When Calibration Fails the Vulnerable Hospital: Federated Conformal Risk Control via Risk-Curve Shrinkage

当校准失败于脆弱的医院:通过风险曲线收缩实现联邦共形风险控制

Nafis Fuad Shahid

AI总结 针对联邦部署中标准共形风险控制(CRC)对个体机构覆盖不足的问题,提出基于风险曲线收缩的联邦CRC协议,在真实脑肿瘤数据上实现2.7/20的违规率且预测集仅扩大2.0倍。

Comments 9 pages, 3 figures, 2 tables. Submitted to the DeCaF Workshop at MICCAI 2026

详情
AI中文摘要

共形风险控制(CRC)通过在保留数据上校准预测集阈值,提供分割质量的无分布保证。在联邦部署中,标准方法将各站点的校准分数合并为一个阈值。我们在真实多机构脑肿瘤数据(FeTS-2022,1251名受试者,20个机构)上首次量化表明,这种朴素的合并CRC保护了平均医院,但违反了40%个体机构的覆盖,最差站点的假阴性率超出目标7.8个百分点。朴素的替代方案——每个站点本地CRC——基本恢复了覆盖,但将预测集扩大了83倍,使其在临床上无用。我们提出一种基于收缩的联邦CRC协议:每个站点仅将其经验风险曲线(G个标量)传输到服务器,服务器为每个站点计算收缩正则化阈值。单个超参数n0平滑地权衡最坏情况覆盖与预测集效率;留一站点敏感性分析确定n0=19,在2.0倍拉伸下实现2.7/20的违规。我们进一步表明,覆盖预算的直接拉格朗日优化失败,将风险集中在脆弱的医院,并且有限样本修正项是必不可少的:移除它会使违规增加三倍。在所述站点混合假设下,边际CRC保证通过构造得以保留;在三个种子下针对四个目标验证了每个站点的覆盖。没有患者级别的图像、掩膜或每体积分数离开任何站点。

英文摘要

Conformal risk control (CRC) provides distribution-free guarantees on segmentation quality by calibrating a prediction-set threshold on held-out data. In federated deployments, the standard approach pools calibration scores across sites into a single threshold. We provide the first quantification, on real multi-institutional brain tumor data (FeTS-2022, 1,251 subjects, 20 institutions), showing that this naive pooled CRC protects the average hospital but violates coverage at 40% of individual institutions, with the worst site exceeding the target false-negative rate by 7.8 percentage points. The naive alternative, per-site local CRC, largely restores coverage but inflates prediction sets by 83x, rendering them clinically useless. We propose a shrinkage-based federated CRC protocol: each site transmits only its empirical risk curve (G scalars) to a server, which computes a shrinkage-regularized threshold per site. A single hyperparameter n0 smoothly trades worst-case coverage for prediction-set efficiency; leave-one-site-out sensitivity analysis identifies n0=19, achieving 2.7/20 violations at 2.0x stretch. We further show that direct Lagrangian optimization of coverage budgets fails, concentrating risk on vulnerable hospitals, and that the finite-sample correction term is essential: removing it triples violations. The marginal CRC guarantee is preserved by construction under the stated site-mixture assumption; per-site coverage is validated across four targets with three seeds. No patient-level images, masks, or per-volume scores leave any site.

2602.22959 2026-06-19 cs.CV 版本更新

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

智能体能否在零样本设置中区分视觉上难以分离的疾病?一项初步研究

Zihao Zhao, Frederik Hauke, Juliana De Castilhos, Sven Nebelung, Daniel Truhn

发表机构 * Department of Diagnostic and Interventional Radiology, University Hospital Aachen, 52074 Aachen, Germany(诊断与介入放射科,亚琛大学医院,德国亚琛,52074)

AI总结 本研究探索多模态大语言模型智能体在零样本下区分视觉混淆疾病(如黑色素瘤与不典型痣、肺水肿与肺炎)的能力,提出基于对比裁决的多智能体框架,在皮肤镜数据上准确率提升11个百分点,但总体性能仍不足临床部署。

Comments Code available at https://github.com/TruhnLab/Contrastive-Agent-Reasoning. Accepted by MICCAI 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)的快速进展引发了对基于智能体系统的日益关注。尽管大多数医学影像先前工作集中于自动化常规临床工作流程,我们研究了一个未被充分探索但临床意义重大的场景:在零样本设置中区分视觉上难以分离的疾病。我们在两个仅基于影像的代理诊断任务上对代表性智能体进行基准测试:(1)黑色素瘤与不典型痣,以及(2)肺水肿与肺炎,尽管临床管理存在显著差异,但视觉特征高度混淆。我们引入了一种基于对比裁决的多智能体框架。实验结果显示诊断性能提升(在皮肤镜数据上准确率提高11个百分点),并在定性样本上减少了无根据的声明,尽管整体性能仍不足以用于临床部署。我们承认人类注释中固有的不确定性以及临床背景的缺失,这进一步限制了向真实世界场景的转化。在此受控设置中,这项初步研究为视觉混淆场景下的零样本智能体性能提供了初步见解。

英文摘要

The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.

2603.01250 2026-06-19 cs.CV cs.AI 版本更新

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

MAMA-MIA挑战:推进乳腺MRI肿瘤分割与治疗反应预测的泛化性和公平性

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar, Richard Osuala, Maciej Bobowicz, Xavier Bargalló, Paulius Jaruševičius, Kai Geissler, Raphael Schäfer, Muhammad Alberb, Tony Xu, Anne Martel, Daniel Sleiman, Navchetan Awasthi, Hadeel Awwad, Joan C. Vilanova, Robert Martí, Daan Schouten, Jeong Hoon Lee, Mirabela Rusu, Eleonora Poeta, Luisa Vargas, Eliana Pastor, Maria A. Zuluaga, Jessica Kächele, Dimitrios Bounias, Alexandra Ertl, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Carlos Martín-Isla, Oliver Díaz, Laura Igual, Karim Lekadir

发表机构 * Barcelona Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona(巴塞罗那人工智能在医学实验室(BCN-AIM),巴塞罗那大学数学与计算机学院)

AI总结 提出MAMA-MIA挑战,通过标准化基准评估乳腺MRI肿瘤分割和病理完全缓解预测,在跨洲多中心数据上分析模型泛化性与公平性,发现性能与亚组公平性之间存在权衡。

详情
AI中文摘要

乳腺癌是全球女性中最常诊断的恶性肿瘤,也是癌症相关死亡的主要原因之一。动态对比增强磁共振成像在肿瘤表征和治疗监测中发挥核心作用,尤其是接受新辅助化疗的患者。然而,现有的乳腺磁共振成像人工智能模型通常使用异质性数据集、研究人群和评估协议进行开发和评估,使得直接比较困难,并限制了跨机构和临床相关患者亚组的模型鲁棒性理解。MAMA-MIA挑战旨在通过提供标准化基准来解决这些问题,该基准用于联合评估原发性肿瘤分割和仅使用治疗前磁共振成像预测病理完全缓解。训练队列包括来自美国多家机构的1506名患者,而评估则在来自三个独立欧洲中心的574名患者的外部测试集上进行,以评估跨大陆和跨机构的泛化性。统一的评分框架结合了预测性能与年龄、绝经状态和乳腺密度方面的亚组一致性。26个国际团队参加了最终评估阶段。结果表明,在共同的外部评估框架下,性能存在显著差异,并揭示了整体准确性与亚组公平性之间的权衡。该挑战提供了标准化数据集、评估协议和公共资源,以促进开发稳健且公平的乳腺癌影像人工智能系统。

英文摘要

Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are typically developed and evaluated using heterogeneous datasets, study populations, and assessment protocols, making direct comparison difficult and limiting understanding of model robustness across institutions and clinically relevant patient subgroups. The MAMA-MIA Challenge was designed to address these challenges by providing a standardized benchmark for the joint evaluation of primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under a common external evaluation framework and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.

2605.00665 2026-06-19 cs.CV 版本更新

Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank

基于深度学习的视网膜图像预测阿尔茨海默病风险因素:英国生物银行中生物学相关形态学关联的开发和验证

Seowung Leem, Yunchao Yang, Adam J. Woods, Ruogu Fang

发表机构 * J. Crayton Pruitt Family Dept. of Biomedical Engineering, University of Florida(朱·克雷顿·普瑞特生物医学工程系,佛罗里达大学) University of Florida Research Computing(佛罗里达大学研究计算中心) Meta AI (FAIR)(Meta AI(FAIR)) School of Behavioral and Brain Sciences, University of Texas at Dallas(德克萨斯大学达拉斯分校行为与脑科学学院) Dept. of Electrical and Computer Engineering, University of Florida(佛罗里达大学电气与计算机工程系) Dept. of Computer and Information Science and Engineering, University of Florida(佛罗里达大学计算机与信息科学与工程系) Center for Cognitive Aging and Memory, University of Florida(佛罗里达大学认知衰老与记忆中心)

AI总结 利用深度学习从视网膜彩色眼底照片预测12个阿尔茨海默病相关风险因素,并揭示其背后的视网膜结构特征,发现视神经头和视网膜血管等区域与风险因素及阿尔茨海默病前期变化相关。

Comments Accepted to the "Journal of Alzheimer's Disease" for publication

详情
AI中文摘要

系统性的、代谢性的、生活方式的因素已通过流行病学和AD特异性生物标志物研究与阿尔茨海默病(AD)建立关联。彩色眼底摄影(CFP)是否包含与这些AD相关风险域相对应的视网膜结构特征仍不清楚。为了确定深度学习(DL)模型能否从CFP预测12个AD相关风险因素,并表征这些预测背后的视网膜结构,从而评估CFP是否反映AD易感性的通路。使用来自英国生物银行的44,501名独特参与者的62,876张CFP,训练DL模型预测与AD发病率相关的12个因素:6个分类变量(性别、吸烟、失眠、经济状况、饮酒、抑郁)和6个连续变量(年龄、受教育完成年龄、BMI、收缩压、舒张压、HbA1c)。评估模型性能、模型显著性和显著性衍生得分(CAM-Score),并与视网膜形态测量进行比较。还将得分在AD发病病例(平均发病前8.55年)与匹配对照之间进行比较。DL的性能范围为分类变量的AUROC=0.5654-0.9480,连续变量的R2=-0.0291-0.7620,优于大多数形态测量-机器学习模型。基于显著性的得分一致地突出了生物学上有意义的区域,特别是视神经头和视网膜血管。它也与现有的形态测量变异一致。多个基于显著性的得分在AD发病病例与匹配对照之间存在显著差异,表明风险因素的视网膜相关性与临床前AD相关变化之间存在潜在重叠。CFP编码了与AD风险因素相关的视网膜特征。尽管不具有诊断性,但DL衍生的视网膜表征可能揭示反映潜在AD易感性的生物学上有意义的风险相关结构变化。

英文摘要

The systemic, metabolic, lifestyle factors have established associations with Alzheimer's Disease (AD) through epidemiologic and AD-specific biomarker studies. Whether colored fundus photography (CFP) contains retinal structural signatures corresponding to these AD-related risk domains remains unclear. To determine whether deep learning (DL) models can predict 12 AD-related risk factors from CFP and to characterize the retinal structures underlying these predictions, thereby assessing whether CFP reflects pathways to AD vulnerability. Using 62,876 CFPs from 44,501 unique participants from the UK Biobank, DL models were trained to predict 12 factors linked to AD incidence: 6 categorical (sex, smoking, sleeplessness, economic status, alcohol use, depression) and 6 continuous (age, age at completing education, BMI, systolic, diastolic blood pressure, HbA1c). Model performance, model saliency, and saliency-derived scores (CAM-Score) were evaluated and compared to retinal morphometry. The scores were also compared between incident-AD cases (average 8.55 years before onset) and matched controls. Performance of DL ranged from AUROC= 0.5654-0.9480 for categorical and R2=-0.0291-0.7620 for continuous factors, outperforming most of the morphometry-machine learning models. Saliency-based score consistently highlighted biologically meaningful regions, particularly the optic nerve head and retinal vasculature. It also aligned with present morphometric variations. Several saliency-based scores differed significantly between incident AD and matched controls, suggesting potential overlap between retinal correlates of risk factors and preclinical AD-associated changes. CFP encodes retinal signatures linked to AD risk factors. Although not diagnostic, DL-derived retinal representations may uncover biologically meaningful risk-related structural changes mirroring the potential AD vulnerability.

2606.14957 2026-06-19 cs.CV 版本更新

Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

学习用于多模态神经影像的稀疏潜在预测基础模型

Haoxu Huang, Long Chen, Jingyun Chen, Jinu Hyun, James Ryan Loftus, Kara Melmed, Daniel Orringer, Jennifer Frontera, Seena Dehkharghani, Arjun Masurkar, Narges Razavian

发表机构 * New York University, Center for Data Science(纽约大学数据科学中心) NYU Grossman School of Medicine, Department of Radiology(纽约大学格罗斯曼医学院放射学系) State University of New York at Binghamton, School of Computing(纽约州立大学宾汉姆顿分校计算机学院) NYU Grossman School of Medicine, Department of Neurology(纽约大学格罗斯曼医学院神经病学系) NYU Grossman School of Medicine, Department of Neurosurgery(纽约大学格罗斯曼医学院神经外科学系) NYU Grossman School of Medicine, Department of Pathology(纽约大学格罗斯曼医学院病理学系) School of Medicine, Department of Radiology, Stanford(斯坦福大学医学院放射学系) NYU Grossman School of Medicine, Department of Neuroscience(纽约大学格罗斯曼医学院神经科学系) NYU Grossman School of Medicine, Neuroscience Institute(纽约大学格罗斯曼医学院神经科学研究所)

AI总结 提出Neuro-JEPA模型,结合潜在预测目标和专家混合架构,学习T1w、T2w和FLAIR三种MRI序列的统一表示,在25项临床任务和22项公开数据集任务上优于现有基础模型和CNN基线。

Comments Under Review Preprint

详情
AI中文摘要

脑部MRI通常作为多个互补序列采集,具有独特的对比度加权,包括T1加权成像(T1w)解剖对比和液体敏感T2加权(T2w)对比。然而,在健康系统规模上,跨多种MRI对比机制学习统一表示的方法尚缺乏。在本研究中,我们引入了Neuro-JEPA,一种稀疏多模态神经影像基础模型,它结合了潜在预测目标和专家混合架构,以编码跨核心T1w、T2w和液体抑制FLAIR成像(FLAIR)的脑部MRI。我们进一步对架构、掩码、目标和稀疏性设计选择进行了系统的方法论研究,这些选择有利于稳健的神经影像多模态表示学习。Neuro-JEPA在428,647项研究的1,551,862次扫描上进行了预训练,这些扫描经过了模态特定的预处理和跨三种核心结构脑部MRI序列的数据整理。我们在临床和研究环境中评估了学习到的表示,包括来自三个健康系统(NYU Langone、NYU Long Island和Massachusetts General Hospital)的25项任务,以及来自12个公开数据集的22项任务,涵盖了单模态、多模态和跨域评估配置。在这些基准测试中,现有的神经影像基础模型相对于简单的卷积神经网络(CNN)基线显示出不一致的提升,而Neuro-JEPA在所有评估设置中实现了更强且更一致的性能。这些结果建立了一个可扩展的多模态神经影像表示学习方法论框架,并强调了基础模型评估协议需要包括简单基线、临床异质性队列和受控的多模态比较。

英文摘要

Brain MRIs are routinely acquired as multiple complementary sequences with unique contrast weighting, including T1-weighed imaging (T1w) anatomic and fluid-sensitive T2-weighted (T2w) contrasts. However, methods for learning unified representations across the multitude of MRI contrast mechanisms at health-system scale are lacking. In this study, we introduce Neuro-JEPA, a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across core T1w, T2w, and fluid-suppressed FLAIR imaging (FLAIR). We further provide a systematic methodological study of architectural, masking, objective, and sparsity design choices beneficial for robust neuroimaging multimodal representation learning. Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing with data curation across three core structural brain MRI sequences. We evaluated the learned representations across clinical and research settings, including 25 tasks from three health systems: NYU Langone, NYU Long Island, and Massachusetts General Hospital, and 22 tasks from 12 public datasets, covering unimodal, multimodal and cross-domain evaluation configurations. Across these benchmarks, existing neuroimaging foundation models showed inconsistent gains over a simple convolutional neural network (CNN) baseline, whereas Neuro-JEPA achieved stronger and more consistent performance across all evaluated settings. These results establish a scalable methodological framework for multimodal neuroimaging representation learning and highlight the need for foundation model evaluation protocols that include simple baselines, clinically heterogeneous cohorts and controlled multimodal comparisons.

2405.10705 2026-06-19 eess.IV cs.CV 版本更新

3D Vessel Reconstruction from Sparse-View Dynamic DSA Images via Vessel Probability Guided Attenuation Learning

基于血管概率引导衰减学习的稀疏视角动态DSA图像三维血管重建

Zhentao Liu, Huangxuan Zhao, Wenhui Qin, Zhenghong Zhou, Xinggang Wang, Wenping Wang, Xiaochun Lai, Chuansheng Zheng, Dinggang Shen, Zhiming Cui

发表机构 * School of Biomedical Engineering \& State Key Laboratory of Advanced Medical Materials Devices, ShanghaiTech University, Shanghai, China National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China School of Electronic Information Communications, Huazhong University of Science Department of Computer Science \& Engineering, Texas A\&M University, USA

AI总结 提出血管概率引导衰减学习框架,通过静态与动态衰减场互补加权实现稀疏视角DSA重建,降低辐射剂量,并采用渐进训练和时间扰动损失提升质量。

Comments Accepted by Medical Image Analysis (MedIA), 2026

详情
AI中文摘要

数字减影血管造影(DSA)是血管疾病诊断的金标准之一。借助造影剂,时间分辨的二维DSA图像提供全面的血流信息,可用于重建三维血管结构以进行医学评估。当前的商用DSA系统通常需要数百个扫描视角进行重建,导致大量辐射暴露。在本研究中,我们提出了一种基于神经渲染的优化框架,专门用于高质量稀疏视角DSA重建,以减少辐射剂量。我们的方法称为血管概率引导衰减学习,将DSA成像表示为静态和动态衰减场的互补加权组合,权重来自时间无关的血管概率场。作为前景掩膜,血管概率为静态和动态场提供适应不同场景类型的适当梯度。该机制实现了静态背景与动态造影剂流的自监督分解,并显著提高了重建质量。我们的模型通过最小化合成投影与真实DSA图像之间的差异进行训练。我们进一步采用两种训练策略来提高重建质量:(1)由粗到细的渐进训练以改善几何结构,以及(2)时间扰动渲染损失以保持时间一致性。实验结果表明了高质量的三维血管重建和二维DSA图像合成。

英文摘要

Digital Subtraction Angiography (DSA) is one of the gold standards for vascular disease diagnosis. With the help of a contrast agent, time-resolved 2D DSA images deliver comprehensive blood flow information and can be utilized to reconstruct 3D vessel structures for medical assessment. Current commercial DSA systems typically require hundreds of scanning views to perform reconstruction, resulting in substantial radiation exposure. In this study, we propose a neural rendering-based optimization framework tailored for high-quality sparse-view DSA reconstruction to reduce radiation dosage. Our approach, termed vessel probability guided attenuation learning, represents DSA imaging as a complementary weighted combination of static and dynamic attenuation fields, with the weights derived from the time-independent vessel probability field. Functioning as a foreground mask, vessel probability provides proper gradients for both static and dynamic fields adaptive to different scene types. This mechanism enables self-supervised decomposition between static backgrounds and dynamic contrast agent flow, and significantly improves reconstruction quality. Our model is trained by minimizing the discrepancy between synthesized projections and real captured DSA images. We further employ two training strategies to improve reconstruction quality: (1) coarse-to-fine progressive training for better geometry and (2) temporal perturbed rendering loss for temporal consistency. Experimental results have demonstrated high-quality 3D vessel reconstruction and 2D DSA image synthesis.

2503.23179 2026-06-19 eess.IV cs.CV 版本更新

OncoReg: Medical Image Registration for Oncological Challenges

OncoReg:面向肿瘤学挑战的医学图像配准

Wiebke Heyer, Yannic Elser, Lennart Berkel, Xinrui Song, Xuanang Xu, Pingkun Yan, Xi Jia, Jinming Duan, Zi Li, Tony C. W. Mok, BoWen LI, Tim Hable, Christian Staackmann, Christoph Großbröhmer, Lasse Hansen, Alessa Hering, Malte M. Sieren, Mattias P. Heinrich

发表机构 * Institute of Medical Informatics, University of Lübeck(吕贝克大学医学信息学研究所) Institute of Radiology and Nuclear Medicine, University Hospital Schleswig-Holstein(石勒斯维希-霍尔斯坦大学医院放射科和核医学研究所) Department of Biomedical Engineering and Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute(伦塞拉塞尔理工学院生物医学工程系和生物技术与跨学科研究中心) School of Computer Science, University of Birmingham(伯明翰大学计算机科学学院) Division of Informatics, Imaging and Data Sciences, University of Manchester(曼彻斯特大学信息学、成像和数据科学系) DAMO Academy, Alibaba Group(阿里集团DAMO学院) Hangzhou Shengshi Technology Co., Ltd(杭州盛世科技有限公司) Department of Radiation Oncology, University Hospital Schleswig-Holstein(石勒斯维希-霍尔斯坦大学医院放射肿瘤科) EchoScout GmbH Radboud University Medical Center, Nijmegen(奈密根大学医学中心) Institute of Interventional Radiology, University Hospital Schleswig-Holstein(石勒斯维希-霍尔斯坦大学医院介入放射科)

AI总结 提出OncoReg挑战,通过两阶段框架在保护患者隐私的同时开发可泛化的图像配准方法,用于放射治疗中锥束CT与扇束CT的配准,发现特征提取是关键,深度学习和经典方法结合最有效。

Comments 21 pages, 13 figures

详情
AI中文摘要

在现代癌症研究中,由于患者隐私相关的挑战,产生的大量医学数据往往未被充分利用。OncoReg挑战通过一个两阶段框架解决了这一问题,该框架使研究人员能够在确保患者隐私的同时开发和验证图像配准方法,并促进更可泛化的AI模型的发展。第一阶段涉及使用公开可用的数据集,第二阶段则专注于在安全的医院网络内对私有数据集进行模型训练。OncoReg建立在Learn2Reg挑战的基础上,纳入了放射治疗中介入性锥束计算机断层扫描与标准计划扇束CT图像的配准。准确的图像配准在肿瘤学中至关重要,特别是在图像引导放射治疗的动态治疗调整中,需要精确对齐以最小化对健康组织的辐射暴露,同时有效靶向肿瘤。本文详细介绍了OncoReg挑战的方法和数据,并对竞赛参赛作品和结果进行了全面分析。研究发现,特征提取在此配准任务中起着关键作用。从该挑战中涌现的一种新方法展示了其多功能性,而现有方法的表现与新技术相当。深度学习和经典方法在图像配准中仍扮演重要角色,尤其是方法的组合,特别是在特征提取方面,被证明最为有效。

英文摘要

In modern cancer research, the vast volume of medical data generated is often underutilised due to challenges related to patient privacy. The OncoReg Challenge addresses this issue by enabling researchers to develop and validate image registration methods through a two-phase framework that ensures patient privacy while fostering the development of more generalisable AI models. Phase one involves working with a publicly available dataset, while phase two focuses on training models on a private dataset within secure hospital networks. OncoReg builds upon the foundation established by the Learn2Reg Challenge by incorporating the registration of interventional cone-beam computed tomography with standard planning fan-beam CT images in radiotherapy. Accurate image registration is crucial in oncology, particularly for dynamic treatment adjustments in image-guided radiotherapy, where precise alignment is necessary to minimise radiation exposure to healthy tissues while effectively targeting tumours. This work details the methodology and data behind the OncoReg Challenge and provides a comprehensive analysis of the competition entries and results. Findings reveal that feature extraction plays a pivotal role in this registration task. A new method emerging from this challenge demonstrated its versatility, while established approaches continue to perform comparably to newer techniques. Both deep learning and classical approaches still play significant roles in image registration, with the combination of methods, particularly in feature extraction, proving most effective.

2606.18970 2026-06-19 cs.LG cs.AI cs.CV 版本更新

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

脑MRI的量子潜GAN增强的受控基准测试

Syed Mujtaba Haider, Silvia Figini

发表机构 * Department of Mathematics(数学系) Department of Political and Social Sciences(政治与社会科学系)

AI总结 通过受控基准测试,比较量子与经典生成器在脑MRI数据增强中的性能,发现两者均未显著优于仅用真实数据训练,且量子生成器无额外优势。

详情
AI中文摘要

医学图像分类常受限于有限的标注数据,因此生成式增强被提出;最近,量子生成模型被用于此目的,并经常报告准确率提升。然而,这些声称通常基于单次训练运行,未匹配量子与经典生成器的参数预算,也未表征任何收益出现的数据范围。我们提出了一个受控基准测试,隔离量子生成器对脑MRI增强的贡献。图像被编码到KL正则化的潜在空间中,在该空间中,使用变分量子生成器或参数数量几乎相同的经典生成器(1648 vs. 1632)训练带有梯度惩罚的条件Wasserstein GAN。合成样本被解码并用于增强预训练分类器,覆盖从5%到100%的标注数据比例,通过八个随机种子进行配对显著性检验(多重比较校正)以及集内多样性和潜在分布分析。在所有比例下,没有增强变体显著优于仅用真实数据训练,且量子与经典生成器在统计上无法区分。任何低数据优势表现为正则化而非忠实的数据扩展:合成样本分布外移,并且在数据稀缺时严重模式崩溃,而量子生成器并不比经典生成器更多样化。我们发布该协议作为医学成像中量子生成增强严格评估的测试平台。

英文摘要

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

9. 文档图像、OCR与图表理解 1 篇

2606.19939 2026-06-19 cs.CV 新提交

DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Mathematical Expression Generation

DiffMath:面向手写数学表达式生成的符号与图感知潜在扩散Transformer

Wei Pan, Xuhan Zheng, Yilin Shi, Huiguo He, Hiuyi Cheng, Dezhi Peng, Minghui Liao, Lianwen Jin

发表机构 * South China University of Technology(华南理工大学) Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 提出DiffMath框架,利用LaTeX层次结构作为先验,通过关系抽象语法树、结构保持潜在表示和条件去噪,无需位置监督即可生成结构一致的手写数学表达式。

详情
AI中文摘要

手写数学表达式生成(HMEG)由于数学表达式的复杂二维布局和长程结构依赖而具有挑战性。现有方法通常依赖显式空间监督,如符号级边界框,这导致高标注成本并限制可扩展性。在这项工作中,我们提出了DiffMath,一个符号与图感知的潜在扩散框架,利用LaTeX固有的层次结构作为结构先验,消除了位置监督的需求。首先,我们设计了关系抽象语法树(RelAST),一种面向生成的表示,将MathML树蒸馏为紧凑的三元组序列[S, R, D],其中每个标记直接编码符号身份、空间关系或嵌套深度。其次,我们引入了MathVAE,通过符号感知和关系感知的感知正则化学习保持结构的潜在表示,确保潜在空间同时捕获字符语义和空间拓扑。第三,MathDiT在这个结构化潜在空间中进行条件去噪,并通过自适应层归一化(AdaLN)进一步由全局符号计数先验引导,以改善结构一致性。实验表明,DiffMath生成结构一致的手写表达式,在现有方法上实现了优越性能,并通过合成数据增强提高了下游OCR模型的准确性。

英文摘要

Handwritten Mathematical Expression Generation (HMEG) is challenging due to the complex two-dimensional layouts and long-range structural dependencies of mathematical expressions. Existing methods typically rely on explicit spatial supervision, such as symbol-level bounding boxes, which incurs high annotation costs and limits scalability. In this work, we propose DiffMath, a symbol- and graph-aware latent diffusion framework that leverages the hierarchical structure inherent in LaTeX as a structural prior, eliminating the need for positional supervision. First, we design a Relational Abstract Syntax Tree (RelAST), a generation-oriented representation that distills MathML trees into compact triplet sequences [S, R, D], where each token directly encodes a symbol identity, spatial relation, or nesting depth. Second, we introduce MathVAE, which learns structure-preserving latent representations through symbol-aware and relation-aware perceptual regularization, ensuring that the latent space captures both character semantics and spatial topology. Third, MathDiT performs conditional denoising in this structured latent space, further guided by a global symbol-count prior via Adaptive Layer Normalization (AdaLN) to improve structural coherence. Experiments show that DiffMath produces structurally consistent handwritten expressions, achieves superior performance over existing methods, and improves the accuracy of downstream OCR models through synthetic data augmentation.

10. 低层视觉、计算成像与图像增强 9 篇

2606.19617 2026-06-19 cs.CV cs.GR cs.LG 新提交

GB-LSR: A Fast Local Spectral Image Representation with a Single Global Bandwidth for Continuous Reconstruction and Super-Resolution

GB-LSR:一种具有单一全局带宽的快速局部光谱图像表示,用于连续重建和超分辨率

Max Shad, Naeem Khoshnevis

发表机构 * Harvard University(哈佛大学)

AI总结 提出GB-LSR,一种基于全局带宽的局部光谱表示,通过共享卷积编码器预测截断傅里叶基系数,实现连续图像重建,在Kodak等基准上PSNR提升2.8-3.6 dB,推理速度比最慢基线快约4倍。

详情
AI中文摘要

我们提出GB-LSR(全局带宽局部光谱表示),一种用于连续图像重建的固定网格局部光谱表示。图像域被划分为非重叠的方形块,每个块携带从共享卷积编码器特征预测的截断傅里叶基系数。一个可训练的标量带宽在所有块和图像中全局共享,在任何连续坐标处的重建是固定大小的基收缩,其成本与图像大小无关。我们研究了三种带宽处理变体:可训练的全局标量(主要)、固定的全局标量和逐块带宽场。在Kodak、Set14和Urban100上的标准化原生重建基准测试中,主要变体在匹配预算的LIIF/LTE/WIRE重实现上PSNR高出2.8-3.6 dB,LPIPS低0.11-0.15,同时推理成本约为最慢基线的四分之一。经验上,单个全局标量就足够了:逐块自适应带宽替代方案在闭式局部性诊断或端到端消融中均未带来改进。在独立的任意尺度超分辨率(ASR)扩展中,GB-LSR在标准SR协议下实现了具有竞争力的PSNR-Y,并在x4时比LIIF-RDN快1.44倍,比LTE-SwinIR快3.25倍;在同一扩展中,一个变体在训练和评估时不使用四角局部集成平均,速度提升1.77倍,峰值内存降低35%,PSNR变化可忽略,而将RDN编码器从64通道扩展到96通道时,PSNR略有提升,速度提升1.58倍,峰值内存降低31%。原生重建声明限定于匹配预算的摊销协议,ASR声明限定于独立的标准SR协议。

英文摘要

We present GB-LSR (Global-Bandwidth Local Spectral Representation), a fixed-grid local spectral representation for continuous image reconstruction. The image domain is partitioned into non-overlapping square patches, each carrying coefficients for a truncated Fourier basis predicted from shared convolutional-encoder features. A single trainable scalar bandwidth is shared globally across all patches and images, and reconstruction at any continuous coordinate is a fixed-size basis contraction whose cost is independent of image size. We study three bandwidth-handling variants: a trainable global scalar (main), a fixed global scalar, and a per-patch bandwidth field. On a standardized native-reconstruction benchmark across Kodak, Set14, and Urban100, the main variant outperforms matched-budget amortized LIIF / LTE / WIRE re-implementations by 2.8-3.6 dB PSNR and 0.11-0.15 LPIPS, while running at roughly one-quarter of the slowest baseline's inference cost. The single global scalar suffices empirically: per-patch adaptive-bandwidth alternatives do not improve over it on either a closed-form locality diagnostic or an end-to-end ablation. In a separate arbitrary-scale super-resolution (ASR) extension, GB-LSR achieves competitive PSNR-Y under a canonical-style SR protocol and runs 1.44x faster than LIIF-RDN and 3.25x faster than LTE-SwinIR at x4; within the same extension, a variant trained and evaluated without 4-corner local-ensemble averaging gives a 1.77x speedup with 35% lower peak memory and negligible PSNR change, while additionally widening the RDN encoder from 64 to 96 channels gives a small positive PSNR shift with a 1.58x speedup and 31% lower peak memory. Native-reconstruction claims are scoped to the matched-budget amortized protocol, and ASR claims are scoped to a separate canonical-style SR protocol.

2606.19901 2026-06-19 cs.CV 新提交

Linear Recurrent Unit with Semantic Modulation for Image Super-Resolution

基于语义调制的线性递归单元用于图像超分辨率

Mingyu Choi, Woo Kyoung Han, Sunghoon Im, Kyong Hwan Jin

发表机构 * Korea University(高丽大学) DGIST(大邱庆北科学技术院)

AI总结 提出一种结合语义调制单元的线性递归网络,通过调制、空间分类和原型增强实现高效图像超分辨率,性能超越现有方法。

Comments Accepted to CVPR 2026 Findings

详情
AI中文摘要

线性递归单元(LRU)基于稳定线性递归的原则性设计,已在长程依赖任务上展现出有前景的准确性和鲁棒性。然而,其静态参数化和单扫描方法限制了其在二维视觉任务中的适用性。在本研究中,我们提出了一种基于LRU的恢复网络,并配备语义调制单元(SMU),以在单图像超分辨率中实现性能与效率的和谐平衡。SMU扮演三个关键角色:LRU调制、空间分类和通过学习原型进行特征增强。大量实验表明,我们的方法在定量和定性上均超越了近期最先进的方法。值得注意的是,我们的方法在计算复杂度与现有方法相当的情况下实现了更优的性能。源代码和模型可在以下网址获取:https://this https URL

英文摘要

Linear recurrent unit (LRU), designed with a principled formulation for stable linear recurrence, has demonstrated promising accuracy and robustness on long-range dependency tasks. However, its static parameterization and single-scan method limits its applicability to 2D vision tasks. In this study, we propose a LRU-based restoration network with a semantic modulating unit (SMU) to achieve a harmonious balance between performance and efficiency in single-image super-resolution. The SMU plays three key roles: LRU modulation, spatial categorization, and feature enhancement through learned prototype. Extensive experiments demonstrate that our method quantitatively and qualitatively surpasses recent state-of-the-art methods. Notably, our approach achieves superior performance with computational complexity on par with existing methods. The source code and models are available at https://github.com/MingyuChoi-run/LSM

2606.19938 2026-06-19 cs.CV cs.AI 新提交

Triangular Consistency as a Universal Constraint for Learning Optical Flow

三角一致性作为光流学习的通用约束

Yi Xiao, Carlos Rodriguez Coronel, Jing Zhan, Haniyeh Ehsani Oskouie, Alex Wong, Dong Lao

发表机构 * Louisiana State University(路易斯安那州立大学) University of California, Los Angeles(加州大学洛杉矶分校) Yale University(耶鲁大学)

AI总结 提出三角一致性约束,通过组合两个光流诱导第三个光流并强制三者一致,适用于不同网络架构、监督类型和数据集,在监督、无监督和迁移学习中均提升性能。

Comments Accepted by ECCV 2026

详情
AI中文摘要

我们提出三角一致性作为光流的第一性原理约束,该约束与网络架构、监督类型和数据集无关,适用于图像对和多帧设置。这个简单但强大的约束是通过组合两个光流来诱导第三个光流,并强制三者之间的一致性。组合的光流可能来自:(i) 图像对,产生循环一致性;(ii) 多个视频帧,通过时间链产生更长范围的运动;或 (iii) 图像对与受控合成变换相结合,这成为数据增强。这种三角一致性引入的计算开销可忽略不计,且不需要额外的标注。由于它直接源自光流的几何特性,不依赖于模型特定的假设,因此可作为光流训练的“通用”即插即用组件。实验表明,在监督、无监督和迁移学习设置中均有一致的改进。

英文摘要

We propose triangular consistency as a first-principled constraint for optical flow, which is agnostic to network architecture, supervision type, and dataset, and applies to both image-pair and multi-frame settings. This simple but powerful constraint is to compose two flows to induce a third flow and enforce consistency among the three. The composed flows may arise from (i) image pairs, yielding cycle consistency; (ii) multiple video frames, producing longer-range motion through temporal chaining; or (iii) image pairs combined with controlled synthetic transformations, which becomes data augmentation. This triangular consistency introduces negligible computational overhead and requires no additional annotations. Since it is derived directly from the geometry of optical flow, it does not rely on model-specific assumptions and serves as a ``universal'' plug-and-play component for optical flow training. Experiments show consistent improvement across supervised, unsupervised, and transfer learning settings.

2606.19961 2026-06-19 cs.CV 新提交

Addressing Detail Bottlenecks in Latent Diffusion for RGB-to-SWIR Image Translation

解决潜在扩散模型中RGB到SWIR图像翻译的细节瓶颈

Kaili Wang, Martin Dimitrievski, Jose Maria Salvador, Ben Stoffelen, David Van Hamme, Lore Goetschalckx

发表机构 * imec imec-IPI-Ghent University(imec-IPI-根特大学) Yale University(耶鲁大学)

AI总结 针对潜在扩散模型在RGB到SWIR图像翻译中丢失空间细节的问题,提出源条件自编码器和可学习引导编码器两种轻量级改进,在驾驶场景下将检测mAP提升至2倍,小目标提升3.4倍,并达到最优FID。

详情
AI中文摘要

潜在扩散模型(LDM)能够高效地进行图像到图像的翻译,但在压缩过程中丢弃了精细的空间细节,从而降低了下游感知任务的性能。我们识别出两个瓶颈:自编码器(丢失空间信息)和条件路径(通过朴素下采样进一步退化源信号)。我们提出了两种轻量级、与骨干网络无关的修复方法:源条件自编码器(SCAE),通过跳跃连接将高分辨率源特征注入解码器;以及可学习引导编码器(LGE),用学习到的条件信号替代朴素下采样。在驾驶场景的RGB到SWIR翻译任务上,使用两种去噪骨干网络(U-Net和DiT)进行评估,我们的方法在潜在扩散基线基础上将检测mAP提升了高达2倍,小目标(COCO-small,<32^2像素^2)上提升高达3.4倍,同时达到了最先进的FID。我们进一步表明FID与检测性能相关性较差,从而激励多轴评估。结果零样本泛化到公开的RASMD基准。我们将公开发布带有标注的测试数据、所有检查点和训练代码。

英文摘要

Latent diffusion models (LDMs) enable efficient image-to-image translation but discard fine spatial details during compression, degrading downstream perception tasks. We identify two bottlenecks: the autoencoder, which loses spatial information, and the conditioning pathway, which further degrades the source signal through naive downsampling. We propose two lightweight, backbone-agnostic fixes: a Source-Conditioned Autoencoder (SCAE) that injects high-resolution source features into the decoder via skip connections, and a Learnable Guidance Encoder (LGE) that replaces naive downsampling with a learned conditioning signal. Evaluated on RGB-to-SWIR translation for driving scenes with two denoiser backbones (U-Net and DiT), our approach improves detection mAP by up to 2x over the latent diffusion baseline, with up to 3.4x gains on small objects (COCO-small, <32^2 px^2), while achieving state-of-the-art FID. We further show that FID and detection performance are poorly correlated, motivating multi-axis evaluation. Results generalise zero-shot to the public RASMD benchmark. We will publicly release test data with annotations, all checkpoints, and training code.

2606.19985 2026-06-19 cs.CV 新提交

Vision-Reasoning-Guided Occlusion Removal from Light Fields

视觉推理引导的光场遮挡去除

Mohamed Youssef, Oliver Bimber

发表机构 * Johannes Kepler University(约翰·开普勒大学)

AI总结 提出结合光场积分与视觉语言模型的框架,通过多视图融合和语义先验恢复被遮挡场景,在合成和真实数据上取得最优性能。

详情
AI中文摘要

遮挡鲁棒的场景恢复仍然是计算成像中的一个主要挑战,特别是在自然环境中,密集的前景植被严重限制了可见性。我们提出了一种视觉推理引导的光场遮挡去除框架,该框架结合了光场积分(LFI)的可见性恢复能力和视觉语言模型(VLM)的语义推理能力。首先通过LFI集成多视图观测以抑制前景遮挡,生成初始的可见性增强表示。然后,引入VLM作为条件语义先验,在观测测量的指导下恢复退化结构并恢复细节。为了提高恢复一致性并减少幻觉伪影,我们引入了一种多样本融合策略,将多个生成的假设聚合为统一的估计。在合成和真实世界数据集上的实验结果表明,该方法达到了最先进的性能,在四个合成光场基准场景(4-Syn)上取得了最高的平均SSIM,并在结构化和非结构化采集设置中表现出强大的泛化能力。这些结果凸显了将物理成像约束与视觉语言推理相结合在严重遮挡下实现鲁棒感知的有效性,可应用于搜索救援和探索性机器人导航。

英文摘要

Occlusion-robust scene recovery remains a major challenge in computational imaging, particularly in natural environments where dense foreground vegetation severely limits visibility. We propose a vision-reasoning-guided light field occlusion removal framework that combines the visibility recovery capability of light field integration (LFI) with the semantic reasoning capacity of vision-language models (VLMs). Multi-view observations are first integrated via LFI to suppress foreground occlusions and produce an initial visibility-enhanced representation. A VLM is then incorporated as a conditional semantic prior to restore degraded structures and recover fine details, guided by the observed measurements. To improve recovery consistency and reduce hallucination artifacts, we introduce a multi-sample fusion strategy that aggregates multiple generated hypotheses into a unified estimate. Experimental results on synthetic and real-world datasets demonstrate state-of-the-art performance, achieving the highest average SSIM across four synthetic light field benchmark scenes (4-Syn) and strong generalization across structured and unstructured acquisition settings. These results highlight the effectiveness of combining physical imaging constraints with vision-language reasoning for robust perception under severe occlusion, with applicability to search-and-rescue and exploratory robotic navigation.

2606.15648 2026-06-19 cs.CV 新提交

Fusing Transferred Priors and Physics-based Decomposition for Underwater Image Enhancement

融合迁移先验与物理分解的水下图像增强

Haochen Hu, Yanrui Bin, Zhengyan Zhang, Minchen Wei, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出一种无需配对标签的迁移学习方法,将水下图像增强分解为全局颜色校正、去雾和背景噪声抑制,利用跨域先验监督各步骤,实现物理一致的增强。

Journal ref Information Fusion (2026): 104557

详情
AI中文摘要

水下图像在不同水质条件下拍摄,导致复杂的退化,包括颜色偏差、低对比度和模糊效应。最近,基于学习的方法已显示出在水下图像增强(UIE)方面的潜力。然而,以往的大多数工作侧重于训练策略或网络设计,使增强结果与数据集中的标签良好对齐,忽略了标签是从先前UIE方法的增强结果中选取的,这些伪标签存在噪声。因此,它们的模型性能在一定程度上并不令人满意。然而,收集水下图像的真实标签具有挑战性。在这项工作中,我们提出了一种基于迁移学习的UIE方法,该方法不需要水下图像具有成对的噪声或真实标签来学习。相反,首先根据水下物理将UIE任务分解为全局颜色校正、去雾和背景噪声抑制。然后,利用来自其他视觉任务的多种先验作为每个步骤的跨域监督。通过这种方式,通过迁移学习实现了一种新颖的UIE,并且物理对齐的UIE分解提供了理论上的合理性。定性和定量实验表明,我们基于物理和先验融合的方法在UIE任务中达到了SOTA性能,并有效提升了下游视觉任务,显著优于基准方法。项目仓库:https://github.com/Haru2022/P2-UIE。

英文摘要

The underwater images are captured within diverse water-medium conditions, leading to complex degradation, including color bias, low contrast, and blur effect. Recently, learning-based methods have demonstrated their potential for underwater image enhancement (UIE). However, most of the previous work focus on the training strategy or network design to make the enhanced result aligned well with the labels in datasets, ignoring that the labels are selected from the enhanced results of previous UIE methods and these pseudo-labels are noisy. Consequently, the performance of their models is not satisfactory to a certain extent. However, collecting the true labels of the underwater images is challenging. In this work, we propose a transfer learning-based UIE that does not require underwater images to have paired noisy or true labels for learning. Instead, the UIE task is first divided into global color correction, haze removal, and background noise suppression following the underwater physics. Then multiple types of prior from other vision tasks are leveraged as cross-domain supervision in each step. In this way, a novel UIE is available via transfer learning, and the physics-aligned UIE decomposition provides theoretical soundness. Qualitative and quantitative experiments demonstrate that our proposal based on physics and priors fusion achieves SOTA performance in the UIE task and effectively boosts downstream vision tasks, significantly outperforming benchmark methods. Project repo: https://github.com/Haru2022/P2-UIE.

2606.19574 2026-06-19 eess.IV cs.CV 交叉投稿

FrequencyFormer: A Co-Designed Sensor-to-Processor Pipeline for Frequency-Domain Vision Transformer Inference

FrequencyFormer: 面向频域视觉Transformer推理的协同设计传感器到处理器流水线

Chengwei Zhou, Ovishake Sen, Xuming Chen, Rishith Paramasivam, Shaahin Angizi, Swarup Bhunia, Baibhab Chatterjee, Gourav Datta

AI总结 提出FrequencyFormer,通过多尺度DCT标记化将图像压缩为频域令牌,结合近传感器LUT硬件和低功耗通信架构,实现高达128倍数据压缩和28.8 TOPS/W能效,兼容多种视觉任务。

详情
AI中文摘要

在传感器边缘系统上部署视觉Transformer(ViT)不仅受限于设备计算能力,还受限于从传感器到处理器传输高维图像数据所需的能量和带宽。虽然传感器内和近传感器计算通过早期特征提取降低了这一成本,但现有方法通常仅提供适度的压缩。我们观察到频域提供了视觉信息的自然紧凑表示,并且可以在传感器级别利用以减少传感器到处理器的数据移动。基于这一见解,我们提出了FrequencyFormer,一种用于高效ViT推理的协同设计传感器到处理器流水线。FrequencyFormer包括:(1)多尺度DCT标记化器,将224x224图像压缩为紧凑的频域令牌,实现高达128倍的片外数据量减少,且精度损失较小;(2)基于查找表(LUT)的近传感器硬件实现,利用固定DCT系数实现无乘法器、节能且面积高效的标记化;(3)改进的基于MIPI的低功耗通信架构,进一步降低传输能量。FrequencyFormer可作为标准ViT补丁嵌入的直接替代,并与分类、检测和分割任务的预训练骨干网络兼容。该流水线实现了28.8 TOPS/W的能效,将通信能量降低230倍,并将总传感器侧能量降低2.22倍,展示了频域标记化作为传感器内ViT部署的可扩展基础。

英文摘要

Deploying vision transformers (ViTs) on sensor-edge systems is limited not only by on-device compute, but also by the energy and bandwidth required to transmit high-dimensional image data from the sensor to the processor. While in-sensor and near-sensor computing reduce this cost through early feature extraction, existing methods often provide only modest compression. We observe that the frequency domain provides a naturally compact representation of visual information and can be exploited at the sensor level to reduce sensor-to-processor data movement. Building on this insight, we present FrequencyFormer, a co-designed sensor-to-processor pipeline for efficient ViT inference. FrequencyFormer includes: (1) a multi-scale DCT tokenizer that compresses a 224x224 image into compact frequency-domain tokens, achieving up to 128x reduction in off-chip data volume with modest accuracy loss; (2) a LUT-based near-sensor hardware implementation that leverages fixed DCT coefficients for multiplier-free, energy- and area-efficient tokenization; and (3) a modified MIPI-based low-power communication architecture that further reduces transfer energy. FrequencyFormer serves as a drop-in replacement for standard ViT patch embedding and remains compatible with pretrained backbones across classification, detection, and segmentation tasks. The pipeline achieves 28.8 TOPS/W, reduces communication energy by 230x, and lowers total sensor-side energy by 2.22x, demonstrating frequency-domain tokenization as a scalable foundation for in-sensor ViT deployment.

2606.19802 2026-06-19 cs.LG cs.CV 交叉投稿

Flow Map Denoisers: Traversing the Distortion-Perception Plane for Inverse Problems

流映射去噪器:遍历逆问题的失真-感知平面

Nicolas Zilberstein, Morteza Mardani, Santiago Segarra

发表机构 * Rice University(莱斯大学) NVIDIA Inc.(英伟达公司)

AI总结 提出流映射模型,通过单一参数t在MMSE和感知质量间连续调节,实现逆问题的失真-感知权衡,无需额外监督或调参。

详情
AI中文摘要

图像复原面临一个基本权衡:最小化误差的方法产生模糊重建,而最大化感知质量的方法产生锐利但不够保真的图像。现有方法要么在失真-感知(DP)前沿上固定一个操作点,要么需要配对数据监督、辅助模型或对采样器进行超参数调优以访问不同点。我们证明,流映射模型——一种用于少步采样的流匹配的近期扩展,学习一个平均场——隐式定义了一个单参数去噪器族,连续跨越DP前沿。前瞻参数t充当MMSE和感知区域之间的控制旋钮。对于高斯目标,我们证明改变t精确恢复最优DP前沿;对于自然图像,我们在经验上观察到类似行为。在即插即用求解器中,相同机制扩展到一般逆问题,控制感知对齐与数据一致性之间的权衡。尽管在此设置中缺乏精确最优性保证,单个训练的流映射跨越DP权衡,在两端匹配或超越专门基线。在CelebA(128×128)和AFHQ(256×256)上的多个线性和非线性逆任务的广泛实验验证了我们的发现。

英文摘要

Image restoration faces a fundamental tradeoff: methods that minimize error produce blurry reconstructions, while those that maximize perceptual quality yield sharp but less faithful images. Existing approaches either commit to a single operating point on this distortion perception (DP) frontier or require paired-data supervision, auxiliary models, or hyperparameter tuning of the sampler to access different points. We show that flow map models, a recent extension of flow matching for few-step sampling that learns an average field, implicitly define a one-parameter family of denoisers that continuously spans the DP frontier. The lookahead parameter t acts as a control knob between the MMSE and perceptual regimes. For Gaussian targets, we prove that varying t exactly recovers the optimal DP frontier; for natural images, we observe similar behavior empirically. Within a Plug-and-Play solver, the same mechanism extends to general inverse problems, where it controls a tradeoff between perceptual alignment and data consistency. Despite the lack of exact optimality guarantees in this setting, a single trained flow map spans the DP tradeoff, matching or exceeding specialized baselines at both extremes. Extensive experiments on CelebA ($128\times 128$) and AFHQ ($256\times 256$) across several linear and nonlinear inverse tasks validate our findings.

2602.01391 2026-06-19 cs.CV 版本更新

Relighting as a Probe of Visual Priors via Augmented Latent Intrinsics

通过增强潜在本征属性将重光照作为视觉先验的探针

Xiaoyan Xing, Xiao Zhang, Sezer Karaoglu, Theo Gevers, Anand Bhattad

发表机构 * UvA-Bosch Delta Lab, University of Amsterdam, Amsterdam, Netherlands(乌得勒支大学阿姆斯特丹分校博世Delta实验室) The University of Chicago, Chicago, USA(芝加哥大学) Johns Hopkins University, Baltimore, USA(约翰霍普金斯大学)

AI总结 提出增强潜在本征属性(ALI)方法,融合密集像素对齐视觉特征到潜在本征重光照模型,平衡语义与光度保真度,提升复杂材质重光照质量。

Comments Camera-ready version for ICML 2026. Project page: https://augmented-latent-intrinsics.github.io

详情
AI中文摘要

图像到图像的重光照需要能够将光照与场景属性分离,同时保留密集几何、材质和光度线索的表征。我们将此任务用作视觉先验的探针:与奖励不变性的识别任务不同,重光照测试视觉特征是否保留光传输所需的信息。通过一个受控的生成式重光照框架,我们发现强语义编码器会降低重光照质量,揭示了抽象与物理保真度之间的语义-光度权衡。我们引入了增强潜在本征属性(ALI),通过将密集的、像素对齐的视觉特征融合到潜在本征重光照模型中,并在未标注的真实图像对上通过自监督进行细化,来平衡这一权衡。ALI提高了重光照质量,尤其是在光泽、金属和透明材质上,并证明了生成式重光照是量化视觉编码器对物理世界编码内容的有效工具。

英文摘要

Image-to-image relighting requires representations that separate illumination from scene properties while preserving dense geometry, material, and photometric cues. We use this task as a probe of visual priors: unlike recognition tasks that reward invariance, relighting tests whether visual features retain the information needed for light transfer. Through a controlled generative relighting framework, we find that strong semantic encoders can degrade relighting quality, exposing a semantic--photometric trade-off between abstraction and physical fidelity. We introduce Augmented Latent Intrinsics (ALI), which balances this trade-off by fusing dense, pixel-aligned visual features into a latent-intrinsic relighting model and refining it with self-supervision on unlabeled real image pairs. ALI improves relighting quality, especially on glossy, metallic, and transparent materials, and demonstrates that generative relighting is an effective tool for quantifying what visual encoders encode about the physical world.

11. 鲁棒性、安全、隐私与可信视觉 11 篇

2606.19565 2026-06-19 cs.CV 新提交

Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models

Mix-QVLA:任务证据感知的视觉-语言-动作模型混合精度量化

Navin Ranjan, Andreas Savakis

发表机构 * Rochester Institute of Technology(罗彻斯特理工学院)

AI总结 提出Mix-QVLA框架,通过任务证据感知的混合精度后训练量化,在保持任务性能的同时大幅降低VLA模型的内存和计算开销,在LIBERO上实现4.1GB内存和1.52倍加速。

详情
AI中文摘要

我们提出Mix-QVLA,一种针对VLA模型的任务证据感知混合精度PTQ框架。Mix-QVLA将每个量化变体锚定到全精度动作令牌参考决策,并评估量化是否在关键VLA功能边界上保留了任务相关证据。它从边界激活计算归一化的梯度加权任务证据图,并使用证据质量和归因分布失真比较全精度和量化图,捕捉决策支持证据的强度和分配变化。一个软瓶颈目标将边界级退化聚合为层敏感度分数。Mix-QVLA进一步在整个任务执行过程中建模敏感度,捕捉层重要性的阶段依赖变化,而不是假设固定的敏感度分布。由此产生的证据和时间感知分数指导在模型大小和BitOps预算下的混合精度位分配。在OpenVLA风格策略上的广泛评估表明,Mix-QVLA改善了低比特VLA部署的精度-效率权衡。在LIBERO上,Mix-QVLA将OpenVLA-OFT内存从15.4 GB减少到4.1 GB,保留了96.3的平均成功率(BF16模型为97.1),并实现了1.52倍的推理加速。

英文摘要

We propose Mix-QVLA, a task-evidence-aware mixed-precision PTQ framework for VLA models. Mix-QVLA anchors each quantized variant to the full-precision action-token reference decision and evaluates whether quantization preserves task-relevant evidence across key VLA functional boundaries. It computes normalized gradient-weighted task-evidence maps from boundary activations and compares full-precision and quantized maps using evidence-mass and attribution-distribution distortion, capturing changes in both the strength and allocation of decision-supporting evidence. A soft-bottleneck objective aggregates boundary-level degradation into layer-wise sensitivity scores. Mix-QVLA further models sensitivity throughout task execution, capturing phase-dependent shifts in layer importance rather than assuming a fixed sensitivity profile. The resulting evidence- and time-aware scores guide mixed-precision bit allocation under model-size and BitOps budgets. Extensive evaluations on OpenVLA-style policies show that Mix-QVLA improves the accuracy-efficiency trade-off of low-bit VLA deployment. On LIBERO, Mix-QVLA reduces OpenVLA-OFT memory from 15.4 GB to 4.1 GB, retains 96.3 average success compared with 97.1 for the BF16 model, and achieves a 1.52x inference speedup.

2606.19736 2026-06-19 cs.CV 新提交

VFACamou: View-Fused Adversarial Camouflage for Environment-Adaptive Physical Evasion

VFACamou: 视图融合的对抗性伪装用于环境自适应物理规避

Shihui Yan, Hu Liu, Junyu Shi, Zihui Zhu, Ziqi Zhou, Yufei Song, Youming Geng, Minghui Li, Shengshan Hu

发表机构 * State Key Laboratory of Intelligent Vehicle Safety Technology(智能汽车安全技术国家重点实验室) School of Cyber Science and Engineering, Huazhong University of Science and Technology(华中科技大学网络空间安全学院) School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Hebei Energy College of Vocation And Technology(河北能源职业技术学院)

AI总结 提出一种端到端框架,结合UV体积渲染与扩散纹理生成器,并引入照明颜色一致性估计器和多尺度动态训练策略,生成可穿戴对抗图案,在无人机侦察等动态视角和光照变化下实现稳定物理攻击。

Comments Accepted by ICME 2026

详情
AI中文摘要

物理世界中的对抗性伪装仍然极具挑战性,尤其是在无人机侦察场景下,目标会经历连续的几何变化和极端光照变化。现有方法要么优化无法泛化到动态视角的2D数字扰动,要么产生视觉上不自然的纹理而无法在实际场景中部署。因此,我们提出一个端到端的对抗性伪装生成框架,能够自动生成可穿戴的对抗图案,并在视角、姿态和光照条件变化的真实物理环境中保持稳定的攻击性能。我们的方法将UV体积渲染与基于扩散的纹理生成器相结合,使得在不同尺度、姿态和光照条件下外观保持一致。为了确保环境真实性,我们提出一个照明颜色一致性估计器,提取主导背景属性并引导自然纹理损失,使生成的UV纹理与周围环境对齐。多尺度动态训练策略进一步增强了对抗视角变化和身体变形的鲁棒性。在多个主流检测器上的大量实验表明,我们的方法在保持高感知自然性的同时实现了强大且稳定的物理攻击性能,在不引入不自然伪影的情况下降低了人类检测率。

英文摘要

Adversarial camouflage in the physical world remains highly challenging, particularly under UAV reconnaissance where targets undergo continuous geometric changes and extreme illumination variations. Existing methods either optimize 2D digital perturbations that fail to generalize to dynamic viewpoints or produce visually unnatural textures that cannot be deployed in real scenarios. Therefore, we propose an end-to-end framework for adversarial camouflage generation that automatically produces wearable adversarial patterns and maintains stable attack performance in real physical environments with changing viewpoints, poses, and lighting conditions. Our method integrates UV-volume rendering with a diffusion-based texture generator, enabling consistent appearance under varying scales, poses, and lighting conditions. To ensure environmental realism, we propose an illumination color consistency estimator that extracts dominant background attributes and guides a natural texture loss to align the generated UV texture with the surrounding environment. A multi-scale dynamic training strategy further enhances robustness against viewpoint shifts and body deformation. Extensive experiments across multiple mainstream detectors demonstrate that our method achieves strong and stable physical attack performance while maintaining high perceptual naturalness, reducing human detection rates without introducing unnatural artifacts.

2606.20155 2026-06-19 cs.CV cs.CL 新提交

NAMESAKES: Probing Identity Memorization in Text-to-Image Models

NAMESAKES: 探究文本到图像模型中的身份记忆

Morris Alper, Vasudha Varadarajan, Moran Yanuka, Angelina Wang, Hadar Averbuch-Elor

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Tel Aviv University(特拉维夫大学) Cornell University(康奈尔大学)

AI总结 提出一种黑盒行为探针,无需参考照片或训练数据,即可区分文本到图像模型生成的图像是记忆还是虚构,并在NAMESAKES数据集上验证其有效性。

详情
AI中文摘要

文本到图像(T2I)模型在提示其姓名时,会生成某些个体的逼真肖像,这引发了隐私问题。然而,区分生成的面孔是记忆还是虚构的,目前需要真实照片、训练数据访问权限或模型内部的白盒访问,限制了适用性。我们引入了一种完全黑盒的行为探针,可以在无需参考照片或事先了解训练数据的情况下区分这两种情况。为了基准测试这一任务,我们提出了NAMESAKES数据集,包含一千多个不同知名度水平的公众人物的姓名和面孔,以及经过扰动的、知名度较低的姓名。对最先进的T2I模型的实验表明,我们的探针能够显著预测身份记忆,并将记忆的姓名与未识别的姓名区分开来,并进一步揭示了不同模型系列之间的差异。

英文摘要

Text-to-image (T2I) models generate realistic likenesses of some individuals when prompted with their names, raising privacy concerns. However, distinguishing whether a generated face is memorized or fabricated currently requires ground-truth photos, access to training data, or white-box access to model internals, limiting applicability. We introduce a fully black-box behavioral probe that distinguishes between these regimes while requiring no reference photos or prior knowledge of training data. To benchmark this task, we present the NAMESAKES dataset of over one thousand names and faces of public figures spanning a wide range of fame levels, along with perturbed, less famous names. Experiments on state-of-the-art T2I models show that our probe substantially predicts identity memorization and separates memorized from unrecognized names, with further insights into differences across model families.

2606.20302 2026-06-19 cs.CV 新提交

CUPID: Reconstructing UV Texture Maps for Interpretable Person-of-Interest Deepfake Detection

CUPID: 重构UV纹理图用于可解释的特定人物深度伪造检测

Giovanni Affatato, Sara Mandelli, Edoardo Daniele Cannas, Paolo Bestagini, Stefano Tubaro

发表机构 * Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano(米兰理工大学电子、信息与生物工程系(DEIB))

AI总结 提出CUPID方法,利用3D人脸重建的UV纹理图和掩码自编码器,无需深度伪造视频训练即可检测特定人物深度伪造,并实现可解释性和鲁棒性。

详情
AI中文摘要

针对高知名度人物(Person-of-Interest, POI)的深度伪造对现代民主社会构成威胁。当前的POI深度伪造检测方法在鲁棒性、效率和可解释性方面仍存在不足。本文提出CUPID,一种POI视频深度伪造检测器,结合了UV纹理图(源自3D人脸重建的面部外观表示)和掩码自编码器(MAE)的表征学习能力。我们的方法在训练阶段不需要任何深度伪造视频,甚至无需在训练集中包含特定POI:从真实视频帧中提取的UV纹理图与MAE上下文引导重构相结合,产生的潜在空间能够捕获丰富且具有判别性的面部特征,即使对于训练中未见过的身份也是如此。在测试阶段,从描述POI的查询视频中提取的嵌入可以与原始参考视频进行匹配,以评估视频真实性。此外,在UV空间中操作自然提供了额外的可解释性层。具体来说,我们可以提取解码残差图,突出显示测试视频中哪些面部区域与相应POI的身份表示偏差最大。在四个深度伪造数据集上的实验表明,CUPID在大多数数据集上优于当前最先进方法,并在强下采样和压缩下实现了最佳的整体鲁棒性,同时提供了更快的推理速度。我们的实验代码将在以下网址发布:https://this https URL。

英文摘要

Deepfakes targeting a high-profile individual, known as Person-of-Interest (POI), are a threat to modern democracies and societies. Current POI deepfake detection methods still struggle to combine robustness to post-processing, efficiency and interpretability, focal aspects of modern deepfake detectors. In this paper we propose CUPID, a POI video deepfake detector that combines UV texture maps, a facial appearance representation derived from 3D face reconstructions, with the representation learning capabilities of the Masked Autoencoder (MAE). Our method does not require any deepfake videos in its training phase. Moreover, it does not even require to include a specific POI in the training set: the combination of UV texture maps extracted from real video frames and the MAE context-guided reconstruction yields a latent space that captures rich and discriminative facial features also for identities unseen during training. In the testing phase, the embeddings extracted from a query video depicting the POI can be matched against pristine reference videos to assess the video authenticity. Furthermore, operating in the UV space naturally provides an additional layer of interpretability. Specifically, we can extract decoded residual maps that highlight which facial regions of a test video deviate most from the identity representation of the corresponding POI. Experiments on four deepfake datasets show that CUPID outperforms current state of the art on most datasets and achieves the best overall robustness against strong downscaling and compression, providing also substantially faster inference. Our experimental code will be released at https://github.com/polimi-ispl/CUPID.

2606.20488 2026-06-19 cs.CV 新提交

How Fragile Are Training-Free AI-Generated Image Detectors? A Controlled Audit of Score Direction, Preprocessing, and Compression

无训练AI生成图像检测器有多脆弱?对分数方向、预处理和压缩的受控审计

Jingwen Zhou, Mingzhe Wang

发表机构 * Xidian University(西安电子科技大学)

AI总结 本文通过统一协议审计两种无训练检测分数(自编码重建和噪声扰动特征相似性)及kNN基线,发现实现细节、分数方向选择和数据集格式偏差会导致AUROC变化高达0.38,且简单融合无法超越最佳单分数。

详情
AI中文摘要

无训练的AI生成图像检测器承诺无需分类器训练即可实现生成器无关的部署,但其报告的数字很少在单一受控协议下进行比较。我们审计了两种代表性的无训练分数——一种自编码器重建分数(AEROBLADE风格)和一种噪声扰动特征相似性分数(RIGID风格),外加一个朴素的特征kNN控制,在包含七个生成器和JPEG压缩质量70和50的公共1,500图像GenImage衍生基准上进行。审计得出三个警示性发现。(i)实现细节伪装成方法差异:将LPIPS骨干网络(AlexNet -> VGG-16)替换使整体AUROC变化+0.085,在resize-to-512和原始分辨率预处理之间切换使每个生成器的结论翻转高达0.38 AUROC。(ii)分数方向不是方法的属性而是其超参数的属性:RIGID风格分数在噪声水平sigma=0.05时对SD1.5和Wukong反转(AUROC < 0.5),在sigma=0.01时对所有生成器恢复至>0.5,在sigma=0.3时降至0.15。(iii)数据集格式偏差夸大鲁棒性声明:没有统一重新编码时,JPEG-50下的AUROC超过AlexNet骨干重建分数的干净条件;偏差校正后残余异常定位到单个生成器(BigGAN)。审计的分数具有互补的逐生成器失败集,但朴素z-score融合未能击败最佳单分数,表明利用互补性需要方向感知的组合。

英文摘要

Training-free detectors of AI-generated images promise generator-agnostic deployment without classifier training, yet their reported numbers are rarely compared under a single controlled protocol. We audit two representative training-free scores -- an autoencoder-reconstruction score (AEROBLADE-style) and a noise-perturbation feature-similarity score (RIGID-style) -- plus a naive feature-kNN control, on a common 1,500-image GenImage-derived benchmark spanning seven generators and JPEG compression at quality 70 and 50. The audit yields three cautionary findings. (i) Implementation details masquerade as method differences: replacing the LPIPS backbone (AlexNet -> VGG-16) changes overall AUROC by +0.085, and switching between resize-to-512 and native-resolution preprocessing flips per-generator conclusions by up to 0.38 AUROC. (ii) Score direction is not a property of the method but of its hyperparameters: the RIGID-style score is inverted (AUROC < 0.5) on SD1.5 and Wukong at noise level sigma=0.05, recovers to >0.5 for every generator at sigma=0.01, and collapses to 0.15 at sigma=0.3. (iii) Dataset format bias inflates robustness claims: without unified re-encoding, AUROC under JPEG-50 exceeds the clean condition for the AlexNet-backbone reconstruction score; after bias correction the residual anomaly localizes to a single generator (BigGAN). The audited scores have complementary per-generator failure sets, but naive z-score fusion does not beat the best single score, indicating that exploiting complementarity requires direction-aware combination.

2606.19735 2026-06-19 cs.AI cs.CV 交叉投稿

GLARE: A Natural Language Interface for Querying Global Explanations

GLARE: 用于查询全局解释的自然语言接口

Bhavan Vasu, Rajesh Mangannavar

发表机构 * Oregon State University(俄勒冈州立大学)

AI总结 提出基于LLM的交互接口GLARE,将自然语言问题转换为SQL查询以聚合局部解释数据,提升全局解释的可访问性和可用性。

Comments 16 pages, 2 figures

详情
AI中文摘要

虽然全局解释对于理解跨数据集、类别和决策上下文的视觉模型至关重要,但其复杂和单一的性质常常阻碍实际探索。由于用户通常寻求针对特定问题的目标答案,而不是静态产物,我们提出了一种基于LLM的交互接口,提供对黑盒图像分类器全局解释的自然语言访问。系统的核心LLM充当调解者,将自然语言问题转换为对局部解释数据的结构化SQL查询。这使得灵活聚合成为可能,而无需向用户暴露低级表示。对于每个查询,接口输出统计增强的自然语言响应,支持局部解释和意图对齐的可视化。我们在意图解释、查询映射准确性、对新查询和数据集的泛化能力以及对语言错误的鲁棒性方面评估了该系统。我们的结果表明,LLM中介的查询显著提高了以人为中心的XAI中全局解释的可访问性和可用性。

英文摘要

While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system's core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.

2606.20527 2026-06-19 cs.CL cs.CV 交叉投稿

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

StylisticBias: 少数人类视觉线索驱动多模态大语言模型中的大部分社会偏见

Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner

发表机构 * Technical University of Munich(慕尼黑工业大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Princeton Center for Information and Technology Policy(普林斯顿信息与技术政策中心)

AI总结 提出StylisticBias基准,通过控制单一视觉属性变化,发现年龄和体型主导身份层面偏见,而时尚风格等约15个属性解释近80%的偏见变化,偏见集中于少数视觉线索。

Comments Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地部署在个人和社会影响重大的场景中,但影响这些模型判断人物的视觉线索仍知之甚少。先前的工作通常比较不同的(群体)个体,难以将外貌效应与身份差异分离。我们引入StylisticBias,一个用于评估MLLMs中属性级社会偏见的受控基准。我们生成500张逼真的基础人脸,每张脸创建约50个单一属性变体,产生约25K张图像。这种设计保持身份不变,每次改变一个视觉属性,使我们能够测量特定线索如何改变模型判断。我们在25个二元社会判断场景中评估了六个MLLMs。我们发现年龄和体型主导身份层面的效应,而时尚风格和其他视觉线索驱动最大的属性级变化。我们进一步发现,约15个属性解释了近80%的总变异,表明偏见集中在少数视觉线索上。在与外貌语义对齐的判断中,尤其是社会经济和风格相关判断,敏感性最强。我们发布StylisticBias作为多模态模型细粒度偏见评估的基准。代码和数据集:此https URL和此https URL。

英文摘要

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.

2511.04260 2026-06-19 cs.CV cs.AI 版本更新

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Proto-LeakNet:面向合成人脸图像中信号泄漏感知的归因方法

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学)

AI总结 提出Proto-LeakNet,利用扩散模型中的信号泄漏痕迹,结合闭集分类与密度开集评估,实现可解释的生成器归因,在闭集上训练后对未见生成器也有效。

Comments 44 pages, 27 figures, 11 tables

详情
AI中文摘要

合成图像和深度伪造生成模型的日益复杂使得源归因和真实性验证成为现代计算机视觉系统的关键挑战。最近的研究表明,扩散管道会在其输出中无意中留下持久的统计痕迹,称为信号泄漏,特别是在潜在表示中。基于这一观察,我们提出了Proto-LeakNet,一个信号泄漏感知且可解释的归因框架,它将闭集分类与基于密度的开集评估相结合,对学习到的嵌入进行开集评估,从而无需重新训练即可分析未见过的生成器。我们的方法作用于扩散模型的潜在域,重新模拟部分前向扩散以暴露残留的生成器特定线索。一个时间注意力编码器聚合多步潜在特征,而一个特征加权原型头则结构化嵌入空间并实现透明的归因。仅在闭集数据上训练并达到98.13%的宏AUC,Proto-LeakNet学习到的潜在几何结构在后处理下保持鲁棒,超越了最先进的方法,并且在真实图像与已知生成器之间以及已知与未见生成器之间实现了强可分离性。代码库可在以下链接获取:this https URL。

英文摘要

The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

2510.27285 2026-06-19 cs.CV cs.CR 版本更新

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

重新思考扩散模型中的鲁棒对抗性概念擦除

Qinghong Yin, Yu Tian, Heming Yang, Xiang Chen, Xianlin Zhang, Yue Ming, Xueming Li, Yue Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua University(计算机科学与技术系,人工智能研究院,清华大学) University of Chinese Academy of Sciences(中国科学院大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学)

AI总结 针对扩散模型中概念擦除的对抗训练忽视概念语义导致拟合不足的问题,提出语义引导的鲁棒对抗概念擦除方法S-GRACE,显著提升擦除性能26%并减少90%训练时间。

详情
AI中文摘要

概念擦除旨在选择性地遗忘扩散模型(DMs)中的不良内容,以降低敏感内容生成的风险。作为概念擦除的一种新范式,现有方法大多采用对抗训练来识别和抑制目标概念,从而减少敏感输出的可能性。然而,这些方法常常忽视对抗训练在DMs中的特异性,导致仅能部分缓解。在这项工作中,我们从概念空间的角度调查并量化了这种特异性,即对抗样本能否真正拟合目标概念空间?我们观察到现有方法在生成对抗样本时忽视了概念语义的作用,导致对概念空间的拟合效果不佳。这种忽视导致了以下问题:1)当对抗样本较少时,它们无法全面覆盖目标概念;2)反之,它们会破坏其他目标概念空间。受这些发现分析的启发,我们引入了S-GRACE(语义引导的鲁棒对抗概念擦除),它优雅地利用概念空间内的语义引导来生成对抗样本并执行擦除训练。使用七种最先进方法和三种对抗提示生成策略在各种DM遗忘场景下进行的实验表明,S-GRACE显著提高了擦除性能26%,更好地保留了非目标概念,并将训练时间减少了90%。我们的代码可在此https URL获取。

英文摘要

Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.

2605.07821 2026-06-19 cs.CV cs.AI 版本更新

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

通过对象共现分析缓解OOD检测中的简单性偏差

Boyang Dai, Chaoqi Chen, Yizhou Yu

发表机构 * The University of Hong Kong(香港大学) Shenzhen University(深圳大学) Shenzhen Loop Area Institute(深圳环城区域研究所)

AI总结 提出基于对象共现的OOD检测框架,通过解耦表示和分治策略区分近OOD,缓解简单性偏差,在多种设置下取得竞争结果。

Comments This paper has been accepted by CVPR2026

详情
AI中文摘要

分布外(OOD)检测对于确保深度学习模型的可靠性至关重要。现有方法大多关注正则纠缠表示以区分分布内(ID)和OOD数据,忽略了图像中丰富的上下文信息。这一问题在检测近OOD时尤其具有挑战性,因为具有简单性偏差的模型难以在解耦表示中学习判别性特征。人类视觉系统可以利用自然环境中对象的共现来促进场景理解。受此启发,我们提出了一种以对象为中心的OOD检测框架,学习捕捉图像中的对象共现(OCO)模式。该方法引入了一种新的OOD检测范式,通过预测测试样本的解耦表示来理解图像中的对象共现,然后根据ID训练数据中观察到的对象共现模式自适应地将模式分为三种场景,最后以分治方式进行OOD检测。通过这种方式,OCO可以通过考虑图像中存在的语义上下文关系来区分近OOD,避免仅关注简单、易学习区域的倾向。我们通过在具有挑战性和全频谱OOD设置下的实验评估了OCO,展示了竞争性结果,并证实了其处理语义和协变量偏移的能力。代码发布在:https://this https URL。

英文摘要

Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at https://github.com/Michael-McQueen/OCO.

2502.03227 2026-06-19 cs.LG cs.CV 版本更新

Adversarial Dependence Minimization

对抗性依赖最小化

Pierre-François De Plaen, Tinne Tuytelaars, Marc Proesmans, Luc Van Gool

发表机构 * CVL, ETH Zürich, Switzerland(CVL,苏黎世联邦理工学院,瑞士) INSAIT, Sofia University, Bulgaria(INSAIT,索菲亚大学,保加利亚)

AI总结 提出ADM算法,通过对抗博弈最小化特征维度间的统计依赖性,证明全局最优时达到相互独立,并应用于非线性去相关、图像分类泛化提升和自监督学习维度坍塌预防。

详情
AI中文摘要

最小冗余表示通常通过最小化特征协方差来学习。然而,基于协方差的方法无法消除所有依赖/冗余,因为线性不相关的变量仍可能表现出非线性关系。为了解决这个问题,我们引入了ADM,一种可微分的算法,通过对抗博弈最小化特征维度之间的统计依赖性:辅助网络识别依赖关系,而编码器去除它们。我们证明了在全局最优时实现了相互独立,经验验证了收敛性,并研究了三个潜在应用:将PCA扩展到非线性去相关、提高图像分类的泛化能力以及防止自监督学习中的维度坍塌。通过促进统计独立的表示,ADM为在多种应用中学习更鲁棒、更压缩和更泛化的表示铺平了道路。

英文摘要

Minimally redundant representations are typically learned by minimizing feature covariance. However, covariance-based methods fail to eliminate all dependencies/redundancies, as linearly uncorrelated variables can still exhibit nonlinear relationships. To address this, we introduce ADM, a differentiable algorithm that minimizes statistical dependence between feature dimensions through an adversarial game: auxiliary networks identify dependencies, while the encoder removes them. We prove that mutual independence is achieved at the global optimum, empirically verify convergence, and study three potential applications: extending PCA to nonlinear decorrelation, improving generalization in image classification, and preventing dimensional collapse in self-supervised learning. By promoting statistically independent representations, ADM paves the way for learning more robust, compressed, and generalizable representations across diverse applications.

12. 数据集、基准、评测与训练方法 24 篇

2606.19483 2026-06-19 cs.CV 新提交

LEAP: Layer-skipping Efficiency via Adaptive Progression for Vision Transformer Distillation

LEAP: 通过自适应进度实现视觉Transformer蒸馏的层跳过效率

Jiaqi Zhang, Ashton Lee, Anthony Wong, John Zou, Sami BuGhanem, Randall Balestriero

发表机构 * Brown University(布朗大学) Rice University(莱斯大学)

AI总结 提出LEAP训练课程,通过自适应选择教师中间特征图作为渐进式目标,加速学生ViT的知识蒸馏,在ImageNet-100上提升12.24%准确率,并节省25.1%训练FLOPs。

详情
AI中文摘要

基于视觉Transformer(ViT)骨干的视觉基础模型(VFMs),如DINOv2,已成为目标识别和语义分割等下游任务的关键。骨干网络的巨大计算需求通常需要将其蒸馏到更小的架构中以便在边缘部署。基于特征的知识蒸馏(KD)常受师生差距影响;学生由于容量有限难以模仿教师复杂的特征图。为缓解这一瓶颈,我们提出LEAP:通过自适应进度实现层跳过效率,一种用于ViT特征知识蒸馏的训练课程。通过利用教师的中间特征图作为一系列逐渐困难的渐进目标,我们的课程允许学生在处理更高层抽象之前构建基础表示。我们的结果表明,这种范式通过在不同学生模型大小和数据集规模上自适应选择难度,显著加速了收敛。采用我们的课程,LEAP蒸馏的ViT-S在ImageNet-100上达到90.1%的准确率,相比基线提升12.24%。在ImageNet-1K上,LEAP在Oxford和Paris数据集上的实例检索任务分别提升3.84%和7.75%。此外,该课程通过在训练初始阶段对教师推理实施早停,在ImageNet-100上节省了25.1%的训练FLOPs和21%的训练时间。代码可在以下网址获取:https://this URL

英文摘要

Vision Foundation Models (VFMs) with Vision Transformer (ViT) backbones, such as DINOv2, have become essential for downstream tasks like object recognition and semantic segmentation. The immense computational requirements of backbones often necessitate distillation into smaller architectures for edge deployment. Feature-based knowledge distillation (KD) often suffers from the teacher-student gap; the student struggles to imitate teacher's complex feature map due to its limited capacity. To mitigate this bottleneck, we propose LEAP: Layer-skipping Efficiency via Adaptive Progression, a training curriculum for ViT feature-based knowledge distillation. By utilizing the teacher's intermediate feature maps as a sequence of progressively more difficult targets, our curriculum allows the student to build a foundational representation before tackling higher-level abstractions. Our results demonstrate that this paradigm significantly accelerates convergence through adaptive difficulty selection across various student model sizes and dataset scales. With our curriculum, the LEAP-distilled ViT-S achieves 90.1% accuracy on ImageNet-100, a +12.24% improvement compared with baseline. On ImageNet-1K, LEAP achieves +3.84% and +7.75% improvement for the instance retrieval task on the Oxford and Paris datasets, respectively. Furthermore, the curriculum enables 25.1% savings in training FLOPs and 21% savings in training time on ImageNet-100 by implementing early-stopping for teacher inference during the initial stages of training. Code is available at https://github.com/KevinZ0217/LEAP

2606.19817 2026-06-19 cs.CV 新提交

Training-Free Metrics for Synthetic Object Detection Data: A Proxy for Detector Performance

无需训练的合成目标检测数据度量:检测器性能的代理指标

Myeongseok Nam, Donghoon Yeo, Seungwook Kim

发表机构 * GenGenAI

AI总结 提出CCDM度量族,无需训练即可评估合成数据集对下游目标检测的效用,在VisDrone-DET上实现与YOLOv8性能的完全Spearman相关。

Comments 9 pages, 4 figures

详情
AI中文摘要

随着近期图像生成模型的出现,合成数据越来越多地被用于补充有限的真实数据集,以训练计算机视觉模型。然而,并非所有合成数据集都能同等提升性能,其有效性只能通过训练下游模型来评估,这计算成本高且耗时。这个问题在目标检测任务中尤为突出,因为边界框所需的标注更为密集。在本文中,我们提出了一种可预先计算的度量族,称为条件-组合域匹配(CCDM),作为候选合成训练集对下游检测相对效用的代理指标。在VisDrone-DET数据集上的实验表明,CCDM度量族与YOLOv8的下游性能实现了1.0的Spearman相关性,明显优于现有的合成图像评估度量。

英文摘要

With the recent advent of image generative models, synthetic data are increasingly being used to supplement limited real datasets for training computer vision models. However, not all synthetic datasets improve performance equally, and their effectiveness can only be assessed by training a downstream model, which is computationally expensive and time-consuming. This problem is pronounced in the task of object detection, where the required annotations are much more dense due to bounding boxes. In this paper, we propose a pre-computable metric family, dubbed Conditional-Composition Domain Match (CCDM), which serves as a proxy for the relative utility of candidate synthetic training sets for downstream detection. Experiments on the VisDrone-DET dataset show that the CCDM metric families achieve a Spearman correlation of 1.0 with the downstream performance of YOLOv8, clearly outperforming existing metrics for synthetic image evaluation.

2606.19932 2026-06-19 cs.CV cs.AI 新提交

Spatial-Aware Reduction Framework: Towards Efficient and Faithful Visual State Space Models

空间感知缩减框架:迈向高效且忠实的视觉状态空间模型

Jindi Lv, Aoyu Li, Yuhao Zhou, Zheng Zhu, Xiaofeng Wang, Qing Ye, Yueqi Duan, Wentao Feng, Jiancheng Lv

发表机构 * Sichuan University(四川大学) Tsinghua University(清华大学)

AI总结 提出STORM框架,通过保持空间结构完整性解决视觉Mamba模型在token缩减时的性能崩溃问题,无需训练即可实现高精度剪枝。

Comments Accepted by ICML 2026

详情
AI中文摘要

Mamba在建模长视觉序列方面表现出强大的效率。然而,当将token缩减应用于结构增强的Mamba变体时,这些模型会出现严重的性能崩溃。我们将这种退化归因于现有缩减方法在空间上的不可知性,这违反了选择性扫描机制所需的二维结构前提。在这项工作中,我们提出了STORM,一个空间感知的token缩减框架,旨在在压缩过程中保持结构完整性。STORM将缩减重新表述为对空间单元的结构化操作,强制局部约束以保持网格拓扑和邻域一致性。作为一个即插即用模块,STORM无需任何训练即可为现有缩减流程赋予明确的空间感知能力。实验结果表明,STORM在无训练设置下,在多种视觉Mamba骨干网络上实现了最先进的剪枝精度。值得注意的是,STORM在VMamba上实现了显著的精度恢复,在top-1准确率上比先前方法高出63.3%。同时,STORM在PlainMamba上仅造成1.0%的准确率下降,达到了与ViT相当的性能。

英文摘要

Mamba demonstrates strong efficiency in modeling long visual sequences. However, when token reduction is applied to structurally enhanced Mamba variants, these models exhibit a severe performance collapse. We attribute this degradation to the spatially agnostic nature of existing reduction methods, which violate the two-dimensional structural premise required by the selective scanning mechanism. In this work, we propose STORM, a spatial-aware token reduction framework designed to maintain structural integrity throughout the compression process. STORM reformulates reduction into a structured operation on spatial units, enforcing localized constraints to maintain both grid topology and neighborhood coherence. As a plug-and-play module, STORM equips existing reduction pipelines with explicit spatial awareness without any training. Empirical results demonstrate that STORM achieves state-of-the-art pruning accuracy across diverse vision Mamba backbones under training-free settings. Notably, STORM delivers a substantial accuracy recovery on VMamba, outperforming prior methods by up to 63.3\% in top-1 accuracy. Meanwhile, STORM incurs only a 1.0\% accuracy drop on PlainMamba, achieving performance comparable to ViT.

2606.19934 2026-06-19 cs.CV cs.AI 新提交

Speeding up the annotation process in semantic segmentation industrial applications

加速工业应用中的语义分割标注过程

Marta Fernandez-Moreno, Margarita Guerrero, Rosalia Rementeria, Pablo Mesejo, Raul Moreno

发表机构 * Department of Computer Science and Artificial Intelligence, Andalusian Research Institute in Data Science and Computational Intelligence, DaSCI, University of Granada(格拉纳达大学计算机科学与人工智能系,安达卢西亚数据科学与计算智能研究所,DaSCI) Department of Computer Science and Automatic Control, National Distance Education University (UNED)(国立远程教育大学计算机科学与自动控制系)

AI总结 本文利用无监督算法将材料科学中语义分割的标注时间从170小时降至37小时(减少78%),并发布了最大的公开钢微观结构分割数据集。

详情
AI中文摘要

当前的机器学习模型通常需要大量且标注良好的数据集。然而,标注过程常常成为瓶颈,随着复杂性的增加,人为错误的机会也更高。在此背景下,本文旨在利用无监督算法提高工业材料科学中复杂语义分割问题的数据标注效率。以往的研究量化了标注时间,并探索了无监督方法。但据我们所知,这是首次量化无监督算法加速标注过程程度的研究。我们旨在验证这一繁琐过程可以加速的程度,重点关注涉及高分辨率图像每个像素标注的语义分割任务,例如材料科学中的微观结构表征挑战。具体来说,我们证明通过使用无监督计算机视觉算法,标注过程所需的时间可以从170小时减少到37小时,实现了约78%的减少。我们处理的数据集包括尺寸为1280x959和960x703的大图像,这进一步增加了标注任务的复杂性。尽管存在这些挑战,我们创建并共享了迄今为止最大的公开钢微观结构分割数据集,在MIT许可下提供,并具有永久DOI,为该领域贡献了一个完全标注的高分辨率数据集。此外,这是首次将从头开始标注的时间(以往研究中的常见方法)与使用这些无监督算法作为预标注步骤时的标注时间进行比较。此外,我们提供了一个在此数据集上训练的深度学习模型,该模型经过领域专家验证,并部署在工业环境中,作为该公共数据集的初始基准。

英文摘要

Current machine learning models commonly require large and well-annotated datasets. However, the annotation process often becomes a bottleneck, with increased complexity leading to higher chances of human errors. Within this context, our goal in this paper is to leverage unsupervised algorithms to improve data annotation efficiency for complex semantic segmentation problems in industrial materials science. Previous research has quantified labeling time and others explored unsupervised methods. However, to the best of our knowledge, this is the first study to quantify how much unsupervised algorithms accelerate the labeling process. We aim to validate the extent to which this laborious process can be accelerated, focusing on semantic segmentation tasks that involve annotating each pixel of high-resolution images, such as the microstructure characterization challenge in materials science. Specifically, we demonstrate that by using unsupervised computer vision algorithms, the time required for the labeling process can be reduced from 170 hours to 37 hours, achieving an approximate reduction of 78\%. The dataset we work with includes large images of dimensions 1280x959 and 960x703, which further increases the complexity of the annotation task. Despite these challenges, we create and share the largest public steel microstructure segmentation dataset to date, available under MIT License with permanent DOI, contributing a fully annotated, high-resolution dataset to the field. Additionally, this is the first work to compare the labeling time from scratch (a common approach in previous studies) to the labeling time when using these unsupervised algorithms as a pre-annotation step. Furthermore, we provide a Deep Learning model trained on this dataset, validated by field experts, and deployed in an industrial setting, serving as an initial benchmark for this public dataset.

2606.19965 2026-06-19 cs.CV cs.AI 新提交

ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

ROSE:多模态模型中感知到行动差距的基准测试

Yihao Wang, Zijian He, Jie Ren, Keze Wang

发表机构 * Sun Yat-sen University(中山大学) Shaanxi Normal University(陕西师范大学)

AI总结 提出ROSE基准,通过固定视觉场景并变化区域约束与符号输出,测试多模态大模型在不同上下文中将相同视觉证据转化为所需行动的能力,发现模型性能下降高达44.5个百分点,揭示感知到行动的瓶颈。

Comments 29 pages, 11 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越被期望基于视觉信息采取行动,然而同一场景在不同任务上下文中可能需要不同的行动。模型能否可靠地将相同的视觉证据转化为当前上下文所需的行动?为了回答这个问题,我们引入了\textsc{ROSE}(\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution),一个受控基准,它在保持视觉场景固定的同时变化区域约束和所需的符号输出。通过耦合的计数和坐标行动任务,\textsc{ROSE}测试模型是否能够推断出隐含的多数参考,并在变化的上下文中基于由此产生的细粒度视觉证据采取行动。在九个最近的MLLMs中,从计数导向任务到区域条件行动的性能下降高达44.5个百分点,而人类表现达到98.8%。这种差距在成对的场景和区域中持续存在,即使同一模型在这些场景和区域上返回正确的计数,而全局点击和匹配的局部控制表明坐标定位仅解释了部分损失,揭示了在将共享视觉证据转化为上下文特定行动时存在一个独特的、模型相关的瓶颈。

英文摘要

Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce \textsc{ROSE} (\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution), a controlled benchmark that holds the visual scene fixed while varying region constraints and required symbolic outputs. Through coupled counting and coordinate-action tasks, \textsc{ROSE} tests whether models can infer an implicit majority reference and act on the resulting fine-grained visual evidence under changing contexts. Across nine recent MLLMs, performance drops by as much as 44.5 percentage points from counting-oriented tasks to region-conditioned action, despite 98.8\% human performance. The gap persists on paired scenes and regions for which the same model returns the correct count, while global-click and matched local controls show that coordinate grounding explains only part of the loss, revealing a distinct, model-dependent bottleneck in turning shared visual evidence into context-specific actions.

2606.20095 2026-06-19 cs.CV 新提交

Stitching and dimensionality effects on large artificially generated volume datasets

拼接和维度对大规模人工生成体数据集的影响

Lucas von Chamier, Jan Philipp Albrecht, Dagmar Kainmüller

发表机构 * GFZ Helmholtz-Zentrum für Geoforschung(亥姆霍兹地球科学中心) Max Delbrück Center for Molecular Medicine in the Helmholtz Association(亥姆霍兹协会马克斯·德尔布吕克分子医学中心) Helmholtz Imaging(亥姆霍兹成像) Humboldt-Universität zu Berlin(柏林洪堡大学) University of Potsdam(波茨坦大学)

AI总结 研究深度学习生成大图像时的拼接伪影对风格迁移的影响,比较2D与3D模型,发现FID无法检测影响下游任务的细微伪影,3D模型略优但计算成本高。

详情
AI中文摘要

通过深度学习生成大图像需要对输入数据进行分块以适应硬件内存限制,然后组装输出块,这一过程在相邻块边界不对齐时可能引入拼接伪影。虽然已知这些伪影会影响分割任务,但它们对风格迁移生成模型的影响尚不清楚。我们使用在冷冻电镜数据集上训练的cycleGAN模型,研究了三种拼接方法和两种块维度(2D vs 3D)。我们评估了感知质量和下游线粒体分割的性能。主要发现如下:(1)FID分数无法检测到显著影响下游分割性能的细微拼接伪影;(2)具有无伪影拼接的3D模型在下游任务上略优于2D模型,尽管改进勉强证明计算成本合理;(3)2D模型由于更大的批量大小而训练更稳定。此外,我们证明从三个正交方向集成预测可以改善低质量体,但对高质量输出无益。这些结果表明,在大型科学数据集上最大化生成模型性能需要仔细考虑和减轻拼接伪影,并且仅凭感知指标不足以评估生物医学成像中的域适应质量。

英文摘要

Generating large images via deep learning requires patching input data to accommodate hardware memory limitations, then assembling output patches, a process that can introduce stitching artifacts when neighboring patches do not align at borders. While these artifacts are known to affect segmentation tasks, their impact on generative models for style-transfer remains poorly understood. We investigated three stitching approaches and two patch dimensionalities (2D vs 3D) using cycleGAN models trained on cryo-electron microscopy datasets. We evaluated both perceptual quality and performance on downstream mitochondria segmentation. Our key findings reveal that: (1) FID scores fail to detect subtle stitching artifacts that significantly impact downstream segmentation performance, (2) 3D models with artifact-free stitching marginally outperform 2D models on downstream tasks, though the improvement barely justifies the computational cost, and (3) 2D models train more stably due to larger batch sizes. Additionally, we demonstrate that ensembling predictions from three orthogonal directions can improve low-quality volumes but provides no benefit for high-quality outputs. These results demonstrate that maximizing generative model performance on large scientific datasets requires careful consideration and mitigation of stitching artifacts, and that perceptual metrics alone are insufficient for evaluating domain adaptation quality in biomedical imaging.

2606.20100 2026-06-19 cs.CV 新提交

WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization

WeGenBench:面向文本到图像模型优化的多维诊断基准

Qian Liang, Xiaomin Li, Ying Zhang, Jia Xu, Lihao Ni, Hongrui Li, Jingjing Li, Jing Lyu, Chen Li

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Dalian University of Technology(大连理工大学) Weixin, Tencent(腾讯微信)

AI总结 提出WeGenBench基准,包含4000个中英双语提示,通过场景分类和多维标签实现跨维度评估,并设计基于视觉语言模型的新颖指标,精准定位模型在特定生成类别中的缺陷。

详情
AI中文摘要

最近的文本到图像生成模型在仅从文本输入合成高度逼真的图像方面展现了卓越的能力。尽管现有基准可以在一定程度上评估各种模型的生成能力,但它们难以全面准确地衡量多个维度的性能,往往无法揭示模型在特定类别中的固有缺陷。为了解决这些局限性,我们提出了WeGenBench,一个新颖的基准,旨在对文本到图像生成能力进行全面、多视角的评估。我们的基准总共包含4000个测试提示,涵盖两个主要类别,并在中英文之间精心平衡,以评估双语和跨文化生成能力。除了宏观场景分类外,我们根据每种语言的不同内容和挑战为每个提示标注了多维标签,从而将生成任务细化为更具体的子类别。通过利用场景分类和多维标签的跨维度评估机制,WeGenBench可以精确定位模型在特定生成类别中的不足。此外,为了更准确地衡量生成质量,我们通过整合视觉语言模型(VLM)设计并验证了几种新颖的评估指标,这些指标从三个核心方面评估模型在特定领域任务上的性能。至关重要的是,我们的方法既产生评估结果,也产生详细的推理轨迹,有助于对评估结果的准确性和合理性进行严格验证。最后,我们对当前最先进的方法进行了系统性的基准测试,并深入分析了现有模型中存在的局限性。

英文摘要

Recent text-to-image generation models have demonstrated remarkable capabilities in synthesizing highly realistic images from text inputs alone. Although existing benchmarks can evaluate the generation capabilities of various models to some extent, they struggle to comprehensively and accurately measure performance across multiple dimensions, often failing to reveal the inherent deficiencies of models in specific categories. To address these limitations, we propose WeGenBench, a novel benchmark designed for the comprehensive, multi-perspective evaluation of text-to-image generation capabilities. Our benchmark comprises a total of 4,000 test prompts across two primary categories, meticulously balanced between Chinese and English to evaluate bilingual and cross-cultural generation capabilities. Beyond macroscopic scene classification, we annotate each prompt with multi-dimensional tags tailored to the distinct content and challenges of each language, thereby refining the generation tasks into more specific sub-categories. Through a cross-dimensional evaluation mechanism leveraging both scene classifications and multi-dimensional tags, WeGenBench can precisely pinpoint model shortcomings in specific generation categories. Furthermore, to measure generation quality more accurately, we design and validate several novel evaluation metrics by integrating Vision-Language Models (VLMs), which assess model performance on domain-specific tasks from three core aspects. Crucially, our approach yields both the assessment outcomes and the detailed reasoning trajectories, facilitating a rigorous verification of the accuracy and soundness of the evaluation results. Finally, we conduct systematic benchmarking on current state-of-the-art methods and provide an in-depth analysis of the limitations present in existing models.

2606.20196 2026-06-19 cs.CV 新提交

Distill Once, Adapt Life-Long: Exploring Dataset Distillation for Continual Test-Time Adaptation

一次蒸馏,终身适应:探索数据集蒸馏用于持续测试时适应

Hyun-Kurl Jang, Jihun Kim, Hyeokjun Kweon, Kuk-Jin Yoon

发表机构 * KAIST, Visual Intelligence Lab(韩国科学技术院,视觉智能实验室) Chung-Ang University, FOV Lab(中央大学,FOV实验室)

AI总结 提出DO-ALL框架,通过数据集蒸馏生成紧凑的合成锚点,在持续测试时适应中提供稳定参考,无需保留原始源数据,提升长期鲁棒性。

Comments ECCV 2026

详情
AI中文摘要

持续测试时适应(CTTA)旨在通过在线适应无标签数据,在目标域不断变化的情况下保持模型性能。然而,实际部署中由于隐私或许可限制,通常无法保留源数据集,而纯无源CTTA方法在长期分布偏移下容易变得不稳定,遭受累积的自训练错误和灾难性遗忘。我们提出DO-ALL(一次蒸馏,终身适应),一个即插即用的框架,通过数据集蒸馏(DD)以紧凑且保护隐私的形式重新利用源信息。在部署前,DO-ALL执行DD生成一小组合成蒸馏锚点,总结源分布。在适应过程中,每个目标样本与其语义最匹配的锚点对齐,该锚点通过源重放、表示对齐和流形平滑正则化为各种CTTA提供稳定参考。DO-ALL可以无缝集成到现有CTTA算法中,在CIFAR100-C、ImageNet-C和CCC基准测试中持续提升长期鲁棒性。这展示了利用DD在不保留原始源数据的情况下实现稳定连续适应的潜力。代码可在该https URL获取。

英文摘要

Continual Test-Time Adaptation (CTTA) aims to maintain model performance under evolving target domains by adapting online without labeled data. However, practical deployments often cannot retain the source dataset due to privacy or licensing constraints, and purely source-free CTTA methods tend to become unstable under long-term distribution shift, suffering from compounding self-training errors and catastrophic forgetting. We introduce DO-ALL (Distill Once, Adapt Life-Long), a plug-and-play framework that revisits source information in a compact and privacy-conscious form via Dataset Distillation (DD). Before deployment, DO-ALL performs DD to produce a small set of synthetic distilled anchors that summarize the source distribution. During adaptation, each target sample is matched with its most semantically aligned anchor, which provides a stable reference for various CTTA via source replay, representation alignment, and manifold-smoothing regularization. DO-ALL can be seamlessly integrated into existing CTTA algorithms, consistently improving long-term robustness across CIFAR100-C, ImageNet-C, and the CCC benchmark. This demonstrates the potential of leveraging DD to enable stable and continuous adaptation without retaining raw source data. The code is available at https://github.com/blue-531/DOALL.

2606.20241 2026-06-19 cs.CV 新提交

BAFIS: Dataset + Framework to assess occupational Bias and Human Preference in modern Text-to-image Models

BAFIS:评估现代文本到图像模型中的职业偏见与人类偏好的数据集与框架

Thomas Klassert, Adrian Ulges, Biying Fu

发表机构 * RheinMain University of Applied Sciences(莱茵美因应用科学大学)

AI总结 本研究提出BAFIS平台和包含21,140张多语言提示生成图像的数据集,评估五种文本到图像模型在职业生成中的性别和种族偏见,结合人类偏好反馈,发现系统性偏见并强调纳入人类偏好的必要性。

Comments Accepted at the IEEE Winter Conference on Applications of Computer Vision, WACV 2026

详情
AI中文摘要

生成式人工智能有潜力提高生产力并改变创意内容的制作。然而,现有研究表明图像生成模型受到偏见的显著影响。本文研究了文本到图像模型在职业相关图像生成中存在的固有偏见和语言诱导偏见,并通过人类偏好反馈补充了现有指标。我们对五种当前文本到图像模型进行了全面评估:Midjourney v6.1、Stable Diffusion 3 Medium、DALL-E 3、Playground v2.5和FLUX.1-dev,重点关注性别和种族偏见、图像质量以及提示对齐。为促进这一评估,我们开发了“公平图像合成竞技场”(BAFIS),一个旨在收集生成图像中偏见的人类反馈的平台。此外,我们创建了一个包含21,140张使用多语言提示生成的合成图像的数据集,作为我们分析的基础。我们进一步将结果置于更广泛的社会背景中,与德国联邦就业局的官方统计数据进行比较。我们的发现揭示了文本到图像模型中的系统性偏见,且现有评估指标与主观用户评分存在部分相关性。因此,我们的研究强调了纳入人类偏好以开发更公平、更包容的文本到图像模型的必要性。

英文摘要

Generative artificial intelligence has the potential to improve productivity and transform the production of creative content. However, existing research indicates that image generation models are significantly influenced by biases. This work investigates the inherent biases and language-induced biases present in text-to-image models within the context of occupation-related image generation, complementing established metrics with human preference feedback. We present a comprehensive evaluation of five current text-to-image models: Midjourney v6.1, Stable Diffusion 3 Medium, DALL-E 3, Playground v2.5, and FLUX.1-dev , focusing on gender and ethnicity bias, image quality, and prompt alignment. To facilitate this evaluation, we developed the "Battle-Arena for Fair Image Synthesis" (BAFIS), a platform designed to collect human feedback on bias in generated images. Furthermore, we created a dataset comprising 21,140 synthetic images generated using multilingual prompts, which serves as a basis for our analysis. We further place our results within a broader social context by comparing them to official statistics from the German Federal Employment Agency. Our findings reveal systematic biases in text-to-image models, with established evaluation metrics in partial correlation with subjective user ratings. Thus, our research emphasizes the need for including human preferences to develop fairer and more inclusive text-to-image models.

2606.20303 2026-06-19 cs.CV 新提交

GEN-Guard: Correcting Generalization Failures for Deployable Federated Surgical AI

GEN-Guard:纠正可部署联邦手术AI的泛化失败

Julia Alekseenko, Pietro Mascagni, AI4SafeChole Consortium, Nicolas Padoy

发表机构 * University of Strasbourg, CNRS, INSERM, ICube, UMR7357(斯特拉斯堡大学,法国国家科学研究中心,法国国家健康与医学研究院,ICube实验室,UMR7357) Bioimage Analysis Center, Fondazione Policlinico Universitario Agostino Gemelli IRCCS(生物图像分析中心,阿戈斯蒂诺·杰梅利大学综合医院基金会IRCCS) Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico di Milano, University of Milan(米兰IRCCS卡格兰达基金会马焦雷综合医院,米兰大学) Monaldi Hospital, AORN dei Colli(莫纳尔迪医院,AORN dei Colli)

AI总结 提出GEN-Guard框架,通过客户端阻塞评估检测性能泄漏,并利用分歧感知蒸馏进行特征级校正,提升联邦手术AI的跨机构泛化能力。

Journal ref Int J Comput Assist Radiol Surg. 2026 Jun 14

详情
AI中文摘要

联邦学习(FL)在手术视频AI中实现了协作模型训练,无需共享敏感数据。然而,标准评估实践——仅基于参与医院的验证数据选择“最佳”全局模型——可能导致次优的部署选择。我们将这种关键失败模式识别为性能泄漏,即所选模型过拟合内部联邦数据,无法泛化到未见机构。我们提出GEN-Guard,一个实用的后处理框架,用于检测和纠正联邦手术AI中的泛化失败。它集成了通过客户端阻塞评估(CBE)进行泛化检测,该方法在隔离的客户端分布上验证性能以防止性能泄漏,以及通过分歧感知蒸馏(DAD)进行泛化纠正,该方法学习自适应的特征级校正以实现跨机构鲁棒性。两个组件在标准FL收敛后运行,同时为零样本适应未见环境提供鲁棒支持。我们首先量化了性能泄漏的严重性,观察到在标准评估下模型选择失败(MSF)超过80%。GEN-Guard在两个多中心临床挑战上进行了评估:腹腔镜胆囊切除术中的手术阶段识别和结肠镜中的息肉分割。在两个数据集上,GEN-Guard一致地纠正了这些失败,将联邦内F1分数提高了最多2个点,未见机构性能提高了最多3个点,最差情况机构性能提高了3-9个点。性能泄漏是联邦手术AI中一个系统性且以前未被充分认识的风险。GEN-Guard为检测和纠正此类失败提供了实用解决方案。通过提高跨机构鲁棒性和零样本泛化,它增强了FL在真实世界手术部署中的可靠性。

英文摘要

Federated Learning (FL) in surgical video AI enables collaborative model training without sharing sensitive data. However, standard evaluation practices - selecting the "best" global model based only on validation data from participating hospitals - can lead to suboptimal deployment choices. We identify this critical failure mode as performance leakage, where the selected model overfits internal federation data and fails to generalize to unseen institutions. We propose GEN-Guard, a practical post-hoc framework to detect and correct generalization failures in federated surgical AI. It integrates Generalization Detection via Client-Blocked Evaluation (CBE), which validates performance on isolated client distributions to prevent performance leakage, and Generalization Correction through Disagreement-Aware Distillation (DAD), which learns adaptive feature-level corrections for cross-institutional robustness. Both components operate after standard FL convergence while providing robust support for zero-shot adaptation to unseen environments. We first quantify the severity of performance leakage, observing Model Selection Failures (MSFs) exceeding 80% under standard evaluation. GEN-Guard is evaluated on two multi-center clinical challenges: surgical phase recognition in laparoscopic cholecystectomy and polyp segmentation in colonoscopy. Across both datasets, GEN-Guard consistently corrects these failures, improving in-federation F1 scores by up to 2 points, unseen-institution performance by up to 3 points, and worst-case institutional performance by 3-9 points. Performance leakage represents a systematic and previously under-recognized risk in federated surgical AI. GEN-Guard provides a practical solution for detecting and correcting such failures. By improving cross-institutional robustness and zero-shot generalization, it strengthens the reliability of FL for real-world surgical deployment.

2606.20455 2026-06-19 cs.CV 新提交

PCFootprint: A Large-Scale Dataset and Benchmark for Vectorized Building Footprint Extraction from Aerial LiDAR Point Clouds

PCFootprint:用于从航空LiDAR点云中提取矢量化建筑足迹的大规模数据集与基准

Haoyuan Shen, Kuihao Wang, Ruisheng Wang, Yujun Liu

发表机构 * School of Architecture and Urban Planning, Shenzhen University(深圳大学建筑与城市规划学院)

AI总结 提出首个大规模航空激光扫描点云建筑足迹提取数据集PCFootprint,含33000个瓦片及跨域测试集,通过评估主流方法揭示复杂地理环境下的挑战。

Comments 14 pages, 9 figures

详情
AI中文摘要

建筑足迹提取是摄影测量、遥感和计算机视觉中的基本任务。近年来,基于图像的方法在高分辨率光学影像的矢量化足迹提取方面取得了显著进展。然而,光学影像本质上易受遮挡、透视畸变和残余地形位移的影响,导致足迹提取不完整或错位。此外,缺乏显式高程信息限制了其在细节层次建筑建模中的直接适用性。本文提出PCFootprint,这是首个用于从机载激光扫描点云中提取足迹的大规模公共数据集。PCFootprint包含来自爱沙尼亚土地和空间发展局的33000个瓦片,覆盖多样化的城市和乡村景观。每个瓦片大小为128×128米,并配有与点云对齐的系统性矢量化足迹。该数据集包括一个3000个瓦片的跨域测试集,用于评估跨地理区域的泛化能力。我们通过评估主流方法建立了全面的基准。实验结果表明,在复杂地理环境中存在高类内方差、数据不平衡和噪声等显著挑战。我们相信PCFootprint将推动建筑建模、城市场景理解和地理空间分析的未来研究。PCFootprint数据集公开于:https://this https URL。

英文摘要

Building footprint extraction is a fundamental task in photogrammetry, remote sensing, and computer vision. Recent image-based methods have achieved remarkable progress in extracting vectorized footprints from high-resolution optical imagery. However, optical imagery inherently susceptible to occlusions, perspective distortions, and residual relief displacement, yielding incomplete or misaligned footprint extraction. Furthermore, the lack of explicit elevation information limits its direct applicability to Level of Detail building modeling. In this paper, we present PCFootprint, the first large-scale public dataset for footprint extraction from airborne laser scanning point clouds. PCFootprint comprises \num{33000} tiles derived from the Estonian Land and Spatial Development Board, covering diverse urban and rural landscapes. Each tile spans \qtyproduct{128 x 128}{\m} with systematically aligned vectorized footprints aligned to point clouds. The dataset includes a \num{3000} tiles cross-domain test set for evaluating generalization across geographic regions. We establish comprehensive benchmarks by evaluating mainstream methods. Experimental results reveal significant challenges including high intra-class variance, data imbalance, and noise across complex geospatial environments. We believe PCFootprint will advance future research in building modeling, urban scene understanding, and geospatial analysis. The PCFootprint dataset is publicly available at \url{https://huggingface.co/datasets/Haoyuan-Shen/PCFootprint}.

2606.20523 2026-06-19 cs.CV cs.AI cs.DB 新提交

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

SARLO-80:全球斜距SAR语言光学数据集80cm

Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Elise Colin, Georgia Channing

发表机构 * DEMR-ONERA – The French Aerospace Lab, Université Paris-Saclay(法国航空航天实验室DEMR-ONERA,巴黎-萨克雷大学) DTIS-ONERA – The French Aerospace Lab, Université Paris-Saclay(法国航空航天实验室DTIS-ONERA,巴黎-萨克雷大学) Hugging Face

AI总结 为解决高分辨率SAR与光学图像及文本对齐的数据稀缺问题,基于Umbra SLC数据构建了80cm斜距网格的SAR-光学-文本三元组数据集,支持跨模态检索与生成任务。

详情
AI中文摘要

多模态基础模型因大规模光学基准而快速发展,但合成孔径雷达(SAR)的类似资源仍然有限。现有的SAR-光学数据集主要依赖低分辨率、仅强度的地面距离检测(GRD)产品,未保留复值SAR测量或原生采集几何,限制了基于物理的多模态学习。特别是,结合甚高分辨率(VHR)SAR SLC、对齐光学图像和自然语言描述的大规模公开数据集仍然缺乏。我们提出了一个基于开源Umbra聚束模式采集的传感器独立复数据(SICD)构建的VHR SAR-光学-文本数据集。从约2500个全球场景(VV/HH,20cm–2m原生分辨率)出发,通过带限FFT重采样将所有SAR数据标准化到80cm斜距网格,并将图像分割为1024×1024的图块。对于每个SAR图块,我们检索高分辨率光学图块,并利用局部坐标对应关系将其扭曲到SAR网格以实现局部像素级对齐。我们进一步为每个样本生成三种描述变体(短/中/长),以支持视觉-语言训练和评估。我们的数据集包含119,566个三元组(复数和幅度斜距SAR图块、对齐光学图块、自然语言描述),覆盖72个国家的257个地点以及广泛的地物类型和基础设施。我们发布固定的训练/验证/测试划分以及完整的预处理和基线代码,以支持在原生SAR几何中进行跨模态检索和条件生成的多模态对齐的可重复基准测试。该数据集在Hugging Face Hub上公开可用,网址为https://this URL。

英文摘要

Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR--optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale public datasets combining very-high-resolution (VHR) SAR SLC, aligned optical imagery, and natural-language descriptions are still lacking. We present a VHR SAR--optical--text dataset built from open-access Umbra spotlight acquisitions distributed as Sensor Independent Complex Data (SICD). From around 2,500 worldwide scenes (VV/HH, 20cm--2m native resolution), we standardize all SAR data to an 80cm slant-range grid via band-limited FFT resampling and tile the imagery into 1024 by 1024 patches. For each SAR patch, we retrieve a high-resolution optical tile and warp it into the SAR grid using local coordinate correspondences for local pixel-level alignment. We further generate three caption variants (SHORT/MID/LONG) per sample to support vision--language training and evaluation. Our dataset contains 119,566 triplets (complex and amplitude slant-range SAR patch, aligned optical patch, natural-language description) covering 257 locations across 72 countries and a broad range of land types and infrastructures. We release fixed train/validation/test splits and the full preprocessing and baseline code to enable reproducible benchmarks for multimodal alignment on cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on the Hugging Face Hub at https://huggingface.co/datasets/ONERA/SARLO-80.

2606.20536 2026-06-19 cs.CV 新提交

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

FID 彩票:量化生成模型评估中的隐藏随机性

Nicolas Dufour, Alexei A. Efros, Patrick Pérez

发表机构 * Kyutai UC Berkeley(加州大学伯克利分校)

AI总结 研究FID作为随机变量在训练和生成种子上的方差,发现重训练比重采样导致更大FID波动,提出新评估协议:使用每类最优引导、报告多个训练种子的误差条。

Comments Website: https://kyutai.org/fid-lottery

详情
AI中文摘要

Frechet Inception Distance (FID) 是图像生成的事实标准仲裁者,但大多数论文仅报告来自单个训练模型使用单个采样种子的单一数值。如果我们重新训练模型,或仅重新从中采样,该数字的可重复性如何?在本文中,我们将 FID 视为训练和生成种子二维面板上的随机变量,并直接在数百个在类别条件 ImageNet 256x256 上训练的 SiT 网络上测量其方差。我们报告了令人惊讶的发现:(a) 使用相同配方但不同种子重新训练模型,在 Inception 特征空间中移动的 FID 比从固定网络重新绘制样本大 3.2 倍。(b) 这一差距由三个因素驱动:随机初始化、数据排序和流匹配损失的每步高斯噪声。(c) 增加计算量或模型大小几乎不会缩小分布范围,将 FID 变异系数 (CoV) 保持在 1-2% 的带内。(d) 每类无分类器引导调整使分布减半,但重新洗牌了哪些种子效果最好,幸运的训练种子达到相同 FID 所需的计算量比不幸的种子少 2 倍。基于这些发现,我们推荐一种新的 FID 评估协议:在每类最优引导下进行评估,将任何低于经验测量的约 1.3% CoV 的 FID 差距视为不确定,并报告多个训练种子的误差条,而不是单一的 FID 数值。

英文摘要

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.

2606.20542 2026-06-19 cs.CV 新提交

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

CalTennis:大型多视角网球视频数据集及单目到3D姿态估计基准

Ilona Demler, Xinran Xie, Blake Werner, Anna Szczuka, Pietro Perona

发表机构 * California Institute of Technology(加州理工学院)

AI总结 提出CalTennis大型多视角网球视频数据集(1100万帧,40名球员),用于评估野外单目到3D姿态估计,并发现现有模型在深度估计和足部接触方面存在不足。

详情
AI中文摘要

Caltech网球数据集(CalTennis)是一个大规模视频基准,用于评估野外单目到3D姿态估计。CalTennis包含超过1100万帧(51小时)来自40名球员的网球练习和比赛视频,由2-6台同步摄像机以60 Hz频率采集。它比现有的野外人体运动视频数据集大10倍,比现有的MOCAP真值数据集大3倍,并且是第一个提供专家运动同步多视角记录的大规模基准。多视角设置使得对单目到3D姿态估计算法进行廉价、无标签的评估成为可能。我们描述了一个简单、标准化的协议,无需专业设备或专业知识即可进行数据收集,并实现了全自动视频校准和同步。在CalTennis上对最先进的单目到3D姿态方法进行基准测试,我们发现,虽然3D关节角度恢复现在相当准确,但所有模型在一致地估计深度和足部接触方面仍然存在困难。我们进一步提出了两个新的性能指标——步法和稳定性,并定性研究了身体形状不一致性。这些指标揭示了以前未充分探索的失败模式,并为姿态估计和动作分析的改进提供了具体机会。

英文摘要

The Caltech Tennis Dataset (CalTennis) is a large-scale video benchmark for evaluating monocular-to-3D pose estimation in the wild. CalTennis comprises over 11 million frames (51 hours) of tennis practice and match play from 40 players, captured with 2-6 synchronized cameras at 60 Hz. It is 10 times larger than existing in-the-wild human motion video datasets and 3 times larger than existing MOCAP-ground-truthed datasets, and it is the first large-scale benchmark to provide synchronized multi-view recordings of expert athletic motion. The multi-view setup enables inexpensive, label-free evaluation of monocular-to-3D pose estimation algorithms. We describe a simple, standardized protocol that enables data collection without specialized equipment or expertise, along with fully automated video calibration and synchronization. Benchmarking state-of-the-art monocular-to-3D pose methods on CalTennis, we find that while 3D joint angle recovery is now quite accurate, all models struggle to estimate depth and foot contact consistently. We further propose two novel performance metrics, footwork and stability, as well as qualitatively study body shape inconsistency. These metrics expose previously underexplored failure modes and point to concrete opportunities for improvement in pose estimation and action analysis.

2606.20545 2026-06-19 cs.CV 新提交

Current World Models Lack a Persistent State Core

当前世界模型缺乏持久状态核心

Jinpeng Lu, Dexu Zhu, Haoyuan Shi, Linghan Cai, Guo Tang, Yinda Chen, Jie Cao, Duyu Tang, Yi Zhang, Yong Dai, Xiaozhu Ju

发表机构 * University of Science and Technology of China(中国科学技术大学) Beijing Innovation Center of Humanoid Robotics (X-Humanoid)(北京人形机器人创新中心) NLPR, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所模式识别国家重点实验室) Independent Researcher(独立研究者) Dresden University of Technology(德累斯顿工业大学) Peking University(北京大学)

AI总结 提出WRBench基准测试,发现现有世界模型在观测中断时无法维持世界状态演化,强调物理状态核稳定性应成为世界模型设计首要目标。

Comments 39 pages, 16 figures

详情
AI中文摘要

世界模型日益被视为迈向通用人工智能的关键一步,然而对物理世界建模需要的不仅仅是按需生成令人信服的帧:它需要一个内部世界状态随时间持续演化,与观测解耦,使得物体持久存在、事件运行至结束,无论是否有相机在观察——就像月球在无人注视时仍保持轨道运行一样。这一要求是现有基准的盲点,它们奖励表面属性如保真度、运动和相机可控性,却从不询问生成的 world 在未被观测时是否持续演化。我们引入 \textbf{WRBench},首个系统性的诊断基准,将相机运动视为对可观测性的干预,并将评估分解为一个人工校准的链条:询问相机是否执行了请求的交互,场景在视野内是否保持连续和可识别,以及返回的目标是否与已启动的事件保持一致。在来自 23 个模型(涵盖四种控制范式)的 9,600 个视频中,一个发现顽固地存在:当前系统将观测到的世界维持为跟踪镜头,返回的目标恢复为被遗弃时的状态,而非在未被观测时推进事件。由于这一失败在控制范式、模型家族和规模增量中重复出现,稳健的世界状态演化并非来自更清晰的图像、更严格的控制、更丰富的几何先验或单纯的参数数量。因此,我们主张物理状态核的稳定性和视角干预下世界线的一致性应成为世界模型设计的一级目标,使得世界模型捕捉世界将如何展开,而非下一帧如何呈现。

英文摘要

World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce \textbf{WRBench}, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.

2606.19712 2026-06-19 cs.LG cs.CV 交叉投稿

Efficient Neural Network Model Selection for Few-Class Application Datasets

面向少类应用数据集的高效神经网络模型选择

Bryan Bo Cao, Abhinav Sharma, Lawrence O'Gorman, Michael Coss, Shubham Jain

发表机构 * Nokia Bell Labs(诺基亚贝尔实验室)

AI总结 针对实际应用中常见的少类数据集,提出基于数据属性的分类难度度量,实现比传统方法快6-29倍的模型选择,并扩展模型族至更小规模,在移动机器人等场景中提升效率。

Comments 36 pages, 9 tables, 13 figures

详情
AI中文摘要

尽管大量工作集中在开发和基准测试高性能神经网络上,但较少关注已知的数据集属性如何指导高效的模型选择。神经网络模型通常在数千类数据集上评估,然而许多实际应用涉及少于十类。为了解决这一被忽视但常见的情况,我们基于数据侧属性开发了一种分类难度度量,并展示了它如何为少类数据集实现更高效的模型选择,而传统方法在此效果较差。我们将此现象称为“少类独特性”。我们的度量允许比重复训练和测试快6到29倍的模型和数据集比较。利用这一洞察,我们将缩放模型族扩展到已发布的最小模型以下,在相似精度下实现更高效率,例如在移动机器人任务中模型比YOLOv5-nano小42%。针对资源受限的应用,我们在移动机器人、无人机和物联网场景中展示了少类模型选择,突出了在不牺牲性能的情况下效率的实际提升。

英文摘要

While much effort has focused on developing and benchmarking high-performance neural networks, less attention has been given to how dataset properties, known to practitioners, can guide efficient model selection. Neural models are typically evaluated on datasets with thousands of classes, yet many real-world applications involve fewer than ten. To address this understudied but common setting, we develop a measure of classification difficulty based on data-side properties and show how it enables more efficient model selection for few-class datasets, where traditional approaches are less effective. We term this phenomenon "few-class distinctiveness". Our metric allows comparison of models and datasets 6 to 29$\times$ faster than repeated training and testing. Leveraging this insight, we extend scaled model families below the smallest published models, achieving greater efficiency at similar accuracy, for example models up to 42% smaller than YOLOv5-nano for a mobile robot task. Targeting resource-constrained applications, we demonstrate few-class model selection across mobile robot, drone, and IoT scenarios, highlighting practical gains in efficiency without sacrificing performance.

2606.20272 2026-06-19 cs.RO cs.CV 交叉投稿

Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications

高效连接真实场景与合成数据生成以支持基于AI的认知机器人和计算机视觉应用

Paul Koch, Vivek Chavan, André Sers, Adem Karakurt, Paul Hofmann, Mohamad Zaher Ziadeh, Jörg Krüger

发表机构 * Fraunhofer IPK(弗劳恩霍夫生产设备和设计技术研究所) TU Berlin(柏林工业大学)

AI总结 本文讨论当前AI视觉模型在认知机器人应用中的局限,并提出通过连接仿真与真实世界训练数据生成来弥合领域差距的方法。

Comments Accepted and best paper award at MHI-Kolloquium 2024

详情
AI中文摘要

AI视觉模型是认知机器人在工业和家庭应用中潜在用例场景的驱动因素。基于最新的AI成就,已经提出了从语义环境分析到6D和抓取姿态估计的大量方法。然而,这些进展需要更强大和高效的方法,特别是在训练数据和AI架构方面,这些方法能够协同应对当前挑战、精度限制以及超越领域差距的可扩展性。在本文中,我们讨论了这些当前限制和相关最先进技术中的趋势,这些趋势正对这些挑战提出挑战。此外,我们讨论了当前在弥合仿真与真实世界应用之间的领域差距方面的工作进展,通过在训练数据生成中连接两者来实现。

英文摘要

AI vision models are a driving factor for the potential use case scenarios of cognitive robotics within in the industry and household applications. A large array of methods from semantic environment analysis towards 6D and grasping pose estimation have been proposed based on the latest AI achievements. However, such advancements require further strong and efficient methods w.r.t. training data and AI-architectures, which are capable in synergy to tackle current challenges, precision limits, and scalability beyond domain gaps. In this paper, we discuss these current limits and trends in the related state-of-the-art which are challenging those. Further we discuss our current work in progress on bridging the domain gap between simulations and real world applications by linking those in the training data generation.

2411.10077 2026-06-19 cs.CV 版本更新

Hierarchical mutual distillation for multi-view fusion: Learning from all possible view combinations

多视角融合的分层互蒸馏:从所有可能的视角组合中学习

Jiwoong Yang, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University(翰阳大学) Hankuk University of Foreign Studies(韩国民法大学)

AI总结 本文提出一种新颖的多视角不确定性加权互蒸馏方法,通过分层互蒸馏提升预测一致性,有效利用各视角信息并缓解不确定预测的影响。

Journal ref Pattern Recognition 178 (2026) 113432

详情
AI中文摘要

多视角学习常面临有效利用不同角度和位置拍摄图像的挑战,尤其是在处理视角间不一致性和不确定性时更为突出。本文提出了一种新颖的多视角不确定性加权互蒸馏(MV-UWMD)方法。我们的方法通过在所有可能的视角组合中进行分层互蒸馏来增强预测一致性,包括单视角、部分多视角和全多视角预测。这引入了一种基于不确定性的加权机制,通过互蒸馏有效利用每个视角的独特信息,同时减轻不确定预测的影响。我们扩展了CNN-Transformer混合架构以促进在多个视角组合中的稳健特征学习和整合。我们使用了一个大规模、非结构化的数据集进行广泛实验,该数据集来自多样且非固定视角的拍摄。结果表明,MV-UWMD相比现有多视角学习方法在预测准确性和一致性方面有所提升。

英文摘要

Multi-view learning often struggles to effectively leverage images captured from diverse angles and locations. Learning methods for unstructured multi-view images remain largely underexplored. We propose a novel Hierarchical Mutual Distillation for Multi-View Fusion (HMDMV) method, which can handle both structured and unstructured multi-view scenarios. It makes predictions utilizing all possible view combinations: single view, partial multi-view, and full multi-view. The method generates predictions for each view combination and then applies hierarchical mutual distillation to enhance inter-view consistency. An uncertainty-based weighting mechanism further refines the fusion process by adjusting the influence of each view combination according to its prediction confidence, reducing the impact of low-confidence views. Extensive experiments on large-scale structured and unstructured datasets demonstrate that HMDMV consistently achieves state-of-the-art classification accuracy. Another unique advantage of HMDMV is that it provides improved flexibility in inference, allowing for more or fewer view counts in inference than those used in training without additional processing. We also provide a light version with reduced training cost by designing an efficient strategy that randomly samples subsets of view combinations during each training iteration. These results highlight HMDMV's robustness in real-world settings where view availability is variable or incomplete. The code is available at https://github.com/labhai/HMDMV.

2512.24592 2026-06-19 cs.CV 版本更新

GH-ESD: Grounded Hypothesis-Driven Error Slice Discovery for Instance-Level Vision Tasks

GH-ESD:基于假设驱动的实例级视觉任务错误切片发现

Wei Zhang, Chaoqun Wang, Zixuan Guan, Sam Kao, Pengfei Zhao, Peng Wu, Sifeng He

发表机构 * Apple(苹果公司)

AI总结 提出GH-ESD框架,通过LLM生成假设与视觉语言模型验证,在实例级任务中自动发现空间关系错误切片,并构建GESD基准,显著提升检测和分割任务的错误切片发现精度。

Comments Accepted by ECCV2026

详情
AI中文摘要

视觉模型在语义一致子集上的系统性失败(称为错误切片)揭示了鲁棒性和评估的局限性。现有的切片发现方法主要将切片建模为表示空间中的聚类或预定义属性的组合。虽然对图像级分类有效,但这种公式对于目标检测和分割等实例级任务不足,因为失败通常源于上下文关系性和空间定位的视觉模式。我们提出GH-ESD(基于假设驱动的实例级错误切片发现),一个生成与验证框架,将切片发现重新表述为基于假设的生成和统计验证。GH-ESD利用LLM先验和基于空间的视觉证据构建关系失败假设,通过视觉语言模型在实例级发现假设切片,并通过实例级错误的统计趋势分析进行验证。我们还引入了GESD(基于空间的错误切片数据集),一个用于实例级错误切片发现的新基准,提供由专家定义且基于空间的切片,这些切片源自检测和分割失败。大量实验表明,GH-ESD持续优于基线,在检测任务的GESD基准上Precision@10提高了0.10(0.73对比0.63),同时也支持分割场景。GH-ESD识别出可解释的切片,促进可操作的模型改进。GESD数据集将在接收后公开。

英文摘要

Systematic failures of vision models on semantically coherent subsets, known as error slices, reveal limitations in robustness and evaluation. Existing slice discovery approaches largely model slices as clusters in representation space or combinations of predefined attributes. While effective for image-level classification, such formulations are insufficient for instance-level tasks such as object detection and segmentation, where failures often arise from contextual relational and spatially grounded visual patterns. We propose GH-ESD (Grounded Hypothesis-Driven Error Slice Discovery), a generate and verify framework that reformulates slice discovery as grounded hypothesis generation and statistical verification. GH-ESD constructs relational failure hypotheses using LLM priors and grounded visual evidence, discovers hypothesis slices at the instance level via Vision Language Models, and verifies them through statistical trend analysis over instance-level errors. We also introduce GESD (Grounded Error Slice Dataset), a new benchmark for instance-level error slice discovery, providing expert-defined and spatially grounded slices derived from detection and segmentation failures. Extensive experiments demonstrate that GH-ESD consistently outperforms baselines, improving Precision@10 by 0.10 (0.73 vs. 0.63) on the GESD benchmark for detection tasks, while also supporting segmentation scenarios. GH-ESD identifies interpretable slices that facilitate actionable model improvements. The GESD dataset will be made publicly available upon acceptance.

2604.13240 2026-06-19 cs.CV cs.LG 版本更新

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

基于概念的可解释AI的高分辨率景观数据集及其在物种分布模型中的应用

Augustin de la Brosse, Damien Garreau, Thomas Houet, Thomas Corpetti

发表机构 * Université Rennes 2, CNRS, Nantes Université, Univ Brest, LETG, UMR 6554(里昂大学第二分校、法国国家科学研究中心、南特大学、布列塔尼大学、LETG、UMR 6554) LTSER Zone Atelier Armorique(Armorique 领域实验室区) University of Würzburg, Center for Artificial Intelligence and Data Science(乌尔姆大学、人工智能与数据科学中心)

AI总结 提出首个基于概念的可解释AI方法用于物种分布模型,利用高分辨率多光谱和LiDAR无人机影像构建景观概念数据集,通过Robust TCAV量化景观概念对模型预测的影响,案例研究验证了方法的有效性。

详情
AI中文摘要

绘制物种空间分布对于保护政策和入侵物种管理至关重要。物种分布模型(SDMs)是完成此任务的主要工具,具有两个目的:实现稳健的预测性能,同时提供关于分布驱动因素的生态见解。然而,深度学习SDMs日益增长的复杂性使得提取这些见解更具挑战性。为了调和这些目标,我们提出了首个基于概念的可解释AI(XAI)在SDMs中的实现。我们利用Robust TCAV(测试与概念激活向量)方法量化景观概念对模型预测的影响。为此,我们提供了一个新的开放获取的景观概念数据集,该数据集源自高分辨率多光谱和LiDAR无人机影像。它包括跨越15个不同景观概念的653个斑块和1,450个随机参考斑块,旨在适用于广泛的物种。我们通过两个水生昆虫(襀翅目和毛翅目)的案例研究,使用两个卷积神经网络和一个视觉Transformer来展示这种方法。结果表明,基于概念的XAI有助于根据专家知识验证SDMs,同时发现产生新生态假说的新颖关联。Robust TCAV还提供了景观层面的信息,对政策制定和土地管理有用。代码和数据集公开可用。

英文摘要

Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.

2604.13416 2026-06-19 cs.CV cs.AI 版本更新

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

DF3DV-1K:用于无干扰新视角合成的大规模数据集与基准

Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin

发表机构 * University of Technology Sydney(悉尼科技大学) University of Sydney(悉尼大学) National Yang Ming Chiao Tung University(阳明交通大学)

AI总结 为弥补无干扰辐射场领域缺乏大规模真实世界数据集的空白,构建了包含1048个场景、每场景提供干净和杂乱图像集的DF3DV-1K数据集,并基于此基准测试了九种最新方法,识别出最鲁棒的方法和最具挑战的场景。

详情
AI中文摘要

辐射场领域的进展已实现逼真的新视角合成。在多个领域中,已开发出大规模真实世界数据集以支持全面基准测试并促进超越场景特定重建的进展。然而,对于无干扰辐射场,每个场景同时包含干净和杂乱图像的大规模数据集仍然缺乏,限制了发展。为填补这一空白,我们引入了DF3DV-1K,一个包含1048个场景的大规模真实世界数据集,每个场景提供干净和杂乱的图像集用于基准测试。该数据集总共包含89,924张使用消费级相机拍摄的图像,模拟随意拍摄,涵盖128种干扰类型和161种场景主题,包括室内和室外环境。一个精心挑选的41个场景子集DF3DV-41被系统设计用于评估无干扰辐射场方法在挑战性场景下的鲁棒性。利用DF3DV-1K,我们对九种最新的无干扰辐射场方法和3D高斯泼溅进行了基准测试,识别出最鲁棒的方法和最具挑战的场景。除了基准测试,我们还展示了DF3DV-1K的一个应用:微调基于扩散的2D增强器以改进辐射场方法,在保留集(例如DF3DV-41)和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的提升。我们希望DF3DV-1K能促进无干扰视觉的发展,并推动超越场景特定方法的进步。数据集和排行榜可在以下网址获取:此 https URL。

英文摘要

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches. The dataset and leaderboard are available at https://johnnylu305.github.io/df3dv1k_web/.

2605.10873 2026-06-19 cs.CV cs.AI 版本更新

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench:一个用于AI辅助CAD程序生成的多模态基准

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出CADBench,一个统一的多模态CAD程序生成基准,包含18000个样本和六类基准,评估11种视觉语言模型,揭示了CAD程序生成中的三种常见失败模式。

详情
AI中文摘要

从图像或3D观测中恢复可编辑的CAD程序是AI辅助设计的核心,但进展难以衡量,因为现有评估分散在数据集、模态和指标上。我们引入CADBench,一个统一的多模态CAD程序生成基准。CADBench包含18000个评估样本,涵盖来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的六个基准家族,五种输入模态包括干净的网格、噪声网格、单视图渲染、逼真渲染和多视图渲染,以及六个指标,涵盖几何保真度、可执行性和程序紧凑性。STEP-based家族按B-rep面数分层,所有家族均进行多样性采样,以支持在复杂性和物体变化方面的受控分析。我们评估了11种CAD专用和通用的视觉语言系统,生成超过140万个CAD程序。在理想输入下,专用的网格到CAD模型显著优于代码生成VLMs,后者仍远未可靠。CADBench进一步揭示了三种常见的失败模式:几何复杂性增加时重建质量下降,CAD专用模型在模态转移下可能变得脆弱,且模型排名在不同指标下会变化。这些结果将CADBench定位为衡量可编辑3D重建和多模态CAD理解进展的诊断测试平台。该基准在https://huggingface.co/datasets/DeCoDELab/CADBench上公开可用。

英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://github.com/anniedoris/CADBench.

2606.10136 2026-06-19 cs.CV 版本更新

iSAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

iSAGE: 一种通过稀疏点监督进行遥感语义分割的人机协同框架

Osmar Luiz Ferreira de Carvalho, Osmar Abilio de Carvalho Junior, Anesmar Olino de Albuquerque, Daniel Guerreiro e Silva

AI总结 提出iSAGE框架,通过专家点击模型错误像素而非任意像素,无需辅助机制即可匹配密集监督,在BsB Aerial和ISPRS Vaihingen数据集上以极低标注率达到与密集监督相当的性能。

Comments 47 pages, 8 tables, 6 figures

详情
AI中文摘要

遥感中的语义分割需要昂贵的像素级标注,且由于模型很少能在传感器、平台或地理区域间迁移,几乎每个问题都需要新的数据集。现有的人机协同框架通过辅助机制(伪标签、传播、CRF、基础模型提示、辅助头)将稀疏点击扩展为密集监督,这些机制均基于模型的预测分布。在该分布中,一个自信的错误像素与一个自信的正确像素在结构上无法区分,因此任何读取该分布的规则都无法区分两者;区分信号位于模型外部。本文假设,专家针对模型错误(而非任意像素)的点击足以匹配密集监督,无需扩展机制。iSAGE(基于专家指导的迭代稀疏标注)在一个集成的开源平台上实现了这一假设,其中错误加权损失放大了每次点击的梯度,而标注记录本身即为数据集,可扩展、可纠正、可审计。实验采用最小努力策略:每帧每类最多一个标注像素。在BsB Aerial上,iSAGE恢复了密集监督的97.2%(在0.040%的像素上达到74.79% mIoU),并呈现出对比性的类别动态:无定形类别(渗透区域)从种子点开始饱和,而小类别(汽车)需要后期迭代的努力。在ISPRS Vaihingen(外部基准)上,iSAGE以0.011%的像素达到76.78% mIoU,匹配密集基线(76.65%)并超越所有已发表方法。在相同流程下,四种输出读取机制(预算1-100倍的oracle熵、阈值0.90-0.99的伪标签、基于CRF的传播、均匀随机)比iSAGE低7.4至14.5个百分点。在调查的31种方法中,iSAGE是唯一无需辅助机制即可运行的迭代式人机协同框架。

英文摘要

Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1--100x, pseudo-labels across thresholds 0.90--0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.

2507.23534 2026-06-19 cs.LG cs.CV 版本更新

Continual Learning with Support Boundary Experience Blending

支持边界经验混合的持续学习

Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

发表机构 * National Taiwan University(国立台湾大学)

AI总结 提出经验混合框架,通过差分隐私启发的噪声生成支持边界数据,联合训练样本和边界数据以正则化决策边界,在多个数据集上提升持续学习准确率。

详情
AI中文摘要

持续学习旨在减轻模型在顺序任务训练时的灾难性遗忘。常见方法经验回放存储过去的样本,但仅稀疏地近似数据分布,导致决策边界脆弱且过于简化。我们通过引入支持边界数据来解决这一限制,该数据通过差分隐私启发的噪声注入潜在特征,生成边界邻近表示,隐式正则化决策边界。基于此,我们提出经验混合框架,通过双模型聚合策略联合训练样本和支持边界数据。经验混合有两个组成部分:(1) 潜在空间噪声注入以生成支持边界数据,(2) 联合利用样本和支持边界数据的端到端训练。与标准经验回放不同,支持边界数据丰富了决策边界附近的特征空间,从而实现更稳定和鲁棒的持续学习。在CIFAR-10、CIFAR-100、Tiny ImageNet和ImageNet1K上的大量实验分别展示了10%、6%、13%和2%的持续准确率提升。

英文摘要

Continual learning (CL) seeks to mitigate catastrophic forgetting when models are trained with sequential tasks. A common approach, experience replay (ER), stores past exemplars but only sparsely approximates the data distribution, yielding fragile and oversimplified decision boundaries. We address this limitation by introducing Support Boundary Data (SBD), generated via differential-privacy-inspired noise into latent features to create boundary-adjacent representations that implicitly regularize decision boundaries. Building on this idea, we propose Experience Blending (EB), a framework that jointly trains on exemplars and SBD through a dual-model aggregation strategy. EB has two components: (1) latent-space noise injection to generate support boundary data, and (2) end-to-end training that jointly leverages exemplars and SBD. Unlike standard experience replay, SBD enriches the feature space near decision boundaries, leading to more stable and robust continual learning. Extensive experiments on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet1K demonstrate consistent accuracy improvements of 10%, 6%, 14%, 2%, respectively.

13. 其他/综合视觉 14 篇

2606.19835 2026-06-19 cs.CV 新提交

Neural Events: Discrete Asynchronous Autoencoders for Event-Based Vision

神经事件:用于事件视觉的离散异步自编码器

Roberto Pellerito, Daniel Gehrig, Shintaro Shiba, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich(苏黎世大学机器人感知组) University of Pennsylvania(宾夕法尼亚大学) The University of Tokyo(东京大学) Keio University(庆应义塾大学)

AI总结 提出将事件流重新标记为少量高信息量的“神经事件”,每个事件代表一个局部时空上下文窗口的离散可学习编码,在物体检测和分类任务中达到或超越现有方法,同时将事件率降低2.0倍。

详情
AI中文摘要

事件相机通过将动态场景表示为微秒分辨率的连续事件流,以卓越的时间保真度捕捉动态场景。然而,每个单独的事件仅携带最小的语义价值,仅仅表示局部亮度变化。为了获得有意义的信号,下游算法需要快速整合来自潜在大量低信息事件流的线索。然而,当前的架构很容易被淹没,难以在捕捉细粒度时间动态和维持可管理的数据吞吐量之间取得平衡。本文提出一个框架,将事件流重新标记为少量高信息量的“神经事件”,每个事件代表一个局部时空上下文窗口,并带有离散可学习编码。每次该编码翻转时,触发一个神经事件,产生高度压缩的数据流。我们证明,在物体检测和分类任务中,基于神经事件训练的网络与最先进方法性能相当或更优,同时将事件率降低2.0倍。

英文摘要

Event cameras capture dynamic scenes with exceptional temporal fidelity by representing them as a continuous stream of microsecond resolution \textit{events}. Each individual event, however, only carries minimal semantic value, merely signaling a localized brightness change. To derive meaningful signals, downstream algorithms need to quickly integrate cues from a potentially massive torrent of low-information events. Current architectures, however, are easily overwhelmed, struggling to balance capturing fine-grained temporal dynamics and maintaining a manageable data throughput. This paper proposes a framework to re-tokenize event streams into a small set of highly informative \textit{neural events}, each representing a local spatio-temporal context window with a discrete learnable code. Every time this code flips, a neural event is triggered, yielding a highly compressed data stream. We demonstrate that, across object detection and classification, networks trained on neural events are on par or surpass the performance of state-of-the-art approaches while reducing the event rate by a factor of 2.0.

2606.19383 2026-06-19 cs.RO cs.CV 交叉投稿

3D Scene Graphs: Open Challenges and Future Directions

3D场景图:开放挑战与未来方向

Dennis Rotondi, Francesco Argenziano, Sebastian Koch, Nathan Hughes, Martin Buechner, Johanna Wald, Lukas Rosenberger Schmid, Daniele Nardi, Abhinav Valada, Liam Paull, Federico Tombari, Luca Carlone, Kai O. Arras

AI总结 本文统一综述3D场景图(3DSG)的构建、应用与评估,分析现有建模选择与开放挑战,旨在推动鲁棒部署。

Comments Invited article for the Annual Review of Control, Robotics, and Autonomous Systems Volume 10

详情
AI中文摘要

3D场景图(3DSG)通过将几何基础与环境的语义和关系抽象相结合,已成为空间AI的强大表示。其表现力使其与机器人和计算机视觉中的广泛问题相关,包括操作、导航、任务规划、场景理解等。然而,该领域仍然分散:不同的社区采用不同的公式、构建流程和评估协议,使得比较方法、识别共同假设以及评估鲁棒实际部署的剩余挑战变得困难。本综述提供了对3DSG的统一和批判性回顾,特别强调开放挑战和未来方向。我们首先在共同定义下形式化3DSG,并分析表征现有公式的主要建模选择,包括节点和边属性、层次结构、动态场景表示和可供性感知扩展。然后,我们回顾如何从原始感官观察构建3DSG,讨论最常见的术语、约定和技术。最后,我们检查下游应用和评估策略,从内在图质量到任务级性能。为支持社区,我们还提供了一个专用网站,组织和扩展所调查的内容,可访问此 https URL。

英文摘要

3D Scene Graphs (3DSGs) have emerged as a powerful representation for spatial AI by combining geometric grounding with semantic and relational abstractions of the environment. Their expressiveness has made them relevant to a broad range of problems in robotics and computer vision, including manipulation, navigation, task planning, scene understanding, and many others. However, the field remains fragmented: different communities adopt distinct formulations, construction pipelines, and evaluation protocols, making it difficult to compare methods, identify common assumptions, and assess remaining challenges for robust real-world deployment. This survey provides a unified and critical review of 3DSGs, with particular emphasis on open challenges and future directions. We first formalize 3DSGs under a common definition and analyze the principal modeling choices that characterize existing formulations, including node and edge attributes, hierarchical structure, dynamic scene representations, and affordance-aware extensions. We then review how 3DSGs are built from raw sensory observations, discussing the most common terminologies, conventions, and techniques. Finally, we examine downstream applications and evaluation strategies, from intrinsic graph quality to task-level performance. To support the community, we also provide a dedicated website that organizes and extends the surveyed content, accessible at https://3dscenegraphs.com/.

2606.20291 2026-06-19 cs.LG cs.CV 交叉投稿

Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision

整合国家森林清查、机载激光雷达和卫星影像,利用计算机视觉实现森林结构的全覆盖制图

Luke J. Zachmann, David D. Diaz, Vincent A. Landau, Chelsey Walden-Schreiner, Tony Chang, Nathan E. Rutenbeck, Katharyn A. Duffy, Kiarie Ndegwa, Andreas Gros, Scott Conway, Guy Bayes

发表机构 * Vibrant Planet Public Benefit Corporation(Vibrant Planet 公益公司)

AI总结 提出VibrantForests框架,结合卫星影像、激光雷达样本和计算机视觉,以10米分辨率生成美国本土的冠层覆盖、高度、生物量等森林属性图,减少饱和与回归均值问题。

详情
AI中文摘要

遥感技术越来越被依赖,以提供可操作的科学研究,用于大型景观的森林和野火风险管理。全覆盖、每年更新的地图是有效森林管理的持续需求。许多规划系统和数据收集结合了不同目的、年份和预测质量的异质数据源,导致运营规划系统中的混淆行为。我们介绍了VibrantForests框架,该框架被开发并应用于绘制森林属性,为有效的森林和野火规划提供一致的基础。VibrantForests包括一个基于卫星的森林结构模型,该模型在激光雷达衍生的样本上训练,并应用于美国本土,以10米分辨率同时生成冠层覆盖度、冠层高度、地上活树生物量、胸高断面积和二次平均直径的估计。我们展示了跨越从稀疏冠层/低生物量到密集冠层/高生物量的全部森林条件的预测能力。结果表明,我们的模型扩展了在类似被动传感器模型中常见的饱和范围,并减少了回归均值行为,该行为通常在小/稀疏条件下高估森林属性,在大/密集条件下低估森林属性。VibrantForests框架通过以年度节奏和10米分辨率提供管理相关属性的一致全覆盖估计,解决了大面积森林和野火规划中的一个关键限制。

英文摘要

Remote sensing is increasingly relied upon to deliver actionable science for forest and wildfire risk management across large landscapes. Wall-to-wall, annually updated maps are a persistent need for effective forest management. Many planning systems and data collections combine disparate data sources with different purposes, vintages, and prediction quality, which leads to confounding behavior in operational planning systems. We introduce the VibrantForests framework, developed and applied to map forest attributes and provide a coherent foundation for effective forest and wildfire planning. VibrantForests includes a satellite-based forest structure model trained on lidar-derived samples and applied across the contiguous United States to concurrently generate estimates of canopy cover, canopy height, aboveground live tree biomass, basal area, and quadratic mean diameter at 10-meter resolution. We demonstrate predictive capability spanning the full spectrum of forest conditions ranging from sparse-canopy/low-biomass to dense-canopy/high-biomass. Results show that our model extends the range at which saturation is commonly encountered in comparable passive-sensor models, and reduces regression-to-mean behavior that commonly produces overestimation of forest attributes in small/sparse conditions and underestimation in large/dense conditions. The VibrantForests framework addresses a key limitation in large-area forest and wildfire planning by delivering coherent wall-to-wall estimates of management-relevant attributes at annual cadence and 10m resolution.

2606.20547 2026-06-19 cs.LG cs.CV cs.GR cs.RO math.DG 交叉投稿

The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups

Token 是群元素:关于矩阵李群上的李代数注意力

Przemyslaw Musialski

发表机构 * New Jersey Institute of Technology(新泽西理工学院)

AI总结 提出李代数注意力机制,将token定义为矩阵李群元素,利用相对位姿的李代数范数作为注意力分数,无需学习核函数或表示论工具,适用于仿射全帧群等非紧致非阿贝尔群。

Comments preprint, 19 pages, 3 figures

详情
AI中文摘要

我们将注意力token置于群上:一个token是矩阵李群$G$的一个元素$g_i$——一个纯粹的变换,没有特征负载,也没有外部作用$\rho(g)$承载它。据我们所知,这是第一个token为裸矩阵李群元素的注意力构造:它们的分数是相对位姿的闭式代数范数,而非学习核,并且它达到了每个基于不可约表示或满射指数的方法必须排除的仿射全帧群。我们称之为李代数注意力。一旦token是群元素,其余部分无需通常的表示论机制。一对的相对几何是规范的,即$g_i^{-1} g_j$,因此成对不变量$w_{ij} = \log(g_i^{-1} g_j)$是内在的而非设计的;在$G$对角作用下的等变性是重言式的,且余循环条件自动成立。注意力分数是负平方代数范数$s_{ij} = -\|\log(g_i^{-1} g_j)\|_\lambda^2/\tau$:在块加权Frobenius内积下的规范邻近核,无需不可约表示、球谐函数、Clebsch-Gordan积或学习核。该构造适用于任何矩阵李群,在包含相对位姿的选定对数图上,包括具有尺度和剪切的非紧致非阿贝尔仿射群,这些是向量token注意力方法无法达到的:既不是不可约表示传统,也不是满射指数方法。在SE(2)、SO(3)和Aff(2)上的三个序列补全实验证实了这一点:闭式分数匹配了相同不变量上的学习MLP核,并在SE(2)上优于它,使用的分数参数少50到80倍,而向量token基线破坏了不变量,误差达五到十二个数量级。

英文摘要

We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $ρ(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_λ^2/τ$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.

2603.07236 2026-06-19 cs.CV 版本更新

HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing

HY-WU (第一部分): 一种可扩展的功能性神经记忆框架及其在文本引导图像编辑中的应用

Mengxuan Wu, Xuanlei Zhao, Ziqiao Wang, Ruicheng Feng, Zhangyang Wang, Kai Wang

发表机构 * Tencent HY Team(腾讯 HY 团队)

AI总结 提出HY-WU框架,通过功能性神经记忆模块即时生成实例特定权重更新,避免共享权重覆盖导致的干扰,解决持续学习与个性化中的灾难性遗忘问题。

详情
AI中文摘要

基础模型正从离线预测器过渡到期望长时间运行的部署系统。在实际部署中,目标并非固定:领域漂移、用户偏好演变,以及模型发布后出现新任务。这将持续学习和即时个性化从可选功能提升为核心架构要求。然而,大多数适应流程仍遵循静态权重范式:训练后(或任何适应步骤后),推理执行单一参数向量,而不考虑用户意图、领域或实例特定约束。这将训练或适应后的模型视为参数空间中的单个点。在异构且持续演变的机制中,不同目标可能在参数上诱导分离的可行区域,迫使任何单一共享更新陷入妥协、干扰或过度专业化。结果,持续学习和个性化通常实现为对共享权重的重复覆盖,冒着先前学习行为退化的风险。我们提出HY-WU(权重释放),一种记忆优先的适应框架,将适应压力从覆盖单一共享参数点转移。HY-WU将功能性(算子级)记忆实现为神经模块:一个根据实例条件即时合成权重更新的生成器,产生实例特定算子而无需测试时优化。

英文摘要

Foundation models are transitioning from offline predictors to deployed systems expected to operate over long time horizons. In real deployments, objectives are not fixed: domains drift, user preferences evolve, and new tasks appear after the model has shipped. This elevates continual learning and instant personalization from optional features to core architectural requirements. Yet most adaptation pipelines still follow a static weight paradigm: after training (or after any adaptation step), inference executes a single parameter vector regardless of user intent, domain, or instance-specific constraints. This treats the trained or adapted model as a single point in parameter space. In heterogeneous and continually evolving regimes, distinct objectives can induce separated feasible regions over parameters, forcing any single shared update into compromise, interference, or overspecialization. As a result, continual learning and personalization are often implemented as repeated overwriting of shared weights, risking degradation of previously learned behaviors. We propose HY-WU (Weight Unleashing), a memory-first adaptation framework that shifts adaptation pressure away from overwriting a single shared parameter point. HY-WU implements functional (operator-level) memory as a neural module: a generator that synthesizes weight updates on-the-fly from the instance condition, yielding instance-specific operators without test-time optimization.

2507.05169 2026-06-19 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model

世界模型批判:一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结 本文从心理学“假设性思维”出发,提出世界模型的核心目标是模拟真实世界的所有可行动可能性,并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测(GLP)架构。

详情
AI中文摘要

世界模型,即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器,近年来因开发具有人工(通用)智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估,已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发,并借鉴心理学文献中“假设性思维”的概念,论证世界模型的主要目标是模拟真实世界中所有可行动的可能性,以进行有目的的推理和行动。我们审视了世界建模的关键设计维度:数据、表示、架构、学习目标和使用,调查了现有方法并分析了它们的权衡。在此基础上,我们提出了一种新的通用世界模型生成式潜在预测(GLP)架构,基于有状态的、分层的、多层次的、混合连续/离散表示,以及生成式和自监督学习框架,并展望了由这种模型支持的物理、智能体和嵌套(PAN)AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

2605.15231 2026-06-19 cs.LG cs.CV 版本更新

Mask-Morph Graph U-Net: A Generalisable Mesh-Based Surrogate for Crashworthiness Field Prediction under Large Geometric Variation

Mask-Morph Graph U-Net:一种通用的基于网格的替代模型,用于在大几何变化下预测碰撞worthiness领域

Haoran Li, Tobias Lehrer, Yingxue Zhao, Haosu Zhou, Philipp Stocker, Tobias Pfaff, Marcus Wagner, Nan Li

发表机构 * Dyson School of Design Engineering, Imperial College London(帝国理工学院伦敦设计工程学院) TUM School of Engineering and Design, Technical University of Munich(慕尼黑技术大学工程与设计学院) Faculty of Mechanical Engineering, OTH Regensburg(雷根斯堡机械工程学院) NVIDIA(NVIDIA公司)

AI总结 本文提出Mask-Morph Graph U-Net,通过特征对齐的重心参数化和节点掩码预训练,提升网格模拟的通用性和数据效率,适用于碰撞worthiness设计探索。

Comments 48 pages, 15 figures, jounral paper under review

详情
AI中文摘要

非线性有限元碰撞模拟准确但计算成本高,限制了其在迭代设计优化中的应用。基于图神经网络(GNN)的机器学习替代模型提供了更快的替代方案。消息传递GNN广泛用于网格模拟,其共享节点和边更新函数在不同图结构中相对通用。相比之下,非共享边特定聚合层能更准确地捕捉非线性关系,但通常需要固定图连接性,限制了通用性。本文提出Mask-Morph Graph U-Net(MMGUNet),一种解决分层图U-Net架构限制的方法,该架构使用边特定下采样和上采样层。固定粗图连接性是边特定层所必需的。为了在保留此连接性的同时提高空间对应性,所提出的方法通过特征对齐的重心参数化将粗化图层次变形到每个输入网格,然后构建跨图边。它进一步在监督预训练中应用节点掩码,随后进行参数高效的微调,其中高参数边特定层被冻结。所提出的方法在分布内、分布外和跨组件迁移设置中使用均欧距离和最大入侵百分比误差进行评估。结果表明,粗图变形相对于固定粗图基线提高了测试准确性,而掩码监督预训练减少了训练-测试差异并提高了迁移期间的数据效率。所提出的模型还比外部基线取得了更低的预测误差。这些结果展示了通往可重用、数据高效网格替代模型的实用路径,用于碰撞worthiness设计探索。

英文摘要

Nonlinear finite element crash simulations are accurate but computationally expensive, limiting their use in iterative design optimisation. Machine-learning surrogate models based on graph neural networks (GNNs) offer a faster alternative. Message-passing GNNs are widely used for mesh simulation, and their shared node and edge update functions are relatively generalisable across varying graph structures. By contrast, non-shareable edge-specific aggregation layers can capture nonlinear relationships more accurately but usually require fixed graph connectivity, which limits generalisability. This paper presents Mask-Morph Graph U-Net (MMGUNet), a practical approach to addressing the limitation of hierarchical Graph U-Net architectures that use edge-specific downsampling and upsampling layers. Fixed coarse graph connectivity is required for edge-specific layers. To retain this while improving spatial correspondence, the proposed method morphs the coarsened graph hierarchy to each input mesh using feature-aligned barycentric parameterisation before constructing cross-graph edges. It further applies node masking during supervised pretraining, followed by parameter-efficient fine-tuning in which high-parameter edge-specific layers are frozen. The proposed approach is evaluated in in-distribution, out-of-distribution, and cross-component transfer settings using mean Euclidean distance and maximum intrusion percentage error. Results show that coarse-graph morphing improves test accuracy relative to a fixed-coarse-graph baseline, while masked supervised pretraining reduces the train-test discrepancy and improves data efficiency during transfer. The proposed model also achieves lower prediction error compared with external baselines. These results demonstrate a practical route toward reusable, data-efficient mesh-based surrogate modelling for crashworthiness design exploration.

2605.00569 2026-06-19 cs.CV cs.GR 版本更新

2D-SuGaR: Surface-Aware Gaussian Splatting for Geometrically Accurate Mesh Reconstruction

Prajwal Gupta C. R., Divyam Sheth, Jinjoo Ha, Mirela Ostrek, Justus Thies

发表机构 * TU Darmstadt(图宾根大学) ELIZA(ELIZA实验室) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所)

Journal ref Eurographics 2026 Short Papers, The Eurographics Association, 2026

详情
英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for generating photorealistic renderings of a scene in real-time. However, the volumetric nature of 3DGS limits its ability to accurately capture surface geometry. To address this, 2D Gaussian Splatting (2DGS) was proposed to enable view-consistent and geometrically accurate surface reconstruction from multi-view images. However, 2DGS can be sensitive to the initialization of the Gaussian primitives. Reliance on Structure-from-Motion (SfM) initializations, which can produce poor estimates on challenging image sets, may lead to subpar results. In this work, we enhance 2DGS by incorporating monocular depth and normal priors to improve both geometric accuracy and robustness. We propose a depth-guided initialization strategy for Gaussians and introduce a clustering-based technique for pruning degenerate Gaussians. We evaluate our method on the DTU dataset, where it achieves state-of-the-art results in mesh reconstruction while preserving high-quality novel view synthesis.

2511.23071 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

发表机构 * Indian Institute of Technology Jodhpur(印度理工学院朱道尔)

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

Journal ref International Journal on Document Analysis and Recognition (IJDAR), 2026

详情
英文摘要

Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.

2603.27698 2026-06-19 cs.CV cs.DL 版本更新

Ink Detection from Surface Topography of the Herculaneum Papyri

Giorgio Angelotti, Federica Nicolardi, Paul Henderson, W. Brent Seales

发表机构 * Vesuvius Challenge, USA(维苏威挑战赛,美国) Università degli Studi di Napoli Federico II, Italy(那不勒斯费德里科二世大学,意大利) University of Glasgow, Scotland, UK(格拉斯哥大学,苏格兰,英国) EduceLab, University of Kentucky, USA(EduceLab,肯塔基大学,美国)

Comments 9 pages, 3 figures, 2 tables. Currently under review

Journal ref Scientific Reports (2026)

详情
英文摘要

Reading the Herculaneum papyri is challenging because both the scrolls and the ink, which is carbon-based, are carbonized. In X-ray radiography and tomography, ink detection typically relies on density- or composition-driven contrast, but carbon ink on carbonized papyrus provides little attenuation contrast. Building on the morphological hypothesis, we show that the surface morphology of written regions contains enough signal to distinguish ink from papyrus. To this end, we train machine learning models on three-dimensional optical profilometry from mechanically opened Herculaneum papyri to separate inked and uninked areas. We further quantify how lateral sampling governs learnability and how a native-resolution model behaves on coarsened inputs. We show that high-resolution topography alone contains a usable signal for ink detection. Diminishing segmentation performance with decreasing lateral resolution provides insight into the characteristic spatial scales that must be resolved on our dataset to exploit the morphological signal. These findings inform spatial resolution targets for morphology-based reading of closed scrolls through X-ray tomography.

2601.15119 2026-06-19 eess.IV cs.CV 版本更新

Vision Models for Medical Imaging: A Hybrid Approach for PCOS Detection from Ultrasound Scans

Md Mahmudul Hoque, Md Mehedi Hassain, Muntakimur Rahaman, Md. Towhidul Islam, Shaista Rani, Md Sharif Mollah

发表机构 * Department of CSE, CCN University of Science & Technology(计算机科学与工程系,CCN科学与技术大学) Department of EEE,International Islamic University Chittagong(电子工程系,国际伊斯兰大学恰tagong分校) Faculty of Engineering, Multimedia University(工程学院,多媒体大学) Department of CSE, Stamford University of Bangladesh(计算机科学与工程系,斯塔福德大学孟加拉国分校) Department of Biology, Lucknow University(生物学系,拉胡尔大学) Department of CSE, Bangladesh Army International University of Science & Technology(计算机科学与工程系,孟加拉国军队国际科学与技术大学)

详情
英文摘要

Polycystic Ovary Syndrome (PCOS) is the most familiar endocrine illness in women of reproductive age. Many Bangladeshi women suffer from PCOS disease in their older age. The aim of our research is to identify effective vision-based medical image analysis techniques and evaluate hybrid models for the accurate detection of PCOS. We introduced two novel hybrid models combining convolutional and transformer-based approaches. The training and testing data were organized into two categories: "infected" (PCOS-positive) and "noninfected" (healthy ovaries). In the initial stage, our first hybrid model, 'DenConST' (integrating DenseNet121, Swin Transformer, and ConvNeXt), achieved 85.69% accuracy. The final optimized model, 'DenConREST' (incorporating Swin Transformer, ConvNeXt, DenseNet121, ResNet18, and EfficientNetV2), demonstrated superior performance with 98.23% accuracy. Among all evaluated models, DenConREST showed the best performance. This research highlights an efficient solution for PCOS detection from ultrasound images, significantly improving diagnostic accuracy while reducing detection errors.

2508.21190 2026-06-19 cs.CV 版本更新

Radially Distorted Homographies, Revisited

Mårten Wadenbäck, Marcus Valtonen Örnhag, Johan Edstedt

发表机构 * Linköping University(林雪平大学) Ericsson Research(爱立信研究)

Journal ref 2026, Proceedings of the International Conference on 3D Vision (3DV). Vancouver, BC, Canada: IEEE, pp. 52-62

详情
英文摘要

Homographies are among the most prevalent transformations occurring in geometric computer vision and projective geometry, and homography estimation is consequently a crucial step in a wide assortment of computer vision tasks. When working with real images, which are often afflicted with geometric distortions caused by the camera lens, it may be necessary to determine both the homography and the lens distortion-particularly the radial component, called radial distortion-simultaneously to obtain anything resembling useful estimates. When considering a homography with radial distortion between two images, there are three conceptually distinct configurations for the radial distortion; (i) distortion in only one image, (ii) identical distortion in the two images, and (iii) independent distortion in the two images. While these cases have been addressed separately in the past, the present paper provides a novel and unified approach to solve all three cases. We demonstrate how the proposed approach can be used to construct new fast, stable, and accurate minimal solvers for radially distorted homographies. In all three cases, our proposed solvers are faster than the existing state-of-the-art solvers while maintaining similar accuracy. The solvers are tested on well-established benchmarks including images taken with fisheye cameras. A reference implementation of the proposed solvers is made available as part of HomLib (https://github.com/marcusvaltonen/HomLib).

2507.23027 2026-06-19 cs.CV cs.AI 版本更新

Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Krishan Agyakari Raja Babu, Om Prabhu, Annu, Mohanasankar Sivaprakasam

发表机构 * Indian Institute of Technology Madras(印度理工学院马德拉斯分校) All India Institute of Medical Sciences(全印度医学科学研究所) Indian Institute of Technology Hyderabad(印度理工学院海得拉巴分校)

Comments Accepted at the MICCAI Workshop on "Medical Image Computing in Resource Constrained Settings & Knowledge Interchange (MIRASOL)" 2025

详情
英文摘要

Automated cardiac interpretation in resource-constrained settings (RCS) is often hindered by poor-quality echocardiographic imaging, limiting the effectiveness of downstream diagnostic models. While super-resolution (SR) techniques have shown promise in enhancing magnetic resonance imaging (MRI) and computed tomography (CT) scans, their application to echocardiography-a widely accessible but noise-prone modality-remains underexplored. In this work, we investigate the potential of deep learning-based SR to improve classification accuracy on low-quality 2D echocardiograms. Using the publicly available CAMUS dataset, we stratify samples by image quality and evaluate two clinically relevant tasks of varying complexity: a relatively simple Two-Chamber vs. Four-Chamber (2CH vs. 4CH) view classification and a more complex End-Diastole vs. End-Systole (ED vs. ES) phase classification. We apply two widely used SR models-Super-Resolution Generative Adversarial Network (SRGAN) and Super-Resolution Residual Network (SRResNet), to enhance poor-quality images and observe significant gains in performance metric-particularly with SRResNet, which also offers computational efficiency. Our findings demonstrate that SR can effectively recover diagnostic value in degraded echo scans, making it a viable tool for AI-assisted care in RCS, achieving more with less.

1902.06202 2026-06-19 cs.CV cs.CG 版本更新

Using Persistent Homology to Quantify a Diurnal Cycle in Hurricane Felix

Sarah Tymochko, Elizabeth Munch, Jason Dunion, Kristen Corbosiero, Ryan Torn

发表机构 * Michigan State University, Dept. of Computational Mathematics, Science and Engineering(密歇根州立大学,计算数学、科学与工程系) Michigan State University, Dept. of Mathematics(密歇根州立大学,数学系) Cooperative Institute for Marine and Atmospheric Studies, University of Miami(马里安诺大气研究合作机构,迈阿密大学) Hurricane Research Division, NOAA/Atlantic Oceanographic and Meteorological Laboratory(飓风研究部,国家海洋和大气管理局/大西洋海洋学和气象实验室) University at Albany - SUNY Albany, Dept. of Atmospheric and Environmental Sciences(阿尔巴尼大学 - 纽约州立大学阿尔巴尼分校,大气与环境科学系)

详情
英文摘要

The diurnal cycle of tropical cyclones (TCs) is a daily cycle in clouds that appears in satellite images and may have implications for TC structure and intensity. The diurnal pattern can be seen in infrared (IR) satellite imagery as cyclical pulses in the cloud field that propagate radially outward from the center of nearly all Atlantic-basin TCs. These diurnal pulses, a distinguishing characteristic of the TC diurnal cycle, begin forming in the storm's inner core near sunset each day and appear as a region of cooling cloud-top temperatures. The area of cooling takes on a ring-like appearance as cloud-top warming occurs on its inside edge and the cooling moves away from the storm overnight, reaching several hundred kilometers from the circulation center by the following afternoon. The state-of-the-art TC diurnal cycle measurement has a limited ability to analyze the behavior beyond qualitative observations. We present a method for quantifying the TC diurnal cycle using one-dimensional persistent homology, a tool from Topological Data Analysis, by tracking maximum persistence and quantifying the cycle using the discrete Fourier transform. Using Geostationary Operational Environmental Satellite IR imagery data from Hurricane Felix (2007), our method is able to detect an approximate daily cycle.